6  Economic and social research questions using spatial data

In their critical appraisal of spatial econometrics, Gibbons and Overman (2012) state that:

In many (micro) economic fields—particularly development, education, environment, labor, health, and public finance—empirical work is increasingly concerned with questions about causality … If we increase an individual’s years of education, what happens to their wages? If we decrease class sizes, what happens to student grades? These questions are fundamentally of the type ‘if we change x, what do we expect to happen to y.’ Just as with economics more generally, such questions are fundamental to our understanding of spatial economics (Gibbons and Overman 2012, 172–73).

In the years preceding their description of spatial econometrics risking becoming mostly pointless, practicioners had been realising that interpretation of model output was not as simple as had been assumed. In aspatial models, there is no spatial spillover to consider, and in time-series models, only past observations can influence the present. However, in spatial autoregressive models using for example the average of neighbours’ values of the dependent variable as an explanatory variable, the coefficient of the spatially lagged dependent variable interacts with the coefficients of the independent variables (Kelejian, Tavlas, and Hondroyiannis 2006; LeSage and Fischer 2008; Ward and Gleditsch 2008; LeSage and Pace 2009). The correct interpretation of the estimated coefficients of models using spatial data is, naturally, central to exploring “‘if we change x, what do we expect to happen to y.’” See also Corrado and Fingleton (2012) for further discussion of the importance of thinking of spatial econometric models as tools for studying causality as understood at that time.

A further, associated, bundle of strands of development in spatial econometrics following 2010 also involved causality. Delgado and Florax (2015) draw on Rubin (1974); Rubin (1978) to highlight the risk posed to the stable unit treatment value assumption (SUTVA) by unmodelled spatial dependency in the data. This key assumption for causal alaysis “implies that potential outcomes for person i are unrelated to the treatment status of other individuals” (Angrist, Imbens, and Rubin 1996). The assumption and the risk of its violation when spatial data is modelled aspatially is also discussed at that time by Koschinsky (2013) and Baylis and Ham (2015).

Work by Delgado and Florax (2015) has been followed up by Bardaka, Delgado, and Florax (2018) and Bardaka, Delgado, and Florax (2019), showing practically how a difference-in-difference (DID) econometric design measuring the impact of change on a chosen dependent variable. Dubé et al. (2014) approach spatial difference-in-difference in a similar way, followed up in Dubé et al. (2017) and Dubé, AbdelHalim, and Devaux (2021). Bardaka, Delgado, and Florax (2018) express the two approaches:

Spatial DID models have been proposed by Dubé et al. (2014) and Delgado and Florax (2015); Delgado and Florax (2015) focus on violations of SUTVA in the case of spillover effects local to the treatment whereas Dubé et al. (2014) focus on global effects (Bardaka, Delgado, and Florax 2018, 17).

Dubé et al. (2014) and work derived from this, such as Sunak and Madlener (2016) and Diao, Leonard, and Sing (2017), has been referred to directly and indirectly by an increasing number of applied studies, including Fang (2021), Jia, Shao, and Yang (2021), Qiu and Tong (2021), Liu et al. (2022), Chen et al. (2023), D. Gao and Wang (2023), Pan et al. (2023), Yu and Jin (2023) and Zeng, Blanco-González-Tejero, and Sendra (2023). Some refer to both spatial DID origins: Chagas, Azzoni, and Almeida (2016) and Qiu and Tong (2021), while some derive from Delgado and Florax (2015) alone: Han et al. (2018) and Kosfeld et al. (2021). It is clear that the demand for assessments of the consequences of for example infrastructure investments on house prices or environmental measures is enormous, so it is likely that such studies will continue to proliferate. Both Dubé et al. (2014) and Delgado and Florax (2015) presuppose that applied researchers using their approaches are adequately trained in spatial econometrics, so that these researchers are more than familiar with the spatial extensions to aspatial econometric techniques. These spatial extensions are the core of our book.

Before continuing to present the structure of this book, it is also sensible to cover three “breaking” topics related to causality in a spatial context. The first of the “breaking” topics extends regression discontinuity designs, from discussion in Gibbons, Machin, and Silva (2013) through Calonico, Cattaneo, and Titiunik (2014) and Keele and Titiunik (2015) to Butts (2023a) and Butts (2023b), with Cattaneo and Titiunik (2022) as an up-to-date general review. While neither Dubé et al. (2014) nor Delgado and Florax (2015) appear to be directly backed by software, software packages for R, Python and Stata for work by Sebastian Calonico, Matias D. Cattaneo and Rocío Titiunik is published; ongoing work by Kyle Butts is also available.

The second two “breaking” topics relate directly to Olsson (1970), in which he addressed the question of the extent to which prediction and explanation could be seen as symmetric. Summarising his findings, Olsson asserted that:

… an adequate explanation may lead to a successful prediction, but … successful prediction is not the same as successful explanation. (Olsson 1970, 230)

This relates directly to: “‘if we change x, what do we expect to happen to y’”, which is more about explanation (and hence causality) than prediction. Many contemporary quantitative methods utilise predictive success to attempt to improve model fit, and having achieved predictive success try to re-construct the meaning of the output model in order to work back to explanation.

Secondly, three important surveys of causality in spatial data analysis have appeared recently: Kolak and Anselin (2020), B. Gao et al. (2022), and Akbari, Winter, and Tomko (2023). All of these take up spatial challenges to the stable unit treatment value assumption, Kolak and Anselin (2020) with an example of the impact of changes in minimum legal drinking age laws on mortality for US states. B. Gao et al. (2022) point to the rapid extension of spatial statistics to other knowledge domains including bioinformatics, in which causal inference is clearly important. The main use cases considered by Akbari, Winter, and Tomko (2023) are in spatial cognition, including wayfinding processes and navigation systems, because these are so much broader in impact than program evaluations. Because Akbari, Winter, and Tomko (2023) is a literature review, it does not propose methods, but describes those available. They also comment that of the minority of articles included in the review that reported what software was used, the most commonly used software was R. Only 12 percent of the articles cited code needed to reproduce their results (see also Wolf 2023). They comment:

This low rate of accessibility to code is a big challenge that not only limits reproducibility of the reviewed papers, but also affects the portability and translation of approaches to other case studies in spatial causal inference. … In sum, in most of the reviewed research, there are no clear procedures related to reproducibility and validation. We can trust more the results of papers with straightforward approaches with a sufficient level of details. (Akbari, Winter, and Tomko 2023, 79)

The final “breaking” topic is the influence of spatial autocorrelation on machine learning, statistical learning, and convolutional neural networks. Kattenborn et al. (2022) study the impact of spatial autocorrelation on the training of convolutional neural networks for data acquired by drones, and find:

Our results suggest that violating spatial independence between training and test data can severely inflate model apparent performance (up to almost 30%) and, hence, lead to an overly optimistic evaluation of the generalization of such models. (Kattenborn et al. 2022, 7)

This observation, that the violation of the assumption of spatial independence between training, validation and test data sets prejudices outcomes, has been recognised in much of spatial data science for years, at least from Brenning (2012). There is now an extensive literature both on the split between training and test data sets, and on the use of fitted models for prediction to areas that were un- or under-represented in the data used to fit the model (Meyer et al. 2018, 2019; Valavi et al. 2019; Schratz et al. 2019; Meyer and Pebesma 2021, 2022; Mila et al. 2022; Linnenbrink et al. 2023). These articles are accompanied by software permitting the reproduction of their findings and the application of suggested adaptations, replacing random permutations in machine learning model fitting and tuning by spatially-aware procedures.

Kopczewska (2022) summarises the current research position with regard to spatial data use in machine learning in this way:

It is clear from many studies that unaddressed spatial autocorrelation generates problems, such as overoptimistic fit of models, omitted information and/or biased (suboptimal) prediction. Thus, an up-to-date toolbox dealing with spatial autocorrelation should be used in all ML models in order to ensure methodological appropriateness. (Kopczewska 2022, 732)

She does, however, provide encouraging examples of the spatially-informed use of machine learning methods, and a concise overview of concepts and a listing of relevant R packages in two appendices (Kopczewska 2022, 735–49).

Wagner and Zeileis (2019) use model-based recursive partitioning handling the spatial dependencies by including the spatially lagged dependent variable in a spatial econometric model to study heterogeneous growth. Vidoli, Pignataro, and Benedetti (2022) approach heterogeneity through spatial regimes, as do Piras and Sarrias (2023); all three articles are backed by software.

Nikparvar and Thill (2021) give a broad review of machine learning methods and applications to spatial data, including attention to spatial autocorrelation, spatial scale and spatial heterogeneity. Credit (2022) considers the intersection of random forest models - a machine learning method aggregating the outout of many decision trees - and spatial econometrics models. In addition to the inclusion of the spatially lagged dependent variable (Wagner and Zeileis 2019), spatially lagged independent variables were considered. Yoshida, Murakami, and Seya (2022) compare a selection of spatially-aware machine learning approaches and a nearest-neighbour Gaussian process model for predicting apartment rents, following some of the suggestions made by Credit (2022); references to software are provided.

Consideration of machine learning and the application of deep learning/neural networks overlaps in a fair number of cases, raising similar concerns about how the probable lack of independence between proximate observations in space will be handled. Ahmed et al. (2021) stress the need for interpretability and explainability in deep learning/neural networks (also known as artificial intelligence) as well as in machine learning, and explore model agnostic greedy explanations of model predictions; references to software are provided.

Zhang et al. (2022) and Deng, He, and Liu (2023) also focus on interpretability of machine learning methods for predicting crime risk; Deng, He, and Liu (2023) provide their complete data set. Zhu et al. (2023b; minor correction Zhu et al. 2023a) propose spatial regression graph convolutional neural networks for modelling and predicting multivariate spatial data, also considering what is known as feature selection (or engineering), which is related to interpretability; references to software are provided.

Li et al. (2023), like Credit (2022), introduce the spatially lagged dependent variable (whether a manhole in an urban drainage system overflows or not) into a deep neural network model. Xiao, Song, and Wang (2023) and Wang and Song (2023) in two articles with overlapping authorships look more closely at integrating adaptations of classical spatial econometrics models - spatial autoregressive models - into deep neural network models, both making nonparametric additions to the classical models.

The literatures covering the interpretation of spatial econometric models, causality when the data used are spatial and so challenge standard econometric assumptions, and training/test set data splits affecting machine learning and artificial intelligence applications, are all burgeoning. It might be thought that the superficial mentioning of these questions in this section should direct us to focus attention in this book on emerging research opportunities. However, our reading of contributions to the current literature, taken with readings of the many other articles published since 2020 - Dubé et al. (2014) has been cited hundreds of times, mostly since 2020 - suggests that many authors would benefit substantially from a clearer grasp of the background to spatial econometric models, and many of their internal characteristcs. Hence, based in part on this motivation, we will now move to present the structure of this book.