All the material presented here, to the extent it is original, is available under CC-BY-SA. Parts build on joint tutorials with Edzer Pebesma.
I am running R 4.1.2, with recent update.packages()
.
needed <- c("sphet", "coda", "spatialreg", "spData", "sf")
Script and data at https://github.com/rsbivand/ECS530_h21/raw/main/ECS530_211118.zip. Download to suitable location, unzip and use as basis.
In the same way that Fujita, Krugman, and Venables (1999) begin their study of the spatial economy by looking at the antecedents of their subject, it is helpful to place spatial econometrics in its temporal and academic context. This context is sufficiently different from the contemporary setting that it may be hard to grasp the background for many of the features of spatial econometrics that came into being during its earlier years. Indeed, the ranges of topics that were studied in economics in the 1960’s and 1970’s differ markedly from those in focus today. If we can sketch the context within which spatial econometrics was created, and its methods developed, we should be able to illuminate choices made then which influence our understanding and application of spatial econometric methods.
Critics of the practice of spatial econometrics, such as Gibbons and Overman (2012), appear to overlook these antecedents, and consequently judge the potential of the field on a partial, perhaps anachronistic, understanding, viewing phenomena with a history in ahistorical way. Since we are attempting to provide an introduction to applied spatial econometrics, we need to throw light on the original motivations and concerns of the first scholars engaged in the field. Luc Anselin (2010) indicates clearly and repeatedly [L. Anselin (1988b); anselin:06; anselin:10a] that we should acknowledge Spatial Econometrics by Paelinck and Klaassen (1979) of the Netherlands Economic Institute as our starting point, and so celebrates thirty years of spatial econometrics in 2009. This firm confirmation of the importance of Jean Paelinck’s contributions as scholar and community-builder is fully justified. We should then turn to the motivations given in Paelinck and Klaassen (1979) to indicate which contextual factors were of importance at that time, and the breadth of the academic communities with which they were in contact.
In a recent short commentary, Paelinck (2013) recalls his conviction, expressed in 1967, that “early econometric exercises \(\ldots\) relating only variables possessing the same regional index \(\ldots\) were inadequate to represent the correct spatial workings of the economy, which would then be reflected in the policy outcomes.” A year before, Paelinck (2012) points to salient isomorphisms linking spatial regression models, simultaneous equation models and input-output models; these were known of and discussed in the early formative period of spatial econometrics. We will return in subsequent chapters to the ways in which spatial regression models may be specified, but for now, a simple presentation of these isomorphisms as perceived in the early period is sufficient:
\[ {\mathbf y} = {\mathbf A}{\mathbf y} + {\mathbf X}{\mathbf b} + {\mathbf \varepsilon} \]
is a spatial regression model where \({\mathbf A}\) is a matrix expressing the mutual first order spatial dependencies between regions — the similarity of this form and the Koyck distributed lag model is striking [Koyck (1954); Klein (1958); Griliches (1967)};
\[ {\mathbf A}{\mathbf y} + {\mathbf X}{\mathbf b} = {\mathbf \varepsilon} \]
is a simultaneous equation model where \({\mathbf A}\) is a matrix expressing the dependencies between the equations; and:
\[ {\mathbf y} = {\mathbf A}{\mathbf y} + {\mathbf f} \]
is an input-output model where \({\mathbf A}\) is a matrix of sectoral input-output coefficients, and \({\mathbf f}\) is final demand.
Input-output models, simultaneous equation models, and the importance of policy outcomes were all known intimately at the Netherlands Economic Institute at this time, and elsewhere among applied economists. The isomporhisms flowed from the known to the unknown, from the stuff of contemporary research and policy advice to doubts about the calibration of aspatial models, and on to what became termed spatial econometrics. If we compare these topics with those described for Regional Science by Boyce (2004), we can see the outlines of research priorities at the time: including urban and regional models for planning, regional and interregional input-output models, transport and location models. During the 1960s and 1970s, many of these models were enhanced — matching needs for policy advice — to cover environmental questions, adding natural resources as inputs and pollution to outputs. Paelinck’s co-author in a key paper in spatial econometrics (Hordijk and Paelinck 1976), went on to work in environmental management and research.
Reading Paelinck and Klaassen (1979), we see that the programme of research into the space economy undertaken at the Netherlands Economic Institute led first to the publication of Paelinck and Nijkamp (1975), and then to Klaassen, Paelinck, and Wagenaar (1979), published in the same year as Spatial Econometrics. All three books were published in the same series and appear to reflect the core concerns of economists at the Institute doing research on regionalised national macro-economic models. The direct link to Jan Tinbergen is evident in the account of the context of economic research in the Netherlands given by Theil (1964). If we take Paelinck at his word, he and his colleagues were aware that an aspatial regionalisation of national accounts, of input-output models, or transport models, might prejudice policy advice and outcomes through inadequate and inappropriate calibration.
Klaassen, Paelinck, and Wagenaar (1979) is mainly concerned with model construction, while about a third of Paelinck and Nijkamp (1975) is devoted to input-output analysis. Both books show sustained concern for economic measurement, especially of national accounts data, intersectoral transactions, and many other topics. Considerable attention is also paid to the data collection units, be they sectors or regions. The need to attempt to define regions that match the underlying economic realities was recognised clearly, and a key part of Paelinck and Nijkamp (1975) is devoted to regionalisation, and the distinction between functional regions and homogeneous regional classifications is made. In the motivation for spatial econometric models given in Paelinck and Klaassen (1979), consumption and investment in a region are modelled as depending on income both in the region itself and in its contiguous neighbours, termed a “spatial income-generating model.” It became important to be able to calibrate planning models of this kind to provide indications of the possible outcomes of alternative policy choices, hence the need for spatial econometrics.
Economic planning was widespread in Europe at the time, and was also central in the development of Regional Science, in particular input-output models; as Boyce (2004) recounts, Walter Isard worked closely with Wassily Leontief. Operational and planning motivations for applied economics were unquestioned, as economists in the post-war period saw their role, beyond educating young economists, as providing rational foundations for economic policy. It is worth noting that Jean Paelinck participated actively in the Association de Science Régionale De Langue Française, becoming president in 1973–1976. The first president of the association was François Perroux, who had founded it with Walter Isard in 1961 (Bailly, Derycke, and Torre 2012).
Until the 1980s, it was not at all unusual to publish original results in other languages than English. French spatial economic research, for example Ponsard (1983), while making little impact in Anglophone countries, was widely used in teaching and research elsewhere (Billot and Thisse 1992). They contrast, though, the “word wizardry of François Perroux with the rigour of Claude Ponsard” (Billot and Thisse 1992), echoing the views expressed by Drèze (1964) with regard to the work of Perroux. Even if we accept that “word wizardry” deserves more rigour and recasting in normative and empirically testable forms, it is also part of the context within which spatial econometrics came into being. A reading of Perroux (1950) is worthwhile, because it not only gives the reader a vignette of the context in the post-war period, but also provides a discussion of economic space, as opposed to banal, unreflected space — mere position — that has largely disappeared from our considerations.
The title of the journal: Regional and Urban Economics, Operational Methods , founded by Jean Paelinck in 1971, and which was renamed as Regional Science and Urban Economics in 1975 (Boyce 2004), points to the perceived importance of “operational methods,” a version of the term “operational theory and method” used in the title of Paelinck and Nijkamp (1975). Spatial econometrics does not seem to have come into being as a set of estimation techniques as such, as perhaps we might think today, but rather as an approach addressing open research questions both in space economy and in the enhancement of interregional models to be used in offering policy advice.
Were motivations of this kind common during the 1960s and early 1970s? Not only was the spread of Regional Science extensive and firmly established (Boyce 2004), but public bodies were concerned to regionalise economic measurement and policy advice (Graham and Romans 1971). In Britain, Environment and Planning was started in 1969 with Alan Wilson as founding editor and published by Pion; he was assistant director at the Centre for Environmental Studies at this time before moving to the University of Leeds. In a recently published lecture series, Wilson (2012) cites Paelinck and Nijkamp (1975) as giving principles for contributions from economics to urban and regional analysis (Wilson 2000). The papers presented at annual Regional Science meetings were published in a series by Pion; the first number in the series included contributions by Granger (1969) and A. D. Cliff and Ord (1969).
In a contribution to a panel session at the 2006 annual meeting of the American Association of Geographers (co-panelists Luc Anselin and Daniel Griffith), Keith Ord pointed to the continued relevance of Granger’s remarks at the meeting almost forty years earlier (J. Keith Ord 2010); we will return to these concerns below. As noted by R. S. Bivand (2008), communities of researchers working in and near mathematical and theoretical geography was more integrated in the pre-internet and pre-photocopier age than one might expect, with duplicated working papers prepared using stencils circulating rapidly between collaborating academic centres. Knowledge of the preliminary results of other researchers then fed through into rapid innovation in an exciting climate for those with access to these meetings and working papers.
There was considerable overlap between quantitative geography and regional science, so that work like A. D. Cliff and Ord (1969) is cited by Hordijk (1974), and was certainly known at the Netherlands Economic Institute several years earlier. Although it has not been possible to find out who participated in the August 1968 conference of the British and Irish Section of the Regional Science Association at which A. D. Cliff and Ord (1969) was read, it was not unusual for members of other sections to be present, and to return home with bundles of duplicated papers. Up to the 1990s, presenters at conferences handed out copies of their papers, and conference participants posted home parcels of these hand-outs, indexed using the conference programme.
Leslie Hepple was among the more thorough scholars working on the underpinnings of spatial econometrics prior to the publication of Paelinck and Klaassen (1979). His wide-ranging review (Hepple 1974) is cited by Bartels and Hordijk (1977), again demonstrating the close links between those working in this field. We will be returning to the review paper, and to Hepple (1976), which studies methods of estimation for spatial econometrics models in some depth, building on and extending J. K. Ord (1975).
Hepple (1974), like A. D. Cliff and Ord (1973), saw no distinction between spatial statistics and the antecedents to spatial econometrics. Obviously, spatial econometrics was strongly influenced by the research tasks undertaken by regional and urban economists and regional scientists. As D. A. Griffith and Paelinck (2007) point out, spatial statistics and spatial econometrics continue to share most topics of interests, with each also possessing shorter lists of topics that have been of less concern to the other. They advocate a “non-standard” spatial econometrics, which is inclusive to wider concerns. It seems appropriate in this context to mention the somewhat heterodox position taken by McMillen (2010), who draws attention to the crucial issue of functional form, which he argues may well lie behind observed spatial autocorrelation; we will return to this in later chapters.
Having described some of the contextual issues suggesting how specific research concerns influenced how spatial econometrics came into being, it may now be helpful to turn to broader econometrics. It will become clearer that the research concerns of broader econometrics at that time, apart from spatial interdependence and interaction, were generally similar to those of spatial econometrics. Indeed, econometrics and the provision of national economic data for analysis, modelling, and for the provision of policy advice were intimately linked, as were mathematical economics and econometrics. The place of econometrics within economics, as Sandmo (2011) shows, was a matter of some contention from the very beginning.
Both Morgan (1990) and Qin (1993) conclude their historical accounts of the beginnings of econometrics in pessimistic ways. Econometrics had begun by addressing a range of research topics, including the expression of economic theories in mathematical form, the building of operational econometric models, the exploration of testing and estimation techniques, and statistical data preparation. “By the 1950s the founding ideal of econometrics, the union of mathematical and statistical economics into a truly synthetic economics, had collapsed” (Morgan 1990). The flavour of Paelinck and Klaassen (1979) seems somewhat anachronistic compared to late 1950s and early 1960s “textbook” econometrics. One might conclude that while spatial econometrics was aligned with Haavelmo’s probabilistic revolution (Qin 1993), it retained, like Haavelmo, adherence to the founding ideals [Bjerkholt (2007); hendry+johansen:12].
It appears from the research programme culminating in Paelinck and Klaassen (1979) that spatial econometrics could be seen as “creative juggling in which theory and data came together to find out about the real world” (Morgan 1990). Morgan (1990) and Qin (1993) indicate that, from the 1950s, the depth of econometrics was flattened , with “[D]ata taken less seriously as a source of ideas and information for econometric models, and the theory-development role of applied econometrics was downgraded to the theory-testing role” (Morgan 1990). The history of econometric ideas is of assistance in illuminating key components of what has come to be the practice of spatial econometrics:
With Haavelmo, the profession seemed to have arrived at a consensus: a statistical model of plural economic causes and errors in the relations. But in the meantime, the old scientific ideal had also changed, from a deterministic to a probabilistic view of the way the world worked. This left the explanatory level of Haavelmo’s model open to new doubts. Was it really based on underlying random behaviour of the economic variables (as in contemporary evolutionary biology and quantum mechanics), or was it, after all, only a convenient way of formally dealing with inference in the non-experimental framework? (Morgan 1990).
The study of the history of econometrics has been actively promoted by David Hendry. It is desirable to motivate our choice of the term “applied spatial econometrics”; to do this we turn to Hendry (2009), the introductory chapter in the “Applied Econometrics” volume of the Palgrave Handbook of Econometrics. If we can shed light on how “applied econometrics” is understood, then we may be able to provide adequate underpinnings for what is to follow. This, however, turns out to be hard to do, as Hendry challenges simple definitions:
At the superficial level, “Applied Econometrics” is “any application of econometrics, as distinct from theoretical econometrics. \(\ldots\) Some applied econometricians would include any applications involving analyses of”real economic data” by econometric methods, making “Applied econometrics” synonymous with empirical econometrics. However, such a view leads to demarcation difficulties from applied economics on the one hand, and applied statistics on the other. \(\ldots\) Outsiders might have thought that “Applied Econometrics” was just the application of econometrics to data, but that is definitely not so \(\ldots\) Rather, the notion of mutual penetration dominates, but as a one-way street. Economic theory comes first, almost mandatorially. Perhaps this just arises from a false view of science, namely that theory precedes evidence, \(\ldots\) (Hendry 2009).
He butresses his views using the history of econometrics to caricature empirical econometric research in the light of his view that applied economics has a “false” view of science:
\(\ldots\) cumulative critiques \(\ldots\) led to an almost monolithic approach to empirical econometric research: first postulate an individualistic, intertemporal optimization theory; next derive a model therefrom; third, find data with the same names as the theory variables; then select a recipe from the econometrics cookbook that appropriately blends the model and the data, or if necessary, develop another estimator; finally report the newly forged economic law. \(\ldots\) Instead of progress, we find fashions, cycles and “schools” in research. \(\ldots\) At about the same time that a priori theory-based econometrics became dominant, data measurement and quality issues were also relegated as a central component of empirical publications (Hendry 2009).
Hendry’s scepticism with regard to the practice of applied econometrics finds ample support in the proposal by Anderson et al. (2008) that econometric results be acknowledged only to the extent that they are open for replication. They point out that theoretical results in economics (and econometrics) are much easier to check, and that referees routinely question formal proofs, because the relevant equations are included in journal submissions. Further contributions to the discussion on reproducible econometric research results have been made by Koenker and Zeileis (2009) and Yalta and Yalta (2010). As we will see later on, spatial econometrics is in the fortunate position of having relatively many open source software implementations.
In order to complete this brief review of the antecedents to spatial econometrics, it makes sense to point to descriptions of the development of econometric software. Both in a handbook chapter (Renfro 2009a), and in book form (Renfro 2009b), we can benefit from Charles Renfro’s experience and insights. He does admit to expressing distinct views, but they are backed both by experience and evidence. One concern is that econometric software development has been seen less and less as academic achievement, but rather as a practical, technical concern only:
To take an interest in programming and econometric software development would seem therefore to be the graveyard of any academic economist’s professional ambitions, justifiable only as a minimally diverting hobby, spoken of only to trusted colleagues. (Renfro 2009b)
He is not alone in drawing attention to the view that economics suffers from a skewed distribution of attention to the actual components of knowledge creation, placing theory firmly in first place, with commensurately much less weight given to measurement, data preparation and programming. As he says, applied economics researchers would benefit from a more even distribution of academic acknowledgement to those who, in terms of the division of labour within the discipline, develop software:
As a software developer, the econometrician who incorporates new theoretical econometric results may therefore be faced with the often-difficult task of not only evaluating the relevance, hence the operational validity of the theoretical solution, but also implementing these results in a way that is contextually meaningful. This operationally focused econometrician consequently not only needs to understand the theoretical advances that are made but also to exercise independent judgment in the numerical implementation of new techniques, for in fact neither are blueprints provided nor are the building blocks prefabricated. (Renfro 2009b)
Since this division of labour is arguably more uneven in economics than in other subjects, it leads to the restriction of research questions that applied economists can address to those that match estimation functions available in software implementations for which they have licences and which they know how to use:
One of the consequences of this specialization has been to introduce an element of user dependence, now grown to a sufficiently great degree that for many economists whichever set of econometric operations can be performed by their choice of program or programs has for them in effect become the universal set. (Renfro 2009b)
This deplorable situation has arisen over time, and certainly did not characterise the research context when spatial econometrics came into being. At that time, graduate students simply regarded learning to program, often in Fortran, as an essential part of their preparation as researchers. This meant that researchers were moch “closer” to their tools, and could adapt them to suit their research needs. Nowadays, few economists feel confident as programmers despite the fact that modern high-level languages such as Matlab, Python, or R are easy to learn and very flexible, and many econometric and statistical software applications offer scripting languages (such as SAS, Stata, SPSS and specifically econometric programs).
The links between methodological advance and the evolution of spatial economic theory are only touched upon in Luc Anselin (2010) — in that sense, his review is concerned with theoretical spatial econometrics (statistical methods) rather than applied spatial econometrics (economic models). Over time, applied spatial econometrics has tended to become synonymous with regression modelling applied to spatial data where spatial autocorrelation and spatial heterogeneity in particular are present and need to be accommodated. Its treatment of spatial effects reflects the growing “legitimization of space and geography” (Luc Anselin 2010) in the quantitative social sciences more generally. But the subfield perhaps needs to be more than that if it is to justify its separate identity from spatial statistics and fully justify its “econometric” label. A close link with mainstream economic theory would seem essential in order to provide economic legitimacy to models (systems of equations) within which geography and spatial relationships have been, in economic terms, rigorously embedded (Fingleton 2000). (Haining 2014)
Beyond software functionality, the existence of communities of interest becomes important
These extend the choices available, and bring in points of view across applied scientific domains and in modern computation, for example interworking across C++, Python and Julia with R
For example, the HSAR package [Dong and Harris (2015); dongetal:15] uses C++ extensively, rather than R sparse matrix packages
The model being introduced into INLA for non-MCMC Bayesian inference is another example of cooperation with other applied statisticians
The R packages spatialreg, sphet (Roger S. Bivand and Piras 2015) and for spatial panel models splm (Millo and Piras 2012) provide implementations of many model fitting functions for Gaussian dependent variables
These also use shared methods in spdep for computing impacts, and for MC inference on impacts
There are a range of fitting functions for non-Gaussian dependent variables in McSpatial, spatialprobit (Wilhelm and Matos 2013) and ProbitSpatial (Martinetti and Geniaux 2017), and these are also available using the ´slm´ model being introduced into INLA (Gómez-Rubio, Bivand, and Rue 2021)
Work is continuing on providing Bayesian inference functions, through MCMC and otherwise; the HSAR package is an example
See R. Bivand, Millo, and Piras (2021) for a recent survey
Even though it may be tempting to focus on interpreting the map pattern of an area support response variable of interest, the pattern may largely derive from covariates (and their functional forms), as well as the respective spatial footprints of the variables in play. Spatial autoregressive models in two dimensions began without covariates and with clear links to time series (Whittle 1954). Extensions included tests for spatial autocorrelation in linear model residuals, and models applying the autoregressive component to the response or the residuals, where the latter matched the tests for residuals (A. Cliff and Ord 1972; A. D. Cliff and Ord 1973). These “lattice” models of areal data typically express the dependence between observations using a graph of neighbours in the form of a contiguity matrix.
A division has grown up, possibly unhelpfully, between scientific fields using conditional autoregressive (CAR) models (Besag 1974), and simultaneous autoregressive models (SAR) (J. K. Ord 1975; Hepple 1976). Although CAR and SAR models are closely related, these fields have found it difficult to share experience of applying similar models, often despite referring to key work summarising the models (Ripley 1981, 1988; Cressie 1993). Ripley gives the SAR variance as (1981, 89):
\[ {\rm Var}_S = \sigma^ 2(I-\lambda W_S)^{-1} (I-\lambda W_S^{\rm T})^{-1} \] where \(\lambda\) is a spatial autocorrelation parameter and \(W_S\) is a nonsingular matrix that represents spatial dependence. The CAR variance is:
\[ {\rm Var}_C = \sigma^ 2(I-\lambda W_C)^{-1} \] where and \(W_C\) is a symmetric and strictly positive definite matrix
More recent books expounding the theoretical bases for modelling with areal data simply point out the similarities in relevant chapters (Gaetan and Guyon 2010; Lieshout 2019); the interested reader is invited to consult these sources for background information and examples using the functions described below.
Of course, handling a spatial correlation structure in a generalised least squares model or a (generalised) linear or nonlinear mixed effects model such as those provided in the nlme and many other packages does not have to use a graph of neighbours (Pinheiro and Bates 2000). These models are also spatial regression models, using functions of the distance between observations, and fitted variograms to model the spatial autocorrelation present; such models have been held to yield a clearer picture of the underlying processes (Wall 2004), building on geostatistics. For example, the glmmTMB package successfully uses this approach to spatial regression (Brooks et al. 2017). Here we will only consider spatial regression using spatial weights, chiefly as implemented in the spatialreg package recently split out of the spdep package which had grown unnecessarily large, covering too many aspects of spatial dependence.
Recent development: (R. Bivand, Millo, and Piras 2021)
Spatial autoregression models using spatial weights matrices were described in some detail using maximum likelihood estimation some time ago (A. D. Cliff and Ord 1973, 1981). A family of models were elaborated in spatial econometric terms extending earlier work, and in many cases using the simultaneous autoregressive framework and row standardization of spatial weights (L. Anselin 1988a). The simultaneous and conditional autoregressive frameworks can be compared, and both can be supplemented using case weights to reflect the relative importance of different observations (Waller and Gotway 2004).
Here we shall use the Boston housing data set, which has been restructured and furnished with census tract boundaries (R. Bivand 2017). The original data set used 506 census tracts and a hedonic model to try to estimate willingness to pay for clean air. The response was constructed from counts of ordinal answers to a 1970 census question about house value; the response is left and right censored in the census source. The key covariate was created from a calibrated meteorological model showing the annual nitrogen oxides (NOX) level for a smaller number of model output zones. The numbers of houses responding also varies by tract and model output zone. There are several other covariates, some measured at the tract level, some by town only, where towns broadly correspond to the air pollution model output zones.
library(sf)
## Linking to GEOS 3.10.1, GDAL 3.4.0, PROJ 8.2.0
library(spatialreg)
## Loading required package: spData
## Loading required package: Matrix
boston_506 <- st_read(system.file("shapes/boston_tracts.shp", package="spData")[1])
## Reading layer `boston_tracts' from data source
## `/home/rsb/lib/r_libs/spData/shapes/boston_tracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 506 features and 36 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -71.52311 ymin: 42.00305 xmax: -70.63823 ymax: 42.67307
## Geodetic CRS: NAD27
nb_q <- spdep::poly2nb(boston_506)
lw_q <- spdep::nb2listw(nb_q, style="W")
We can start by reading in the 506 tract data set from spData, and creating a contiguity neighbour object and from that again a row standardized spatial weights object. If we examine the median house values, we find that they have been assigned as missing values, and that 17 tracts are affected.
table(boston_506$censored)
##
## left no right
## 2 489 15
summary(boston_506$median)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 5600 16800 21000 21749 24700 50000 17
Next, we can subset to the remaining 489 tracts with non-censored house values, and the neighbour object to match. The neighbour object now has one observation with no neighbours.
boston_506$CHAS <- as.factor(boston_506$CHAS)
boston_489 <- boston_506[!is.na(boston_506$median),]
nb_q_489 <- spdep::poly2nb(boston_489)
lw_q_489 <- spdep::nb2listw(nb_q_489, style="W", zero.policy=TRUE)
The NOX_ID
variable specifies the upper level aggregation, letting us aggregate the tracts to air pollution model output zones. We can create aggregate neighbour and row standardized spatial weights objects, and aggregate the NOX
variable taking means, and the CHAS
Charles River dummy variable for observations on the river.
agg_96 <- list(as.character(boston_506$NOX_ID))
boston_96 <- aggregate(boston_506[, "NOX_ID"], by=agg_96, unique)
nb_q_96 <- spdep::poly2nb(boston_96)
lw_q_96 <- spdep::nb2listw(nb_q_96)
boston_96$NOX <- aggregate(boston_506$NOX, agg_96, mean)$x
boston_96$CHAS <- aggregate(as.integer(boston_506$CHAS)-1, agg_96, max)$x
The response is aggregated using the weightedMedian()
function in matrixStats, and midpoint values for the house value classes. Counts of houses by value class were punched to check the published census values, which can be replicated using weightedMedian()
at the tract level. Here we find two output zones with calculated weighted medians over the upper census question limit of USD 50,000, and remove them subsequently as they also are affected by not knowing the appropriate value to insert for the top class by value.
nms <- names(boston_506)
ccounts <- 23:31
for (nm in nms[c(22, ccounts, 36)]) {
boston_96[[nm]] <- aggregate(boston_506[[nm]], agg_96, sum)$x
}
br2 <- c(3.50, 6.25, 8.75, 12.50, 17.50, 22.50, 30.00, 42.50, 60.00)*1000
counts <- as.data.frame(boston_96)[, nms[ccounts]]
f <- function(x) matrixStats::weightedMedian(x=br2, w=x, interpolate=TRUE)
boston_96$median <- apply(counts, 1, f)
is.na(boston_96$median) <- boston_96$median > 50000
summary(boston_96$median)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9009 20417 23523 25263 30073 49496 2
Before subsetting, we aggregate the remaining covariates by weighted mean using the tract population counts punched from the census (R. Bivand 2017).
boston_94 <- boston_96[!is.na(boston_96$median),]
nb_q_94 <- spdep::subset.nb(nb_q_96, !is.na(boston_96$median))
lw_q_94 <- spdep::nb2listw(nb_q_94, style="W")
We now have two data sets each at the lower, census tract level and the upper, air pollution model output zone level, one including the censored observations, the other excluding them.
The original model related the log of median house values by tract to the square of NOX values, including other covariates usually related to house value by tract, such as aggregate room counts, aggregate age, ethnicity, social status, distance to downtown and to the nearest radial road, a crime rate, and town-level variables reflecting land use (zoning, industry), taxation and education (R. Bivand 2017). This structure will be used here to exercise issues raised in fitting spatial regression models, including the presence of multiple levels.
form <- formula(log(median) ~ CRIM + ZN + INDUS + CHAS + I((NOX*10)^2) + I(RM^2) +
AGE + log(DIS) + log(RAD) + TAX + PTRATIO + I(BB/100) +
log(I(LSTAT/100)))
Before moving to presentations of issues raised in fitting spatial regression models, it is worth making a few further points. A recent review of spatial regression in a spatial econometrics setting is given by Kelejian and Piras (2017); note that their usage is to call the spatial coefficient of the lagged response \(\lambda\) and that of the lagged residuals \(\rho\), the reverse of other usage (L. Anselin 1988a; James P. LeSage and Pace 2009); here we use \(\rho_{\mathrm{Lag}}\) for the spatial coefficient in the spatial lag model, and \(\rho_{\mathrm{Err}}\) for the spatial error model. One interesting finding is that relatively dense spatial weights matrices may downweight model estimates, suggesting that sparser weights are preferable (Tony E. Smith 2009). Another useful finding is that the presence of residual spatial autocorrelation need not bias the estimates of variance of regression coefficients, provided that the covariates themselves do not exhibit spatial autocorrelation (T. E. Smith and Lee 2012). In general, however, the footprints of the spatial processes of the response and covariates may not be aligned, and if covariates and the residual are autocorrelated, it is likely that the estimates of variance of regression coefficients will be biassed downwards if attempts are not made to model the spatial processes.
In trying to model these spatial processes, we may choose to model the spatial autocorrelation in the residual with a spatial error model (SEM).
\[ {\mathbf y} = {\mathbf X}{\mathbf \beta} + {\mathbf u}, \qquad {\mathbf u} = \rho_{\mathrm{Err}} {\mathbf W} {\mathbf u} + {\mathbf \varepsilon}, \] where \({\mathbf y}\) is an \((N \times 1)\) vector of observations on a response variable taken at each of \(N\) locations, \({\mathbf X}\) is an \((N \times k)\) matrix of covariates, \({\mathbf \beta}\) is a \((k \times 1)\) vector of parameters, \({\mathbf u}\) is an \((N \times 1)\) spatially autocorrelated disturbance vector, \({\mathbf \varepsilon}\) is an \((N \times 1)\) vector of independent and identically distributed disturbances and \(\rho_{\mathrm{Err}}\) is a scalar spatial parameter.
If the processes in the covariates and the response match, we should find little difference between the coefficients of a least squares and a SEM, but very often they diverge, suggesting that a Hausman test for this condition should be employed (R. Pace and LeSage 2008). This may be related to earlier discussions of a spatial equivalent to the unit root and cointegration where spatial processes match (Fingleton 1999).
A model with a spatial process in the response only is termed a spatial lag model (SLM, often SAR - spatial autoregressive) (James P. LeSage and Pace 2009).
\[ {\mathbf y} = \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf y} + {\mathbf X}{\mathbf \beta} + {\mathbf \varepsilon}, \] where \(\rho_{\mathrm{Lag}}\) is a scalar spatial parameter.
Work reviewed by Mur and Angulo (2006) on the Durbin model; the Durbin model adds the spatially lagged covariates to the covariates included in the spatial lag model, giving a spatial Durbin model (SDM) with different processes in the response and covariates:
\[ {\mathbf y} = \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf y} + {\mathbf X}{\mathbf \beta} + {\mathbf W}{\mathbf X}{\mathbf \gamma} + {\mathbf \varepsilon}, \] where \({\mathbf \gamma}\) is a \((k' \times 1)\) vector of parameters. \(k'\) defines the subset of the intercept and covariates, often \(k' = k-1\) when using row standardised spatial weights and omitting the spatially lagged intercept.
This permits the spatial processes to be viewed and tested for as a Common Factor (Burridge 1981; R. S. Bivand 1984). The inclusion of spatially lagged covariates lets us check whether the same spatial process is manifest in the response and the covariates (SEM), whether they are different processes, or whether no process is detected. The Common Factor is present when \({\mathbf \gamma} = - \rho_{\mathrm{Lag}} {\mathbf \beta}\):
\[ {\mathbf y} = \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf y} + {\mathbf X}{\mathbf \beta} - \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf X} {\mathbf \beta} + {\mathbf \varepsilon}, \qquad ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W}){\mathbf y} = ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W}) {\mathbf X}{\mathbf \beta} + {\mathbf \varepsilon}, \] where \({\mathbf I}\) is the \(N \times N\) identity matrix, and \({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W}\) is the Common Factor:
\[ \qquad {\mathbf y} = {\mathbf X}{\mathbf \beta} + ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1} {\mathbf \varepsilon}, \qquad {\mathbf y} = {\mathbf X}{\mathbf \beta} + {\mathbf u}, \qquad {\mathbf u} = \rho_{\mathrm{Err}} {\mathbf W} {\mathbf u} + {\mathbf \varepsilon}, \]
If we extend this family with processes in the covariates and the residual, we get a spatial error Durbin model (SDEM). If it is chosen to admit a spatial process in the residuals in addition to a spatial process in the response, again two models are formed, a general nested model (GNM) nesting all the others, and a model without spatially lagged covariates (SAC, also known as SARAR - Spatial AutoRegressive-AutoRegressive model). If neither the residuals nor the residual are modelled with spatial processes, spatially lagged covariates may be added to a linear model, as a spatially lagged X model (SLX) (Elhorst 2010; Roger S. Bivand 2012; J. P. LeSage 2014; Halleck Vega and Elhorst 2015).
We can write the general nested model (GNM) as:
\[ {\mathbf y} = \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf y} + {\mathbf X}{\mathbf \beta} + {\mathbf W}{\mathbf X}{\mathbf \gamma} + {\mathbf u}, \qquad {\mathbf u} = \rho_{\mathrm{Err}} {\mathbf W} {\mathbf u} + {\mathbf \varepsilon}, \]
This may be constrained to the double spatial coefficient model SAC/SARAR by setting \({\mathbf \gamma} = 0\), to the spatial Durbin (SDM) by setting \(\rho_{\mathrm{Err}} = 0\), and to the error Durbin model (SDEM) by setting \(\rho_{\mathrm{Lag}} = 0\). Imposing more conditions gives the spatial lag model (SLM) with \({\mathbf \gamma} = 0\) and \(\rho_{\mathrm{Err}} = 0\), the spatial error model (SEM) with \({\mathbf \gamma} = 0\) and \(\rho_{\mathrm{Lag}} = 0\), and the spatially lagged X model (SLX) with \(\rho_{\mathrm{Lag}} = 0\) and \(\rho_{\mathrm{Err}} = 0\).
Although making predictions for new locations for which covariates are observed was raised as an issue some time ago, it has many years to make progress in reviewing the possibilities (R. S. Bivand 2002; Goulard, Laurent, and Thomas-Agnan 2017). The prediction methods for SLM, SDM, SEM, SDEM, SAC and GNM models fitted with maximum likelihood were contributed as a Google Summer of Coding project by Martin Gubri. This work, and work on similar models with missing data (Suesse 2018) is also relevant for exploring censored median house values in the Boston data set. Work on prediction also exposed the importance of the reduced form of these models, in which the spatial process in the response interacts with the regression coefficients in the SLM, SDM, SAC and GNM models.
The consequence of these interactions is that a unit change in a covariate will only impact the response as the value of the regression coefficient if the spatial coefficient of the lagged response is zero. Where it is non-zero, global spillovers, impacts, come into play, and these impacts should be reported rather than the regression coefficients (James P. LeSage and Pace 2009; Elhorst 2010; Roger S. Bivand 2012; J. P. LeSage 2014; Halleck Vega and Elhorst 2015). Local impacts may be reported for SDEM and SLX models, using linear combination to calculate standard errors for the total impacts of each covariate (sums of coefficients on the covariates and their spatial lags).
This can be seen from the GNM data generation process:
\[ ({\mathbf I} - \rho_{\mathrm{Err}} {\mathbf W})({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W}){\mathbf y} = ({\mathbf I} - \rho_{\mathrm{Err}} {\mathbf W})({\mathbf X}{\mathbf \beta} + {\mathbf W}{\mathbf X}{\mathbf \gamma}) + {\mathbf \varepsilon}, \] re-writing:
\[ {\mathbf y} = ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1}({\mathbf X}{\mathbf \beta} + {\mathbf W}{\mathbf X}{\mathbf \gamma}) + ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1}({\mathbf I} - \rho_{\mathrm{Err}} {\mathbf W})^{-1}{\mathbf \varepsilon}. \] There is interaction between the \(\rho_{\mathrm{Lag}}\) and \({\mathbf \beta}\) (and \({\mathbf \gamma}\) if present) coefficients. This can be seen from the partial derivatives: \(\partial y_i / \partial x_{jr} = (({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1} ({\mathbf I} \beta_r + {\mathbf W} \gamma_r))_{ij}\). This dense matrix \(S_r({\mathbf W}) = (({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1} ({\mathbf I} \beta_r + {\mathbf W} \gamma_r))\) expresses the direct impacts (effects) on its principal diagonal, and indirect impacts in off-diagonal elements.
Current work in the spatialreg package is focused on refining the handling of spatially lagged covariates using a consistent Durbin=
argument taking either a logical value or a formula giving the subset of covariates to add in spatially lagged form. There is a speculation that some covariates, for example some dummy variables, should not be added in spatially lagged form. This then extends to handling these included spatially lagged covariates appropriately in calculating impacts. This work applies to cross-sectional models fitted using MCMC or maximum likelihood, and will offer facilities to spatial panel models.
It is worth mentioning the almost unexplored issues of functional form assumptions, for which flexible structures are useful, including spatial quantile regression presented in the McSpatial package (McMillen 2013). There are further issues with discrete response variables, covered by some functions in McSpatial, and in the spatialprobit and ProbitSpatial packages (Wilhelm and Matos 2013; Martinetti and Geniaux 2017); the MCMC implementations of the former are based on LeSage and Pace (2009). Finally, Wagner and Zeileis (2019) show how an SLM model may be used in the setting of recursive partitioning, with an implementation using spatialreg::lagsarlm()
in the lagsarlmtree package.
The review of cross-sectional maximum likelihood and generalized method of moments (GMM) estimators in spatialreg and sphet for spatial econometrics style spatial regression models by Bivand and Piras (2015) is still largely valid. In the review, estimators in these R packages were compared with alternative implementations available in other programming languages elsewhere. The review did not cover Bayesian spatial econometrics style spatial regression. More has changed with respect to spatial panel estimators described in Millo and Piras (2012), but will not be covered here.
For models with single spatial coefficients (SEM and SDEM using errorsarlm()
, SLM and SDM using lagsarlm()
), the methods initially described by Ord (1975) are used. The following table shows the functions that can be used to estimate the models described above using maximum likelihood.
model | model name | maximum likelihood estimation function |
---|---|---|
SEM | spatial error | errorsarlm(..., Durbin=FALSE, ...) |
SEM | spatial error | spautolm(..., family="SAR", ...) |
SDEM | spatial Durbin error | errorsarlm(..., Durbin=TRUE, ...) |
SLM | spatial lag | lagsarlm(..., Durbin=FALSE, ...) |
SDM | spatial Durbin | lagsarlm(..., Durbin=TRUE, ...) |
SAC | spatial autoregressive combined | sacsarlm(..., Durbin=FALSE, ...) |
GNM | general nested | sacsarlm(..., Durbin=TRUE, ...) |
The estimating functions errorsarlm()
and lagsarlm()
take similar arguments, where the first two, formula=
and data=
are shared by most model estimating functions. The third argument is a listw
spatial weights object, while na.action=
behaves as in other model estimating functions if the spatial weights can reasonably be subsetted to avoid observations with missing values. The weights=
argument may be used to provide weights indicating the known degree of per-observation variability in the variance term - this is not available for lagsarlm()
.
args(errorsarlm)
## function (formula, data = list(), listw, na.action, weights = NULL,
## Durbin, etype, method = "eigen", quiet = NULL, zero.policy = NULL,
## interval = NULL, tol.solve = .Machine$double.eps, trs = NULL,
## control = list())
## NULL
args(lagsarlm)
## function (formula, data = list(), listw, na.action, Durbin, type,
## method = "eigen", quiet = NULL, zero.policy = NULL, interval = NULL,
## tol.solve = .Machine$double.eps, trs = NULL, control = list())
## NULL
The Durbin=
argument replaces the earlier type=
and etype=
arguments, and if not given is taken as FALSE
. If given, it may be FALSE
, TRUE
in which case all spatially lagged covariates are included, or a one-sided formula specifying which spatially lagged covariates should be included. The method=
argument gives the method for calculating the log determinant term in the log likelihood function, and defaults to "eigen"
, suitable for moderately sized data sets. The interval=
argument gives the bounds of the domain for the line search using stats::optimize()
used for finding the spatial coefficient. The tol.solve()
argument, passed through to base::solve()
, was needed to handle data sets with differing numerical scales among the coefficients which hindered inversion of the variance-covariance matrix; the default value in base::solve()
used to be much larger. The control=
argument takes a list of control values to permit more careful adjustment of the running of the estimation function.
The spautolm()
function also fits spatial regressions with the spatial process in the residuals, and takes a possibly poorly named family=
argument, taking the values of "SAR"
for the simultaneous autoregressive model (like errorsarlm()
), "CAR"
for the conditional autoregressive model, and "SMA"
for the spatial moving average model.
args(spautolm)
## function (formula, data = list(), listw, weights, na.action,
## family = "SAR", method = "eigen", verbose = NULL, trs = NULL,
## interval = NULL, zero.policy = NULL, tol.solve = .Machine$double.eps,
## llprof = NULL, control = list())
## NULL
The sacsarlm()
function may take second spatial weights and interval arguments if the spatial weights used to model the two spatial processes in the SAC and GNM specifications differ. By default, the same spatial weights are used. By default, stats::nlminb()
is used for numerical optimization, using a heuristic to choose starting values.
args(sacsarlm)
## function (formula, data = list(), listw, listw2 = NULL, na.action,
## Durbin, type, method = "eigen", quiet = NULL, zero.policy = NULL,
## tol.solve = .Machine$double.eps, llprof = NULL, interval1 = NULL,
## interval2 = NULL, trs1 = NULL, trs2 = NULL, control = list())
## NULL
Where larger data sets are used, a numerical Hessian approach is used to calculate the variance-covariance matrix of coefficients, rather than an analytical asymptotic approach.
Apart from spautolm()
which returns an "spautolm"
object, the model fitting functions return "Sarlm"
objects. Standard methods for fitted models are provided, such as summary()
:
args(getS3method("summary", "Sarlm"))
## function (object, correlation = FALSE, Nagelkerke = FALSE, Hausman = FALSE,
## adj.se = FALSE, ...)
## NULL
The Nagelkerke=
argument permits the return of a value approximately corresponding to a coefficient of determination, although the summary method anyway provides the value of stats::AIC()
because a stats::logLik()
method is provided for "sarlm"
and "spautolm"
objects. If the "sarlm"
object is a SEM or SDEM, the Hausman test may be performed by setting Hausman=TRUE
to see whether the regression coefficients are sufficiently like least squares coefficients, indicating absence of mis-specification from that source. As an example, we may fit SEM and SDEM to the 94 and 489 observation Boston data sets, and present the Hausman test results:
eigs_489 <- eigenw(lw_q_489)
SDEM_489 <- errorsarlm(form, data=boston_489, listw=lw_q_489, Durbin=TRUE,
zero.policy=TRUE, control=list(pre_eig=eigs_489))
SEM_489 <- errorsarlm(form, data=boston_489, listw=lw_q_489,
zero.policy=TRUE, control=list(pre_eig=eigs_489))
cbind(data.frame(model=c("SEM", "SDEM")),
rbind(broom::tidy(Hausman.test(SEM_489)),
broom::tidy(Hausman.test(SDEM_489))))[,1:4]
## model statistic p.value parameter
## 1 SEM 51.9862 2.827192e-06 14
## 2 SDEM 48.6551 6.481654e-03 27
Here we are using the control=
list argument to pass through pre-computed eigenvalues for the default "eigen"
method. Both test results for the 489 tract data set suggest that the regression coefficients do differ, perhaps that the footprints of the spatial processes do not match. Likelihood ratio tests of the spatial models against their least squares equivalents show that spatial process(es) are present, but we find that neither SEM not SDM are adequate representations.
cbind(data.frame(model=c("SEM", "SDEM")),
rbind(broom::tidy(LR1.Sarlm(SEM_489)),
broom::tidy(LR1.Sarlm(SDEM_489))))[,c(1, 4:6)]
## model statistic p.value parameter
## 1 SEM 198.4133 0 1
## 2 SDEM 159.3798 0 1
For the 94 air pollution model output zones, the Hausman tests find little difference between coefficients, but this is related to the fact that the SEM and SDEM models add little to least squares or SLX using likelihood ratio tests.
eigs_94 <- eigenw(lw_q_94)
SDEM_94 <- errorsarlm(form, data=boston_94, listw=lw_q_94, Durbin=TRUE,
control=list(pre_eig=eigs_94))
SEM_94 <- errorsarlm(form, data=boston_94, listw=lw_q_94, control=list(pre_eig=eigs_94))
cbind(data.frame(model=c("SEM", "SDEM")),
rbind(broom::tidy(Hausman.test(SEM_94)),
broom::tidy(Hausman.test(SDEM_94))))[, 1:4]
## model statistic p.value parameter
## 1 SEM 15.657083 0.3347612 14
## 2 SDEM 9.205152 0.9994394 27
cbind(data.frame(model=c("SEM", "SDEM")), rbind(broom::tidy(LR1.Sarlm(SEM_94)), broom::tidy(LR1.Sarlm(SDEM_94))))[,c(1, 4:6)]
## model statistic p.value parameter
## 1 SEM 2.5933581 0.1073126 1
## 2 SDEM 0.2158487 0.6422213 1
We can use spatialreg::LR.sarlm()
to apply a likelihood ratio test between nested models, but here choose lmtest::lrtest()
, which gives the same results, preferring models including spatially lagged covariates.
broom::tidy(lmtest::lrtest(SEM_489, SDEM_489))
## Warning in tidy.anova(lmtest::lrtest(SEM_489, SDEM_489)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik
## # A tibble: 2 × 5
## X.Df LogLik df statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 16 273. NA NA NA
## 2 29 311. 13 74.4 1.23e-10
broom::tidy(lmtest::lrtest(SEM_94, SDEM_94))
## Warning in tidy.anova(lmtest::lrtest(SEM_94, SDEM_94)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik
## # A tibble: 2 × 5
## X.Df LogLik df statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 16 59.7 NA NA NA
## 2 29 81.3 13 43.2 0.0000421
The SLX model is fitted using least squares, and also returns a log likelihood value, letting us test whether we need a spatial process in the residuals.
args(lmSLX)
## function (formula, data = list(), listw, na.action, weights = NULL,
## Durbin = TRUE, zero.policy = NULL)
## NULL
In the tract data set we obviously do:
SLX_489 <- lmSLX(form, data=boston_489, listw=lw_q_489, zero.policy=TRUE)
broom::tidy(lmtest::lrtest(SLX_489, SDEM_489))
## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of
## class "SlX", updated model is of class "Sarlm"
## Warning in tidy.anova(lmtest::lrtest(SLX_489, SDEM_489)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik
## # A tibble: 2 × 5
## X.Df LogLik df statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28 231. NA NA NA
## 2 29 311. 1 159. 1.55e-36
but in the output zone case we do not.
SLX_94 <- lmSLX(form, data=boston_94, listw=lw_q_94)
broom::tidy(lmtest::lrtest(SLX_94, SDEM_94))
## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of
## class "SlX", updated model is of class "Sarlm"
## Warning in tidy.anova(lmtest::lrtest(SLX_94, SDEM_94)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik
## # A tibble: 2 × 5
## X.Df LogLik df statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28 81.2 NA NA NA
## 2 29 81.3 1 0.216 0.642
This outcome is sustained also when we use the counts of house units by tract and output zones as weights:
SLX_94w <- lmSLX(form, data=boston_94, listw=lw_q_94, weights=units)
SDEM_94w <- errorsarlm(form, data=boston_94, listw=lw_q_94, Durbin=TRUE, weights=units,
control=list(pre_eig=eigs_94))
broom::tidy(lmtest::lrtest(SLX_94w, SDEM_94w))
## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of
## class "SlX", updated model is of class "Sarlm"
## Warning in tidy.anova(lmtest::lrtest(SLX_94w, SDEM_94w)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik
## # A tibble: 2 × 5
## X.Df LogLik df statistic p.value
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 28 97.5 NA NA NA
## 2 29 98.0 1 0.917 0.338
The estimation methods used for fitting SLM and the spatially lagged response part of SARAR models are based on instrumental variables, using the spatially lagged covariates (usually including lags of lags too) as instruments for the spatially lagged response (Piras 2010). This, and the use of spatially lagged covariates in the moment conditions for models including a spatial process in the residuals, means that including the spatially lagged covariates in the initial model is challenging, and functions in the sphet package do not provide a Durbin=
argument. This makes it harder to accommodate data with multiple spatial process footprints. However, as Kelejian and Piras show in their recent review (2017), these approaches have other benefits, such as being able to instrument variables suspected of suffering from endogeneity or measurement error. Let us first compare the ML and GMM estimates of the SEM regression coefficients for rescaled squared NOX values. We use the sphet::spreg()
wrapper function, and fit a SEM model. Extracting the matching rows from the summary objects for these models, we can see that the values are not dissimilar, despite the difference in estimation methods.
SEM1_94 <- sphet::spreg(form, data=boston_94, listw=lw_q_94, model="error")
res <- rbind(summary(SEM_94)$Coef["I((NOX * 10)^2)",],
summary(SEM1_94)$CoefTable["I((NOX * 10)^2)",])
rownames(res) <- c("ML", "GMM")
res
## Estimate Std. Error z value Pr(>|z|)
## ML -0.009564967 0.0026099 -3.664879 0.0002474555
## GMM -0.009744770 0.0026113 -3.731770 0.0001901391
Using sphet::spreg()
, we can instrument the rescaled squared NOX variable, dropping it first from the formula, next creating the rescaled squared NOX variable as a column in the "sf"
object, extracting a matrix of coordinates from the centroids of the output zones, and creating a one-sided instrument formula from a second order polynomial in the coordinates (here improperly, as they are not projected) and mean distances to downtown and radial roads. The endog=
argument takes a one-sided formula for the variables to be modelled in the first stage of the model. Had we re-run the original air pollution model many times under slightly varying scenarios, we could have used an ensemble of NOX loadings to yield its distribution by output zone. Because this is not possible, we assume that the measurement error can be captured by using selected instruments. Unfortunately, the NOX regression coefficient estimate from the second stage has fallen substantially in absolute size, although the sign is unchanged.
formiv <- update(form, . ~ . - I((NOX*10)^2))
boston_94$NOX2 <- (boston_94$NOX*10)^2
suppressWarnings(ccoords <- st_coordinates(st_centroid(st_geometry(boston_94))))
iform <- formula(~poly(ccoords, degree=2) + DIS + RAD)
SEM1_94iv <- sphet::spreg(formiv, data=boston_94, listw=lw_q_94, endog = ~NOX2,
instruments=iform, model="error")
summary(SEM1_94iv)$CoefTable["NOX2",]
## Estimate Std. Error t-value Pr(>|t|)
## -0.001974509 0.005664411 -0.348581556 0.727403476
Handling measurement error in this or similar ways is one of the benefits of GMM estimation methods, although here the choice of instruments was somewhat arbitrary.
The Spatial Econometrics Library is part of the extensive Matlab code repository at https://www.spatial-econometrics.com/ and documented in LeSage and Pace (2009). The Google Summer of Coding project in 2011 by Abhirup Mallik mentored by Virgilio Gómez-Rubio yielded translations of some of the model fitting functions for SEM, SDEM, SLM, SDM, SAC and GNM from the Matlab code. These have now been added to spatialreg as spBreg_err()
, spBreg_lag()
and spBreg_sac()
with Durbin=
arguments to handle the inclusion of spatially lagged covariates. As yet, heteroskedastic disturbances are not accommodated. The functions return "mcmc"
objects as specified in the coda package, permitting the use of tools from coda for handling model output. The default burnin is 500 draws, followed by default 2000 draws returned, but these values and many others may be set through the control=
list argument. Fitting the SDEM model for the output zones takes about an order of magnitude longer than using ML, but there is more work to do subsequently, and this difference scales more in the number of samples than covariates or observations.
system.time(SDEM_94B <- spBreg_err(form, data=boston_94, listw=lw_q_94, Durbin=TRUE))
## user system elapsed
## 1.818 0.001 1.835
system.time(SDEM_489B <- spBreg_err(form, data=boston_489, listw=lw_q_489, Durbin=TRUE, zero.policy=TRUE))
## user system elapsed
## 2.496 0.004 2.521
Most time in the ML case using eigenvalues is taken by log determinant setup and optimization, and by dense matrix asymptotic standard errors if chosen (always chosen with default "eigen"
log determinant method):
t(errorsarlm(form, data=boston_94, listw=lw_q_94, Durbin=TRUE)$timings[,2])
## set_up eigen_set_up eigen_opt coefs eigen_hcov eigen_se
## [1,] 0.004 0.005 0.006 0.003 0.002 0.001
t(errorsarlm(form, data=boston_489, listw=lw_q_489, Durbin=TRUE, zero.policy=TRUE)$timings[,2])
## set_up eigen_set_up eigen_opt coefs eigen_hcov eigen_se
## [1,] 0.008 0.053 0.025 0.005 0.094 0.192
while in the MCMC case, the default use of Spatial Econometric Toolbox gridded log determinants and obviously sampling takes most time:
t(attr(SDEM_94B, "timings")[ , 3])
## set_up SE_classic_set_up complete_setup sampling finalise
## [1,] 0.003 0.245 0.027 1.501 0.057
t(attr(SDEM_489B, "timings")[ , 3])
## set_up SE_classic_set_up complete_setup sampling finalise
## [1,] 0.005 0.42 0.069 1.977 0.05
However, as we will see shortly, inference from model impacts may need sampling (where the spatially lagged response is included in the model), and in the case of MCMC models, the samples have already been drawn.
We will briefly touch on selected implementation details that applied users of spatial regression models would be wise to review. The handling of the log determinant term applies to all such users, while impacts are restricted to those employing spatial econometrics style models either including the spatially lagged response or including spatially lagged covariates.
It has been known for over twenty years that the sparse matrix representation of spatial weights overcomes the difficulties of fitting models with larger numbers of observations using maximum likelihood and MCMC where the log determinant term comes into play (R. K. Pace and Barry 1997a, 1997c, 1997d, 1997b). During the development of these approaches in model fitting functions in spatialreg, use was first made of C code also used in the S-PLUS SpatialStats module (Kaluzny et al. 1998), then SparseM which used a compressed sparse row form very similar to "nb"
and "listw"
objects. This was followed by the use of spam and Matrix methods, both of which mainly use compressed sparse column representations. Details are provided in Bivand, Hauke and Kossowski (2013).
The domain of the spatial coefficient(s) is given by the interval=
argument to model fitting functions, and returned in the fitted object:
SEM_94$interval
## [1] -1.527257 1.000000
This case is trivial, because the upper bound is unity by definition, because of the use of row standardization. The interval is the inverse of the range of the eigenvalues of the weights matrix:
1/range(eigs_94)
## [1] -1.527257 1.000000
Finding the interval within which to search for the spatial coefficient is trivial for smaller data sets, but more complex for larger ones. It is possible to use heuristics implemented in lextrW()
(D. Griffith, Bivand, and Chun 2015):
1/c(lextrW(lw_q_94))
## lambda_n lambda_1
## -1.527257 1.000000
or RSpectra::eigs()
after coercion to a Matrix package compressed sparse column representation:
W <- as(lw_q_94, "CsparseMatrix")
1/Re(c(RSpectra::eigs(W, k=1, which="SR")$values,
RSpectra::eigs(W, k=1, which="LR")$values))
## [1] -1.527257 1.000000
Why are we extracting the real part of the values returned by eigs()
? Since nb_q_94
is symmetric, the row-standardized object lw_q_94
is symmetric by similarity, a result known since Ord (1975); consequently, we can take the real part without concern. Had the underlying neighbour relationships not been symmetric, we should be more careful.
The baseline log determinant term as given by Ord (1975) for a coefficient value proposed in sampling or during numerical optimization; this extract matches the "eigen"
method (with or without control=list(pre_eig=...)"
):
coef <- 0.5
sum(log(1 - coef * eigs_94))
## [1] -2.867292
Using sparse matrix functions from Matrix, the LU decomposition can be used for asymmetric matrices; this extract matches the "LU"
method:
I <- Diagonal(nrow(boston_94))
LU <- lu(I - coef * W)
dU <- abs(diag(slot(LU, "U")))
sum(log(dU))
## [1] -2.867292
and Cholesky decomposition for symmetric matrices, with similar.listw()
used to handle asymmetric weights that are similar to symmetric. The default value od super
allows the underlying Matrix code to choose between supernodal or simplicial decomposition; this extract matches the "Matrix_J"
method:
W <- as(similar.listw(lw_q_94), "CsparseMatrix")
super <- as.logical(NA)
cch <- Cholesky((I - coef * W), super=super)
c(2 * determinant(cch, logarithm = TRUE)$modulus)
## [1] -2.867292
The "Matrix"
and "spam_update"
methods are to be preferred as they pre-compute the fill-reducing permutation of the decomposition since the weights do not change for different values of the coefficient.
Maximum likelihood model fitting functions in spatialreg and splm use jacobianSetup()
to populate env=
environment with intermediate objects needed to find log determinants during optimization. Passing environments to objective functions is efficient because they are passed by reference rather than value. The con=
argument passes through the populated control list, containing default values unless the key-value pairs were given in the function call (pre_eig=
is extracted separately). The which=
argument is 1
by default, but will also take 2
in SAC and GNM models. HSAR uses mcdet_setup()
to set up Monte Carlo approximation terms.
args(jacobianSetup)
## function (method, env, con, pre_eig = NULL, trs = NULL, interval = NULL,
## which = 1)
## NULL
For each value of coef
, the do_ldet()
function returns the log determinant, using the values stored in environment env=
:
args(do_ldet)
## function (coef, env, which = 1)
## NULL
As yet the Bayesian models are limited to control argument ldet_method="SE_classic"
at present, using "LU"
to generate a coarse grid of control argument nrho=200L
log determinant values in the interval, spline interpolated to a finer grid of length control argument interpn=2000L
, from which griddy Gibbs samples are drawn. It is hoped to add facilities to choose alternative methods in the future. This would offer possibilities to move beyond griddy Gibbs, but using gridded log determinant values seems reasonable at present.
Global impacts have been seen as crucial for reporting results from fitting models including the spatially lagged response (SLM, SDM, SAC. GNM) for over ten years (James P. LeSage and Pace 2009). Extension to other models including spatially lagged covariates (SLX, SDEM) has followed (Elhorst 2010; Roger S. Bivand 2012; Halleck Vega and Elhorst 2015). In order to infer from the impacts, linear combination may be used for SLX and SDEM models. For SLM, SDM, SAC and GNM models fitted with maximum likelihood or GMM, the variance-covariance matrix of the coefficients is available, and can be used to make random draws from a multivariate Normal distribution with mean set to coefficient values and variance to the estimated variance-covariance matrix. For these models fitted using Bayesian methods, draws are already available. In the SDEM case, the draws on the regression coefficients of the unlagged covariates represent direct impacts, and draws on the coefficients of the spatially lagged covariates represent indirect impacts, and their by-draw sums the total impacts.
Impacts are calculated using model object class specific impacts()
methods, here taking the method for "sarlm"
objects as an example. In the sphet package, the impacts method for "gstsls"
uses the spatialreg impacts()
framework, as does the splm package for "splm"
fitted model objects. impacts()
methods require either a tr=
- a vector of traces of the power series of the weights object typically computed with trW()
or a listw=
argument. If listw=
is given, dense matrix methods are used. The evalues=
argument is experimental, does not yet work for all model types, and takes the eigenvalues of the weights matrix. The R=
argument gives the number of samples to be taken from the fitted model. The Q=
permits the decomposition of impacts to components of the power series of the weights matrix (James P. LeSage and Pace 2009).
args(getS3method("impacts", "Sarlm"))
## function (obj, ..., tr = NULL, R = NULL, listw = NULL, evalues = NULL,
## useHESS = NULL, tol = 1e-06, empirical = FALSE, Q = NULL)
## NULL
The summary method for the output of impacts()
methods where inference from samples was requested by default uses the summary()
method for "mcmc"
objects defined in the coda package. It can instead report just matrices of standard errors, z-values and p-values by setting zstats=
and short=
to TRUE
.
args(getS3method("summary", "LagImpact"))
## function (object, ..., zstats = FALSE, short = FALSE, reportQ = NULL)
## NULL
Since sampling is not required for inference for SLX and SDEM models, linear combination is used for models fitted using maximum likelihood; results are shown here for the air pollution variable only. The literature has not yet resolved the question of how to report model output, as each covariate is now represented by three impacts. Where spatially lagged covariates are included, two coefficients are replaced by three impacts.
sum_imp_94_SDEM <- summary(impacts(SDEM_94))
rbind(Impacts=sum_imp_94_SDEM$mat[5,], SE=sum_imp_94_SDEM$semat[5,])
## Direct Indirect Total
## Impacts -0.012764045 -0.018454532 -0.031218577
## SE 0.002354762 0.004717734 0.005303962
The impacts from the same model fitted by MCMC are very similar:
sum_imp_94_SDEM_B <- summary(impacts(SDEM_94B))
rbind(Impacts=sum_imp_94_SDEM_B$mat[5,], SE=sum_imp_94_SDEM_B$semat[5,])
## Direct Indirect Total
## Impacts -0.012791651 -0.017728819 -0.030520470
## SE 0.002460167 0.005119155 0.006077588
as also are those from the SLX model. In the SLX and SDEM models, the direct impacts are the consequences for the response of changes in air pollution in the same observational entity, and the indirect (local) impacts are the consequences for the response of changes in air pollution in neighbouring observational entities.
sum_imp_94_SLX <- summary(impacts(SLX_94))
rbind(Impacts=sum_imp_94_SLX$mat[5,], SE=sum_imp_94_SLX$semat[5,])
## Direct Indirect Total
## Impacts -0.012766473 -0.018743765 -0.031510238
## SE 0.002796966 0.005564633 0.006114058
In contrast to local indirect impacts in SLX and SDEM models, global indirect impacts are found in models including the spatially lagged response. For purposes of exposition, let us fit an SLM:
SLM_489 <- lagsarlm(form, data=boston_489, listw=lw_q_489, zero.policy=TRUE)
Traces of the first m=
matrices of the power series in the spatial weights are pre-computed to speed up inference from samples from the fitted model, and from existing MCMC samples (James P. LeSage and Pace 2009). The traces can also be used in other contexts too, so their pre-computation may be worthwhile anyway. The type=
argument is "mult"
by default, but may also be set to "MC"
for Monte Carlo simulation or "moments"
using a space-saving looping algorithm.
args(trW)
## function (W = NULL, m = 30, p = 16, type = "mult", listw = NULL,
## momentsSymmetry = TRUE)
## NULL
W <- as(lw_q_489, "CsparseMatrix")
tr_489 <- trW(W)
str(tr_489)
## num [1:30] 0 90.8 29.4 42.1 26.5 ...
## - attr(*, "timings")= Named num [1:2] 0.173 0.18
## ..- attr(*, "names")= chr [1:2] "user.self" "elapsed"
## - attr(*, "type")= chr "mult"
## - attr(*, "n")= int 489
In this case, the spatial process in the response is not strong, so the global indirect impacts (here for the air pollution variable) are weak.
SLM_489_imp <- impacts(SLM_489, tr=tr_489, R=2000)
SLM_489_imp_sum <- summary(SLM_489_imp, short=TRUE, zstats=TRUE)
res <- rbind(Impacts=sapply(SLM_489_imp$res, "[", 5), SE=SLM_489_imp_sum$semat[5,])
colnames(res) <- c("Direct", "Indirect", "Total")
res
## Direct Indirect Total
## Impacts -0.005930731 -1.014744e-05 -0.005940879
## SE 0.001048725 1.010022e-04 0.001052947
Of more interest is trying to reconstruct the direct and total impacts using dense matrix methods; the direct global impacts are the mean of the diagonal of the dense impacts matrix, and the total global impacts are the sum of all matrix elements divided by the number of observations. The direct impacts agree, but the total impacts differ slightly.
coef_SLM_489 <- coef(SLM_489)
IrW <- Diagonal(489) - coef_SLM_489[1] * W
S_W <- solve(IrW)
S_NOX_W <- S_W %*% (diag(489) * coef_SLM_489[7])
c(Direct=mean(diag(S_NOX_W)), Total=sum(S_NOX_W)/489)
## Direct Total
## -0.005930731 -0.005940858
This bare-bones approach corresponds to using the listw=
argument, and as expected gives the same output.
sapply(impacts(SLM_489, listw=lw_q_489), "[", 5)
## direct indirect total
## -5.930731e-03 -1.014744e-05 -5.940879e-03
The experimental evalues=
approach which is known to be numerically exact by definition gives the same results as the matrix power series trace approach, so the slight difference may be attributed to the consequences of inverting the spatial process matrix.
sapply(impacts(SLM_489, evalues=eigs_489), "[", 5)
## direct indirect total
## -5.930731e-03 -1.014744e-05 -5.940879e-03
Impacts are crucial to the interpretation of Durbin models including spatially lagged covariates and models including the spatially lagged response. Tools to calculate impacts and their inferential bases are now available, and should be employed, but as yet some implementation details are under development and ways of presenting results in tabular form have not reached maturity.
We will use the predict()
method for "sarlm"
objects to double-check impacts, here for the pupil-teacher ratio (PTRATIO
). The method was re-written by Martin Gubri based on Goulard, Laurent and Thomas-Agnan (2017). The pred.type=
argument specifies the prediction strategy among those presented in the article.
args(getS3method("predict", "sarlm"))
## function (object, newdata = NULL, listw = NULL, pred.type = "TS",
## all.data = FALSE, zero.policy = NULL, legacy = TRUE, legacy.mixed = FALSE,
## power = NULL, order = 250, tol = .Machine$double.eps^(3/5),
## spChk = NULL, ...)
## NULL
First we’ll increment PTRATIO
by one to show that, using least squares, the mean difference between predictions from the incremented new data and fitted values is equal to the regression coefficient.
nd_489 <- boston_489
nd_489$PTRATIO <- nd_489$PTRATIO + 1
OLS_489 <- lm(form, data=boston_489)
fitted <- predict(OLS_489)
nd_fitted <- predict(OLS_489, newdata=nd_489)
all.equal(unname(coef(OLS_489)[12]), mean(nd_fitted - fitted))
## [1] TRUE
In models including the spatially lagged response, and when the spatial coefficient in different from zero, this is not the case in general, and is why we need impacts()
methods. The difference here is not great, but neither is it zero, and needs to be handled.
fitted <- predict(SLM_489)
## This method assumes the response is known - see manual page
nd_fitted <- predict(SLM_489, newdata=nd_489, listw=lw_q_489, pred.type="TS",
zero.policy=TRUE)
all.equal(unname(coef_SLM_489[13]), mean(nd_fitted - fitted))
## [1] "Mean relative difference: 0.001783081"
In the Boston tracts data set, 17 observations of median house values, the response, are censored. Using these as an example and comparing some pred.type=
variants for the SDEM model and predicting out-of-sample, we can see that there are differences, suggesting that this is a fruitful area for study. There have been a number of alternative proposals for handling missing variables (Gómez-Rubio, Bivand, and Rue 2015; Suesse 2018). Another reason for increasing attention on prediction is that it is fundamental for machine learning approaches, in which prediction for validation and test data sets drives model specification choice. The choice of training and other data sets with dependent spatial data remains an open question, and is certainly not as simple as with independent data.
Here, we’ll list the predictions for the censored tract observations using three different prediction types, taking the exponent to get back to the USD median house values.
nd <- boston_506[is.na(boston_506$median),]
t0 <- exp(predict(SDEM_489, newdata=nd, listw=lw_q, pred.type="TS", zero.policy=TRUE))
suppressWarnings(t1 <- exp(predict(SDEM_489, newdata=nd, listw=lw_q, pred.type="KP2",
zero.policy=TRUE)))
suppressWarnings(t2 <- exp(predict(SDEM_489, newdata=nd, listw=lw_q, pred.type="KP5",
zero.policy=TRUE)))
data.frame(fit_TS=t0[,1], fit_KP2=c(t1), fit_KP5=c(t2),
censored=boston_506$censored[as.integer(attr(t0, "region.id"))])
## fit_TS fit_KP2 fit_KP5 censored
## 13 23912.450 29477.240 28147.151 right
## 14 28125.588 27001.192 28516.159 right
## 15 30553.380 36184.007 32476.053 right
## 17 18518.448 19620.801 18878.244 right
## 43 9563.613 6816.721 7561.353 left
## 50 8371.200 7196.445 7383.139 left
## 312 51477.038 53300.877 54172.605 right
## 313 45920.630 45823.263 47094.833 right
## 314 44195.663 44585.510 45361.371 right
## 317 43427.407 45706.682 45442.461 right
## 337 39878.563 42071.842 41126.925 right
## 346 44708.480 46693.655 46107.827 right
## 355 48187.620 49067.627 48910.651 right
## 376 42881.289 45883.193 44965.674 right
## 408 44293.862 44615.008 45669.977 right
## 418 38210.776 43374.808 41913.830 right
## 434 41647.159 41690.222 42397.968 right
Let us read in again the ACS data set with over 70,000 observations:
library(sf)
df_tracts <- st_read("df_tracts.gpkg")
## Reading layer `df_tracts' from data source
## `/home/rsb/und/ecs530/2021/df_tracts.gpkg' using driver `GPKG'
## Simple feature collection with 71353 features and 28 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: NAD83
set.ZeroPolicyOption(TRUE)
## [1] FALSE
spdep::set.ZeroPolicyOption(TRUE)
## [1] FALSE
nb_subset <- readRDS("nb_subset.rds")
lw <- spdep::nb2listw(nb_subset, style="W")
In the original study, logarithms of the median income CV were thought to be driven by a slightly broader set of covariates. Here we take logs of the vacancy rate, the elderly rate, the Black and Hispanic rates, the population density, and group quarters population (institutional like prisons, non-institutional like student accommodation https://www.census.gov/newsroom/releases/archives/2010_census/cb11-tps13.html).
tr_form <- log(med_inc_cv) ~ log1p(vacancy_rate) + log1p(old_rate) + log1p(black_rate) + log1p(hisp_rate) + log1p(group_pop) + log1p(dens)
The available covariates are not the same as in the original article, but our focus here is to show that larger data sets can be used in analysis:
lm_mod <- lm(tr_form, data=df_tracts)
summary(lm_mod)
##
## Call:
## lm(formula = tr_form, data = df_tracts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.14822 -0.29445 0.01038 0.30068 2.99263
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.2284926 0.0355078 -175.41 <2e-16 ***
## log1p(vacancy_rate) 0.4039100 0.0248345 16.26 <2e-16 ***
## log1p(old_rate) 0.4069179 0.0040408 100.70 <2e-16 ***
## log1p(black_rate) 0.5478924 0.0110139 49.75 <2e-16 ***
## log1p(hisp_rate) 0.5854893 0.0121825 48.06 <2e-16 ***
## log1p(group_pop) 0.0216907 0.0007901 27.45 <2e-16 ***
## log1p(dens) 0.0354277 0.0018286 19.37 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4526 on 71346 degrees of freedom
## Multiple R-squared: 0.2296, Adjusted R-squared: 0.2295
## F-statistic: 3543 on 6 and 71346 DF, p-value: < 2.2e-16
The least squares residuals show strong positive spatial autocorrelation:
spdep::lm.morantest(lm_mod, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = tr_form, data = df_tracts)
## weights: lw
##
## Moran I statistic standard deviate = 50.637, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 1.117328e-01 -6.347763e-05 4.874291e-06
Adding the spatially lagged covariates in SLX model form might make sense:
SLX <- lmSLX(tr_form, data=df_tracts, listw=lw, zero.policy=TRUE)
summary(SLX)
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.14224 -0.29424 0.01015 0.30057 2.94089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.091034 0.054824 -111.102 < 2e-16 ***
## log1p.vacancy_rate. 0.204452 0.039726 5.147 2.66e-07 ***
## log1p.old_rate. 0.402543 0.004422 91.026 < 2e-16 ***
## log1p.black_rate. 0.493457 0.027068 18.230 < 2e-16 ***
## log1p.hisp_rate. 0.607203 0.033922 17.900 < 2e-16 ***
## log1p.group_pop. 0.019748 0.000834 23.678 < 2e-16 ***
## log1p.dens. 0.010350 0.004203 2.463 0.01380 *
## lag.log1p.vacancy_rate. 0.374812 0.050936 7.359 1.88e-13 ***
## lag.log1p.old_rate. -0.016907 0.006436 -2.627 0.00861 **
## lag.log1p.black_rate. 0.053985 0.030672 1.760 0.07840 .
## lag.log1p.hisp_rate. -0.039247 0.037392 -1.050 0.29390
## lag.log1p.group_pop. 0.012803 0.001529 8.374 < 2e-16 ***
## lag.log1p.dens. 0.033688 0.004742 7.105 1.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4521 on 71340 degrees of freedom
## Multiple R-squared: 0.2315, Adjusted R-squared: 0.2314
## F-statistic: 1791 on 12 and 71340 DF, p-value: < 2.2e-16
The level of residual spatial autocorrelation is effectively unchanged, perhaps suggesting that the spatial footprint of the autocorrelation process(es) may not be well-represented by the chosen definition of tract neighbours:
spdep::lm.morantest(SLX, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
##
## Moran I statistic standard deviate = 50.563, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 1.115597e-01 -7.287566e-05 4.874367e-06
The linear combinations of unlagged and lagged covariates give these impacts, many pulling the same way:
summary(spatialreg::impacts(SLX))
## Impact measures (SlX, estimable, n-k):
## Direct Indirect Total
## log1p(vacancy_rate) 0.20445207 0.37481174 0.57926381
## log1p(old_rate) 0.40254315 -0.01690728 0.38563587
## log1p(black_rate) 0.49345676 0.05398505 0.54744181
## log1p(hisp_rate) 0.60720345 -0.03924693 0.56795651
## log1p(group_pop) 0.01974814 0.01280282 0.03255096
## log1p(dens) 0.01035030 0.03368823 0.04403853
## ========================================================
## Standard errors:
## Direct Indirect Total
## log1p(vacancy_rate) 0.0397260202 0.050935722 0.032647455
## log1p(old_rate) 0.0044223032 0.006435572 0.006346559
## log1p(black_rate) 0.0270685196 0.030671611 0.012643255
## log1p(hisp_rate) 0.0339216058 0.037391531 0.013710963
## log1p(group_pop) 0.0008340256 0.001528895 0.001507331
## log1p(dens) 0.0042030102 0.004741785 0.002155113
## ========================================================
## Z-values:
## Direct Indirect Total
## log1p(vacancy_rate) 5.146553 7.358524 17.74300
## log1p(old_rate) 91.025678 -2.627160 60.76298
## log1p(black_rate) 18.229913 1.760098 43.29912
## log1p(hisp_rate) 17.900198 -1.049621 41.42353
## log1p(group_pop) 23.678098 8.373907 21.59510
## log1p(dens) 2.462592 7.104546 20.43444
##
## p-values:
## Direct Indirect Total
## log1p(vacancy_rate) 2.6532e-07 1.8585e-13 < 2.22e-16
## log1p(old_rate) < 2.22e-16 0.0086101 < 2.22e-16
## log1p(black_rate) < 2.22e-16 0.0783911 < 2.22e-16
## log1p(hisp_rate) < 2.22e-16 0.2938925 < 2.22e-16
## log1p(group_pop) < 2.22e-16 < 2.22e-16 < 2.22e-16
## log1p(dens) 0.013794 1.2073e-12 < 2.22e-16
summary(df_tracts$tot_pop)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 501 2925 4028 4283 5341 37452
quantile(df_tracts$group_pop, seq(0, 1, 0.1))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0 0 0 0 2 7 16 40 91 187 19496
plot(df_tracts$group_pop, df_tracts$group_pop/df_tracts$tot_pop)
plot(df_tracts$tot_pop, df_tracts$group_pop/df_tracts$tot_pop)
Weighted least squares use the range of the weights to assign tracts with larger populations less residual variance, but here the group quarters populations play in too:
lm_modw <- lm(tr_form, data=df_tracts, weights=tot_pop)
summary(lm_modw)
##
## Call:
## lm(formula = tr_form, data = df_tracts, weights = tot_pop)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -165.025 -18.290 0.634 18.582 206.188
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.2252852 0.0349852 -177.94 <2e-16 ***
## log1p(vacancy_rate) 0.5567749 0.0272130 20.46 <2e-16 ***
## log1p(old_rate) 0.4053828 0.0040674 99.67 <2e-16 ***
## log1p(black_rate) 0.5058137 0.0113856 44.43 <2e-16 ***
## log1p(hisp_rate) 0.5708457 0.0114385 49.91 <2e-16 ***
## log1p(group_pop) 0.0239794 0.0007418 32.33 <2e-16 ***
## log1p(dens) 0.0336080 0.0017827 18.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.5 on 71346 degrees of freedom
## Multiple R-squared: 0.2204, Adjusted R-squared: 0.2203
## F-statistic: 3361 on 6 and 71346 DF, p-value: < 2.2e-16
spdep::lm.morantest(lm_modw, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = tr_form, data = df_tracts, weights = tot_pop)
## weights: lw
##
## Moran I statistic standard deviate = 49.407, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 1.090160e-01 -6.281631e-05 4.874287e-06
We can try the same with other models, such as SLX, but again with little change:
SLXw <- lmSLX(tr_form, weights=tot_pop, data=df_tracts, listw=lw, zero.policy=TRUE)
summary(SLXw)
##
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))),
## data = as.data.frame(x), weights = weights)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -164.762 -18.250 0.628 18.602 212.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.2010197 0.0542376 -114.331 < 2e-16 ***
## log1p.vacancy_rate. 0.2900525 0.0426211 6.805 1.02e-11 ***
## log1p.old_rate. 0.3983487 0.0044799 88.919 < 2e-16 ***
## log1p.black_rate. 0.4647722 0.0276407 16.815 < 2e-16 ***
## log1p.hisp_rate. 0.5933632 0.0325690 18.219 < 2e-16 ***
## log1p.group_pop. 0.0212821 0.0007877 27.019 < 2e-16 ***
## log1p.dens. 0.0168140 0.0040944 4.107 4.02e-05 ***
## lag.log1p.vacancy_rate. 0.4325027 0.0518089 8.348 < 2e-16 ***
## lag.log1p.old_rate. -0.0019333 0.0065644 -0.295 0.768
## lag.log1p.black_rate. 0.0407380 0.0311804 1.307 0.191
## lag.log1p.hisp_rate. -0.0310037 0.0358891 -0.864 0.388
## lag.log1p.group_pop. 0.0164220 0.0014939 10.993 < 2e-16 ***
## lag.log1p.dens. 0.0231145 0.0046590 4.961 7.02e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.45 on 71340 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2226
## F-statistic: 1704 on 12 and 71340 DF, p-value: < 2.2e-16
spdep::lm.morantest(SLXw, lw)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
##
## Moran I statistic standard deviate = 49.483, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## 1.091760e-01 -7.215477e-05 4.874344e-06
summary(spatialreg::impacts(SLXw))
## Impact measures (SlX, estimable, n-k):
## Direct Indirect Total
## log1p(vacancy_rate) 0.29005246 0.432502675 0.72255514
## log1p(old_rate) 0.39834869 -0.001933313 0.39641538
## log1p(black_rate) 0.46477224 0.040737989 0.50551022
## log1p(hisp_rate) 0.59336317 -0.031003701 0.56235946
## log1p(group_pop) 0.02128214 0.016422044 0.03770419
## log1p(dens) 0.01681396 0.023114529 0.03992849
## ========================================================
## Standard errors:
## Direct Indirect Total
## log1p(vacancy_rate) 0.0426210645 0.051808861 0.033980108
## log1p(old_rate) 0.0044798951 0.006564424 0.006325152
## log1p(black_rate) 0.0276406965 0.031180367 0.012944562
## log1p(hisp_rate) 0.0325690346 0.035889146 0.012845410
## log1p(group_pop) 0.0007876827 0.001493860 0.001457527
## log1p(dens) 0.0040943629 0.004658979 0.002114653
## ========================================================
## Z-values:
## Direct Indirect Total
## log1p(vacancy_rate) 6.805378 8.3480444 21.26406
## log1p(old_rate) 88.919201 -0.2945138 62.67286
## log1p(black_rate) 16.814780 1.3065269 39.05194
## log1p(hisp_rate) 18.218629 -0.8638740 43.77902
## log1p(group_pop) 27.018677 10.9930292 25.86861
## log1p(dens) 4.106612 4.9612865 18.88181
##
## p-values:
## Direct Indirect Total
## log1p(vacancy_rate) 1.0078e-11 < 2.22e-16 < 2.22e-16
## log1p(old_rate) < 2.22e-16 0.76837 < 2.22e-16
## log1p(black_rate) < 2.22e-16 0.19137 < 2.22e-16
## log1p(hisp_rate) < 2.22e-16 0.38766 < 2.22e-16
## log1p(group_pop) < 2.22e-16 < 2.22e-16 < 2.22e-16
## log1p(dens) 4.0151e-05 7.0028e-07 < 2.22e-16
With a larger number of observations, using eigenvalues to calculate the Jacobian is not possible, so we use a pre-computed Cholesky decomposition:
SEM <- spatialreg::errorsarlm(tr_form, data=df_tracts, listw=lw, method="Matrix", zero.policy=TRUE)
apply(SEM$timings, 2, sum)
## user.self elapsed
## 4.857 5.050
summary(SEM, Hausman=TRUE, Nagelkerke=TRUE)
##
## Call:spatialreg::errorsarlm(formula = tr_form, data = df_tracts, listw = lw,
## method = "Matrix", zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.161891 -0.287439 0.010315 0.296182 2.979651
##
## Type: error
## Regions with no neighbours included:
## 12562 26251 28654 28696 31900 32259 42437 45571 45614 51434 57159 57306 68844 68898 68901 69293 69294
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.25184400 0.03703706 -168.800 < 2.2e-16
## log1p(vacancy_rate) 0.35134711 0.02829437 12.418 < 2.2e-16
## log1p(old_rate) 0.41108547 0.00418162 98.308 < 2.2e-16
## log1p(black_rate) 0.54672194 0.01341739 40.747 < 2.2e-16
## log1p(hisp_rate) 0.59289677 0.01501856 39.478 < 2.2e-16
## log1p(group_pop) 0.02045655 0.00080153 25.522 < 2.2e-16
## log1p(dens) 0.03120040 0.00220509 14.149 < 2.2e-16
##
## Lambda: 0.25388, LR test value: 2082.7, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0079519
## z-value: 31.927, p-value: < 2.22e-16
## Wald statistic: 1019.3, p-value: < 2.22e-16
##
## Log likelihood: -43642.84 for error model
## ML residual variance (sigma squared): 0.19674, (sigma: 0.44356)
## Nagelkerke pseudo-R-squared: 0.25173
## Number of observations: 71353
## Number of parameters estimated: 9
## AIC: 87304, (AIC for lm: 89384)
## Hausman test: 171.45, df: 7, p-value: < 2.22e-16
We can also use single-chain MCMC with default 2500 draws, of which 500 are treated as burn-in and discarded; much of the time is spent on pre-computing a coarse grid of LU-decomposition Jacobian terms:
set.seed(1)
BSEM <- spatialreg::spBreg_err(tr_form, data=df_tracts, listw=lw, zero.policy=TRUE)
apply(attr(BSEM, "timings")[,c(1,3)], 2, sum)
## user.self elapsed
## 115.859 118.196
As we know, the model MCMC output is held in an "mcmc"
object defined in coda:
library(coda)
summary(BSEM)
##
## Iterations = 501:2500
## Thinning interval = 1
## Number of chains = 1
## Sample size per chain = 2000
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## (Intercept) -6.25240 0.0368774 8.246e-04 8.246e-04
## log1p(vacancy_rate) 0.35138 0.0288813 6.458e-04 6.458e-04
## log1p(old_rate) 0.41115 0.0041649 9.313e-05 9.313e-05
## log1p(black_rate) 0.54719 0.0137979 3.085e-04 3.085e-04
## log1p(hisp_rate) 0.59254 0.0154012 3.444e-04 3.444e-04
## log1p(group_pop) 0.02045 0.0007997 1.788e-05 1.929e-05
## log1p(dens) 0.03117 0.0022222 4.969e-05 4.969e-05
## lambda 0.25291 0.0053490 1.196e-04 1.196e-04
## sige 0.20019 0.0012118 2.710e-05 2.710e-05
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## (Intercept) -6.32307 -6.27707 -6.25300 -6.22796 -6.17848
## log1p(vacancy_rate) 0.29587 0.33131 0.35145 0.37149 0.40597
## log1p(old_rate) 0.40290 0.40839 0.41116 0.41394 0.41910
## log1p(black_rate) 0.51979 0.53798 0.54742 0.55630 0.57370
## log1p(hisp_rate) 0.56259 0.58228 0.59234 0.60257 0.62334
## log1p(group_pop) 0.01887 0.01990 0.02045 0.02099 0.02201
## log1p(dens) 0.02684 0.02966 0.03119 0.03271 0.03541
## lambda 0.24262 0.24962 0.25263 0.25663 0.26363
## sige 0.19800 0.19944 0.20018 0.20088 0.20234
Diagnostic tests from coda can also be used:
raftery.diag(BSEM, r=0.01)
##
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.01
## Probability (s) = 0.95
##
## Burn-in Total Lower bound Dependence
## (M) (N) (Nmin) factor (I)
## (Intercept) 2 930 937 0.993
## log1p(vacancy_rate) 2 950 937 1.010
## log1p(old_rate) 2 930 937 0.993
## log1p(black_rate) 2 913 937 0.974
## log1p(hisp_rate) 2 892 937 0.952
## log1p(group_pop) 2 930 937 0.993
## log1p(dens) 2 950 937 1.010
## lambda 1 1190 937 1.270
## sige 2 930 937 0.993
plot(BSEM, trace=FALSE)
We can also use weights with SEM and conditioned SEM models:
SEMw <- spatialreg::errorsarlm(tr_form, weights=tot_pop, data=df_tracts, listw=lw, method="Matrix", zero.policy=TRUE)
apply(SEMw$timings, 2, sum)
## user.self elapsed
## 4.884 4.986
summary(SEMw, Hausman=TRUE, Nagelkerke=TRUE)
##
## Call:spatialreg::errorsarlm(formula = tr_form, data = df_tracts, listw = lw,
## weights = tot_pop, method = "Matrix", zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1721397 -0.2875481 0.0096908 0.2949229 2.9755875
##
## Type: error
## Regions with no neighbours included:
## 12562 26251 28654 28696 31900 32259 42437 45571 45614 51434 57159 57306 68844 68898 68901 69293 69294
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.25080288 0.03640540 -171.700 < 2.2e-16
## log1p(vacancy_rate) 0.49017557 0.03116976 15.726 < 2.2e-16
## log1p(old_rate) 0.41003992 0.00422318 97.093 < 2.2e-16
## log1p(black_rate) 0.50494215 0.01384960 36.459 < 2.2e-16
## log1p(hisp_rate) 0.57340744 0.01406260 40.775 < 2.2e-16
## log1p(group_pop) 0.02242876 0.00075294 29.788 < 2.2e-16
## log1p(dens) 0.02993784 0.00213535 14.020 < 2.2e-16
##
## Lambda: 0.24966, LR test value: 2032, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0038204
## z-value: 65.349, p-value: < 2.22e-16
## Wald statistic: 4270.5, p-value: < 2.22e-16
##
## Log likelihood: -44437.41 for error model
## ML residual variance (sigma squared): 780.56, (sigma: 27.938)
## Nagelkerke pseudo-R-squared: 0.23488
## Number of observations: 71353
## Number of parameters estimated: 9
## AIC: 88893, (AIC for lm: 90923)
## Hausman test: 208.19, df: 7, p-value: < 2.22e-16
and its Durbin variant:
SDEM <- spatialreg::errorsarlm(tr_form, data=df_tracts, listw=lw, method="Matrix", Durbin=TRUE, zero.policy=TRUE)
apply(SDEM$timings, 2, sum)
## user.self elapsed
## 9.584 9.952
summary(SDEM, Nagelkerke=TRUE)
##
## Call:spatialreg::errorsarlm(formula = tr_form, data = df_tracts, listw = lw,
## Durbin = TRUE, method = "Matrix", zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.153522 -0.286920 0.010042 0.295821 2.932170
##
## Type: error
## Regions with no neighbours included:
## 12562 26251 28654 28696 31900 32259 42437 45571 45614 51434 57159 57306 68844 68898 68901 69293 69294
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.16017866 0.06471718 -95.1861 < 2.2e-16
## log1p(vacancy_rate) 0.22011019 0.03746441 5.8752 4.224e-09
## log1p(old_rate) 0.40323063 0.00432342 93.2666 < 2.2e-16
## log1p(black_rate) 0.50755974 0.02561676 19.8136 < 2.2e-16
## log1p(hisp_rate) 0.62902991 0.03218041 19.5470 < 2.2e-16
## log1p(group_pop) 0.01986287 0.00080835 24.5722 < 2.2e-16
## log1p(dens) 0.00833251 0.00399191 2.0873 0.03686
## lag.log1p(vacancy_rate) 0.33910367 0.05237609 6.4744 9.519e-11
## lag.log1p(old_rate) -0.00961209 0.00682877 -1.4076 0.15925
## lag.log1p(black_rate) 0.03009720 0.03046327 0.9880 0.32316
## lag.log1p(hisp_rate) -0.06474559 0.03688723 -1.7552 0.07922
## lag.log1p(group_pop) 0.01440312 0.00167140 8.6174 < 2.2e-16
## lag.log1p(dens) 0.03495989 0.00475808 7.3475 2.021e-13
##
## Lambda: 0.25272, LR test value: 2070, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0042256
## z-value: 59.807, p-value: < 2.22e-16
## Wald statistic: 3576.8, p-value: < 2.22e-16
##
## Log likelihood: -43560.04 for error model
## ML residual variance (sigma squared): 0.19631, (sigma: 0.44307)
## Nagelkerke pseudo-R-squared: 0.25347
## Number of observations: 71353
## Number of parameters estimated: 15
## AIC: 87150, (AIC for lm: 89218)
summary(spatialreg::impacts(SDEM))
## Impact measures (SDEM, estimable, n):
## Direct Indirect Total
## log1p(vacancy_rate) 0.220110186 0.339103674 0.55921386
## log1p(old_rate) 0.403230628 -0.009612094 0.39361853
## log1p(black_rate) 0.507559744 0.030097197 0.53765694
## log1p(hisp_rate) 0.629029914 -0.064745592 0.56428432
## log1p(group_pop) 0.019862873 0.014403116 0.03426599
## log1p(dens) 0.008332506 0.034959895 0.04329240
## ========================================================
## Standard errors:
## Direct Indirect Total
## log1p(vacancy_rate) 0.0374644083 0.052376090 0.040633564
## log1p(old_rate) 0.0043234188 0.006828774 0.007484115
## log1p(black_rate) 0.0256167606 0.030463269 0.016154547
## log1p(hisp_rate) 0.0321804058 0.036887235 0.017557548
## log1p(group_pop) 0.0008083488 0.001671399 0.001803419
## log1p(dens) 0.0039919138 0.004758078 0.002749267
## ========================================================
## Z-values:
## Direct Indirect Total
## log1p(vacancy_rate) 5.875181 6.4743984 13.76236
## log1p(old_rate) 93.266613 -1.4075870 52.59387
## log1p(black_rate) 19.813580 0.9879832 33.28208
## log1p(hisp_rate) 19.546985 -1.7552303 32.13913
## log1p(group_pop) 24.572157 8.6174018 19.00057
## log1p(dens) 2.087346 7.3474828 15.74689
##
## p-values:
## Direct Indirect Total
## log1p(vacancy_rate) 4.2238e-09 9.5191e-11 < 2.22e-16
## log1p(old_rate) < 2.22e-16 0.15925 < 2.22e-16
## log1p(black_rate) < 2.22e-16 0.32316 < 2.22e-16
## log1p(hisp_rate) < 2.22e-16 0.07922 < 2.22e-16
## log1p(group_pop) < 2.22e-16 < 2.22e-16 < 2.22e-16
## log1p(dens) 0.036857 2.0206e-13 < 2.22e-16
and weighted:
SDEMw <- spatialreg::errorsarlm(tr_form, weights=tot_pop, data=df_tracts, listw=lw, method="Matrix", Durbin=TRUE, zero.policy=TRUE)
apply(SDEMw$timings, 2, sum)
## user.self elapsed
## 8.398 8.563
summary(SDEMw, Hausman=TRUE, Nagelkerke=TRUE)
##
## Call:spatialreg::errorsarlm(formula = tr_form, data = df_tracts, listw = lw,
## weights = tot_pop, Durbin = TRUE, method = "Matrix", zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1665752 -0.2883039 0.0096818 0.2960586 2.9234058
##
## Type: error
## Regions with no neighbours included:
## 12562 26251 28654 28696 31900 32259 42437 45571 45614 51434 57159 57306 68844 68898 68901 69293 69294
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.26779399 0.06438165 -97.3537 < 2.2e-16
## log1p(vacancy_rate) 0.29917270 0.04034320 7.4157 1.210e-13
## log1p(old_rate) 0.40182015 0.00435840 92.1944 < 2.2e-16
## log1p(black_rate) 0.48896909 0.02617479 18.6809 < 2.2e-16
## log1p(hisp_rate) 0.61656157 0.03094209 19.9263 < 2.2e-16
## log1p(group_pop) 0.02144233 0.00076196 28.1408 < 2.2e-16
## log1p(dens) 0.01417781 0.00388774 3.6468 0.0002655
## lag.log1p(vacancy_rate) 0.40922588 0.05309595 7.7073 1.288e-14
## lag.log1p(old_rate) 0.00219764 0.00697219 0.3152 0.7526096
## lag.log1p(black_rate) 0.00270443 0.03096953 0.0873 0.9304128
## lag.log1p(hisp_rate) -0.06315358 0.03539952 -1.7840 0.0744198
## lag.log1p(group_pop) 0.01860324 0.00163288 11.3929 < 2.2e-16
## lag.log1p(dens) 0.02522840 0.00467896 5.3919 6.972e-08
##
## Lambda: 0.24893, LR test value: 2029.7, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0056695
## z-value: 43.906, p-value: < 2.22e-16
## Wald statistic: 1927.8, p-value: < 2.22e-16
##
## Log likelihood: -44328.16 for error model
## ML residual variance (sigma squared): 778.22, (sigma: 27.897)
## Nagelkerke pseudo-R-squared: 0.23722
## Number of observations: 71353
## Number of parameters estimated: 15
## AIC: 88686, (AIC for lm: 90714)
## Hausman test: 67.692, df: 13, p-value: 2.1293e-09
summary(spatialreg::impacts(SDEMw))
## Impact measures (SDEM, estimable, n):
## Direct Indirect Total
## log1p(vacancy_rate) 0.29917270 0.409225875 0.70839857
## log1p(old_rate) 0.40182015 0.002197637 0.40401779
## log1p(black_rate) 0.48896909 0.002704431 0.49167353
## log1p(hisp_rate) 0.61656157 -0.063153582 0.55340799
## log1p(group_pop) 0.02144233 0.018603237 0.04004557
## log1p(dens) 0.01417781 0.025228403 0.03940622
## ========================================================
## Standard errors:
## Direct Indirect Total
## log1p(vacancy_rate) 0.040343204 0.053095951 0.042489080
## log1p(old_rate) 0.004358399 0.006972192 0.007509624
## log1p(black_rate) 0.026174793 0.030969533 0.016508152
## log1p(hisp_rate) 0.030942095 0.035399521 0.016387297
## log1p(group_pop) 0.000761965 0.001632879 0.001739686
## log1p(dens) 0.003887738 0.004678957 0.002686349
## ========================================================
## Z-values:
## Direct Indirect Total
## log1p(vacancy_rate) 7.415690 7.70728964 16.67249
## log1p(old_rate) 92.194431 0.31520029 53.80000
## log1p(black_rate) 18.680915 0.08732554 29.78368
## log1p(hisp_rate) 19.926303 -1.78402361 33.77055
## log1p(group_pop) 28.140837 11.39290899 23.01884
## log1p(dens) 3.646803 5.39188581 14.66906
##
## p-values:
## Direct Indirect Total
## log1p(vacancy_rate) 1.2101e-13 1.2879e-14 < 2.22e-16
## log1p(old_rate) < 2.22e-16 0.75261 < 2.22e-16
## log1p(black_rate) < 2.22e-16 0.93041 < 2.22e-16
## log1p(hisp_rate) < 2.22e-16 0.07442 < 2.22e-16
## log1p(group_pop) < 2.22e-16 < 2.22e-16 < 2.22e-16
## log1p(dens) 0.00026552 6.9722e-08 < 2.22e-16
Finally, using the development version of sphet:
library(sphet)
##
## Attaching package: 'sphet'
## The following object is masked from 'package:spatialreg':
##
## impacts
system.time(SEM_GMM <- spreg(tr_form, data=df_tracts, listw=as(lw, "CsparseMatrix"), model="error", het=FALSE))
## user system elapsed
## 1.933 0.148 2.102
summary(SEM_GMM)
##
## Generalized stsls
##
## Call:
## spreg(formula = tr_form, data = df_tracts, listw = as(lw, "CsparseMatrix"),
## model = "error", het = FALSE)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.15222 -0.29469 0.01032 -0.00021 0.30086 3.00213
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -6.24575367 0.03720516 -167.873 < 2.2e-16 ***
## log1p(vacancy_rate) 0.35157200 0.02843557 12.364 < 2.2e-16 ***
## log1p(old_rate) 0.41042586 0.00419882 97.748 < 2.2e-16 ***
## log1p(black_rate) 0.54682738 0.01352155 40.441 < 2.2e-16 ***
## log1p(hisp_rate) 0.59271015 0.01514450 39.137 < 2.2e-16 ***
## log1p(group_pop) 0.02040826 0.00080207 25.444 < 2.2e-16 ***
## log1p(dens) 0.03112745 0.00222134 14.013 < 2.2e-16 ***
## rho 0.26259335 0.00477093 55.040 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
with its Durbin variant:
system.time(SDEM_GMM <- spreg(tr_form, data=df_tracts, listw=as(lw, "CsparseMatrix"), model="error", het=FALSE, Durbin=TRUE))
## user system elapsed
## 2.228 0.173 2.428
summary(SDEM_GMM)
##
## Generalized stsls
##
## Call:
## spreg(formula = tr_form, data = df_tracts, listw = as(lw, "CsparseMatrix"),
## model = "error", het = FALSE, Durbin = TRUE)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.14258 -0.29470 0.00949 -0.00075 0.29963 2.94460
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -6.13268960 0.06558835 -93.5027 < 2.2e-16 ***
## log1p(vacancy_rate) 0.22111422 0.03740353 5.9116 3.388e-09 ***
## log1p(old_rate) 0.40241917 0.00432891 92.9608 < 2.2e-16 ***
## log1p(black_rate) 0.50792821 0.02557387 19.8612 < 2.2e-16 ***
## log1p(hisp_rate) 0.62901587 0.03212850 19.5781 < 2.2e-16 ***
## log1p(group_pop) 0.01983835 0.00080826 24.5445 < 2.2e-16 ***
## log1p(dens) 0.00826926 0.00398581 2.0747 0.03802 *
## lag_log1p(vacancy_rate) 0.34382296 0.05251777 6.5468 5.879e-11 ***
## lag_log1p(old_rate) -0.01195826 0.00688147 -1.7377 0.08226 .
## lag_log1p(black_rate) 0.03012242 0.03048615 0.9881 0.32312
## lag_log1p(hisp_rate) -0.06631076 0.03690239 -1.7969 0.07235 .
## lag_log1p(group_pop) 0.01435628 0.00167892 8.5509 < 2.2e-16 ***
## lag_log1p(dens) 0.03521147 0.00476425 7.3908 1.460e-13 ***
## rho 0.26161661 0.00476485 54.9055 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This also lets us adjust the coefficient standard errors for heterskedasticity:
system.time(SEM_GMMh <- spreg(tr_form, data=df_tracts, listw=as(lw, "CsparseMatrix"), model="error", het=TRUE))
## user system elapsed
## 2.978 0.117 3.134
summary(SEM_GMMh)
##
## Generalized stsls
##
## Call:
## spreg(formula = tr_form, data = df_tracts, listw = as(lw, "CsparseMatrix"),
## model = "error", het = TRUE)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.15225 -0.29467 0.01030 -0.00021 0.30090 3.00219
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -6.24579581 0.03996340 -156.288 < 2.2e-16 ***
## log1p(vacancy_rate) 0.35123442 0.03135236 11.203 < 2.2e-16 ***
## log1p(old_rate) 0.41044090 0.00456026 90.004 < 2.2e-16 ***
## log1p(black_rate) 0.54681845 0.01429321 38.257 < 2.2e-16 ***
## log1p(hisp_rate) 0.59276338 0.01559615 38.007 < 2.2e-16 ***
## log1p(group_pop) 0.02040097 0.00083032 24.570 < 2.2e-16 ***
## log1p(dens) 0.03109688 0.00232960 13.349 < 2.2e-16 ***
## rho 0.27593286 0.00488909 56.438 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
system.time(SDEM_GMMh <- spreg(tr_form, data=df_tracts, listw=as(lw, "CsparseMatrix"), model="error", het=TRUE, Durbin=TRUE))
## user system elapsed
## 3.019 0.162 3.215
summary(SDEM_GMMh)
##
## Generalized stsls
##
## Call:
## spreg(formula = tr_form, data = df_tracts, listw = as(lw, "CsparseMatrix"),
## model = "error", het = TRUE, Durbin = TRUE)
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.14259 -0.29469 0.00947 -0.00076 0.29964 2.94460
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -6.13295558 0.06908374 -88.7757 < 2.2e-16 ***
## log1p(vacancy_rate) 0.22118334 0.04042524 5.4714 4.465e-08 ***
## log1p(old_rate) 0.40242167 0.00468697 85.8597 < 2.2e-16 ***
## log1p(black_rate) 0.50798180 0.02743648 18.5148 < 2.2e-16 ***
## log1p(hisp_rate) 0.62911182 0.03381343 18.6054 < 2.2e-16 ***
## log1p(group_pop) 0.01983960 0.00083555 23.7443 < 2.2e-16 ***
## log1p(dens) 0.00825998 0.00410720 2.0111 0.04432 *
## lag_log1p(vacancy_rate) 0.34365182 0.05502973 6.2448 4.242e-10 ***
## lag_log1p(old_rate) -0.01192960 0.00729841 -1.6345 0.10214
## lag_log1p(black_rate) 0.03000758 0.03268209 0.9182 0.35853
## lag_log1p(hisp_rate) -0.06644231 0.03895247 -1.7057 0.08806 .
## lag_log1p(group_pop) 0.01436527 0.00171910 8.3563 < 2.2e-16 ***
## lag_log1p(dens) 0.03521675 0.00491840 7.1602 8.056e-13 ***
## rho 0.27495975 0.00488766 56.2560 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
sessionInfo()
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 34 (Workstation Edition)
##
## Matrix products: default
## BLAS: /home/rsb/topics/R/R412-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R412-share/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] sphet_1.11 coda_0.19-4 spatialreg_1.1-9 Matrix_1.3-4
## [5] spData_2.0.1 sf_1.0-4
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.0 tidyr_1.1.4 jsonlite_1.7.2 splines_4.1.2
## [5] gtools_3.9.2 bslib_0.3.1 assertthat_0.2.1 expm_0.999-6
## [9] highr_0.9 sp_1.4-6 yaml_2.2.1 LearnBayes_2.15.1
## [13] pillar_1.6.4 backports_1.3.0 lattice_0.20-45 glue_1.4.2
## [17] digest_0.6.28 htmltools_0.5.2 spDataLarge_0.5.4 pkgconfig_2.0.3
## [21] broom_0.7.10 raster_3.5-2 gmodels_2.18.1 s2_1.0.7
## [25] mvtnorm_1.1-3 purrr_0.3.4 gdata_2.18.0 RSpectra_0.16-0
## [29] terra_1.4-16 tibble_3.1.5 proxy_0.4-26 generics_0.1.1
## [33] ellipsis_0.3.2 cli_3.1.0 magrittr_2.0.1 crayon_1.4.2
## [37] deldir_1.0-6 evaluate_0.14 fansi_0.5.0 nlme_3.1-153
## [41] MASS_7.3-54 class_7.3-19 tools_4.1.2 lifecycle_1.0.1
## [45] matrixStats_0.61.0 stringr_1.4.0 compiler_4.1.2 jquerylib_0.1.4
## [49] e1071_1.7-9 rlang_0.4.12 classInt_0.4-3 units_0.7-2
## [53] grid_4.1.2 rstudioapi_0.13 rmarkdown_2.11 boot_1.3-28
## [57] wk_0.5.0 codetools_0.2-18 DBI_1.1.1 R6_2.5.1
## [61] zoo_1.8-9 knitr_1.36 dplyr_1.0.7 fastmap_1.1.0
## [65] utf8_1.2.2 spdep_1.1-12 KernSmooth_2.23-20 stringi_1.7.5
## [69] Rcpp_1.0.7 vctrs_0.3.8 tidyselect_1.1.1 xfun_0.27
## [73] lmtest_0.9-38