Copyright

All the material presented here, to the extent it is original, is available under CC-BY-SA.

Required current contributed CRAN packages:

I am running R 4.1.0, with recent update.packages().

needed <- c("sphet", "coda", "spatialreg", "spData", "spdep", "sf")

Script

Script and data at https://github.com/rsbivand/UAM21_II/raw/main/UAM21_II_210614.zip. Download to suitable location, unzip and use as basis.

Seminar introduction

Schedule

Time	Topic
Monday 7/6
09.00-12.00	How may we integrate spatial data from different sources? How may we aggregate spatial data? Which data structures are helpful in handling spatial data?
13.00-16.00	Observations of spatial data are related to each other either by distance, or by a graph of edges linking observations seen as being neighbours. How may they be constructed? How may we address the issues raised by the probable presence of spatial autocorrelation in the spatial data that we are using?
Tuesday 8/6
09.00-12.00	How can we measure global spatial autocorrelation?
13.00-16.00	How can we measure local spatial autocorrelation?
Monday 14/6
09.00-12.00	How may we fit regression models to spatial data in the presence of spatial autocorrelation? Maximum likelihood GMM and Bayesian estimation, case weights.
13.00-16.00	How should we interpret the coefficients or impacts of spatial regression models? How may we predict from spatial regression models?
Tuesday 15/6
09.00-12.00	Multi-level models: at which level in the data may we fit spatial processes?
13.00-16.00	Spatial filtering, hierarchical GLM, GAM, etc., spatial epidemiological approaches
Monday 21/6
09.00-12.00	Presentations/consultations/discussion
13.00-16.00	Presentations/consultations/discussion

Introduction

Beyond software functionality, the existence of communities of interest becomes important
These extend the choices available, and bring in points of view across applied scientific domains and in modern computation, for example interworking across C++, Python and Julia with R
For example, the HSAR package [Dong and Harris (2015); dongetal:15] uses C++ extensively, rather than R sparse matrix packages
The model being introduced into INLA for non-MCMC Bayesian inference is another example of cooperation with other applied statisticians
The R packages spatialreg, sphet (Roger S. Bivand and Piras 2015) and for spatial panel models splm (Millo and Piras 2012) provide implementations of many model fitting functions for Gaussian dependent variables
These also use shared methods in spdep for computing impacts, and for MC inference on impacts
There are a range of fitting functions for non-Gaussian dependent variables in McSpatial, spatialprobit (Wilhelm and Matos 2013) and ProbitSpatial (Martinetti and Geniaux 2017), and these are also available using the ´slm´ model being introduced into INLA
Work is continuing on providing Bayesian inference functions, through MCMC and otherwise; the HSAR package is an example

Spatial Regression

Even though it may be tempting to focus on interpreting the map pattern of an area support response variable of interest, the pattern may largely derive from covariates (and their functional forms), as well as the respective spatial footprints of the variables in play. Spatial autoregressive models in two dimensions began without covariates and with clear links to time series (Whittle 1954). Extensions included tests for spatial autocorrelation in linear model residuals, and models applying the autoregressive component to the response or the residuals, where the latter matched the tests for residuals (A. Cliff and Ord 1972; A. D. Cliff and Ord 1973). These “lattice” models of areal data typically express the dependence between observations using a graph of neighbours in the form of a contiguity matrix.

A division has grown up, possibly unhelpfully, between scientific fields using conditional autoregressive (CAR) models (Besag 1974), and simultaneous autoregressive models (SAR) (Ord 1975; Hepple 1976). Although CAR and SAR models are closely related, these fields have found it difficult to share experience of applying similar models, often despite referring to key work summarising the models (Ripley 1981, 1988; Cressie 1993). Ripley gives the SAR variance as (1981, 89):

\[ {\rm Var}_S = \sigma^ 2(I-\lambda W_S)^{-1} (I-\lambda W_S^{\rm T})^{-1} \] where \(\lambda\) is a spatial autocorrelation parameter and \(W_S\) is a nonsingular matrix that represents spatial dependence. The CAR variance is:

\[ {\rm Var}_C = \sigma^ 2(I-\lambda W_C)^{-1} \] where and \(W_C\) is a symmetric and strictly positive definite matrix

More recent books expounding the theoretical bases for modelling with areal data simply point out the similarities in relevant chapters (Gaetan and Guyon 2010; Lieshout 2019); the interested reader is invited to consult these sources for background information and examples using the functions described below.

Of course, handling a spatial correlation structure in a generalised least squares model or a (generalised) linear or nonlinear mixed effects model such as those provided in the nlme and many other packages does not have to use a graph of neighbours (Pinheiro and Bates 2000). These models are also spatial regression models, using functions of the distance between observations, and fitted variograms to model the spatial autocorrelation present; such models have been held to yield a clearer picture of the underlying processes (Wall 2004), building on geostatistics. For example, the glmmTMB package successfully uses this approach to spatial regression (Brooks et al. 2017). Here we will only consider spatial regression using spatial weights, chiefly as implemented in the spatialreg package recently split out of the spdep package which had grown unnecessarily large, covering too many aspects of spatial dependence.

Recent development: (R. Bivand, Millo, and Piras 2021)

Spatial regression with spatial weights

Spatial autoregression models using spatial weights matrices were described in some detail using maximum likelihood estimation some time ago (A. D. Cliff and Ord 1973, 1981). A family of models were elaborated in spatial econometric terms extending earlier work, and in many cases using the simultaneous autoregressive framework and row standardization of spatial weights (Anselin 1988). The simultaneous and conditional autoregressive frameworks can be compared, and both can be supplemented using case weights to reflect the relative importance of different observations (Waller and Gotway 2004).

Here we shall use the Boston housing data set, which has been restructured and furnished with census tract boundaries (R. Bivand 2017). The original data set used 506 census tracts and a hedonic model to try to estimate willingness to pay for clean air. The response was constructed from counts of ordinal answers to a 1970 census question about house value; the response is left and right censored in the census source. The key covariate was created from a calibrated meteorological model showing the annual nitrogen oxides (NOX) level for a smaller number of model output zones. The numbers of houses responding also varies by tract and model output zone. There are several other covariates, some measured at the tract level, some by town only, where towns broadly correspond to the air pollution model output zones.

library(sf)

## Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1

library(spatialreg)

## Loading required package: spData

## Loading required package: Matrix

boston_506 <- st_read(system.file("shapes/boston_tracts.shp", package="spData")[1])

## Reading layer `boston_tracts' from data source 
##   `/home/rsb/lib/r_libs/spData/shapes/boston_tracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 506 features and 36 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -71.52311 ymin: 42.00305 xmax: -70.63823 ymax: 42.67307
## Geodetic CRS:  NAD27

nb_q <- spdep::poly2nb(boston_506)
lw_q <- spdep::nb2listw(nb_q, style="W")

We can start by reading in the 506 tract data set from spData, and creating a contiguity neighbour object and from that again a row standardized spatial weights object. If we examine the median house values, we find that they have been assigned as missing values, and that 17 tracts are affected.

table(boston_506$censored)

## 
##  left    no right 
##     2   489    15

summary(boston_506$median)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    5600   16800   21000   21749   24700   50000      17

Next, we can subset to the remaining 489 tracts with non-censored house values, and the neighbour object to match. The neighbour object now has one observation with no neighbours.

boston_506$CHAS <- as.factor(boston_506$CHAS)
boston_489 <- boston_506[!is.na(boston_506$median),]
nb_q_489 <- spdep::poly2nb(boston_489)
lw_q_489 <- spdep::nb2listw(nb_q_489, style="W", zero.policy=TRUE)

The NOX_ID variable specifies the upper level aggregation, letting us aggregate the tracts to air pollution model output zones. We can create aggregate neighbour and row standardized spatial weights objects, and aggregate the NOX variable taking means, and the CHAS Charles River dummy variable for observations on the river.

agg_96 <- list(as.character(boston_506$NOX_ID))
boston_96 <- aggregate(boston_506[, "NOX_ID"], by=agg_96, unique)
nb_q_96 <- spdep::poly2nb(boston_96)
lw_q_96 <- spdep::nb2listw(nb_q_96)
boston_96$NOX <- aggregate(boston_506$NOX, agg_96, mean)$x
boston_96$CHAS <- aggregate(as.integer(boston_506$CHAS)-1, agg_96, max)$x

The response is aggregated using the weightedMedian() function in matrixStats, and midpoint values for the house value classes. Counts of houses by value class were punched to check the published census values, which can be replicated using weightedMedian() at the tract level. Here we find two output zones with calculated weighted medians over the upper census question limit of USD 50,000, and remove them subsequently as they also are affected by not knowing the appropriate value to insert for the top class by value.

nms <- names(boston_506)
ccounts <- 23:31
for (nm in nms[c(22, ccounts, 36)]) {
  boston_96[[nm]] <- aggregate(boston_506[[nm]], agg_96, sum)$x
}
br2 <- c(3.50,  6.25,  8.75, 12.50, 17.50, 22.50, 30.00, 42.50, 60.00)*1000
counts <- as.data.frame(boston_96)[, nms[ccounts]]
f <- function(x) matrixStats::weightedMedian(x=br2, w=x, interpolate=TRUE)
boston_96$median <- apply(counts, 1, f)
is.na(boston_96$median) <- boston_96$median > 50000
summary(boston_96$median)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    9009   20417   23523   25263   30073   49496       2

Before subsetting, we aggregate the remaining covariates by weighted mean using the tract population counts punched from the census (R. Bivand 2017).

boston_94 <- boston_96[!is.na(boston_96$median),]
nb_q_94 <- spdep::subset.nb(nb_q_96, !is.na(boston_96$median))
lw_q_94 <- spdep::nb2listw(nb_q_94, style="W")

We now have two data sets each at the lower, census tract level and the upper, air pollution model output zone level, one including the censored observations, the other excluding them.

The original model related the log of median house values by tract to the square of NOX values, including other covariates usually related to house value by tract, such as aggregate room counts, aggregate age, ethnicity, social status, distance to downtown and to the nearest radial road, a crime rate, and town-level variables reflecting land use (zoning, industry), taxation and education (R. Bivand 2017). This structure will be used here to exercise issues raised in fitting spatial regression models, including the presence of multiple levels.

form <- formula(log(median) ~ CRIM + ZN + INDUS + CHAS + I((NOX*10)^2) + I(RM^2) + 
                  AGE + log(DIS) + log(RAD) + TAX + PTRATIO + I(BB/100) + 
                  log(I(LSTAT/100)))

Before moving to presentations of issues raised in fitting spatial regression models, it is worth making a few further points. A recent review of spatial regression in a spatial econometrics setting is given by Kelejian and Piras (2017); note that their usage is to call the spatial coefficient of the lagged response \(\lambda\) and that of the lagged residuals \(\rho\), the reverse of other usage (Anselin 1988; James P. LeSage and Pace 2009); here we use \(\rho_{\mathrm{Lag}}\) for the spatial coefficient in the spatial lag model, and \(\rho_{\mathrm{Err}}\) for the spatial error model. One interesting finding is that relatively dense spatial weights matrices may downweight model estimates, suggesting that sparser weights are preferable (Tony E. Smith 2009). Another useful finding is that the presence of residual spatial autocorrelation need not bias the estimates of variance of regression coefficients, provided that the covariates themselves do not exhibit spatial autocorrelation (T. E. Smith and Lee 2012). In general, however, the footprints of the spatial processes of the response and covariates may not be aligned, and if covariates and the residual are autocorrelated, it is likely that the estimates of variance of regression coefficients will be biassed downwards if attempts are not made to model the spatial processes.

In trying to model these spatial processes, we may choose to model the spatial autocorrelation in the residual with a spatial error model (SEM).

\[ {\mathbf y} = {\mathbf X}{\mathbf \beta} + {\mathbf u}, \qquad {\mathbf u} = \rho_{\mathrm{Err}} {\mathbf W} {\mathbf u} + {\mathbf \varepsilon}, \] where \({\mathbf y}\) is an \((N \times 1)\) vector of observations on a response variable taken at each of \(N\) locations, \({\mathbf X}\) is an \((N \times k)\) matrix of covariates, \({\mathbf \beta}\) is a \((k \times 1)\) vector of parameters, \({\mathbf u}\) is an \((N \times 1)\) spatially autocorrelated disturbance vector, \({\mathbf \varepsilon}\) is an \((N \times 1)\) vector of independent and identically distributed disturbances and \(\rho_{\mathrm{Err}}\) is a scalar spatial parameter.

If the processes in the covariates and the response match, we should find little difference between the coefficients of a least squares and a SEM, but very often they diverge, suggesting that a Hausman test for this condition should be employed (R. Pace and LeSage 2008). This may be related to earlier discussions of a spatial equivalent to the unit root and cointegration where spatial processes match (Fingleton 1999).

A model with a spatial process in the response only is termed a spatial lag model (SLM, often SAR - spatial autoregressive) (James P. LeSage and Pace 2009).

\[ {\mathbf y} = \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf y} + {\mathbf X}{\mathbf \beta} + {\mathbf \varepsilon}, \] where \(\rho_{\mathrm{Lag}}\) is a scalar spatial parameter.

Work reviewed by Mur and Angulo (2006) on the Durbin model; the Durbin model adds the spatially lagged covariates to the covariates included in the spatial lag model, giving a spatial Durbin model (SDM) with different processes in the response and covariates:

\[ {\mathbf y} = \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf y} + {\mathbf X}{\mathbf \beta} + {\mathbf W}{\mathbf X}{\mathbf \gamma} + {\mathbf \varepsilon}, \] where \({\mathbf \gamma}\) is a \((k' \times 1)\) vector of parameters. \(k'\) defines the subset of the intercept and covariates, often \(k' = k-1\) when using row standardised spatial weights and omitting the spatially lagged intercept.

This permits the spatial processes to be viewed and tested for as a Common Factor (Burridge 1981; R. S. Bivand 1984). The inclusion of spatially lagged covariates lets us check whether the same spatial process is manifest in the response and the covariates (SEM), whether they are different processes, or whether no process is detected. The Common Factor is present when \({\mathbf \gamma} = - \rho_{\mathrm{Lag}} {\mathbf \beta}\):

\[ {\mathbf y} = \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf y} + {\mathbf X}{\mathbf \beta} - \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf X} {\mathbf \beta} + {\mathbf \varepsilon}, \qquad ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W}){\mathbf y} = ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W}) {\mathbf X}{\mathbf \beta} + {\mathbf \varepsilon}, \] where \({\mathbf I}\) is the \(N \times N\) identity matrix, and \({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W}\) is the Common Factor:

\[ \qquad {\mathbf y} = {\mathbf X}{\mathbf \beta} + ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1} {\mathbf \varepsilon}, \qquad {\mathbf y} = {\mathbf X}{\mathbf \beta} + {\mathbf u}, \qquad {\mathbf u} = \rho_{\mathrm{Err}} {\mathbf W} {\mathbf u} + {\mathbf \varepsilon}, \]

If we extend this family with processes in the covariates and the residual, we get a spatial error Durbin model (SDEM). If it is chosen to admit a spatial process in the residuals in addition to a spatial process in the response, again two models are formed, a general nested model (GNM) nesting all the others, and a model without spatially lagged covariates (SAC, also known as SARAR - Spatial AutoRegressive-AutoRegressive model). If neither the residuals nor the residual are modelled with spatial processes, spatially lagged covariates may be added to a linear model, as a spatially lagged X model (SLX) (Elhorst 2010; Roger S. Bivand 2012; J. P. LeSage 2014; Halleck Vega and Elhorst 2015).

We can write the general nested model (GNM) as:

\[ {\mathbf y} = \rho_{\mathrm{Lag}} {\mathbf W}{\mathbf y} + {\mathbf X}{\mathbf \beta} + {\mathbf W}{\mathbf X}{\mathbf \gamma} + {\mathbf u}, \qquad {\mathbf u} = \rho_{\mathrm{Err}} {\mathbf W} {\mathbf u} + {\mathbf \varepsilon}, \]

This may be constrained to the double spatial coefficient model SAC/SARAR by setting \({\mathbf \gamma} = 0\), to the spatial Durbin (SDM) by setting \(\rho_{\mathrm{Err}} = 0\), and to the error Durbin model (SDEM) by setting \(\rho_{\mathrm{Lag}} = 0\). Imposing more conditions gives the spatial lag model (SLM) with \({\mathbf \gamma} = 0\) and \(\rho_{\mathrm{Err}} = 0\), the spatial error model (SEM) with \({\mathbf \gamma} = 0\) and \(\rho_{\mathrm{Lag}} = 0\), and the spatially lagged X model (SLX) with \(\rho_{\mathrm{Lag}} = 0\) and \(\rho_{\mathrm{Err}} = 0\).

Although making predictions for new locations for which covariates are observed was raised as an issue some time ago, it has many years to make progress in reviewing the possibilities (R. S. Bivand 2002; Goulard, Laurent, and Thomas-Agnan 2017). The prediction methods for SLM, SDM, SEM, SDEM, SAC and GNM models fitted with maximum likelihood were contributed as a Google Summer of Coding project by Martin Gubri. This work, and work on similar models with missing data (Suesse 2018) is also relevant for exploring censored median house values in the Boston data set. Work on prediction also exposed the importance of the reduced form of these models, in which the spatial process in the response interacts with the regression coefficients in the SLM, SDM, SAC and GNM models.

The consequence of these interactions is that a unit change in a covariate will only impact the response as the value of the regression coefficient if the spatial coefficient of the lagged response is zero. Where it is non-zero, global spillovers, impacts, come into play, and these impacts should be reported rather than the regression coefficients (James P. LeSage and Pace 2009; Elhorst 2010; Roger S. Bivand 2012; J. P. LeSage 2014; Halleck Vega and Elhorst 2015). Local impacts may be reported for SDEM and SLX models, using linear combination to calculate standard errors for the total impacts of each covariate (sums of coefficients on the covariates and their spatial lags).

This can be seen from the GNM data generation process:

\[ ({\mathbf I} - \rho_{\mathrm{Err}} {\mathbf W})({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W}){\mathbf y} = ({\mathbf I} - \rho_{\mathrm{Err}} {\mathbf W})({\mathbf X}{\mathbf \beta} + {\mathbf W}{\mathbf X}{\mathbf \gamma}) + {\mathbf \varepsilon}, \] re-writing:

\[ {\mathbf y} = ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1}({\mathbf X}{\mathbf \beta} + {\mathbf W}{\mathbf X}{\mathbf \gamma}) + ({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1}({\mathbf I} - \rho_{\mathrm{Err}} {\mathbf W})^{-1}{\mathbf \varepsilon}. \] There is interaction between the \(\rho_{\mathrm{Lag}}\) and \({\mathbf \beta}\) (and \({\mathbf \gamma}\) if present) coefficients. This can be seen from the partial derivatives: \(\partial y_i / \partial x_{jr} = (({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1} ({\mathbf I} \beta_r + {\mathbf W} \gamma_r))_{ij}\). This dense matrix \(S_r({\mathbf W}) = (({\mathbf I} - \rho_{\mathrm{Lag}} {\mathbf W})^{-1} ({\mathbf I} \beta_r + {\mathbf W} \gamma_r))\) expresses the direct impacts (effects) on its principal diagonal, and indirect impacts in off-diagonal elements.

Current work in the spatialreg package is focused on refining the handling of spatially lagged covariates using a consistent Durbin= argument taking either a logical value or a formula giving the subset of covariates to add in spatially lagged form. There is a speculation that some covariates, for example some dummy variables, should not be added in spatially lagged form. This then extends to handling these included spatially lagged covariates appropriately in calculating impacts. This work applies to cross-sectional models fitted using MCMC or maximum likelihood, and will offer facilities to spatial panel models.

It is worth mentioning the almost unexplored issues of functional form assumptions, for which flexible structures are useful, including spatial quantile regression presented in the McSpatial package (McMillen 2013). There are further issues with discrete response variables, covered by some functions in McSpatial, and in the spatialprobit and ProbitSpatial packages (Wilhelm and Matos 2013; Martinetti and Geniaux 2017); the MCMC implementations of the former are based on LeSage and Pace (2009). Finally, Wagner and Zeileis (2019) show how an SLM model may be used in the setting of recursive partitioning, with an implementation using spatialreg::lagsarlm() in the lagsarlmtree package.

Estimators

The review of cross-sectional maximum likelihood and generalized method of moments (GMM) estimators in spatialreg and sphet for spatial econometrics style spatial regression models by Bivand and Piras (2015) is still largely valid. In the review, estimators in these R packages were compared with alternative implementations available in other programming languages elsewhere. The review did not cover Bayesian spatial econometrics style spatial regression. More has changed with respect to spatial panel estimators described in Millo and Piras (2012), but will not be covered here.

Maximum likelihood

For models with single spatial coefficients (SEM and SDEM using errorsarlm(), SLM and SDM using lagsarlm()), the methods initially described by Ord (1975) are used. The following table shows the functions that can be used to estimate the models described above using maximum likelihood.

model	model name	maximum likelihood estimation function
SEM	spatial error	`errorsarlm(..., Durbin=FALSE, ...)`
SEM	spatial error	`spautolm(..., family="SAR", ...)`
SDEM	spatial Durbin error	`errorsarlm(..., Durbin=TRUE, ...)`
SLM	spatial lag	`lagsarlm(..., Durbin=FALSE, ...)`
SDM	spatial Durbin	`lagsarlm(..., Durbin=TRUE, ...)`
SAC	spatial autoregressive combined	`sacsarlm(..., Durbin=FALSE, ...)`
GNM	general nested	`sacsarlm(..., Durbin=TRUE, ...)`

The estimating functions errorsarlm() and lagsarlm() take similar arguments, where the first two, formula= and data= are shared by most model estimating functions. The third argument is a listw spatial weights object, while na.action= behaves as in other model estimating functions if the spatial weights can reasonably be subsetted to avoid observations with missing values. The weights= argument may be used to provide weights indicating the known degree of per-observation variability in the variance term - this is not available for lagsarlm().

args(errorsarlm)

## function (formula, data = list(), listw, na.action, weights = NULL, 
##     Durbin, etype, method = "eigen", quiet = NULL, zero.policy = NULL, 
##     interval = NULL, tol.solve = .Machine$double.eps, trs = NULL, 
##     control = list()) 
## NULL

args(lagsarlm)

## function (formula, data = list(), listw, na.action, Durbin, type, 
##     method = "eigen", quiet = NULL, zero.policy = NULL, interval = NULL, 
##     tol.solve = .Machine$double.eps, trs = NULL, control = list()) 
## NULL

The Durbin= argument replaces the earlier type= and etype= arguments, and if not given is taken as FALSE. If given, it may be FALSE, TRUE in which case all spatially lagged covariates are included, or a one-sided formula specifying which spatially lagged covariates should be included. The method= argument gives the method for calculating the log determinant term in the log likelihood function, and defaults to "eigen", suitable for moderately sized data sets. The interval= argument gives the bounds of the domain for the line search using stats::optimize() used for finding the spatial coefficient. The tol.solve() argument, passed through to base::solve(), was needed to handle data sets with differing numerical scales among the coefficients which hindered inversion of the variance-covariance matrix; the default value in base::solve() used to be much larger. The control= argument takes a list of control values to permit more careful adjustment of the running of the estimation function.

The spautolm() function also fits spatial regressions with the spatial process in the residuals, and takes a possibly poorly named family= argument, taking the values of "SAR" for the simultaneous autoregressive model (like errorsarlm()), "CAR" for the conditional autoregressive model, and "SMA" for the spatial moving average model.

args(spautolm)

## function (formula, data = list(), listw, weights, na.action, 
##     family = "SAR", method = "eigen", verbose = NULL, trs = NULL, 
##     interval = NULL, zero.policy = NULL, tol.solve = .Machine$double.eps, 
##     llprof = NULL, control = list()) 
## NULL

The sacsarlm() function may take second spatial weights and interval arguments if the spatial weights used to model the two spatial processes in the SAC and GNM specifications differ. By default, the same spatial weights are used. By default, stats::nlminb() is used for numerical optimization, using a heuristic to choose starting values.

args(sacsarlm)

## function (formula, data = list(), listw, listw2 = NULL, na.action, 
##     Durbin, type, method = "eigen", quiet = NULL, zero.policy = NULL, 
##     tol.solve = .Machine$double.eps, llprof = NULL, interval1 = NULL, 
##     interval2 = NULL, trs1 = NULL, trs2 = NULL, control = list()) 
## NULL

Where larger data sets are used, a numerical Hessian approach is used to calculate the variance-covariance matrix of coefficients, rather than an analytical asymptotic approach.

Apart from spautolm() which returns an "spautolm" object, the model fitting functions return "Sarlm" objects. Standard methods for fitted models are provided, such as summary():

args(getS3method("summary", "Sarlm"))

## function (object, correlation = FALSE, Nagelkerke = FALSE, Hausman = FALSE, 
##     adj.se = FALSE, ...) 
## NULL

The Nagelkerke= argument permits the return of a value approximately corresponding to a coefficient of determination, although the summary method anyway provides the value of stats::AIC() because a stats::logLik() method is provided for "sarlm" and "spautolm" objects. If the "sarlm" object is a SEM or SDEM, the Hausman test may be performed by setting Hausman=TRUE to see whether the regression coefficients are sufficiently like least squares coefficients, indicating absence of mis-specification from that source. As an example, we may fit SEM and SDEM to the 94 and 489 observation Boston data sets, and present the Hausman test results:

eigs_489 <- eigenw(lw_q_489)
SDEM_489 <- errorsarlm(form, data=boston_489, listw=lw_q_489, Durbin=TRUE, 
                       zero.policy=TRUE, control=list(pre_eig=eigs_489))
SEM_489 <- errorsarlm(form, data=boston_489, listw=lw_q_489, 
                      zero.policy=TRUE, control=list(pre_eig=eigs_489))
cbind(data.frame(model=c("SEM", "SDEM")), 
      rbind(broom::tidy(Hausman.test(SEM_489)), 
            broom::tidy(Hausman.test(SDEM_489))))[,1:4]

##   model statistic      p.value parameter
## 1   SEM   51.9862 2.827192e-06        14
## 2  SDEM   48.6551 6.481654e-03        27

Here we are using the control= list argument to pass through pre-computed eigenvalues for the default "eigen" method. Both test results for the 489 tract data set suggest that the regression coefficients do differ, perhaps that the footprints of the spatial processes do not match. Likelihood ratio tests of the spatial models against their least squares equivalents show that spatial process(es) are present, but we find that neither SEM not SDM are adequate representations.

cbind(data.frame(model=c("SEM", "SDEM")), 
      rbind(broom::tidy(LR1.Sarlm(SEM_489)), 
            broom::tidy(LR1.Sarlm(SDEM_489))))[,c(1, 4:6)]

##   model statistic p.value parameter
## 1   SEM  198.4133       0         1
## 2  SDEM  159.3798       0         1

For the 94 air pollution model output zones, the Hausman tests find little difference between coefficients, but this is related to the fact that the SEM and SDEM models add little to least squares or SLX using likelihood ratio tests.

eigs_94 <- eigenw(lw_q_94)
SDEM_94 <- errorsarlm(form, data=boston_94, listw=lw_q_94, Durbin=TRUE,
                      control=list(pre_eig=eigs_94))
SEM_94 <- errorsarlm(form, data=boston_94, listw=lw_q_94, control=list(pre_eig=eigs_94))
cbind(data.frame(model=c("SEM", "SDEM")), 
      rbind(broom::tidy(Hausman.test(SEM_94)), 
            broom::tidy(Hausman.test(SDEM_94))))[, 1:4]

##   model statistic   p.value parameter
## 1   SEM 15.657083 0.3347612        14
## 2  SDEM  9.205152 0.9994394        27

cbind(data.frame(model=c("SEM", "SDEM")), rbind(broom::tidy(LR1.Sarlm(SEM_94)), broom::tidy(LR1.Sarlm(SDEM_94))))[,c(1, 4:6)]

##   model statistic   p.value parameter
## 1   SEM 2.5933581 0.1073126         1
## 2  SDEM 0.2158487 0.6422213         1

We can use spatialreg::LR.sarlm() to apply a likelihood ratio test between nested models, but here choose lmtest::lrtest(), which gives the same results, preferring models including spatially lagged covariates.

broom::tidy(lmtest::lrtest(SEM_489, SDEM_489))

## Warning in tidy.anova(lmtest::lrtest(SEM_489, SDEM_489)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik

## # A tibble: 2 x 5
##    X.Df LogLik    df statistic   p.value
##   <dbl>  <dbl> <dbl>     <dbl>     <dbl>
## 1    16   273.    NA      NA   NA       
## 2    29   311.    13      74.4  1.23e-10

broom::tidy(lmtest::lrtest(SEM_94, SDEM_94))

## Warning in tidy.anova(lmtest::lrtest(SEM_94, SDEM_94)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik

## # A tibble: 2 x 5
##    X.Df LogLik    df statistic    p.value
##   <dbl>  <dbl> <dbl>     <dbl>      <dbl>
## 1    16   59.7    NA      NA   NA        
## 2    29   81.3    13      43.2  0.0000421

The SLX model is fitted using least squares, and also returns a log likelihood value, letting us test whether we need a spatial process in the residuals.

args(lmSLX)

## function (formula, data = list(), listw, na.action, weights = NULL, 
##     Durbin = TRUE, zero.policy = NULL) 
## NULL

In the tract data set we obviously do:

SLX_489 <- lmSLX(form, data=boston_489, listw=lw_q_489, zero.policy=TRUE)
broom::tidy(lmtest::lrtest(SLX_489, SDEM_489))

## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of
## class "SlX", updated model is of class "Sarlm"

## Warning in tidy.anova(lmtest::lrtest(SLX_489, SDEM_489)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik

## # A tibble: 2 x 5
##    X.Df LogLik    df statistic   p.value
##   <dbl>  <dbl> <dbl>     <dbl>     <dbl>
## 1    28   231.    NA       NA  NA       
## 2    29   311.     1      159.  1.55e-36

but in the output zone case we do not.

SLX_94 <- lmSLX(form, data=boston_94, listw=lw_q_94)
broom::tidy(lmtest::lrtest(SLX_94, SDEM_94))

## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of
## class "SlX", updated model is of class "Sarlm"

## Warning in tidy.anova(lmtest::lrtest(SLX_94, SDEM_94)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik

## # A tibble: 2 x 5
##    X.Df LogLik    df statistic p.value
##   <dbl>  <dbl> <dbl>     <dbl>   <dbl>
## 1    28   81.2    NA    NA      NA    
## 2    29   81.3     1     0.216   0.642

This outcome is sustained also when we use the counts of house units by tract and output zones as weights:

SLX_94w <- lmSLX(form, data=boston_94, listw=lw_q_94, weights=units)
SDEM_94w <- errorsarlm(form, data=boston_94, listw=lw_q_94, Durbin=TRUE, weights=units,
                       control=list(pre_eig=eigs_94))
broom::tidy(lmtest::lrtest(SLX_94w, SDEM_94w))

## Warning in modelUpdate(objects[[i - 1]], objects[[i]]): original model was of
## class "SlX", updated model is of class "Sarlm"

## Warning in tidy.anova(lmtest::lrtest(SLX_94w, SDEM_94w)): The following column
## names in ANOVA output were not recognized or transformed: X.Df, LogLik

## # A tibble: 2 x 5
##    X.Df LogLik    df statistic p.value
##   <dbl>  <dbl> <dbl>     <dbl>   <dbl>
## 1    28   97.5    NA    NA      NA    
## 2    29   98.0     1     0.917   0.338

Generalized method of moments

The estimation methods used for fitting SLM and the spatially lagged response part of SARAR models are based on instrumental variables, using the spatially lagged covariates (usually including lags of lags too) as instruments for the spatially lagged response (Piras 2010). This, and the use of spatially lagged covariates in the moment conditions for models including a spatial process in the residuals, means that including the spatially lagged covariates in the initial model is challenging, and functions in the sphet package do not provide a Durbin= argument. This makes it harder to accommodate data with multiple spatial process footprints. However, as Kelejian and Piras show in their recent review (2017), these approaches have other benefits, such as being able to instrument variables suspected of suffering from endogeneity or measurement error. Let us first compare the ML and GMM estimates of the SEM regression coefficients for rescaled squared NOX values. We use the sphet::spreg() wrapper function, and fit a SEM model. Extracting the matching rows from the summary objects for these models, we can see that the values are not dissimilar, despite the difference in estimation methods.

SEM1_94 <- sphet::spreg(form, data=boston_94, listw=lw_q_94, model="error")
res <- rbind(summary(SEM_94)$Coef["I((NOX * 10)^2)",], 
             summary(SEM1_94)$CoefTable["I((NOX * 10)^2)",])
rownames(res) <- c("ML", "GMM")
res

##         Estimate Std. Error   z value     Pr(>|z|)
## ML  -0.009564967  0.0026099 -3.664879 0.0002474555
## GMM -0.009744770  0.0026113 -3.731770 0.0001901391

Using sphet::spreg(), we can instrument the rescaled squared NOX variable, dropping it first from the formula, next creating the rescaled squared NOX variable as a column in the "sf" object, extracting a matrix of coordinates from the centroids of the output zones, and creating a one-sided instrument formula from a second order polynomial in the coordinates (here improperly, as they are not projected) and mean distances to downtown and radial roads. The endog= argument takes a one-sided formula for the variables to be modelled in the first stage of the model. Had we re-run the original air pollution model many times under slightly varying scenarios, we could have used an ensemble of NOX loadings to yield its distribution by output zone. Because this is not possible, we assume that the measurement error can be captured by using selected instruments. Unfortunately, the NOX regression coefficient estimate from the second stage has fallen substantially in absolute size, although the sign is unchanged.

formiv <- update(form, . ~ . - I((NOX*10)^2))
boston_94$NOX2 <- (boston_94$NOX*10)^2
suppressWarnings(ccoords <- st_coordinates(st_centroid(st_geometry(boston_94))))
iform <- formula(~poly(ccoords, degree=2) + DIS + RAD)
SEM1_94iv <- sphet::spreg(formiv, data=boston_94, listw=lw_q_94, endog = ~NOX2, 
                   instruments=iform, model="error")
summary(SEM1_94iv)$CoefTable["NOX2",]

##     Estimate   Std. Error      t-value     Pr(>|t|) 
## -0.001972107  0.005664617 -0.348144820  0.727731426

Handling measurement error in this or similar ways is one of the benefits of GMM estimation methods, although here the choice of instruments was somewhat arbitrary.

Markov chain Monte Carlo

The Spatial Econometrics Library is part of the extensive Matlab code repository at https://www.spatial-econometrics.com/ and documented in LeSage and Pace (2009). The Google Summer of Coding project in 2011 by Abhirup Mallik mentored by Virgilio Gómez-Rubio yielded translations of some of the model fitting functions for SEM, SDEM, SLM, SDM, SAC and GNM from the Matlab code. These have now been added to spatialreg as spBreg_err(), spBreg_lag() and spBreg_sac() with Durbin= arguments to handle the inclusion of spatially lagged covariates. As yet, heteroskedastic disturbances are not accommodated. The functions return "mcmc" objects as specified in the coda package, permitting the use of tools from coda for handling model output. The default burnin is 500 draws, followed by default 2000 draws returned, but these values and many others may be set through the control= list argument. Fitting the SDEM model for the output zones takes about an order of magnitude longer than using ML, but there is more work to do subsequently, and this difference scales more in the number of samples than covariates or observations.

system.time(SDEM_94B <- spBreg_err(form, data=boston_94, listw=lw_q_94, Durbin=TRUE))

##    user  system elapsed 
##   1.750   0.010   1.767

system.time(SDEM_489B <- spBreg_err(form, data=boston_489, listw=lw_q_489, Durbin=TRUE, zero.policy=TRUE))

##    user  system elapsed 
##   2.467   0.001   2.477

Most time in the ML case using eigenvalues is taken by log determinant setup and optimization, and by dense matrix asymptotic standard errors if chosen (always chosen with default "eigen" log determinant method):

t(errorsarlm(form, data=boston_94, listw=lw_q_94, Durbin=TRUE)$timings[,2])

##      set_up eigen_set_up eigen_opt coefs eigen_hcov eigen_se
## [1,]  0.004        0.004     0.007 0.003      0.002    0.001

t(errorsarlm(form, data=boston_489, listw=lw_q_489, Durbin=TRUE, zero.policy=TRUE)$timings[,2])

##      set_up eigen_set_up eigen_opt coefs eigen_hcov eigen_se
## [1,]  0.007        0.052     0.025 0.006      0.096    0.187

while in the MCMC case, the default use of Spatial Econometric Toolbox gridded log determinants and obviously sampling takes most time:

t(attr(SDEM_94B, "timings")[ , 3])

##      set_up SE_classic_set_up complete_setup sampling finalise
## [1,]  0.003              0.25           0.03    1.427    0.056

t(attr(SDEM_489B, "timings")[ , 3])

##      set_up SE_classic_set_up complete_setup sampling finalise
## [1,]  0.005             0.414           0.07    1.935    0.053

However, as we will see shortly, inference from model impacts may need sampling (where the spatially lagged response is included in the model), and in the case of MCMC models, the samples have already been drawn.

Handling the log determinant term

We will briefly touch on selected implementation details that applied users of spatial regression models would be wise to review. The handling of the log determinant term applies to all such users, while impacts are restricted to those employing spatial econometrics style models either including the spatially lagged response or including spatially lagged covariates.

It has been known for over twenty years that the sparse matrix representation of spatial weights overcomes the difficulties of fitting models with larger numbers of observations using maximum likelihood and MCMC where the log determinant term comes into play (R. K. Pace and Barry 1997a, 1997c, 1997d, 1997b). During the development of these approaches in model fitting functions in spatialreg, use was first made of C code also used in the S-PLUS SpatialStats module (Kaluzny et al. 1998), then SparseM which used a compressed sparse row form very similar to "nb" and "listw" objects. This was followed by the use of spam and Matrix methods, both of which mainly use compressed sparse column representations. Details are provided in Bivand, Hauke and Kossowski (2013).

The domain of the spatial coefficient(s) is given by the interval= argument to model fitting functions, and returned in the fitted object:

SEM_94$interval

## [1] -1.527257  1.000000

This case is trivial, because the upper bound is unity by definition, because of the use of row standardization. The interval is the inverse of the range of the eigenvalues of the weights matrix:

1/range(eigs_94)

## [1] -1.527257  1.000000

Finding the interval within which to search for the spatial coefficient is trivial for smaller data sets, but more complex for larger ones. It is possible to use heuristics implemented in lextrW() (Griffith, Bivand, and Chun 2015):

1/c(lextrW(lw_q_94))

##  lambda_n  lambda_1 
## -1.527257  1.000000

or RSpectra::eigs() after coercion to a Matrix package compressed sparse column representation:

W <- as(lw_q_94, "CsparseMatrix")
1/Re(c(RSpectra::eigs(W, k=1, which="SR")$values, 
       RSpectra::eigs(W, k=1, which="LR")$values))

## [1] -1.527257  1.000000

Why are we extracting the real part of the values returned by eigs()? Since nb_q_94 is symmetric, the row-standardized object lw_q_94 is symmetric by similarity, a result known since Ord (1975); consequently, we can take the real part without concern. Had the underlying neighbour relationships not been symmetric, we should be more careful.

The baseline log determinant term as given by Ord (1975) for a coefficient value proposed in sampling or during numerical optimization; this extract matches the "eigen" method (with or without control=list(pre_eig=...)"):

coef <- 0.5
sum(log(1 - coef * eigs_94))

## [1] -2.867292

Using sparse matrix functions from Matrix, the LU decomposition can be used for asymmetric matrices; this extract matches the "LU" method:

I <- Diagonal(nrow(boston_94))
LU <- lu(I - coef * W)
dU <- abs(diag(slot(LU, "U")))
sum(log(dU))

## [1] -2.867292

and Cholesky decomposition for symmetric matrices, with similar.listw() used to handle asymmetric weights that are similar to symmetric. The default value od super allows the underlying Matrix code to choose between supernodal or simplicial decomposition; this extract matches the "Matrix_J" method:

W <- as(similar.listw(lw_q_94), "CsparseMatrix")
super <- as.logical(NA)
cch <- Cholesky((I - coef * W), super=super)
c(2 * determinant(cch, logarithm = TRUE)$modulus)

## [1] -2.867292

The "Matrix" and "spam_update" methods are to be preferred as they pre-compute the fill-reducing permutation of the decomposition since the weights do not change for different values of the coefficient.

Maximum likelihood model fitting functions in spatialreg and splm use jacobianSetup() to populate env= environment with intermediate objects needed to find log determinants during optimization. Passing environments to objective functions is efficient because they are passed by reference rather than value. The con= argument passes through the populated control list, containing default values unless the key-value pairs were given in the function call (pre_eig= is extracted separately). The which= argument is 1 by default, but will also take 2 in SAC and GNM models. HSAR uses mcdet_setup() to set up Monte Carlo approximation terms.

args(jacobianSetup)

## function (method, env, con, pre_eig = NULL, trs = NULL, interval = NULL, 
##     which = 1) 
## NULL

For each value of coef, the do_ldet() function returns the log determinant, using the values stored in environment env=:

args(do_ldet)

## function (coef, env, which = 1) 
## NULL

As yet the Bayesian models are limited to control argument ldet_method="SE_classic" at present, using "LU" to generate a coarse grid of control argument nrho=200L log determinant values in the interval, spline interpolated to a finer grid of length control argument interpn=2000L, from which griddy Gibbs samples are drawn. It is hoped to add facilities to choose alternative methods in the future. This would offer possibilities to move beyond griddy Gibbs, but using gridded log determinant values seems reasonable at present.

Impacts

Global impacts have been seen as crucial for reporting results from fitting models including the spatially lagged response (SLM, SDM, SAC. GNM) for over ten years (James P. LeSage and Pace 2009). Extension to other models including spatially lagged covariates (SLX, SDEM) has followed (Elhorst 2010; Roger S. Bivand 2012; Halleck Vega and Elhorst 2015). In order to infer from the impacts, linear combination may be used for SLX and SDEM models. For SLM, SDM, SAC and GNM models fitted with maximum likelihood or GMM, the variance-covariance matrix of the coefficients is available, and can be used to make random draws from a multivariate Normal distribution with mean set to coefficient values and variance to the estimated variance-covariance matrix. For these models fitted using Bayesian methods, draws are already available. In the SDEM case, the draws on the regression coefficients of the unlagged covariates represent direct impacts, and draws on the coefficients of the spatially lagged covariates represent indirect impacts, and their by-draw sums the total impacts.

Impacts are calculated using model object class specific impacts() methods, here taking the method for "sarlm" objects as an example. In the sphet package, the impacts method for "gstsls" uses the spatialreg impacts() framework, as does the splm package for "splm" fitted model objects. impacts() methods require either a tr= - a vector of traces of the power series of the weights object typically computed with trW() or a listw= argument. If listw= is given, dense matrix methods are used. The evalues= argument is experimental, does not yet work for all model types, and takes the eigenvalues of the weights matrix. The R= argument gives the number of samples to be taken from the fitted model. The Q= permits the decomposition of impacts to components of the power series of the weights matrix (James P. LeSage and Pace 2009).

args(getS3method("impacts", "Sarlm"))

## function (obj, ..., tr = NULL, R = NULL, listw = NULL, evalues = NULL, 
##     useHESS = NULL, tol = 1e-06, empirical = FALSE, Q = NULL) 
## NULL

The summary method for the output of impacts() methods where inference from samples was requested by default uses the summary() method for "mcmc" objects defined in the coda package. It can instead report just matrices of standard errors, z-values and p-values by setting zstats= and short= to TRUE.

args(getS3method("summary", "LagImpact"))

## function (object, ..., zstats = FALSE, short = FALSE, reportQ = NULL) 
## NULL

Since sampling is not required for inference for SLX and SDEM models, linear combination is used for models fitted using maximum likelihood; results are shown here for the air pollution variable only. The literature has not yet resolved the question of how to report model output, as each covariate is now represented by three impacts. Where spatially lagged covariates are included, two coefficients are replaced by three impacts.

sum_imp_94_SDEM <- summary(impacts(SDEM_94))
rbind(Impacts=sum_imp_94_SDEM$mat[5,], SE=sum_imp_94_SDEM$semat[5,])

##               Direct     Indirect        Total
## Impacts -0.012764045 -0.018454532 -0.031218577
## SE       0.002354762  0.004717734  0.005303962

The impacts from the same model fitted by MCMC are very similar:

sum_imp_94_SDEM_B <- summary(impacts(SDEM_94B))
rbind(Impacts=sum_imp_94_SDEM_B$mat[5,], SE=sum_imp_94_SDEM_B$semat[5,])

##               Direct    Indirect        Total
## Impacts -0.012653812 -0.01740874 -0.030062550
## SE       0.002485255  0.00575439  0.006774467

as also are those from the SLX model. In the SLX and SDEM models, the direct impacts are the consequences for the response of changes in air pollution in the same observational entity, and the indirect (local) impacts are the consequences for the response of changes in air pollution in neighbouring observational entities.

sum_imp_94_SLX <- summary(impacts(SLX_94))
rbind(Impacts=sum_imp_94_SLX$mat[5,], SE=sum_imp_94_SLX$semat[5,])

##               Direct     Indirect        Total
## Impacts -0.012766473 -0.018743765 -0.031510238
## SE       0.002796966  0.005564633  0.006114058

In contrast to local indirect impacts in SLX and SDEM models, global indirect impacts are found in models including the spatially lagged response. For purposes of exposition, let us fit an SLM:

SLM_489 <- lagsarlm(form, data=boston_489, listw=lw_q_489, zero.policy=TRUE)

Traces of the first m= matrices of the power series in the spatial weights are pre-computed to speed up inference from samples from the fitted model, and from existing MCMC samples (James P. LeSage and Pace 2009). The traces can also be used in other contexts too, so their pre-computation may be worthwhile anyway. The type= argument is "mult" by default, but may also be set to "MC" for Monte Carlo simulation or "moments" using a space-saving looping algorithm.

args(trW)

## function (W = NULL, m = 30, p = 16, type = "mult", listw = NULL, 
##     momentsSymmetry = TRUE) 
## NULL

W <- as(lw_q_489, "CsparseMatrix")
tr_489 <- trW(W)
str(tr_489)

##  num [1:30] 0 90.8 29.4 42.1 26.5 ...
##  - attr(*, "timings")= Named num [1:2] 0.164 0.169
##   ..- attr(*, "names")= chr [1:2] "user.self" "elapsed"
##  - attr(*, "type")= chr "mult"
##  - attr(*, "n")= int 489

In this case, the spatial process in the response is not strong, so the global indirect impacts (here for the air pollution variable) are weak.

SLM_489_imp <- impacts(SLM_489, tr=tr_489, R=2000)
SLM_489_imp_sum <- summary(SLM_489_imp, short=TRUE, zstats=TRUE)
res <- rbind(Impacts=sapply(SLM_489_imp$res, "[", 5), SE=SLM_489_imp_sum$semat[5,])
colnames(res) <- c("Direct", "Indirect", "Total")
res

##               Direct      Indirect        Total
## Impacts -0.005930731 -1.014744e-05 -0.005940879
## SE       0.001073926  1.013427e-04  0.001082105

Of more interest is trying to reconstruct the direct and total impacts using dense matrix methods; the direct global impacts are the mean of the diagonal of the dense impacts matrix, and the total global impacts are the sum of all matrix elements divided by the number of observations. The direct impacts agree, but the total impacts differ slightly.

coef_SLM_489 <- coef(SLM_489)
IrW <- Diagonal(489) - coef_SLM_489[1] * W
S_W <- solve(IrW)
S_NOX_W <- S_W %*% (diag(489) * coef_SLM_489[7])
c(Direct=mean(diag(S_NOX_W)), Total=sum(S_NOX_W)/489)

##       Direct        Total 
## -0.005930731 -0.005940858

This bare-bones approach corresponds to using the listw= argument, and as expected gives the same output.

sapply(impacts(SLM_489, listw=lw_q_489), "[", 5)

##        direct      indirect         total 
## -5.930731e-03 -1.014744e-05 -5.940879e-03

The experimental evalues= approach which is known to be numerically exact by definition gives the same results as the matrix power series trace approach, so the slight difference may be attributed to the consequences of inverting the spatial process matrix.

sapply(impacts(SLM_489, evalues=eigs_489), "[", 5)

##        direct      indirect         total 
## -5.930731e-03 -1.014744e-05 -5.940879e-03

Impacts are crucial to the interpretation of Durbin models including spatially lagged covariates and models including the spatially lagged response. Tools to calculate impacts and their inferential bases are now available, and should be employed, but as yet some implementation details are under development and ways of presenting results in tabular form have not reached maturity.

Predictions

We will use the predict() method for "sarlm" objects to double-check impacts, here for the pupil-teacher ratio (PTRATIO). The method was re-written by Martin Gubri based on Goulard, Laurent and Thomas-Agnan (2017). The pred.type= argument specifies the prediction strategy among those presented in the article.

args(getS3method("predict", "sarlm"))

## function (object, newdata = NULL, listw = NULL, pred.type = "TS", 
##     all.data = FALSE, zero.policy = NULL, legacy = TRUE, legacy.mixed = FALSE, 
##     power = NULL, order = 250, tol = .Machine$double.eps^(3/5), 
##     spChk = NULL, ...) 
## NULL

First we’ll increment PTRATIO by one to show that, using least squares, the mean difference between predictions from the incremented new data and fitted values is equal to the regression coefficient.

nd_489 <- boston_489
nd_489$PTRATIO <- nd_489$PTRATIO + 1
OLS_489 <- lm(form, data=boston_489)
fitted <- predict(OLS_489)
nd_fitted <- predict(OLS_489, newdata=nd_489)
all.equal(unname(coef(OLS_489)[12]), mean(nd_fitted - fitted))

## [1] TRUE

In models including the spatially lagged response, and when the spatial coefficient in different from zero, this is not the case in general, and is why we need impacts() methods. The difference here is not great, but neither is it zero, and needs to be handled.

fitted <- predict(SLM_489)

## This method assumes the response is known - see manual page

nd_fitted <- predict(SLM_489, newdata=nd_489, listw=lw_q_489, pred.type="TS",
                     zero.policy=TRUE)
all.equal(unname(coef_SLM_489[13]), mean(nd_fitted - fitted))

## [1] "Mean relative difference: 0.001783081"

In the Boston tracts data set, 17 observations of median house values, the response, are censored. Using these as an example and comparing some pred.type= variants for the SDEM model and predicting out-of-sample, we can see that there are differences, suggesting that this is a fruitful area for study. There have been a number of alternative proposals for handling missing variables (Gómez-Rubio, Bivand, and Rue 2015; Suesse 2018). Another reason for increasing attention on prediction is that it is fundamental for machine learning approaches, in which prediction for validation and test data sets drives model specification choice. The choice of training and other data sets with dependent spatial data remains an open question, and is certainly not as simple as with independent data.

Here, we’ll list the predictions for the censored tract observations using three different prediction types, taking the exponent to get back to the USD median house values.

nd <- boston_506[is.na(boston_506$median),]
t0 <- exp(predict(SDEM_489, newdata=nd, listw=lw_q, pred.type="TS", zero.policy=TRUE))
suppressWarnings(t1  <- exp(predict(SDEM_489, newdata=nd, listw=lw_q, pred.type="KP2",
                                    zero.policy=TRUE)))
suppressWarnings(t2  <- exp(predict(SDEM_489, newdata=nd, listw=lw_q, pred.type="KP5",
                                    zero.policy=TRUE)))
data.frame(fit_TS=t0[,1], fit_KP2=c(t1), fit_KP5=c(t2),
           censored=boston_506$censored[as.integer(attr(t0, "region.id"))])

##        fit_TS   fit_KP2   fit_KP5 censored
## 13  23912.450 29477.240 28147.151    right
## 14  28125.588 27001.192 28516.159    right
## 15  30553.380 36184.007 32476.053    right
## 17  18518.448 19620.801 18878.244    right
## 43   9563.613  6816.721  7561.353     left
## 50   8371.200  7196.445  7383.139     left
## 312 51477.038 53300.877 54172.605    right
## 313 45920.630 45823.263 47094.833    right
## 314 44195.663 44585.510 45361.371    right
## 317 43427.407 45706.682 45442.461    right
## 337 39878.563 42071.842 41126.925    right
## 346 44708.480 46693.655 46107.827    right
## 355 48187.620 49067.627 48910.651    right
## 376 42881.289 45883.193 44965.674    right
## 408 44293.862 44615.008 45669.977    right
## 418 38210.776 43374.808 41913.830    right
## 434 41647.159 41690.222 42397.968    right

The ACS US census tracts data set

Let us read in again the ACS data set with over 70,000 observations:

library(sf)
df_tracts <- st_read("df_tracts.gpkg")

## Reading layer `df_tracts' from data source 
##   `/home/rsb/und/uam21/units_2/df_tracts.gpkg' using driver `GPKG'
## Simple feature collection with 71353 features and 28 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  NAD83

set.ZeroPolicyOption(TRUE)

## [1] FALSE

spdep::set.ZeroPolicyOption(TRUE)

## [1] FALSE

nb_subset <- readRDS("nb_subset.rds")
lw <- spdep::nb2listw(nb_subset, style="W")

In the original study, logarithms of the median income CV were thought to be driven by a slightly broader set of covariates. Here we take logs of the vacancy rate, the elderly rate, the Black and Hispanic rates, the population density, and group quarters population (institutional like prisons, non-institutional like student accommodation https://www.census.gov/newsroom/releases/archives/2010_census/cb11-tps13.html).

tr_form <- log(med_inc_cv) ~ log1p(vacancy_rate) + log1p(old_rate) + log1p(black_rate) + log1p(hisp_rate) + log1p(group_pop) + log1p(dens)

The available covariates are not the same as in the original article, but our focus here is to show that larger data sets can be used in analysis:

lm_mod <- lm(tr_form, data=df_tracts)
summary(lm_mod)

## 
## Call:
## lm(formula = tr_form, data = df_tracts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.14822 -0.29445  0.01038  0.30068  2.99263 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6.2284926  0.0355078 -175.41   <2e-16 ***
## log1p(vacancy_rate)  0.4039100  0.0248345   16.26   <2e-16 ***
## log1p(old_rate)      0.4069179  0.0040408  100.70   <2e-16 ***
## log1p(black_rate)    0.5478924  0.0110139   49.75   <2e-16 ***
## log1p(hisp_rate)     0.5854893  0.0121825   48.06   <2e-16 ***
## log1p(group_pop)     0.0216907  0.0007901   27.45   <2e-16 ***
## log1p(dens)          0.0354277  0.0018286   19.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4526 on 71346 degrees of freedom
## Multiple R-squared:  0.2296, Adjusted R-squared:  0.2295 
## F-statistic:  3543 on 6 and 71346 DF,  p-value: < 2.2e-16

The least squares residuals show strong positive spatial autocorrelation:

spdep::lm.morantest(lm_mod, lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = tr_form, data = df_tracts)
## weights: lw
## 
## Moran I statistic standard deviate = 50.637, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.117328e-01    -6.347763e-05     4.874291e-06

Adding the spatially lagged covariates in SLX model form might make sense:

SLX <- lmSLX(tr_form, data=df_tracts, listw=lw, zero.policy=TRUE)
summary(SLX)

## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.14224 -0.29424  0.01015  0.30057  2.94089 
## 
## Coefficients:
##                          Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             -6.091034   0.054824 -111.102  < 2e-16 ***
## log1p.vacancy_rate.      0.204452   0.039726    5.147 2.66e-07 ***
## log1p.old_rate.          0.402543   0.004422   91.026  < 2e-16 ***
## log1p.black_rate.        0.493457   0.027068   18.230  < 2e-16 ***
## log1p.hisp_rate.         0.607203   0.033922   17.900  < 2e-16 ***
## log1p.group_pop.         0.019748   0.000834   23.678  < 2e-16 ***
## log1p.dens.              0.010350   0.004203    2.463  0.01380 *  
## lag.log1p.vacancy_rate.  0.374812   0.050936    7.359 1.88e-13 ***
## lag.log1p.old_rate.     -0.016907   0.006436   -2.627  0.00861 ** 
## lag.log1p.black_rate.    0.053985   0.030672    1.760  0.07840 .  
## lag.log1p.hisp_rate.    -0.039247   0.037392   -1.050  0.29390    
## lag.log1p.group_pop.     0.012803   0.001529    8.374  < 2e-16 ***
## lag.log1p.dens.          0.033688   0.004742    7.105 1.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4521 on 71340 degrees of freedom
## Multiple R-squared:  0.2315, Adjusted R-squared:  0.2314 
## F-statistic:  1791 on 12 and 71340 DF,  p-value: < 2.2e-16

The level of residual spatial autocorrelation is effectively unchanged, perhaps suggesting that the spatial footprint of the autocorrelation process(es) may not be well-represented by the chosen definition of tract neighbours:

spdep::lm.morantest(SLX, lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## Moran I statistic standard deviate = 50.563, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.115597e-01    -7.287566e-05     4.874367e-06

The linear combinations of unlagged and lagged covariates give these impacts, many pulling the same way:

summary(spatialreg::impacts(SLX))

## Impact measures (SlX, estimable, n-k):
##                         Direct    Indirect      Total
## log1p(vacancy_rate) 0.20445207  0.37481174 0.57926381
## log1p(old_rate)     0.40254315 -0.01690728 0.38563587
## log1p(black_rate)   0.49345676  0.05398505 0.54744181
## log1p(hisp_rate)    0.60720345 -0.03924693 0.56795651
## log1p(group_pop)    0.01974814  0.01280282 0.03255096
## log1p(dens)         0.01035030  0.03368823 0.04403853
## ========================================================
## Standard errors:
##                           Direct    Indirect       Total
## log1p(vacancy_rate) 0.0397260202 0.050935722 0.032647455
## log1p(old_rate)     0.0044223032 0.006435572 0.006346559
## log1p(black_rate)   0.0270685196 0.030671611 0.012643255
## log1p(hisp_rate)    0.0339216058 0.037391531 0.013710963
## log1p(group_pop)    0.0008340256 0.001528895 0.001507331
## log1p(dens)         0.0042030102 0.004741785 0.002155113
## ========================================================
## Z-values:
##                        Direct  Indirect    Total
## log1p(vacancy_rate)  5.146553  7.358524 17.74300
## log1p(old_rate)     91.025678 -2.627160 60.76298
## log1p(black_rate)   18.229913  1.760098 43.29912
## log1p(hisp_rate)    17.900198 -1.049621 41.42353
## log1p(group_pop)    23.678098  8.373907 21.59510
## log1p(dens)          2.462592  7.104546 20.43444
## 
## p-values:
##                     Direct     Indirect   Total     
## log1p(vacancy_rate) 2.6532e-07 1.8585e-13 < 2.22e-16
## log1p(old_rate)     < 2.22e-16 0.0086101  < 2.22e-16
## log1p(black_rate)   < 2.22e-16 0.0783911  < 2.22e-16
## log1p(hisp_rate)    < 2.22e-16 0.2938925  < 2.22e-16
## log1p(group_pop)    < 2.22e-16 < 2.22e-16 < 2.22e-16
## log1p(dens)         0.013794   1.2073e-12 < 2.22e-16

summary(df_tracts$tot_pop)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     501    2925    4028    4283    5341   37452

quantile(df_tracts$group_pop, seq(0, 1, 0.1))

##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##     0     0     0     0     2     7    16    40    91   187 19496

plot(df_tracts$group_pop, df_tracts$group_pop/df_tracts$tot_pop)

plot(df_tracts$tot_pop, df_tracts$group_pop/df_tracts$tot_pop)

Weighted least squares use the range of the weights to assign tracts with larger populations less residual variance, but here the group quarters populations play in too:

lm_modw <- lm(tr_form, data=df_tracts, weights=tot_pop)
summary(lm_modw)

## 
## Call:
## lm(formula = tr_form, data = df_tracts, weights = tot_pop)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -165.025  -18.290    0.634   18.582  206.188 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6.2252852  0.0349852 -177.94   <2e-16 ***
## log1p(vacancy_rate)  0.5567749  0.0272130   20.46   <2e-16 ***
## log1p(old_rate)      0.4053828  0.0040674   99.67   <2e-16 ***
## log1p(black_rate)    0.5058137  0.0113856   44.43   <2e-16 ***
## log1p(hisp_rate)     0.5708457  0.0114385   49.91   <2e-16 ***
## log1p(group_pop)     0.0239794  0.0007418   32.33   <2e-16 ***
## log1p(dens)          0.0336080  0.0017827   18.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.5 on 71346 degrees of freedom
## Multiple R-squared:  0.2204, Adjusted R-squared:  0.2203 
## F-statistic:  3361 on 6 and 71346 DF,  p-value: < 2.2e-16

spdep::lm.morantest(lm_modw, lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = tr_form, data = df_tracts, weights = tot_pop)
## weights: lw
## 
## Moran I statistic standard deviate = 49.407, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.090160e-01    -6.281631e-05     4.874287e-06

We can try the same with other models, such as SLX, but again with little change:

SLXw <- lmSLX(tr_form, weights=tot_pop, data=df_tracts, listw=lw, zero.policy=TRUE)
summary(SLXw)

## 
## Call:
## lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1], collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Weighted Residuals:
##      Min       1Q   Median       3Q      Max 
## -164.762  -18.250    0.628   18.602  212.170 
## 
## Coefficients:
##                           Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)             -6.2010197  0.0542376 -114.331  < 2e-16 ***
## log1p.vacancy_rate.      0.2900525  0.0426211    6.805 1.02e-11 ***
## log1p.old_rate.          0.3983487  0.0044799   88.919  < 2e-16 ***
## log1p.black_rate.        0.4647722  0.0276407   16.815  < 2e-16 ***
## log1p.hisp_rate.         0.5933632  0.0325690   18.219  < 2e-16 ***
## log1p.group_pop.         0.0212821  0.0007877   27.019  < 2e-16 ***
## log1p.dens.              0.0168140  0.0040944    4.107 4.02e-05 ***
## lag.log1p.vacancy_rate.  0.4325027  0.0518089    8.348  < 2e-16 ***
## lag.log1p.old_rate.     -0.0019333  0.0065644   -0.295    0.768    
## lag.log1p.black_rate.    0.0407380  0.0311804    1.307    0.191    
## lag.log1p.hisp_rate.    -0.0310037  0.0358891   -0.864    0.388    
## lag.log1p.group_pop.     0.0164220  0.0014939   10.993  < 2e-16 ***
## lag.log1p.dens.          0.0231145  0.0046590    4.961 7.02e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28.45 on 71340 degrees of freedom
## Multiple R-squared:  0.2228, Adjusted R-squared:  0.2226 
## F-statistic:  1704 on 12 and 71340 DF,  p-value: < 2.2e-16

spdep::lm.morantest(SLXw, lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = formula(paste("y ~ ", paste(colnames(x)[-1],
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## Moran I statistic standard deviate = 49.483, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.091760e-01    -7.215477e-05     4.874344e-06

summary(spatialreg::impacts(SLXw))

## Impact measures (SlX, estimable, n-k):
##                         Direct     Indirect      Total
## log1p(vacancy_rate) 0.29005246  0.432502675 0.72255514
## log1p(old_rate)     0.39834869 -0.001933313 0.39641538
## log1p(black_rate)   0.46477224  0.040737989 0.50551022
## log1p(hisp_rate)    0.59336317 -0.031003701 0.56235946
## log1p(group_pop)    0.02128214  0.016422044 0.03770419
## log1p(dens)         0.01681396  0.023114529 0.03992849
## ========================================================
## Standard errors:
##                           Direct    Indirect       Total
## log1p(vacancy_rate) 0.0426210645 0.051808861 0.033980108
## log1p(old_rate)     0.0044798951 0.006564424 0.006325152
## log1p(black_rate)   0.0276406965 0.031180367 0.012944562
## log1p(hisp_rate)    0.0325690346 0.035889146 0.012845410
## log1p(group_pop)    0.0007876827 0.001493860 0.001457527
## log1p(dens)         0.0040943629 0.004658979 0.002114653
## ========================================================
## Z-values:
##                        Direct   Indirect    Total
## log1p(vacancy_rate)  6.805378  8.3480444 21.26406
## log1p(old_rate)     88.919201 -0.2945138 62.67286
## log1p(black_rate)   16.814780  1.3065269 39.05194
## log1p(hisp_rate)    18.218629 -0.8638740 43.77902
## log1p(group_pop)    27.018677 10.9930292 25.86861
## log1p(dens)          4.106612  4.9612865 18.88181
## 
## p-values:
##                     Direct     Indirect   Total     
## log1p(vacancy_rate) 1.0078e-11 < 2.22e-16 < 2.22e-16
## log1p(old_rate)     < 2.22e-16 0.76837    < 2.22e-16
## log1p(black_rate)   < 2.22e-16 0.19137    < 2.22e-16
## log1p(hisp_rate)    < 2.22e-16 0.38766    < 2.22e-16
## log1p(group_pop)    < 2.22e-16 < 2.22e-16 < 2.22e-16
## log1p(dens)         4.0151e-05 7.0028e-07 < 2.22e-16

We can also interact the ESRI reliability classes with the other covariates:

df_tracts$mi_cv_esri <- cut(df_tracts$med_inc_cv, c(0, 0.12, 0.40, Inf), labels=c("High", "Medium", "Low"), right=TRUE, include.lowest=TRUE, ordered_result=TRUE)
ctr_form <- update(tr_form, . ~ mi_cv_esri/(. -1))
lm_modc <- lm(ctr_form, data=df_tracts)
summary(lm_modc)

## 
## Call:
## lm(formula = ctr_form, data = df_tracts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.85057 -0.18982  0.01306  0.22555  1.61825 
## 
## Coefficients:
##                                        Estimate Std. Error  t value Pr(>|t|)
## mi_cv_esriHigh                       -3.9650108  0.0310620 -127.648  < 2e-16
## mi_cv_esriMedium                     -3.6373216  0.0443373  -82.037  < 2e-16
## mi_cv_esriLow                        -1.4641573  0.3551794   -4.122 3.76e-05
## mi_cv_esriHigh:log1p(vacancy_rate)    0.1618894  0.0211417    7.657 1.92e-14
## mi_cv_esriMedium:log1p(vacancy_rate)  0.1666862  0.0281628    5.919 3.26e-09
## mi_cv_esriLow:log1p(vacancy_rate)     0.0241005  0.1577771    0.153  0.87860
## mi_cv_esriHigh:log1p(old_rate)        0.1402792  0.0035544   39.466  < 2e-16
## mi_cv_esriMedium:log1p(old_rate)      0.1919496  0.0048860   39.286  < 2e-16
## mi_cv_esriLow:log1p(old_rate)         0.0759502  0.0359904    2.110  0.03484
## mi_cv_esriHigh:log1p(black_rate)      0.2260342  0.0104547   21.620  < 2e-16
## mi_cv_esriMedium:log1p(black_rate)    0.1641366  0.0111635   14.703  < 2e-16
## mi_cv_esriLow:log1p(black_rate)       0.0313052  0.0677407    0.462  0.64399
## mi_cv_esriHigh:log1p(hisp_rate)       0.2724036  0.0109594   24.856  < 2e-16
## mi_cv_esriMedium:log1p(hisp_rate)     0.2057883  0.0130134   15.814  < 2e-16
## mi_cv_esriLow:log1p(hisp_rate)        0.0323785  0.0910110    0.356  0.72202
## mi_cv_esriHigh:log1p(group_pop)       0.0111474  0.0006592   16.911  < 2e-16
## mi_cv_esriMedium:log1p(group_pop)     0.0097060  0.0009280   10.459  < 2e-16
## mi_cv_esriLow:log1p(group_pop)        0.0171415  0.0060395    2.838  0.00454
## mi_cv_esriHigh:log1p(dens)            0.0219977  0.0016064   13.694  < 2e-16
## mi_cv_esriMedium:log1p(dens)          0.0066425  0.0019684    3.375  0.00074
## mi_cv_esriLow:log1p(dens)            -0.0055058  0.0127856   -0.431  0.66674
##                                         
## mi_cv_esriHigh                       ***
## mi_cv_esriMedium                     ***
## mi_cv_esriLow                        ***
## mi_cv_esriHigh:log1p(vacancy_rate)   ***
## mi_cv_esriMedium:log1p(vacancy_rate) ***
## mi_cv_esriLow:log1p(vacancy_rate)       
## mi_cv_esriHigh:log1p(old_rate)       ***
## mi_cv_esriMedium:log1p(old_rate)     ***
## mi_cv_esriLow:log1p(old_rate)        *  
## mi_cv_esriHigh:log1p(black_rate)     ***
## mi_cv_esriMedium:log1p(black_rate)   ***
## mi_cv_esriLow:log1p(black_rate)         
## mi_cv_esriHigh:log1p(hisp_rate)      ***
## mi_cv_esriMedium:log1p(hisp_rate)    ***
## mi_cv_esriLow:log1p(hisp_rate)          
## mi_cv_esriHigh:log1p(group_pop)      ***
## mi_cv_esriMedium:log1p(group_pop)    ***
## mi_cv_esriLow:log1p(group_pop)       ** 
## mi_cv_esriHigh:log1p(dens)           ***
## mi_cv_esriMedium:log1p(dens)         ***
## mi_cv_esriLow:log1p(dens)               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3054 on 71332 degrees of freedom
## Multiple R-squared:  0.9837, Adjusted R-squared:  0.9837 
## F-statistic: 2.054e+05 on 21 and 71332 DF,  p-value: < 2.2e-16

There is still a lot of residual autocorrelation:

spdep::lm.morantest(lm_modc, lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = ctr_form, data = df_tracts)
## weights: lw
## 
## Moran I statistic standard deviate = 26.685, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     5.883956e-02    -7.266655e-05     4.873925e-06

anova(lm_modc)

## Analysis of Variance Table
## 
## Response: log(med_inc_cv)
##                                   Df Sum Sq Mean Sq    F value    Pr(>F)    
## mi_cv_esri                         3 401888  133963 1.4359e+06 < 2.2e-16 ***
## mi_cv_esri:log1p(vacancy_rate)     3     53      18 1.9014e+02 < 2.2e-16 ***
## mi_cv_esri:log1p(old_rate)         3    245      82 8.7423e+02 < 2.2e-16 ***
## mi_cv_esri:log1p(black_rate)       3     94      31 3.3653e+02 < 2.2e-16 ***
## mi_cv_esri:log1p(hisp_rate)        3    141      47 5.0366e+02 < 2.2e-16 ***
## mi_cv_esri:log1p(group_pop)        3     37      12 1.3377e+02 < 2.2e-16 ***
## mi_cv_esri:log1p(dens)             3     19       6 6.6365e+01 < 2.2e-16 ***
## Residuals                      71332   6655       0                         
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Conditioning on ESRI reliability yields an improvement in fit:

anova(lm_mod, lm_modc)

## Analysis of Variance Table
## 
## Model 1: log(med_inc_cv) ~ log1p(vacancy_rate) + log1p(old_rate) + log1p(black_rate) + 
##     log1p(hisp_rate) + log1p(group_pop) + log1p(dens)
## Model 2: log(med_inc_cv) ~ mi_cv_esri + mi_cv_esri:log1p(vacancy_rate) + 
##     mi_cv_esri:log1p(old_rate) + mi_cv_esri:log1p(black_rate) + 
##     mi_cv_esri:log1p(hisp_rate) + mi_cv_esri:log1p(group_pop) + 
##     mi_cv_esri:log1p(dens) - 1
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1  71346 14617.9                                  
## 2  71332  6654.7 14    7963.1 6096.9 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can also fit the SLX model on the conditioned representation:

SLXc <- lmSLX(ctr_form, data=df_tracts, listw=lw, zero.policy=TRUE)
summary(SLXc)

## 
## Call:
## lm(formula = formula(paste("y ~ 0 + ", paste(colnames(x), collapse = "+"))), 
##     data = as.data.frame(x), weights = weights)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91411 -0.18856  0.01429  0.22304  1.61503 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## mi_cv_esriHigh                           -4.0901041  0.0814612 -50.209  < 2e-16
## mi_cv_esriMedium                         -3.7875996  0.0877375 -43.170  < 2e-16
## mi_cv_esriLow                            -1.7110394  0.3614910  -4.733 2.21e-06
## mi_cv_esriHigh.log1p.vacancy_rate.        0.0863174  0.0302771   2.851 0.004361
## mi_cv_esriMedium.log1p.vacancy_rate.      0.0618700  0.0346242   1.787 0.073958
## mi_cv_esriLow.log1p.vacancy_rate.        -0.0724166  0.1579352  -0.459 0.646580
## mi_cv_esriHigh.log1p.old_rate.            0.1448400  0.0038209  37.908  < 2e-16
## mi_cv_esriMedium.log1p.old_rate.          0.1969933  0.0050701  38.854  < 2e-16
## mi_cv_esriLow.log1p.old_rate.             0.0875928  0.0358438   2.444 0.014538
## mi_cv_esriHigh.log1p.black_rate.          0.1811410  0.0198064   9.146  < 2e-16
## mi_cv_esriMedium.log1p.black_rate.        0.1495776  0.0199107   7.512 5.87e-14
## mi_cv_esriLow.log1p.black_rate.           0.0625038  0.0693053   0.902 0.367133
## mi_cv_esriHigh.log1p.hisp_rate.           0.2830902  0.0239404  11.825  < 2e-16
## mi_cv_esriMedium.log1p.hisp_rate.         0.2470295  0.0249812   9.889  < 2e-16
## mi_cv_esriLow.log1p.hisp_rate.            0.1345804  0.0931431   1.445 0.148496
## mi_cv_esriHigh.log1p.group_pop.           0.0106472  0.0006818  15.617  < 2e-16
## mi_cv_esriMedium.log1p.group_pop.         0.0088891  0.0009425   9.431  < 2e-16
## mi_cv_esriLow.log1p.group_pop.            0.0173298  0.0060110   2.883 0.003940
## mi_cv_esriHigh.log1p.dens.                0.0155850  0.0029777   5.234 1.66e-07
## mi_cv_esriMedium.log1p.dens.              0.0036302  0.0033277   1.091 0.275324
## mi_cv_esriLow.log1p.dens.                -0.0082932  0.0130541  -0.635 0.525240
## lag.mi_cv_esriHigh                        0.3598838  0.0950003   3.788 0.000152
## lag.mi_cv_esriMedium                      0.9177353  0.1195552   7.676 1.66e-14
## lag.mi_cv_esriLow                         0.6935848  0.8284068   0.837 0.402454
## lag.mi_cv_esriHigh.log1p.vacancy_rate.    0.0416962  0.0417443   0.999 0.317871
## lag.mi_cv_esriMedium.log1p.vacancy_rate.  0.2563055  0.0580503   4.415 1.01e-05
## lag.mi_cv_esriLow.log1p.vacancy_rate.    -0.1267525  0.3680428  -0.344 0.730550
## lag.mi_cv_esriHigh.log1p.old_rate.       -0.0370668  0.0068030  -5.449 5.09e-08
## lag.mi_cv_esriMedium.log1p.old_rate.     -0.0757979  0.0103622  -7.315 2.60e-13
## lag.mi_cv_esriLow.log1p.old_rate.        -0.0308527  0.0840665  -0.367 0.713617
## lag.mi_cv_esriHigh.log1p.black_rate.      0.0517898  0.0251535   2.059 0.039502
## lag.mi_cv_esriMedium.log1p.black_rate.   -0.1170101  0.0273028  -4.286 1.82e-05
## lag.mi_cv_esriLow.log1p.black_rate.      -0.2485985  0.1625651  -1.529 0.126213
## lag.mi_cv_esriHigh.log1p.hisp_rate.      -0.0012532  0.0291621  -0.043 0.965722
## lag.mi_cv_esriMedium.log1p.hisp_rate.    -0.2254311  0.0332033  -6.789 1.13e-11
## lag.mi_cv_esriLow.log1p.hisp_rate.       -0.2781210  0.2057159  -1.352 0.176390
## lag.mi_cv_esriHigh.log1p.group_pop.       0.0047556  0.0013283   3.580 0.000344
## lag.mi_cv_esriMedium.log1p.group_pop.     0.0019507  0.0020011   0.975 0.329666
## lag.mi_cv_esriLow.log1p.group_pop.       -0.0079431  0.0140328  -0.566 0.571370
## lag.mi_cv_esriHigh.log1p.dens.            0.0123808  0.0038052   3.254 0.001140
## lag.mi_cv_esriMedium.log1p.dens.         -0.0111389  0.0044558  -2.500 0.012427
## lag.mi_cv_esriLow.log1p.dens.            -0.0082363  0.0299712  -0.275 0.783465
##                                             
## mi_cv_esriHigh                           ***
## mi_cv_esriMedium                         ***
## mi_cv_esriLow                            ***
## mi_cv_esriHigh.log1p.vacancy_rate.       ** 
## mi_cv_esriMedium.log1p.vacancy_rate.     .  
## mi_cv_esriLow.log1p.vacancy_rate.           
## mi_cv_esriHigh.log1p.old_rate.           ***
## mi_cv_esriMedium.log1p.old_rate.         ***
## mi_cv_esriLow.log1p.old_rate.            *  
## mi_cv_esriHigh.log1p.black_rate.         ***
## mi_cv_esriMedium.log1p.black_rate.       ***
## mi_cv_esriLow.log1p.black_rate.             
## mi_cv_esriHigh.log1p.hisp_rate.          ***
## mi_cv_esriMedium.log1p.hisp_rate.        ***
## mi_cv_esriLow.log1p.hisp_rate.              
## mi_cv_esriHigh.log1p.group_pop.          ***
## mi_cv_esriMedium.log1p.group_pop.        ***
## mi_cv_esriLow.log1p.group_pop.           ** 
## mi_cv_esriHigh.log1p.dens.               ***
## mi_cv_esriMedium.log1p.dens.                
## mi_cv_esriLow.log1p.dens.                   
## lag.mi_cv_esriHigh                       ***
## lag.mi_cv_esriMedium                     ***
## lag.mi_cv_esriLow                           
## lag.mi_cv_esriHigh.log1p.vacancy_rate.      
## lag.mi_cv_esriMedium.log1p.vacancy_rate. ***
## lag.mi_cv_esriLow.log1p.vacancy_rate.       
## lag.mi_cv_esriHigh.log1p.old_rate.       ***
## lag.mi_cv_esriMedium.log1p.old_rate.     ***
## lag.mi_cv_esriLow.log1p.old_rate.           
## lag.mi_cv_esriHigh.log1p.black_rate.     *  
## lag.mi_cv_esriMedium.log1p.black_rate.   ***
## lag.mi_cv_esriLow.log1p.black_rate.         
## lag.mi_cv_esriHigh.log1p.hisp_rate.         
## lag.mi_cv_esriMedium.log1p.hisp_rate.    ***
## lag.mi_cv_esriLow.log1p.hisp_rate.          
## lag.mi_cv_esriHigh.log1p.group_pop.      ***
## lag.mi_cv_esriMedium.log1p.group_pop.       
## lag.mi_cv_esriLow.log1p.group_pop.          
## lag.mi_cv_esriHigh.log1p.dens.           ** 
## lag.mi_cv_esriMedium.log1p.dens.         *  
## lag.mi_cv_esriLow.log1p.dens.               
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3035 on 71311 degrees of freedom
## Multiple R-squared:  0.9839, Adjusted R-squared:  0.9839 
## F-statistic: 1.041e+05 on 42 and 71311 DF,  p-value: < 2.2e-16

spdep::lm.morantest(SLXc, lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = formula(paste("y ~ 0 + ", paste(colnames(x),
## collapse = "+"))), data = as.data.frame(x), weights = weights)
## weights: lw
## 
## Moran I statistic standard deviate = 21.462, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     4.722938e-02    -1.493084e-04     4.873462e-06

summary(spatialreg::impacts(SLXc))

## Impact measures (SlX, estimable, n-k):
##                                            Direct     Indirect        Total
## mi_cv_esriHigh                       -4.090104132  0.359883802 -3.730220329
## mi_cv_esriMedium                     -3.787599585  0.917735347 -2.869864238
## mi_cv_esriLow                        -1.711039382  0.693584754 -1.017454628
## mi_cv_esriHigh:log1p(vacancy_rate)    0.086317399  0.041696231  0.128013630
## mi_cv_esriMedium:log1p(vacancy_rate)  0.061870035  0.256305459  0.318175495
## mi_cv_esriLow:log1p(vacancy_rate)    -0.072416561 -0.126752462 -0.199169022
## mi_cv_esriHigh:log1p(old_rate)        0.144840014 -0.037066793  0.107773221
## mi_cv_esriMedium:log1p(old_rate)      0.196993270 -0.075797877  0.121195393
## mi_cv_esriLow:log1p(old_rate)         0.087592829 -0.030852720  0.056740109
## mi_cv_esriHigh:log1p(black_rate)      0.181140990  0.051789817  0.232930806
## mi_cv_esriMedium:log1p(black_rate)    0.149577592 -0.117010089  0.032567503
## mi_cv_esriLow:log1p(black_rate)       0.062503849 -0.248598460 -0.186094611
## mi_cv_esriHigh:log1p(hisp_rate)       0.283090232 -0.001253238  0.281836994
## mi_cv_esriMedium:log1p(hisp_rate)     0.247029465 -0.225431066  0.021598398
## mi_cv_esriLow:log1p(hisp_rate)        0.134580420 -0.278120971 -0.143540551
## mi_cv_esriHigh:log1p(group_pop)       0.010647239  0.004755558  0.015402797
## mi_cv_esriMedium:log1p(group_pop)     0.008889058  0.001950689  0.010839747
## mi_cv_esriLow:log1p(group_pop)        0.017329841 -0.007943069  0.009386772
## mi_cv_esriHigh:log1p(dens)            0.015585033  0.012380828  0.027965861
## mi_cv_esriMedium:log1p(dens)          0.003630192 -0.011138902 -0.007508710
## mi_cv_esriLow:log1p(dens)            -0.008293167 -0.008236288 -0.016529455
## ========================================================
## Standard errors:
##                                            Direct    Indirect       Total
## mi_cv_esriHigh                       0.0814611959 0.095000270 0.057699914
## mi_cv_esriMedium                     0.0877375277 0.119555240 0.096886235
## mi_cv_esriLow                        0.3614910215 0.828406833 0.887832679
## mi_cv_esriHigh:log1p(vacancy_rate)   0.0302770553 0.041744275 0.032787516
## mi_cv_esriMedium:log1p(vacancy_rate) 0.0346242226 0.058050256 0.053524495
## mi_cv_esriLow:log1p(vacancy_rate)    0.1579352418 0.368042774 0.396163158
## mi_cv_esriHigh:log1p(old_rate)       0.0038208790 0.006803000 0.006666668
## mi_cv_esriMedium:log1p(old_rate)     0.0050701074 0.010362231 0.010737603
## mi_cv_esriLow:log1p(old_rate)        0.0358438232 0.084066540 0.090446739
## mi_cv_esriHigh:log1p(black_rate)     0.0198064357 0.025153463 0.017097134
## mi_cv_esriMedium:log1p(black_rate)   0.0199107463 0.027302820 0.020220798
## mi_cv_esriLow:log1p(black_rate)      0.0693052818 0.162565109 0.172245086
## mi_cv_esriHigh:log1p(hisp_rate)      0.0239403788 0.029162060 0.018012159
## mi_cv_esriMedium:log1p(hisp_rate)    0.0249812349 0.033203320 0.024192830
## mi_cv_esriLow:log1p(hisp_rate)       0.0931430877 0.205715885 0.222680660
## mi_cv_esriHigh:log1p(group_pop)      0.0006817651 0.001328308 0.001361229
## mi_cv_esriMedium:log1p(group_pop)    0.0009425010 0.002001129 0.002091060
## mi_cv_esriLow:log1p(group_pop)       0.0060109804 0.014032752 0.015248246
## mi_cv_esriHigh:log1p(dens)           0.0029777193 0.003805163 0.002574415
## mi_cv_esriMedium:log1p(dens)         0.0033277284 0.004455829 0.003476595
## mi_cv_esriLow:log1p(dens)            0.0130541070 0.029971163 0.032094623
## ========================================================
## Z-values:
##                                           Direct    Indirect       Total
## mi_cv_esriHigh                       -50.2092326  3.78823977 -64.6486294
## mi_cv_esriMedium                     -43.1696639  7.67624530 -29.6209699
## mi_cv_esriLow                         -4.7332832  0.83725137  -1.1459982
## mi_cv_esriHigh:log1p(vacancy_rate)     2.8509179  0.99884909   3.9043406
## mi_cv_esriMedium:log1p(vacancy_rate)   1.7869003  4.41523394   5.9444838
## mi_cv_esriLow:log1p(vacancy_rate)     -0.4585206 -0.34439601  -0.5027449
## mi_cv_esriHigh:log1p(old_rate)        37.9075113 -5.44859520  16.1659789
## mi_cv_esriMedium:log1p(old_rate)      38.8538652 -7.31482238  11.2870067
## mi_cv_esriLow:log1p(old_rate)          2.4437357 -0.36700357   0.6273317
## mi_cv_esriHigh:log1p(black_rate)       9.1455622  2.05895371  13.6239681
## mi_cv_esriMedium:log1p(black_rate)     7.5124051 -4.28564119   1.6105943
## mi_cv_esriLow:log1p(black_rate)        0.9018627 -1.52922395  -1.0804059
## mi_cv_esriHigh:log1p(hisp_rate)       11.8248017 -0.04297494  15.6470416
## mi_cv_esriMedium:log1p(hisp_rate)      9.8886010 -6.78941336   0.8927603
## mi_cv_esriLow:log1p(hisp_rate)         1.4448782 -1.35196643  -0.6446027
## mi_cv_esriHigh:log1p(group_pop)       15.6171649  3.58016342  11.3153636
## mi_cv_esriMedium:log1p(group_pop)      9.4313516  0.97479398   5.1838527
## mi_cv_esriLow:log1p(group_pop)         2.8830307 -0.56603789   0.6155968
## mi_cv_esriHigh:log1p(dens)             5.2338823  3.25369239  10.8629944
## mi_cv_esriMedium:log1p(dens)           1.0908920 -2.49984973  -2.1597888
## mi_cv_esriLow:log1p(dens)             -0.6352918 -0.27480708  -0.5150226
## 
## p-values:
##                                      Direct     Indirect   Total     
## mi_cv_esriHigh                       < 2.22e-16 0.00015172 < 2.22e-16
## mi_cv_esriMedium                     < 2.22e-16 1.6431e-14 < 2.22e-16
## mi_cv_esriLow                        2.2092e-06 0.40245129 0.251796  
## mi_cv_esriHigh:log1p(vacancy_rate)   0.0043593  0.31786780 9.4483e-05
## mi_cv_esriMedium:log1p(vacancy_rate) 0.0739536  1.0090e-05 2.7733e-09
## mi_cv_esriLow:log1p(vacancy_rate)    0.6465785  0.73054849 0.615144  
## mi_cv_esriHigh:log1p(old_rate)       < 2.22e-16 5.0769e-08 < 2.22e-16
## mi_cv_esriMedium:log1p(old_rate)     < 2.22e-16 2.5779e-13 < 2.22e-16
## mi_cv_esriLow:log1p(old_rate)        0.0145361  0.71361636 0.530442  
## mi_cv_esriHigh:log1p(black_rate)     < 2.22e-16 0.03949867 < 2.22e-16
## mi_cv_esriMedium:log1p(black_rate)   5.7954e-14 1.8221e-05 0.107268  
## mi_cv_esriLow:log1p(black_rate)      0.3671298  0.12620893 0.279961  
## mi_cv_esriHigh:log1p(hisp_rate)      < 2.22e-16 0.96572151 < 2.22e-16
## mi_cv_esriMedium:log1p(hisp_rate)    < 2.22e-16 1.1259e-11 0.371986  
## mi_cv_esriLow:log1p(hisp_rate)       0.1484921  0.17638605 0.519185  
## mi_cv_esriHigh:log1p(group_pop)      < 2.22e-16 0.00034338 < 2.22e-16
## mi_cv_esriMedium:log1p(group_pop)    < 2.22e-16 0.32966247 2.1735e-07
## mi_cv_esriLow:log1p(group_pop)       0.0039387  0.57136803 0.538161  
## mi_cv_esriHigh:log1p(dens)           1.6599e-07 0.00113916 < 2.22e-16
## mi_cv_esriMedium:log1p(dens)         0.2753204  0.01242460 0.030789  
## mi_cv_esriLow:log1p(dens)            0.5252381  0.78346446 0.606537

Again, the conditioned representation yields an improved fit, but of course ESRI reliability is directly cut from the response (CV):

cat(capture.output(print(anova(SLX, SLXc)))[c(1, 2, 27:31)], sep="\n") # anova(SLX, SLXc)

## Analysis of Variance Table
## 
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1  71340 14581.4                                  
## 2  71311  6568.9 29    8012.5 2999.4 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a larger number of observations, using eigenvalues to calculate the Jacobian is not possible, so we use a pre-computed Cholesky decomposition:

SEM <- spatialreg::errorsarlm(tr_form, data=df_tracts, listw=lw, method="Matrix", zero.policy=TRUE)
apply(SEM$timings, 2, sum)

## user.self   elapsed 
##     4.424     4.497

summary(SEM, Hausman=TRUE, Nagelkerke=TRUE)

## 
## Call:spatialreg::errorsarlm(formula = tr_form, data = df_tracts, listw = lw, 
##     method = "Matrix", zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.161891 -0.287439  0.010315  0.296182  2.979651 
## 
## Type: error 
## Regions with no neighbours included:
##  12745 26628 29077 29120 32398 32775 43104 46344 46390 52303 58113 58261 70000 70055 70058 70457 70458 
## Coefficients: (asymptotic standard errors) 
##                        Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)         -6.25184400  0.03703706 -168.800 < 2.2e-16
## log1p(vacancy_rate)  0.35134711  0.02829437   12.418 < 2.2e-16
## log1p(old_rate)      0.41108547  0.00418162   98.308 < 2.2e-16
## log1p(black_rate)    0.54672194  0.01341739   40.747 < 2.2e-16
## log1p(hisp_rate)     0.59289677  0.01501856   39.478 < 2.2e-16
## log1p(group_pop)     0.02045655  0.00080153   25.522 < 2.2e-16
## log1p(dens)          0.03120040  0.00220509   14.149 < 2.2e-16
## 
## Lambda: 0.25388, LR test value: 2082.7, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0079519
##     z-value: 31.927, p-value: < 2.22e-16
## Wald statistic: 1019.3, p-value: < 2.22e-16
## 
## Log likelihood: -43642.84 for error model
## ML residual variance (sigma squared): 0.19674, (sigma: 0.44356)
## Nagelkerke pseudo-R-squared: 0.25173 
## Number of observations: 71353 
## Number of parameters estimated: 9 
## AIC: 87304, (AIC for lm: 89384)
## Hausman test: 171.45, df: 7, p-value: < 2.22e-16

We can also use single-chain MCMC with default 2500 draws, of which 500 are treated as burn-in and discarded; much of the time is spent on pre-computing a coarse grid of LU-decomposition Jacobian terms:

set.seed(1)
BSEM <- spatialreg::spBreg_err(tr_form, data=df_tracts, listw=lw, zero.policy=TRUE)
apply(attr(BSEM, "timings")[,c(1,3)], 2, sum)

## user.self   elapsed 
##   115.015   116.260

As we know, the model MCMC output is held in an "mcmc" object defined in coda:

library(coda)
summary(BSEM)

## 
## Iterations = 501:2500
## Thinning interval = 1 
## Number of chains = 1 
## Sample size per chain = 2000 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                         Mean        SD  Naive SE Time-series SE
## (Intercept)         -6.25240 0.0368774 8.246e-04      8.246e-04
## log1p(vacancy_rate)  0.35138 0.0288813 6.458e-04      6.458e-04
## log1p(old_rate)      0.41115 0.0041649 9.313e-05      9.313e-05
## log1p(black_rate)    0.54719 0.0137979 3.085e-04      3.085e-04
## log1p(hisp_rate)     0.59254 0.0154012 3.444e-04      3.444e-04
## log1p(group_pop)     0.02045 0.0007997 1.788e-05      1.929e-05
## log1p(dens)          0.03117 0.0022222 4.969e-05      4.969e-05
## lambda               0.25291 0.0053490 1.196e-04      1.196e-04
## sige                 0.20019 0.0012118 2.710e-05      2.710e-05
## 
## 2. Quantiles for each variable:
## 
##                         2.5%      25%      50%      75%    97.5%
## (Intercept)         -6.32307 -6.27707 -6.25300 -6.22796 -6.17848
## log1p(vacancy_rate)  0.29587  0.33131  0.35145  0.37149  0.40597
## log1p(old_rate)      0.40290  0.40839  0.41116  0.41394  0.41910
## log1p(black_rate)    0.51979  0.53798  0.54742  0.55630  0.57370
## log1p(hisp_rate)     0.56259  0.58228  0.59234  0.60257  0.62334
## log1p(group_pop)     0.01887  0.01990  0.02045  0.02099  0.02201
## log1p(dens)          0.02684  0.02966  0.03119  0.03271  0.03541
## lambda               0.24262  0.24962  0.25263  0.25663  0.26363
## sige                 0.19800  0.19944  0.20018  0.20088  0.20234

Diagnostic tests from coda can also be used:

raftery.diag(BSEM, r=0.01)

## 
## Quantile (q) = 0.025
## Accuracy (r) = +/- 0.01
## Probability (s) = 0.95 
##                                                            
##                      Burn-in  Total Lower bound  Dependence
##                      (M)      (N)   (Nmin)       factor (I)
##  (Intercept)         2        930   937          0.993     
##  log1p(vacancy_rate) 2        950   937          1.010     
##  log1p(old_rate)     2        930   937          0.993     
##  log1p(black_rate)   2        913   937          0.974     
##  log1p(hisp_rate)    2        892   937          0.952     
##  log1p(group_pop)    2        930   937          0.993     
##  log1p(dens)         2        950   937          1.010     
##  lambda              1        1190  937          1.270     
##  sige                2        930   937          0.993

plot(BSEM, trace=FALSE)

We can also use weights with SEM and conditioned SEM models:

SEMw <- spatialreg::errorsarlm(tr_form, weights=tot_pop, data=df_tracts, listw=lw, method="Matrix", zero.policy=TRUE)
apply(SEMw$timings, 2, sum)

## user.self   elapsed 
##     4.089     4.101

summary(SEMw, Hausman=TRUE, Nagelkerke=TRUE)

## 
## Call:spatialreg::errorsarlm(formula = tr_form, data = df_tracts, listw = lw, 
##     weights = tot_pop, method = "Matrix", zero.policy = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.1721397 -0.2875481  0.0096908  0.2949229  2.9755875 
## 
## Type: error 
## Regions with no neighbours included:
##  12745 26628 29077 29120 32398 32775 43104 46344 46390 52303 58113 58261 70000 70055 70058 70457 70458 
## Coefficients: (asymptotic standard errors) 
##                        Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)         -6.25080288  0.03640540 -171.700 < 2.2e-16
## log1p(vacancy_rate)  0.49017557  0.03116976   15.726 < 2.2e-16
## log1p(old_rate)      0.41003992  0.00422318   97.093 < 2.2e-16
## log1p(black_rate)    0.50494215  0.01384960   36.459 < 2.2e-16
## log1p(hisp_rate)     0.57340744  0.01406260   40.775 < 2.2e-16
## log1p(group_pop)     0.02242876  0.00075294   29.788 < 2.2e-16
## log1p(dens)          0.02993784  0.00213535   14.020 < 2.2e-16
## 
## Lambda: 0.24966, LR test value: 2032, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0038204
##     z-value: 65.349, p-value: < 2.22e-16
## Wald statistic: 4270.5, p-value: < 2.22e-16
## 
## Log likelihood: -44437.41 for error model
## ML residual variance (sigma squared): 780.56, (sigma: 27.938)
## Nagelkerke pseudo-R-squared: 0.23488 
## Number of observations: 71353 
## Number of parameters estimated: 9 
## AIC: 88893, (AIC for lm: 90923)
## Hausman test: 208.19, df: 7, p-value: < 2.22e-16

SEMc <- spatialreg::errorsarlm(ctr_form, weights=tot_pop, data=df_tracts, listw=lw, method="Matrix", zero.policy=TRUE)
apply(SEMc$timings, 2, sum)

## user.self   elapsed 
##    16.977    17.148

summary(SEMc, Hausman=TRUE, Nagelkerke=TRUE)

## 
## Call:spatialreg::errorsarlm(formula = ctr_form, data = df_tracts, 
##     listw = lw, weights = tot_pop, method = "Matrix", zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -2.875210 -0.189846  0.010025  0.218619  1.589451 
## 
## Type: error 
## Regions with no neighbours included:
##  12745 26628 29077 29120 32398 32775 43104 46344 46390 52303 58113 58261 70000 70055 70058 70457 70458 
## Coefficients: (asymptotic standard errors) 
##                                         Estimate  Std. Error   z value
## mi_cv_esriHigh                       -4.50194913  0.03081245 -146.1081
## mi_cv_esriMedium                     -3.52293575  0.05244773  -67.1704
## mi_cv_esriLow                        -1.51636186  0.50068998   -3.0285
## mi_cv_esriHigh:log1p(vacancy_rate)    0.24662900  0.02483581    9.9304
## mi_cv_esriMedium:log1p(vacancy_rate)  0.17373501  0.03654076    4.7546
## mi_cv_esriLow:log1p(vacancy_rate)    -0.08408966  0.26643804   -0.3156
## mi_cv_esriHigh:log1p(old_rate)        0.20108446  0.00359308   55.9644
## mi_cv_esriMedium:log1p(old_rate)      0.17805021  0.00588966   30.2310
## mi_cv_esriLow:log1p(old_rate)         0.07629282  0.05097803    1.4966
## mi_cv_esriHigh:log1p(black_rate)      0.22727812  0.01145022   19.8492
## mi_cv_esriMedium:log1p(black_rate)    0.16015804  0.01387312   11.5445
## mi_cv_esriLow:log1p(black_rate)       0.12311592  0.10298571    1.1955
## mi_cv_esriHigh:log1p(hisp_rate)       0.31728054  0.01113463   28.4949
## mi_cv_esriMedium:log1p(hisp_rate)     0.19542720  0.01472698   13.2700
## mi_cv_esriLow:log1p(hisp_rate)        0.09478542  0.12665203    0.7484
## mi_cv_esriHigh:log1p(group_pop)       0.01317602  0.00063026   20.9056
## mi_cv_esriMedium:log1p(group_pop)     0.01054448  0.00100731   10.4680
## mi_cv_esriLow:log1p(group_pop)        0.02180365  0.00879205    2.4799
## mi_cv_esriHigh:log1p(dens)            0.01785684  0.00169428   10.5395
## mi_cv_esriMedium:log1p(dens)          0.00869646  0.00228621    3.8039
## mi_cv_esriLow:log1p(dens)            -0.00350301  0.01820832   -0.1924
##                                       Pr(>|z|)
## mi_cv_esriHigh                       < 2.2e-16
## mi_cv_esriMedium                     < 2.2e-16
## mi_cv_esriLow                        0.0024573
## mi_cv_esriHigh:log1p(vacancy_rate)   < 2.2e-16
## mi_cv_esriMedium:log1p(vacancy_rate) 1.989e-06
## mi_cv_esriLow:log1p(vacancy_rate)    0.7523010
## mi_cv_esriHigh:log1p(old_rate)       < 2.2e-16
## mi_cv_esriMedium:log1p(old_rate)     < 2.2e-16
## mi_cv_esriLow:log1p(old_rate)        0.1345020
## mi_cv_esriHigh:log1p(black_rate)     < 2.2e-16
## mi_cv_esriMedium:log1p(black_rate)   < 2.2e-16
## mi_cv_esriLow:log1p(black_rate)      0.2319050
## mi_cv_esriHigh:log1p(hisp_rate)      < 2.2e-16
## mi_cv_esriMedium:log1p(hisp_rate)    < 2.2e-16
## mi_cv_esriLow:log1p(hisp_rate)       0.4542235
## mi_cv_esriHigh:log1p(group_pop)      < 2.2e-16
## mi_cv_esriMedium:log1p(group_pop)    < 2.2e-16
## mi_cv_esriLow:log1p(group_pop)       0.0131409
## mi_cv_esriHigh:log1p(dens)           < 2.2e-16
## mi_cv_esriMedium:log1p(dens)         0.0001425
## mi_cv_esriLow:log1p(dens)            0.8474405
## 
## Lambda: 0.1595, LR test value: 707.11, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0028702
##     z-value: 55.573, p-value: < 2.22e-16
## Wald statistic: 3088.3, p-value: < 2.22e-16
## 
## Log likelihood: -20613.76 for error model
## ML residual variance (sigma squared): 402.95, (sigma: 20.074)
## Number of observations: 71353 
## Number of parameters estimated: 23 
## AIC: 41274, (AIC for lm: 41979)
## Hausman test: 855.85, df: 21, p-value: < 2.22e-16

anova(SEM, SEMc)

##      Model df   AIC logLik Test L.Ratio p-value
## SEM      1  9 87304 -43643    1                
## SEMc     2 23 41274 -20614    2   46058       0

and its Durbin variant:

SDEM <- spatialreg::errorsarlm(tr_form, data=df_tracts, listw=lw, method="Matrix", Durbin=TRUE, zero.policy=TRUE)
apply(SDEM$timings, 2, sum)

## user.self   elapsed 
##     8.973     8.997

summary(SDEM, Nagelkerke=TRUE)

## 
## Call:spatialreg::errorsarlm(formula = tr_form, data = df_tracts, listw = lw, 
##     Durbin = TRUE, method = "Matrix", zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -3.153522 -0.286920  0.010042  0.295821  2.932170 
## 
## Type: error 
## Regions with no neighbours included:
##  12745 26628 29077 29120 32398 32775 43104 46344 46390 52303 58113 58261 70000 70055 70058 70457 70458 
## Coefficients: (asymptotic standard errors) 
##                            Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)             -6.16017866  0.06471718 -95.1861 < 2.2e-16
## log1p(vacancy_rate)      0.22011019  0.03746441   5.8752 4.224e-09
## log1p(old_rate)          0.40323063  0.00432342  93.2666 < 2.2e-16
## log1p(black_rate)        0.50755974  0.02561676  19.8136 < 2.2e-16
## log1p(hisp_rate)         0.62902991  0.03218041  19.5470 < 2.2e-16
## log1p(group_pop)         0.01986287  0.00080835  24.5722 < 2.2e-16
## log1p(dens)              0.00833251  0.00399191   2.0873   0.03686
## lag.log1p(vacancy_rate)  0.33910367  0.05237609   6.4744 9.519e-11
## lag.log1p(old_rate)     -0.00961209  0.00682877  -1.4076   0.15925
## lag.log1p(black_rate)    0.03009720  0.03046327   0.9880   0.32316
## lag.log1p(hisp_rate)    -0.06474559  0.03688723  -1.7552   0.07922
## lag.log1p(group_pop)     0.01440312  0.00167140   8.6174 < 2.2e-16
## lag.log1p(dens)          0.03495989  0.00475808   7.3475 2.021e-13
## 
## Lambda: 0.25272, LR test value: 2070, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0042256
##     z-value: 59.807, p-value: < 2.22e-16
## Wald statistic: 3576.8, p-value: < 2.22e-16
## 
## Log likelihood: -43560.04 for error model
## ML residual variance (sigma squared): 0.19631, (sigma: 0.44307)
## Nagelkerke pseudo-R-squared: 0.25347 
## Number of observations: 71353 
## Number of parameters estimated: 15 
## AIC: 87150, (AIC for lm: 89218)

summary(spatialreg::impacts(SDEM))

## Impact measures (SDEM, estimable, n):
##                          Direct     Indirect      Total
## log1p(vacancy_rate) 0.220110186  0.339103674 0.55921386
## log1p(old_rate)     0.403230628 -0.009612094 0.39361853
## log1p(black_rate)   0.507559744  0.030097197 0.53765694
## log1p(hisp_rate)    0.629029914 -0.064745592 0.56428432
## log1p(group_pop)    0.019862873  0.014403116 0.03426599
## log1p(dens)         0.008332506  0.034959895 0.04329240
## ========================================================
## Standard errors:
##                           Direct    Indirect       Total
## log1p(vacancy_rate) 0.0374644083 0.052376090 0.040633564
## log1p(old_rate)     0.0043234188 0.006828774 0.007484115
## log1p(black_rate)   0.0256167606 0.030463269 0.016154547
## log1p(hisp_rate)    0.0321804058 0.036887235 0.017557548
## log1p(group_pop)    0.0008083488 0.001671399 0.001803419
## log1p(dens)         0.0039919138 0.004758078 0.002749267
## ========================================================
## Z-values:
##                        Direct   Indirect    Total
## log1p(vacancy_rate)  5.875181  6.4743984 13.76236
## log1p(old_rate)     93.266613 -1.4075870 52.59387
## log1p(black_rate)   19.813580  0.9879832 33.28208
## log1p(hisp_rate)    19.546985 -1.7552303 32.13913
## log1p(group_pop)    24.572157  8.6174018 19.00057
## log1p(dens)          2.087346  7.3474828 15.74689
## 
## p-values:
##                     Direct     Indirect   Total     
## log1p(vacancy_rate) 4.2238e-09 9.5191e-11 < 2.22e-16
## log1p(old_rate)     < 2.22e-16 0.15925    < 2.22e-16
## log1p(black_rate)   < 2.22e-16 0.32316    < 2.22e-16
## log1p(hisp_rate)    < 2.22e-16 0.07922    < 2.22e-16
## log1p(group_pop)    < 2.22e-16 < 2.22e-16 < 2.22e-16
## log1p(dens)         0.036857   2.0206e-13 < 2.22e-16

and weighted:

SDEMw <- spatialreg::errorsarlm(tr_form, weights=tot_pop, data=df_tracts, listw=lw, method="Matrix", Durbin=TRUE, zero.policy=TRUE)
apply(SDEMw$timings, 2, sum)

## user.self   elapsed 
##     8.204     8.228

summary(SDEMw, Hausman=TRUE, Nagelkerke=TRUE)

## 
## Call:spatialreg::errorsarlm(formula = tr_form, data = df_tracts, listw = lw, 
##     weights = tot_pop, Durbin = TRUE, method = "Matrix", zero.policy = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.1665752 -0.2883039  0.0096818  0.2960586  2.9234058 
## 
## Type: error 
## Regions with no neighbours included:
##  12745 26628 29077 29120 32398 32775 43104 46344 46390 52303 58113 58261 70000 70055 70058 70457 70458 
## Coefficients: (asymptotic standard errors) 
##                            Estimate  Std. Error  z value  Pr(>|z|)
## (Intercept)             -6.26779399  0.06438165 -97.3537 < 2.2e-16
## log1p(vacancy_rate)      0.29917270  0.04034320   7.4157 1.210e-13
## log1p(old_rate)          0.40182015  0.00435840  92.1944 < 2.2e-16
## log1p(black_rate)        0.48896909  0.02617479  18.6809 < 2.2e-16
## log1p(hisp_rate)         0.61656157  0.03094209  19.9263 < 2.2e-16
## log1p(group_pop)         0.02144233  0.00076196  28.1408 < 2.2e-16
## log1p(dens)              0.01417781  0.00388774   3.6468 0.0002655
## lag.log1p(vacancy_rate)  0.40922588  0.05309595   7.7073 1.288e-14
## lag.log1p(old_rate)      0.00219764  0.00697219   0.3152 0.7526096
## lag.log1p(black_rate)    0.00270443  0.03096953   0.0873 0.9304128
## lag.log1p(hisp_rate)    -0.06315358  0.03539952  -1.7840 0.0744198
## lag.log1p(group_pop)     0.01860324  0.00163288  11.3929 < 2.2e-16
## lag.log1p(dens)          0.02522840  0.00467896   5.3919 6.972e-08
## 
## Lambda: 0.24893, LR test value: 2029.7, p-value: < 2.22e-16
## Approximate (numerical Hessian) standard error: 0.0056695
##     z-value: 43.906, p-value: < 2.22e-16
## Wald statistic: 1927.8, p-value: < 2.22e-16
## 
## Log likelihood: -44328.16 for error model
## ML residual variance (sigma squared): 778.22, (sigma: 27.897)
## Nagelkerke pseudo-R-squared: 0.23722 
## Number of observations: 71353 
## Number of parameters estimated: 15 
## AIC: 88686, (AIC for lm: 90714)
## Hausman test: 67.692, df: 13, p-value: 2.1293e-09

summary(spatialreg::impacts(SDEMw))

## Impact measures (SDEM, estimable, n):
##                         Direct     Indirect      Total
## log1p(vacancy_rate) 0.29917270  0.409225875 0.70839857
## log1p(old_rate)     0.40182015  0.002197637 0.40401779
## log1p(black_rate)   0.48896909  0.002704431 0.49167353
## log1p(hisp_rate)    0.61656157 -0.063153582 0.55340799
## log1p(group_pop)    0.02144233  0.018603237 0.04004557
## log1p(dens)         0.01417781  0.025228403 0.03940622
## ========================================================
## Standard errors:
##                          Direct    Indirect       Total
## log1p(vacancy_rate) 0.040343204 0.053095951 0.042489080
## log1p(old_rate)     0.004358399 0.006972192 0.007509624
## log1p(black_rate)   0.026174793 0.030969533 0.016508152
## log1p(hisp_rate)    0.030942095 0.035399521 0.016387297
## log1p(group_pop)    0.000761965 0.001632879 0.001739686
## log1p(dens)         0.003887738 0.004678957 0.002686349
## ========================================================
## Z-values:
##                        Direct    Indirect    Total
## log1p(vacancy_rate)  7.415690  7.70728964 16.67249
## log1p(old_rate)     92.194431  0.31520029 53.80000
## log1p(black_rate)   18.680915  0.08732554 29.78368
## log1p(hisp_rate)    19.926303 -1.78402361 33.77055
## log1p(group_pop)    28.140837 11.39290899 23.01884
## log1p(dens)          3.646803  5.39188581 14.66906
## 
## p-values:
##                     Direct     Indirect   Total     
## log1p(vacancy_rate) 1.2101e-13 1.2879e-14 < 2.22e-16
## log1p(old_rate)     < 2.22e-16 0.75261    < 2.22e-16
## log1p(black_rate)   < 2.22e-16 0.93041    < 2.22e-16
## log1p(hisp_rate)    < 2.22e-16 0.07442    < 2.22e-16
## log1p(group_pop)    < 2.22e-16 < 2.22e-16 < 2.22e-16
## log1p(dens)         0.00026552 6.9722e-08 < 2.22e-16

Finally, using the development version of sphet:

library(sphet)

## 
## Attaching package: 'sphet'

## The following object is masked from 'package:spatialreg':
## 
##     impacts

system.time(SEM_GMM <- spreg(tr_form, data=df_tracts, listw=as(lw, "CsparseMatrix"), model="error", het=FALSE))

##    user  system elapsed 
##   1.902   0.058   1.966

summary(SEM_GMM)

## 
##  Generalized stsls
## 
## Call:
## spreg(formula = tr_form, data = df_tracts, listw = as(lw, "CsparseMatrix"), 
##     model = "error", het = FALSE)
## 
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.15222 -0.29469  0.01032 -0.00021  0.30086  3.00213 
## 
## Coefficients:
##                        Estimate  Std. Error  t-value  Pr(>|t|)    
## (Intercept)         -6.24575367  0.03720516 -167.873 < 2.2e-16 ***
## log1p(vacancy_rate)  0.35157200  0.02843557   12.364 < 2.2e-16 ***
## log1p(old_rate)      0.41042586  0.00419882   97.748 < 2.2e-16 ***
## log1p(black_rate)    0.54682738  0.01352155   40.441 < 2.2e-16 ***
## log1p(hisp_rate)     0.59271015  0.01514450   39.137 < 2.2e-16 ***
## log1p(group_pop)     0.02040826  0.00080207   25.444 < 2.2e-16 ***
## log1p(dens)          0.03112745  0.00222134   14.013 < 2.2e-16 ***
## rho                  0.26259335  0.00477093   55.040 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

with its Durbin variant:

system.time(SDEM_GMM <- spreg(tr_form, data=df_tracts, listw=as(lw, "CsparseMatrix"), model="error", het=FALSE, Durbin=TRUE))

##    user  system elapsed 
##   2.133   0.078   2.216

summary(SDEM_GMM)

## 
##  Generalized stsls
## 
## Call:
## spreg(formula = tr_form, data = df_tracts, listw = as(lw, "CsparseMatrix"), 
##     model = "error", het = FALSE, Durbin = TRUE)
## 
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.14258 -0.29470  0.00949 -0.00075  0.29963  2.94460 
## 
## Coefficients:
##                            Estimate  Std. Error  t-value  Pr(>|t|)    
## (Intercept)             -6.13268960  0.06558835 -93.5027 < 2.2e-16 ***
## log1p(vacancy_rate)      0.22111422  0.03740353   5.9116 3.388e-09 ***
## log1p(old_rate)          0.40241917  0.00432891  92.9608 < 2.2e-16 ***
## log1p(black_rate)        0.50792821  0.02557387  19.8612 < 2.2e-16 ***
## log1p(hisp_rate)         0.62901587  0.03212850  19.5781 < 2.2e-16 ***
## log1p(group_pop)         0.01983835  0.00080826  24.5445 < 2.2e-16 ***
## log1p(dens)              0.00826926  0.00398581   2.0747   0.03802 *  
## lag_log1p(vacancy_rate)  0.34382296  0.05251777   6.5468 5.879e-11 ***
## lag_log1p(old_rate)     -0.01195826  0.00688147  -1.7377   0.08226 .  
## lag_log1p(black_rate)    0.03012242  0.03048615   0.9881   0.32312    
## lag_log1p(hisp_rate)    -0.06631076  0.03690239  -1.7969   0.07235 .  
## lag_log1p(group_pop)     0.01435628  0.00167892   8.5509 < 2.2e-16 ***
## lag_log1p(dens)          0.03521147  0.00476425   7.3908 1.460e-13 ***
## rho                      0.26161661  0.00476485  54.9055 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This also lets us adjust the coefficient standard errors for heterskedasticity:

system.time(SEM_GMMh <- spreg(tr_form, data=df_tracts, listw=as(lw, "CsparseMatrix"), model="error", het=TRUE))

##    user  system elapsed 
##   2.660   0.036   2.703

summary(SEM_GMMh)

## 
##  Generalized stsls
## 
## Call:
## spreg(formula = tr_form, data = df_tracts, listw = as(lw, "CsparseMatrix"), 
##     model = "error", het = TRUE)
## 
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.15225 -0.29467  0.01030 -0.00021  0.30090  3.00219 
## 
## Coefficients:
##                        Estimate  Std. Error  t-value  Pr(>|t|)    
## (Intercept)         -6.24579581  0.03996340 -156.288 < 2.2e-16 ***
## log1p(vacancy_rate)  0.35123442  0.03135236   11.203 < 2.2e-16 ***
## log1p(old_rate)      0.41044090  0.00456026   90.004 < 2.2e-16 ***
## log1p(black_rate)    0.54681845  0.01429321   38.257 < 2.2e-16 ***
## log1p(hisp_rate)     0.59276338  0.01559615   38.007 < 2.2e-16 ***
## log1p(group_pop)     0.02040097  0.00083032   24.570 < 2.2e-16 ***
## log1p(dens)          0.03109688  0.00232960   13.349 < 2.2e-16 ***
## rho                  0.27593286  0.00488909   56.438 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

system.time(SDEM_GMMh <- spreg(tr_form, data=df_tracts, listw=as(lw, "CsparseMatrix"), model="error", het=TRUE, Durbin=TRUE))

##    user  system elapsed 
##   3.207   0.095   3.311

summary(SDEM_GMMh)

## 
##  Generalized stsls
## 
## Call:
## spreg(formula = tr_form, data = df_tracts, listw = as(lw, "CsparseMatrix"), 
##     model = "error", het = TRUE, Durbin = TRUE)
## 
## Residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -3.14259 -0.29469  0.00947 -0.00076  0.29964  2.94460 
## 
## Coefficients:
##                            Estimate  Std. Error  t-value  Pr(>|t|)    
## (Intercept)             -6.13295558  0.06908374 -88.7757 < 2.2e-16 ***
## log1p(vacancy_rate)      0.22118334  0.04042524   5.4714 4.465e-08 ***
## log1p(old_rate)          0.40242167  0.00468697  85.8597 < 2.2e-16 ***
## log1p(black_rate)        0.50798180  0.02743648  18.5148 < 2.2e-16 ***
## log1p(hisp_rate)         0.62911182  0.03381343  18.6054 < 2.2e-16 ***
## log1p(group_pop)         0.01983960  0.00083555  23.7443 < 2.2e-16 ***
## log1p(dens)              0.00825998  0.00410720   2.0111   0.04432 *  
## lag_log1p(vacancy_rate)  0.34365182  0.05502973   6.2448 4.242e-10 ***
## lag_log1p(old_rate)     -0.01192960  0.00729841  -1.6345   0.10214    
## lag_log1p(black_rate)    0.03000758  0.03268209   0.9182   0.35853    
## lag_log1p(hisp_rate)    -0.06644231  0.03895247  -1.7057   0.08806 .  
## lag_log1p(group_pop)     0.01436527  0.00171910   8.3563 < 2.2e-16 ***
## lag_log1p(dens)          0.03521675  0.00491840   7.1602 8.056e-13 ***
## rho                      0.27495975  0.00488766  56.2560 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R’s `sessionInfo()`

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 34 (Workstation Edition)
## 
## Matrix products: default
## BLAS:   /home/rsb/topics/R/R410-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R410-share/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] sphet_1.11       coda_0.19-4      spatialreg_1.1-8 Matrix_1.3-3    
## [5] spData_0.3.8     sf_0.9-9        
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6         mvtnorm_1.1-1      lattice_0.20-44    tidyr_1.1.3       
##  [5] deldir_0.2-10      class_7.3-19       zoo_1.8-9          gtools_3.8.2      
##  [9] assertthat_0.2.1   digest_0.6.27      lmtest_0.9-38      utf8_1.2.1        
## [13] RSpectra_0.16-0    spDataLarge_0.5.1  R6_2.5.0           backports_1.2.1   
## [17] evaluate_0.14      e1071_1.7-7        highr_0.9          pillar_1.6.1      
## [21] rlang_0.4.11       spdep_1.1-8        rstudioapi_0.13    gdata_2.18.0      
## [25] raster_3.4-10      jquerylib_0.1.4    gmodels_2.18.1     rmarkdown_2.8     
## [29] splines_4.1.0      stringr_1.4.0      proxy_0.4-25       broom_0.7.6       
## [33] compiler_4.1.0     xfun_0.23          pkgconfig_2.0.3    htmltools_0.5.1.1 
## [37] tidyselect_1.1.1   tibble_3.1.2       expm_0.999-6       codetools_0.2-18  
## [41] matrixStats_0.58.0 fansi_0.4.2        crayon_1.4.1       dplyr_1.0.6       
## [45] MASS_7.3-54        grid_4.1.0         nlme_3.1-152       jsonlite_1.7.2    
## [49] lifecycle_1.0.0    DBI_1.1.1          magrittr_2.0.1     units_0.7-1       
## [53] KernSmooth_2.23-20 cli_2.5.0          stringi_1.6.2      LearnBayes_2.15.1 
## [57] sp_1.4-5           bslib_0.2.5.1      ellipsis_0.3.2     generics_0.1.0    
## [61] vctrs_0.3.8        boot_1.3-28        tools_4.1.0        glue_1.4.2        
## [65] purrr_0.3.4        yaml_2.2.1         classInt_0.4-3     knitr_1.33        
## [69] sass_0.4.0

Anselin, L. 1988. Spatial Econometrics: Methods and Models. Kluwer Academic Publishers.

Besag, Julian. 1974. “Spatial Interaction and the Statistical Analysis of Lattice Systems.” Journal of the Royal Statistical Society. Series B (Methodological) 36: pp. 192–236.

Bivand, R. S. 1984. “Regression Modeling with Spatial Dependence: an Application of Some Class Selection and Estimation Methods.” Geographical Analysis 16: 25–37.

———. 2002. “Spatial Econometrics Functions in R: Classes and Methods.” Journal of Geographical Systems 4: 405–21.

Bivand, Roger. 2017. “Revisiting the Boston Data Set — Changing the Units of Observation Affects Estimated Willingness to Pay for Clean Air.” REGION 4 (1): 109–27. https://doi.org/10.18335/region.v4i1.107.

Bivand, Roger S. 2012. “After ’Raising the Bar’: Applied Maximum Likelihood Estimation of Families of Models in Spatial Econometrics.” Estadística Española 54: 71–88.

Bivand, Roger S., Jan Hauke, and Tomasz Kossowski. 2013. “Computing the Jacobian in Gaussian Spatial Autoregressive Models: An Illustrated Comparison of Available Methods.” Geographical Analysis 45 (2): 150–79. https://doi.org/10.1111/gean.12008.

Bivand, Roger S., and Gianfranco Piras. 2015. “Comparing Implementations of Estimation Methods for Spatial Econometrics.” Journal of Statistical Software 63 (1): 1–36. https://doi.org/10.18637/jss.v063.i18.

Bivand, Roger, Giovanni Millo, and Gianfranco Piras. 2021. “A Review of Software for Spatial Econometrics in R.” Mathematics 9 (11). https://doi.org/10.3390/math9111276.

Brooks, Mollie E., Kasper Kristensen, Koen J. van Benthem, Arni Magnusson, Casper W. Berg, Anders Nielsen, Hans J. Skaug, Martin Maechler, and Benjamin M. Bolker. 2017. “glmmTMB Balances Speed and Flexibility Among Packages for Zero-Inflated Generalized Linear Mixed Modeling.” The R Journal 9 (2): 378–400. https://journal.r-project.org/archive/2017/RJ-2017-066/index.html.

Burridge, P. 1981. “Testing for a Common Factor in a Spatial Autoregression Model.” Environment and Planning A 13: 795–800.

Cliff, A. D., and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion.

———. 1981. Spatial Processes. London: Pion.

Cliff, A., and J. K. Ord. 1972. “Testing for Spatial Autocorrelation Among Regression Residuals.” Geographical Analysis 4: 267–84.

Cressie, N. A. C. 1993. Statistics for Spatial Data. New York:Wiley.

Dong, Guanpeng, and Richard Harris. 2015. “Spatial Autoregressive Models for Geographically Hierarchical Data Structures.” Geographical Analysis 47 (2): 173–91. https://doi.org/10.1111/gean.12049.

Elhorst, J. Paul. 2010. “Applied Spatial Econometrics: Raising the Bar.” Spatial Economic Analysis 5: 9–28.

Fingleton, B. 1999. “Spurious spatial regression: Some Monte Carlo results with a spatial unit root and spatial cointegration.” Journal of Regional Science 39: 1–19.

Gaetan, Carlo, and Xavier Guyon. 2010. Spatial Statistics and Modeling. New York: Springer.

Goulard, Michel, Thibault Laurent, and Christine Thomas-Agnan. 2017. “About Predictions in Spatial Autoregressive Models: Optimal and Almost Optimal Strategies.” Spatial Economic Analysis 12 (2-3): 304–25. https://doi.org/10.1080/17421772.2017.1300679.

Gómez-Rubio, Virgilio, Roger Bivand, and Håvard Rue. 2015. “A New Latent Class to Fit Spatial Econometrics Models with Integrated Nested Laplace Approximations.” Procedia Environmental Sciences 27: 116–18. https://doi.org/https://doi.org/10.1016/j.proenv.2015.07.119.

Griffith, Daniel, Roger Bivand, and Yongwan Chun. 2015. “Implementing Approximations to Extreme Eigenvalues and Eigenvalues of Irregular Surface Partitionings for Use in SAR and CAR Models.” Procedia Environmental Sciences 26: 119–22. https://doi.org/https://doi.org/10.1016/j.proenv.2015.05.013.

Halleck Vega, Solmaria, and J. Paul Elhorst. 2015. “The SLX Model.” Journal of Regional Science 55 (3): 339–63. https://doi.org/10.1111/jors.12188.

Hepple, Leslie W. 1976. “A Maximum Likelihood Model for Econometric Estimation with Spatial Series.” In Theory and Practice in Regional Science, edited by I. Masser, 90–104. London Papers in Regional Science. London: Pion.

Kaluzny, S. P., S. C. Vega, T. P. Cardoso, and A. A. Shelly. 1998. S+SpatialStats. New York, NY: Springer.

Kelejian, Harry, and Gianfranco Piras. 2017. Spatial Econometrics. London: Academic Press.

LeSage, J. P. 2014. “What Regional Scientists Need to Know about Spatial Econometrics.” Review of Regional Studies 44: 13–32. https://journal.srsa.org/ojs/index.php/RRS/article/view/44.1.2.

LeSage, James P., and Kelley R. Pace. 2009. Introduction to Spatial Econometrics. Boca Raton, FL: CRC Press.

Lieshout, M. N. M. van. 2019. Theory of Spatial Statistics. Boca Raton, FL: Chapman; Hall/CRC.

Martinetti, Davide, and Ghislain Geniaux. 2017. “Approximate Likelihood Estimation of Spatial Probit Models.” Regional Science and Urban Economics 64: 30–45. https://doi.org/https://doi.org/10.1016/j.regsciurbeco.2017.02.002.

McMillen, D. P. 2013. Quantile Regression for Spatial Data. Heidelberg: Springer-Verlag.

Millo, Giovanni, and Gianfranco Piras. 2012. “splm: Spatial Panel Data Models in R.” Journal of Statistical Software 47 (1): 1–38.

Mur, Jesús, and Ana Angulo. 2006. “The Spatial Durbin Model and the Common Factor Tests.” Spatial Economic Analysis 1 (2): 207–26. https://doi.org/10.1080/17421770601009841.

Ord, J. K. 1975. “Estimation Methods for Models of Spatial Interaction.” Journal of the American Statistical Association 70: 120–26.

Pace, R. K., and R. P. Barry. 1997a. “Fast CARs.” Journal of Statistical Computation and Simulation 59 (2): 123–45.

———. 1997b. “Fast spatial estimation.” Applied Economics Letters 4 (5): 337–41.

———. 1997c. “Quick computation of spatial autoregressive estimators.” Geographics Analysis 29 (3): 232–47.

———. 1997d. “Sparse spatial autoregressions.” Statistics & Probability Letters 33 (3): 291–97.

Pace, RK, and JP LeSage. 2008. “A Spatial Hausman Test.” Economics Letters 101: 282–84.

Pinheiro, Jose C., and Douglas M. Bates. 2000. Mixed-Effects Models in S and S-Plus. New York: Springer.

Piras, Gianfranco. 2010. “sphet: Spatial Models with Heteroskedastic Innovations in R.” Journal of Statistical Software 35 (1): 1–21.

Ripley, B. D. 1981. Spatial Statistics. New York: Wiley.

———. 1988. Statistical Inference for Spatial Processes. Cambridge: Cambridge University Press.

Smith, T. E., and K. L. Lee. 2012. “The effects of spatial autoregressive dependencies on inference in ordinary least squares: a geometric approach.” Journal of Geographical Systems 14 (January): 91–124. https://doi.org/10.1007/s10109-011-0152-x.

Smith, Tony E. 2009. “Estimation Bias in Spatial Models with Strongly Connected Weight Matrices.” Geographical Analysis 41 (3): 307–32. https://doi.org/10.1111/j.1538-4632.2009.00758.x.

Suesse, Thomas. 2018. “Marginal Maximum Likelihood Estimation of SAR Models with Missing Data.” Computational Statistics & Data Analysis 120: 98–110. https://doi.org/https://doi.org/10.1016/j.csda.2017.11.004.

Wagner, Martin, and Achim Zeileis. 2019. “Heterogeneity and Spatial Dependence of Regional Growth in the EU: A Recursive Partitioning Approach.” German Economic Review 20 (1): 67–82. https://doi.org/10.1111/geer.12146.

Wall, M. M. 2004. “A Close Look at the Spatial Structure Implied by the CAR and SAR Models.” Journal of Statistical Planning and Inference 121: 311–24.

Waller, Lance A., and Carol A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. Hoboken, NJ: John Wiley & Sons.

Whittle, P. 1954. “On Stationary Processes in the Plane.” Biometrika 41 (3-4): 434–49. https://doi.org/10.1093/biomet/41.3-4.434.

Wilhelm, Stefan, and Miguel Godinho de Matos. 2013. “Estimating Spatial Probit Models in R.” The R Journal 5 (1): 130–43. https://doi.org/10.32614/RJ-2013-013.

Introduction to R (II)

Roger Bivand

Monday 14 June 2021

Copyright

Required current contributed CRAN packages:

Script

Seminar introduction

Schedule

Introduction

Spatial Regression

Spatial regression with spatial weights

Estimators

Maximum likelihood

Generalized method of moments

Markov chain Monte Carlo

Handling the log determinant term

Impacts

Predictions

The ACS US census tracts data set

R’s `sessionInfo()`

Introduction to R (II)

Roger Bivand

Monday 14 June 2021

Copyright

Required current contributed CRAN packages:

Script

Seminar introduction

Schedule

Introduction

Spatial Regression

Spatial regression with spatial weights

Estimators

Maximum likelihood

Generalized method of moments

Markov chain Monte Carlo

Handling the log determinant term

Impacts

Predictions

The ACS US census tracts data set

R’s sessionInfo()

R’s `sessionInfo()`