ECS530: (VII) Spatial regression

Copyright

All the material presented here, to the extent it is original, is available under CC-BY-SA. Parts build on joint tutorials with Edzer Pebesma.

Required current contributed CRAN packages:

I am running R 3.6.1, with recent update.packages().

needed <- c("hglm", "hglm.data", "tmap", "spatialreg", "spdep", "spData", "sp", "sf")

Script

Script and data at https://github.com/rsbivand/ECS530_h19/raw/master/ECS530_VII.zip. Download to suitable location, unzip and use as basis.

Schedule

2/12 (I) Spatial data representation, (II) Support+topology, input/output
3/12 (III) Coordinate reference systems, (IV) Visualization
4/12 (VI) Spatial autocorrelation, project surgery
5/12 (VII) Spatial regression, (VIII) Spatial multilevel regression
6/12 (IX) Interpolation, point processes, project surgery
7/12 Presentations

Session VII

09:15-09:45 Spatial regression background
09:45-10:30 Simultaneous autoregression
10:30-11:00 Conditional autoregression & Markov Random Fields

library(sf)
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
row.names(nc) <- as.character(nc$FIPSNO)
nc$ft.SID74 <- sqrt(1000)*(sqrt(nc$SID74/nc$BIR74) + sqrt((nc$SID74+1)/nc$BIR74))
nc$ft.NWBIR74 <- sqrt(1000)*(sqrt(nc$NWBIR74/nc$BIR74) + sqrt((nc$NWBIR74+1)/nc$BIR74))
library(spdep)
gal_file <- system.file("weights/ncCR85.gal", package="spData")[1]
ncCR85 <- read.gal(gal_file, region.id=nc$FIPSNO)
nc_lw <- nb2listw(ncCR85, style="B")
nc_W <- as(nc_lw, "CsparseMatrix")

Simultaneous autoregressive approaches

The approach taken by social scientists including economists, and some others has been to approach this through simultaneous autoregressive approaches, where the response is modelled using fixed covariates, and the residual process is modelled by optimising a log likelihood function. The spatialreg package provides spautolm() and errorsarlm():

library(spatialreg)
m1 <- spautolm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, family="SAR")

When we include the covariate, the spatial error coefficient contributes little:

summary(m1)

## 
## Call: spautolm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw, 
##     weights = BIR74, family = "SAR")
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83149 -0.44934  0.09456  0.56278  2.90208 
## 
## Coefficients: 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 1.5816687  0.2507427  6.3079 2.828e-10
## ft.NWBIR74  0.0366864  0.0069608  5.2704 1.361e-07
## 
## Lambda: 0.035437 LR test value: 1.787 p-value: 0.18129 
## Numerical Hessian standard error of lambda: 0.026096 
## 
## Log likelihood: -119.6704 
## ML residual variance (sigma squared): 1304.4, (sigma: 36.117)
## Number of observations: 100 
## Number of parameters estimated: 4 
## AIC: 247.34

It is unusual to present maps of the spatially structured random effect in cases where the simultaneous autoregressive approach is used, but it is fully possible using components of the returned model object (the vector is doubled for comparison with the CAR version):

nc$SAR_ssre <- 2*as.vector((m1$lambda * nc_W) %*% m1$Y - (m1$lambda * nc_W) %*% (m1$X %*% m1$fit$coefficients))
library(tmap)
tm_shape(nc) + tm_fill(c("ft.SID74", "SAR_ssre"), midpoint=c(NA, 0))

The other maximum likelihood implementation gives the same results, but provides a Hausman test for shifts in the covariate coefficients between the aspatial and spatial estimates (Pace and LeSage 2008); there is none:

m1a <- errorsarlm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw)
summary(m1a, Hausman=TRUE)

## 
## Call:errorsarlm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw, 
##     weights = BIR74)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83149 -0.44934  0.09456  0.56278  2.90208 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 1.5816687  0.2507427  6.3079 2.828e-10
## ft.NWBIR74  0.0366864  0.0069608  5.2704 1.361e-07
## 
## Lambda: 0.035437, LR test value: 1.787, p-value: 0.18129
## Asymptotic standard error: 0.029167
##     z-value: 1.2149, p-value: 0.22439
## Wald statistic: 1.4761, p-value: 0.22439
## 
## Log likelihood: -119.6704 for error model
## ML residual variance (sigma squared): 1304.4, (sigma: 36.117)
## Number of observations: 100 
## Number of parameters estimated: 4 
## AIC: 247.34, (AIC for lm: 247.13)
## Hausman test: 1.565, df: 2, p-value: 0.45726

It also lets us add Durbin= terms, that is spatially lagged covariates:

m1b <- errorsarlm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, Durbin=TRUE)
summary(m1b, Hausman=TRUE)

## 
## Call:errorsarlm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw, 
##     weights = BIR74, Durbin = TRUE)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.8705347 -0.4488357  0.0094254  0.4864939  2.7406880 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                    Estimate  Std. Error z value  Pr(>|z|)
## (Intercept)      2.28212888  0.33548214  6.8025 1.028e-11
## ft.NWBIR74       0.03532229  0.00811453  4.3530 1.343e-05
## lag.(Intercept) -0.15336745  0.06073121 -2.5253   0.01156
## lag.ft.NWBIR74   0.00098902  0.00159099  0.6216   0.53418
## 
## Lambda: 0.028053, LR test value: 1.036, p-value: 0.30874
## Asymptotic standard error: 0.029824
##     z-value: 0.94064, p-value: 0.34689
## Wald statistic: 0.8848, p-value: 0.34689
## 
## Log likelihood: -115.1974 for error model
## ML residual variance (sigma squared): 1195.8, (sigma: 34.58)
## Number of observations: 100 
## Number of parameters estimated: 6 
## AIC: 242.39, (AIC for lm: 241.43)
## Hausman test: 1.8315, df: 4, p-value: 0.76671

and to present the by-covariate effects taking into account the unlagged (direct) and lagged (indirect) covariates and the sum of the coefficients (total):

summary(impacts(m1b))

## Impact measures (SDEM, estimable, n):
##                Direct   Indirect      Total
## ft.NWBIR74 0.03532229 -0.1533675 -0.1180452
## ========================================================
## Standard errors:
##                 Direct   Indirect      Total
## ft.NWBIR74 0.008114531 0.06073121 0.06438103
## ========================================================
## Z-values:
##              Direct  Indirect     Total
## ft.NWBIR74 4.352968 -2.525348 -1.833539
## 
## p-values:
##            Direct     Indirect Total   
## ft.NWBIR74 1.3431e-05 0.011558 0.066722

However, our model may suffer from not using a mixed model approach to a count response; the simultaneous autoregressive models are mostly used with Gaussian responses. One possibility is to employ the hierarchical generalized linear model approach from the hglm package. First we’ll fit an unstructured IID (independent and identically distributed) random effect (Alam, Rönnegård, and Shen 2015):

library(hglm)
E <- nc$BIR74 * sum(nc$SID74)/sum(nc$BIR74)
HGLM_iid <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74,
                 data=nc, family=poisson(link=log))

The random effects also have their own standard errors, so we can order and display them with error bars in a forest or caterpillar plot:

ranef_iid <- unname(summary(HGLM_iid, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_iid[,1], ranef_iid[,2], subset=order(ranef_iid[,1], decreasing=TRUE), 
        slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)

We can also fit a weighted Poisson simultaneous autoregressive model, and examine the random effects:

HGLM_sar <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74, 
                 data=nc, family=poisson(link=log), rand.family=SAR(D=nc_W))
ranef_sar <- unname(summary(HGLM_sar, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_sar[,1], ranef_sar[,2], subset=order(ranef_sar[,1], decreasing=TRUE), 
        slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)

There is not much difference between the IID and the SAR spatially structured random effects:

nc$HGLM_re <- ranef_iid[,1]
nc$HGLM_ss_SAR <- ranef_sar[,1]
tm_shape(nc) + tm_fill(c("HGLM_re", "HGLM_ss_SAR"), midpoint=c(0), title="Poisson HGLM RE") +
  tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("IID", "SAR SSRE"))

Conditional and MRF autoregressive approaches

Most epidemiological applications use conditional autoregressive approaches, where some (like spautolm() and the hglm() implementations) fit a spatial coefficient, but many fit an intrinsic CAR. First the spautolm() and the hglm() implementations:

m1c <- spautolm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, family="CAR")
summary(m1c)

## 
## Call: spautolm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw, 
##     weights = BIR74, family = "CAR")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.890805 -0.483471  0.070376  0.575906  2.824220 
## 
## Coefficients: 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 1.4561763  0.2499558  5.8257 5.686e-09
## ft.NWBIR74  0.0410281  0.0069417  5.9103 3.414e-09
## 
## Lambda: 0.06028 LR test value: 1.6364 p-value: 0.20082 
## Numerical Hessian standard error of lambda: 0.053566 
## 
## Log likelihood: -119.7458 
## ML residual variance (sigma squared): 1302, (sigma: 36.083)
## Number of observations: 100 
## Number of parameters estimated: 4 
## AIC: 247.49

Again, we can calculate something that represents the spatial patterning of the spatial process (termed “signal” in the documentation of spatialreg::predict.sarlm), but it is not scaled in the same way as the hgml() random effects (and importantly we do not have standard errors):

nc$CAR_ssre <- as.vector((m1c$lambda * nc_W) %*% m1c$Y - 
                           (m1c$lambda * nc_W) %*% (m1c$X %*% m1c$fit$coefficients))
tm_shape(nc) + tm_fill(c("SAR_ssre", "CAR_ssre"), midpoint=c(0), title="Gauss ML RE") +
  tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("SAR SSRE", "CAR SSRE"))

Fitting the HGLM CAR model is just like the SAR model, and the forest plot of the spatially structured random effect is similar. Recall that spautolm() is fitting a Gaussian model, but hglm() is fitting a Poisson model, arguably better suited to count data. This means that the scalings of the random effects will vary in scale:

HGLM_car <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74, 
                 data=nc, family=poisson(link=log), rand.family=CAR(D=nc_W))
ranef_car <- unname(summary(HGLM_car, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_car[,1], ranef_car[,2], subset=order(ranef_car[,1], decreasing=TRUE), 
        slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)

nc$HGLM_ss_CAR <- ranef_car[,1]
tm_shape(nc) + tm_fill(c("HGLM_ss_CAR", "HGLM_ss_SAR"), midpoint=c(0), title="Poisson HGLM RE") +
  tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("CAR SSRE", "SAR SSRE"))

To use a generalized additive mixed model (mgcv::gam() with an "mrf" random effect), and some other mixed models, the areal entities need to be grouped (done in the first exercise), and we can try a flexible fit on the covariate:

nc$LM <- as.numeric(interaction(nc$L_id, nc$M_id))
aggLM <- aggregate(nc[,"LM"], list(nc$LM), head, n=1)
aggnb <- poly2nb(aggLM)
library(mgcv)
names(aggnb) <- as.character(aggLM$Group.1)
nc$LM <- as.factor(nc$LM)
GAM_mrf <- gam(SID74 ~ s(ft.NWBIR74) + s(LM, bs="mrf", xt=list(nb=aggnb)), offset=log(E), weights=BIR74, data=nc, family=poisson(link=log))
summary(GAM_mrf)

And plot the covariate smooth term:

library(mgcv)

## Loading required package: nlme

## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.

plot(GAM_mrf)

The forest plot is obviously grouped too:

GAM_mrf_re <- predict(GAM_mrf, type="terms", se=TRUE)
metafor::forest(GAM_mrf_re$fit[,2], GAM_mrf_re$se.fit[,2], subset=order(GAM_mrf_re$fit[,1], decreasing=TRUE), 
        slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)

as is the RE map:

nc$GAM_mrf_re <- GAM_mrf_re$fit[,2]
tm_shape(nc) + tm_fill(c("GAM_mrf_re"), midpoint=c(0), title="Poisson GAM MRF RE")

The New York 8 county data set contains population standardized leukemia cases, with Z as a transformed rate:

NY8 <- st_read(system.file("shapes/NY8_utm18.shp", package="spData"))

## Reading layer `NY8_utm18' from data source `/home/rsb/lib/r_libs/spData/shapes/NY8_utm18.shp' using driver `ESRI Shapefile'
## Simple feature collection with 281 features and 17 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 358241.9 ymin: 4649755 xmax: 480393.1 ymax: 4808545
## epsg (SRID):    32618
## proj4string:    +proj=utm +zone=18 +ellps=WGS84 +units=m +no_defs

NY8 <- st_buffer(NY8, dist=0)

tm_shape(NY8) + tm_fill("Z", midpoint=0)

Create a neighbour object:

NY_nb <- poly2nb(NY8)
NY_lw <- nb2listw(NY_nb, style="B")

Check how the SAR and CAR models behave, with and without case weights:

mod1 <- spautolm(Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data=NY8, family="CAR", listw=NY_lw, weights=POP8)
summary(mod1)

## 
## Call: spautolm(formula = Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME, data = NY8, 
##     listw = NY_lw, weights = POP8, family = "CAR")
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1.500566 -0.274566  0.087494  0.454262  4.257081 
## 
## Coefficients: 
##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -0.783775   0.142645 -5.4946 3.916e-08
## PEXPOSURE    0.078887   0.027922  2.8253  0.004724
## PCTAGE65P    3.841221   0.573083  6.7027 2.046e-11
## PCTOWNHOME  -0.392319   0.154995 -2.5312  0.011368
## 
## Lambda: 0.011751 LR test value: 0.11668 p-value: 0.73266 
## Numerical Hessian standard error of lambda: 0.039471 
## 
## Log likelihood: -251.7067 
## ML residual variance (sigma squared): 1105.1, (sigma: 33.243)
## Number of observations: 281 
## Number of parameters estimated: 6 
## AIC: 515.41

This data set is used Waller and Gotway (2004), and in both editions of ASDAR. It is harder to deploy Poisson models because the cases are not integer counts.

Alam, Moudud, Lars Rönnegård, and Xia Shen. 2015. “Fitting Conditional and Simultaneous Autoregressive Spatial Models in Hglm.” The R Journal 7 (2): 5–18. https://doi.org/10.32614/RJ-2015-017.

Pace, RK, and JP LeSage. 2008. “A Spatial Hausman Test.” Economics Letters 101: 282–84.

Waller, Lance A., and Carol A. Gotway. 2004. Applied Spatial Statistics for Public Health Data. Hoboken, NJ: John Wiley & Sons.