11  Prediction and spatial models

Dubin (2003) presented a contrast from simulation between geostatistical prediction and prediction from fitted spatial econometrics models.

Bivand (2002) inquired whether prediction had been neglected in spatial econometrics. Other authors rather considered missing data in spatial econometrics models, as followed up by Kato (2013). Suesse and Zammit-Mangion (2017) and Suesse (2018) continue the examination of imputing missing spatial data Goulard, Laurent, and Thomas-Agnan (2017) review what was then known about prediction using spatial econometrics models, providing implementations of existing and proposed methods in spatialreg.

11.1 Prediction and impacts

In models fitted by OLS, the marginal effect of a unit change in a numerical independent variable is equal to the regression coefficient. Taking the mean difference of predicted values before and after the incrementation, we can see that this holds:

OLS_pre <- lm(form_pre_maj, data=eng324)
row.names(eng324) <- eng324$NAME
newdata <- eng324
p0 <- predict(OLS_pre, newdata=newdata)
newdata$dens <- exp(log(newdata$dens)+1)
p1 <- predict(OLS_pre, newdata=newdata)
mean(p1-p0)
[1] -0.01552116
coefficients(OLS_pre)["log(dens)"]
  log(dens) 
-0.01552116 

When the spatially lagged dependent variable is included in the model (SLM, SDM, GNM), and as we have just seen, global spill-over occurs, varying in intensity as the coefficient departs from zero.

library(spdep)
lw <- nb2listw(unb, style="W")
library(spatialreg)
e <- eigenw(lw)
W <- as(lw, "CsparseMatrix")
trMat <- trW(W, type="mult")
SLM_pre_maj <- lagsarlm(form_pre_maj, data=eng324, listw=lw, control=list(pre_eig=e))
SLM_post_maj <- lagsarlm(form_post_maj, data=eng324, listw=lw, control=list(pre_eig=e))
(imp_SLM_pre_maj_ev <- impacts(SLM_pre_maj, listw=lw, evalue=e))
Impact measures (lag, exact):
                    Direct     Indirect       Total
log(units)      0.97021645  0.166102763  1.13631922
house          -1.75150349 -0.299860477 -2.05136397
log(dens)      -0.02019100 -0.003456735 -0.02364774
Metroplondon    0.07432424  0.012724440  0.08704868
Metropmetrop   -0.05080358 -0.008697662 -0.05950124
log(realWgPre)  0.34764203  0.059516927  0.40715896
MajorityCONS    0.03796509  0.006499690  0.04446478
MajorityLAB     0.19524091  0.033425587  0.22866650

Doing this by hand and using the reduced form model: \hat{{\mathbf y}} = ({\mathbf I} - \rho {\mathbf W})^{-1}{\mathbf X}{\mathbf \beta}, we can see that the equality between the mean difference of before-and-after incrementation predictions is not with the regression coefficient for the chosen variable, byt with its total impact:

IrW <- invIrW(lw, rho=SLM_pre_maj$rho)
p0a <- IrW %*% SLM_pre_maj$X %*% coef(SLM_pre_maj)[-1]
nd <- SLM_pre_maj$X
nd[,4] <- nd[,4] + 1
p1 <- IrW %*% nd %*% coef(SLM_pre_maj)[-1]
mean(p1-p0a)
[1] -0.02364774
coefficients(SLM_pre_maj)["log(dens)"]
  log(dens) 
-0.02009104 
imp_SLM_pre_maj_ev$total[3]
[1] -0.02364774

Using the predict method described by Goulard, Laurent, and Thomas-Agnan (2017), we can see that this continues to hold:

p0 <- predict(SLM_pre_maj, newdata=eng324, listw=lw, pred.type="TS", legacy=FALSE)
all.equal(unclass(p0), c(p0a), check.attributes=FALSE)
[1] TRUE
p1 <- predict(SLM_pre_maj, newdata=newdata, listw=lw, pred.type="TS", legacy=FALSE)
mean(p1-p0)
[1] -0.02364774
imp_SLM_pre_maj_ev$total[3]
[1] -0.02364774

11.2 Predicting post-CCT with the pre-CCT model

We can also use prediction in a counterfactual setting, by changing the only time-varying independent variable (real wages) in the newdata object. The model is fitted with the pre-CCT data, but the predictions are made with the fitted pre-CCT model and post-CCT data:

pre_nd <- eng324
pre_nd$realWgPre <- pre_nd$realWgPst
pre_post <- predict(SLM_pre_maj, newdata=pre_nd, listw=lw, pred.type="TS", legacy=FALSE)
eng324$diff <- eng324$realNetPst-eng324$realNetPre
eng324$pre_diff <- eng324$realNetPst-exp(pre_post)
library(tmap)
Breaking News: tmap 3.x is retiring. Please test v4, e.g. with
remotes::install_github('r-tmap/tmap')
tm_shape(eng324) + tm_fill(c("diff", "pre_diff"), n=11, style="pretty", midpoint=0, palette="-RdBu", title="£ 000") + tm_borders() + tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("Post-CCT - Pre-CCT expenditure", "Post-CCT expenditure - prediction from pre-CCT model"))

Difference between real net expenditure post-CCT and pre-CCT (left), and predictions made with post-CCT independent variables and pre-CCT SLM model coefficients (right)

11.3 Out-of-sample prediction

The situation becomes more complex for out-of-sample prediction. If we make predictions for a set of observations not linked in space through the spatial weights object with the set of observations used to fit the model, there are no troublesome spill-overs between the fitting set and the prediction set. Recall from chapter 6 that:

There is now an extensive literature both on the split between training and test data sets, and on the use of fitted models for prediction to areas that were un- or under-represented in the data used to fit the model (Meyer et al. 2018, 2019; Valavi et al. 2019; Schratz et al. 2019; Meyer and Pebesma 2021, 2022; Mila et al. 2022; Linnenbrink et al. 2023). These articles are accompanied by software permitting the reproduction of their findings and the application of suggested adaptations, replacing random permutations in machine learning model fitting and tuning by spatially-aware procedures.

Here we will split England into the North region and the rest, modify the Metrop factor because no "london" level is present in the North, and divide the data and neighbour objects:

eng324$Metrop |> as.character() |> (\(d) sub("london", "metrop", x=d))() |> factor() -> eng324$Metrop
rest <- eng324$region != "NORTH"
eng324_rest <- eng324[rest,]
unb_rest <- subset(unb, rest)
unb_rest
Neighbour list object:
Number of regions: 297 
Number of nonzero links: 1454 
Percentage nonzero weights: 1.648358 
Average number of links: 4.895623 
lw_rest <- nb2listw(unb_rest, style="W")
e_rest <- eigenw(lw_rest)

Then we fit the rest of England subset using the formula without the political control variable, as there were no Conservative controlled local authorities in the North, and prepare the predictions made for the whole data set for mapping, allowing spill-over to occur between North and the rest of England:

SLM_rest <-lagsarlm(form_pre, data=eng324_rest, listw=lw_rest, control=list(pre_eig=e_rest))
eng324$rest_pred <- exp(c(unclass(predict(SLM_rest, newdata=eng324, listw=lw, pred.type="TS", legacy=FALSE))))
Warning in predict.Sarlm(SLM_rest, newdata = eng324, listw = lw, pred.type =
"TS", : some region.id are both in data and newdata

Next, we repeat the subsetting, fitting the same formula to North data, and predicting for the whole data set:

eng324_north <- eng324[!rest,]
unb_north <- subset(unb, !rest)
unb_north
Neighbour list object:
Number of regions: 27 
Number of nonzero links: 112 
Percentage nonzero weights: 15.36351 
Average number of links: 4.148148 
lw_north <- nb2listw(unb_north, style="W")
e_north <- eigenw(lw_north)
SLM_north <-lagsarlm(form_pre, data=eng324_north, listw=lw_north, control=list(pre_eig=e_north))
eng324$north_pred <- exp(c(unclass(predict(SLM_north, newdata=eng324, listw=lw, pred.type="TS", legacy=FALSE))))
Warning in predict.Sarlm(SLM_north, newdata = eng324, listw = lw, pred.type =
"TS", : some region.id are both in data and newdata
tm_shape(eng324) + tm_fill(c("rest_pred", "north_pred"), n=7, style="quantile", palette="YlOrBr", title="£ 000") + tm_borders(col="grey") + tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("Predicted pre-CCT expenditure, model without North region", "Predicted pre-CCT expenditure, North region model"))

Difference between real net expenditure post-CCT and pre-CCT (left), and predictions made with post-CCT independent variables and pre-CCT SLM model coefficients (right)