All the material presented here, to the extent it is original, is available under CC-BY-SA. Parts build on joint tutorials with Edzer Pebesma.
I am running R 4.1.2, with recent update.packages()
.
needed <- c("stars", "abind", "raster", "spatstat", "spatstat.linnet", "spatstat.core",
"spatstat.geom", "spatstat.data", "ranger", "automap", "RColorBrewer", "gstat",
"ggplot2", "R2BayesX", "colorspace", "HSAR", "hglm", "sp", "lme4", "spfilteR",
"spmoran", "spatialreg", "spData", "tmap", "sf")
Script and data at https://github.com/rsbivand/ECS530_h21/raw/main/ECS530_211119.zip. Download to suitable location, unzip and use as basis.
Spatial filtering methods as developed by Daniel A. Griffith (2003) build on using standard linear and generalized linear models supplemented with selected eigenvectors from the spatial weights matrix. In Patuelli et al. (2012), D. A. Griffith (2010) and D. A. Griffith and Paelinck (2011), examples were given of how standard and non-standard spatial econometric problems could be approached using spatial filtering. Tiefelsdorf and Griffith (2007) proposed that the eigenvectors for inclusion should be selected by their ability to reduce residual autocorrelation rather than to increase model fit. This approach was implemented by Yongwan Chun and Michael Tiefelsdorf in spdep and moved to spatialreg, with two steps, first to select eigenvectors taken from the spatial weights matrix doubly centred using the hat matrix of the actual regression, then using lm
to fit the model, effectively removing residual autocorrelation:
The New York 8 county data set contains population standardized leukaemia cases, with Z
as a transformed rate:
library(sf)
## Linking to GEOS 3.10.1, GDAL 3.4.0, PROJ 8.2.0
NY8 <- st_read(system.file("shapes/NY8_utm18.shp", package="spData"))
## Reading layer `NY8_utm18' from data source
## `/home/rsb/lib/r_libs/spData/shapes/NY8_utm18.shp' using driver `ESRI Shapefile'
## Simple feature collection with 281 features and 17 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 358241.9 ymin: 4649755 xmax: 480393.1 ymax: 4808545
## Projected CRS: WGS 84 / UTM zone 18N
NY8 <- st_buffer(NY8, dist=0)
library(tmap)
## Registered S3 methods overwritten by 'stars':
## method from
## st_bbox.SpatRaster sf
## st_bbox.SpatVector sf
## st_crs.SpatRaster sf
## st_crs.SpatVector sf
tm_shape(NY8) + tm_fill("Z", midpoint=0)
Create a neighbour object:
NY_nb <- spdep::poly2nb(NY8)
NY_lwB <- spdep::nb2listw(NY_nb, style="B")
NY_lwW <- spdep::nb2listw(NY_nb, style="W")
Check how the SAR and CAR models behave, with and without case weights:
library(spatialreg)
## Loading required package: spData
## Loading required package: Matrix
gform <- Z ~ PEXPOSURE + PCTAGE65P + PCTOWNHOME
mod1 <- spautolm(gform, data=NY8, family="CAR", listw=NY_lwB, weights=POP8)
summary(mod1)
##
## Call: spautolm(formula = gform, data = NY8, listw = NY_lwB, weights = POP8,
## family = "CAR")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.500566 -0.274566 0.087494 0.454262 4.257081
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.783775 0.142645 -5.4946 3.916e-08
## PEXPOSURE 0.078887 0.027922 2.8253 0.004724
## PCTAGE65P 3.841221 0.573083 6.7027 2.046e-11
## PCTOWNHOME -0.392319 0.154995 -2.5312 0.011368
##
## Lambda: 0.011751 LR test value: 0.11668 p-value: 0.73266
## Numerical Hessian standard error of lambda: 0.039471
##
## Log likelihood: -251.7067
## ML residual variance (sigma squared): 1105.1, (sigma: 33.243)
## Number of observations: 281
## Number of parameters estimated: 6
## AIC: 515.41
This data set is used Waller and Gotway (2004), and in both editions of ASDAR. It is harder to deploy Poisson models because the cases are not integer counts.
(SF0 <- SpatialFiltering(gform, data = NY8, nb=NY_nb, style="W"))
## Step SelEvec Eval MinMi ZMinMi Pr(ZI) R2
## 0 0 0 0.0000000 0.077160258 2.41881146 0.01557131 0.1931605
## 1 1 21 0.7617439 0.065249815 2.17039169 0.02997719 0.2069579
## 2 2 78 0.2132604 0.052273969 1.82565937 0.06790159 0.2708788
## 3 3 12 0.8593013 0.040520587 1.59403969 0.11092715 0.2813452
## 4 4 11 0.8763886 0.029200840 1.37568345 0.16891966 0.2909475
## 5 5 10 0.8936187 0.017791439 1.15578610 0.24776866 0.3001843
## 6 6 36 0.5975208 0.006126374 0.89180778 0.37249597 0.3139880
## 7 7 25 0.7045949 -0.005770779 0.63373494 0.52625382 0.3254772
## 8 8 8 0.9138669 -0.016486915 0.43237073 0.66547199 0.3332466
## 9 9 41 0.5555168 -0.027419742 0.18369582 0.85425209 0.3457514
## 10 10 4 0.9659973 -0.036297199 0.03794093 0.96973478 0.3515462
## gamma
## 0 0.0000000
## 1 -1.4302287
## 2 -3.0784210
## 3 -1.2456734
## 4 -1.1931510
## 5 1.1702215
## 6 -1.4305515
## 7 1.3051280
## 8 -1.0732479
## 9 1.3615854
## 10 0.9268825
SF_obj2 <- lm(update(gform, . ~ . + fitted(SF0)), data = NY8)
summary(SF_obj2)
##
## Call:
## lm(formula = update(gform, . ~ . + fitted(SF0)), data = NY8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.42548 -0.38747 -0.03151 0.32462 3.11533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.51728 0.14478 -3.573 0.000419 ***
## PEXPOSURE 0.04884 0.03202 1.525 0.128366
## PCTAGE65P 3.95089 0.55290 7.146 8.54e-12 ***
## PCTOWNHOME -0.56004 0.15551 -3.601 0.000378 ***
## fitted(SF0)vec21 -1.43023 0.60005 -2.383 0.017848 *
## fitted(SF0)vec78 -3.07842 0.60005 -5.130 5.57e-07 ***
## fitted(SF0)vec12 -1.24567 0.60005 -2.076 0.038857 *
## fitted(SF0)vec11 -1.19315 0.60005 -1.988 0.047788 *
## fitted(SF0)vec10 1.17022 0.60005 1.950 0.052200 .
## fitted(SF0)vec36 -1.43055 0.60005 -2.384 0.017823 *
## fitted(SF0)vec25 1.30513 0.60005 2.175 0.030507 *
## fitted(SF0)vec8 -1.07325 0.60005 -1.789 0.074815 .
## fitted(SF0)vec41 1.36159 0.60005 2.269 0.024060 *
## fitted(SF0)vec4 0.92688 0.60005 1.545 0.123612
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6001 on 267 degrees of freedom
## Multiple R-squared: 0.3515, Adjusted R-squared: 0.32
## F-statistic: 11.13 on 13 and 267 DF, p-value: < 2.2e-16
spdep::lm.morantest(SF_obj2, NY_lwW)
##
## Global Moran I for regression residuals
##
## data:
## model: lm(formula = update(gform, . ~ . + fitted(SF0)), data = NY8)
## weights: NY_lwW
##
## Moran I statistic standard deviate = 0.037941, p-value = 0.4849
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I Expectation Variance
## -0.036297199 -0.037620475 0.001216424
vecs <- fitted(SF0)
pvecs <- vecs %*% diag(coefficients(SF_obj2)[5:14])
colnames(pvecs) <- colnames(vecs)
Typically, the small subset of eigenvectors selected mops up spatial autocorrelation in the residual. S. Dray, Legendre, and Peres-Neto (2006) and D. A. Griffith and Peres-Neto (2006) adopt a similar approach in a generalized linear model context, implemented in spdep by Pedro Peres-Neto and moved now to spatialreg as ME
analogous with SpatialFiltering
, but centering the spatial weights matrix on the null model hat matrix, and using bootstrap methods in evaluating the the choice of eigenvectors. The correlations between the implied cumulated outcomes of these methods are shown in Table . Stéphane Dray et al. (2012) describe many of the underlying motivations, including the view that Moran eigenvector spatial filtering approaches may permit both spatial autocorrelation and spatial scale tto be accommodate in a single model; a further implementation is given in Stéphane Dray et al. (2021).
set.seed(22)
(ME0 <- ME(gform, data = NY8, listw=NY_lwW, alpha=0.15, nsim=99))
## Eigenvector ZI pr(ZI)
## 0 NA NA 0.02
## 1 14 NA 0.05
## 2 10 NA 0.09
## 3 13 NA 0.19
ME_obj2 <- lm(update(gform, . ~ . + fitted(ME0)), data = NY8)
summary(ME_obj2)
##
## Call:
## lm(formula = update(gform, . ~ . + fitted(ME0)), data = NY8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8380 -0.4017 -0.0284 0.3515 3.9741
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.50990 0.16308 -3.127 0.00196 **
## PEXPOSURE 0.04796 0.03454 1.389 0.16611
## PCTAGE65P 3.98900 0.60341 6.611 1.99e-10 ***
## PCTOWNHOME -0.57883 0.17669 -3.276 0.00119 **
## fitted(ME0)vec14 1.68445 0.65969 2.553 0.01121 *
## fitted(ME0)vec10 -1.24178 0.67254 -1.846 0.06591 .
## fitted(ME0)vec13 -1.06217 0.64759 -1.640 0.10211
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6454 on 274 degrees of freedom
## Multiple R-squared: 0.2303, Adjusted R-squared: 0.2134
## F-statistic: 13.66 on 6 and 274 DF, p-value: 1.435e-13
mvecs <- fitted(ME0) %*% diag(coefficients(ME_obj2)[5:7])
sem_obj2a <- spautolm(gform, NY8, listw=NY_lwW)
W <- as(NY_lwW, "CsparseMatrix")
SAR_ssre <- 2*as.vector((sem_obj2a$lambda * W) %*% sem_obj2a$Y - (sem_obj2a$lambda * W) %*% (sem_obj2a$X %*% sem_obj2a$fit$coefficients))
Murakami and Griffith (2019) provide a fresher version of spatial filtering implemented in Murakami (2021). This also appears to centre the spatial weights matrix on the null model hat matrix, and chooses eigenvectors not to reduce residual autocorrelation, but chooses those among the eigenvectors with positive eigenvalues that increase model fit most up to a threshold to control overfitting. The default approach uses an exponential variogram model to generate the weights matrix from planar coordinates. The meigen
function subsets the full set of eigenvectors before the data are seen, then esf
calls lm
itself while further subsetting the eigenvectors.
library(spmoran)
centroids_utm <- st_centroid(st_geometry(NY8))
meig <- meigen(st_coordinates(centroids_utm))
## 33/281 eigen-pairs are extracted
y <- model.response(model.frame(gform, NY8))
X <- model.matrix(update(gform, ~ .), NY8)
(esf_obj2 <- esf(y, X, meig=meig))
## 13/33 eigenvectors are selected
## Call:
## esf(y = y, x = X, meig = meig)
##
## ----Coefficients------------------------------
## Estimate SE t_value p_value
## (Intercept) -0.645834939 0.1653420 -3.90605508 1.192357e-04
## PEXPOSURE -0.003338754 0.0393389 -0.08487155 9.324278e-01
## PCTAGE65P 3.815483576 0.6071078 6.28468921 1.350496e-09
## PCTOWNHOME -0.149187135 0.2024274 -0.73699095 4.617823e-01
##
## ----Spatial effects (residuals)---------------
## Estimate
## SE 0.25038503
## Moran.I/max(Moran.I) 0.09378257
##
## ----Error statistics--------------------------
## stat
## resid_SE 0.6337987
## adjR2 0.2413396
## logLik -261.8110562
## AIC 559.6221123
## BIC 625.1124964
esf_obj2$r
## Estimate SE t_value p_value
## sf4 2.1178467 0.7512534 2.819084 0.005180899
## sf16 1.1685835 0.6394087 1.827600 0.068738267
## sf10 1.5165780 0.6639666 2.284118 0.023158813
## sf7 -0.9797156 0.6452891 -1.518258 0.130145881
## sf3 -1.2546505 0.6715339 -1.868335 0.062822932
## sf23 -0.9807425 0.6421764 -1.527217 0.127904327
## sf17 -0.9867262 0.6344423 -1.555266 0.121080772
## sf30 -0.9672723 0.6339788 -1.525717 0.128277437
## sf19 1.0235518 0.6477160 1.580248 0.115247446
## sf18 0.9472687 0.6429683 1.473274 0.141868377
## sf13 0.8214887 0.6437379 1.276123 0.203033580
## sf8 -0.8464719 0.6536142 -1.295063 0.196429731
## sf12 0.8134187 0.6831693 1.190655 0.234858689
r_used_eigs <- row.names(esf_obj2$r)
used_eigs <- as.integer(substring(r_used_eigs, 3))
evecs <- meig$sf[, used_eigs] %*% diag(esf_obj2$r[1:length(r_used_eigs),1])
colnames(evecs) <- r_used_eigs
Juhl (2021a) describes another implementation (Juhl 2021b), permitting the choice of function to be optimised:
library(spfilteR)
##
## Attaching package: 'spfilteR'
## The following object is masked _by_ '.GlobalEnv':
##
## W
set.seed(1)
spfMI_obj <- lmFilter(y=y, x=X[,-1], W=as.matrix(W), objfn="MI", positive = FALSE, tol = .2)
summary(spfMI_obj)
##
## - Spatial Filtering with Eigenvectors (Linear Model) -
##
## Coefficients (OLS):
## Estimate SE p-value
## (Intercept) -0.5904074860 0.19008472 2.102977e-03 **
## PEXPOSURE 0.0003415117 0.05288584 9.948526e-01
## PCTAGE65P 3.7422103815 0.59569126 1.369974e-09 ***
## PCTOWNHOME -0.2377502591 0.20014511 2.359446e-01
##
## Adjusted R-squared:
## Initial Filtered
## 0.1844222 0.2853227
##
## Filtered for positive spatial autocorrelation
## 13 out of 75 candidate eigenvectors selected
## Objective Function: "MI"
##
## Moran's I (Residuals):
## Observed Expected Variance z p-value
## Initial 0.07716026 -0.009757607 0.001294237 2.41602865 0.007845413 **
## Filtered -0.04232760 -0.046289688 0.001933631 0.09010271 0.464102797
spfvecs <- spfMI_obj$selvecs %*% diag(spfMI_obj$EV[,1])
SEs <- cbind("ESF"=apply(evecs, 1, sum), "SF"=apply(pvecs, 1, sum), "SPF"=apply(spfvecs, 1, sum), "ME"=apply(mvecs, 1, sum), "SAR"=SAR_ssre)
NY8_SEs <- cbind(NY8, SEs)
mat <- cor(SEs)
colnames(mat) <- rownames(mat) <- c("ESF", "SF", "SPF", "ME", "SAR")
mat
## ESF SF SPF ME SAR
## ESF 1.0000000 0.3452012 0.6939342 0.4634257 0.6032419
## SF 0.3452012 1.0000000 0.4450188 0.4639506 0.5144771
## SPF 0.6939342 0.4450188 1.0000000 0.4840028 0.6143533
## ME 0.4634257 0.4639506 0.4840028 1.0000000 0.3907094
## SAR 0.6032419 0.5144771 0.6143533 0.3907094 1.0000000
The correlations between four cumulated spatial filtering approaches, and the spatially structured term implied by the ML estimates of the spatial error model. Differences are related to weighting of the linear model, and centring on the full or null model. As can be seen, they are very similar to each other, so the choice of approach may be fairly flexible and relate more to the needs of users and their domain usages that to a single body of theory.
tm_shape(NY8_SEs) + tm_fill(c("ESF", "SF", "SPF", "ME", "SAR"), n=8, palette="BrBG", midpoint=0, title="Spatial effect") + tm_facets(free.scales=FALSE) + tm_borders(lwd=0.5) + tm_layout(panel.labels=c("ESF", "SF", "SPF", "ME", "SAR"), bg="grey90")
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "EPSG:4267"
row.names(nc) <- as.character(nc$FIPSNO)
nc$ft.SID74 <- sqrt(1000)*(sqrt(nc$SID74/nc$BIR74) + sqrt((nc$SID74+1)/nc$BIR74))
nc$ft.NWBIR74 <- sqrt(1000)*(sqrt(nc$NWBIR74/nc$BIR74) + sqrt((nc$NWBIR74+1)/nc$BIR74))
gal_file <- system.file("weights/ncCR85.gal", package="spData")[1]
ncCR85 <- spdep::read.gal(gal_file, region.id=nc$FIPSNO)
nc_lw <- spdep::nb2listw(ncCR85, style="B")
nc_W <- as(nc_lw, "CsparseMatrix")
E <- nc$BIR74 * sum(nc$SID74)/sum(nc$BIR74)
set.seed(1001)
(ME_nc_p <- ME(SID74 ~ ft.NWBIR74, offset=log(E), weights=BIR74, data=nc, family=poisson(link=log), listw=nc_lw, alpha=0.25, nsim=499))
## Eigenvector ZI pr(ZI)
## 0 NA NA 0.236
## 1 29 NA 0.376
ME_obj_nc0 <- glm(SID74 ~ ft.NWBIR74, offset=log(E), weights=BIR74, data=nc, family=poisson(link=log))
summary(ME_obj_nc0)
##
## Call:
## glm(formula = SID74 ~ ft.NWBIR74, family = poisson(link = log),
## data = nc, weights = BIR74, offset = log(E))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -313.610 -24.400 5.395 44.517 300.187
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.170e+00 2.524e-03 -463.5 <2e-16 ***
## ft.NWBIR74 2.988e-02 6.708e-05 445.5 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 789476 on 99 degrees of freedom
## Residual deviance: 588642 on 98 degrees of freedom
## AIC: 1934183
##
## Number of Fisher Scoring iterations: 4
ME_obj_nc <- glm(update(SID74 ~ ft.NWBIR74, . ~ . + fitted(ME_nc_p)), offset=log(E), weights=BIR74, data=nc, family=poisson(link=log))
summary(ME_obj_nc)
##
## Call:
## glm(formula = update(SID74 ~ ft.NWBIR74, . ~ . + fitted(ME_nc_p)),
## family = poisson(link = log), data = nc, weights = BIR74,
## offset = log(E))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -336.68 -27.88 2.30 37.86 316.34
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.026e+00 2.548e-03 -402.7 <2e-16 ***
## ft.NWBIR74 2.738e-02 6.622e-05 413.5 <2e-16 ***
## fitted(ME_nc_p) -9.998e-01 5.081e-03 -196.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 789476 on 99 degrees of freedom
## Residual deviance: 550383 on 97 degrees of freedom
## AIC: 1895926
##
## Number of Fisher Scoring iterations: 4
anova(ME_obj_nc0, ME_obj_nc, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: SID74 ~ ft.NWBIR74
## Model 2: SID74 ~ ft.NWBIR74 + fitted(ME_nc_p)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 98 588642
## 2 97 550383 1 38259 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
boston_506 <- st_read(system.file("shapes/boston_tracts.shp", package="spData")[1])
## Reading layer `boston_tracts' from data source
## `/home/rsb/lib/r_libs/spData/shapes/boston_tracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 506 features and 36 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -71.52311 ymin: 42.00305 xmax: -70.63823 ymax: 42.67307
## Geodetic CRS: NAD27
nb_q <- spdep::poly2nb(boston_506)
lw_q <- spdep::nb2listw(nb_q, style="W")
boston_506$CHAS <- as.factor(boston_506$CHAS)
boston_489 <- boston_506[!is.na(boston_506$median),]
nb_q_489 <- spdep::poly2nb(boston_489)
lw_q_489 <- spdep::nb2listw(nb_q_489, style="W", zero.policy=TRUE)
agg_96 <- list(as.character(boston_506$NOX_ID))
boston_96 <- aggregate(boston_506[, "NOX_ID"], by=agg_96, unique)
nb_q_96 <- spdep::poly2nb(boston_96)
lw_q_96 <- spdep::nb2listw(nb_q_96)
boston_96$NOX <- aggregate(boston_506$NOX, agg_96, mean)$x
boston_96$CHAS <- aggregate(as.integer(boston_506$CHAS)-1, agg_96, max)$x
nms <- names(boston_506)
ccounts <- 23:31
for (nm in nms[c(22, ccounts, 36)]) {
boston_96[[nm]] <- aggregate(boston_506[[nm]], agg_96, sum)$x
}
br2 <- c(3.50, 6.25, 8.75, 12.50, 17.50, 22.50, 30.00, 42.50, 60.00)*1000
counts <- as.data.frame(boston_96)[, nms[ccounts]]
f <- function(x) matrixStats::weightedMedian(x=br2, w=x, interpolate=TRUE)
boston_96$median <- apply(counts, 1, f)
is.na(boston_96$median) <- boston_96$median > 50000
summary(boston_96$median)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 9009 20417 23523 25263 30073 49496 2
boston_94 <- boston_96[!is.na(boston_96$median),]
nb_q_94 <- spdep::subset.nb(nb_q_96, !is.na(boston_96$median))
lw_q_94 <- spdep::nb2listw(nb_q_94, style="W")
There is a large literature in spatial epidemiology using CAR and ICAR models in spatially structured random effects. These extend to multilevel models, in which the spatially structured random effects may apply at different levels of the model (R. S. Bivand et al. 2017). In order to try out some of the variants, we need to remove the no-neighbour observations from the tract level, and from the model output zone aggregated level, in two steps as reducing the tract level induces a no-neighbour outcome at the model output zone level.
boston_94a <- aggregate(boston_489[,"NOX_ID"], list(boston_489$NOX_ID), unique)
nb_q_94a <- spdep::poly2nb(boston_94a)
NOX_ID_no_neighs <- boston_94a$NOX_ID[which(spdep::card(nb_q_94a) == 0)]
boston_487 <- boston_489[is.na(match(boston_489$NOX_ID, NOX_ID_no_neighs)),]
boston_93 <- aggregate(boston_487[, "NOX_ID"], list(ids = boston_487$NOX_ID), unique)
row.names(boston_93) <- as.character(boston_93$NOX_ID)
nb_q_93 <- spdep::poly2nb(boston_93, row.names=unique(as.character(boston_93$NOX_ID)))
The lme4 package lets us add an IID unstructured random effect at the model output zone level:
form <- formula(log(median) ~ CRIM + ZN + INDUS + CHAS + I((NOX*10)^2) + I(RM^2) +
AGE + log(DIS) + log(RAD) + TAX + PTRATIO + I(BB/100) +
log(I(LSTAT/100)))
The ZN, INDUS, NOX, RAD, TAX and PTRATIO variables show effectively no variability within the TASSIM zones, so in a multilevel model the random effect may absorb their influence.
library(lme4)
MLM <- lmer(update(form, . ~ . + (1 | NOX_ID)), data=boston_487, REML=FALSE)
copying the random effect into the "sf"
object for mapping below.
boston_93$MLM_re <- ranef(MLM)[[1]][,1]
Two packages, hglm and HSAR, offer SAR upper level spatially structured random effects, and require the specification of a sparse matrix mapping the upper level enities onto lower level entities, and sparse binary weights matrices:
library(Matrix)
suppressMessages(library(MatrixModels))
Delta <- as(model.Matrix(~ -1 + as.factor(NOX_ID), data=boston_487, sparse=TRUE),
"dgCMatrix")
M <- as(spdep::nb2listw(nb_q_93, style="B"), "CsparseMatrix")
The extension of hglm to sparse spatial setting extended its facilities (Alam, Rönnegård, and Shen 2015), and also permits the modelling of discrete responses. First we fit an IID random effect:
suppressPackageStartupMessages(library(hglm))
y_hglm <- log(boston_487$median)
X_hglm <- model.matrix(lm(form, data=boston_487))
suppressWarnings(HGLM_iid <- hglm(y=y_hglm, X=X_hglm, Z=Delta))
followed by a SAR model at the upper level (corresponding to a spatial error (SEM) model), which reports the spatially structured random effect without fully converging, so coefficient standard errors are not available:
suppressWarnings(HGLM_sar <- hglm(y=y_hglm, X=X_hglm, Z=Delta, rand.family=SAR(D=M)))
boston_93$HGLM_re <- unname(HGLM_iid$ranef)
boston_93$HGLM_ss <- HGLM_sar$ranef[,1]
The HSAR package is restricted to the Gaussian response case, and fits an upper level SEM using MCMC; if W=
is a lower level weights matrix, it will also fit a lower level SLM (Guanpeng Dong and Harris 2015; G. Dong et al. 2015):
library(HSAR)
suppressWarnings(HSAR <- hsar(form, data=boston_487, W=NULL, M=M, Delta=Delta,
burnin=500, Nsim=2500, thinning=1))
boston_93$HSAR_ss <- HSAR$Mus[1,]
The R2BayesX package provides flexible support for structured additive regression models, including spatial multilevel models. The models include an IID unstructured random effect at the upper level using the "re"
specification (Umlauf et al. 2015); we choose the "MCMC"
method:
suppressPackageStartupMessages(library(R2BayesX))
BX_iid <- bayesx(update(form, . ~ . + sx(NOX_ID, bs="re")), family="gaussian",
data=boston_487, method="MCMC", iterations=12000, burnin=2000, step=2, seed=123)
boston_93$BX_re <- BX_iid$effects["sx(NOX_ID):re"][[1]]$Mean
and the "mrf"
(Markov random field) spatially structured random effect specification based on a graph derived from converting a suitable "nb"
object for the upper level. The "region.id"
attribute of the "nb"
object needs to contain values corresponding the the indexing variable.
RBX_gra <- nb2gra(nb_q_93)
BX_mrf <- bayesx(update(form, . ~ . + sx(NOX_ID, bs="mrf", map=RBX_gra)),
family="gaussian", data=boston_487, method="MCMC",
iterations=12000, burnin=2000,step=2, seed=123)
boston_93$BX_ss <- BX_mrf$effects["sx(NOX_ID):mrf"][[1]]$Mean
In a very similar way, mgcv::gam()
can take an "mrf"
term using a suitable "nb"
object for the upper level. In this case the "nb"
object needs to have the contents of the "region.id"
attribute copied as the names of the neighbour list components, and the indexing variable needs to be a factor (Wood 2017) (the "REML"
method of bayesx()
gives the same result here):
library(mgcv)
names(nb_q_93) <- attr(nb_q_93, "region.id")
boston_487$NOX_ID <- as.factor(boston_487$NOX_ID)
GAM_MRF <- gam(update(form, . ~ . + s(NOX_ID, bs="mrf", xt=list(nb=nb_q_93))),
data=boston_487, method="REML")
boston_93$GAM_ss <- aggregate(predict(GAM_MRF, type="terms", se=FALSE)[,14],
list(boston_487$NOX_ID), mean)$x
In the cases of hglm()
, bayesx()
and gam()
, we could also model discrete responses without further major difficulty, and bayesx()
and gam()
also facilitate the generalization of functional form fitting for included covariates.
res <- rbind(iid_lmer=summary(MLM)$coefficients[6, 1:2],
iid_hglm=summary(HGLM_iid)$FixCoefMat[6, 1:2],
iid_BX=BX_iid$fixed.effects[6, 1:2],
sar_hsar=c(HSAR$Mbetas[1, 6], HSAR$SDbetas[1, 6]),
mrf_BX=BX_mrf$fixed.effects[6, 1:2],
mrf_GAM=c(summary(GAM_MRF)$p.coeff[6], summary(GAM_MRF)$se[6]))
Unfortunately, the coefficient estimates for the air pollution variable for these multilevel models are not helpful. All remain negative, but the inclusion of the model output zone level effects, be they IID or spatially structured, suggest that it is hard to disentangle the influence of the scale of observation from that of covariates observed at that scale.
suppressPackageStartupMessages(library(ggplot2))
df_res <- as.data.frame(res)
names(df_res) <- c("mean", "sd")
limits <- aes(ymax = mean + qnorm(0.975)*sd, ymin=mean + qnorm(0.025)*sd)
df_res$model <- row.names(df_res)
p <- ggplot(df_res, aes(y=mean, x=model)) + geom_point() + geom_errorbar(limits) + geom_hline(yintercept = 0, col="#EB811B") + coord_flip()
p + ggtitle("NOX coefficients and error bars") + theme(plot.background = element_rect(fill = "transparent",colour = NA), legend.background = element_rect(colour = NA, fill = "transparent"))
This map shows that the model output zone level IID random effects are very similar across the three model fitting functions reported.
library(tmap)
tm_shape(boston_93) + tm_fill(c("MLM_re", "HGLM_re" , "BX_re"), midpoint=0, title="IID") + tm_facets(free.scales=FALSE) + tm_borders(lwd=0.3, alpha=0.4) + tm_layout(panel.labels=c("MLM", "HGLM", "BX"))
The spatially structured SAR and MRF random effects (MRF term in the gam()
case) are also very similar, with the MRF somewhat less smoothed than the SAR values.
tm_shape(boston_93) + tm_fill(c("HGLM_ss", "HSAR_ss", "BX_ss", "GAM_ss"), midpoint=0, title="SS") + tm_facets(free.scales=FALSE) + tm_borders(lwd=0.3, alpha=0.4) + tm_layout(panel.labels=c("HGLM SAR", "HSAR SAR", "BX MRF", "GAM MRF"))
Although there is still a great need for more thorough comparative studies of model fitting functions for spatial regression, there has been much progress over recent years. This is not completed as a proper set of comparisons with this data set - (R. S. Bivand et al. 2017; R. Bivand 2017) contains comparisons for Beijing land parcels and a housing data set for SW Norway.
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "EPSG:4267"
row.names(nc) <- as.character(nc$FIPSNO)
nc$ft.SID74 <- sqrt(1000)*(sqrt(nc$SID74/nc$BIR74) + sqrt((nc$SID74+1)/nc$BIR74))
nc$ft.NWBIR74 <- sqrt(1000)*(sqrt(nc$NWBIR74/nc$BIR74) + sqrt((nc$NWBIR74+1)/nc$BIR74))
gal_file <- system.file("weights/ncCR85.gal", package="spData")[1]
ncCR85 <- spdep::read.gal(gal_file, region.id=nc$FIPSNO)
nc_lw <- spdep::nb2listw(ncCR85, style="B")
nc_W <- as(nc_lw, "CsparseMatrix")
The approach taken by social scientists including economists, and some others has been to approach this through simultaneous autoregressive approaches, where the response is modelled using fixed covariates, and the residual process is modelled by optimising a log likelihood function. The spatialreg package provides spautolm()
and errorsarlm()
:
m1 <- spautolm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, family="SAR")
When we include the covariate, the spatial error coefficient contributes little:
summary(m1)
##
## Call: spautolm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw,
## weights = BIR74, family = "SAR")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83149 -0.44934 0.09456 0.56278 2.90208
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5816687 0.2507427 6.3079 2.828e-10
## ft.NWBIR74 0.0366864 0.0069608 5.2704 1.361e-07
##
## Lambda: 0.035437 LR test value: 1.787 p-value: 0.18129
## Numerical Hessian standard error of lambda: 0.026096
##
## Log likelihood: -119.6704
## ML residual variance (sigma squared): 1304.4, (sigma: 36.117)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 247.34
It is unusual to present maps of the spatially structured random effect in cases where the simultaneous autoregressive approach is used, but it is fully possible using components of the returned model object (the vector is doubled for comparison with the CAR version):
nc$SAR_ssre <- 2*as.vector((m1$lambda * nc_W) %*% m1$Y - (m1$lambda * nc_W) %*% (m1$X %*% m1$fit$coefficients))
library(tmap)
tm_shape(nc) + tm_fill(c("ft.SID74", "SAR_ssre"), midpoint=c(NA, 0))
The other maximum likelihood implementation gives the same results, but provides a Hausman test for shifts in the covariate coefficients between the aspatial and spatial estimates (Pace and LeSage 2008); there is none:
m1a <- errorsarlm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw)
summary(m1a, Hausman=TRUE)
##
## Call:errorsarlm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw,
## weights = BIR74)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83149 -0.44934 0.09456 0.56278 2.90208
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5816687 0.2507427 6.3079 2.828e-10
## ft.NWBIR74 0.0366864 0.0069608 5.2704 1.361e-07
##
## Lambda: 0.035437, LR test value: 1.787, p-value: 0.18129
## Asymptotic standard error: 0.029167
## z-value: 1.2149, p-value: 0.22439
## Wald statistic: 1.4761, p-value: 0.22439
##
## Log likelihood: -119.6704 for error model
## ML residual variance (sigma squared): 1304.4, (sigma: 36.117)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 247.34, (AIC for lm: 247.13)
## Hausman test: 1.565, df: 2, p-value: 0.45726
It also lets us add Durbin=
terms, that is spatially lagged covariates:
m1b <- errorsarlm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, Durbin=TRUE)
summary(m1b, Hausman=TRUE)
##
## Call:errorsarlm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw,
## weights = BIR74, Durbin = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8705347 -0.4488357 0.0094254 0.4864939 2.7406880
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.28212888 0.33548214 6.8025 1.028e-11
## ft.NWBIR74 0.03532229 0.00811453 4.3530 1.343e-05
## lag.(Intercept) -0.15336745 0.06073121 -2.5253 0.01156
## lag.ft.NWBIR74 0.00098902 0.00159099 0.6216 0.53418
##
## Lambda: 0.028053, LR test value: 1.036, p-value: 0.30874
## Asymptotic standard error: 0.029824
## z-value: 0.94064, p-value: 0.34689
## Wald statistic: 0.8848, p-value: 0.34689
##
## Log likelihood: -115.1974 for error model
## ML residual variance (sigma squared): 1195.8, (sigma: 34.58)
## Number of observations: 100
## Number of parameters estimated: 6
## AIC: 242.39, (AIC for lm: 241.43)
## Hausman test: 1.8315, df: 4, p-value: 0.76671
and to present the by-covariate effects taking into account the unlagged (direct) and lagged (indirect) covariates and the sum of the coefficients (total):
summary(impacts(m1b))
## Impact measures (SDEM, estimable, n):
## Direct Indirect Total
## ft.NWBIR74 0.03532229 0.0009890218 0.03631131
## ========================================================
## Standard errors:
## Direct Indirect Total
## ft.NWBIR74 0.008114531 0.001590986 0.007277556
## ========================================================
## Z-values:
## Direct Indirect Total
## ft.NWBIR74 4.352968 0.6216407 4.989493
##
## p-values:
## Direct Indirect Total
## ft.NWBIR74 1.3431e-05 0.53418 6.0538e-07
However, our model may suffer from not using a mixed model approach to a count response; the simultaneous autoregressive models are mostly used with Gaussian responses. One possibility is to employ the hierarchical generalized linear model approach from the hglm package. First we’ll fit an unstructured IID (independent and identically distributed) random effect (Alam, Rönnegård, and Shen 2015):
library(hglm)
E <- nc$BIR74 * sum(nc$SID74)/sum(nc$BIR74)
HGLM_iid <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74,
data=nc, family=poisson(link=log))
The random effects also have their own standard errors, so we can order and display them with error bars in a forest or caterpillar plot:
ranef_iid <- unname(summary(HGLM_iid, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_iid[,1], ranef_iid[,2], subset=order(ranef_iid[,1], decreasing=TRUE),
slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)
We can also fit a weighted Poisson simultaneous autoregressive model, and examine the random effects:
HGLM_sar <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74,
data=nc, family=poisson(link=log), rand.family=SAR(D=nc_W))
ranef_sar <- unname(summary(HGLM_sar, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_sar[,1], ranef_sar[,2], subset=order(ranef_sar[,1], decreasing=TRUE),
slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)
There is not much difference between the IID and the SAR spatially structured random effects:
nc$HGLM_re <- ranef_iid[,1]
nc$HGLM_ss_SAR <- ranef_sar[,1]
tm_shape(nc) + tm_fill(c("HGLM_re", "HGLM_ss_SAR"), midpoint=c(0), title="Poisson HGLM RE") +
tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("IID", "SAR SSRE"))
Most epidemiological applications use conditional autoregressive approaches, where some (like spautolm()
and the hglm()
implementations) fit a spatial coefficient, but many fit an intrinsic CAR. First the spautolm()
and the hglm()
implementations:
m1c <- spautolm(ft.SID74 ~ ft.NWBIR74, weights=BIR74, data=nc, listw=nc_lw, family="CAR")
summary(m1c)
##
## Call: spautolm(formula = ft.SID74 ~ ft.NWBIR74, data = nc, listw = nc_lw,
## weights = BIR74, family = "CAR")
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.890805 -0.483471 0.070376 0.575906 2.824220
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4561763 0.2499558 5.8257 5.686e-09
## ft.NWBIR74 0.0410281 0.0069417 5.9103 3.414e-09
##
## Lambda: 0.06028 LR test value: 1.6364 p-value: 0.20082
## Numerical Hessian standard error of lambda: 0.053566
##
## Log likelihood: -119.7458
## ML residual variance (sigma squared): 1302, (sigma: 36.083)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 247.49
Again, we can calculate something that represents the spatial patterning of the spatial process (termed “signal” in the documentation of spatialreg::predict.sarlm
), but it is not scaled in the same way as the hgml()
random effects (and importantly we do not have standard errors):
nc$CAR_ssre <- as.vector((m1c$lambda * nc_W) %*% m1c$Y -
(m1c$lambda * nc_W) %*% (m1c$X %*% m1c$fit$coefficients))
tm_shape(nc) + tm_fill(c("SAR_ssre", "CAR_ssre"), midpoint=c(0), title="Gauss ML RE") +
tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("SAR SSRE", "CAR SSRE"))
Fitting the HGLM CAR model is just like the SAR model, and the forest plot of the spatially structured random effect is similar. Recall that spautolm()
is fitting a Gaussian model, but hglm()
is fitting a Poisson model, arguably better suited to count data. This means that the scalings of the random effects will vary in scale:
HGLM_car <- hglm(fixed=SID74 ~ ft.NWBIR74, random= ~ 1|CRESS_ID, offset=log(E), weights=BIR74,
data=nc, family=poisson(link=log), rand.family=CAR(D=nc_W))
ranef_car <- unname(summary(HGLM_car, print.ranef=TRUE)$RandCoefMat)
metafor::forest(ranef_car[,1], ranef_car[,2], subset=order(ranef_car[,1], decreasing=TRUE),
slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)
nc$HGLM_ss_CAR <- ranef_car[,1]
tm_shape(nc) + tm_fill(c("HGLM_ss_CAR", "HGLM_ss_SAR"), midpoint=c(0), title="Poisson HGLM RE") +
tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("CAR SSRE", "SAR SSRE"))
To use a generalized additive mixed model (mgcv::gam()
with an "mrf"
random effect), and some other mixed models, the areal entities need to be grouped (done in the first exercise), and we can try a flexible fit on the covariate:
nc$LM <- as.numeric(interaction(nc$L_id, nc$M_id))
aggLM <- aggregate(nc[,"LM"], list(nc$LM), head, n=1)
aggnb <- spdep::poly2nb(aggLM)
library(mgcv)
names(aggnb) <- as.character(aggLM$Group.1)
nc$LM <- as.factor(nc$LM)
GAM_mrf <- gam(SID74 ~ s(ft.NWBIR74) + s(LM, bs="mrf", xt=list(nb=aggnb)), offset=log(E), weights=BIR74, data=nc, family=poisson(link=log))
summary(GAM_mrf)
And plot the covariate smooth term:
library(mgcv)
plot(GAM_mrf)
The forest plot is obviously grouped too:
GAM_mrf_re <- predict(GAM_mrf, type="terms", se=TRUE)
metafor::forest(GAM_mrf_re$fit[,2], GAM_mrf_re$se.fit[,2], subset=order(GAM_mrf_re$fit[,1], decreasing=TRUE),
slab=NA, annotate=FALSE, lty=c("solid","blank"), pch=19, psize=2, cex.lab=1, cex.axis=1)
as is the RE map:
nc$GAM_mrf_re <- GAM_mrf_re$fit[,2]
tm_shape(nc) + tm_fill(c("GAM_mrf_re"), midpoint=c(0), title="Poisson GAM MRF RE")
df_tracts <- st_read("df_tracts.gpkg")
## Reading layer `df_tracts' from data source
## `/home/rsb/und/ecs530/2021/df_tracts.gpkg' using driver `GPKG'
## Simple feature collection with 71353 features and 28 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS: NAD83
tr_form <- log(med_inc_cv) ~ log1p(vacancy_rate) + log1p(old_rate) + log1p(black_rate) + log1p(hisp_rate) + log1p(group_pop) + log1p(dens)
df_tracts$ST <- factor(substring(df_tracts$GEOID, 1, 2))
library(mgcv)
GAM_RE <- gam(update(tr_form, . ~ . + s(ST, bs="re")), data=df_tracts)
summary(GAM_RE)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(med_inc_cv) ~ log1p(vacancy_rate) + log1p(old_rate) + log1p(black_rate) +
## log1p(hisp_rate) + log1p(group_pop) + log1p(dens) + s(ST,
## bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.410054 0.039241 -163.35 <2e-16 ***
## log1p(vacancy_rate) 0.356864 0.025239 14.14 <2e-16 ***
## log1p(old_rate) 0.426732 0.004054 105.27 <2e-16 ***
## log1p(black_rate) 0.434330 0.012156 35.73 <2e-16 ***
## log1p(hisp_rate) 0.515721 0.014432 35.73 <2e-16 ***
## log1p(group_pop) 0.024723 0.000792 31.21 <2e-16 ***
## log1p(dens) 0.045569 0.002023 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(ST) 46.79 48 49.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.255 Deviance explained = 25.5%
## GCV = 0.19839 Scale est. = 0.19824 n = 71353
states <- aggregate(df_tracts["ST"], list(df_tracts$ST), head, n=1)
nb_st <- spdep::poly2nb(states, row.names=as.character(states$ST))
names(nb_st) <- as.character(states$ST)
GAM_MRF <- gam(update(tr_form, . ~ . + s(ST, bs="mrf", k=49, xt=list(nb=nb_st))), data=df_tracts)
summary(GAM_MRF)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## log(med_inc_cv) ~ log1p(vacancy_rate) + log1p(old_rate) + log1p(black_rate) +
## log1p(hisp_rate) + log1p(group_pop) + log1p(dens) + s(ST,
## bs = "mrf", k = 49, xt = list(nb = nb_st))
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.4018825 0.0356203 -179.73 <2e-16 ***
## log1p(vacancy_rate) 0.3572201 0.0252390 14.15 <2e-16 ***
## log1p(old_rate) 0.4270174 0.0040538 105.34 <2e-16 ***
## log1p(black_rate) 0.4318845 0.0121718 35.48 <2e-16 ***
## log1p(hisp_rate) 0.5151583 0.0144364 35.69 <2e-16 ***
## log1p(group_pop) 0.0247906 0.0007921 31.30 <2e-16 ***
## log1p(dens) 0.0456938 0.0020231 22.59 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(ST) 46.25 48 50.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.255 Deviance explained = 25.5%
## GCV = 0.19839 Scale est. = 0.19824 n = 71353
cat(capture.output(print(anova(GAM_RE, GAM_MRF, test="Chisq")))[c(1, 2, 9:11)], sep="\n")
## Analysis of Deviance Table
##
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 71298 14134
## 2 71298 14134 -0.081294 -0.11743 0.03742 *
What is interpolation, and what does it require with regard to data, observation and prediction locations?
The support of the observation and prediction locations - what is it, and how does it affect interpolation?
Standard models for interpolation: deterministic and geostatistical prediction for continuous variables
Extensions to the standard model for non-standard data and non-planar locations
Geostatistics is a bit like the alchemy of spatial statistics, focussed more on prediction than model fitting
Since the reason for modelling is chiefly prediction in pre-model-based geostatistics, and to a good extent in model-based geostatistics, we’ll also keep to interpolation here
Interpolation is trying to make as good guesses as possible of the values of the variable of interest for places where there are no observations (can be in 1, 2, 3, \(\ldots\) dimensions)
These are based on the relative positions of places with observations and places for which predictions are required, and the observed values at observations
Interpolation for point data involves estimates of a variable (or variables) from known observations at known positions to unknown positions
If the observation and prediction positions have point support, then we dealing with interpolation for point data
The underlying assumptions for all interpolation include distributional assumptions about the data, assumptions about the quality of the data and covariates, and assumptions about the quality of positional support and measurement
The standard model assumes planar geometry, error-free coordinates and covariates, and data that are observed as directly as possible from the same realization of the data generation process
Since predictions are only of use if we know how far we can rely on them, all interpolation should try to accommodate uncertainty about measurements
If we are binning to try to obtain observations’’, we are going much further than experimental scientists do in designing model matrices, we have much less control over the values of the predictors
Model-based geostatistics are among the techniques that can be used, but we can start from block kriging to get a feel for COSP
Very likely, Hierarchical or other Markov chain Monte Carlo (MCMC) models will be required to specify the errors to propagate
We could say that we know very little about any spatial process in the DGP, and take the value at each prediction point as the value at the nearest observation, disregarding covariates
We could extend this by taking some measure of the values observed at nearby observations, for example a median or average
Inverse distance weighting with a power parameter interpolates by taking the distance-weighted mean of all, or a local region of, observations
Trend surfaces fit a surface through the observations using a polynomial in the point coordinates of the observations; splines also fit a surface based on local fits
While splines may have a statistical interpretation, they are often used deterministically, but as with IDW and trend surface, the fitted model may be chosen by cross-validation (eg. Geostatistical Analyst in ArcGIS)
It is also perfectly possible to fit an aspatial statistical model with covariates (linear, GLM, GAM, CART, etc.), and predict to locations for which observations on the covariates are available
Many of these can be extended to their mixed-model version, as LME, GLMM, GAMM, etc., using a spatial process model for random effects, or for example a spline surface in the observation coordinates (GAM)
The spatial process models are mostly geostatistical, so we’ll focus on these
Assuming that we have a response variable and perhaps explanatory variables observed on the same support and yielding a mean-stationary error process, our interest is in finding out whether there is still useful information in the error
The information may take the form of spatial covariance, the error autocorrelation between observation
This also involves the introduction of the concept of the random field, a stochastic process in more than one dimension, as a representation of space, where these can be continuous or discrete
The continuous form is:
\[ Z(\mathbf{s}) = \mu(\mathbf{s}) + W(\mathbf{s}) + \eta(\mathbf{s}) + \epsilon(\mathbf{s}), \] where \(Z(\mathbf{s})\) is a random field, \(\mathbf{s} = [s_1, s_2, \ldots, s_d]'\) are the coordinates of a spatial process indexed over \(D \subset \mathcal{R}^d\), the mean function \(\mu(\mathbf{s})\) is the large-scale trend, \(W(\mathbf{s})\) is the smooth-scale variation and is a stationary process with a covariance function, \(\eta(\mathbf{s})\) is the often unobservable micro-scale variation, and \(\epsilon(\mathbf{s})\) is white noise measurement error
In geostatistics, two steps are involved in prediction from point data, first modelling the smooth-scale variation, then making the prediction from the mean function and this fitted model
The first step in practice involves fitting a variogram model to the observed differences in the residuals of the model after fitting the mean function between pairs of observation points
The fitting of a model to the empirical variogram may be done by hand and by a number of statistical techniques (which depend on assumptions)
Choosing a different variogram model may lead to differences in predictions; it may not be possible to choose a satisfactory model
It is probable that more exploratory spatial data analysis is done in geostatistics than in the remaining domains of spatial data analysis
It is easy to grasp why interpolation is crucially dependent on identifying the right model, in terms of the selection of observation locations, the fitting of models of spatial autocorrelation, detecting useful covariates, and checking the appropriateness of assumptions such as isotropy
Here we will use a data set of precipitation values for Switzerland, discussed in Diggle and Ribeiro (2007), and used in the Spatial Interpolation Comparison 97’’ contest
The examples demonstrate that geostatistics software, here packages, provides much support for exploratory spatial data analysis
Both geoR and gstat include data for preciptation for 467 met. stations in Switzerland, for 8 May 1986, measured in 0.1 mm. We’ll fit with 100 training sites, and predict to the remaining 367 sites. geoR gives a the four-panel ESDA display that conveys a lot of information for the 100 training sites.
library(geoR)
## --------------------------------------------------------------
## Analysis of Geostatistical Data
## For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
## geoR version 1.8-1 (built on 2020-02-08) is now loaded
## --------------------------------------------------------------
data(SIC)
plot(sic.100, borders=sic.borders, lowess=TRUE)
The first diagnostic plot provided in gstat is known as an \(h\)-scatterplot, and conditions a scatterplot of the values at pairs of locations on the binned distance \(h_{ij}\) between them; the diagonal lines represent perfect correlation
library(sf)
sic.100sf <- st_as_sf(cbind(as.data.frame(sic.100[[1]]), precip=sic.100[[2]], sic.100[[3]]), coords=1:2)
sic.allsf <- st_as_sf(cbind(as.data.frame(sic.all[[1]]), precip=sic.all[[2]], sic.all[[3]]), coords=1:2)
library(gstat)
hscat(precip ~ altitude, data=sic.100sf, seq(0,120,20))
We now compute a variogram cloud plot and a plot of empirical variogram values for 20 5km wide bins, for classical and robust versions of the variogram. The bin borders are shown to highlight how the empirical variogram is constructed as a measure of central tendency of squared differences in the variable of interest between pairs of points whose inter-point distance falls into the bin
g <- gstat(id="precip", formula=precip ~ altitude, data=sic.100sf)
evgm <- variogram(g, cutoff=100, width=5)
revgm <- variogram(g, cutoff=100, width=5, cressie=TRUE)
cevgm <- variogram(g, cutoff=100, width=5, cloud=TRUE)
oopar <- par(mfrow=c(1,2))
plot(gamma ~ dist, cevgm, pch=".", cex=2, col="grey65", ylab="semivariance", xlab="distance")
lines(gamma ~ dist, evgm, lwd=2)
lines(gamma ~ dist, revgm, lwd=2, lty=2)
abline(v=seq(0,100,5), lty=2, col="grey50")
plot(gamma ~ dist, evgm, ylab="semivariance", xlab="distance", type="b", lwd=2)
points(gamma ~ dist, revgm, pch=3)
lines(gamma ~ dist, revgm, lty=2, lwd=2)
abline(v=seq(0,100,5), lty=2, col="grey50")
legend("topleft", legend=c("classic", "robust"), pch=c(1,3), lty=c(1,2), bty="n", lwd=2)
par(oopar)
We can close in on these within-bin distributions by using boxplots constructed from the gstat variogram cloud — box widths proportional to pair counts in bins, classical empirical variogram shown as dashed line; the fields package returns number summaries by bin in addition to the classical variogram estimator in output from the vgram
function
dbin <- findInterval(cevgm$dist, seq(0, 100, 5), all.inside=TRUE)
wid <- tapply(cevgm$gamma, dbin, length)
boxplot(cevgm$gamma ~ dbin, width=wid, ylab="semivariance", xlab="distance", axes=FALSE)
axis(2)
axis(1, at=c(0.5, 5.5, 10.5, 15.5, 20.5), labels=c(0, 25, 50, 75, 100))
box()
lines(gamma ~ I(dist/5), evgm, lwd=2, lty=2)
Finally, we explore possible anisotropy in the data set. Using the same bins as earlier, we add arguments to the variogram
function to create objects for plotting, a variogram map, and four empirical variograms for four axes at \(0\,^{\circ}\), \(45\,^{\circ}\), \(90\,^{\circ}\) and \(135\,^{\circ}\); the variogram direction lines are coded in the same way on both panels
mevgm <- variogram(g, cutoff=100, width=5, map=TRUE)
aevgm <- variogram(g, cutoff=100, width=5, alpha=c(0, 45, 90, 135))
library(RColorBrewer)
oopar <- par(mar=c(1,1,1,1))
image(mevgm$map, col=colorRampPalette(brewer.pal(7, "Blues")[-(6:7)])(20))
abline(v=0, lty=1)
abline(a=0, b=1, lty=2)
abline(h=0, lty=3)
abline(a=0, b=-1, lty=4)
par(oopar)
library(lattice)
trellis.device(new=FALSE,color=FALSE)
plot(aevgm, multipanel=FALSE)
First we will fit a Matern variogram model in geoR using weighted least squares and maximum likelihood:
evg <- variog(sic.100, max.dist=200, trend=formula(~altitude), messages=FALSE)
fvg <- variofit(evg, messages=FALSE)
ml <- likfit(sic.100, ini.cov.pars=c(0.5, 0.5), trend=formula(~altitude), messages=FALSE)
plot(evg)
lines(fvg)
lines(ml, lty=2)
legend("topleft", legend=c("WLS", "ML"), lty=1:2, bty="n")
Next we will fit a Matern variogram model in gstat:
evgm <- variogram(g, cutoff=200, width=5)
fevgm <- fit.variogram(evgm, vgm(psill=16000, model="Mat", range=30, nugget=1, kappa=0.5))
fevgm
## model psill range kappa
## 1 Nug 0.0 0.00000 0.0
## 2 Mat 16869.7 46.96405 0.5
plot(evgm, model=fevgm)
The geostatistical packages, like gstat, use formula objects in standard ways where possible, which allows for considerable flexibility:
UK_fit <- gstat(g, id="precip", model=fevgm)
z <- predict(UK_fit, newdata=sic.allsf, debug.level=0)
sic.367sf <- sic.allsf[which(z$precip.var > 0.0001),]
z <- predict(UK_fit, newdata=sic.367sf, debug.level=0)
Using geoR, we get:
kcwls <- krige.conv(sic.100, locations=st_coordinates(sic.367sf),
krige=krige.control(obj.model=fvg, type.krige="OK",
trend.d=formula(~altitude), trend.l=formula(~sic.367sf$altitude)))
## krige.conv: model with mean defined by covariates provided by the user
## krige.conv: Kriging performed using global neighbourhood
kcml <- krige.conv(sic.100, locations=st_coordinates(sic.367sf),
krige=krige.control(obj.model=ml, type.krige="OK",
trend.d=formula(~altitude), trend.l=formula(~sic.367sf$altitude)))
## krige.conv: model with mean defined by covariates provided by the user
## krige.conv: Kriging performed using global neighbourhood
kcB <- krige.bayes(sic.100, locations=st_coordinates(sic.367sf),
model=model.control(trend.d=formula(~altitude),
trend.l=formula(~sic.367sf$altitude)))
## krige.bayes: model with mean defined by covariates provided by the user
## krige.bayes: computing the discrete posterior of phi/tausq.rel
## krige.bayes: argument `phi.discrete` not provided, using default values
## krige.bayes: computing the posterior probabilities.
## Number of parameter sets: 50
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
##
## krige.bayes: sampling from posterior distribution
## krige.bayes: sample from the (joint) posterior of phi and tausq.rel
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## phi 23.44137 35.16205 46.88273 58.60342 70.3241 82.04478 93.76547
## tausq.rel 0.00000 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000
## frequency 10.00000 48.00000 79.00000 64.00000 55.0000 45.00000 43.00000
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## phi 105.4862 117.2068 128.9275 140.6482 152.3689 164.0896 175.8103
## tausq.rel 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## frequency 28.0000 39.0000 27.0000 18.0000 30.0000 20.0000 25.0000
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## phi 187.5309 199.2516 210.9723 222.693 234.4137 246.1344 257.855 269.5757
## tausq.rel 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000 0.0000
## frequency 20.0000 14.0000 20.0000 22.000 10.0000 18.0000 22.000 11.0000
## [,23] [,24] [,25] [,26] [,27] [,28] [,29]
## phi 281.2964 293.0171 304.7378 316.4585 328.1791 339.8998 351.6205
## tausq.rel 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## frequency 18.0000 18.0000 16.0000 16.0000 13.0000 19.0000 16.0000
## [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## phi 363.3412 375.0619 386.7826 398.5032 410.2239 421.9446 433.6653
## tausq.rel 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## frequency 7.0000 14.0000 10.0000 15.0000 10.0000 11.0000 12.0000
## [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44]
## phi 445.386 457.1067 468.8273 480.548 492.2687 503.9894 515.7101 527.4308
## tausq.rel 0.000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000 0.0000
## frequency 9.000 10.0000 7.0000 13.000 13.0000 8.0000 12.0000 12.0000
## [,45] [,46] [,47] [,48] [,49]
## phi 539.1514 550.8721 562.5928 574.3135 586.0342
## tausq.rel 0.0000 0.0000 0.0000 0.0000 0.0000
## frequency 10.0000 11.0000 11.0000 8.0000 13.0000
##
## krige.bayes: starting prediction at the provided locations
## krige.bayes: phi/tausq.rel samples for the predictive are same as for the posterior
## krige.bayes: computing moments of the predictive distribution
## krige.bayes: sampling from the predictive
## Number of parameter sets: 49
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
## krige.bayes: preparing summaries of the predictive distribution
The point about the standard assumptions is that when they are met, the prediction standard errors are tractable — we plot the MCMC prediction simulations for location 1:
plot(density(kcB$predictive$simulations[1,]), ylim=c(0, 0.006))
abline(v=kcB$predictive$mean[1], col="red", lwd=2)
curve(dnorm(x, mean=kcB$predictive$mean[1],
sd=sqrt(kcB$predictive$variance[1])), lty=2, lwd=2, from=-100, to=500, add=TRUE)
abline(v=sic.367sf$precip[1], col="blue", lwd=2)
The automap package builds on gstat to automate the choice of variogram model:
library(automap)
aK <- autoKrige(formula=precip ~ altitude, input_data=as(sic.100sf, "Spatial"), new_data=as(sic.367sf, "Spatial"))
## [using universal kriging]
aK$var_model
## model psill range kappa
## 1 Nug 861.4484 0.00000 0
## 2 Ste 14011.7859 35.92031 5
plot(aK)
(10.7287/peerj.preprints.26693v3?) point to possible uses of ML technologies for interpolation, and give code examples on the https://envirometrix.github.io/PredictiveSoilMapping/ are provided in an ebook. First distance matrices are constructed and converted to data frames:
#grid.dist0 <- GSIF::buffer.dist(sic.100SP["precip"], sic.367SP[1], as.factor(1:nrow(sic.100SP)))
dist0sf <- as.data.frame(st_distance(st_geometry(sic.100sf)))
names(dist0sf) <- paste("layer", names(dist0sf), sep=".")
dist1sf <- as.data.frame(st_distance(st_geometry(sic.367sf), st_geometry(sic.100sf)))
names(dist1sf) <- paste("layer", names(dist1sf), sep=".")
Then the observed responses and any covariates are added to the per-observation distances, and a formula constructed:
rm.precip <- cbind(data.frame(precip=sic.100sf$precip, altitude=sic.100sf$altitude), dist0sf)
rm.precip1 <- cbind(data.frame(altitude=sic.367sf$altitude), dist1sf)
dn0 <- paste(names(dist0sf), collapse="+")
fm0 <- as.formula(paste("precip ~ altitude +", dn0))
#fm0
The ranger
function from the ranger package can be used for fast random forest fitting:
library(ranger)
m.precip <- ranger(fm0, rm.precip, quantreg=TRUE, num.trees=150, seed=1)
m.precip
## Ranger result
##
## Call:
## ranger(fm0, rm.precip, quantreg = TRUE, num.trees = 150, seed = 1)
##
## Type: Regression
## Number of trees: 150
## Sample size: 100
## Number of independent variables: 101
## Mtry: 10
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 6150.302
## R squared (OOB): 0.5482526
And make predictions from the fitted model for the 367 weather stations held back:
quantiles <- c(pnorm(1, lower.tail=FALSE), 0.5, pnorm(1))
precip.rfd <- as.data.frame(predict(m.precip, rm.precip1, type="quantiles",
quantiles=quantiles)$predictions)
res <- cbind(sic.367sf[,"precip"], precip.rfd, as.data.frame(aK$krige_output)[,3:5])
res$rf_sd <- (res[[4]] - res[[2]])/2
names(res) <- make.names(names(res))
names(res)[c(2,4)] <- c("quantile= 0.159", "quantile= 0.841")
library(tmap)
st_crs(res) <- 32662
tm_shape(res) + tm_symbols(col=c("precip", "var1.pred", "quantile..0.5"), pal="Blues", size=0.2) + tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("Preciptation", "Kriging predictons", "Random Forest predictions"))
tm_shape(res) + tm_symbols(col=c("var1.stdev", "rf_sd"), pal="Reds", size=0.2) + tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("Kriging std. dev", "Random Forest 0.159-0.841 range"))
Then the observed responses and any covariates are added to the per-observation distances, and a formula constructed:
xy100 <- st_coordinates(sic.100sf)
xy367 <- st_coordinates(sic.367sf)
rm.precipa <- cbind(rm.precip, x=xy100[,1], y=xy100[,2])
rm.precipa1 <- cbind(rm.precip1, x=xy367[,1], y=xy367[,2])
fm1 <- update(fm0, . ~ . + x + y)
m.precipa <- ranger(fm1, rm.precipa, quantreg=TRUE, num.trees=150, seed=1)
m.precipa
## Ranger result
##
## Call:
## ranger(fm1, rm.precipa, quantreg = TRUE, num.trees = 150, seed = 1)
##
## Type: Regression
## Number of trees: 150
## Sample size: 100
## Number of independent variables: 103
## Mtry: 10
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 6013.641
## R squared (OOB): 0.5582906
And make predictions from the fitted model for the 367 weather stations held back:
quantiles <- c(pnorm(1, lower.tail=FALSE), 0.5, pnorm(1))
precipa.rfd <- as.data.frame(predict(m.precipa, rm.precipa1, type="quantiles",
quantiles=quantiles)$predictions)
resa <- cbind(sic.367sf[,"precip"], precipa.rfd, as.data.frame(aK$krige_output)[,3:5])
resa$rf_sda <- (resa[[4]] - resa[[2]])/2
names(resa) <- make.names(names(resa))
names(resa)[c(2,4)] <- c("quantile= 0.159", "quantile= 0.841")
st_crs(resa) <- 32662
tm_shape(resa) + tm_symbols(col=c("precip", "var1.pred", "quantile..0.5"), pal="Blues", size=0.2) + tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("Preciptation", "Kriging predictons", "Random Forest predictions"))
tm_shape(resa) + tm_symbols(col=c("var1.stdev", "rf_sda"), pal="Reds", size=0.2) + tm_facets(free.scales=FALSE) + tm_layout(panel.labels=c("Kriging std. dev", "Random Forest 0.159-0.841 range"))
What we see on a map is a pattern, or perhaps some patterns mixed together.
It is not easy to work back from map pattern to the process or processes that generated it/them.
Using a variety of approaches, we can explore and analyse point patterns, also reviewing an important chapter in the development of quantitative geography.
Practically, we will also see how we can try out different approaches, and how their assumptions affect our conclusions.
David O’Sullivan and David Unwin (2003) , Wiley, chapter 4, plus chapter 3 for the curious (or 2010, ch. 5 plus ch. 4);
Ian Smalley and David Unwin (1968) The formation and shape of drumlins and their distribution and orientation in drumlin fields, , 7, pp. 377–390; Alan R. Hill (1973) The distribution of drumlins in County Down, Ireland, , 63 (2). pp. 226–240.
Others may also like Trevor Bailey and Anthony Gatrell (1995) , Longman, chapter 3.
library(sf)
drumlins <- st_geometry(st_read("drumlins.gpkg"))
## Reading layer `drumlins' from data source
## `/home/rsb/und/ecs530/2021/drumlins.gpkg' using driver `GPKG'
## Simple feature collection with 232 features and 0 fields
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 0.2804878 ymin: 5.567073 xmax: 8.231707 ymax: 13.4878
## Projected CRS: Undefined Cartesian SRS
A data set similar to the one refered to by O’Sullivan and Unwin on p. 100-101 is available in spatial in R (associated with Venables and Ripley (2002) Modern Applied Statistics with S) — it is the one used by Upton and Fingleton, coded by Ripley. We have here copied the points to a shapefile.
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: spatstat.geom
## spatstat.geom 2.3-0
##
## Attaching package: 'spatstat.geom'
## The following object is masked from 'package:colorspace':
##
## coords
## The following object is masked from 'package:MASS':
##
## area
## Loading required package: spatstat.core
## Loading required package: rpart
## spatstat.core 2.3-1
##
## Attaching package: 'spatstat.core'
## The following object is masked from 'package:lattice':
##
## panel.histogram
## The following object is masked from 'package:gstat':
##
## idw
## Loading required package: spatstat.linnet
## spatstat.linnet 2.3-0
##
## spatstat 2.2-0 (nickname: 'That's not important right now')
## For an introduction to spatstat, type 'beginner'
(drumlins_ppp <- as.ppp(drumlins))
## Planar point pattern: 232 points
## window: rectangle = [0.280488, 8.231707] x [5.567073, 13.487805] units
Although spatstat and the sp classes have developed independently, they have a good deal in common, and point patterns, images and polygon windows can be exchanged
Point pattern objects need bounding windows to show where the population of data points were collected. The default window is the bounding box of the points, but others are available.
bb <- boundingbox(drumlins_ppp)
ch <- convexhull.xy(drumlins_ppp)
rr <- ripras(drumlins_ppp)
drumlins_rr <- ppp(drumlins_ppp$x, drumlins_ppp$y, window=rr)
plot(drumlins_ppp)
plot(bb, add=TRUE, border="darkgreen", lwd=2, lty=1)
plot(ch, add=TRUE, border="darkred", lwd=2, lty=3)
plot(rr, add=TRUE, border="orange", lwd=2, lty=2)
One legacy approach to point patterns, avoiding the drudge of measuring inter-point distances, has been to divide the study area into quadrats, and count the numbers of points falling into each quadrat. This can take the form of a 2D histogram, or be displayed as an image plot.
qc <- quadratcount(drumlins_ppp)
plot(drumlins, cex=0.8)
t3 <- cbind(expand.grid(x=attr(qc, "xbreaks")[1:5] + diff(attr(qc, "xbreaks"))[1]/2, y=rev(attr(qc, "ybreaks")[1:5] + diff(attr(qc, "ybreaks"))[1]/2)), qc=c(t(qc)))
text(t3[,1], t3[,2], t3[,3], cex=1.2, font=2, col="darkred")
abline(h=attr(qc, "ybreaks"))
abline(v=attr(qc, "xbreaks"))
Chi-squared tests for Complete Spatial Randomness using quadrat counts may seem attractive, but suffer from the same problems as do histogram bins:
quadrat.test(drumlins_ppp)
##
## Chi-squared test of CSR using quadrat counts
##
## data: drumlins_ppp
## X2 = 37.828, df = 24, p-value = 0.07222
## alternative hypothesis: two.sided
##
## Quadrats: 5 by 5 grid of tiles
Just adding one more row and column of quadrats, or switching windows, changes our conclusion:
quadrat.test(drumlins_ppp, nx=6)
##
## Chi-squared test of CSR using quadrat counts
##
## data: drumlins_ppp
## X2 = 41.414, df = 35, p-value = 0.4221
## alternative hypothesis: two.sided
##
## Quadrats: 6 by 6 grid of tiles
quadrat.test(drumlins_rr)
## Warning: Some expected counts are small; chi^2 approximation may be inaccurate
##
## Chi-squared test of CSR using quadrat counts
##
## data: drumlins_rr
## X2 = 34.389, df = 24, p-value = 0.156
## alternative hypothesis: two.sided
##
## Quadrats: 25 tiles (irregular windows)
Density plots use a 2D kernel, in spatstat a Gaussian kernel, to create smoothed histograms avoiding the problems of quadrat counts. The key argument to pass to the density method for point patterm objects is sigma=
, which determines the bandwidth of the kernel. Since we can coerce the image objects output by the method to an sp class, we use this to cumulate density values for different values of sigma.
crds <- crds <- st_coordinates(st_sample(st_as_sfc(rr), size=10000, type="regular"))
crds <- list(x=crds[,1], y=crds[,2])
library(raster)
##
## Attaching package: 'raster'
## The following object is masked from 'package:nlme':
##
## getData
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:lme4':
##
## getData
k02 <- as(density(drumlins_rr, sigma=0.2, xy=crds), "RasterLayer")
k04 <- as(density(drumlins_rr, sigma=0.4, xy=crds), "RasterLayer")
k06 <- as(density(drumlins_rr, sigma=0.6, xy=crds), "RasterLayer")
k08 <- as(density(drumlins_rr, sigma=0.8, xy=crds), "RasterLayer")
rB <- brick(k02, k04, k06, k08)
library(stars)
## Loading required package: abind
rB_st <- st_as_stars(rB)
library(tmap)
st_crs(rB_st) <- 32662
st_crs(drumlins) <- 32662
## Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
## that
tm_shape(rB_st) + tm_raster(title="Density") + tm_layout(panel.labels=c("0.2", "0.4", "0.6", "0.8")) + tm_shape(drumlins) + tm_symbols(size=0.25, shape=4)
Narrower bandwidths yield more extreme values, broader bandwidths narrow the interquartile range. From this table, we can see how the change in the bandwidth is affecting the relative differences in our view of the local estimates of intensity.
summary(rB)
## layer.1 layer.2 layer.3 layer.4
## Min. 9.421760e-07 0.05803886 0.3291022 0.6437881
## 1st Qu. 1.663420e+00 2.54392720 2.8779842 3.0843853
## Median 3.578001e+00 3.73802602 3.7933231 3.8079261
## 3rd Qu. 5.374750e+00 4.83903887 4.5984496 4.4039411
## Max. 1.444779e+01 7.47133680 5.7460480 5.3224758
## NA's 7.980000e+02 798.00000000 798.0000000 798.0000000
boxplot(rB)
We can find and plot nearest neighbour distances, finding them with nndist
— plotting the empirical cumulative distribution function of the nearest neighbour distances is interesting:
nns <- nndist(drumlins_rr)
summary(nns)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1446 0.2700 0.3203 0.3254 0.3685 1.0073
plot(ecdf(nns))
The \(\hat{G}\) measure turns out to be just the ECDF of the nearest neighbour distances, plotted by default with the expected CSR line; Gest
returns binned values for a range of distance bins best chosen by the function:
plot(ecdf(nns), xlim=c(0, 0.5))
plot(Gest(drumlins_ppp), add=TRUE, lwd=3)
If we generate many simulated CSR point patterns for the current window, we can use the envelope
method to explore whether the observed \(\hat{G}\) measures lie in relation to the simulated ones:
n <- drumlins_rr$n
set.seed(121122)
ex <- expression(runifpoint(n, win=rr))
res <- envelope(drumlins_rr, Gest, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
plot(res, xlim=c(0,0.7))
for(i in 2:100) lines(attr(res, "simfuns")[[1]], attr(res, "simfuns")[[i]], col="grey")
plot(res, add=TRUE, lwd=3, xlim=c(0,0.7))
We can also compute the nearest neighbour based Clark/Evans R statistic :
clarkevans(drumlins_ppp)
## naive Donnelly cdf
## 1.248991 1.214479 1.229503
clarkevans(drumlins_rr, correction="none")
## [1] 1.238292
clarkevans(drumlins_rr, correction="guard", clipregion=erosion.owin(rr, r=1))
## [1] 1.194691
which seem to indicate that the observed and CSR expected distances are similar, but perhaps more evenly spaced than clustered.
From what we have seen, it appears the the drumlin summit points are more regularly than randomly distributed. If we think, however, the absence of short nearest neighbour distance may mean that they “push” each other apart (in fact this is about points not being a good way of representing ellipses) — so we can try to simulate from a Simple Sequential Inhibition (SSI) process with a 180m inhibition radius instead of CSR:
ex <- expression(rSSI(0.18, n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Gest, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
null <- capture.output(plot(res, xlim=c(0,0.7)))
for(i in 2:100) lines(attr(res, "simfuns")[[1]], attr(res, "simfuns")[[i]], col="grey")
null <- capture.output(plot(res, add=TRUE, lwd=3, xlim=c(0,0.7)))
As we know, G-hat uses nearest neighbour distances to express summary features of a point pattern. The K-hat function uses point intensities in rings spreading out from the points, and so uses more of the data to examine what is driving the process (reported here as L-hat):
ex <- expression(runifpoint(n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Kest, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
r <- res$r
Lhat <- function(k, r) { (sqrt(k/pi)) - r }
plot(r, Lhat(res$obs, r), type="n", ylab="L(r)", main="CSR simulation", ylim=c(-0.17, 0.1))
for(i in 2:100) lines(r, Lhat(attr(res, "simfuns")[[i]], r), col="grey")
lines(r, Lhat(res$obs, r), lwd=2, col="brown")
lines(r, Lhat(res$lo, r), lwd=2, col="black", lty=2)
lines(r, Lhat(res$hi, r), lwd=2, col="black", lty=2)
From what we already know, drumlins represented as points appear to inhibit each other under a distance of about 200m, so running the \(\hat{K}\) measure with an SSI process should show more of what is going on:
ex <- expression(rSSI(0.18, n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Kest, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
r <- res$r
Lhat <- function(k, r) { (sqrt(k/pi)) - r }
plot(r, Lhat(res$obs, r), type="n", ylab="L(r)", main="SSI simulation", ylim=c(-0.17, 0.1))
for(i in 2:100) lines(r, Lhat(attr(res, "simfuns")[[i]], r), col="grey")
lines(r, Lhat(res$obs, r), lwd=2, col="brown")
lines(r, Lhat(res$lo, r), lwd=2, col="black", lty=2)
lines(r, Lhat(res$hi, r), lwd=2, col="black", lty=2)
Another possibility is that the CSR hypothesis is at error on assuming that the process is homogeneous — we may also test against an inhomogeneous process using the Kinhom
function. If its lambda
argument is omitted, it does leave-one-out kernel smoothing to find \(\lambda_i\) by omitting the \(i\)-th point:
ex <- expression(runifpoint(n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Kinhom, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
r <- res$r
Lhat <- function(k, r) { (sqrt(k/pi)) - r }
plot(r, Lhat(res$obs, r), type="n", ylab="L(r)", main="CSR simulation", ylim=c(-0.17, 0.1))
for(i in 2:100) lines(r, Lhat(attr(res, "simfuns")[[i]], r), col="grey")
lines(r, Lhat(res$obs, r), lwd=2, col="brown")
lines(r, Lhat(res$lo, r), lwd=2, col="black", lty=2)
lines(r, Lhat(res$hi, r), lwd=2, col="black", lty=2)
Finally, we round off with an inhomogeneous SSI process:
ex <- expression(rSSI(0.18, n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Kinhom, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
r <- res$r
Lhat <- function(k, r) { (sqrt(k/pi)) - r }
plot(r, Lhat(res$obs, r), type="n", ylab="L(r)", main="SSI simulation", ylim=c(-0.17, 0.1))
for(i in 2:100) lines(r, Lhat(attr(res, "simfuns")[[i]], r), col="grey")
lines(r, Lhat(res$obs, r), lwd=2, col="brown")
lines(r, Lhat(res$lo, r), lwd=2, col="black", lty=2)
lines(r, Lhat(res$hi, r), lwd=2, col="black", lty=2)
Comparing the SSI with the CSR results indicates that we can not only reject the CSR process as being that driving drumlin point locations, but we have good grounds to suppose that a spatial inhibition process is operating
It is also very possible that the process is inhomogeneous, that is that an omitted heterogeneity in the surface is influencing the point pattern
The minimum drumlin footprint is such that drumlins cannot be closer to each other than a certain minimum distance
At short range, the estimated L-hat values are lower than the lower envelope bounds, but only less than 0.4 distance units — this corresponds to spatial inhibition
As the L-hat simulation values for the SSI process indicate, drumlins are not well represented by points, because they have area, even volume, and rarely overlap’’
It is frequently the case that point patterns are not measured on homogeneous surfaces
One way to tackle this is, as in these examples from Zhang et al. (2009), to sample from the population at risk to form a control group
We are then interested in examining the spatial distribution of cases compared to the spatial distribution of controls
The cases are observations of schistosomiasis in Guichi, China, and the controls are taken from the polulation at risk
First we’ll read in the points, enclosing polygon, and river banks:
points <- st_read("cc.gpkg", layer="points")
## Reading layer `points' from data source `/home/rsb/und/ecs530/2021/cc.gpkg' using driver `GPKG'
## Simple feature collection with 166 features and 1 field
## Geometry type: POINT
## Dimension: XY
## Bounding box: xmin: 0.1299685 ymin: 0.09089491 xmax: 0.9548543 ymax: 0.8590922
## Projected CRS: Undefined Cartesian SRS
rivers <- st_geometry(st_read("cc.gpkg", layer="rivers"))
## Reading layer `rivers' from data source `/home/rsb/und/ecs530/2021/cc.gpkg' using driver `GPKG'
## Simple feature collection with 5 features and 0 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -0.1455332 ymin: 0.3004564 xmax: 0.9188697 ymax: 1.181636
## Projected CRS: Undefined Cartesian SRS
poly <- st_geometry(st_read("cc.gpkg", layer="poly"))
## Reading layer `poly' from data source `/home/rsb/und/ecs530/2021/cc.gpkg' using driver `GPKG'
## Simple feature collection with 1 feature and 0 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 0 ymin: 0 xmax: 1 ymax: 0.8758322
## Projected CRS: Undefined Cartesian SRS
plot(poly)
plot(rivers, add=TRUE)
plot(st_geometry(points), pch=3:4, add=TRUE)
We coerce the sp objects to their spatstat representations, the points with marks:
points$mark <- factor(points$mark)
points.ppp <- as.ppp(points)
points.ppp$window <- as.owin(poly)
summary(points.ppp)
## Marked planar point pattern: 166 points
## Average intensity 317.9072 points per square unit
##
## Coordinates are given to 8 decimal places
##
## Multitype:
## frequency proportion intensity
## case 83 0.5 158.9536
## control 83 0.5 158.9536
##
## Window: polygonal boundary
## single connected closed polygon with 328 vertices
## enclosing rectangle: [0, 1] x [0, 0.8758322] units
## Window area = 0.522165 square units
## Fraction of frame area: 0.596
plot(split(points.ppp))
We make a mask for the kernel density calculations, and show the overall density:
XY <- st_coordinates(st_sample(poly, size=10000, type="regular"))
XY <- list(x=XY[,1], y=XY[,2])
Z <- density(points.ppp, xy=XY)
plot(Z)
The split density shows how the point patterns differ:
Z <- density(split(points.ppp), xy=XY)
plot(Z)
We can also display the case density as a proportion of the overall density:
pCases <- with(Z, eval.im(case/(case + control)))
plot(pCases)
plot(points.ppp, add=TRUE)
sessionInfo()
sessionInfo()
## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 34 (Workstation Edition)
##
## Matrix products: default
## BLAS: /home/rsb/topics/R/R412-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R412-share/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] stars_0.5-4 abind_1.4-5 raster_3.5-2
## [4] spatstat_2.2-0 spatstat.linnet_2.3-0 spatstat.core_2.3-1
## [7] rpart_4.1-15 spatstat.geom_2.3-0 spatstat.data_2.1-0
## [10] ranger_0.13.1 automap_1.0-14 lattice_0.20-45
## [13] RColorBrewer_1.1-2 gstat_2.0-8 geoR_1.8-1
## [16] ggplot2_3.3.5 R2BayesX_1.1-1.1 mgcv_1.8-38
## [19] nlme_3.1-153 colorspace_2.0-2 BayesXsrc_3.0-1.1
## [22] HSAR_0.5.1 hglm_2.2-1 hglm.data_1.0-1
## [25] sp_1.4-6 MASS_7.3-54 MatrixModels_0.5-0
## [28] lme4_1.1-27.1 spfilteR_1.1.1 spmoran_0.2.2
## [31] spatialreg_1.1-9 Matrix_1.3-4 spData_2.0.1
## [34] tmap_3.3-2 sf_1.0-4
##
## loaded via a namespace (and not attached):
## [1] spam_2.7-0 lwgeom_0.2-8 plyr_1.8.6
## [4] splines_4.1.2 crosstalk_1.1.1 leaflet_2.0.4.1
## [7] digest_0.6.28 foreach_1.5.1 htmltools_0.5.2
## [10] viridis_0.6.2 gdata_2.18.0 fansi_0.5.0
## [13] magrittr_2.0.1 RandomFieldsUtils_0.5.6 tensor_1.5
## [16] cluster_2.1.2 doParallel_1.0.16 matrixStats_0.61.0
## [19] gmodels_2.18.1 rARPACK_0.11-0 xts_0.12.1
## [22] spatstat.sparse_2.0-0 rgdal_1.5-27 xfun_0.27
## [25] dplyr_1.0.7 leafem_0.1.6 tcltk_4.1.2
## [28] crayon_1.4.2 jsonlite_1.7.2 zoo_1.8-9
## [31] iterators_1.0.13 glue_1.4.2 polyclip_1.10-0
## [34] gtable_0.3.0 maps_3.4.0 scales_1.1.1
## [37] DBI_1.1.1 Rcpp_1.0.7 RandomFields_3.3.13
## [40] viridisLite_0.4.0 units_0.7-2 spdep_1.1-12
## [43] proxy_0.4-26 dotCall64_1.0-1 intervals_0.15.2
## [46] htmlwidgets_1.5.4 FNN_1.1.3 wk_0.5.0
## [49] ellipsis_0.3.2 reshape_0.8.8 pkgconfig_2.0.3
## [52] XML_3.99-0.8 farver_2.1.0 sass_0.4.0
## [55] deldir_1.0-6 utf8_1.2.2 tidyselect_1.1.1
## [58] labeling_0.4.2 rlang_0.4.12 tmaptools_3.1-1
## [61] munsell_0.5.0 tools_4.1.2 generics_0.1.1
## [64] splancs_2.01-42 mathjaxr_1.4-0 fftwtools_0.9-11
## [67] evaluate_0.14 stringr_1.4.0 fastmap_1.1.0
## [70] goftest_1.2-3 yaml_2.2.1 leafsync_0.1.0
## [73] knitr_1.36 purrr_0.3.4 s2_1.0.7
## [76] compiler_4.1.2 png_0.1-7 e1071_1.7-9
## [79] spatstat.utils_2.2-0 tibble_3.1.5 spacetime_1.2-5
## [82] bslib_0.3.1 stringi_1.7.5 highr_0.9
## [85] RSpectra_0.16-0 fields_13.3 classInt_0.4-3
## [88] nloptr_1.2.2.3 vegan_2.5-7 permute_0.9-5
## [91] vctrs_0.3.8 pillar_1.6.4 LearnBayes_2.15.1
## [94] lifecycle_1.0.1 jquerylib_0.1.4 R6_2.5.1
## [97] KernSmooth_2.23-20 gridExtra_2.3 spDataLarge_0.5.4
## [100] codetools_0.2-18 dichromat_2.0-0 boot_1.3-28
## [103] gtools_3.9.2 assertthat_0.2.1 withr_2.4.2
## [106] metafor_3.0-2 expm_0.999-6 parallel_4.1.2
## [109] terra_1.4-16 grid_4.1.2 coda_0.19-4
## [112] class_7.3-19 minqa_1.2.4 rmarkdown_2.11
## [115] base64enc_0.1-3