Copyright

All the material presented here, to the extent it is original, is available under CC-BY-SA. Parts build on joint tutorials with Edzer Pebesma.

Required current contributed CRAN packages:

I am running R 4.1.2, with recent update.packages().

needed <- c("rgeoda", "digest", "igraph", "tmap", "spdep", "spData", "sp", "gstat",
"spatstat.linnet", "spatstat.core", "spatstat.geom", "spatstat.data_2.1-0")

Script

Script and data at https://github.com/rsbivand/ECS530_h21/raw/main/ECS530_211117.zip. Download to suitable location, unzip and use as basis.

Is spatial autocorrelation just poor data collection design and/or model mis-specification?

Spatial, time series, and spatio-temporal data breaks the fundamental rule of independence of observations (social networks do too)
Often, we are not sampling from a known population to get observations, so caution is needed for inference
Often again, the observation units or entities are not chosen by us, but by others
Designing spatial samples has become a less important part of the field than previously (Ripley 1981; Müller 2007)

Spatial modelling: why?

If we want to detect and classify patterns (ML), infer about covariates, or interpolate from a fitted model, we need models that take account of the possible interdependency of spatial, time series and spatio-temporal observations
Here we are focussing on the spatial case; time differs in having a direction of flow, and spatio-temporal processes are necessarily more complex, especially when non-separable
We will not look at machine learning issues, although the partition into training and test sets raises important spatial questions for tuning (Schratz et al. 2019)

Spatial modelling: which kinds?

Spatial point processes are most closely tied to the relative positions of the observations, but may accommodate inhomogeneities
Geostatistics is concerned with interpolating values of a variable of interest at unobserved locations, and the relative positions of the observations contribute through the modelling a function of differences in observed values of that variable between observations at increasing distances from each other
Disease mapping, spatial econometrics/regression and the application of generalized linear mixed models (GLMM, GAMM) in for example ecology are more interested in inferences about the spatial processes in play and the included covariates; some approaches may use distance based spatial correlation functions, others use graph or lattice neighbourhoods

When is a “spatial” process actually induced by the analyst

Sometimes when we use spatial data, we jump to spatial statistical methods because of the widely cited first law of geography, that nearby observations are more likely to be similar than those further away
But maybe what we see as clustering, patchiness, pattern, that looks spatial is actually mis-specification, such as a missing covariate, and/or inappropriate functional form, and/or including variables acting at different scales, and/or consequences of suboptimal bounding of tesselated observations …
Here, we’ll first look at the consequences of treating inhomogeneous point as homogeneous (the intensity trends upwards with x)
Then we’ll use those points to see what happens in adding a similar trend to a random variable; empirical variograms are constructed ignoring and including the trend, and tests for global and local spatial autocorrelation are calculated

Spatial point processes

We can start by using spatstat (Baddeley, Rubak, and Turner 2015) to generate completely spatially random (CSR) points with intensity increasing with x to introduce trend inhomogeneity in a unit square (note the use of set.seed():

suppressPackageStartupMessages(library(spatstat))
intenfun <- function(x, y) 200 * x
set.seed(1)
(ihpp <- rpoispp(intenfun, lmax = 200))

## Planar point pattern: 95 points
## window: rectangle = [0, 1] x [0, 1] units

plot(density(ihpp), axes=TRUE)
points(ihpp, col="green2", pch=19)

We can use $\hat{K}$ tests ignoring and including inhomogeneity to adapt the test to the underlying data generation process. The homogeneous test examines the divergence from a theoretical CSR at distance bins, the inhomogeneous tries to guess the kind of patterning in the relative positions of the point observations. If we ignore the inhomogeneity, we find significant clustering at most distances, with the opposite finding when using the test attempting to accommodate inhomogeneity:

opar <- par(mfrow=c(1,2))
plot(envelope(ihpp, Kest, verbose=FALSE), main="Homogeneous")
plot(envelope(ihpp, Kinhom, verbose=FALSE), main="Inhomogeneous")

par(opar)

Adding a `y` trend variable

Coercing the spatstat "ppp" object to an sf "sf" object is trivial, but first adds the window of the point process, which we need to drop by retaining only those labelled "point". Then we take the x coordinate and use it to create y, which trends with x; the DGP is not very noisy.

library(sf)

## Linking to GEOS 3.10.1, GDAL 3.4.0, PROJ 8.2.0

sf_ihpp <- subset(st_as_sf(ihpp), label == "point")
#st_as_sf(ihpp) %>% dplyr::filter(label == "point") -> sf_ihpp
#st_as_sf(ihpp) ->.; .[.$label == "point",] -> sf_ihpp
crds <- st_coordinates(sf_ihpp)
sf_ihpp$x <- crds[,1]
sf_ihpp$y <- 100 + 50 * sf_ihpp$x + 20 * rnorm(nrow(sf_ihpp))
plot(sf_ihpp[,"y"], pch=19)

Variogram model

Variograms for models ignoring (y ~ 1) and including (y ~ x) the trend; if we ignore the trend, we find spurious relationships, shown by loess() fits

suppressPackageStartupMessages(library(gstat))
vg0 <- variogram(y ~ 1, sf_ihpp)
vg1 <- variogram(y ~ x, sf_ihpp)
library(ggplot2)
g0 <- ggplot(vg0, aes(x=dist, y=gamma)) + geom_point() + geom_smooth(span=1) + ylim(300, 575) + ggtitle("Trend ignored")
g1 <- ggplot(vg1, aes(x=dist, y=gamma)) + geom_point() + geom_smooth(span=1) + ylim(300, 575) + ggtitle("Trend included")
gridExtra::grid.arrange(g0, g1, ncol=2)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## Warning: Removed 2 rows containing non-finite values (stat_smooth).

## Warning: Removed 2 rows containing missing values (geom_point).

Spatial autocorrelation

Going on from a continuous to a discrete treatment of space, we can use functions in spdep to define neighbours and then test for global and local spatial autocorrelation. Making first a triangulated neighbour objects, we can thin out improbable neighbours lying too far from one another using a sphere of influence graph to make a symmetric neighbour object:

suppressPackageStartupMessages(library(spdep))
nb_tri <- tri2nb(crds)

(nb_soi <- graph2nb(soi.graph(nb_tri, crds), sym=TRUE))

## Neighbour list object:
## Number of regions: 95 
## Number of nonzero links: 302 
## Percentage nonzero weights: 3.34626 
## Average number of links: 3.178947

The sphere of influence graph generates three subgraphs; a bit unfortunate, but we can live with that (for now):

plot(nb_soi, crds)

comps <- n.comp.nb(nb_soi)
sf_ihpp$comps <- comps$comp.id
comps$nc

## [1] 3

Using binary weights, looking at the variable but ignoring the trend, we find strong global spatial autocorrelation using global Moran’s $I$ (tests return "htest" objects tidied here using broom)

lwB <- nb2listw(nb_soi, style="B")
out <- broom::tidy(moran.test(sf_ihpp$y, listw=lwB, randomisation=FALSE, alternative="two.sided"))[1:5]
names(out)[1:3] <- c("Moran's I", "Expectation", "Variance"); out

## # A tibble: 1 × 5
##   `Moran's I` Expectation Variance statistic    p.value
##         <dbl>       <dbl>    <dbl>     <dbl>      <dbl>
## 1       0.342     -0.0106  0.00633      4.44 0.00000917

If, however, we take the residuals after including the trend using a linear model, the apparent spatial autocorrelation evaporates

lm_obj <- lm(y ~ x, data=sf_ihpp)
out <- broom::tidy(lm.morantest(lm_obj, listw=lwB, alternative="two.sided"))[1:5]
names(out)[1:3] <- c("Moran's I", "Expectation", "Variance"); out

## # A tibble: 1 × 5
##   `Moran's I` Expectation Variance statistic[,1] p.value[,1]
##         <dbl>       <dbl>    <dbl>         <dbl>       <dbl>
## 1      0.0579     -0.0199  0.00623         0.986       0.324

The same happens if we calculate local Moran’s $I$ ignoring the trend (randomisation assumption), and including the trend (normality assumption and saddlepoint approximation)

lmor0 <- localmoran(sf_ihpp$y, listw=lwB, alternative="two.sided")
lmor1 <- as.data.frame(localmoran.sad(lm_obj, nb=nb_soi, style="B", alternative="two.sided"))
sf_ihpp$z_value <- lmor0[,4]
sf_ihpp$z_lmor1_N <- lmor1[,2]
sf_ihpp$z_lmor1_SAD <- lmor1[,4]

suppressPackageStartupMessages(library(tmap))
tm_shape(sf_ihpp) + tm_symbols(col=c("z_value", "z_lmor1_N", "z_lmor1_SAD"), midpoint=0) + tm_facets(free.scales=FALSE, nrow=1) + tm_layout(panel.labels=c("No trend", "Trend, normal", "Trend, SAD"))

Look at the North Carolina SIDS data vignette for background:

library(sf)
nc <- st_read(system.file("shapes/sids.shp", package="spData")[1], quiet=TRUE)
st_crs(nc) <- "+proj=longlat +datum=NAD27"
row.names(nc) <- as.character(nc$FIPSNO)
head(nc)

## Simple feature collection with 6 features and 22 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
## CRS:           +proj=longlat +datum=NAD27
##       CNTY_ID  AREA PERIMETER CNTY_        NAME  FIPS FIPSNO CRESS_ID BIR74
## 37009    1825 0.114     1.442  1825        Ashe 37009  37009        5  1091
## 37005    1827 0.061     1.231  1827   Alleghany 37005  37005        3   487
## 37171    1828 0.143     1.630  1828       Surry 37171  37171       86  3188
## 37053    1831 0.070     2.968  1831   Currituck 37053  37053       27   508
## 37131    1832 0.153     2.206  1832 Northampton 37131  37131       66  1421
## 37091    1833 0.097     1.670  1833    Hertford 37091  37091       46  1452
##       SID74 NWBIR74 BIR79 SID79 NWBIR79 east north      x       y       lon
## 37009     1      10  1364     0      19  164   176 -81.67 4052.29 -81.48594
## 37005     0      10   542     3      12  183   182 -50.06 4059.70 -81.14061
## 37171     5     208  3616     6     260  204   174 -16.14 4043.76 -80.75312
## 37053     1     123   830     2     145  461   182 406.01 4035.10 -76.04892
## 37131     9    1066  1606     3    1197  385   176 281.10 4029.75 -77.44057
## 37091     7     954  1838     5    1237  411   176 323.77 4028.10 -76.96474
##            lat L_id M_id                       geometry
## 37009 36.43940    1    2 MULTIPOLYGON (((-81.47276 3...
## 37005 36.52443    1    2 MULTIPOLYGON (((-81.23989 3...
## 37171 36.40033    1    2 MULTIPOLYGON (((-80.45634 3...
## 37053 36.45655    1    4 MULTIPOLYGON (((-76.00897 3...
## 37131 36.38799    1    4 MULTIPOLYGON (((-77.21767 3...
## 37091 36.38189    1    4 MULTIPOLYGON (((-76.74506 3...

The variables are largely count data, L_id and M_id are grouping variables. We can also read the original neighbour object:

library(spdep)
gal_file <- system.file("weights/ncCR85.gal", package="spData")[1]
ncCR85 <- read.gal(gal_file, region.id=nc$FIPSNO)
ncCR85

## Neighbour list object:
## Number of regions: 100 
## Number of nonzero links: 492 
## Percentage nonzero weights: 4.92 
## Average number of links: 4.92

plot(st_geometry(nc), border="grey")
plot(ncCR85, st_centroid(st_geometry(nc), of_largest_polygon), add=TRUE, col="blue")

Now generate a random variable. Here I’ve set the seed - maybe choose your own, and compare outcomes with the people around you. With many people in the room, about 5 in 100 may get a draw that is autocorrelated when tested with Moran’s $I$ (why?).

set.seed(1)
nc$rand <- rnorm(nrow(nc))
lw <- nb2listw(ncCR85, style="B")
moran.test(nc$rand, listw=lw, alternative="two.sided")

## 
##  Moran I test under randomisation
## 
## data:  nc$rand  
## weights: lw    
## 
## Moran I statistic standard deviate = -0.82038, p-value = 0.412
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.060754158      -0.010101010       0.003812269

Now we’ll create a trend (maybe try plotting LM to see the trend pattern). Do we get different test outcomes by varying beta and sigma (alpha is constant).

nc$LM <- as.numeric(interaction(nc$L_id, nc$M_id))
alpha <- 1
beta <- 0.5
sigma <- 2
nc$trend <- alpha + beta*nc$LM + sigma*nc$rand
moran.test(nc$trend, listw=lw, alternative="two.sided")

## 
##  Moran I test under randomisation
## 
## data:  nc$trend  
## weights: lw    
## 
## Moran I statistic standard deviate = 6.1163, p-value = 9.58e-10
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.368410659      -0.010101010       0.003829901

To get back to reality, include the trend in a linear model, and test again.

lm.morantest(lm(trend ~ LM, nc), listw=lw, alternative="two.sided")

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = trend ~ LM, data = nc)
## weights: lw
## 
## Moran I statistic standard deviate = -0.50788, p-value = 0.6115
## alternative hypothesis: two.sided
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     -0.049567147     -0.018718176      0.003689356

So we can manipulate a missing variable mis-specification to look like spatial autocorrelation. Is this informative?

Sometimes we only have data on a covariate for aggregates of our units of observation. What happens when we “copy out” these aggregate values to the less aggregated observations? First we’ll aggregate nc by LM, then make a neighbour object for the aggregate units

aggLM <- aggregate(nc[,"LM"], list(nc$LM), head, n=1)
(aggnb <- poly2nb(aggLM))

## Neighbour list object:
## Number of regions: 12 
## Number of nonzero links: 46 
## Percentage nonzero weights: 31.94444 
## Average number of links: 3.833333

Next, draw a random sample for the aggregated units:

set.seed(1)
LMrand <- rnorm(nrow(aggLM))

Check that it does not show any spatial autocorrelation:

moran.test(LMrand, nb2listw(aggnb, style="B"))

## 
##  Moran I test under randomisation
## 
## data:  LMrand  
## weights: nb2listw(aggnb, style = "B")    
## 
## Moran I statistic standard deviate = -0.24911, p-value = 0.5984
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       -0.13101880       -0.09090909        0.02592483

Copy it out to the full data set, indexing on values of LM; the pattern now looks pretty autocorrelated

nc$LMrand <- LMrand[match(nc$LM, aggLM$LM)]
plot(nc[,"LMrand"])

which it is:

moran.test(nc$LMrand, listw=lw, alternative="two.sided")

## 
##  Moran I test under randomisation
## 
## data:  nc$LMrand  
## weights: lw    
## 
## Moran I statistic standard deviate = 9.4301, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.575091242      -0.010101010       0.003850884

Again, we’ve manipulated ourselves into a situation with abundant spatial autocorrelation at the level of the counties, but only by copying out from a more aggregated level. What is going on?

Constructing neighbour and spatial weight objects

Areal units of observation are very often used when simultaneous observations are aggregated within non-overlapping boundaries. The boundaries may be those of administrative entities, and may be related to underlying spatial processes, such as commuting flows, but are usually arbitrary. If they do not match the underlying and unobserved spatial processes in one or more variables of interest, proximate areal units will contain parts of the underlying processes, engendering spatial autocorrelation. By proximity, we mean closeness in ways that make sense for the data generation processes thought to be involved. In cross-sectional geostatistical analysis with point support, measured distance makes sense for typical data generation processes. In similar analysis of areal data, sharing a border may make more sense, because that is what we do know, but we cannot measure the distance between the areas in as adequate a way.

With support of data we mean the physical size (lenth, area, volume) associated with an individual observational unit (measurement). It is possible to represent the support of areal data by a point, despite the fact that the data have polygonal support. The centroid of the polygon may be taken as a representative point, or the centroid of the largest polygon in a multi-polygon object. When data with intrinsic point support are treated as areal data, the change of support goes the other way, from the known point to a non-overlapping tesselation such as a Voronoi diagram or Dirichlet tessellation or Thiessen polygons often through a Delaunay triangulation using projected coordinates. Here, different metrics may also be chosen, or distances measured on a network rather than on the plane. There is also a literature using weighted Voronoi diagrams in local spatial analysis (see for example Boots and Okabe 2007; Okabe et al. 2008; She et al. 2015).

When the intrinsic support of the data is as points, but the underlying process is between proximate observations rather than driven chiefly by distance however measured between observations, the data may be aggregate counts or totals (polling stations, retail turnover) or represent a directly observed characteristic of the observation (opening hours of the polling station). Obviously, the risk of mis-representing the footprint of the underlying spatial processes remains in all of these cases, not least because the observations are taken as encompassing the entirety of the underlying process in the case of tesselation of the whole area of interest. This is distinct from the geostatistical setting in which observations are rather samples taken using some scheme within the area of interest. It is also partly distinct from the practice of taking areal sample plots within the area of interest but covering only a small proportion of the area, typically used in ecological and environmental research.

In order to explore and analyse areal data of these kinds, methods are needed to represent the proximity of observations. This chapter then considers a subset of the such methods, where the spatial processes are considered as working through proximity understood in the first instance as contiguity, as a graph linking observations taken as neighbours. This graph is typically undirected and unweighted, but may be directed and/or weighted in certain settings, which then leads to further issues with regard to symmetry. In principle, proximity would be expected to operate symmetrically in space, that is that the influence of $i$ on $j$ and of $j$ on $i$ based on their relative positions should be equivalent. Edge effects are not considered in standard treatments.

Representing proximity in spdep

Handling spatial autocorrelation using relationships to neighbours on a graph takes the graph as given, chosen by the analyst. This differs from the geostatistical approach in which the analyst chooses the binning of the empirical variogram and function used, and then the way the fitted variogram is fitted. Both involve a priori choices, but represent the underlying correlation in different ways (Wall 2004). In Bavaud (1998) and work citing his contribution, attempts have been made to place graph-based neighbours in a broader context.

One issue arising in the creation of objects representing neighbourhood relationships is that of no-neighbour areal units (R. S. Bivand and Portnov 2004). Islands or units separated by rivers may not be recognised as neighbours when the units have areal support and when using topological relationships such as shared boundaries. In some settings, for example mrf (Markov Random Field) terms in mgcv::gam() and similar model fitting functions that require undirected connected graphs, a requirement is violated when there are disconnected subgraphs.

No-neighbour observations can also occur when a distance threshold is used between points, where the threshold is smaller than the maximum nearest neighbour distance. Shared boundary contiguities are not affected by using geographical, unprojected coordinates, but all point-based approaches use distance in one way or another, and need to calculate distances in an appropriate way.

The spdep package provides an nb class for neighbours, a list of length equal to the number of observations, with integer vector components. No-neighbours are encoded as an integer vector with a single element 0L, and observations with neighbours as sorted integer vectors containing values in 1L:n pointing to the neighbouring observations. This is a typical row-oriented sparse representation of neighbours. spdep provides many ways of constructing nb objects, and the representation and construction functions are widely used in other packages.

spdep builds on the nb representation (undirected or directed graphs) with the listw object, a list with three components, an nb object, a matching list of numerical weights, and a single element character vector containing the single letter name of the way in which the weights were calculated. The most frequently used approach in the social sciences is calculating weights by row standardization, so that all the non-zero weights for one observation will be the inverse of the cardinality of its set of neighbours (1/card(nb[[i]])).

We will be using election data from the 2015 Polish Presidential election in this chapter, with 2495 municipalities and Warsaw boroughs of the municipality types) , and complete count data from polling stations aggregated to these areal units. The data are an sf sf object:

data(pol_pres15, package="spDataLarge")
pol_pres15 |> 
    subset(select=c(TERYT, name, types)) |> 
    head()

## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 235157.1 ymin: 366913.3 xmax: 281431.7 ymax: 413016.6
## Projected CRS: ETRS89 / Poland CS92
##    TERYT                name       types                       geometry
## 1 020101         BOLESŁAWIEC       Urban MULTIPOLYGON (((261089.5 38...
## 2 020102         BOLESŁAWIEC       Rural MULTIPOLYGON (((254150 3837...
## 3 020103            GROMADKA       Rural MULTIPOLYGON (((275346 3846...
## 4 020104        NOWOGRODZIEC Urban/rural MULTIPOLYGON (((251769.8 37...
## 5 020105          OSIECZNICA       Rural MULTIPOLYGON (((263423.9 40...
## 6 020106 WARTA BOLESŁAWIECKA       Rural MULTIPOLYGON (((267030.7 38...

library(tmap)
tm_shape(pol_pres15) + tm_fill("types")

Between early 2002 and April 2019, spdep contained functions for constructing and handling neighbour and spatial weights objects, tests for spatial autocorrelation, and model fitting functions. The latter have been split out into spatialreg, and will be discussed in the next chapter. spdep now accommodates objects represented using sf classes and sp classes directly.

library(spdep)

Contiguous neighbours

The poly2nb() function in spdep takes the boundary points making up the polygon boundaries in the object passed as the pl= argument, and for each observation checks whether at least one (queen=TRUE, default), or at least two (rook, queen=FALSE) points are within snap= distance units of each other. The distances are planar in the raw coordinate units, ignoring geographical projections. Once the required number of sufficiently close points is found, the search is stopped.

args(poly2nb)

## function (pl, row.names = NULL, snap = sqrt(.Machine$double.eps), 
##     queen = TRUE, useC = TRUE, foundInBox = NULL) 
## NULL

From spdep 1.1-7, the GEOS interface of the sf package is used within poly2nb() if foundInBox=NULL to find the candidate neighbours and populate foundInBox internally. In this case, this use of spatial indexing (STRtree queries) in GEOS through sf is the default:

system.time(pol_pres15 |> poly2nb(queen=TRUE) -> nb_q)

##    user  system elapsed 
##   0.600   0.019   0.623

The print method shows the summary structure of the neighbour object:

nb_q

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 14242 
## Percentage nonzero weights: 0.2287862 
## Average number of links: 5.708216

If the objects are in unprojected coordinate reference systems, as is the case here, the candidate neighbours are found using spherical spatial indexing in the package. There is a minor cost to the use of spherical rather than planar topological predicates but only roughly a doubling of processing time. R spatial packages are choosing to prefer spherical rather than planar representations for geographical coordinate reference systems because straight line segments between two points are only straight in planar settings, not in spherical pretending to be planar, something we have known at least since Al-Khwarizmi.

Much of the work involved in finding contiguous neighbours is spent on finding candidate neighbours with intersecting bounding boxes. Note that nb objects record both symmetric neighbour relationships, because these objects admit asymmetric relationships as well, but these duplications are not needed for object construction.

Most of the spdep functions for constructing neighbour objects take a row.names= argument, the value of which is stored as a region.id attribute. If not given, the values are taken from row.names() of the first argument. These can be used to check that the neighbours object is in the same order as data. If nb objects are subsetted, the indices change to continue to be within 1:length(subsetted_nb), but the region.id attribute values point back to the object from which it was constructed.

We can also check that this undirected graph is connected using the n.comp.nb() function; while some model estimation techniques do not support graphs that are not connected, it is helpful to be aware of possible problems (Freni-Sterrantino, Ventrucci, and Rue 2018):

(nb_q |> n.comp.nb())$nc

## [1] 1

This approach is equivalent to treating the neighbour object as a graph and using graph analysis on that graph (Csardi and Nepusz 2006), by first coercing to a binary sparse matrix:

library(Matrix)
nb_q |> 
    nb2listw(style="B") |> 
    as("CsparseMatrix") -> smat
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:spatstat.geom':
## 
##     diameter, edges, is.connected, vertices

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

(smat |> 
        graph.adjacency() -> g1) |> 
    count_components()

## [1] 1

Neighbour objects may be exported and imported in GAL format for exchange with other software, using write.nb.gal() and read.gal():

tf <- tempfile(fileext=".gal")
write.nb.gal(nb_q, tf)

Graph-based neighbours

If areal units are an appropriate representation, but only points have been observed, contiguity relationships may be approximated using graph-based neighbours. In this case, the imputed boundaries tesselate the plane such that points closer to one observation than any other fall within its polygon. The simplest form is by using triangulation, here using the deldir() function in the deldir package. Because the function returns from and to identifiers, it is easy to construct a long representation of a listw object, as used in the S-Plus SpatialStats module and the sn2listw() function internally to construct an nb object (ragged wide representation). Alternatives often fail to return sufficient information to permit the neighbours to be identified.

The soi.graph() function takes triangulated neighbours and prunes off neighbour relationships represented by unusually long edges, especially around the convex hull, but may render the output object asymmetric. Other graph-based approaches include relativeneigh() and gabrielneigh().

The output of these functions is then converted to the nb representation using graph2nb(), with the possible use of the sym= argument to coerce to symmetry. We take the centroids of the largest component polygon for each observation as the point representation; population-weighted centroids might have been a better choice if they were available:

pol_pres15 |> 
    st_geometry() |> 
    st_centroid(of_largest_polygon=TRUE) -> coords 
(coords |> tri2nb() -> nb_tri)

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 14930 
## Percentage nonzero weights: 0.2398384 
## Average number of links: 5.983968

The average number of neighbours is similar to the Queen boundary contiguity case, but if we look at the distribution of edge lengths using nbdists(), we can see that although the upper quartile is about 15 km, the maximum is almost 300 km, an edge along much of one side of the convex hull. The short minimum distance is also of interest, as many centroids of urban municipalities are very close to the centroids of their surrounding rural counterparts.

nb_tri |> 
    nbdists(coords) |> 
    unlist() |> 
    summary()

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    246.6   9847.2  12151.2  13485.2  14993.5 296973.7

Triangulated neighbours also yield a connected graph:

(nb_tri |> n.comp.nb())$nc

## [1] 1

The sphere of influence graph trims a neighbour object such as nb_tri to remove edges that seem long in relation to typical neighbours (Avis and Horton 1985).

(nb_tri |> 
        soi.graph(coords) |> 
        graph2nb() -> nb_soi)

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 12792 
## Percentage nonzero weights: 0.2054932 
## Average number of links: 5.127054

Unpicking the triangulated neighbours does however remove the connected character of the underlying graph:

(nb_soi |> n.comp.nb() -> n_comp)$nc

## [1] 16

The SoI algorithm has stripped out longer edges leading to urban and rural municipality pairs where their centroids are very close to each other because the rural ones completely surround the urban, giving 15 pairs of neighbours unconnected to the main graph:

table(n_comp$comp.id)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2465    2    2    2    2    2    2    2    2    2    2    2    2    2    2    2

The largest length edges along the convex hull have been removed, but “holes” have appeared where the unconnected pairs of neighbours have appeared. The differences between nb_tri and nb_soi are shown in orange.

opar <- par(mar=c(0,0,0,0)+0.5)
pol_pres15 |> 
    st_geometry() |> 
    plot(border="grey", lwd=0.5)
nb_soi |> 
    plot(coords=coords, add=TRUE, points=FALSE, lwd=0.5)
nb_tri |> 
    diffnb(nb_soi) |> 
    plot(coords=coords, col="orange", add=TRUE, points=FALSE, lwd=0.5)

par(opar)

Distance-based neighbours

Distance-based neighbours can be constructed using dnearneigh(), with a distance band with lower d1= and upper d2= bounds controlled by the bounds= argument. If spherical coordinates are used and either specified in the coordinates object x or with x as a two column matrix and longlat=TRUE, great circle distances in km will be calculated assuming the WGS84 reference ellipsoid. From spdep 1.1-7, two arguments have been added, to use functionality in the dbscan package for finding neighbours using a kd-tree in two or three dimensions by default, and not to test the symmetry of the output neighbour object.

The knearneigh() function for $k$-nearest neighbours returns a knn object, converted to an nb object using knn2nb(). It can also use great circle distances, not least because nearest neighbours may differ when uprojected coordinates are treated as planar. k= should be a small number. For projected coordinates, the dbscan package is used to compute nearest neighbours more efficiently. Note that nb objects constructed in this way are most unlikely to be symmetric, hence knn2nb() has a sym= argument to permit the imposition of symmetry, which will mean that all units have at least k= neighbours, not that all units will have exactly k= neighbours.

The nbdists() function returns the length of neighbour relationship edges in the units of the coordinates if the coordinates are projected, in km otherwise. In order to set the upper limit for distance bands, one may first find the maximum first nearest neighbour distance, using unlist() to remove the list structure of the returned object.

coords |> 
    knearneigh(k=1) |> 
    knn2nb() |> 
    nbdists(coords) |> 
    unlist() |> 
    summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   246.5  6663.4  8538.0  8275.1 10123.9 17978.8

Here the largest first nearest neighbour distance is just under 18 km, so using this as the upper threshold gives certainty that all units will have at least one neighbour:

system.time(coords |> dnearneigh(0, 18000) -> nb_d18)

##    user  system elapsed 
##   0.171   0.003   0.175

system.time(coords |> dnearneigh(0, 18000, use_kd_tree=FALSE) -> nb_d18a)

##    user  system elapsed 
##   0.168   0.002   0.172

all.equal(nb_d18, nb_d18a, check.attributes=FALSE)

## [1] TRUE

nb_d18

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 20358 
## Percentage nonzero weights: 0.3270348 
## Average number of links: 8.159519

However, even though there are no no-neighbour observations (their presence is reported by the print method for nb objects), the graph is not connected, as a pair of observations are each others’ only neighbours.

(nb_d18 |> n.comp.nb() -> n_comp)$nc

## [1] 2

table(n_comp$comp.id)

## 
##    1    2 
## 2493    2

Adding 300 m to the threshold gives us a neighbour object with no no-neighbour units, and all units can be reached from all others across the graph.

(coords |> dnearneigh(0, 18300) -> nb_d183)

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 21086 
## Percentage nonzero weights: 0.3387296 
## Average number of links: 8.451303

(nb_d183 |> n.comp.nb())$nc

## [1] 1

One characteristic of distance-based neighbours is that more densely settled areas, with units which are smaller in terms of area (Warsaw boroughs are much smaller on average, but have almost 30 neighbours). Having many neighbours smooths the neighbour relationship across more neighbours. For use later, we also construct a neighbour object with no-neighbour units, using a threshold of 16 km:

(coords |> dnearneigh(0, 16000) -> nb_d16)

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 15850 
## Percentage nonzero weights: 0.2546175 
## Average number of links: 6.352705 
## 7 regions with no links:
## 569 1371 1522 2374 2385 2473 2474

It is possible to control the numbers of neighbours directly using $k$-nearest neighbours, either accepting asymmetric neighbours or imposing symmetry:

(coords |> knearneigh(k=6) -> knn_k6) |> knn2nb()

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 14970 
## Percentage nonzero weights: 0.240481 
## Average number of links: 6 
## Non-symmetric neighbours list

(knn_k6 |> knn2nb(sym=TRUE) -> nb_k6s)

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 16810 
## Percentage nonzero weights: 0.2700391 
## Average number of links: 6.737475

Here the size of k= is sufficient to ensure connectedness, although the graph is not planar as edges cross at locations other than nodes, which is not the case for contiguous or graph-based neighbours.

(nb_k6s |> n.comp.nb())$nc

## [1] 1

Weights specification

Once neighbour objects are available, further choices need to made in specifying the weights objects. The nb2listw() function is used to create a listw weights object with an nb object, a matching list of weights vectors, and a style specification. Because handling no-neighbour observations now begins to matter, the zero.policy= argument is introduced. By default, this is FALSE, indicating that no-neighbour observations will cause an error, as the spatially lagged value for an observation with no neighbours is not available. By convention, zero is substituted for the lagged value, as the cross product of a vector of zero-valued weights and a data vector, hence the name of zero.policy.

args(nb2listw)

## function (neighbours, glist = NULL, style = "W", zero.policy = NULL) 
## NULL

We will be using the helper function spweights.constants() below to show some consequences of varing style choices. It returns constants for a listw object, $n$ is the number of observations, n1 to n3 are $n-1, \ldots$, nn is $n^2$ and $S_0$, $S_1$ and $S_2$ are constants, $S_0$ being the sum of the weights. There is a full discussion of the constants in Bivand and Wong (2018).

args(spweights.constants)

## function (listw, zero.policy = NULL, adjust.n = TRUE) 
## NULL

The "B" binary style gives a weight of unity to each neighbour relationship, and typically upweights units with no boundaries on the edge of the study area.

(nb_q |> 
        nb2listw(style="B") -> lw_q_B) |> 
    spweights.constants() |> 
    data.frame() |> 
    subset(select=c(n, S0, S1, S2))

##      n    S0    S1     S2
## 1 2495 14242 28484 357280

The "W" row-standardized style upweights units around the edge of the study area that necessarily have fewer neighbours. This style first gives a weight of unity to each neighbour relationship, then divides these weights by the per unit sums of weights. Naturally this leads to division by zero where there are no neighbours, a not-a-number result, unless the chosen policy is to permit no-neighbour observations. We can see that $S_0$ is now equal to $n$.

(nb_q |> 
        nb2listw(style="W") -> lw_q_W) |> 
    spweights.constants() |> 
    data.frame() |> 
    subset(select=c(n, S0, S1, S2))

##      n   S0       S1       S2
## 1 2495 2495 957.5303 10406.44

Inverse distance weights are used in a number of scientific fields. Some use dense inverse distance matrices, but many of the inverse distances are close to zero, so have little practical contribution, especially as the spatial process matrix is itself dense. Inverse distance weights may be constructed by taking the lengths of edges, changing units to avoid most weights being too large or small (here from m to km), taking the inverse, and passing through the glist= argument to nb2listw():

nb_d183 |> 
    nbdists(coords) |> 
    lapply(\(x) 1/(x/1000)) -> gwts
(nb_d183 |> nb2listw(glist=gwts, style="B") -> lw_d183_idw_B) |> 
    spweights.constants() |> 
    data.frame() |> 
    subset(select=c(n, S0, S1, S2))

##      n       S0       S1       S2
## 1 2495 1841.345 533.7606 7264.958

No-neighbour handling is by default to prevent the construction of a weights object, making the analyst take a position on how to proceed.

try(nb_d16 |> nb2listw(style="B") -> lw_d16_B)

## Error in nb2listw(nb_d16, style = "B") : Empty neighbour sets found

Use can be made of the zero.policy= argument to many functions used with nb and listw objects.

nb_d16 |> 
    nb2listw(style="B", zero.policy=TRUE) |> 
    spweights.constants(zero.policy=TRUE) |> 
    data.frame() |> 
    subset(select=c(n, S0, S1, S2))

##      n    S0    S1     S2
## 1 2488 15850 31700 506480

Note that by default the adjust.n= argument to spweights.constants() is set by default to TRUE, subtracting the count of no-neighbour observations from the observation count, so $n$ is smaller with possible consequences for inference. The complete count can be retrieved by changing the argument.

Higher order neighbours

We recall the characteristics of the neighbour object based on Queen contiguities:

nb_q

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 14242 
## Percentage nonzero weights: 0.2287862 
## Average number of links: 5.708216

If we wish to create an object showing $i$ to $k$ neighbours, where $i$ is a neighbour of $j$, and $j$ in turn is a neighbour of $k$, so taking two steps on the neighbour graph, we can use nblag(), which automatically removes $i$ to $i$ self-neighbours:

(nb_q |> nblag(2) -> nb_q2)[[2]]

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 32930 
## Percentage nonzero weights: 0.5289939 
## Average number of links: 13.1984

The nblag_cumul() function cumulates the list of neighbours for the whole list of lags:

nblag_cumul(nb_q2)

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 47172 
## Percentage nonzero weights: 0.7577801 
## Average number of links: 18.90661

while the set operation union.nb() takes two objects, giving here the same outcome:

union.nb(nb_q2[[2]], nb_q2[[1]])

## Neighbour list object:
## Number of regions: 2495 
## Number of nonzero links: 47172 
## Percentage nonzero weights: 0.7577801 
## Average number of links: 18.90661

Returning to the graph representation of the same neighbour object, we can ask how many steps might be needed to traverse the graph:

diameter(g1)

## [1] 52

We step out from each observation across the graph to establish the number of steps needed to reach each other observation by the shortest path, once again finding the same count, and that the municipality is called Lutowiska, close to the Ukrainian border in the far south east of the country.

g1 |> shortest.paths() -> sps
(sps |> apply(2, max) -> spmax) |> max()

## [1] 52

mr <- which.max(spmax)
pol_pres15$name0[mr]

## [1] "Lutowiska"

This shows that contiguity neighbours represent the same kinds of relationships with other observations as distance. Some approaches prefer distance neighbours on the basis that, for example, inverse distance neighbours show clearly how all observations are related to each other. However, the development of both tests for spatial autocorrelation, and spatial regression models has involved the inverse of a spatial process model, which in turn can be represented as the sum of a power series of the neighbour object and a coefficient, so intrinsically acknowledging the relationships of all with all other. Sparse contiguity neighbour objects accommodate rich dependency structures without the need to make the structures explicit.

Relationship of shortest paths to distance for Lutowiska; left panel: shortest path counts from Lutowiska; right panel: plot of shortest paths from Lutowiska to other observations and distances (km) from Lutowiska to other observations

gridExtra::grid.arrange(tmap_grob(tm1), g1, nrow=1)

## Note that the aspect ratio for which the grob has been generated is 1.4

Finally, we create contiguity neighbours for the US Census tracts, completing in a respectable 25 seconds:

library(spdep)
t0 <- system.time(nb_subset <- poly2nb(df_tracts, queen=TRUE, row.names=map10$GEOID))
saveRDS(nb_subset, file="nb_subset.rds")
#   user  system elapsed 
# 24.233   0.129  24.417

library(spdep)
(nb_subset <- readRDS("nb_subset.rds"))

## Neighbour list object:
## Number of regions: 71353 
## Number of nonzero links: 438146 
## Percentage nonzero weights: 0.008605862 
## Average number of links: 6.140541 
## 17 regions with no links:
## 12562 26251 28654 28696 31900 32259 42437 45571 45614 51434 57159 57306
## 68844 68898 68901 69293 69294

Measures of spatial autocorrelation

library(sf)

data(pol_pres15, package="spDataLarge")
pol_pres15 |> 
    subset(select=c(TERYT, name, types)) |> 
    head()

## Simple feature collection with 6 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 235157.1 ymin: 366913.3 xmax: 281431.7 ymax: 413016.6
## Projected CRS: ETRS89 / Poland CS92
##    TERYT                name       types                       geometry
## 1 020101         BOLESŁAWIEC       Urban MULTIPOLYGON (((261089.5 38...
## 2 020102         BOLESŁAWIEC       Rural MULTIPOLYGON (((254150 3837...
## 3 020103            GROMADKA       Rural MULTIPOLYGON (((275346 3846...
## 4 020104        NOWOGRODZIEC Urban/rural MULTIPOLYGON (((251769.8 37...
## 5 020105          OSIECZNICA       Rural MULTIPOLYGON (((263423.9 40...
## 6 020106 WARTA BOLESŁAWIECKA       Rural MULTIPOLYGON (((267030.7 38...

library(spdep)
pol_pres15 |> poly2nb(queen=TRUE) -> nb_q
nb_q |> nb2listw(style="B") -> lw_q_B
nb_q |> nb2listw(style="W") -> lw_q_W

pol_pres15 |> st_geometry() |> st_centroid(of_largest_polygon=TRUE) -> coords 
coords |> dnearneigh(0, 18300) -> nb_d183
nb_d183 |> nbdists(coords) |> lapply(\(x) 1/(x/1000)) -> gwts
nb_d183 |> nb2listw(glist=gwts, style="B") -> lw_d183_idw_B

When analysing areal data, it has long been recognised that, if present, spatial autocorrelation changes how we may infer, relative to the default position of independent observations. In the presence of spatial autocorrelation, we can predict the values of observation $i$ from the values observed at $j \in N_i$, the set of its proximate neighbours. Early results (Moran 1948; Geary 1954), entered into research practice gradually, for example the social sciences (Duncan, Cuzzort, and Duncan 1961). These results were then collated and extended to yield a set of basic tools of analysis (Cliff and Ord 1973, 1981).

Cliff and Ord (1973) generalised and extended the expression of the spatial weights matrix representation as part of the framework for establishing the distribution theory for join count, Moran’s $I$ and Geary’s $C$ statistics. This development of what have become known as global measures, returning a single value of autocorrelation for the total study area, has been supplemented by local measures returning values for each areal unit (Getis and Ord 1992; L. Anselin 1995).

Global measures

We will begin by examining join count statistics, where joincount.test() takes a factor vector of values fx= and a listw object, and returns a list of htest (hypothesis test) objects defined in the stats package, one htest object for each level of the fx= argument. The observed counts are of neighbours with the same factor levels, known as same-colour joins.

args(joincount.test)

## function (fx, listw, zero.policy = NULL, alternative = "greater", 
##     sampling = "nonfree", spChk = NULL, adjust.n = TRUE) 
## NULL

The function takes an alternative= argument for hypothesis testing, a sampling= argument showing the basis for the construction of the variance of the measure, where the default "nonfree" choice corresponds to analytical permutation; the spChk= argument is retained for backward compatibility. For reference, the counts of factor levels for the type of municipality or Warsaw borough are:

(pol_pres15 |> 
        st_drop_geometry() |> 
        subset(select=types, drop=TRUE) -> Types) |> 
    table()

## 
##          Rural          Urban    Urban/rural Warsaw Borough 
##           1563            303            611             18

Since there are four levels, we re-arrange the list of htest objects to give a matrix of estimated results. The observed same-colour join counts are tabulated with their expectations based on the counts of levels of the input factor, so that few joins would be expected between for example Warsaw boroughs, because there are very few of them. The variance calculation uses the underlying constants of the chosen listw object and the counts of levels of the input factor. The z-value is obtained in the usual way by dividing the difference between the observed and expected join counts by the square root of the variance.

The join count test was subsequently adapted for multi-colour join counts (Upton and Fingleton 1985). The implementation as joincount.mult() in spdep returns a table based on nonfree sampling, and does not report p-values.

Types |> joincount.multi(listw=lw_q_B)

##                                Joincount   Expected   Variance  z-value
## Rural:Rural                   3087.00000 2793.92018 1126.53420   8.7320
## Urban:Urban                    110.00000  104.71854   93.29937   0.5468
## Urban/rural:Urban/rural        656.00000  426.52553  331.75903  12.5986
## Warsaw Borough:Warsaw Borough   41.00000    0.35018    0.34743  68.9646
## Urban:Rural                    668.00000 1083.94086  708.20864 -15.6297
## Urban/rural:Rural             2359.00000 2185.76854 1267.13133   4.8665
## Urban/rural:Urban              171.00000  423.72864  352.18954 -13.4669
## Warsaw Borough:Rural            12.00000   64.39253   46.45991  -7.6865
## Warsaw Borough:Urban             9.00000   12.48300   11.75800  -1.0158
## Warsaw Borough:Urban/rural       8.00000   25.17200   22.35382  -3.6320
## Jtot                          3227.00000 3795.48557 1496.39842 -14.6959

So far, we have used binary weights, so the sum of join counts multiplied by the weight on that join remains integer. If we change to row standardised weights, where the weights are not unity in all cases, the counts, expectations and variances change, but there are few major changes in the z-values.

Using an inverse distance based listw object does, however, change the z-values markedly, because closer centroids are upweighted relatively strongly:

Types |> joincount.multi(listw=lw_d183_idw_B)

##                                Joincount   Expected   Variance  z-value
## Rural:Rural                   3.4648e+02 3.6123e+02 4.9314e+01  -2.1003
## Urban:Urban                   2.9045e+01 1.3539e+01 2.2281e+00  10.3877
## Urban/rural:Urban/rural       4.6498e+01 5.5145e+01 9.6134e+00  -2.7891
## Warsaw Borough:Warsaw Borough 1.6822e+01 4.5275e-02 6.6083e-03 206.3805
## Urban:Rural                   2.0206e+02 1.4014e+02 2.3645e+01  12.7338
## Urban/rural:Rural             2.2517e+02 2.8260e+02 3.5892e+01  -9.5852
## Urban/rural:Urban             3.6499e+01 5.4784e+01 8.8586e+00  -6.1434
## Warsaw Borough:Rural          5.6502e+00 8.3253e+00 1.7260e+00  -2.0362
## Warsaw Borough:Urban          9.1800e+00 1.6139e+00 2.5392e-01  15.0150
## Warsaw Borough:Urban/rural    3.2676e+00 3.2545e+00 5.5180e-01   0.0177
## Jtot                          4.8183e+02 4.9072e+02 4.1570e+01  -1.3781

The implementation of Moran’s $I$ in spdep in the moran.test() function has similar arguments to those of joincount.test(), but sampling= is replaced by randomisation= to indicate the underlying analytical approach used for calculating the variance of the measure. It is also possible to use ranks rather than numerical values (Cliff and Ord 1981, 46). The drop.EI2= agrument may be used to reproduce results where the final component of the variance term is omitted as found in some legacy software implementations.

args(moran.test)

## function (x, listw, randomisation = TRUE, zero.policy = NULL, 
##     alternative = "greater", rank = FALSE, na.action = na.fail, 
##     spChk = NULL, adjust.n = TRUE, drop.EI2 = FALSE) 
## NULL

The default for the randomisation= argument is TRUE, but here we will simply show that the test under normality is the same as a test of least squares residuals with only the intercept used in the mean model. The spelling of randomisation is that of Cliff and Ord (1973).

glance_htest <- function(ht) c(ht$estimate, 
    "Std deviate"=unname(ht$statistic), 
    "p.value"=unname(ht$p.value))
(pol_pres15 |> 
        st_drop_geometry() |> 
        subset(select=I_turnout, drop=TRUE) -> z) |> 
    moran.test(listw=lw_q_B, randomisation=FALSE) |> 
    glance_htest()

## Moran I statistic       Expectation          Variance       Std deviate 
##      0.6914339743     -0.0004009623      0.0001400449     58.4613487704 
##           p.value 
##      0.0000000000

The lm.morantest() function also takes a resfun= argument to set the function used to extract the residuals used for testing, and clearly lets us model other salient features of the response variable (Cliff and Ord 1981, 203). To compare with the standard test, we are only using the intercept here, and as can be seen, the results are the same.

lm(I_turnout ~ 1, pol_pres15) |> 
    lm.morantest(listw=lw_q_B) |> 
    glance_htest()

## Observed Moran I      Expectation         Variance      Std deviate 
##     0.6914339743    -0.0004009623     0.0001400449    58.4613487704 
##          p.value 
##     0.0000000000

The only difference between tests under normality and randomisation is that an extra term is added if the kurtosis of the variable of interest indicates a flatter or more peaked distribution, where the measure used is the classical measure of kurtosis. Under the default randomisation assumption of analytical randomisation, the results are largely unchanged.

(z |> 
    moran.test(listw=lw_q_B) -> mtr) |> 
    glance_htest()

## Moran I statistic       Expectation          Variance       Std deviate 
##      0.6914339743     -0.0004009623      0.0001400522     58.4598351527 
##           p.value 
##      0.0000000000

Of course, from the very beginning, interest was shown in Monte Carlo testing, also known as a Hope-type test and as a permutation bootstrap. By default, moran.mc() returns a "htest" object, but may simply use boot::boot() internally and return a "boot" object when return_boot=TRUE. In addition the number of simulations of the variable of interest by permutation, that is shuffling the values across the observations at random, needs to be given as nsim=.

set.seed(1)
z |> 
    moran.mc(listw=lw_q_B, nsim=999, return_boot = TRUE) -> mmc

The bootstrap permutation retains the outcomes of each of the random permutations, reporting the observed value of the statistic, here Moran’s $I$, the difference between this value and the mean of the simulations under randomisation (equivalent to $E(I)$), and the standard deviation of the simulations under randomisation.

If we compare the Monte Carlo and analytical variances of $I$ under randomisation, we typically see few differences, arguably rendering Monte Carlo testing unnecessary.

c("Permutation bootstrap"=var(mmc$t), 
  "Analytical randomisation"=unname(mtr$estimate[3]))

##    Permutation bootstrap Analytical randomisation 
##             0.0001441596             0.0001400522

Geary’s global $C$ is implemented in geary.test() largely following the same argument structure as moran.test(). The Getis-Ord $G$ test includes extra arguments to accommodate differences between implementations, as Roger S. Bivand and Wong (2018) found multiple divergences from the original definitions, often to omit no-neighbour observations generated when using distance band neighbours. It is given by (Getis and Ord 1992, 194). For $G_*$, the $\sum_{(2)}$ constraint is relaxed by including $i$ as a neighbour of itself (thereby also removing the no-neighbour problem, because all observations have at least one neighbour).

Finally, the empirical Bayes Moran’s $I$ takes account of the denominator in assessing spatial autocorrelation in rates data (Assunção and Reis 1999). Until now, we have considered the proportion of valid votes cast in relation to the numbers entitled to vote by spatial entity, but using EBImoran.mc() we can try to accommodate uncertainty in extreme rates in entities with small numbers entitled to vote. There is, however, little impact on the outcome in this case.

Global measures of spatial autocorrelation using spatial weights objects based on graphs of neighbours are, as we have seen, rather blunt tools, which for interpretation depend critically on a reasoned mean model of the variable in question. If the mean model is just the intercept, the global measures will respond to all kinds of mis-specification, not only spatial autocorrelation. A key source of mis-specification will typically also include the choice of entities for aggregation of data.

ACS CV data set

df_tracts <- st_read("df_tracts.gpkg")

## Reading layer `df_tracts' from data source 
##   `/home/rsb/und/ecs530/2021/df_tracts.gpkg' using driver `GPKG'
## Simple feature collection with 71353 features and 28 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7332 ymin: 24.5447 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  NAD83

(nb_subset <- readRDS("nb_subset.rds"))

## Neighbour list object:
## Number of regions: 71353 
## Number of nonzero links: 438146 
## Percentage nonzero weights: 0.008605862 
## Average number of links: 6.140541 
## 17 regions with no links:
## 12562 26251 28654 28696 31900 32259 42437 45571 45614 51434 57159 57306
## 68844 68898 68901 69293 69294

nc_nb_subset <- n.comp.nb(nb_subset)
nc_nb_subset$nc

## [1] 31

table(table(nc_nb_subset$comp.id))

## 
##     1     2     4    13    15    17   107 71156 
##    17     4     5     1     1     1     1     1

set.ZeroPolicyOption(TRUE)

## [1] FALSE

lw <- nb2listw(nb_subset, style="W")

system.time(mt_med_inc_cv <- moran.test(df_tracts$med_inc_cv, lw, alternative="two.sided"))

##    user  system elapsed 
##   2.826   0.000   2.840

mt_med_inc_cv

## 
##  Moran I test under randomisation
## 
## data:  df_tracts$med_inc_cv  
## weights: lw  n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = 85.539, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      1.887741e-01     -1.401836e-05      4.871047e-06

form <- log(med_inc_cv) ~ log1p(vacancy_rate) + log1p(old_rate) + log1p(black_rate) + log1p(hisp_rate) + log1p(group_pop) + log1p(dens)

lm_mod <- lm(form, data=df_tracts)
summary(lm_mod)

## 
## Call:
## lm(formula = form, data = df_tracts)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.14822 -0.29445  0.01038  0.30068  2.99263 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -6.2284926  0.0355078 -175.41   <2e-16 ***
## log1p(vacancy_rate)  0.4039100  0.0248345   16.26   <2e-16 ***
## log1p(old_rate)      0.4069179  0.0040408  100.70   <2e-16 ***
## log1p(black_rate)    0.5478924  0.0110139   49.75   <2e-16 ***
## log1p(hisp_rate)     0.5854893  0.0121825   48.06   <2e-16 ***
## log1p(group_pop)     0.0216907  0.0007901   27.45   <2e-16 ***
## log1p(dens)          0.0354277  0.0018286   19.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4526 on 71346 degrees of freedom
## Multiple R-squared:  0.2296, Adjusted R-squared:  0.2295 
## F-statistic:  3543 on 6 and 71346 DF,  p-value: < 2.2e-16

lm.morantest(lm_mod, lw)

## 
##  Global Moran I for regression residuals
## 
## data:  
## model: lm(formula = form, data = df_tracts)
## weights: lw
## 
## Moran I statistic standard deviate = 50.637, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.117328e-01    -6.347763e-05     4.874291e-06

Local measures

Building on insights from the weaknesses of global measures, local indicators of spatial association began to appear in the first half of the 1990s (L. Anselin 1995; Getis and Ord 1992, 1996). In addition, the Moran plot was introduced, plotting the values of the variable of interest against their spatially lagged values, typically using row-standardised weights to make the axes more directly comparable (L. Anselin 1996). The moran.plot() function also returns an influence measures object used to label observations exerting more than proportional influence on the slope of the line representing global Moran’s $I$. We can see that there are many spatial entities exerting such influence. These pairs of observed and lagged observed values make up in aggregate the global measure, but can also be explored in detail. The quadrants of the Moran plot also show low-low pairs in the lower left quadrant, high-high in the upper right quadrant, and fewer low-high and high-low pairs in the upper left and lower right quadrants.

z |> moran.plot(listw=lw_q_W, labels=pol_pres15$TERYT, cex=1, pch=".", xlab="I round turnout", ylab="lagged turnout") -> infl_W

If we extract the hat value influence measure from the returned object, the figure suggests that some edge entities exert more than proportional influence (perhaps because of row standardisation), as do entities in or near larger urban areas.

pol_pres15$hat_value <- infl_W$hat
library(tmap)
tm_shape(pol_pres15) + tm_fill("hat_value")

Roger S. Bivand and Wong (2018) discuss issues impacting the use of local indicators, such as local Moran’s $I$ and local Getis-Ord $G$. Some issues affect the calculation of the local indicators, others inference from their values. Because $n$ statistics may be being calculated from the same number of observations, there are multiple comparison problems that need to be addressed. Although the apparent detection of hotspots from values of local indicators has been quite widely adopted, it remains fraught with difficulty because adjustment of the inferential basis to accommodate multiple comparisons is not often chosen, and as in the global case, mis-specification also remains a source of confusion. Further, interpreting local spatial autocorrelation in the presence of global spatial autocorrelation is challenging (Ord and Getis 2001; Tiefelsdorf 2002; R. S. Bivand, Müller, and Reder 2009). The mlvar= and adjust.x= arguments to localmoran() are discussed in Roger S. Bivand and Wong (2018), and permit matching with other implementations. The p.adjust.method= argument uses an untested speculation that adjustment should only take into account the cardinality of the neighbour set of each observation when adjusting for multiple comparisons; using stats::p.adjust() is preferable.

z |> 
    localmoran(listw=lw_q_W, conditional=FALSE, alternative="two.sided") -> locm

Recent development: Sauer et al. (n.d.) point out that the analytical inferential basis has been unclear, and argue that conditional= (a new argument) should be used rather than the legacy “total” setting based on randomising all the observed values, as in the case of the global test, not the observed values except those at $i$. For reference, here we take the legacy values and contrast them below with the new default.

Taking "two.sided" p-values because these local indicators when summed and divided by the sum of the spatial weights, and thus positive and negative local spatial autocorrelation may be present, we obtain:

all.equal(sum(locm[,1])/Szero(lw_q_W), unname(moran.test(z, lw_q_W)$estimate[1]))

## [1] TRUE

Using stats::p.adjust() to adjust for multiple comparisons, we see that almost 29% of the local measures have p-values < 0.05 if no adjustment is applied, but only 12% using Bonferroni adjustment, with two other choices also shown:

pva <- \(pv) cbind("none"=pv, "bonferroni"=p.adjust(pv, "bonferroni"), "fdr"=p.adjust(pv, "fdr"), "BY"=p.adjust(pv, "BY"))
locm |> 
    subset(select="Pr(z != E(Ii))", drop=TRUE) |> 
    pva() -> pvsp
f <- \(x) sum(x < 0.05)
apply(pvsp, 2, f)

##       none bonferroni        fdr         BY 
##        715        297        576        424

In the global measure case, bootstrap permutations could be used as an alternative to analytical methods for possible inference. In the local case, conditional permutation may be used, retaining the value at observation $i$ and randomly sampling from the remaining $n-1$ values to find randomised values at neighbours, and is provided as localmoran_perm(), which will use multiple nodes to sample in parallel if provided, and permits the setting of a seed for the random number generator across the compute nodes:

library(parallel)
set.coresOption(ifelse(detectCores() == 1, 1, detectCores()-1L))

## NULL

system.time(z |> 
        localmoran_perm(listw=lw_q_W, nsim=499, alternative="two.sided", iseed=1) -> locm_p)

##    user  system elapsed 
##   1.163   0.669   0.473

The outcome is that almost 32% of observations have two sided p-values < 0.05 without multiple comparison adjustment, and under 3% with Bonferroni adjustment.

locm_p |> 
    subset(select="Pr(z != E(Ii))", drop=TRUE) |> 
    pva() -> pvsp
apply(pvsp, 2, f)

##       none bonferroni        fdr         BY 
##        783         68        466        161

We can see what is happening by tabulating counts of the standard deviate of local Moran’s $I$, where the two-sided $\alpha=0.05$ bounds would be $0.025$ and $0.975$, but Bonferroni adjustment is close to $0.00001$ and $0.99999$. Without adjustment, almost 800 observations are significant, with Bonferroni adjustment, only 68 in the conditional permutation case:

brks <- qnorm(c(0, 0.00001, 0.0001, 0.001, 0.01, 0.025, 0.5, 0.975, 0.99, 0.999, 0.9999, 0.99999, 1))
(locm_p |> 
        subset(select=Z.Ii, drop=TRUE) |> 
        cut(brks) |> 
        table()-> tab)

## 
##  (-Inf,-4.26] (-4.26,-3.72] (-3.72,-3.09] (-3.09,-2.33] (-2.33,-1.96] 
##             0             0             1             4             3 
##     (-1.96,0]      (0,1.96]   (1.96,2.33]   (2.33,3.09]   (3.09,3.72] 
##           471          1241           182           320           142 
##   (3.72,4.26]   (4.26, Inf] 
##            63            68

sum(tab[c(1:5, 8:12)])

## [1] 783

sum(tab[c(1, 12)])

## [1] 68

Roger S. Bivand and Wong (2018) were puzzled by the discrepancy between the analytical and permutation-based standard deviates of local Moran’s $I_i$. Because local indicators of spatial association have become such a central part of exploratory data analysis, some of the puzzles in the practical application of these measures will be examined here in detail, as a case study in open source implementation and comparison. In Roger S. Bivand and Wong (2018), we did not read the appendix of Sokal, Oden, and Thomson (1998) carefully enough, as has been pointed out by Sauer et al. (n.d.). They identified the underlying reason for the discrepancy noted above in the incorrect comparison of the analytical derivation of the expectations of the first two moments of $I_i$ based on the total'' randomization, and on conditional randomization (fixing the value of the variable at $i$). They point out that we confused the null hypothesis (total’’ versus ``conditional’’) and methods of calculation (analytical versus simulation/permutation), and that Sokal, Oden, and Thomson (1998) provide analytical formulae for both ways of specifying the null. They show clearly that analytical conditional randomization and simulation-based conditional approaches yield very similar outcomes. They generously contributed code to and have included equivalent code in PySAL .

While used to use the analytical total null as default, the argument is now by default; legacy script output can be retrived by setting the argument to . Previously, the default of the argument was but has been changed to as more appropriate for local indicators.

z |> 
    localmoran(listw=lw_q_W, conditional=TRUE, alternative="two.sided") -> locm_c

locm_c |> 
    subset(select="Pr(z != E(Ii))", drop=TRUE) |> 
    pva() -> pvsp
apply(pvsp, 2, f)

##       none bonferroni        fdr         BY 
##        789         69        468        156

pol_pres15$locm_Z <- locm[, "Z.Ii"]
pol_pres15$locm_c_Z <- locm_c[, "Z.Ii"]
pol_pres15$locm_p_Z <- locm_p[, "Z.Ii"]
tm_shape(pol_pres15) + tm_fill(c("locm_Z", "locm_c_Z", "locm_p_Z"), breaks=brks, midpoint=0, title="Standard deviates of\nLocal Moran's I") + tm_facets(free.scales=FALSE, ncol=2) + tm_layout(panel.labels=c("Analytical total", "Analytical conditional", "Conditional permutation"))

The figure shows that conditional analytical and permutation approaches scale back the proportion of standard deviate values taking extreme values, especially positive values. As we will see below, the analytical standard deviates of local Moran’s $I$ should probably not be used if alternatives are available.

In presenting local Moran’s $I$, use is often made of “hotspot” maps. Because $I_i$ takes high values both for strong positive autocorrelation of low and high values of the input variable, it is hard to show where “clusters” of similar neighbours with low or high values of the input variable occur. The quadrants of the Moran plot are used, by creating a categorical quadrant variable interacting the input variable and its spatial lag split at 0 (rgeoda and PySAL/esda) or their means. While , this does not hold for the spatially lagged centred variable of interest in general, leading to differences in assignment of observations to quadrants.

The quadrant categories are then set to NA if, for the chosen standard deviate and adjustment, $I_i$ would be considered insignificant. Here, for the conditional permutation standard deviates, Bonferroni adjusted, 14 observations belong to “Low-Low clusters,” and 54 to “High-High clusters”:

q_mean <- attr(locm, "quadr")$mean
pol_pres15$hs_ac_q <- pol_pres15$hs_cp_q <- pol_pres15$hs_an_q <- q_mean
is.na(pol_pres15$hs_an_q) <- !(pol_pres15$locm_Z < brks[6] | pol_pres15$locm_Z > brks[8])
is.na(pol_pres15$hs_cp_q) <- !(pol_pres15$locm_p_Z < brks[2] | pol_pres15$locm_p_Z > brks[12])
is.na(pol_pres15$hs_ac_q) <- !(pol_pres15$locm_c_Z < brks[2] | pol_pres15$locm_c_Z > brks[12])

tm_shape(pol_pres15) + tm_fill(c("hs_an_q", "hs_ac_q", "hs_cp_q"), colorNA="grey95", textNA="Not significant", title="Turnout hotspot status\nLocal Moran's I") + tm_facets(free.scales=FALSE, ncol=2) + tm_layout(panel.labels=c("Unadjusted analytical total", "Bonferroni analytical cond.", "Cond. perm. with Bonferroni"))

The figure shows the impact of using analytical or conditional permutation standard deviates, and no or Bonferroni adjustment, reducing the counts of observations in “Low-Low clusters” from 370 to 14, and “High-High clusters” from 342 to 54; the “High-High clusters” are metropolitan areas.

The local Getis-Ord $G$ measure is reported as a standard deviate, and may also take the $G^*$ form where self-neighbours are inserted into the neighbour object using include.self(). The observed and expected values of local $G$ with their analytical variances may also be returned if return_internals=TRUE.

system.time(z |> 
        localG(lw_q_W) -> locG)

##    user  system elapsed 
##   0.005   0.000   0.005

system.time(z |> 
        localG_perm(lw_q_W, nsim=499, iseed=1) -> locG_p)

##    user  system elapsed 
##   0.261   0.182   0.230

Once again we face the problem of multiple comparisons, with the count of areal unit p-values < 0.05 being reduced by an order of magnitude when employing Bonferroni correction:

locG |> 
    c() |> 
    abs() |> 
    pnorm(lower.tail = FALSE) |> 
    (\(x) x*2)() |> 
    pva() -> pvsp
apply(pvsp, 2, f)

##       none bonferroni        fdr         BY 
##        789         69        468        156

library(ggplot2)
p1 <- ggplot(data.frame(Zi=locm_c[,4], Zi_perm=locm_p[,4])) + geom_point(aes(x=Zi, y=Zi_perm), alpha=0.2) + xlab("Analytical conditional") + ylab("Permutation conditional") + coord_fixed() + ggtitle("Local Moran's I")
p2 <- ggplot(data.frame(Zi=c(locG), Zi_perm=c(locG_p))) + geom_point(aes(x=Zi, y=Zi_perm), alpha=0.2) + xlab("Analytical conditional") + ylab("Permutation conditional") + coord_fixed() + ggtitle("Local G")
gridExtra::grid.arrange(p1, p2, nrow=1)

The figure shows that, keeping fixed aspect in both panels, conditional permutation changes the range and distribution of the standard deviate values for $I_i$, but that for $G_i$, the two sets of standard deviates are equivalent.

pol_pres15$locG_Z <- c(locG)
pol_pres15$hs_G <- cut(c(locG), c(-Inf, brks[2], brks[12], Inf), labels=c("Low", "Not significant", "High"))
table(pol_pres15$hs_G)

## 
##             Low Not significant            High 
##              14            2426              55

m1 <- tm_shape(pol_pres15) + tm_fill(c("locG_Z"), midpoint=0, title="Standard\ndeviate")
m2 <- tm_shape(pol_pres15) + tm_fill(c("hs_G"), title="Bonferroni\nhotspot status")
tmap_arrange(m1, m2, nrow=1)

As can be seen, we do not need to contrast the two estimation methods, and showing the mapped standard deviate is as informative as the “hotspot” status for the chosen adjustment. In the case of $G_i$, the values taken by the measure reflect the values of the input variable, so a “High cluster” is found for observations with high values of the input variable, here high turnout in metropolitan areas.

New rgeoda

Very recently, Geoda has been wrapped for R as rgeoda (Li and Anselin 2021), and will provide very similar functionalities for the exploration of spatial autocorrelation in areal data as spdep. The active objects are kept as pointers to a compiled code workspace; using compiled code for all operations (as in Geoda itself) makes rgeoda perform fast, but leaves less flexible when modifications or enhancements are desired.

The contiguity neighbours it constructs are the same as those found by poly2nb(), as almost are the $I_i$ measures. The difference is as established by Roger S. Bivand and Wong (2018), that localmoran() calculates the input variable variance divinding by $n$, but Geoda uses $(n-1)$, as can be reproduced by setting mlvar=FALSE:

library(rgeoda)

## Loading required package: digest

## 
## Attaching package: 'rgeoda'

## The following object is masked from 'package:spdep':
## 
##     skater

system.time(Geoda_w <- queen_weights(pol_pres15))

##    user  system elapsed 
##   0.069   0.000   0.069

summary(Geoda_w)

##                      name               value
## 1 number of observations:                2495
## 2          is symmetric:                 TRUE
## 3               sparsity: 0.00228786229774178
## 4        # min neighbors:                   1
## 5        # max neighbors:                  13
## 6       # mean neighbors:    5.70821643286573
## 7     # median neighbors:                   6
## 8           has isolates:               FALSE

system.time(lisa <- local_moran(Geoda_w, pol_pres15["I_turnout"], 
    cpu_threads=ifelse(parallel::detectCores() == 1, 1, parallel::detectCores()-1L), permutations=499, seed=1))

##    user  system elapsed 
##   0.166   0.005   0.068

all.equal(card(nb_q), lisa_num_nbrs(lisa), check.attributes=FALSE)

## [1] TRUE

all.equal(lisa_values(lisa), localmoran(pol_pres15$I_turnout, listw=lw_q_W, mlvar=FALSE)[,1], check.attributes=FALSE)

## [1] TRUE

R’s `sessionInfo()`

sessionInfo()

## R version 4.1.2 (2021-11-01)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 34 (Workstation Edition)
## 
## Matrix products: default
## BLAS:   /home/rsb/topics/R/R412-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R412-share/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] rgeoda_0.0.8-6        digest_0.6.28         igraph_1.2.7         
##  [4] Matrix_1.3-4          tmap_3.3-2            spdep_1.1-12         
##  [7] spData_2.0.1          sp_1.4-6              ggplot2_3.3.5        
## [10] gstat_2.0-8           sf_1.0-4              spatstat_2.2-0       
## [13] spatstat.linnet_2.3-0 spatstat.core_2.3-1   rpart_4.1-15         
## [16] nlme_3.1-153          spatstat.geom_2.3-0   spatstat.data_2.1-0  
## 
## loaded via a namespace (and not attached):
##   [1] leafem_0.1.6          colorspace_2.0-2      deldir_1.0-6         
##   [4] ellipsis_0.3.2        class_7.3-19          leaflet_2.0.4.1      
##   [7] rgdal_1.5-27          base64enc_0.1-3       fftwtools_0.9-11     
##  [10] dichromat_2.0-0       rstudioapi_0.13       proxy_0.4-26         
##  [13] farver_2.1.0          fansi_0.5.0           codetools_0.2-18     
##  [16] splines_4.1.2         knitr_1.36            polyclip_1.10-0      
##  [19] jsonlite_1.7.2        tmaptools_3.1-1       broom_0.7.10         
##  [22] png_0.1-7             spatstat.sparse_2.0-0 compiler_4.1.2       
##  [25] backports_1.3.0       assertthat_0.2.1      fastmap_1.1.0        
##  [28] cli_3.1.0             s2_1.0.7              htmltools_0.5.2      
##  [31] tools_4.1.2           coda_0.19-4           gtable_0.3.0         
##  [34] glue_1.4.2            dplyr_1.0.7           wk_0.5.0             
##  [37] gmodels_2.18.1        Rcpp_1.0.7            jquerylib_0.1.4      
##  [40] raster_3.5-2          vctrs_0.3.8           gdata_2.18.0         
##  [43] leafsync_0.1.0        crosstalk_1.1.1       lwgeom_0.2-8         
##  [46] xfun_0.27             stringr_1.4.0         lifecycle_1.0.1      
##  [49] gtools_3.9.2          XML_3.99-0.8          goftest_1.2-3        
##  [52] terra_1.4-16          LearnBayes_2.15.1     MASS_7.3-54          
##  [55] zoo_1.8-9             scales_1.1.1          spatstat.utils_2.2-0 
##  [58] expm_0.999-6          RColorBrewer_1.1-2    spatialreg_1.1-9     
##  [61] yaml_2.2.1            gridExtra_2.3         sass_0.4.0           
##  [64] stringi_1.7.5         highr_0.9             spDataLarge_0.5.4    
##  [67] e1071_1.7-9           boot_1.3-28           intervals_0.15.2     
##  [70] rlang_0.4.12          pkgconfig_2.0.3       evaluate_0.14        
##  [73] lattice_0.20-45       purrr_0.3.4           tensor_1.5           
##  [76] htmlwidgets_1.5.4     labeling_0.4.2        tidyselect_1.1.1     
##  [79] magrittr_2.0.1        R6_2.5.1              generics_0.1.1       
##  [82] DBI_1.1.1             pillar_1.6.4          withr_2.4.2          
##  [85] mgcv_1.8-38           units_0.7-2           stars_0.5-4          
##  [88] xts_0.12.1            abind_1.4-5           tibble_3.1.5         
##  [91] spacetime_1.2-5       crayon_1.4.2          KernSmooth_2.23-20   
##  [94] utf8_1.2.2            rmarkdown_2.11        grid_4.1.2           
##  [97] FNN_1.1.3             classInt_0.4-3        tidyr_1.1.4          
## [100] dbscan_1.1-8          munsell_0.5.0         viridisLite_0.4.0    
## [103] bslib_0.3.1

Anselin, L. 1995. “Local indicators of spatial association - LISA.” Geographical Analysis 27 (2): 93–115.

Anselin, L. 1996. “The Moran Scatterplot as an ESDA Tool to Assess Local Instability in Spatial Association.” In Spatial Analytical Perspectives on GIS, edited by M. M. Fischer, H. J. Scholten, and D. Unwin, 111–25. London: Taylor & Francis.

Assunção, R. M., and E. A. Reis. 1999. “A New Proposal to Adjust Moran’s $I$ for Population Density.” Statistics in Medicine 18: 2147–62.

Avis, D., and J. Horton. 1985. “Remarks on the Sphere of Influence Graph.” In Discrete Geometry and Convexity, edited by J. E. Goodman, 323–27. New York: New York Academy of Sciences, New York.

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with r. Chapman; Hall/CRC.

Bavaud, F. 1998. “Models for Spatial Weights: A Systematic Look.” Geographical Analysis 30: 153–71. https://doi.org/10.1111/j.1538-4632.1998.tb00394.x.

Bivand, R. S., W. Müller, and M. Reder. 2009. “Power Calculations for Global and Local Moran’s $I$.” Computational Statistics and Data Analysis 53: 2859–72.

Bivand, R. S., and B. A. Portnov. 2004. “Exploring Spatial Data Analysis Techniques Using R: The Case of Observations with No Neighbours.” In Advances in Spatial Econometrics: Methodology, Tools, Applications, edited by L. Anselin, R. J. G. M. Florax, and S. J. Rey, 121–42. Berlin: Springer.

Bivand, Roger S., and David W. S. Wong. 2018. “Comparing Implementations of Global and Local Indicators of Spatial Association.” TEST 27 (3): 716–48. https://doi.org/10.1007/s11749-018-0599-x.

Boots, B., and A. Okabe. 2007. “Local Statistical Spatial Analysis: Inventory and Prospect.” International Journal of Geographical Information Science 21 (4): 355–75. https://doi.org/10.1080/13658810601034267.

Cliff, A. D., and J. K. Ord. 1973. Spatial Autocorrelation. London: Pion.

———. 1981. Spatial Processes. London: Pion.

Csardi, Gabor, and Tamas Nepusz. 2006. “The Igraph Software Package for Complex Network Research.” InterJournal Complex Systems: 1695. https://igraph.org.

Duncan, O. D., R. P. Cuzzort, and B. Duncan. 1961. Statistical Geography: Problems in Analyzing Areal Data. Glencoe, IL: Free Press.

Freni-Sterrantino, Anna, Massimo Ventrucci, and Håvard Rue. 2018. “A Note on Intrinsic Conditional Autoregressive Models for Disconnected Graphs.” Spatial and Spatio-Temporal Epidemiology 26: 25–34. https://doi.org/https://doi.org/10.1016/j.sste.2018.04.002.

Geary, R. C. 1954. “The Contiguity Ratio and Statistical Mapping.” The Incorporated Statistician 5: 115–45.

Getis, A., and J. K. Ord. 1992. “The Analysis of Spatial Association by the Use of Distance Statistics.” Geographical Analysis 24 (2): 189–206.

———. 1996. “Local Spatial Statistics: An Overview.” In Spatial Analysis: Modelling in a GIS Environment, edited by P. Longley and M Batty, 261–77. Cambridge: GeoInformation International.

Li, Xun, and Luc Anselin. 2021. Rgeoda: R Library for Spatial Data Analysis. https://CRAN.R-project.org/package=rgeoda.

Moran, P. A. P. 1948. “The Interpretation of Statistical Maps.” Journal of the Royal Statistical Society, Series B (Methodological) 10 (2): 243–51.

Müller, Werner G. 2007. Collecting Spatial Data: Optimum Design of Experiments for Random Fields. Berlin: Springer-Verlag.

Okabe, A., T. Satoh, T. Furuta, A. Suzuki, and K. Okano. 2008. “Generalized Network Voronoi Diagrams: Concepts, Computational Methods, and Applications.” International Journal of Geographical Information Science 22 (9): 965–94. https://doi.org/10.1080/13658810701587891.

Ord, J. K., and A. Getis. 2001. “Testing for Local Spatial Autocorrelation in the Presence of Global Autocorrelation.” Journal of Regional Science 41 (3): 411–32.

Ripley, B. D. 1981. Spatial Statistics. New York: Wiley.

Sauer, Jeffery, Taylor Oshan, Sergio Rey, and Levi John Wolf. n.d. “The Importance of Null Hypotheses: Understanding Differences in Local Moran’s Under Heteroskedasticity.” Geographical Analysis. https://doi.org/https://doi.org/10.1111/gean.12304.

Schratz, Patrick, Jannes Muenchow, Eugenia Iturritxa, Jakob Richter, and Alexander Brenning. 2019. “Hyperparameter Tuning and Performance Assessment of Statistical and Machine-Learning Algorithms Using Spatial Data.” Ecological Modelling 406: 109–20. https://doi.org/https://doi.org/10.1016/j.ecolmodel.2019.06.002.

She, Bing, Xinyan Zhu, Xinyue Ye, Wei Guo, Kehua Su, and Jay Lee. 2015. “Weighted Network Voronoi Diagrams for Local Spatial Analysis.” Computers, Environment and Urban Systems 52: 70–80. https://doi.org/https://doi.org/10.1016/j.compenvurbsys.2015.03.005.

Sokal, R. R, N. L. Oden, and B. A. Thomson. 1998. “Local Spatial Autocorrelation in a Biological Model.” Geographical Analysis 30: 331–54.

Tiefelsdorf, M. 2002. “The Saddlepoint Approximation of Moran’s I and Local Moran’s ${I}_i$ Reference Distributions and Their Numerical Evaluation.” Geographical Analysis 34: 187–206.

Upton, G., and B. Fingleton. 1985. Spatial Data Analysis by Example: Point Pattern and Qualitative Data. New York: Wiley.

Wall, M. M. 2004. “A Close Look at the Spatial Structure Implied by the CAR and SAR Models.” Journal of Statistical Planning and Inference 121: 311–24.

ECS530: Spatial Data Analysis III

Roger Bivand

Wednesday 17 November 2021

Copyright

Required current contributed CRAN packages:

Script

Is spatial autocorrelation just poor data collection design and/or model mis-specification?

Spatial modelling: why?

Spatial modelling: which kinds?

When is a “spatial” process actually induced by the analyst

Spatial point processes

Adding a `y` trend variable

Variogram model

Spatial autocorrelation

Constructing neighbour and spatial weight objects

Representing proximity in spdep

Contiguous neighbours

Graph-based neighbours

Distance-based neighbours

Weights specification

Higher order neighbours

Measures of spatial autocorrelation

Global measures

ACS CV data set

Local measures

New rgeoda

R’s `sessionInfo()`

ECS530: Spatial Data Analysis III

Roger Bivand

Wednesday 17 November 2021

Copyright

Required current contributed CRAN packages:

Script

Is spatial autocorrelation just poor data collection design and/or model mis-specification?

Spatial modelling: why?

Spatial modelling: which kinds?

When is a “spatial” process actually induced by the analyst

Spatial point processes

Adding a y trend variable

Variogram model

Spatial autocorrelation

Constructing neighbour and spatial weight objects

Representing proximity in spdep

Contiguous neighbours

Graph-based neighbours

Distance-based neighbours

Weights specification

Higher order neighbours

Measures of spatial autocorrelation

Global measures

ACS CV data set

Local measures

New rgeoda

R’s sessionInfo()

Adding a `y` trend variable

R’s `sessionInfo()`