4  Creating spatial weights objects

4.1 Using spdep to create neighbour and weights objects

Chapter 14 of Pebesma and Bivand (2023) provides a fairly up-to-date account of how one may construct "nb" neighbour objects, and from these "listw" weights objects. In the following chapters, these objects will be used extensively in modelling, and will often be referred to as spatial weights matrices. While their description in terms of linear algebra as matrices is reasonable, their representation as dense n by n matrices is much less reasonable, as in most cases, only a very few elements of such a matrix are non-zero. This means that sparse representations are certainly preferable in terms of storage and efficiency, and repeated multiplication by zero is simply wasteful. "nb" neighbour objects, and "listw" weights objects are sparse representations, where the "nb" object records which objects j are neighbours of i for each i, and "listw" objects add in the weights assigned to each such i-j relationship.

We will use the spdep package to construct neighbour objects for the 123 Pomeranian municipalities. Since we have the boundaries of the municipalities, we can see which municipalities share boundary segments, known as rook contiguity; we add identifiers to the output object, now only to show it can be done, but later for use in out-of-sample prediction among other uses:

library(sf)
Linking to GEOS 3.12.1, GDAL 3.9.0, PROJ 9.4.0; sf_use_s2() is TRUE
library(spdep)
Loading required package: spData
(pom6a |> poly2nb(queen=FALSE,
 row.names=pom6a$TERYT) -> pom_rook_nb)
Neighbour list object:
Number of regions: 123 
Number of nonzero links: 504 
Percentage nonzero weights: 3.33135 
Average number of links: 4.097561 
2 regions with no links:
2206011 2212053
7 disjoint connected subgraphs

If we slacken our definition of contiguity to simply sharing one boundary point, to queen contiguity, we see that the problems detected here of two no-neighbour municipalities and seven disjoint subgraphs remain:

(pom6a |> poly2nb(queen=TRUE,
 row.names=pom6a$TERYT) -> pom_queen_nb)
Neighbour list object:
Number of regions: 123 
Number of nonzero links: 504 
Percentage nonzero weights: 3.33135 
Average number of links: 4.097561 
2 regions with no links:
2206011 2212053
7 disjoint connected subgraphs

One possible reason for such problems is that boundaries provided by public agencies are accurate enough for their purposes, but where shared boundary points are not exactly identical. If we pass a very small distance, here one-hundredth of a millimetre, as a snap distance, the problems are resolved:

(pom6a |> poly2nb(queen=FALSE, row.names=pom6a$TERYT,
 snap=0.00001) -> pom_rook_nb)
Neighbour list object:
Number of regions: 123 
Number of nonzero links: 614 
Percentage nonzero weights: 4.058431 
Average number of links: 4.99187 
(pom6a |> poly2nb(queen=TRUE, row.names=pom6a$TERYT,
 snap=0.00001) -> pom_queen_nb)
Neighbour list object:
Number of regions: 123 
Number of nonzero links: 614 
Percentage nonzero weights: 4.058431 
Average number of links: 4.99187 

We can also see that in this case the rook and queen definitions lead to the same neighbour object; differences are more often seen in US tracts or counties with grid-like boundaries where four entities meet at a single point.

By definition, contiguity neighbours, like distance threshold neighbours are symmetric, that is if i is a neighbour of j, j must be a neighbour of i. This is only rarely the case when defining neighbours as the k-nearest points. Asymmetric neigbours lead to directed graphs, which can also be handled, but which need subsequent numerical special treatment. There may be substantive reasons for choosing asymmetric neighbours, but it is probably a good idea to spell out the reasons clearly.

It is important to establish whether no-neighbour entities really have no neighbours because in matrix terms they lead to columns and rows with only zero values. Some methods become poorly defined if observations “drop out” of analysis in this way. For similar reasons, the detection of disjoint subgraphs, that is parts of the neighbour object that are unconnected from other parts of the object, jeopardises the outcome of modelling, as spatial processes cannot travel across the whole graph. The graph we have found now can be plotted, using st_point_on_surface internally to provide points to represent the polygons:

geom <- st_geometry(pom6a) 
plot(geom, border="grey")
plot(pom_queen_nb, geom, add=TRUE)

Queen neighbour object, Pomeranian municipalities, 2022

When choosing the five nearest neighbours as the neighbour criterion, with distance measured to suit the coordinate reference system of the point geometries, the outcome is most often asymmetric:

(geom |> st_point_on_surface() |> knearneigh(k=5) |>
 knn2nb(row.names=pom6a$TERYT) -> pom_k5_nb)
Neighbour list object:
Number of regions: 123 
Number of nonzero links: 615 
Percentage nonzero weights: 4.065041 
Average number of links: 5 
Non-symmetric neighbours list
plot(geom, border="grey")
plot(pom_k5_nb, geom, add=TRUE)

Five nearest neighbour object, Pomeranian municipalities, 2022

As the figure above shows, open coastal water is crossed by links between neighbours defined in this way. For the point support Gdańsk take-away outlets and five nearest neighbours, asymmetry is expected, but as with many distance measures, this case does raise the question of whether to measure distance in a distance metric or perhaps a time metric, because crossing a busy street takes more time than walking or cycling along a street:

(gd_takeaways|> knearneigh(k=5) |>
 knn2nb(row.names=gd_takeaways$osm_id) -> gd_k5_nb)
Neighbour list object:
Number of regions: 160 
Number of nonzero links: 800 
Percentage nonzero weights: 3.125 
Average number of links: 5 
7 disjoint connected subgraphs
Non-symmetric neighbours list
gd_k5_nb |> nb2lines(coords=st_geometry(gd_takeaways)
 ) -> gd_k5_nb_sf

As the figure below shows, the outlets are tightly clustered into disjoint subgraphs, but on more careful consideration, the outlets are multi-function, also serving drop-in customers. An outlet only offering take-away service might be located best near peak demand conditioned by the cost of property rental and transport accessibility, but most seem to sell to multiple markets: take-away, eat in restaurant, and delivery to customer. Maybe some compete on price, but cuisine may be a differentiator as well.

library(mapview)
mapview(gd_takeaways) + mapview(gd_k5_nb_sf, color="red4")

Five nearest neighbour object, Gdańsk take-away outlets, OpenStreetMap

Having shown some methods for constructing neighbour objects, we can briefly mention the assignment of weights to the neighbour links:

pom_queen_nb |> nb2listw(style="W") -> pom_queen_lw
summary(pom_queen_lw)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 123 
Number of nonzero links: 614 
Percentage nonzero weights: 4.058431 
Average number of links: 4.99187 
Link number distribution:

 1  2  3  4  5  6  7  8  9 10 
 7 10 12 14 30 22 16  8  3  1 
7 least connected regions:
2203011 2206011 2210011 2211011 2211031 2212011 2213031 with 1 link
1 most connected region:
2206042 with 10 links

Weights style: W 
Weights constants summary:
    n    nn  S0      S1       S2
W 123 15129 123 57.8465 522.8508

The output of the summary method shows the same details as for the nb object, adding a tabulation by neighbour count, the weights style "W", meaning that the weights are standardised so that the sum of weights for each observation is unity:

pom_queen_lw$weights |> sapply(sum) |> unique()
[1] 1

In some modelling settings, the "B" binary zero-one style is preferred, and there are a number of alternative ways of specifying weights. Finally, if the neighbour object contains no-neighbour entities, the zero.policy argument to nb2listw may be used to indicate that zero weights are acceptable - the obvious alternatives are to exclude such entities, or to add neighbour links so that all entities have neigbours.

4.2 English garbage data set and weights construction

From the seventh lecture, we shall be using a data set from Bivand and Szymanski (1997) and Bivand and Szymanski (2000), extending results in SZYMANSKI and WILKINS (1993), SYMANSKI (1996) and Bello and Szymanski (1996). The question posed in the research reported in these articles is whether the costs of garbage collection in English local authority districts (which had a statutory obligation to collect garbage) were reduced when compulsory competitive tendering was introduced. Not all districts introduced compulsory competitive tendering at the same time, and some had not begun when data collection ceased. Of 366 districts, only 324 had completed at this point, and the data used relate to these districts. Since compulsory competitive tendering was implemented in different years, the data set reports real district net expenditure on garbage collection for the year before the implementation of compulsory competitive tendering (pre-CCT), and the year after (post-CCT).

As mentioned in Bivand and Szymanski (2000), the observations are agents, their boundaries do not change during the study period, and as entities they meet reasonable behavioural expectations. Revelli (2003), Brueckner (2003), Revelli (2006), Revelli and Tovmo (2007) and others engage with a literature including Case, Hines, and Rosen (1993) and Besley and Case (1995) concerning strategic interactions, of which yardstick competition may play a part. In Bivand and Szymanski (1997), the identification of residual spatial autocorrelation in the results given by Bello and Szymanski (1996) was used to develop a spatial yardstick competition framework, in which districts without compulsory competitive tendering were more likely to be influenced by the behaviour of their proximate neighbours than by general market conditions.

The data used in Bivand and Szymanski (2000) have been matched with simplified boundaries for English districts, and may be read in the usual way:

library(sf)
eng324 <- st_read("Datasets/cross section/garbage/eng324s.gpkg")
Reading layer `eng324s' from data source 
  `/home/rsb/und/PG_AGII_2sem/Datasets/cross section/garbage/eng324s.gpkg' 
  using driver `GPKG'
Simple feature collection with 324 features and 34 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 134199 ymin: 11551 xmax: 655604 ymax: 657534
Projected CRS: OSGB36 / British National Grid

Showing which districts implemented compulsory competitive tendering and whether they awarded the contract to an outside bidder or to thir direct service organisation also shows the gaps, that is districts that had not implemented the policy by the end of data collection:

library(mapview)
mapview(eng324, zcol="f_dso")

Direct service organisation chosen, compulsory competitive tendering, English districts

Naturally, the gaps lead to difficulties in creating a viable neighbour object:

library(spdep)
(pnb <- poly2nb(eng324, queen=TRUE, row.names=eng324$NAME))
Neighbour list object:
Number of regions: 324 
Number of nonzero links: 1440 
Percentage nonzero weights: 1.371742 
Average number of links: 4.444444 
3 regions with no links:
HEREFORD KENSINGTON & CHELSEA STEVENAGE
6 disjoint connected subgraphs

Three disticts have no neighbours, and there are 6 disjoint subgraphs. We can use k-nearest neighbours with small k to patch in the missing links, and make them symmetric:

crds <- st_centroid(st_geometry(eng324))
k2 <- knn2nb(knearneigh(crds, k=2), row.names=eng324$NAME)
(unb <- make.sym.nb(union.nb(pnb, k2)))
Neighbour list object:
Number of regions: 324 
Number of nonzero links: 1584 
Percentage nonzero weights: 1.508916 
Average number of links: 4.888889 
pnb_sf <- nb2lines(pnb, coords=crds)
k2_sf <- nb2lines(setdiff.nb(intersect.nb(pnb, k2), k2),
 coords=crds)
mapview(eng324, alpha.regions=0.15) +
 mapview(pnb_sf, color="blue3") +
 mapview(k2_sf, color="red4")

Contiguity neighbours and added k-nearest neighbours, 324 Englist districts

We can also read the original neighbour object modified by hand for Bivand and Szymanski (1997):

(orignb <- read.gal("Datasets/cross section/garbage/eng324_orig.gal"))
Neighbour list object:
Number of regions: 324 
Number of nonzero links: 1570 
Percentage nonzero weights: 1.49558 
Average number of links: 4.845679 
attr(orignb, "region.id") <- attr(unb, "region.id")
orignb_sf <- nb2lines(orignb, coords=crds)
mapview(eng324, alpha.regions=0.15) +
 mapview(orignb_sf, color="green4")

Original simple graph, 324 English districts

Now we have a basis for work later on involving a data set used in published articles some time ago, but still relevant, as the entities are well-founded, and we have a micro-economic model for spillovers between entities.