All the material presented here, to the extent it is original, is available under CC-BY-SA. Parts build on joint tutorials with Edzer Pebesma.
I am running R 3.6.1, with recent update.packages()
.
needed <- c("spatstat", "spatstat.data", "maptools", "rgdal", "ranger", "rgeos", "sp", "geoR")
Script and data at https://github.com/rsbivand/ECS530_h19/raw/master/ECS530_IX.zip. Download to suitable location, unzip and use as basis.
2/12 (I) Spatial data representation, (II) Support+topology, input/output
3/12 (III) Coordinate reference systems, (IV) Visualization
4/12 (VI) Spatial autocorrelation, project surgery
5/12 (VII) Spatial regression, (VIII) Spatial multilevel regression
6/12 (IX) Interpolation, point processes, project surgery
7/12 Presentations
09:15-09:45
09:45-10:30
10:30-11:00
What is interpolation, and what does it require with regard to data, observation and prediction locations?
The support of the observation and prediction locations - what is it, and how does it affect interpolation?
Standard models for interpolation: deterministic and geostatistical prediction for continuous variables
Extensions to the standard model for non-standard data and non-planar locations
Geostatistics is a bit like the alchemy of spatial statistics, focussed more on prediction than model fitting
Since the reason for modelling is chiefly prediction in pre-model-based geostatistics, and to a good extent in model-based geostatistics, we’ll also keep to interpolation here
Interpolation is trying to make as good guesses as possible of the values of the variable of interest for places where there are no observations (can be in 1, 2, 3, \(\ldots\) dimensions)
These are based on the relative positions of places with observations and places for which predictions are required, and the observed values at observations
Interpolation for point data involves estimates of a variable (or variables) from known observations at known positions to unknown positions
If the observation and prediction positions have point support, then we dealing with interpolation for point data
The underlying assumptions for all interpolation include distributional assumptions about the data, assumptions about the quality of the data and covariates, and assumptions about the quality of positional support and measurement
The standard model assumes planar geometry, error-free coordinates and covariates, and data that are observed as directly as possible from the same realization of the data generation process
Since predictions are only of use if we know how far we can rely on them, all interpolation should try to accommodate uncertainty about measurements
If we are binning to try to obtain observations’’, we are going much further than experimental scientists do in designing model matrices, we have much less control over the values of the predictors
Model-based geostatistics are among the techniques that can be used, but we can start from block kriging to get a feel for COSP
Very likely, Hierarchical or other Markov chain Monte Carlo (MCMC) models will be required to specify the errors to propagate
We could say that we know very little about any spatial process in the DGP, and take the value at each prediction point as the value at the nearest observation, disregarding covariates
We could extend this by taking some measure of the values observed at nearby observations, for example a median or average
Inverse distance weighting with a power parameter interpolates by taking the distance-weighted mean of all, or a local region of, observations
Trend surfaces fit a surface through the observations using a polynomial in the point coordinates of the observations; splines also fit a surface based on local fits
While splines may have a statistical interpretation, they are often used deterministically, but as with IDW and trend surface, the fitted model may be chosen by cross-validation (eg. Geostatistical Analyst in ArcGIS)
It is also perfectly possible to fit an aspatial statistical model with covariates (linear, GLM, GAM, CART, etc.), and predict to locations for which observations on the covariates are available
Many of these can be extended to their mixed-model version, as LME, GLMM, GAMM, etc., using a spatial process model for random effects, or for example a spline surface in the observation coordinates (GAM)
The spatial process models are mostly geostatistical, so we’ll focus on these
Assuming that we have a response variable and perhaps explanatory variables observed on the same support and yielding a mean-stationary error process, our interest is in finding out whether there is still useful information in the error
The information may take the form of spatial covariance, the error autocorrelation between observation
This also involves the introduction of the concept of the random field, a stochastic process in more than one dimension, as a representation of space, where these can be continuous or discrete
The continuous form is:
\[ Z(\mathbf{s}) = \mu(\mathbf{s}) + W(\mathbf{s}) + \eta(\mathbf{s}) + \epsilon(\mathbf{s}), \] where \(Z(\mathbf{s})\) is a random field, \(\mathbf{s} = [s_1, s_2, \ldots, s_d]'\) are the coordinates of a spatial process indexed over \(D \subset \mathcal{R}^d\), the mean function \(\mu(\mathbf{s})\) is the large-scale trend, \(W(\mathbf{s})\) is the smooth-scale variation and is a stationary process with a covariance function, \(\eta(\mathbf{s})\) is the often unobservable micro-scale variation, and \(\epsilon(\mathbf{s})\) is white noise measurement error
In geostatistics, two steps are involved in prediction from point data, first modelling the smooth-scale variation, then making the prediction from the mean function and this fitted model
The first step in practice involves fitting a variogram model to the observed differences in the residuals of the model after fitting the mean function between pairs of observation points
The fitting of a model to the empirical variogram may be done by hand and by a number of statistical techniques (which depend on assumptions)
Choosing a different variogram model may lead to differences in predictions; it may not be possible to choose a satisfactory model
It is probable that more exploratory spatial data analysis is done in geostatistics than in the remaining domains of spatial data analysis
It is easy to grasp why interpolation is crucially dependent on identifying the right model, in terms of the selection of observation locations, the fitting of models of spatial autocorrelation, detecting useful covariates, and checking the appropriateness of assumptions such as isotropy
Here we will use a data set of precipitation values for Switzerland, discussed in Diggle and Ribeiro (2007), and used in the Spatial Interpolation Comparison 97’’ contest
The examples demonstrate that geostatistics software, here packages, provides much support for exploratory spatial data analysis
Both geoR and gstat include data for preciptation for 467 met. stations in Switzerland, for 8 May 1986, measured in 0.1 mm. We’ll fit with 100 training sites, and predict to the remaining 367 sites. geoR gives a the four-panel ESDA display that conveys a lot of information for the 100 training sites.
library(geoR)
## --------------------------------------------------------------
## Analysis of Geostatistical Data
## For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
## geoR version 1.7-5.2.1 (built on 2016-05-02) is now loaded
## --------------------------------------------------------------
data(SIC)
plot(sic.100, borders=sic.borders, lowess=TRUE)
The first diagnostic plot provided in gstat is known as an \(h\)-scatterplot, and conditions a scatterplot of the values at pairs of locations on the binned distance \(h_{ij}\) between them; the diagonal lines represent perfect correlation
library(sp)
library(gstat)
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
sic.100SP <- SpatialPointsDataFrame(SpatialPoints(sic.100$coords),
data=data.frame(precip=sic.100$data, altitude=sic.100$altitude))
sic.allSP <- SpatialPointsDataFrame(SpatialPoints(sic.all$coords),
data=data.frame(precip=sic.all$data, altitude=sic.all$altitude))
names(sic.allSP) <- c("precip", "altitude")
hscat(precip ~ altitude, data=sic.100SP, seq(0,120,20))
We now compute a variogram cloud plot and a plot of empirical variogram values for 20 5km wide bins, for classical and robust versions of the variogram. The bin borders are shown to highlight how the empirical variogram is constructed as a measure of central tendency of squared differences in the variable of interest between pairs of points whose inter-point distance falls into the bin
g <- gstat(id="precip", formula=precip ~ altitude, data=sic.100SP)
evgm <- variogram(g, cutoff=100, width=5)
revgm <- variogram(g, cutoff=100, width=5, cressie=TRUE)
cevgm <- variogram(g, cutoff=100, width=5, cloud=TRUE)
oopar <- par(mfrow=c(1,2))
plot(gamma ~ dist, cevgm, pch=".", cex=2, col="grey65", ylab="semivariance", xlab="distance")
lines(gamma ~ dist, evgm, lwd=2)
lines(gamma ~ dist, revgm, lwd=2, lty=2)
abline(v=seq(0,100,5), lty=2, col="grey50")
plot(gamma ~ dist, evgm, ylab="semivariance", xlab="distance", type="b", lwd=2)
points(gamma ~ dist, revgm, pch=3)
lines(gamma ~ dist, revgm, lty=2, lwd=2)
abline(v=seq(0,100,5), lty=2, col="grey50")
legend("topleft", legend=c("classic", "robust"), pch=c(1,3), lty=c(1,2), bty="n", lwd=2)
par(oopar)
We can close in on these within-bin distributions by using boxplots constructed from the gstat variogram cloud — box widths proportional to pair counts in bins, classical empirical variogram shown as dashed line; the fields package returns number summaries by bin in addition to the classical variogram estimator in output from the vgram
function
dbin <- findInterval(cevgm$dist, seq(0, 100, 5), all.inside=TRUE)
wid <- tapply(cevgm$gamma, dbin, length)
boxplot(cevgm$gamma ~ dbin, width=wid, ylab="semivariance", xlab="distance", axes=FALSE)
axis(2)
axis(1, at=c(0.5, 5.5, 10.5, 15.5, 20.5), labels=c(0, 25, 50, 75, 100))
box()
lines(gamma ~ I(dist/5), evgm, lwd=2, lty=2)
Finally, we explore possible anisotropy in the data set. Using the same bins as earlier, we add arguments to the variogram
function to create objects for plotting, a variogram map, and four empirical variograms for four axes at \(0\,^{\circ}\), \(45\,^{\circ}\), \(90\,^{\circ}\) and \(135\,^{\circ}\); the variogram direction lines are coded in the same way on both panels
mevgm <- variogram(g, cutoff=100, width=5, map=TRUE)
aevgm <- variogram(g, cutoff=100, width=5, alpha=c(0, 45, 90, 135))
library(RColorBrewer)
oopar <- par(mar=c(1,1,1,1))
image(mevgm$map, col=colorRampPalette(brewer.pal(7, "Blues")[-(6:7)])(20))
abline(v=0, lty=1)
abline(a=0, b=1, lty=2)
abline(h=0, lty=3)
abline(a=0, b=-1, lty=4)
par(oopar)
library(lattice)
trellis.device(new=FALSE,color=FALSE)
plot(aevgm, multipanel=FALSE)
First we will fit a Matern variogram model in geoR using weighted least squares and maximum likelihood:
evg <- variog(sic.100, max.dist=200, trend=formula(~altitude), messages=FALSE)
fvg <- variofit(evg, messages=FALSE)
ml <- likfit(sic.100, ini.cov.pars=c(0.5, 0.5), trend=formula(~altitude), messages=FALSE)
plot(evg)
lines(fvg)
lines(ml, lty=2)
legend("topleft", legend=c("WLS", "ML"), lty=1:2, bty="n")
Next we will fit a Matern variogram model in gstat:
evgm <- variogram(g, cutoff=200, width=5)
fevgm <- fit.variogram(evgm, vgm(psill=16000, model="Mat", range=30, nugget=1, kappa=0.5))
fevgm
## model psill range kappa
## 1 Nug 0.0 0.00000 0.0
## 2 Mat 16869.7 46.96405 0.5
plot(evgm, model=fevgm)
The geostatistical packages, like gstat, use formula objects in standard ways where possible, which allows for considerable flexibility:
UK_fit <- gstat(g, id="precip", model=fevgm)
z <- predict(UK_fit, newdata=sic.allSP, debug.level=0)
sic.367SP <- sic.allSP[which(z$precip.var > 0.0001),]
z <- predict(UK_fit, newdata=sic.367SP, debug.level=0)
Using geoR, we get:
kcwls <- krige.conv(sic.100, locations=coordinates(sic.367SP),
krige=krige.control(obj.model=fvg, type.krige="OK",
trend.d=formula(~altitude), trend.l=formula(~sic.367SP$altitude)))
## krige.conv: model with mean defined by covariates provided by the user
## krige.conv: Kriging performed using global neighbourhood
kcml <- krige.conv(sic.100, locations=coordinates(sic.367SP),
krige=krige.control(obj.model=ml, type.krige="OK",
trend.d=formula(~altitude), trend.l=formula(~sic.367SP$altitude)))
## krige.conv: model with mean defined by covariates provided by the user
## krige.conv: Kriging performed using global neighbourhood
kcB <- krige.bayes(sic.100, locations=coordinates(sic.367SP),
model=model.control(trend.d=formula(~altitude),
trend.l=formula(~sic.367SP$altitude)))
## krige.bayes: model with mean defined by covariates provided by the user
## krige.bayes: computing the discrete posterior of phi/tausq.rel
## krige.bayes: argument `phi.discrete` not provided, using default values
## krige.bayes: computing the posterior probabilities.
## Number of parameter sets: 50
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50,
##
## krige.bayes: sampling from posterior distribution
## krige.bayes: sample from the (joint) posterior of phi and tausq.rel
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## phi 23.44137 35.16205 46.88273 58.60342 70.3241 82.04478 93.76547
## tausq.rel 0.00000 0.00000 0.00000 0.00000 0.0000 0.00000 0.00000
## frequency 5.00000 54.00000 65.00000 62.00000 69.0000 40.00000 45.00000
## [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## phi 105.4862 117.2068 128.9275 140.6482 152.3689 164.0896 175.8103
## tausq.rel 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## frequency 32.0000 34.0000 28.0000 27.0000 23.0000 30.0000 19.0000
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## phi 187.5309 199.2516 210.9723 222.693 234.4137 246.1344 257.855 269.5757
## tausq.rel 0.0000 0.0000 0.0000 0.000 0.0000 0.0000 0.000 0.0000
## frequency 14.0000 21.0000 17.0000 15.000 19.0000 15.0000 14.000 9.0000
## [,23] [,24] [,25] [,26] [,27] [,28] [,29]
## phi 281.2964 293.0171 304.7378 316.4585 328.1791 339.8998 351.6205
## tausq.rel 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## frequency 13.0000 3.0000 16.0000 18.0000 18.0000 13.0000 12.0000
## [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## phi 363.3412 375.0619 386.7826 398.5032 410.2239 421.9446 433.6653
## tausq.rel 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
## frequency 17.0000 6.0000 16.0000 14.0000 9.0000 6.0000 13.0000
## [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44]
## phi 445.386 457.1067 468.8273 480.548 492.2687 503.9894 515.7101 527.4308
## tausq.rel 0.000 0.0000 0.0000 0.000 0.0000 0.0000 0.0000 0.0000
## frequency 14.000 12.0000 16.0000 14.000 17.0000 13.0000 12.0000 12.0000
## [,45] [,46] [,47] [,48] [,49]
## phi 539.1514 550.8721 562.5928 574.3135 586.0342
## tausq.rel 0.0000 0.0000 0.0000 0.0000 0.0000
## frequency 12.0000 7.0000 14.0000 17.0000 9.0000
##
## krige.bayes: starting prediction at the provided locations
## krige.bayes: phi/tausq.rel samples for the predictive are same as for the posterior
## krige.bayes: computing moments of the predictive distribution
## krige.bayes: sampling from the predictive
## Number of parameter sets: 49
## 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
## krige.bayes: preparing summaries of the predictive distribution
The point about the standard assumptions is that when they are met, the prediction standard errors are tractable — we plot the MCMC prediction simulations for location 1:
plot(density(kcB$predictive$simulations[1,]))
abline(v=kcB$predictive$mean[1], col="red", lwd=2)
curve(dnorm(x, mean=kcB$predictive$mean[1],
sd=sqrt(kcB$predictive$variance[1])), lty=2, lwd=2, from=-100, to=500, add=TRUE)
abline(v=sic.367SP$precip[1], col="blue", lwd=2)
The automap package builds on gstat to automate the choice of variogram model:
library(automap)
aK <- autoKrige(formula=precip ~ altitude, input_data=sic.100SP, new_data=sic.367SP)
## [using universal kriging]
aK$var_model
## model psill range kappa
## 1 Nug 861.4484 0.00000 0
## 2 Ste 14011.7859 35.92031 5
plot(aK)
point to possible uses of ML technologies for interpolation, and give code examples on the . Further examples for are provided in an ebook. First distance matrices are constructed and converted to data frames:
#grid.dist0 <- GSIF::buffer.dist(sic.100SP["precip"], sic.367SP[1], as.factor(1:nrow(sic.100SP)))
library(rgeos)
## rgeos version: 0.5-2, (SVN revision 621)
## GEOS runtime version: 3.8.0-CAPI-1.13.1
## Linking to sp version: 1.3-3
## Polygon checking: TRUE
dist0 <- as.data.frame(gDistance(sic.100SP["precip"], sic.100SP["precip"], byid=TRUE))
names(dist0) <- paste("layer", names(dist0), sep=".")
dist1 <- as.data.frame(gDistance(sic.100SP["precip"], sic.367SP["precip"], byid=TRUE))
names(dist1) <- paste("layer", names(dist1), sep=".")
Then the observed responses and any covariates are added to the per-observation distances, and a formula constructed:
rm.precip <- cbind(as.data.frame(sic.100SP)[c("precip", "altitude")], dist0)
rm.precip1 <- cbind(as.data.frame(sic.367SP)[c("altitude")], dist1)
dn0 <- paste(names(dist0), collapse="+")
fm0 <- as.formula(paste("precip ~ altitude +", dn0))
#fm0
The ranger
function from the ranger package can be used for fast random forest fitting:
library(ranger)
m.precip <- ranger(fm0, rm.precip, quantreg=TRUE, num.trees=150, seed=1)
m.precip
## Ranger result
##
## Call:
## ranger(fm0, rm.precip, quantreg = TRUE, num.trees = 150, seed = 1)
##
## Type: Regression
## Number of trees: 150
## Sample size: 100
## Number of independent variables: 101
## Mtry: 10
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 6150.302
## R squared (OOB): 0.5482526
And make predictions from the fitted model for the 367 weather stations held back:
quantiles <- c(pnorm(1, lower.tail=FALSE), 0.5, pnorm(1))
precip.rfd <- as.data.frame(predict(m.precip, rm.precip1, type="quantiles",
quantiles=quantiles)$predictions)
res <- SpatialPointsDataFrame(as(sic.367SP, "SpatialPoints"),
data=cbind(sic.367SP[["precip"]], precip.rfd,
as.data.frame(aK$krige_output)[,3:5]))
res$rf_sd <- (res[[4]] - res[[2]])/2
names(res) <- make.names(names(res))
spplot(res, zcol=c(1,5,3))
spplot(res, zcol=c(7,8))
Then the observed responses and any covariates are added to the per-observation distances, and a formula constructed:
xy100 <- coordinates(sic.100SP)
xy367 <- coordinates(sic.367SP)
rm.precip <- cbind(as.data.frame(sic.100SP)[c("precip", "altitude")], x=xy100[,1], y=xy100[,2], dist0)
rm.precip1 <- cbind(as.data.frame(sic.367SP)[c("altitude")], x=xy367[,1], y=xy367[,2], dist1)
dn0 <- paste(names(dist0), collapse="+")
fm0 <- as.formula(paste("precip ~ altitude + x + y +", dn0))
#fm0
m.precip <- ranger(fm0, rm.precip, quantreg=TRUE, num.trees=150, seed=1)
m.precip
## Ranger result
##
## Call:
## ranger(fm0, rm.precip, quantreg = TRUE, num.trees = 150, seed = 1)
##
## Type: Regression
## Number of trees: 150
## Sample size: 100
## Number of independent variables: 103
## Mtry: 10
## Target node size: 5
## Variable importance mode: none
## Splitrule: variance
## OOB prediction error (MSE): 5847.392
## R squared (OOB): 0.5705017
And make predictions from the fitted model for the 367 weather stations held back:
quantiles <- c(pnorm(1, lower.tail=FALSE), 0.5, pnorm(1))
precip.rfd <- as.data.frame(predict(m.precip, rm.precip1, type="quantiles",
quantiles=quantiles)$predictions)
res <- SpatialPointsDataFrame(as(sic.367SP, "SpatialPoints"),
data=cbind(sic.367SP[["precip"]], precip.rfd,
as.data.frame(aK$krige_output)[,3:5]))
res$rf_sd <- (res[[4]] - res[[2]])/2
names(res) <- make.names(names(res))
spplot(res, zcol=c(1,5,3))
spplot(res, zcol=c(7,8))
What we see on a map is a pattern, or perhaps some patterns mixed together.
It is not easy to work back from map pattern to the process or processes that generated it/them.
Using a variety of approaches, we can explore and analyse point patterns, also reviewing an important chapter in the development of quantitative geography.
Practically, we will also see how we can try out different approaches, and how their assumptions affect our conclusions.
David O’Sullivan and David Unwin (2003) , Wiley, chapter 4, plus chapter 3 for the curious (or 2010, ch. 5 plus ch. 4);
Ian Smalley and David Unwin (1968) The formation and shape of drumlins and their distribution and orientation in drumlin fields, , 7, pp. 377–390; Alan R. Hill (1973) The distribution of drumlins in County Down, Ireland, , 63 (2). pp. 226–240.
Others may also like Trevor Bailey and Anthony Gatrell (1995) , Longman, chapter 3.
library(rgdal)
## rgdal: version: 1.5-2, (SVN revision 896)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.0.2, released 2019/10/28
## Path to GDAL shared files: /usr/local/share/gdal
## GDAL binary built with GEOS: TRUE
## Loaded PROJ.4 runtime: Rel. 6.2.1, November 1st, 2019, [PJ_VERSION: 621]
## Path to PROJ.4 shared files: /usr/local/share/proj
## Linking to sp version: 1.3-3
drumlins <- readOGR(".", "drumlins")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/rsb/und/ecs530/h19", layer: "drumlins"
## with 232 features
## It has 4 fields
drumlins_SP <- as(drumlins, "SpatialPoints")
A data set similar to the one refered to by O’Sullivan and Unwin on p. 100-101 is available in spatial in R (associated with Venables and Ripley (2002) Modern Applied Statistics with S) — it is the one used by Upton and Fingleton, coded by Ripley. We have here copied the points to a shapefile.
library(maptools)
## Checking rgeos availability: TRUE
library(spatstat)
## Loading required package: spatstat.data
## Loading required package: nlme
## Loading required package: rpart
##
## spatstat 1.61-0 (nickname: 'Puppy zoomies')
## For an introduction to spatstat, type 'beginner'
##
## Note: spatstat version 1.61-0 is out of date by more than 11 weeks; a newer version should be available.
##
## Attaching package: 'spatstat'
## The following object is masked from 'package:lattice':
##
## panel.histogram
## The following object is masked from 'package:gstat':
##
## idw
drumlins_ppp <- as(drumlins_SP, "ppp")
head(drumlins_ppp)
## Planar point pattern: 6 points
## window: rectangle = [0.280488, 8.231707] x [5.567073, 13.487805] units
Although spatstat and the sp classes have developed independently, they have a good deal in common, and point patterns, images and polygon windows can be exchanged
Point pattern objects need bounding windows to show where the population of data points were collected. The default window is the bounding box of the points, but others are available.
bb <- boundingbox(drumlins_ppp)
ch <- convexhull.xy(drumlins_ppp)
rr <- ripras(drumlins_ppp)
drumlins_rr <- ppp(drumlins_ppp$x, drumlins_ppp$y, window=rr)
plot(drumlins_ppp)
plot(bb, add=TRUE, border="darkgreen", lwd=2, lty=1)
plot(ch, add=TRUE, border="darkred", lwd=2, lty=3)
plot(rr, add=TRUE, border="orange", lwd=2, lty=2)
One legacy approach to point patterns, avoiding the drudge of measuring inter-point distances, has been to divide the study area into quadrats, and count the numbers of points falling into each quadrat. This can take the form of a 2D histogram, or be displayed as an image plot.
qc <- quadratcount(drumlins_ppp)
cellsize <- mean(diff(attr(qc, "xbreaks")))
cellsize <- c(cellsize, mean(diff(attr(qc, "ybreaks"))))
grd <- GridTopology(c(attr(qc, "xbreaks")[1]+cellsize[1]/2, attr(qc, "ybreaks")[1]+cellsize[2]/2), cellsize, dim(qc))
t3 <- cbind(coordinates(grd), c(t(qc)))
plot(as(SpatialGrid(grd), "Spatial"), axes=TRUE)
title(main="Quadrat counts")
plot(drumlins_SP, add=TRUE, cex=0.8)
text(t3[,1], t3[,2], t3[,3], cex=1.2, font=2, col="darkred")
abline(h=attr(qc, "ybreaks"))
abline(v=attr(qc, "xbreaks"))
Chi-squared tests for Complete Spatial Randomness using quadrat counts may seem attractive, but suffer from the same problems as do histogram bins:
quadrat.test(drumlins_ppp)
##
## Chi-squared test of CSR using quadrat counts
##
## data: drumlins_ppp
## X2 = 37.828, df = 24, p-value = 0.07222
## alternative hypothesis: two.sided
##
## Quadrats: 5 by 5 grid of tiles
Just adding one more row and column of quadrats, or switching windows, changes our conclusion:
quadrat.test(drumlins_ppp, nx=6)
##
## Chi-squared test of CSR using quadrat counts
##
## data: drumlins_ppp
## X2 = 41.414, df = 35, p-value = 0.4221
## alternative hypothesis: two.sided
##
## Quadrats: 6 by 6 grid of tiles
quadrat.test(drumlins_rr)
## Warning: Some expected counts are small; chi^2 approximation may be inaccurate
##
## Chi-squared test of CSR using quadrat counts
##
## data: drumlins_rr
## X2 = 34.389, df = 24, p-value = 0.156
## alternative hypothesis: two.sided
##
## Quadrats: 25 tiles (irregular windows)
Density plots use a 2D kernel, in spatstat a Gaussian kernel, to create smoothed histograms avoiding the problems of quadrat counts. The key argument to pass to the density method for point patterm objects is sigma=
, which determines the bandwidth of the kernel. Since we can coerce the image objects output by the method to an sp class, we use this to cumulate density values for different values of sigma.
SPrr <- as(rr, "SpatialPolygons")
SG0 <- Sobj_SpatialGrid(SPrr)$SG
crds <- coordinates(SG0)
crds <- list(x=crds[,1], y=crds[,2])
k025 <- density(drumlins_rr, sigma=0.25, xy=crds)
SG <- as(k025, "SpatialGridDataFrame")
k050 <- density(drumlins_rr, sigma=0.5, xy=crds)
SG <- cbind(SG, as(k050, "SpatialGridDataFrame"))
k075 <- density(drumlins_rr, sigma=0.75, xy=crds)
SG <- cbind(SG, as(k075, "SpatialGridDataFrame"))
k100 <- density(drumlins_rr, sigma=1.0, xy=crds)
SG <- cbind(SG, as(k100, "SpatialGridDataFrame"))
names(SG) <- c("k025", "k050", "k075", "k100")
spl <- list("sp.points", drumlins, pch=".", col=1)
spplot(SG, c("k025", "k050", "k075", "k100"), col.regions=terrain.colors(16), cuts=15, sp.layout=list(spl), par.strip.text=list(cex=0.7))
Narrower bandwidths yield more extreme values, broader bandwidths narrow the interquartile range. From this table, we can see how the change in the bandwidth is affecting the relative differences in our view of the local estimates of intensity.
summary(as(SG, "data.frame")[,1:4])
## k025 k050 k075 k100
## Min. : 0.000246 Min. :0.1863 Min. :0.5847 Min. :0.9518
## 1st Qu.: 2.140005 1st Qu.:2.7470 1st Qu.:3.0331 1st Qu.:3.1944
## Median : 3.648164 Median :3.7812 Median :3.8033 Median :3.8120
## Mean : 3.649323 Mean :3.6303 Mean :3.6151 Mean :3.6113
## 3rd Qu.: 5.122177 3rd Qu.:4.7156 3rd Qu.:4.4479 3rd Qu.:4.2417
## Max. :11.145954 Max. :6.2894 Max. :5.3341 Max. :5.1032
We can find and plot nearest neighbour distances, finding them with nndist
— plotting the empirical cumulative distribution function of the nearest neighbour distances is interesting:
nns <- nndist(drumlins_rr)
summary(nns)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1446 0.2700 0.3203 0.3254 0.3685 1.0073
plot(ecdf(nns))
The \(\hat{G}\) measure turns out to be just the ECDF of the nearest neighbour distances, plotted by default with the expected CSR line; Gest
returns binned values for a range of distance bins best chosen by the function:
plot(ecdf(nns), xlim=c(0, 0.5))
plot(Gest(drumlins_ppp), add=TRUE, lwd=3)
If we generate many simulated CSR point patterns for the current window, we can use the envelope
method to explore whether the observed \(\hat{G}\) measures lie in relation to the simulated ones:
n <- drumlins_rr$n
set.seed(121122)
ex <- expression(runifpoint(n, win=rr))
res <- envelope(drumlins_rr, Gest, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
plot(res, xlim=c(0,0.7))
for(i in 2:100) lines(attr(res, "simfuns")[[1]], attr(res, "simfuns")[[i]], col="grey")
plot(res, add=TRUE, lwd=3, xlim=c(0,0.7))
We can also compute the nearest neighbour based Clark/Evans R statistic :
clarkevans(drumlins_ppp)
## naive Donnelly cdf
## 1.248991 1.214479 1.229503
clarkevans(drumlins_rr, correction="none")
## [1] 1.238292
clarkevans(drumlins_rr, correction="guard", clipregion=erosion.owin(rr, r=1))
## [1] 1.194691
which seem to indicate that the observed and CSR expected distances are similar, but perhaps more evenly spaced than clustered.
From what we have seen, it appears the the drumlin summit points are more regularly than randomly distributed. If we think, however, the absence of short nearest neighbour distance may mean that they “push” each other apart (in fact this is about points not being a good way of representing ellipses) — so we can try to simulate from a Simple Sequential Inhibition (SSI) process with a 180m inhibition radius instead of CSR:
ex <- expression(rSSI(0.18, n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Gest, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
capture.output(plot(res, xlim=c(0,0.7)))
## character(0)
for(i in 2:100) lines(attr(res, "simfuns")[[1]], attr(res, "simfuns")[[i]], col="grey")
capture.output(plot(res, add=TRUE, lwd=3, xlim=c(0,0.7)))
## character(0)
As we know, G-hat uses nearest neighbour distances to express summary features of a point pattern. The K-hat function uses point intensities in rings spreading out from the points, and so uses more of the data to examine what is driving the process (reported here as L-hat):
ex <- expression(runifpoint(n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Kest, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
r <- res$r
Lhat <- function(k, r) { (sqrt(k/pi)) - r }
plot(r, Lhat(res$obs, r), type="n", ylab="L(r)", main="CSR simulation", ylim=c(-0.17, 0.1))
for(i in 2:100) lines(r, Lhat(attr(res, "simfuns")[[i]], r), col="grey")
lines(r, Lhat(res$obs, r), lwd=2, col="brown")
lines(r, Lhat(res$lo, r), lwd=2, col="black", lty=2)
lines(r, Lhat(res$hi, r), lwd=2, col="black", lty=2)
From what we already know, drumlins represented as points appear to inhibit each other under a distance of about 200m, so running the \(\hat{K}\) measure with an SSI process should show more of what is going on:
ex <- expression(rSSI(0.18, n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Kest, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
r <- res$r
Lhat <- function(k, r) { (sqrt(k/pi)) - r }
plot(r, Lhat(res$obs, r), type="n", ylab="L(r)", main="SSI simulation", ylim=c(-0.17, 0.1))
for(i in 2:100) lines(r, Lhat(attr(res, "simfuns")[[i]], r), col="grey")
lines(r, Lhat(res$obs, r), lwd=2, col="brown")
lines(r, Lhat(res$lo, r), lwd=2, col="black", lty=2)
lines(r, Lhat(res$hi, r), lwd=2, col="black", lty=2)
Another possibility is that the CSR hypothesis is at error on assuming that the process is homogeneous — we may also test against an inhomogeneous process using the Kinhom
function. If its lambda
argument is omitted, it does leave-one-out kernel smoothing to find \(\lambda_i\) by omitting the \(i\)-th point:
ex <- expression(runifpoint(n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Kinhom, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
r <- res$r
Lhat <- function(k, r) { (sqrt(k/pi)) - r }
plot(r, Lhat(res$obs, r), type="n", ylab="L(r)", main="CSR simulation", ylim=c(-0.17, 0.1))
for(i in 2:100) lines(r, Lhat(attr(res, "simfuns")[[i]], r), col="grey")
lines(r, Lhat(res$obs, r), lwd=2, col="brown")
lines(r, Lhat(res$lo, r), lwd=2, col="black", lty=2)
lines(r, Lhat(res$hi, r), lwd=2, col="black", lty=2)
Finally, we round off with an inhomogeneous SSI process:
ex <- expression(rSSI(0.18, n, win=rr))
set.seed(121122)
res <- envelope(drumlins_rr, Kinhom, nsim=99, simulate=ex,
verbose=FALSE, savefuns=TRUE)
r <- res$r
Lhat <- function(k, r) { (sqrt(k/pi)) - r }
plot(r, Lhat(res$obs, r), type="n", ylab="L(r)", main="SSI simulation", ylim=c(-0.17, 0.1))
for(i in 2:100) lines(r, Lhat(attr(res, "simfuns")[[i]], r), col="grey")
lines(r, Lhat(res$obs, r), lwd=2, col="brown")
lines(r, Lhat(res$lo, r), lwd=2, col="black", lty=2)
lines(r, Lhat(res$hi, r), lwd=2, col="black", lty=2)
Comparing the SSI with the CSR results indicates that we can not only reject the CSR process as being that driving drumlin point locations, but we have good grounds to suppose that a spatial inhibition process is operating
It is also very possible that the process is inhomogeneous, that is that an omitted heterogeneity in the surface is influencing the point pattern
The minimum drumlin footprint is such that drumlins cannot be closer to each other than a certain minimum distance
At short range, the estimated L-hat values are lower than the lower envelope bounds, but only less than 0.4 distance units — this corresponds to spatial inhibition
As the L-hat simulation values for the SSI process indicate, drumlins are not well represented by points, because they have area, even volume, and rarely overlap’’
It is frequently the case that point patterns are not measured on homogeneous surfaces
One way to tackle this is, as in these examples from Zhang et al. (2009), to sample from the population at risk to form a control group
We are then interested in examining the spatial distribution of cases compared to the spatial distribution of controls
The cases are observations of schistosomiasis in Guichi, China, and the controls are taken from the polulation at risk
First we’ll read in the points, enclosing polygon, and river banks:
points <- readOGR(".", "points")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/rsb/und/ecs530/h19", layer: "points"
## with 166 features
## It has 1 fields
rivers <- readOGR(".", "rivers")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/rsb/und/ecs530/h19", layer: "rivers"
## with 5 features
## It has 1 fields
## Integer64 fields read as strings: OBJECTID
poly <- readOGR(".", "poly")
## OGR data source with driver: ESRI Shapefile
## Source: "/home/rsb/und/ecs530/h19", layer: "poly"
## with 1 features
## It has 1 fields
## Integer64 fields read as strings: ID
plot(poly)
plot(rivers, add=TRUE)
points(points, pch=as.integer(points$mark))
We coerce the sp objects to their spatstat representations, the points with marks:
poly.owin <- as(poly, "owin")
xy <- coordinates(points)
points.ppp <- ppp(xy[,1], xy[,2], marks=points$mark, window=poly.owin)
summary(points.ppp)
## Marked planar point pattern: 166 points
## Average intensity 317.9072 points per square unit
##
## Coordinates are given to 8 decimal places
##
## Multitype:
## frequency proportion intensity
## case 83 0.5 158.9536
## control 83 0.5 158.9536
##
## Window: polygonal boundary
## single connected closed polygon with 328 vertices
## enclosing rectangle: [0, 1] x [0, 0.8758322] units
## Window area = 0.522165 square units
## Fraction of frame area: 0.596
plot(split(points.ppp), nrows=2)
We make a mask for the kernel density calculations, and show the overall density:
Mask <- Sobj_SpatialGrid(poly)
o <- over(Mask$SG, poly)
Mask1 <- SpatialGridDataFrame(slot(Mask$SG, "grid"), data=o)
Mask2 <- as(Mask1, "SpatialPixelsDataFrame")
XY <- coordinates(Mask2)
XY <- list(x=XY[,1], y=XY[,2])
Z <- density(points.ppp, xy=XY)
plot(Z)
The split density shows how the point patterns differ:
Z <- density(split(points.ppp), xy=XY)
plot(Z, nrows=2)
We can also display the case density as a proportion of the overall density:
pCases <- with(Z, eval.im(case/(case + control)))
plot(pCases)
plot(points.ppp, add=TRUE)