Copyright

All the material presented here, to the extent it is original, is available under CC-BY-SA. Parts build on joint tutorials with Edzer Pebesma.

Required current contributed CRAN packages:

I am running R 3.6.1, with recent update.packages().

needed <- c("rgdal", "RSQLite", "units", "gdistance", "Matrix", "igraph", "raster", "sp", "spData", "mapview", "sf")

Script

Script and data at https://github.com/rsbivand/ECS530_h19/raw/master/ECS530_II.zip. Download to suitable location, unzip and use as basis.

Schedule

2/12 (I) Spatial data representation, (II) Support+topology, input/output
3/12 (III) Coordinate reference systems, (IV) visualization
4/12 (V) R/GIS interfaces, project surgery
5/12 (VI) Spatial autocorrelation, (VII) regression
6/12 (VIII) Interpolation, point processes, project surgery
7/12 Presentations

Session II

13:15-13:45 Support
13:45-14:15 Topology operations
14:15-15:00 Input/output

Support

Support expressses the relationship between the spatial and temporal entities of observation used for capturing underlying data generation processes, and those processes themselves.

The processes and their associated spatial and temporal scales (“footprints”) and strides may not be well-understood, so the ways that we conduct observations may or may not give use good “handles” on the underlying realities.

Since we are most often interested in drawing conclusions about the underlying realities, we should try to be aware of issues raised when we mix things up, the ecological fallacy being a typical example (Wakefield and Lyons 2010), involving the drawing of conclusions about individuals from aggregates.

Change of support occurs when the observational entities we are using differ in their spatial and/or temporal footprint, and we need to impute or interpolate from one support to another (Gotway and Young 2002). Areal interpolation is one of the pathways (Do, Thomas-Agnan, and Vanhems 2015).

Often we are not clear about the aggregation status of variables that we observe by entity either. In many entities, we are dealing with count aggregates. Say we have an accident count on a road segment, it is clear that if we subset the segment, we would need to impute the count to the subsegments that we have chosen. The variable is an aggregate, and subsetting should preserve the sum over all subsets rule.

byb <- readRDS("byb.rds")
names(attributes(byb))

## [1] "names"     "row.names" "sf_column" "agr"       "class"

library(sf)

## Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1

st_agr(byb)

##      osm_id        name electrified   frequency       gauge    maxspeed 
##        <NA>        <NA>        <NA>        <NA>        <NA>        <NA> 
##     name.en     railway      source     voltage 
##        <NA>        <NA>        <NA>        <NA> 
## Levels: constant aggregate identity

Work by (Stasch et al. 2014; Scheider et al. 2016) has shown that misunderstandings about whether variable values are constant over a segment (we really want the gauge to be constant), whether they are identities (osm_id), or whether they are measured over the whole observed time period at the point, line segment, polygon, or raster cell by counting or other aggregation, are quite prevalent.

All "sf" objects have an "agr" attribute, set by default to unknown (NA) for each non-geometry column in the data frame. In this case the information is of very poor quality (many missing values, others guessed), but use can be made of the facility in other datasets.

byb$length <- st_length(byb)
summary(byb$length)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    7.481   47.641  133.922  255.529  380.768 1216.524

str(byb$length)

## Object of class units:
##  num [1:189] 380 771 372 818 814 ...
##  - attr(*, "units")=List of 2
##   ..$ numerator  : chr "m"
##   ..$ denominator: chr(0) 
##   ..- attr(*, "class")= chr "symbolic_units"

Unfortunately, the simple examples given in SDSR do not work. The introduction of units, shown here and in (Pebesma, Mailund, and Hiebert 2016) also do not (yet) provide the background for issuing warnings with regard to support that were anticipated when the ideas were first considered. The idea underlying st_agr() has been to warn when an aggregate variable is copied across to a part of an entity as though it was a constant.

Boston housing data set example

The results I presented at useR! 2016 on this data set aggregated the data to air pollution model output zones

Here, based on , spatially structured random effects are added by air pollution model output zones instead

used a hedonic model to find out how house values were affected by air pollution in Boston, when other variables were taken into consideration

They chose to use 506 census tracts as units of observation, but air pollution values were available from model output for 122 zones, of which less than 100 fell within the study area

By taking the 96 air pollution model output zones as the upper level in a hierarchical spatial model, we explore the consequences for the results

The (Harrison and Rubinfeld 1978) Boston housing data set has been widely used because of its availability from (Belsley, Kuh, and Welsch 1980), (Pace and Gilley 1997) and (Gilley and Pace 1996)

The underlying research question in the original article was the estimation of willingness to pay for clean air, using air pollution levels and house values in a hedonic regression

(Pace and Gilley 1997) showed clearly, the air pollution coefficient estimate in the model changed when residual spatial autocorrelation was taken into account (from -0.0060 to -0.0037), as did its standard error (from 0.0012 to 0.0016)

Is the strength of spatial autocorrelation observed in this data set a feature of the census tract observations themselves, or has it been introduced or strengthened by changes in the observational units used for the different variables?

Our focus will be on the choices of observational units made in assembling the original data set, and on another relevant alternative

Using an approximation to the model output zones from which the air pollution variable levels were taken, it will be shown that much of the puzzling spatial autocorrelation is removed

House value data

scan from census form

(Harrison and Rubinfeld 1978) used median house values in 1970 USD for 506 census tracts in the Boston SMSA for owner-occupied one-family houses; census tracts with no reported owner-occupied one-family housing units were excluded from the data set. The relevant question is H11, which was answered by crossing off one grouped value alternative, ranging from under USD 5,000 to over USD 50,000

The house value data have census tract support, and are median values calculated from group counts from the alternatives offered in H11; tracts with weighted median values in these upper and lower alternative value classes are censored

The published census tract tabulations show the link between question H11 and the Statlib-based data (after correction)

The median values tabulated in the census report can be reconstructed from the tallies shown in the same Census tables fairly accurately using the weightedMedian function in the matrixStats package in R, using linear interpolation

The effectiveness of the study was prejudiced by the fact that areas of central Boston with the highest levels of air pollution also lose house value data, either because of tract exclusion (no one-family housing units reported) or right or left censored tracts

Air pollution data

The data on air pollution concentrations were obtained from the Transportation and Air Shed SIMulation model (TASSIM) applied to the Boston air shed (Ingram and Fauth 1974)

The calibrated model results were obtained for 122 zones, and assigned proportionally to the 506 census tracts

The NOX values in the published data sets are in units of 10 ppm (10 parts per million), and were then multiplied by 10 again in the regression models to yield parts per 100 million (pphm)

Many of the smaller tracts belong to the same TASSIM zones; this is a clear case of change of support, with very different spatial statistical properties under the two different entitation schemes (???)

TASSIM <- st_read("TASSIM.gpkg")

## Reading layer `TASSIM' from data source `/home/rsb/und/ecs530/h19/TASSIM.gpkg' using driver `GPKG'
## Simple feature collection with 122 features and 1 field
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -111000.5 ymin: 364415.7 xmax: 171926.9 ymax: 633116.7
## epsg (SRID):    NA
## proj4string:    +proj=lcc +lat_0=41 +lon_0=-70.5 +lat_1=41.2833333333333 +lat_2=41.4833333333333 +x_0=60960.1219202439 +y_0=0 +ellps=clrk66 +nadgrids=@conus,@alaska,@ntv2_0.gsb,@ntv1_can.dat +units=us-ft +no_defs

library(mapview)
mapview(TASSIM)

A two-part report details the use of the TASSIM simulation model . Both of these volumes include line-printer maps of the TASSIM zones, and the Fortran code in volume 2 (Ingram, Fauth, and Kroch 1974) shows the links between the 122 TASSIM zones and the line printer output. Western TASSIM zones appear to lie outside the Boston SMSA tracts included in the 506 census tract data set.

The Boston data set has 2D polygon and multipolygon geometries stored in a shapefile; shapefiles are pre-SF

library(sf)
library(spData)
b506 <- st_read(system.file("shapes/boston_tracts.shp", package="spData")[1])

## Reading layer `boston_tracts' from data source `/home/rsb/lib/r_libs/spData/shapes/boston_tracts.shp' using driver `ESRI Shapefile'
## Simple feature collection with 506 features and 36 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -71.52311 ymin: 42.00305 xmax: -70.63823 ymax: 42.67307
## epsg (SRID):    4267
## proj4string:    +proj=longlat +datum=NAD27 +no_defs

mapview(b506, zcol="censored")

Aggregating geometries to model output zones

After dropping the censored census tracts, we need to derive the model output zones. The aggregate method calls sf::st_union on each unique grouping value, but the function called internally is a trick, putting only the first value of each variable in the output; for this reason we only retain the ids:

b489 <- b506[b506$censored == "no",]
t0 <- aggregate(b489, list(ids = b489$NOX_ID), head, n = 1)
b94 <- t0[, c("ids", attr(t0, "sf_column"))]

Air pollution data

The figure shows clearly that the study of the relationship between NOX and house value will be impacted by ``copying out’’ NOX values to census tracts, as noted by (Harrison and Rubinfeld 1978)

mapview(b489, zcol="NOX")

Even if we were to use more class intervals in these choropleth maps, the visual impression would be the same, because the underlying data have support approximated by the TASSIM zones, not by the census tracts

Other independent variables

Besides NOX, the other census covariates included in the hedonic regression to account for median house values are the average number of rooms per house, the proportion of houses older than 1940, the proportion low-status inhabitants in each tract, and the Black proportion of population in the tract - originally expressed as a broken-stick relationship, but here taken as a percentage

The crime rate is said to be taken from FBI data by town, but which is found on inspection to vary by tract

The distance from tract to employment centres is derived from other sources, as is the dummy variable for tracts bordering Charles River

Other covariates are defined by town, with some also being fixed for all towns in Boston

The town aggregates of census tracts are used in many of the census report tabulations, and of the 92 towns, 17 only contain one census tract, while one town contains thirty census tracts

The variables are the proportion of residential lots zoned over 25000 sq. ft, the proportion of non-retail business acres, accessibility to radial highways, full-value property-tax rate per USD 10,000, and pupil-teacher ratio by town school district

Towns and TASSIM zones

In the case of 80 approximate TASSIM zones aggregated from census tracts, the boundaries do coincide exactly with town boundaries

For the remaining 12 towns and 16 TASSIM zones, there are overlaps between more than one town and TASSIM zone, mostly in Boston itself

Using TASSIM zones for analysis should therefore also reduce the levels of autocorrelation induced by ``copying out’’ town values to tracts within towns

The exact match between town boundaries defined using census tracts, and approximated TASSIM zones also constructed using census tracts is not necessarily an indication that towns were used as TASSIM zones

GEOS, topology operations

(precision in sf, scale in rgeos)

Broad Street Cholera Data

Even though we know that John Snow already had a working hypothesis about cholera epidemics, his data remain interesting, especially if we use a GIS to find the street distances from mortality dwellings to the Broad Street pump in Soho in central London. Brody et al. (2000) point out that John Snow did not use maps to find the Broad Street pump, the polluted water source behind the 1854 cholera epidemic, because he associated cholera with water contaminated with sewage, based on earlier experience.

The basic data to be used here were made available by Jim Detwiler, who had collated them for David O’Sullivan for use on the cover of O’Sullivan and Unwin (2003), based on earlier work by Waldo Tobler and others. The files were a shapefile of counts of deaths at front doors of houses, two shapefiles of pump locations and a georeferenced copy of the Snow map as an image; the files were registered in the British National Grid CRS. These have been converted to GPKG format. In GRASS, a suitable location was set up in this CRS and the image file was imported; the building contours were then digitised as a vector layer and cleaned.

We would like to find the line of equal distances shown on the extract from John Snow’s map shown in Brody et al. (2000) shown here, or equivalently find the distances from the pumps to the front doors of houses with mortalities following the roads, not the straight line distance. We should recall that we only have the locations of counts of mortalities, not of people at risk or of survivors.

library(sf)
bbo <- st_read("snow/bbo.gpkg")

## Reading layer `bbo' from data source `/home/rsb/und/ecs530/h19/snow/bbo.gpkg' using driver `GPKG'
## Simple feature collection with 1 feature and 1 field
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 528890.7 ymin: 180557.9 xmax: 529808.7 ymax: 181415.7
## epsg (SRID):    NA
## proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs

buildings <- st_read("snow/buildings.gpkg", quiet=TRUE)
deaths <- st_read("snow/deaths.gpkg", quiet=TRUE)
sum(deaths$Num_Css)

## [1] 578

b_pump <- st_read("snow/b_pump.gpkg", quiet=TRUE)
nb_pump <- st_read("snow/nb_pump.gpkg", quiet=TRUE)

As there is a small difference between the CRS values, we copy across before conducting an intersection operation to clip the buildings to the boundary, then we buffer in the buildings object (to make the roads broader).

library(sf)
st_crs(buildings) <- st_crs(bbo)
buildings1 <- st_intersection(buildings, bbo)
buildings2 <- st_buffer(buildings1, dist=-4)

plot(st_geometry(buildings2))

Next we create a dummy raster using raster with 1 meter resolution in the extent of the buildings object (note that raster::extent() works with sf objects, but the CRS must be given as a string):

library(raster)

## Loading required package: sp

resolution <- 1
r <- raster(extent(buildings2), resolution=resolution, crs=st_crs(bbo)$proj4string)

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum Unknown_based_on_Airy_1830_ellipsoid in CRS definition

r[] <- resolution
summary(r)

##         layer
## Min.        1
## 1st Qu.     1
## Median      1
## 3rd Qu.     1
## Max.        1
## NA's        0

One of the building3 component geometries was empty (permitted in sf, not in sp), so should be dropped before running raster::cellFromPolygon() to list raster cells in each geometry (so we need unlist() to assign NA to the in-buffered buildings):

buildings3 <- as(buildings2[!st_is_empty(buildings2),], "Spatial")
cfp <- cellFromPolygon(r, buildings3)
is.na(r[]) <- unlist(cfp)
summary(r)

##          layer
## Min.         1
## 1st Qu.      1
## Median       1
## 3rd Qu.      1
## Max.         1
## NA's    330537

plot(r)

Using gdistance, we create a symmetric transition object with an internal sparse matrix representation, from which shortest paths can be computed:

library(gdistance)

tr1 <- transition(r, transitionFunction=function(x) 1/mean(x), directions=8, symm=TRUE)

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum Unknown_based_on_Airy_1830_ellipsoid in CRS definition

We need to find shortest paths from addresses with mortalities to the Broad Street pump first:

sp_deaths <- as(deaths, "Spatial")
d_b_pump <- st_length(st_as_sfc(shortestPath(tr1, as(b_pump, "Spatial"), sp_deaths, output="SpatialLines")))

and then in a loop from the same addresses to each of the other pumps in turn, finally taking the minimum:

res <- matrix(NA, ncol=nrow(nb_pump), nrow=nrow(deaths))
sp_nb_pump <- as(nb_pump, "Spatial")
for (i in 1:nrow(nb_pump)) res[,i] <- st_length(st_as_sfc(shortestPath(tr1, sp_nb_pump[i,], sp_deaths, output="SpatialLines")))
d_nb_pump <- apply(res, 1, min)

Because sf::st_length() uses units units, but they get lost in assigning to a matrix, we need to re-assign before testing whether the Broad Street pump is closer or not:

library(units)

## udunits system database from /usr/share/udunits

units(d_nb_pump) <- "m"
deaths$b_nearer <- d_b_pump < d_nb_pump
by(deaths$Num_Css, deaths$b_nearer, sum)

## deaths$b_nearer: FALSE
## [1] 217
## ------------------------------------------------------------ 
## deaths$b_nearer: TRUE
## [1] 361

Recent challenges

A recent upgrade of GEOS from 3.7.1 to 3.7.2 on a CRAN test server led to failures in three packages using rgeos for topological operations. rgeos 0.4-3 set the checkValidity= argument to for example gIntersection() to FALSE (TRUE threw an error if either geometry was invalid). An issue was opened on the sf github repository (rgeos is developed on R-Forge). The test objects (from an example from inlmisc) will be used here:

rgeos::version_GEOS0()

## [1] "3.8.0"

For rgeos <= 0.4-3, the default was not to check input geometries for validity before trying topological operations, for >= 0.5-1, the default changes when GEOS > 3.7.1 to check for validity. The mode of the argument also changes to integer from logical:

cV_old_default <- ifelse(rgeos::version_GEOS0() >= "3.7.2", 0L, FALSE)
yy <- rgeos::readWKT(readLines("invalid.wkt"))
rgeos::gIsValid(yy, byid=TRUE, reason=TRUE)

##                             1                             2 
## "Ring Self-intersection[2 3]"              "Valid Geometry" 
##                             3                             4 
##              "Valid Geometry"              "Valid Geometry"

sf::sf_extSoftVersion()

##           GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H 
##        "3.8.0"        "3.0.2"        "6.2.1"         "true"         "true"

The same underlyng GEOS code is used in sf:

sf::st_is_valid(sf::st_as_sf(yy), reason=TRUE)

## [1] "Ring Self-intersection[2 3]" "Valid Geometry"             
## [3] "Valid Geometry"              "Valid Geometry"

The geometries were also invalid in GEOS 3.7.1, but the operations succeeded:

ply <- rgeos::readWKT(readLines("ply.wkt"))
oo <- try(rgeos::gIntersection(yy, ply, byid=TRUE, checkValidity=cV_old_default), silent=TRUE)
print(attr(oo, "condition")$message)

## [1] "TopologyException: Input geom 0 is invalid: Ring Self-intersection at or near point 2 3 at 2 3"

ooo <- try(sf::st_intersection(sf::st_as_sf(yy), sf::st_as_sf(ply)), silent=TRUE)
print(attr(oo, "condition")$message)

## [1] "TopologyException: Input geom 0 is invalid: Ring Self-intersection at or near point 2 3 at 2 3"

In rgeos 0.5-1 and GEOS 3.7.2, new warnings are provided, and advice to check validity.

cV_new_default <- ifelse(rgeos::version_GEOS0() >= "3.7.2", 1L, TRUE)
try(rgeos::gIntersection(yy, ply, byid=TRUE, checkValidity=cV_new_default), silent=TRUE)

## Warning in RGEOSUnaryPredFunc(spgeom, byid, "rgeos_isvalid"): Ring Self-
## intersection at or near point 2 3

## yy is invalid

## Warning in rgeos::gIntersection(yy, ply, byid = TRUE, checkValidity
## = cV_new_default): Invalid objects found; consider using
## set_RGEOS_CheckValidity(2L)

New options are provided, get_RGEOS_CheckValidity() and set_RGEOS_CheckValidity(), because in some packages the use of topological operations may happen through other packages, such as raster::crop() calling rgeos::gIntersection() without access to the arguments of the latter function.

If we follow the advice, zero-width buffering is used to try to rectify the invalidity:

oo <- rgeos::gIntersection(yy, ply, byid=TRUE, checkValidity=2L)

## Warning in RGEOSUnaryPredFunc(spgeom, byid, "rgeos_isvalid"): Ring Self-
## intersection at or near point 2 3

## yy is invalid

## Attempting to make yy valid by zero-width buffering

rgeos::gIsValid(oo)

## [1] TRUE

equivalently:

oo <- rgeos::gIntersection(rgeos::gBuffer(yy, byid=TRUE, width=0), ply, byid=TRUE, checkValidity=1L)
rgeos::gIsValid(oo)

## [1] TRUE

and by extension to sf until GEOS 3.7.2 is accommodated:

ooo <- sf::st_intersection(sf::st_buffer(sf::st_as_sf(yy), dist=0), sf::st_as_sf(ply))
all(sf::st_is_valid(ooo))

## [1] TRUE

The actual cause was the use of an ESRI/shapefile style/understanding of the self-touching exterior ring. In OGC style, an interior ring is required, but not in shapefile style. Martin Davis responded in the issue:

The problem turned out to be a noding robustness issue, which caused the valid input linework to have a self-touch after noding. This caused the output to be invalid. The fix was to tighten up the internal overlay noding validation check to catch this situation. This has the side-effect of detecting (and failing) all self-touches in input geometry. Previously, vertex-vertex self-touches were not detected, and in many cases they would simply propagate through the overlay algorithm. (This made the output invalid as well, but since the inputs were already invalid this behaviour was considered acceptable).

The change in GEOS behaviour was not planned as such, but has consequences, fortunately detected because CRAN checks by default much more than say Travis by default. Zero-width buffering will not repair all cases of invalidity, but does work here.

Input/output

sf_extSoftVersion()

##           GEOS           GDAL         proj.4 GDAL_with_GEOS     USE_PROJ_H 
##        "3.8.0"        "3.0.2"        "6.2.1"         "true"         "true"

While sp handed off dependencies to interfaces to external software GEOS (rgeos) and GDAL+PROJ (rgdal), sf includes all the external dependencies itself. This also means that stars needs sf to provide raster drivers (some other packages like gdalcubes themselves link to GDAL).

sort(as.character(st_drivers("vector")$name))

##  [1] "AeronavFAA"     "AmigoCloud"     "ARCGEN"         "AVCBin"        
##  [5] "AVCE00"         "BNA"            "CAD"            "Carto"         
##  [9] "Cloudant"       "CouchDB"        "CSV"            "CSW"           
## [13] "DGN"            "DXF"            "EDIGEO"         "EEDA"          
## [17] "ElasticSearch"  "ESRI Shapefile" "ESRIJSON"       "Geoconcept"    
## [21] "GeoJSON"        "GeoJSONSeq"     "Geomedia"       "GeoRSS"        
## [25] "GFT"            "GML"            "GMLAS"          "GPKG"          
## [29] "GPSBabel"       "GPSTrackMaker"  "GPX"            "HTF"           
## [33] "HTTP"           "Idrisi"         "Interlis 1"     "Interlis 2"    
## [37] "JML"            "JP2OpenJPEG"    "KML"            "MapInfo File"  
## [41] "MBTiles"        "Memory"         "MSSQLSpatial"   "MVT"           
## [45] "NAS"            "netCDF"         "NGW"            "ODBC"          
## [49] "ODS"            "OGR_GMT"        "OGR_PDS"        "OGR_SDTS"      
## [53] "OGR_VRT"        "OpenAir"        "OpenFileGDB"    "OSM"           
## [57] "PCIDSK"         "PDF"            "PDS4"           "PGDUMP"        
## [61] "PGeo"           "PLSCENES"       "PostgreSQL"     "REC"           
## [65] "S57"            "SEGUKOOA"       "SEGY"           "Selafin"       
## [69] "SQLite"         "SUA"            "SVG"            "SXF"           
## [73] "TIGER"          "TopoJSON"       "UK .NTF"        "VDV"           
## [77] "VFK"            "Walk"           "WAsP"           "WFS"           
## [81] "WFS3"           "XLS"            "XLSX"           "XPlane"

The drivers provided by GDAL can (mostly) read from data formatted as described for the drivers, and can to a lesser extent write data out. Raster access can use spatial subsets of the data extent, something that is harder to do with vector. Proxy handling is similarly largely restricted to raster drivers.

sort(as.character(st_drivers("raster")$name))

##   [1] "AAIGrid"             "ACE2"                "ADRG"               
##   [4] "AIG"                 "AirSAR"              "ARG"                
##   [7] "BAG"                 "BIGGIF"              "BLX"                
##  [10] "BMP"                 "BSB"                 "BT"                 
##  [13] "BYN"                 "CAD"                 "CALS"               
##  [16] "CEOS"                "COASP"               "COSAR"              
##  [19] "CPG"                 "CTable2"             "CTG"                
##  [22] "DAAS"                "DERIVED"             "DIMAP"              
##  [25] "DIPEx"               "DOQ1"                "DOQ2"               
##  [28] "DTED"                "E00GRID"             "ECRGTOC"            
##  [31] "EEDAI"               "EHdr"                "EIR"                
##  [34] "ELAS"                "ENVI"                "ERS"                
##  [37] "ESAT"                "FAST"                "FIT"                
##  [40] "FujiBAS"             "GenBin"              "GFF"                
##  [43] "GIF"                 "GMT"                 "GPKG"               
##  [46] "GRASSASCIIGrid"      "GRIB"                "GS7BG"              
##  [49] "GSAG"                "GSBG"                "GSC"                
##  [52] "GTiff"               "GTX"                 "GXF"                
##  [55] "HDF5"                "HDF5Image"           "HF2"                
##  [58] "HFA"                 "HTTP"                "IDA"                
##  [61] "IGNFHeightASCIIGrid" "ILWIS"               "INGR"               
##  [64] "IRIS"                "ISCE"                "ISIS2"              
##  [67] "ISIS3"               "JAXAPALSAR"          "JDEM"               
##  [70] "JP2OpenJPEG"         "JPEG"                "KMLSUPEROVERLAY"    
##  [73] "KRO"                 "L1B"                 "LAN"                
##  [76] "LCP"                 "Leveller"            "LOSLAS"             
##  [79] "MAP"                 "MBTiles"             "MEM"                
##  [82] "MFF"                 "MFF2"                "MRF"                
##  [85] "MSGN"                "NDF"                 "netCDF"             
##  [88] "NGSGEOID"            "NGW"                 "NITF"               
##  [91] "NTv1"                "NTv2"                "NWT_GRC"            
##  [94] "NWT_GRD"             "OZI"                 "PAux"               
##  [97] "PCIDSK"              "PCRaster"            "PDF"                
## [100] "PDS"                 "PDS4"                "PLMOSAIC"           
## [103] "PLSCENES"            "PNG"                 "PNM"                
## [106] "PostGISRaster"       "PRF"                 "R"                  
## [109] "Rasterlite"          "RDA"                 "RIK"                
## [112] "RMF"                 "ROI_PAC"             "RPFTOC"             
## [115] "RRASTER"             "RS2"                 "RST"                
## [118] "SAFE"                "SAGA"                "SAR_CEOS"           
## [121] "SDTS"                "SENTINEL2"           "SGI"                
## [124] "SIGDEM"              "SNODAS"              "SRP"                
## [127] "SRTMHGT"             "Terragen"            "TIL"                
## [130] "TSX"                 "USGSDEM"             "VICAR"              
## [133] "VRT"                 "WCS"                 "WEBP"               
## [136] "WMS"                 "WMTS"                "XPM"                
## [139] "XYZ"                 "ZMap"

There are clear preferences among data providers and users for particular data formats, so some drivers get more exposure than others. For vector data, many still use "ESRI SShapefile", although its geometries are not SF-compliant, and data on features are stored in variant DBF files (text tiles, numerically imprecise, field name length restrictions, encoding issues). "geojson" and "GML" are text files with numeric imprecision in coordinates as well as data fields. Among vector drivers, "GPKG" is a viable standard and should be used as far as possible.

library(RSQLite)
db = dbConnect(SQLite(), dbname="snow/b_pump.gpkg")
dbListTables(db)

##  [1] "b_pump"                   "gpkg_contents"           
##  [3] "gpkg_extensions"          "gpkg_geometry_columns"   
##  [5] "gpkg_ogr_contents"        "gpkg_spatial_ref_sys"    
##  [7] "gpkg_tile_matrix"         "gpkg_tile_matrix_set"    
##  [9] "rtree_b_pump_geom"        "rtree_b_pump_geom_node"  
## [11] "rtree_b_pump_geom_parent" "rtree_b_pump_geom_rowid" 
## [13] "sqlite_sequence"

str(dbReadTable(db, "gpkg_geometry_columns"))

## 'data.frame':    1 obs. of  6 variables:
##  $ table_name        : chr "b_pump"
##  $ column_name       : chr "geom"
##  $ geometry_type_name: chr "POINT"
##  $ srs_id            : int 100000
##  $ z                 : int 0
##  $ m                 : int 0

str(dbReadTable(db, "b_pump")$geom)

## blob [1:1] 
## $ : raw [1:29] 47 50 00 01 ...
## @ ptype: raw(0)

dbDisconnect(db)

str(st_layers("snow/b_pump.gpkg"))

## List of 5
##  $ name    : chr "b_pump"
##  $ geomtype:List of 1
##   ..$ : chr "Point"
##  $ driver  : chr "GPKG"
##  $ features: num 1
##  $ fields  : num 1
##  - attr(*, "class")= chr "sf_layers"

st_layers("snow/nb_pump.gpkg")

## Driver: GPKG 
## Available layers:
##   layer_name geometry_type features fields
## 1    nb_pump         Point       11      1

library(rgdal)

## rgdal: version: 1.5-2, (SVN revision 896)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 3.0.2, released 2019/10/28
##  Path to GDAL shared files: /usr/local/share/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 6.2.1, November 1st, 2019, [PJ_VERSION: 621]
##  Path to PROJ.4 shared files: /usr/local/share/proj
##  Linking to sp version: 1.3-3

ogrInfo("snow/nb_pump.gpkg")

## Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
## dumpSRS): Discarded datum OSGB_1936 in CRS definition: +proj=tmerc +lat_0=49
## +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs

## Source: "/home/rsb/und/ecs530/h19/snow/nb_pump.gpkg", layer: "nb_pump"
## Driver: GPKG; number of rows: 11 
## Feature type: wkbPoint with 2 dimensions
## Extent: (529180.4 180665.9) - (529747.4 181359)
## CRS: +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.999601 +x_0=400000 +y_0=-100000 +ellps=airy +units=m +no_defs 
## Number of fields: 1 
##   name type length  typeName
## 1  cat   12      0 Integer64

rgdal::GDALinfo("output/preview.tif")

## Warning in rgdal::GDALinfo("output/preview.tif"): statistics not supported by
## this driver

## rows        1029 
## columns     1006 
## bands       1 
## lower left origin.x        651700 
## lower left origin.y        5160380 
## res.x       20 
## res.y       20 
## ysign       -1 
## oblique.x   0 
## oblique.y   0 
## driver      GTiff 
## projection  +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs 
## file        output/preview.tif 
## apparent band summary:
##    GDType hasNoDataValue NoDataValue blockSize1 blockSize2
## 1 Float64          FALSE           0          1       1006
## apparent band statistics:
##          Bmin       Bmax Bmean Bsd
## 1 -4294967295 4294967295    NA  NA
## Metadata:
## AREA_OR_POINT=Area

obj <- GDAL.open("output/preview.tif")

dim(obj)

## [1] 1029 1006

getDriverLongName(getDriver(obj))

## [1] "GeoTIFF"

image(getRasterData(obj, band=1, offset=c(100, 100), region.dim=c(200, 200)))

GDAL.close(obj)

All of these facilities are taken from GDAL; the raster facilities have been extant for many years. raster used the ease of subsetting to permit large rasters to be handled out-of-memory.

Summary: sf::st_read() and rgdal::readOGR() are equivalent, as are sf::st_write() and rgdal::writeOGR(). When writing, you may need to take steps if overwriting. rgdal::readGDAL() reads the raster data (sub)set into an sp object, stars::read_stars() reads into a possibly proxy stars object, and raster can also be used:

library(raster)
(obj <- raster("output/preview.tif"))

## class      : RasterLayer 
## dimensions : 1029, 1006, 1035174  (nrow, ncol, ncell)
## resolution : 20, 20  (x, y)
## extent     : 651700, 671820, 5160380, 5180960  (xmin, xmax, ymin, ymax)
## crs        : +proj=utm +zone=32 +datum=WGS84 +units=m +no_defs 
## source     : /home/rsb/und/ecs530/h19/output/preview.tif 
## names      : preview

Output: rgdal::writeGDAL(), stars::write_stars() or raster::writeRaster() may be used for writing, but what happens depends on details, such as storage formats. Unlike vector, most often storage formats will be taken as homogeneous by type.

Tiled representations

While interactive web mapping interfaces use raster or vector tiled backgrounds, we have not (yet) approached tiles or pyramids internally.

Belsley, David A., Edwin Kuh, and Roy E. Welsch. 1980. Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. New York: John Wiley & Sons.

Brody, H., M. R. Rip, P. Vinten-Johansen, N. Paneth, and S. Rachman. 2000. “Map-Making and Myth-Making in Broad Street: The London Cholera Epidemic, 1854.” Lancet 356: 64–68.

Do, Van Huyen, Christine Thomas-Agnan, and Anne Vanhems. 2015. “Accuracy of Areal Interpolation Methods for Count Data.” Spatial Statistics 14: 412–38. https://doi.org/10.1016/j.spasta.2015.07.005.

Gilley, Otis W., and R. Kelley Pace. 1996. “On the Harrison and Rubinfeld Data.” Journal of Environmental Economics and Management 31 (3): 403–5. http://ideas.repec.org/a/eee/jeeman/v31y1996i3p403-405.html.

Gotway, C. A., and L. J. Young. 2002. “Combining Incompatible Spatial Data.” Journal of the American Statistical Association 97: 632–48.

Harrison, David, and Daniel L. Rubinfeld. 1978. “Hedonic Housing Prices and the Demand for Clean Air.” Journal of Environmental Economics and Management 5: 81–102.

Ingram, G. K., and G. R. Fauth. 1974. TASSIM: A Transportation and Air Shed SIMulation Model, Volume 1. Case Study of the Boston Region. Department of City; Regional Planning, Harvard University.

Ingram, G. K., G. R. Fauth, and E. A. Kroch. 1974. TASSIM: A Transportation and Air Shed SIMulation Model, Volume 2: Program User’s Guide. Department of City; Regional Planning, Harvard University.

O’Sullivan, D., and D. J. Unwin. 2003. Geographical Information Analysis. Hoboken, NJ: Wiley.

Pace, R. Kelley, and O.W. Gilley. 1997. “Using the Spatial Configuration of the Data to Improve Estimation.” Journal of the Real Estate Finance and Economics 14: 333–40.

Pebesma, Edzer, Thomas Mailund, and James Hiebert. 2016. “Measurement Units in R.” The R Journal 8 (2): 486–94. https://doi.org/10.32614/RJ-2016-061.

Scheider, Simon, Benedikt Gräler, Edzer Pebesma, and Christoph Stasch. 2016. “Modeling Spatiotemporal Information Generation.” International Journal of Geographical Information Science 30 (10): 1980–2008. https://doi.org/10.1080/13658816.2016.1151520.

Stasch, Christoph, Simon Scheider, Edzer Pebesma, and Werner Kuhn. 2014. “Meaningful Spatial Prediction and Aggregation.” Environmental Modelling & Software 51: 149–65. https://doi.org/10.1016/j.envsoft.2013.09.006.

Wakefield, J. C., and H. Lyons. 2010. “Spatial Aggregation and the Ecological Fallacy.” In Handbook of Spatial Statistics, edited by Alan E. Gelfand, Peter Diggle, Peter Guttorp, and Montserrat Fuentes, 541–58. Boca Raton: Chapman & Hall/CRC.

ECS530: (II) Import/export of spatial data

Roger Bivand

Monday 2 December 2019, 13:15-15.00, aud. C

Copyright

Required current contributed CRAN packages:

Script

Schedule

Session II

Support

Boston housing data set example

House value data

Air pollution data

Aggregating geometries to model output zones

Air pollution data

Other independent variables

Towns and TASSIM zones

GEOS, topology operations

Broad Street Cholera Data

Recent challenges

Input/output

Tiled representations