Required current contributed CRAN packages:

I am running R 3.6.1, with recent update.packages().

needed <- c("raster", "stars", "abind", "elevatr", "sp", "mapview", "sf", "osmdata")

Script

Script and data at https://github.com/rsbivand/ECS530_h19/raw/master/ECS530_I.zip. Download to suitable location, unzip and use as basis.

Schedule

  • 2/12 (I) Spatial data representation, (II) Support+topology, input/output

  • 3/12 (III) Coordinate reference systems, (IV) Visualization

  • 4/12 (V) R/GIS interfaces, project surgery

  • 5/12 (VI) Spatial autocorrelation, (VII) Spatial regression

  • 6/12 (VIII) Interpolation, point processes, project surgery

  • 7/12 Presentations

Outline

Introduction and background: why break stuff (why not)?

Data input/output and representation: movement from legacy sp to standards-compliant sf representation; coordinate reference systems; developments similar to GeoPandas and Shapely in Python

Opportunities in visualization (tmap, mapview), but also challenges when upstream software libraries evolve (PROJ/GDAL)

Spatial weights and measures of autocorrelation: software packages previously using the sp representation may add sf representation or replace sp with sf; spdep can use both for constructing spatial weights

Spatial regression: model estimation and handling split out from spdep to spatialreg; fewer reverse dependencies, quicker changes

Session I

  • 09:15-09:45 Course startup, practical matters, background

  • 09:45-10:30 Vector representation: sf replaces sp, rgeos and rgdal: vector

  • 10:30-11:00 Raster representation: stars/sf replace sp and rgdal: raster remains for now

R-Spatial: getting to here and now

Background

In the early and mid 1990s, those of us who were teaching courses in spatial analysis beyond the direct application of geographical information systems (GIS) found the paucity of software limiting.

In institutions with funding for site licenses for GIS, it was possible to write or share scripts for Arc/Info (in AML), ArcView (in Avenue), or later in Visual Basic for ArcGIS.

If site licenses and associated dongles used in the field were a problem (including students involved in fieldwork in research projects), there were few alternatives, but opportunities were discussed on mailing lists.

From late 1996, the R programmimg language and environment began to be seen as an alternative for teaching and research involving spatial analysis.

R uses much of the syntax of S, then available commercially as S-Plus, but was and remains free to install, use and extend under the GNU General Public License (GPL).

In addition, it could be installed portably across multiple operating systems, including Windows and Apple MACOS.

At about the same time, the S-Plus SpatialStats module was published (Kaluzny et al. 1998), and a meeting occurred in Leicester to which many of those looking for solutions took part.

Much of the porting of S code to R for spatial statistics was begun by Albrecht Gebhardt as soon as the R package mechanism matured. Since teachers moving courses from S to R needed access to the S libraries previously used, porting was a crucial step.

CRAN listings show tripack (Renka and Gebhardt 2016) and akima (Akima and Gebhardt 2016) - both with non-open source licenses - available from August 1998 ported by Albrecht Gebhardt; ash and sgeostat (Majure and Gebhardt 2016) followed in April 1999.

The spatial package was available as part of MASS (Venables and Ripley 2002), also ported in part by Albrecht Gebhardt.

In the earliest period, CRAN administrators helped practically with porting and publication.

Albrecht and I presented an overview of possibilities of usin R for research and teaching in spatial analysis and statistics in August 1998 (Roger Bivand and Gebhardt 2000).

The S-PLUS version of splancs provided point pattern analysis (Rowlingson and Diggle 1993, 2017).

I had contacted Barry Rowlingson in 1997 but only moved forward with porting as R’s ability to load shared objects advanced.

In September 1998, I wrote to him: “It wasn’t at all difficult to get things running, which I think is a result of your coding, thank you!”

However, I added this speculation: “An issue I have thought about a little is whether at some stage Albrecht and I wouldn’t integrate or harmonize the points and pairs objects in splancs, spatial and sgeostat - they aren’t the same, but for users maybe they ought to appear to be so”.

This concern with class representations for geographical data turned out to be fruitful.

A further step was to link GRASS and R (Roger Bivand 2000), and followed up at several meetings and working closely with Markus Neteler.

The interface has evolved, and its almost current status is presented by (Lovelace, Nowosad, and Muenchow 2019), we return to the current status below.

A consequence of this work was that the CRAN team suggested that I attend a meeting in Vienna in early 2001 to talk about the GRASS GIS interface.

The meeting gave unique insights into the dynamics of R development, and very valuable contacts.

Later the same year Luc Anselin and Serge Rey asked me to take part in a workshop in Santa Barbara, which again led to many fruitful new contacts (Bivand 2006).

Further progress was made in spatial econometrics (Bivand 2002).

2003 Vienna workshop

During the second half of 2002, it seemed relevant to propose a spatial statistics paper session at the next Vienna meeting to be held in March 2003, together with a workshop to discuss classes for spatial data.

I had reached out to Edzer Pebesma as an author of the stand-alone open source program gstat (Pebesma and Wesseling 1998); it turned out that he had just been approached to wrap the program for S-Plus.

He saw the potential of the workshop immediately, and in November 2002 wrote in an email: “I wonder whether I should start writing S classes. I’m afraid I should.”

Virgilio Gómez-Rubio had been developing two spatial packages, RArcInfo (V. Gómez-Rubio and López-Quílez 2005; Gómez-Rubio 2011) and DCluster (V. Gómez-Rubio, Ferrándiz-Ferragud, and Lopez-Quílez 2005; Gómez-Rubio, Ferrándiz-Ferragud, and López-Quílez 2015), and was committed to participating.

Although he could not get to the workshop, Nicholas Lewin-Koh wrote in March 2003 that: “I was looking over all the DSC material, especially the spatial stuff. I did notice, after looking through peoples’ packages that there is a lot of duplication of effort. My suggestion is that we set up a repository for spatial packages similar to the Bioconductor mode, where we have a base spatial package that has S-4 based methods and classes that are efficient and general.”

Straight after the workshop, a collaborative repository for the development of software using SourceForge was established, and the R-sig-geo mailing list (still with over 3,500 subscribers) was created to facilitate interaction.

Beginnings of sp

So the mandate for the development of the sp package emerged in discussions between interested contributors before, during, and especially following the 2003 Vienna workshop.

Coding meetings were organized by Barry Rowlingson in Lancaster in November 2004 and by Virgilio Gómez-Rubio in Valencia in May 2005, at both of which the class definitions and implementations were stress-tested and often changed radically; the package was first published on CRAN in April 2005.

The underlying model adopted was for S4 (new-style) classes to be used, for "Spatial" objects, whether raster or vector, to behave like "data.frame" objects, and for visualization methods to make it easy to show the objects.

Relationships with other packages

From an early point in time, object conversion (known as coercion in S and R) to and from sp classes and classes in for example the spatstat package (Baddeley and Turner 2005; Baddeley, Rubak, and Turner 2015; Baddeley, Turner, and Rubak 2019).

Packages could choose whether they would use sp classes and methods directly, or rather use those classes for functionality that they did not provide themselves through coercion.

Reading and writing ESRI Shapefiles had been possible using the maptools package (Bivand and Lewin-Koh 2019) available from CRAN since August 2003, but rgdal, on CRAN from November 2003, initially only supported raster data read and written using the external GDAL library (Warmerdam 2008).

Further code contributions by Barry Rowlingson for handling projections using the external PROJ.4 library and the vector drivers in the then OGR part of GDAL were folded into rgdal, permitting reading into sp-objects and writing from sp-objects of vector and raster data.

Completing the sp-verse

For vector data it became possible to project coordinates, and in addition to transform them where datum specifications were available.

Until recently, the interfaces to external libraries GDAL and PROJ have been relatively stable, and upstream changes have not led to breaking changes for users of packages using sp classes or rgdal functionalities, although they have involved significant maintenance effort.

The final part of the framework for spatial vector data handling was the addition of the rgeos package interfacing the external GEOS library in 2011, thanks to Colin Rundell’s 2010 Google Summer of Coding project.

The rgeos package provided vector topological predicates and operations typically found in GIS such as intersection; note that by this time, both GDAL and GEOS used the Simple Features vector representation internally.

ASDAR first edition

By the publication of ASDAR (Bivand, Pebesma, and Gomez-Rubio 2008), a few packages not written or maintained by the book authors and their nearest collaborators had begun to use sp classes. By the publication of the second edition (Bivand, Pebesma, and Gomez-Rubio 2013), we had seen that the number of packages depending on sp, importing from and suggesting it (in CRAN terminology for levels of dependency) had grown strongly. In late 2014, (de Vries 2014) looked at CRAN package clusters from a page rank graph, and found a clear spatial cluster that we had not expected. This cluster is from late November 2019:

Spatial data

Spatial data typically combine position data in 2D (or 3D), attribute data and metadata related to the position data. Much spatial data could be called map data or GIS data. We collect and handle much more position data since global navigation satellite systems (GNSS) like GPS came on stream 20 years ago, earth observation satellites have been providing data for longer.

suppressPackageStartupMessages(library(osmdata))
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.2, PROJ 6.2.1
bbox <- opq(bbox = 'bergen norway')
byb0 <- osmdata_sf(add_osm_feature(bbox, key = 'railway',
  value = 'light_rail'))$osm_lines
tram <- osmdata_sf(add_osm_feature(bbox, key = 'railway',
  value = 'tram'))$osm_lines
byb1 <- tram[!is.na(tram$name),]
o <- intersect(names(byb0), names(byb1))
byb <- rbind(byb0[,o], byb1[,o])
saveRDS(byb, file="byb.rds")

Spatial vector data is based on points, from which other geometries are constructed. Vector data is often also termed object-based spatial data. The light rail tracks are 2D vector data. The points themselves are stored as double precision floating point numbers, typically without recorded measures of accuracy (GNSS provides a measure of accuracy). Here, lines are constructed from points.

library(mapview)
mapview(byb)

Data handling

We can download monthly CSV files of use, and manipulate the input to let us use the stplanr package to aggregate origin-destination data. One destination is in Oslo, some are round trips, but otherwise things are OK. We can use to route the volumes onto cycle paths, via an API and API key. We’d still need to aggregate the bike traffic by cycle path segment for completeness.

bike_fls <- list.files("bbs")
trips0 <- NULL
for (fl in bike_fls) trips0 <- rbind(trips0,
  read.csv(file.path("bbs", fl), header=TRUE))
trips0 <- trips0[trips0[, 8] < 6 & trips0[, 13] < 6,]
trips <- cbind(trips0[,c(1, 4, 2, 9)], data.frame(count=1))
from <- unique(trips0[,c(4,5,7,8)])
names(from) <- substring(names(from), 7)
to <- unique(trips0[,c(9,10,12,13)])
names(to) <- substring(names(to), 5)
stations0 <- st_as_sf(merge(from, to, all=TRUE),
  coords=c("station_longitude", "station_latitude"))
stations <- aggregate(stations0, list(stations0$station_id),
  head, n=1)
suppressWarnings(stations <- st_cast(stations, "POINT"))
st_crs(stations) <- 4326
od <- aggregate(trips[,-(1:4)], list(trips$start_station_id,
  trips$end_station_id), sum)
od <- od[-(which(od[,1] == od[,2])),]
library(stplanr)
od_lines <- od2line(flow=od, zones=stations, zone_code="Group.1",
  origin_code="Group.1", dest_code="Group.2")
saveRDS(od_lines, "od_lines.rds")
Sys.setenv(CYCLESTREET="XxXxXxXxXxXxXxXx")
od_routes <- line2route(od_lines, "route_cyclestreet",
  plan = "fastest")
saveRDS(od_routes, "od_routes.rds")

Origin-destination lines

Routed lines along cycle routes

Advancing from the sp representation

Representing spatial vector data in R (sp)

The sp package was a child of its time, using S4 formal classes, and the best compromise we then had of positional representation (not arc-node, but hard to handle holes in polygons). If we coerse byb to the sp representation, we see the formal class structure. Input/output used OGR/GDAL vector drivers in the rgdal package, and topological operations used GEOS in the rgeos package.

library(sp)
byb_sp <- as(byb, "Spatial")
str(byb_sp, max.level=2)
## Formal class 'SpatialLinesDataFrame' [package "sp"] with 4 slots
##   ..@ data       :'data.frame':  189 obs. of  10 variables:
##   ..@ lines      :List of 189
##   ..@ bbox       : num [1:2, 1:2] 5.23 60.29 5.36 60.39
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
str(slot(byb_sp, "lines")[[1]])
## Formal class 'Lines' [package "sp"] with 2 slots
##   ..@ Lines:List of 1
##   .. ..$ :Formal class 'Line' [package "sp"] with 1 slot
##   .. .. .. ..@ coords: num [1:14, 1:2] 5.33 5.33 5.33 5.33 5.33 ...
##   .. .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. .. ..$ : chr [1:14] "4870663682" "331531217" "331531216" "331531215" ...
##   .. .. .. .. .. ..$ : chr [1:2] "lon" "lat"
##   ..@ ID   : chr "30098967"

Raster data

Spatial raster data is observed using rectangular (often square) cells, within which attribute data are observed. Raster data are very rarely object-based, very often they are field-based and could have been observed everywhere. We probably do not know where within the raster cell the observed value is correct; all we know is that at the chosen resolution, this is the value representing the whole cell area.

library(elevatr)
elevation <- get_elev_raster(byb_sp, z = 10)
is.na(elevation) <- elevation < 1
saveRDS(elevation, file="elevation.rds")
elevation <- readRDS("elevation.rds")
str(elevation, max.level=2)
## Formal class 'RasterLayer' [package "raster"] with 12 slots
##   ..@ file    :Formal class '.RasterFile' [package "raster"] with 13 slots
##   ..@ data    :Formal class '.SingleLayerData' [package "raster"] with 13 slots
##   ..@ legend  :Formal class '.RasterLegend' [package "raster"] with 5 slots
##   ..@ title   : chr(0) 
##   ..@ extent  :Formal class 'Extent' [package "raster"] with 4 slots
##   ..@ rotated : logi FALSE
##   ..@ rotation:Formal class '.Rotation' [package "raster"] with 2 slots
##   ..@ ncols   : int 1545
##   ..@ nrows   : int 1043
##   ..@ crs     :Formal class 'CRS' [package "sp"] with 1 slot
##   ..@ history : list()
##   ..@ z       : list()
str(as(elevation, "SpatialGridDataFrame"), max.level=2)
## Formal class 'SpatialGridDataFrame' [package "sp"] with 4 slots
##   ..@ data       :'data.frame':  1611435 obs. of  1 variable:
##   ..@ grid       :Formal class 'GridTopology' [package "sp"] with 3 slots
##   ..@ bbox       : num [1:2, 1:2] 4.92 60.06 5.98 60.42
##   .. ..- attr(*, "dimnames")=List of 2
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
mapview(elevation, col=topo.colors)
## Warning in rasterCheckSize(x, maxpixels = maxpixels): maximum number of pixels for Raster* viewing is 5e+05 ; 
## the supplied Raster* has 1611435 
##  ... decreasing Raster* resolution to 5e+05 pixels
##  to view full resolution set 'maxpixels =  1611435 '
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition
## Warning in rgdal::rawTransform(projfrom, projto, nrow(xy), xy[, 1], xy[, : NOT
## UPDATED FOR PROJ >= 6
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition
## Warning in rgdal::rawTransform(projfrom, projto, nrow(xy), xy[, 1], xy[, : NOT
## UPDATED FOR PROJ >= 6
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : NOT
## UPDATED FOR PROJ >= 6
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition
## Warning in rgdal::rawTransform(projfrom, projto, nrow(xy), xy[, 1], xy[, : NOT
## UPDATED FOR PROJ >= 6
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition
## Warning in rgdal::rawTransform(projfrom, projto, nrow(xy), xy[, 1], xy[, : NOT
## UPDATED FOR PROJ >= 6
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition
## Warning in rgdal::rawTransform(projfrom, projto, nrow(xy), xy[, 1], xy[, : NOT
## UPDATED FOR PROJ >= 6
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition
## Warning in rgdal::rawTransform(projfrom, projto, nrow(xy), xy[, 1], xy[, : NOT
## UPDATED FOR PROJ >= 6
## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition

## Warning in showSRID(uprojargs, format = "PROJ", multiline = "NO"): Discarded
## datum WGS_1984 in CRS definition
## Warning in rgdal::rawTransform(projfrom, projto, nrow(xy), xy[, 1], xy[, : NOT
## UPDATED FOR PROJ >= 6
## Warning in rgdal::rawTransform(projto_int, projfrom, nrow(xy), xy[, 1], : NOT
## UPDATED FOR PROJ >= 6

Raster data

The raster package complemented sp for handling raster objects and their interactions with vector objects.

It added to input/output using GDAL through rgdal, and better access to NetCDF files for GDAL built without the relevant drivers.

It may be mentioned in passing that thanks to help from CRAN administrators and especially Brian Ripley, CRAN binary builds of rgdal for Windows and Apple Mac OSX became available from 2006, but with a limited set of vector and raster drivers.

Support from CRAN adminstrators remains central to making packages available to users who are not able to install R source packages themselves, particularly linking to external libraries.

Initially, raster was written in R using functionalities in sp and rgdal with rgeos coming later.

It used a feature of GDAL raster drivers permitting the successive reading of subsets of rasters by row and column, permitting the processing of much larger objects than could be held in memory.

In addition, the concepts of bricks and stacks of rasters were introduced, diverging somewhat from the sp treatment of raster bands as stacked columns as vectors in a data frame.

Questions arose

As raster evolved, two other packages emerged raising issues with the ways in which spatial objects had been conceptualized in sp.

The rgeos package used the C application programming interface (API) to the C++ GEOS library, which is itself a translation of the Java Topology Suite (JTS).

While the GDAL vector drivers did use the standard Simple Features representation of ector geometries, it was not strongly enforced.

This laxity now seems most closely associated with the use of ESRI Shapefiles as a de-facto file standard for representation, in which many Simple Features are not consistently representable.

Need for vector standards compliance

Both JTS and GEOS required a Simple Feature compliant representation, and led to the need for curious and fragile adaptations.

For example, these affected the representation of sp "Polygons" objects, which were originally conceptualized after the Shapefile specification: ring direction determined whether a ring was exterior or interior (a hole), but no guidance was given to show which exterior ring holes might belong to.

As R provides a way to add a character string comment to any object, comments were added to each "Polygons" object encoding the necessary information.

In this way, GEOS functionality could be used, but the fragility of vector representation in sp was made very obvious.

Spatio-temporal data

Another package affecting thinking about representation was spacetime, as it diverged from raster by stacking vectors for regular spatio-temporal objects with space varying faster than time.

So a single earth observation band observed repeatedly would be stored in a single vector in a data frame, rather than in the arguably more robust form of a four-dimensional array, with the band taking one position on the final dimension.

The second edition of (Bivand, Pebesma, and Gomez-Rubio 2013) took up all of these issues in one way or another, but after completing a spatial statistics special issue of the Journal of Statistical Software (Pebesma, Bivand, and Ribeiro 2015), it was time to begin fresh implementations of classes for spatial data.

Simple Features in R

It was clear that vector representations needed urgent attention, so the sf package was begun, aiming to implement the most frequently used parts of the specification (ISO 2004; Kralidis 2008; Herring 2011).

Development was supported by a grant from the then newly started R Consortium, which brings together R developers and industry members.

A key breakthrough came at the useR! 2016 conference, following an earlier decision to re-base vector objects on data frames, rather than as in sp to embed a data frame inside a collection of spatial features of the same kind.

However, although data frame objects in S and R have always been able to take list columns as valid columns, such list columns were not seen as “tidy” (Wickham 2014).

Refresher: data frame objects

First, let us see that is behind the data.frame object: the list object. list objects are vectors that contain other objects, which can be addressed by name or by 1-based indices . Like the vectors we have already met, lists can be accessed and manipulated using square brackets []. Single list elements can be accessed and manipulated using double square brackets [[]]

Starting with four vectors of differing types, we can assemble a list object; as we see, its structure is quite simple. The vectors in the list may vary in length, and lists can (and do often) include lists

V1 <- 1:3
V2 <- letters[1:3]
V3 <- sqrt(V1)
V4 <- sqrt(as.complex(-V1))
L <- list(v1=V1, v2=V2, v3=V3, v4=V4)
str(L)
## List of 4
##  $ v1: int [1:3] 1 2 3
##  $ v2: chr [1:3] "a" "b" "c"
##  $ v3: num [1:3] 1 1.41 1.73
##  $ v4: cplx [1:3] 0+1i 0+1.41i 0+1.73i
L$v3[2]
## [1] 1.414214
L[[3]][2]
## [1] 1.414214

Our list object contains four vectors of different types but of the same length; conversion to a data.frame is convenient. Note that by default strings are converted into factors:

DF <- as.data.frame(L)
str(DF)
## 'data.frame':    3 obs. of  4 variables:
##  $ v1: int  1 2 3
##  $ v2: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i
DF <- as.data.frame(L, stringsAsFactors=FALSE)
str(DF)
## 'data.frame':    3 obs. of  4 variables:
##  $ v1: int  1 2 3
##  $ v2: chr  "a" "b" "c"
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i

We can also provoke an error in conversion from a valid list made up of vectors of different length to a data.frame:

V2a <- letters[1:4]
V4a <- factor(V2a)
La <- list(v1=V1, v2=V2a, v3=V3, v4=V4a)
DFa <- try(as.data.frame(La, stringsAsFactors=FALSE), silent=TRUE)
message(DFa)
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
##   arguments imply differing number of rows: 3, 4

We can access data.frame elements as list elements, where the $ is effectively the same as [[]] with the list component name as a string:

DF$v3[2]
## [1] 1.414214
DF[[3]][2]
## [1] 1.414214
DF[["v3"]][2]
## [1] 1.414214

Since a data.frame is a rectangular object with named columns with equal numbers of rows, it can also be indexed like a matrix, where the rows are the first index and the columns (variables) the second:

DF[2, 3]
## [1] 1.414214
DF[2, "v3"]
## [1] 1.414214
str(DF[2, 3])
##  num 1.41
str(DF[2, 3, drop=FALSE])
## 'data.frame':    1 obs. of  1 variable:
##  $ v3: num 1.41

If we coerce a data.frame containing a character vector or factor into a matrix, we get a character matrix; if we extract an integer and a numeric column, we get a numeric matrix.

as.matrix(DF)
##      v1  v2  v3         v4           
## [1,] "1" "a" "1.000000" "0+1.000000i"
## [2,] "2" "b" "1.414214" "0+1.414214i"
## [3,] "3" "c" "1.732051" "0+1.732051i"
as.matrix(DF[,c(1,3)])
##      v1       v3
## [1,]  1 1.000000
## [2,]  2 1.414214
## [3,]  3 1.732051

The fact that data.frame objects descend from list objects is shown by looking at their lengths; the length of a matrix is not its number of columns, but its element count:

length(L)
## [1] 4
length(DF)
## [1] 4
length(as.matrix(DF))
## [1] 12

There are dim methods for data.frame objects and matrices (and arrays with more than two dimensions); matrices and arrays are seen as vectors with dimensions; list objects have no dimensions:

dim(L)
## NULL
dim(DF)
## [1] 3 4
dim(as.matrix(DF))
## [1] 3 4
str(as.matrix(DF))
##  chr [1:3, 1:4] "1" "2" "3" "a" "b" "c" "1.000000" "1.414214" "1.732051" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "v1" "v2" "v3" "v4"

data.frame objects have names and row.names, matrices have dimnames, colnames and rownames; all can be used for setting new values:

row.names(DF)
## [1] "1" "2" "3"
names(DF)
## [1] "v1" "v2" "v3" "v4"
names(DF) <- LETTERS[1:4]
names(DF)
## [1] "A" "B" "C" "D"
str(dimnames(as.matrix(DF)))
## List of 2
##  $ : NULL
##  $ : chr [1:4] "A" "B" "C" "D"

R objects have attributes that are not normally displayed, but which show their structure and class (if any); we can see that data.frame objects are quite different internally from matrices:

str(attributes(DF))
## List of 3
##  $ names    : chr [1:4] "A" "B" "C" "D"
##  $ class    : chr "data.frame"
##  $ row.names: int [1:3] 1 2 3
str(attributes(as.matrix(DF)))
## List of 2
##  $ dim     : int [1:2] 3 4
##  $ dimnames:List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "A" "B" "C" "D"

If the reason for different vector lengths was that one or more observations are missing on that variable, NA should be used; the lengths are then equal, and a rectangular table can be created:

V1a <- c(V1, NA)
V3a <- sqrt(V1a)
La <- list(v1=V1a, v2=V2a, v3=V3a, v4=V4a)
DFa <- as.data.frame(La, stringsAsFactors=FALSE)
str(DFa)
## 'data.frame':    4 obs. of  4 variables:
##  $ v1: int  1 2 3 NA
##  $ v2: chr  "a" "b" "c" "d"
##  $ v3: num  1 1.41 1.73 NA
##  $ v4: Factor w/ 4 levels "a","b","c","d": 1 2 3 4

Tidy list columns

DF$E <- list(d=1, e="1", f=TRUE)
str(DF)
## 'data.frame':    3 obs. of  5 variables:
##  $ A: int  1 2 3
##  $ B: chr  "a" "b" "c"
##  $ C: num  1 1.41 1.73
##  $ D: cplx  0+1i 0+1.41i 0+1.73i
##  $ E:List of 3
##   ..$ d: num 1
##   ..$ e: chr "1"
##   ..$ f: logi TRUE

At useR! in 2016, list columns were declared “tidy”, using examples including the difficulty of encoding polygon interior rings in non-list columns. The decision to accommodate “tidy” workflows as well as base-R workflows had already been made, as at least some users only know how to use ``tidy’’ workflows.

sf begins

(Pebesma 2018) shows the status of the sf towards the end of 2017, with a geometry list column containing R wrappers around objects adhering to Simple Features specification definitions. The feature geometries are stored in numeric vectors, matrices, or lists of matrices, and may also be subject to arithmetic operations. Features are held in the "XY" class if two-dimensional, or "XYZ", "XYM" or "XYZM" if such coordinates are available; all single features are "sfg" (Simple Feature geometry) objects:

pt1 <- st_point(c(1,3))
pt2 <- pt1 + 1
pt3 <- pt2 + 1
str(pt3)
##  'XY' num [1:2] 3 5

Geometries may be represented as “Well Known Text” (WKT):

st_as_text(pt3)
## [1] "POINT (3 5)"

or as “Well Known Binary” (WKB) as in databases’ “binary large objects” (BLOBs), resolving the problem of representation when working with GDAL vector drivers and functions, and with GEOS predicates and topological operations:

st_as_binary(pt3)
##  [1] 01 01 00 00 00 00 00 00 00 00 00 08 40 00 00 00 00 00 00 14 40

A column of simple feature geometries ("sfc") is constructed as a list of "sfg" objects, which do not have to belong to the same Simple Features category:

pt_sfc <- st_as_sfc(list(pt1, pt2, pt3))
str(pt_sfc)
## sfc_POINT of length 3; first list element:  'XY' num [1:2] 1 3

Finally, an "sfc" object, a geometry column, can be added to a data.frame object using st_geometry(), which sets a number of attributes on the object and defines it as also being an "sf" object (the "agg" attribute if populated shows how observations on non-geometry columns should be understood):

st_geometry(DF) <- pt_sfc
(DF)
## Simple feature collection with 3 features and 5 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 1 ymin: 3 xmax: 3 ymax: 5
## epsg (SRID):    NA
## proj4string:    NA
##   A B        C           D    E    geometry
## 1 1 a 1.000000 0+1.000000i    1 POINT (1 3)
## 2 2 b 1.414214 0+1.414214i    1 POINT (2 4)
## 3 3 c 1.732051 0+1.732051i TRUE POINT (3 5)

The sf package does not implement all of the Simple Features geometry categories, but geometries may be converted to the chosen subset, using for example the gdal_utils() function with util="ogr2ogr", options="-nlt CONVERT_TO_LINEAR" to convert curve geometries in an input file to linear geometries.

Many of the functions in the sf package begin with st_ as a reference to the same usage in PostGIS, where the letters were intended to symbolise space and time, but where time has not yet been implemented.

sf also integrates GEOS topological predicates and operations into the same framework, replacing the rgeos package for access to GEOS functionality. The precision and scale defaults differ between sf and rgeos slightly; both remain fragile with respect to invalid geometries, of which there are many in circulation.

(buf_DF <- st_buffer(DF, dist=0.3))
## Simple feature collection with 3 features and 5 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 0.7 ymin: 2.7 xmax: 3.3 ymax: 5.3
## epsg (SRID):    NA
## proj4string:    NA
##   A B        C           D    E                       geometry
## 1 1 a 1.000000 0+1.000000i    1 POLYGON ((1.3 3, 1.299589 2...
## 2 2 b 1.414214 0+1.414214i    1 POLYGON ((2.3 4, 2.299589 3...
## 3 3 c 1.732051 0+1.732051i TRUE POLYGON ((3.3 5, 3.299589 4...

Raster representations: stars

Like sf, stars was supported by an R Consortium grant, for scalable, spatio-temporal tidy arrays for R.

Spatio-temporal arrays were seen as an alternative way of representing multivariate spatio-temporal data from the choices made in the spacetime package, where a two-dimensional data frame contained stacked positions within stacked time points or intervals.

The proposed arrays might collapse to a raster layer if only one variable was chosen for one time point or interval.

More important, the development of the package was extended to accommodate a backend for earth data processing in which the data are retrieved and rescaled as needed from servers, most often cloud-based servers.

This example only covers a multivariate raster taken from a Landsat 7 view of a small part of the Brazilian coast. In the first part, a GeoTIFF file is read into memory, using three array dimensions, two in planar space, the third across six bands:

library(stars)
## Loading required package: abind
fn <- system.file("tif/L7_ETMs.tif", package = "stars")
L7 <- read_stars(fn)
L7
## stars object with 3 dimensions and 1 attribute
## attribute(s):
##   L7_ETMs.tif    
##  Min.   :  1.00  
##  1st Qu.: 54.00  
##  Median : 69.00  
##  Mean   : 68.91  
##  3rd Qu.: 86.00  
##  Max.   :255.00  
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL

The bands can be operated on arithmetically, for example to generate a new object containing values of the normalized difference vegetation index through a function applied across the \(x\) and \(y\) spatial dimensions:

ndvi <- function(x) (x[4] - x[3])/(x[4] + x[3])
(s2.ndvi <- st_apply(L7, c("x", "y"), ndvi))
## stars object with 2 dimensions and 1 attribute
## attribute(s):
##      ndvi          
##  Min.   :-0.75342  
##  1st Qu.:-0.20301  
##  Median :-0.06870  
##  Mean   :-0.06432  
##  3rd Qu.: 0.18667  
##  Max.   : 0.58667  
## dimension(s):
##   from  to  offset delta                       refsys point values    
## x    1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y    1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]

The same file can also be accessed using the proxy mechanism, shich creates a link to the external entity, here a file:

L7p <- read_stars(fn, proxy=TRUE)
L7p
## stars_proxy object with 1 attribute in file:
## $L7_ETMs.tif
## [1] "/home/rsb/lib/r_libs/stars/tif/L7_ETMs.tif"
## 
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL

The same function can also be applied across the same two spatial dimentions of the array, but no calculation is carried out until the data is needed and the output resolution known:

(L7p.ndvi = st_apply(L7p, c("x", "y"), ndvi))
## stars_proxy object with 1 attribute in file:
## $L7_ETMs.tif
## [1] "/home/rsb/lib/r_libs/stars/tif/L7_ETMs.tif"
## 
## dimension(s):
##      from  to  offset delta                       refsys point values    
## x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
## band    1   6      NA    NA                           NA    NA   NULL    
## call list:
## [[1]]
## st_apply(X = X, MARGIN = c("x", "y"), FUN = ndvi)

The array object can also be split, here on the band dimension, to yield a representation as six rasters in list form:

(x6 <- split(L7, "band"))
## stars object with 2 dimensions and 6 attributes
## attribute(s):
##       X1               X2               X3               X4         
##  Min.   : 47.00   Min.   : 32.00   Min.   : 21.00   Min.   :  9.00  
##  1st Qu.: 67.00   1st Qu.: 55.00   1st Qu.: 49.00   1st Qu.: 52.00  
##  Median : 78.00   Median : 66.00   Median : 63.00   Median : 63.00  
##  Mean   : 79.15   Mean   : 67.57   Mean   : 64.36   Mean   : 59.24  
##  3rd Qu.: 89.00   3rd Qu.: 79.00   3rd Qu.: 77.00   3rd Qu.: 75.00  
##  Max.   :255.00   Max.   :255.00   Max.   :255.00   Max.   :255.00  
##       X5               X6         
##  Min.   :  1.00   Min.   :  1.00  
##  1st Qu.: 63.00   1st Qu.: 32.00  
##  Median : 89.00   Median : 60.00  
##  Mean   : 83.18   Mean   : 59.98  
##  3rd Qu.:112.00   3rd Qu.: 88.00  
##  Max.   :255.00   Max.   :255.00  
## dimension(s):
##   from  to  offset delta                       refsys point values    
## x    1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
## y    1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]

These rasters may also be subjected to arithmetical operations, and as may be seen, explicit arithmetic on the six rasters has the same outcome as applying the same calculatiob to the three-dimensional array:

x6$mean <- (x6[[1]] + x6[[2]] + x6[[3]] + x6[[4]] + x6[[5]] +
              x6[[6]])/6
xm <- st_apply(L7, c("x", "y"), mean)
all.equal(xm[[1]], x6$mean)
## [1] TRUE

openeo

OpenEO proposes proof-of-concept client-server API approaches. The sample data sets are from the southern Alps, and the project is under development.

if (!dir.exists("output")) dir.create("output")
library(openeo) # v 0.3.1
euracHost = "http://saocompute.eurac.edu/openEO_0_3_0/openeo/"
eurac = connect(host = euracHost,disable_auth = TRUE)
## Connected to host
pgb = eurac %>% pgb()
bb <- list(west=10.98,east=11.25,south=46.58,north=46.76)
tt <- c("2017-01-01T00:00:00Z","2017-01-31T00:00:00Z")
pgb$collection$S2_L2A_T32TPS_20M %>%
  pgb$filter_bbox(extent=bb) %>%
  pgb$filter_daterange(extent=tt) %>%
  pgb$NDVI(red="B04",nir="B8A") %>%
  pgb$min_time() -> task
tf <- "output/preview.tif"
preview <- preview(eurac, task=task, format="GTiff",
  output_file=tf)
plot(read_stars(preview))

gdalcubes

Earth Observation Data Cubes from Satellite Image Collections - extension of the stars proxy mechansim and the raster out-of-memory approach: (https://github.com/appelmar/gdalcubes_R).

Processing collections of Earth observation images as on-demand multispectral, multitemporal data cubes. Users define cubes by spatiotemporal extent, resolution, and spatial reference system and let ‘gdalcubes’ automatically apply cropping, reprojection, and resampling using the ‘Geospatial Data Abstraction Library’ (‘GDAL’).

Implemented functions on data cubes include reduction over space and time, applying arithmetic expressions on pixel band values, moving window aggregates over time, filtering by space, time, bands, and predicates on pixel values, materializing data cubes as ‘netCDF’ files, and plotting. User-defined ‘R’ functions can be applied over chunks of data cubes. The package implements lazy evaluation and multithreading. See also a five month old blog post.

Akima, Hiroshi, and Albrecht Gebhardt. 2016. akima: Interpolation of Irregularly and Regularly Spaced Data. https://CRAN.R-project.org/package=akima.

Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. London: Chapman; Hall/CRC Press. http://www.crcpress.com/Spatial-Point-Patterns-Methodology-and-Applications-with-R/Baddeley-Rubak-Turner/9781482210200/.

Baddeley, Adrian, and Rolf Turner. 2005. “spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of Statistical Software 12 (6): 1–42. http://www.jstatsoft.org/v12/i06/.

Baddeley, Adrian, Rolf Turner, and Ege Rubak. 2019. Spatstat: Spatial Point Pattern Analysis, Model-Fitting, Simulation, Tests. https://CRAN.R-project.org/package=spatstat.

Bivand, R. 2002. “Spatial Econometrics Functions in R: Classes and Methods.” Journal of Geographical Systems 4: 405–21.

———. 2006. “Implementing Spatial Data Analysis Software Tools in R.” Geographical Analysis 38: 23–40.

Bivand, Roger. 2000. “Using the R Statistical Data Analysis Language on GRASS 5.0 GIS Database Files.” Computers & Geosciences 26 (9): 1043–52.

Bivand, Roger, and Albrecht Gebhardt. 2000. “Implementing Functions for Spatial Statistical Analysis Using the R Language.” Journal of Geographical Systems 2 (3): 307–17. https://doi.org/10.1007/PL00011460.

Bivand, Roger, and Nicholas Lewin-Koh. 2019. Maptools: Tools for Handling Spatial Objects. https://CRAN.R-project.org/package=maptools.

Bivand, Roger, Edzer Pebesma, and Virgilio Gomez-Rubio. 2008. Applied Spatial Data Analysis with R. Springer, NY. https://asdar-book.org/.

———. 2013. Applied Spatial Data Analysis with R, Second Edition. Springer, NY. https://asdar-book.org/.

de Vries, Andrie. 2014. “Finding Clusters of CRAN Packages Using Igraph.” https://blog.revolutionanalytics.com/2014/12/finding-clusters-of-cran-packages-using-igraph.html.

Gómez-Rubio, V., J. Ferrándiz-Ferragud, and A. Lopez-Quílez. 2005. “Detecting Clusters of Disease with R.” Journal of Geographical Systems 7 (2): 189–206.

Gómez-Rubio, Virgilio. 2011. RArcInfo: Functions to Import Data from Arc/Info V7.x Binary Coverages. https://CRAN.R-project.org/package=RArcInfo.

Gómez-Rubio, Virgilio, Juan Ferrándiz-Ferragud, and Antonio López-Quílez. 2015. DCluster: Functions for the Detection of Spatial Clusters of Diseases. https://CRAN.R-project.org/package=DCluster.

Gómez-Rubio, V., and A. López-Quílez. 2005. “RArcInfo: Using Gis Data with R.” Computers & Geosciences 31 (8): 1000–1006.

Herring, John R. 2011. “OpenGIS Implementation Standard for Geographic Information-Simple Feature Access-Part 1: Common Architecture.” Open Geospatial Consortium Inc, 111. http://portal.opengeospatial.org/files/?artifact\_id=25355.

ISO. 2004. ISO 19125-1:2004 Geographic Information – Simple Feature Access – Part 1: Common Architecture. https://www.iso.org/standard/40114.html.

Kaluzny, S.P., S.C. Vega, T.P. Cardoso, and A.A. Shelly. 1998. S+SpatialStats. New York, NY: Springer.

Kralidis, Athanasios Tom. 2008. “Geospatial Open Source and Open Standards Convergences.” In Open Source Approaches in Spatial Data Handling, edited by G. B. Hall and M. Leahy, 1–20. Berlin: Springer-Verlag.

Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. Boca Raton, FL: Chapman and Hall/CRC. https://geocompr.robinlovelace.net/.

Majure, James J., and Albrecht Gebhardt. 2016. sgeostat: An Object-Oriented Framework for Geostatistical Modeling in S+. https://CRAN.R-project.org/package=sgeostat.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

Pebesma, Edzer, Roger Bivand, and Paulo Ribeiro. 2015. “Software for Spatial Statistics.” Journal of Statistical Software, Articles 63 (1): 1–8. https://doi.org/10.18637/jss.v063.i01.

Pebesma, E. J., and C. G. Wesseling. 1998. “Gstat, a Program for Geostatistical Modelling, Prediction and Simulation.” Computers and Geosciences 24: 17–31.

Renka, R. J., and Albrecht Gebhardt. 2016. tripack: Triangulation of Irregularly Spaced Data. https://CRAN.R-project.org/package=tripack.

Rowlingson, Barry, and Peter Diggle. 2017. Splancs: Spatial and Space-Time Point Pattern Analysis. https://CRAN.R-project.org/package=splancs.

Rowlingson, B., and P. J. Diggle. 1993. “Splancs: Spatial Point Pattern Analysis Code in S-Plus.” Computers and Geosciences 19: 627–55.

Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S. Fourth. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS4.

Warmerdam, Frank. 2008. “The Geospatial Data Abstraction Library.” In Open Source Approaches in Spatial Data Handling, edited by G. B. Hall and M. Leahy, 87–104. Berlin: Springer-Verlag.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software, Articles 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.