All the material presented here, to the extent it is original, is available under CC-BY-SA.
I am running R 4.0.5, with recent update.packages()
.
needed <- c("mapview", "mapsf", "tmap", "ggplot2", "lme4", "openxlsx", "bdl", "terra", "stars", "sf", "readxl", "rgugik")
Script and data at https://github.com/rsbivand/UAM21_I/raw/main/UAM21_I_210506.zip. Download to suitable location, unzip and use as basis.
Time | Topic |
---|---|
Wednesday 5/5 | |
09.00-12.00 | What is R: programming language, community, ecosystem? What may it be used for in analysing spatial data in a social science setting? What are the basic data structures in R? How can we start writing an R markdown notebook? How to access help in using R? How to use built-in data sets and why? How to write reproducible examples? What can we learn from code examples? How can R help us in furthering reproducible research? |
13.00-16.00 | What kinds of data objects are used in R? What is the structure of a data.frame? What is a list object? What kinds of data can be contained in data objects? |
Thursday 6/5 | |
09.00-12.00 | How may we read data into R? From files, including spatial data files, and from online resources? How can we choose between output formats for notebooks and other output media? What are classes and formulae? How can one choose between the basic graphics functions and devices in R? |
13.00-16.00 | When our data include spatial data objects, in which ways may they be represented in R? How can one make simple thematic maps using R? (sf, stars, tmap) |
Monday 10/5 | |
09.00-12.00 | May we use R “like a GIS?” How may we structure temporal and spatio-temporal data? Closer introduction to R-spatial (sf, stars, gdalcubes, terra, GDAL, GEOS) |
13.00-16.00 | Planar and spherical geometries, projections and transformations (s2, PROJ, tmap, mapview, leaflet, geogrid) |
Tuesday 11/5 | |
09.00-12.00 | What forms of expression and colour scales are available in R? How can we use class intervals and colour palettes to communicate? Rather than “lying with maps,” how can we explore the impact of choices made in thematic cartography? How can we condition on continuous or discrete variables to permit visual comparison? How can we combine multiple graphical elements in data visualization? (classInt, sf, tmap, mapsf) |
13.00-16.00 | Doing things with spatial data … (osmdata, …) |
Thursday 20/11 | |
09.00-12.00 | Presentations/consultations/discussion |
13.00-16.00 | Presentations/consultations/discussion |
When you quit an R session, you may be asked whether you want to save the workspace (in your current working directory, as .RData
) and/or your session history (in your current working directory, as .Rhistory
).
I never save these, and thus avoid the possible problem that starting an R session in that directory loads both the objects and the history into R as it begins.
We’ll get back to saving objects and command history from an R session later. For now we’ll concentrate on getting data into R
We have already seen that built-in data sets can be used; here we use the voivodeship_names
"data.frame"
object included in the rgugik package:
library(rgugik)
data(voivodeship_names)
class(voivodeship_names)
## [1] "data.frame"
It contains three columns, all containing character strings of voivodeship names and a two-digit code:
head(voivodeship_names)
## NAME_PL NAME_EN TERC
## 1 dolnośląskie Lower Silesia 02
## 2 kujawsko-pomorskie Kuyavian-Pomeranian 04
## 3 lubelskie Lublin 06
## 4 lubuskie Lubusz 08
## 5 łódzkie Lodz 10
## 6 małopolskie Lesser Poland 12
str(voivodeship_names)
## 'data.frame': 16 obs. of 3 variables:
## $ NAME_PL: chr "dolnośląskie" "kujawsko-pomorskie" "lubelskie" "lubuskie" ...
## $ NAME_EN: chr "Lower Silesia" "Kuyavian-Pomeranian" "Lublin" "Lubusz" ...
## $ TERC : chr "02" "04" "06" "08" ...
Data can be input directly if need be, from for example: https://pl.wikipedia.org/wiki/Sejmik_wojew%C3%B3dztwa. The rule explained there is that for populations up to 2 million, 30 councilors are chosen, with three more per next begun half-million. If we create a new "TERC"
column, we can then merge our new "data.frame"
with the one from rgugik:
TERC <- formatC(seq(2, 32, 2), format="d", width=2, flag="0")
sejmik_size <- c(36, 30, 33, 30, 33, 39, 51, 30, 33, 30, 33, 45, 30, 30, 39, 30)
my_data <- data.frame(TERC=TERC, sejmik_size=sejmik_size)
my_data1 <- merge(voivodeship_names, my_data, by="TERC")
head(my_data1)
## TERC NAME_PL NAME_EN sejmik_size
## 1 02 dolnośląskie Lower Silesia 36
## 2 04 kujawsko-pomorskie Kuyavian-Pomeranian 30
## 3 06 lubelskie Lublin 33
## 4 08 lubuskie Lubusz 30
## 5 10 łódzkie Lodz 33
## 6 12 małopolskie Lesser Poland 39
I also copied and pasted the same Wikipedia table into a spreadsheet, and saved it as a comma separated value file. Some “CSV” files actually use semi-colons as column delimiters, but here it is a comma - let’s check the top of the file:
readLines("../data/copied_data.csv", n=2)
## [1] "Sejmik ,Siedziba ,Liczba radnych "
## [2] "Sejmik Województwa Dolnośląskiego ,Wrocław ,36"
copied_data <- read.csv("../data/copied_data.csv")
str(copied_data)
## 'data.frame': 16 obs. of 3 variables:
## $ Sejmik : chr "Sejmik Województwa Dolnośląskiego " "Sejmik Województwa Kujawsko-Pomorskiego " "Sejmik Województwa Lubelskiego " "Sejmik Województwa Lubuskiego " ...
## $ Siedziba : chr "Wrocław " "Toruń " "Lublin " "Zielona Góra " ...
## $ Liczba.radnych: int 36 30 33 30 33 39 51 30 33 30 ...
When reading plain text files, the function used has to work out what kind of data each column contains; base functions read the data twice to check, but the user can override these guesses. It is also possible to use other packages for very large plain text data files, such as data.table::fread()
or readr::read_csv()
; they read some rows to guess the column types, and fread()
is fast because it can utilise multiple cores.
This object doesn’t have an easy look-up table like "TERC"
, so we’ll need to match them carefully by removing case endings from lower-case renderings of the names:
a <- gsub("kie", "", my_data1$NAME_PL)
b <- sub(" ", "", gsub("kiego", "", substring(tolower(copied_data$Sejmik), 20)))
(o <- match(a, b))
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
We can assert the ordering by indexing the rows by the matches - here this is a no-op:
my_data2 <- cbind(my_data1, copied_data[o,])
str(my_data2)
## 'data.frame': 16 obs. of 7 variables:
## $ TERC : chr "02" "04" "06" "08" ...
## $ NAME_PL : chr "dolnośląskie" "kujawsko-pomorskie" "lubelskie" "lubuskie" ...
## $ NAME_EN : chr "Lower Silesia" "Kuyavian-Pomeranian" "Lublin" "Lubusz" ...
## $ sejmik_size : num 36 30 33 30 33 39 51 30 33 30 ...
## $ Sejmik : chr "Sejmik Województwa Dolnośląskiego " "Sejmik Województwa Kujawsko-Pomorskiego " "Sejmik Województwa Lubelskiego " "Sejmik Województwa Lubuskiego " ...
## $ Siedziba : chr "Wrocław " "Toruń " "Lublin " "Zielona Góra " ...
## $ Liczba.radnych: int 36 30 33 30 33 39 51 30 33 30 ...
However, if we randomise the order of the rows in the "data.frame"
, we can still get the fish back from the fish soup (we control the randomisation by setting the random number generator seed):
set.seed(1)
zz <- copied_data[sample(nrow(copied_data)),]
head(zz)
## Sejmik Siedziba Liczba.radnych
## 9 Sejmik Województwa Podkarpackiego Rzeszów 33
## 4 Sejmik Województwa Lubuskiego Zielona Góra 30
## 7 Sejmik Województwa Mazowieckiego Warszawa 51
## 1 Sejmik Województwa Dolnośląskiego Wrocław 36
## 2 Sejmik Województwa Kujawsko-Pomorskiego Toruń 30
## 14 Sejmik Województwa Warmińsko-Mazurskiego Olsztyn 30
b1 <- sub(" ", "", gsub("kiego", "", substring(tolower(zz$Sejmik), 20)))
(o1 <- match(a, b1))
## [1] 4 5 8 2 10 13 3 16 1 12 11 7 9 6 14 15
zz1 <- cbind(my_data1, zz[o1,])
head(zz1)
## TERC NAME_PL NAME_EN sejmik_size
## 1 02 dolnośląskie Lower Silesia 36
## 2 04 kujawsko-pomorskie Kuyavian-Pomeranian 30
## 3 06 lubelskie Lublin 33
## 4 08 lubuskie Lubusz 30
## 5 10 łódzkie Lodz 33
## 6 12 małopolskie Lesser Poland 39
## Sejmik Siedziba Liczba.radnych
## 1 Sejmik Województwa Dolnośląskiego Wrocław 36
## 2 Sejmik Województwa Kujawsko-Pomorskiego Toruń 30
## 3 Sejmik Województwa Lubelskiego Lublin 33
## 4 Sejmik Województwa Lubuskiego Zielona Góra 30
## 5 Sejmik Województwa Łódzkiego Łódź 33
## 6 Sejmik Województwa Małopolskiego Kraków 39
all.equal(my_data2$sejmik_size, my_data2$Liczba.radnych)
## [1] TRUE
all.equal(zz1$sejmik_size, zz1$Liczba.radnych)
## [1] TRUE
Sample binary (well compressed structured text) files include spreadsheet files as often provided by public organisations. Here we are using election data from local authority elections from 2018 downloaded from: https://wybory2018.pkw.gov.pl/xls/2018-sejmiki.zip. One of the files contains results tabulated by voivodeship, and we can use the readxl package and read_excel()
to read the only sheet into a "data.frame"
; there are lots of variables, but we’ll make the names syntactically valid first:
library(readxl)
wybory18 <- as.data.frame(read_excel("../data/2018-sejmiki-po-wojewвdztwach.xlsx"))
names(wybory18) <- make.names(names(wybory18))
str(wybory18)
## 'data.frame': 16 obs. of 52 variables:
## $ Województwo : chr "dolnośląskie" "kujawsko-pomorskie" "lubelskie" "lubuskie" ...
## $ L..okr. : num 5 6 5 5 5 6 7 5 5 4 ...
## $ L..obw. : num 1940 1651 1892 719 1732 ...
## $ Liczba.wyborców.uprawnionych.do.głosowania : num 2275468 1611366 1720230 792247 1977916 ...
## $ Komisja.otrzymała.kart.do.głosowania : num 2055956 1469005 1550056 716272 1892029 ...
## $ Nie.wykorzystano.kart.do.głosowania : num 847260 616501 614504 301674 762433 ...
## $ Liczba.wyborców..którym.wydano.karty.do.głosowania : num 1208354 852231 935291 414530 1128555 ...
## $ Liczba.wyborców.głosujących.przez.pełnomocnika : num 1280 1195 2469 412 1516 ...
## $ Liczba.wyborców..którym.wysłano.pakiety.wyborcze : num 122 58 42 44 64 126 218 20 43 31 ...
## $ Liczba.otrzymanych.kopert.zwrotnych : num 112 54 41 42 61 118 203 20 42 29 ...
## $ Liczba.kopert.zwrotnych..w.których.nie.było.oświadczenia.o.osobistym.i.tajnym.oddaniu.głosu : num 8 9 2 2 2 6 29 0 9 4 ...
## $ Liczba.kopert.zwrotnych..w.których.oświadczenie.nie.było.podpisane : num 1 0 1 0 0 1 1 0 1 0 ...
## $ Liczba.kopert.zwrotnych..w.których.nie.było.koperty.na.kartę.do.głosowania : num 0 0 0 2 1 0 0 0 0 1 ...
## $ Liczba.kopert.zwrotnych..w.których.znajdowała.się.niezaklejona.koperta.na.kartę.do.głosowania : num 3 2 0 0 0 2 4 1 0 0 ...
## $ Liczba.kopert.na.kartę.do.głosowania.wrzuconych.do.urny : num 103 45 39 38 59 110 172 19 34 24 ...
## $ Liczba.kart.wyjętych.z.urny : num 1207902 852139 935069 414548 1127559 ...
## $ w.tym.liczba.kart.wyjętych.z.kopert.na.kartę.do.głosowania : num 103 45 38 38 59 109 169 19 33 24 ...
## $ Liczba.kart.nieważnych : num 1714 810 1076 433 1965 ...
## $ Liczba.kart.ważnych : num 1206188 851329 933993 414115 1125594 ...
## $ Liczba.głosów.nieważnych : num 81371 61322 58892 28757 78478 ...
## $ w.tym.z.powodu.postawienia.znaku..X..obok.nazwiska.dwóch.lub.większej.liczby.kandydatów.z.różnych.list : num 23175 22091 19307 10328 20355 ...
## $ w.tym.z.powodu.niepostawienia.znaku..X..obok.nazwiska.żadnego.kandydata : num 58196 39231 39585 18429 58123 ...
## $ w.tym.z.powodu.postawienia.znaku..X..wyłącznie.obok.nazwiska.kandydata.na.liście..której.rejestracja.została.unieważniona: num 0 0 0 0 0 0 0 0 0 0 ...
## $ Liczba.głosów.ważnych : num 1124817 790007 875101 385358 1047116 ...
## $ KW.PRAWO.I.SPRAWIEDLIWOŚĆ : num 320908 223357 385422 96747 365241 ...
## $ KKW.PLATFORMA.NOWOCZESNA.KOALICJA.OBYWATELSKA : num 289831 273800 159167 115294 297245 ...
## $ KOMITET.WYBORCZY.PSL : num 58820 113012 168921 47919 136743 ...
## $ KKW.SLD.LEWICA.RAZEM : num 61889 71131 51520 36941 66160 ...
## $ KWW.KUKIZ.15 : num 53800 45011 58072 12391 61333 ...
## $ KWW.BEZPARTYJNI.SAMORZĄDOWCY : num 168442 5077 3709 50764 40875 ...
## $ KWW.WOLNOŚĆ.W.SAMORZĄDZIE : num 15499 15627 14156 2132 14435 ...
## $ KW.RAZEM : num 18087 13988 8179 6623 18016 ...
## $ KW.RUCH.NARODOWY.RP : num 14874 4288 11525 6435 12200 ...
## $ KW.PARTIA.ZIELONI : num 19783 9413 1196 10112 14047 ...
## $ KW.WOLNI.I.SOLIDARNI : num 9624 12017 6763 NA 10798 ...
## $ KWW.Z.DUTKIEWICZEM.DLA.DOLNEGO.ŚLĄSKA : num 93260 NA NA NA NA ...
## $ KW..WSPÓLNA.MAŁOPOLSKA. : num NA NA NA NA NA ...
## $ KW.ŚPR : num NA NA NA NA NA ...
## $ KW.ŚLONZOKI.RAZEM : num NA NA NA NA NA NA NA NA NA NA ...
## $ KWW.MNIEJSZOŚĆ.NIEMIECKA : num NA NA NA NA NA ...
## $ KWW.PROJEKT.ŚWIĘTOKRZYSKIE.BOGDANA.WENTY : num NA NA NA NA NA NA NA NA NA NA ...
## $ KWW.JEDNOŚĆ.NARODU...WSPÓLNOTA : num NA NA 6471 NA 4369 ...
## $ KW.ZJEDNOCZENIE.CHRZEŚCIJAŃSKICH.RODZIN : num NA NA NA NA NA ...
## $ KW.INICJATYWA.OBYWATELSKA : num NA NA NA NA NA NA NA NA NA NA ...
## $ KWW.POLSKIE.RODZINY.RAZEM : num NA NA NA NA NA NA NA NA NA NA ...
## $ KWW.SPOZA.SITWY : num NA NA NA NA NA ...
## $ KWW.ISKRA : num NA NA NA NA NA ...
## $ KWW.AKCJA.NARODOWA : num NA NA NA NA 5654 ...
## $ KW.ZWIĄZKU.SŁOWIAŃSKIEGO : num NA 3286 NA NA NA ...
## $ KW.STOWARZYSZENIA.LEX.NATURALIS : num NA NA NA NA NA 2120 NA NA NA NA ...
## $ KWW.AGNIESZKI.JĘDRZEJEWSKIEJ : num NA NA NA NA NA NA NA NA NA NA ...
## $ KW.STRONNICTWA.PRACY : num NA NA NA NA NA NA 761 NA NA NA ...
We can also tell merge()
that the matching columns have different names
my_data_3 <- merge(x=my_data2, y=wybory18, by.x="NAME_PL", by.y="Województwo")
The foreign package and later additions permit the reading of binary files from other (statistical) software into "data.frames"
; the function or package used may vary with the version of the software. Probably spreadsheets are the most frequently encountered non-plain-text files, but here care is needed with multiple sheets and with extra rows of non-tabular information.
In general terms, almost all geospatial data is read into R workspaces using the drivers provided by the external GDAL library. On Windows and MacOS, a fairly updated version of GDAL is built into the sf package, and this provides many vector drivers. Here we mean GIS-style vector data (point coordinates, or points structuring lines and polygons), not R vectors.
library(sf)
## Linking to GEOS 3.10.0dev, GDAL 3.3.0, PROJ 8.0.1
drvs <- st_drivers("vector")[, -2]
drvs[order(drvs$name),]
## name write copy is_raster is_vector vsi
## AmigoCloud AmigoCloud TRUE FALSE FALSE TRUE FALSE
## ARCGEN ARCGEN FALSE FALSE FALSE TRUE TRUE
## AVCBin AVCBin FALSE FALSE FALSE TRUE TRUE
## AVCE00 AVCE00 FALSE FALSE FALSE TRUE TRUE
## BAG BAG TRUE TRUE TRUE TRUE TRUE
## CAD CAD FALSE FALSE TRUE TRUE TRUE
## Carto Carto TRUE FALSE FALSE TRUE FALSE
## Cloudant Cloudant TRUE FALSE FALSE TRUE FALSE
## CouchDB CouchDB TRUE FALSE FALSE TRUE FALSE
## CSV CSV TRUE FALSE FALSE TRUE TRUE
## CSW CSW FALSE FALSE FALSE TRUE FALSE
## DGN DGN TRUE FALSE FALSE TRUE TRUE
## DXF DXF TRUE FALSE FALSE TRUE TRUE
## EDIGEO EDIGEO FALSE FALSE FALSE TRUE TRUE
## EEDA EEDA FALSE FALSE FALSE TRUE FALSE
## Elasticsearch Elasticsearch TRUE FALSE FALSE TRUE FALSE
## ESRI Shapefile ESRI Shapefile TRUE FALSE FALSE TRUE TRUE
## ESRIC ESRIC FALSE FALSE TRUE TRUE TRUE
## ESRIJSON ESRIJSON FALSE FALSE FALSE TRUE TRUE
## FlatGeobuf FlatGeobuf TRUE FALSE FALSE TRUE TRUE
## Geoconcept Geoconcept TRUE FALSE FALSE TRUE TRUE
## GeoJSON GeoJSON TRUE FALSE FALSE TRUE TRUE
## GeoJSONSeq GeoJSONSeq TRUE FALSE FALSE TRUE TRUE
## Geomedia Geomedia FALSE FALSE FALSE TRUE FALSE
## GeoRSS GeoRSS TRUE FALSE FALSE TRUE TRUE
## GML GML TRUE FALSE FALSE TRUE TRUE
## GPKG GPKG TRUE TRUE TRUE TRUE TRUE
## GPSBabel GPSBabel TRUE FALSE FALSE TRUE FALSE
## GPSTrackMaker GPSTrackMaker TRUE FALSE FALSE TRUE TRUE
## GPX GPX TRUE FALSE FALSE TRUE TRUE
## HTTP HTTP FALSE FALSE TRUE TRUE FALSE
## Idrisi Idrisi FALSE FALSE FALSE TRUE TRUE
## JML JML TRUE FALSE FALSE TRUE TRUE
## JP2OpenJPEG JP2OpenJPEG FALSE TRUE TRUE TRUE TRUE
## KML KML TRUE FALSE FALSE TRUE TRUE
## LVBAG LVBAG FALSE FALSE FALSE TRUE TRUE
## MapInfo File MapInfo File TRUE FALSE FALSE TRUE TRUE
## MapML MapML TRUE FALSE FALSE TRUE TRUE
## MBTiles MBTiles TRUE TRUE TRUE TRUE TRUE
## Memory Memory TRUE FALSE FALSE TRUE FALSE
## MSSQLSpatial MSSQLSpatial TRUE FALSE FALSE TRUE FALSE
## MVT MVT TRUE FALSE FALSE TRUE TRUE
## MySQL MySQL TRUE FALSE FALSE TRUE FALSE
## netCDF netCDF TRUE TRUE TRUE TRUE TRUE
## NGW NGW TRUE TRUE TRUE TRUE FALSE
## OAPIF OAPIF FALSE FALSE FALSE TRUE FALSE
## ODBC ODBC TRUE FALSE FALSE TRUE FALSE
## ODS ODS TRUE FALSE FALSE TRUE TRUE
## OGCAPI OGCAPI FALSE FALSE TRUE TRUE TRUE
## OGR_GMT OGR_GMT TRUE FALSE FALSE TRUE TRUE
## OGR_PDS OGR_PDS FALSE FALSE FALSE TRUE TRUE
## OGR_SDTS OGR_SDTS FALSE FALSE FALSE TRUE TRUE
## OGR_VRT OGR_VRT FALSE FALSE FALSE TRUE TRUE
## OpenFileGDB OpenFileGDB FALSE FALSE FALSE TRUE TRUE
## OSM OSM FALSE FALSE FALSE TRUE TRUE
## PCIDSK PCIDSK TRUE FALSE TRUE TRUE TRUE
## PDF PDF TRUE TRUE TRUE TRUE FALSE
## PDS4 PDS4 TRUE TRUE TRUE TRUE TRUE
## PGDUMP PGDUMP TRUE FALSE FALSE TRUE TRUE
## PGeo PGeo FALSE FALSE FALSE TRUE FALSE
## PLSCENES PLSCENES FALSE FALSE TRUE TRUE FALSE
## PostgreSQL PostgreSQL TRUE FALSE FALSE TRUE FALSE
## REC REC FALSE FALSE FALSE TRUE FALSE
## S57 S57 TRUE FALSE FALSE TRUE TRUE
## Selafin Selafin TRUE FALSE FALSE TRUE TRUE
## SQLite SQLite TRUE FALSE FALSE TRUE TRUE
## SVG SVG FALSE FALSE FALSE TRUE TRUE
## SXF SXF FALSE FALSE FALSE TRUE TRUE
## TIGER TIGER TRUE FALSE FALSE TRUE TRUE
## TopoJSON TopoJSON FALSE FALSE FALSE TRUE TRUE
## UK .NTF UK .NTF FALSE FALSE FALSE TRUE TRUE
## VDV VDV TRUE FALSE FALSE TRUE TRUE
## VFK VFK FALSE FALSE FALSE TRUE FALSE
## VICAR VICAR TRUE TRUE TRUE TRUE TRUE
## Walk Walk FALSE FALSE FALSE TRUE FALSE
## WAsP WAsP TRUE FALSE FALSE TRUE TRUE
## WFS WFS FALSE FALSE FALSE TRUE TRUE
## XLSX XLSX TRUE FALSE FALSE TRUE TRUE
The drivers available vary from GDAL version to version, and depending on the build and installation system. The st_read()
function reads the geometry and attribute data, storing the geometries in a list column in a "data.frame"
:
vv <- st_read("../data/rgugik_vv.gpkg")
## Reading layer `rgugik_vv' from data source
## `/home/rsb/und/uam21/data/rgugik_vv.gpkg' using driver `GPKG'
## Simple feature collection with 16 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 171677.6 ymin: 133223.7 xmax: 861895.8 ymax: 775019.1
## Projected CRS: ETRF2000-PL / CS92
class(vv)
## [1] "sf" "data.frame"
We’ll return to look at the structures involved next week. The indexing variable here is "TERYT"
rather than "TERC"
, but they have the same form:
head(vv)
## Simple feature collection with 6 features and 1 field
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 196217.3 ymin: 146222 xmax: 861895.8 ymax: 657775.3
## Projected CRS: ETRF2000-PL / CS92
## TERYT geom
## 1 02 MULTIPOLYGON (((205901.1 34...
## 2 04 MULTIPOLYGON (((383436.9 61...
## 3 06 MULTIPOLYGON (((681015.8 42...
## 4 08 MULTIPOLYGON (((196217.3 44...
## 5 10 MULTIPOLYGON (((435569.2 38...
## 6 12 MULTIPOLYGON (((505961.5 23...
If need be, we’ll return to raster files next week if they are needed by participants. The stars and terra packages use GDAL raster drivers to handle raster data, both can handle multiband data, and stars and gdalcubes handle multiband space-time arrays (perhaps terra does too).
library(stars)
## Loading required package: abind
library(terra)
## terra version 1.1.17
We have already implicitly used one package, rgugik to access voivodeship boundaries https://cran.r-project.org/package=rgugik:
library(rgugik)
vv1 <- borders_get(voivodeship_names$NAME_PL)
head(vv1)
## Simple feature collection with 6 features and 1 field
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: 196217.3 ymin: 146222 xmax: 861895.8 ymax: 657775.3
## Projected CRS: ETRF2000-PL / CS92
## TERYT geometry
## 1 02 POLYGON ((205901.1 343182, ...
## 2 04 POLYGON ((383436.9 613642.8...
## 3 06 POLYGON ((681015.8 420027.3...
## 4 08 POLYGON ((196217.3 448607.3...
## 5 10 POLYGON ((435569.2 387437, ...
## 6 12 POLYGON ((505961.5 231480.9...
my_data_4 <- merge(vv1, my_data_3, by.x="TERYT", by.y="TERC")
st_geometry(my_data_4)
## Geometry set for 16 features
## Geometry type: GEOMETRY
## Dimension: XY
## Bounding box: xmin: 171677.6 ymin: 133223.7 xmax: 861895.8 ymax: 775019.1
## Projected CRS: ETRF2000-PL / CS92
## First 5 geometries:
## POLYGON ((205901.1 343182, 205962.9 343351, 206...
## POLYGON ((383436.9 613642.8, 383483.6 613714.5,...
## POLYGON ((681015.8 420027.3, 681441.2 420527.7,...
## POLYGON ((196217.3 448607.3, 196232.7 448674.5,...
## POLYGON ((435569.2 387437, 435633.4 387486.8, 4...
The new bdl package gives access to the local data bank of Statistics Poland https://cran.r-project.org/package=bdl; many organisations now publish public API specifications to access their data, in addition to providing interactive access letting the user choose and download data. If we use the interactive https://bdl.stat.gov.pl/BDL/start interface, we can look for regional accounts. These screendumps show choices made from the Polish version:
!(Screenshot_2021-04-29 GUS - Bank Danych Lokalnych.png)
!(Screenshot_2021-04-29 GUS - Bank Danych Lokalnych2.png)
!(Screenshot_2021-04-29 GUS - Bank Danych Lokalnych3.png)
leading to the download of files such as:
GRP_csv <- read.csv2("../data/RACH_3498_CTAB_20210429141300.csv")
dim(GRP_csv)
## [1] 16 9
head(GRP_csv)
## Kod Nazwa produkt.krajowy.brutto.ogółem.2014..mln.zł.
## 1 200000 DOLNOŚLĄSKIE 144599
## 2 400000 KUJAWSKO-POMORSKIE 75756
## 3 600000 LUBELSKIE 67108
## 4 800000 LUBUSKIE 38425
## 5 1000000 ŁÓDZKIE 104730
## 6 1200000 MAŁOPOLSKIE 133114
## produkt.krajowy.brutto.ogółem.2015..mln.zł.
## 1 151724
## 2 79835
## 3 69076
## 4 39939
## 5 109557
## 6 142292
## produkt.krajowy.brutto.ogółem.2016..mln.zł.
## 1 156091
## 2 82515
## 3 71637
## 4 41496
## 5 112759
## 6 148446
## produkt.krajowy.brutto.ogółem.2017..mln.zł.
## 1 166114
## 2 87250
## 3 76402
## 4 43498
## 5 119506
## 6 160254
## produkt.krajowy.brutto.ogółem.2018..mln.zł.
## 1 175690
## 2 93294
## 3 79472
## 4 46145
## 5 126981
## 6 172708
## produkt.krajowy.brutto.ogółem.2019..mln.zł. X
## 1 NA NA
## 2 NA NA
## 3 NA NA
## 4 NA NA
## 5 NA NA
## 6 NA NA
GRP_piv_dane <- as.data.frame(read_excel("../data/RACH_3498_XPIV_20210429142101.xlsx", sheet="DANE"))
dim(GRP_piv_dane)
## [1] 96 7
head(GRP_piv_dane)
## Kod Nazwa Produkt krajowy brutto Rok Wartosc
## 1 0200000 DOLNOŚLĄSKIE produkt krajowy brutto ogółem 2014 144599
## 2 0200000 DOLNOŚLĄSKIE produkt krajowy brutto ogółem 2015 151724
## 3 0200000 DOLNOŚLĄSKIE produkt krajowy brutto ogółem 2016 156091
## 4 0200000 DOLNOŚLĄSKIE produkt krajowy brutto ogółem 2017 166114
## 5 0200000 DOLNOŚLĄSKIE produkt krajowy brutto ogółem 2018 175690
## 6 0200000 DOLNOŚLĄSKIE produkt krajowy brutto ogółem 2019 -
## Jednostka miary Atrybut
## 1 mln zł <NA>
## 2 mln zł <NA>
## 3 mln zł <NA>
## 4 mln zł <NA>
## 5 mln zł <NA>
## 6 mln zł n
We can do the same programmatically in similar steps to those taken interactively. Of course, what we access depends on the current state of the data source, so downloads or API access may yield different outcomes at different times as errors are corrected. First, what level corresponds to voivodeships?
library(bdl)
GUSlevels <- as.data.frame(get_levels())
woj_level <- GUSlevels[grep("Woj", GUSlevels$name), "id"]
str(woj_level)
## int 2
Now we need to search for the keyword "RACH"
:
(rachunki_reg <- as.data.frame(search_subjects("RACH")))
## id name
## 1 K29 RACHUNKI REGIONALNE
## 2 P2121 Gospodarstwa domowe źródło utrzymania praca na rachunek własny
## 3 P3474 Produkcja roślinna - udział województw w zbiorach
## hasVariables
## 1 FALSE
## 2 FALSE
## 3 FALSE
## children
## 1 G541, G542, G543, G544, G545, G546, G547, G548, G549, G550, G551, G554, G555, G563
## 2
## 3
## levels parentId
## 1 3, 4 <NA>
## 2 6 G332
## 3 3 G180
(rr_subject <- rachunki_reg[grep("RACH", rachunki_reg$name), "id"])
## [1] "K29"
x <- get_subjects(parentId=rr_subject)
(rr_subjects <- as.data.frame(x))
## id parentId
## 1 G541 K29
## 2 G542 K29
## 3 G543 K29
## 4 G544 K29
## 5 G545 K29
## 6 G546 K29
## 7 G547 K29
## 8 G548 K29
## 9 G549 K29
## 10 G550 K29
## 11 G551 K29
## 12 G554 K29
## 13 G555 K29
## 14 G563 K29
## name
## 1 PRODUKT KRAJOWY BRUTTO (CENY BIEŻĄCE) - PKD 2007 - ESA 2010 - NUTS 2
## 2 PRODUKT KRAJOWY BRUTTO (CENY BIEŻĄCE) - PKD 2007 - ESA 2010 - NUTS 3
## 3 PRODUKT KRAJOWY BRUTTO (CENY STAŁE) - PKD 2007 - ESA 2010
## 4 WARTOŚĆ DODANA BRUTTO (CENY BIEŻĄCE) - PKD 2007 - ESA 2010 - NUTS 2
## 5 WARTOŚĆ DODANA BRUTTO (CENY BIEŻĄCE) - PKD 2007 - ESA 2010 - NUTS 3
## 6 PRODUKCJA GLOBALNA (CENY BIEŻĄCE) - PKD 2007 - ESA 2010
## 7 ZUŻYCIE POŚREDNIE (CENY BIEŻĄCE) - PKD 2007 - ESA 2010
## 8 KOSZTY ZWIĄZANE Z ZATRUDNIENIEM (CENY BIEŻĄCE) - PKD 2007 - ESA 2010
## 9 NADWYŻKA OPERACYJNA BRUTTO (CENY BIEŻĄCE) - PKD 2007 - ESA 2010
## 10 NOMINALNE DOCHODY W SEKTORZE GOSPODARSTW DOMOWYCH - PKD 2007 - ESA 2010
## 11 REALNE DOCHODY W SEKTORZE GOSPODARSTW DOMOWYCH - PKD 2007 - ESA 2010
## 12 PRODUKT KRAJOWY BRUTTO (CENY BIEŻĄCE) - PKD 2007 - ESA 2010 - SZACUNKI WSTĘPNE
## 13 WARTOŚĆ DODANA BRUTTO (CENY BIEŻĄCE) - PKD 2007 - ESA 2010 - SZACUNKI WSTĘPNE
## 14 DANE ARCHIWALNE
## hasVariables
## 1 FALSE
## 2 FALSE
## 3 FALSE
## 4 FALSE
## 5 FALSE
## 6 FALSE
## 7 FALSE
## 8 FALSE
## 9 FALSE
## 10 FALSE
## 11 FALSE
## 12 FALSE
## 13 FALSE
## 14 FALSE
## children
## 1 P3498, P3499
## 2 P3500, P3501
## 3 P3502, P3503
## 4 P3504, P3505, P3506, P3507, P3508, P3509, P3510, P3511
## 5 P3512, P3513, P3514
## 6 P3515
## 7 P3516
## 8 P3517, P3518, P3519
## 9 P3520
## 10 P3521, P3522, P3523, P3524
## 11 P3525, P3526
## 12 P3562, P3563
## 13 P3564
## 14 P2535, P2536, P2537, P2538, P2539, P2540, P2541, P2542, P2543, P2544, P2545, P2546, P2547, P2548, P2549, P2550, P2551, P2552, P2553, P2554, P2620, P2739, P2740, P2741, P2742, P2743, P2927, P2928, P2929, P2930, P2931, P2932, P2933, P2934, P2935, P2936, P2937, P2938, P2939, P2940, P2941, P2942, P2943, P2944, P2945, P2946, P2947, P2948, P2949, P2950, P2951, P2952, P2953, P2954, P3192
## levels
## 1 3
## 2 4
## 3 3
## 4 3
## 5 4
## 6 3
## 7 3
## 8 3
## 9 3
## 10 3
## 11 3
## 12 3
## 13 3
## 14 3, 4
(rr_target <- unlist(rr_subjects[grep("NUTS 2", rr_subjects$name)[1], "children"]))
## [1] "P3498" "P3499"
(rr_vars <- as.data.frame(get_variables(rr_target[1])))
## id subjectId n1
## 1 458271 P3498 produkt krajowy brutto ogółem
## 2 458272 P3498 dynamika produktu krajowego brutto ogółem, rok poprzedni=100
## 3 458273 P3498 produkt krajowy brutto ogółem, Polska=100
## level measureUnitId measureUnitName
## 1 3 6 mln zł
## 2 3 50 %
## 3 3 50 %
(rr_var <- rr_vars$id[1])
## [1] 458271
rr_data <- as.data.frame(get_data_by_variable(rr_var, unitLevel = woj_level, year = 2014:2019))
rr_data$TERC <- substring(rr_data$id, 3, 4)
names(rr_data)[4] <- "PRB"
head(rr_data)
## id name year PRB measureUnitId measureName attrId
## 1 011200000000 MAŁOPOLSKIE 2014 133114 6 mln zł 1
## 2 011200000000 MAŁOPOLSKIE 2015 142292 6 mln zł 1
## 3 011200000000 MAŁOPOLSKIE 2016 148446 6 mln zł 1
## 4 011200000000 MAŁOPOLSKIE 2017 160254 6 mln zł 1
## 5 011200000000 MAŁOPOLSKIE 2018 172708 6 mln zł 1
## 6 012400000000 ŚLĄSKIE 2014 212062 6 mln zł 1
## attributeDescription TERC
## 1 12
## 2 12
## 3 12
## 4 12
## 5 12
## 6 24
oo <- reshape(rr_data[, -c(5:8)], v.names="PRB", timevar="year", idvar=c("TERC", "id", "name"), direction="wide")
rr_data_wide <- oo[order(oo$TERC), ]
dim(rr_data_wide)
## [1] 16 8
head(rr_data_wide)
## id name TERC PRB.2014 PRB.2015 PRB.2016 PRB.2017
## 26 030200000000 DOLNOŚLĄSKIE 02 144599 151724 156091 166114
## 36 040400000000 KUJAWSKO-POMORSKIE 04 75756 79835 82515 87250
## 61 060600000000 LUBELSKIE 06 67108 69076 71637 76402
## 11 020800000000 LUBUSKIE 08 38425 39939 41496 43498
## 51 051000000000 ŁÓDZKIE 10 104730 109557 112759 119506
## 1 011200000000 MAŁOPOLSKIE 12 133114 142292 148446 160254
## PRB.2018
## 26 175690
## 36 93294
## 61 79472
## 11 46145
## 51 126981
## 1 172708
So online download from the data provider via bdl and the API can give is the same data as by choosing the data interactively and downloading a file to be read into R (if at about the same time).
all.equal(rr_data_wide[, 4:8], GRP_csv[,3:7], check.attributes=FALSE)
## [1] TRUE
So now we’ll merge the gross regional product with the geometries and data we already have:
names(my_data_4)
## [1] "TERYT"
## [2] "NAME_PL"
## [3] "NAME_EN"
## [4] "sejmik_size"
## [5] "Sejmik"
## [6] "Siedziba"
## [7] "Liczba.radnych"
## [8] "L..okr."
## [9] "L..obw."
## [10] "Liczba.wyborców.uprawnionych.do.głosowania"
## [11] "Komisja.otrzymała.kart.do.głosowania"
## [12] "Nie.wykorzystano.kart.do.głosowania"
## [13] "Liczba.wyborców..którym.wydano.karty.do.głosowania"
## [14] "Liczba.wyborców.głosujących.przez.pełnomocnika"
## [15] "Liczba.wyborców..którym.wysłano.pakiety.wyborcze"
## [16] "Liczba.otrzymanych.kopert.zwrotnych"
## [17] "Liczba.kopert.zwrotnych..w.których.nie.było.oświadczenia.o.osobistym.i.tajnym.oddaniu.głosu"
## [18] "Liczba.kopert.zwrotnych..w.których.oświadczenie.nie.było.podpisane"
## [19] "Liczba.kopert.zwrotnych..w.których.nie.było.koperty.na.kartę.do.głosowania"
## [20] "Liczba.kopert.zwrotnych..w.których.znajdowała.się.niezaklejona.koperta.na.kartę.do.głosowania"
## [21] "Liczba.kopert.na.kartę.do.głosowania.wrzuconych.do.urny"
## [22] "Liczba.kart.wyjętych.z.urny"
## [23] "w.tym.liczba.kart.wyjętych.z.kopert.na.kartę.do.głosowania"
## [24] "Liczba.kart.nieważnych"
## [25] "Liczba.kart.ważnych"
## [26] "Liczba.głosów.nieważnych"
## [27] "w.tym.z.powodu.postawienia.znaku..X..obok.nazwiska.dwóch.lub.większej.liczby.kandydatów.z.różnych.list"
## [28] "w.tym.z.powodu.niepostawienia.znaku..X..obok.nazwiska.żadnego.kandydata"
## [29] "w.tym.z.powodu.postawienia.znaku..X..wyłącznie.obok.nazwiska.kandydata.na.liście..której.rejestracja.została.unieważniona"
## [30] "Liczba.głosów.ważnych"
## [31] "KW.PRAWO.I.SPRAWIEDLIWOŚĆ"
## [32] "KKW.PLATFORMA.NOWOCZESNA.KOALICJA.OBYWATELSKA"
## [33] "KOMITET.WYBORCZY.PSL"
## [34] "KKW.SLD.LEWICA.RAZEM"
## [35] "KWW.KUKIZ.15"
## [36] "KWW.BEZPARTYJNI.SAMORZĄDOWCY"
## [37] "KWW.WOLNOŚĆ.W.SAMORZĄDZIE"
## [38] "KW.RAZEM"
## [39] "KW.RUCH.NARODOWY.RP"
## [40] "KW.PARTIA.ZIELONI"
## [41] "KW.WOLNI.I.SOLIDARNI"
## [42] "KWW.Z.DUTKIEWICZEM.DLA.DOLNEGO.ŚLĄSKA"
## [43] "KW..WSPÓLNA.MAŁOPOLSKA."
## [44] "KW.ŚPR"
## [45] "KW.ŚLONZOKI.RAZEM"
## [46] "KWW.MNIEJSZOŚĆ.NIEMIECKA"
## [47] "KWW.PROJEKT.ŚWIĘTOKRZYSKIE.BOGDANA.WENTY"
## [48] "KWW.JEDNOŚĆ.NARODU...WSPÓLNOTA"
## [49] "KW.ZJEDNOCZENIE.CHRZEŚCIJAŃSKICH.RODZIN"
## [50] "KW.INICJATYWA.OBYWATELSKA"
## [51] "KWW.POLSKIE.RODZINY.RAZEM"
## [52] "KWW.SPOZA.SITWY"
## [53] "KWW.ISKRA"
## [54] "KWW.AKCJA.NARODOWA"
## [55] "KW.ZWIĄZKU.SŁOWIAŃSKIEGO"
## [56] "KW.STOWARZYSZENIA.LEX.NATURALIS"
## [57] "KWW.AGNIESZKI.JĘDRZEJEWSKIEJ"
## [58] "KW.STRONNICTWA.PRACY"
## [59] "geometry"
my_data_5 <- merge(my_data_4, rr_data_wide, by.x="TERYT", by.y="TERC")
The most simple current output of results is as a notebook, using Sweave()
or knitr::knit()
, also available in the RStudio IDE. This keeps the code and the output together, and for HTML output may also include interactive components (which may be quite large). Both R and RStudio include local web servers to display HTML output through browsers or other renderers. These are similar in kind to Jupyter notebooks in Python and elsewhere.
In producing such notebook output, RStudio can cache objects, avoiding re-running long computations (or here avoiding multiple downloads from GUS or GUGIK) if the code in a chunk is unchanged, and where none of the arguments used are changed by earlier chunks. We can also do this manually, storing R objects in binary compressed form as RDS (single object, saveRDS()
, readRDS()
) form, or multiple objects as RData, RDA (multiple object, save()
, load()
) form. The RDA form is also used to save whole workspaces.
saveRDS(my_data_5, file="my_data_5.rds")
my_data_5a <- readRDS("my_data_5.rds")
all.equal(my_data_5, my_data_5a)
## [1] TRUE
Similarly, savehistory()
can be used to save your session history to a text file for subsequent reference. In RStudio, the active history is shown in a tab, and can be saved from there.
Tabular data may be saved using write.csv()
and similar:
write.csv(rr_data, file="rr_data.csv", row.names=FALSE)
cat(readLines("rr_data.csv", n=3), "\n")
## "id","name","year","PRB","measureUnitId","measureName","attrId","attributeDescription","TERC" "011200000000","MAŁOPOLSKIE","2014",133114,"6","mln zł",1,"","12" "011200000000","MAŁOPOLSKIE","2015",142292,"6","mln zł",1,"","12"
Non-tabular data, such as model output, may use sink()
to dump output to file:
summary(lm(Liczba.głosów.ważnych ~ PRB.2018, data=my_data_5))
##
## Call:
## lm(formula = Liczba.głosów.ważnych ~ PRB.2018, data = my_data_5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -200122 -73492 -20112 111391 217660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.377e+05 5.336e+04 6.328 1.86e-05 ***
## PRB.2018 4.728e+00 3.108e-01 15.215 4.22e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135600 on 14 degrees of freedom
## Multiple R-squared: 0.943, Adjusted R-squared: 0.9389
## F-statistic: 231.5 on 1 and 14 DF, p-value: 4.215e-10
sink(file="output.txt")
summary(lm(Liczba.głosów.ważnych ~ PRB.2018, data=my_data_5))
##
## Call:
## lm(formula = Liczba.głosów.ważnych ~ PRB.2018, data = my_data_5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -200122 -73492 -20112 111391 217660
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.377e+05 5.336e+04 6.328 1.86e-05 ***
## PRB.2018 4.728e+00 3.108e-01 15.215 4.22e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135600 on 14 degrees of freedom
## Multiple R-squared: 0.943, Adjusted R-squared: 0.9389
## F-statistic: 231.5 on 1 and 14 DF, p-value: 4.215e-10
sink()
cat(readLines("output.txt"), sep="\n")
In addition to R objects serialised to binary files, we can of course export tabular data to other software. Writing to Excel is possible with openxlsx:
library(openxlsx)
write.xlsx(rr_data_wide, file="rr_data_wide.xlsx")
all.equal(read_excel("rr_data_wide.xlsx"), rr_data_wide, check.attributes=FALSE)
## [1] TRUE
For exchange with other software, again start with foreign.
Typically, we use the sf function st_write()
, again using GDAL drivers.
st_write(vv1, dsn="vv1.gpkg", append=FALSE)
## Deleting layer `vv1' using driver `GPKG'
## Writing layer `vv1' to data source `vv1.gpkg' using driver `GPKG'
## Writing 16 features with 1 fields and geometry type Unknown (any).
In S2 syntax, there were objects; in S3, some objects also had a class attribute set, to offer guidance on what the object might contain
In S4, the contents of objects were formalised, moving checks from methods dispached by the class of the first method argument to the classes themselves
In OOP in other languages, methods belong to objects, and RC (reference classes) and R6 systems have been developed to provide these
RC are linked to the success of Rcpp, and perhaps used in reticulate to interface Python from R (see keras and tensorflow)
Formulae provide the S3 modelling interface, and use non-standard evaluation
They tell us where to look for variables in modelling, and which transformations to apply to them
They provide us with what we need from our data in terms of output from analyses
They are very flexible, with update
methods to modify our approaches flexibly
There are plenty of base R functions that use non-standard evaluation, such as formula
and library
The underlying issue is often how to point to objects within other objects, where the emcompassing object is an environment
or data.frame
Connections to files and databases include the original meaning of “pipe”
More recently, pipes may be used instead of nested function arguments
Classes and objects appeared first in Simula from the Norwegian Computing Center in Oslo, and were intended to further encapsulation in programs
C++ was the extension of C to include Simula-like object handling, and, like Simula, a garbage collector
More modern languages, like Java, settled on a strict OOP view of classes of objects that contained methods, but until the 1990s, this was not the only possibility
S adopted a functional class quasi-system with S3, where generic methods stood apart from classes of objects, but related to them
As in file names, various non-alphanumeric characters can be used to separate parts, for example the .
dot
In S and early R, the _
underscore was used for assignment like <-
(see this posting)
So a central S3 class was called data.frame
, and coercion to a data.frame
was as.data.frame.matrix
for a matrix argument, and as.data.frame.list
for a list argument
Here the as.data.frame
part was sufficient, and method dispatch would choose the appropriate implementation by matching the class of the first argument to the final dot-separated part of the list of available methods
methods(as.data.frame)
## [1] as.data.frame,ANY-method
## [2] as.data.frame,Raster-method
## [3] as.data.frame,SpatRaster-method
## [4] as.data.frame,SpatVector-method
## [5] as.data.frame.aovproj*
## [6] as.data.frame.array
## [7] as.data.frame.AsIs
## [8] as.data.frame.character
## [9] as.data.frame.complex
## [10] as.data.frame.data.frame
## [11] as.data.frame.data.table*
## [12] as.data.frame.Date
## [13] as.data.frame.default
## [14] as.data.frame.difftime
## [15] as.data.frame.factor
## [16] as.data.frame.ftable*
## [17] as.data.frame.GridTopology*
## [18] as.data.frame.grouped_df*
## [19] as.data.frame.integer
## [20] as.data.frame.ITime*
## [21] as.data.frame.list
## [22] as.data.frame.logical
## [23] as.data.frame.logLik*
## [24] as.data.frame.mapped_discrete*
## [25] as.data.frame.matrix
## [26] as.data.frame.model.matrix
## [27] as.data.frame.noquote
## [28] as.data.frame.numeric
## [29] as.data.frame.numeric_version
## [30] as.data.frame.ordered
## [31] as.data.frame.POSIXct
## [32] as.data.frame.POSIXlt
## [33] as.data.frame.proxy_registry*
## [34] as.data.frame.raw
## [35] as.data.frame.sf*
## [36] as.data.frame.sfc*
## [37] as.data.frame.sgbp*
## [38] as.data.frame.shingle*
## [39] as.data.frame.SpatialGrid*
## [40] as.data.frame.SpatialGridDataFrame*
## [41] as.data.frame.SpatialLinesDataFrame*
## [42] as.data.frame.SpatialMultiPoints*
## [43] as.data.frame.SpatialMultiPointsDataFrame*
## [44] as.data.frame.SpatialPixels*
## [45] as.data.frame.SpatialPixelsDataFrame*
## [46] as.data.frame.SpatialPoints*
## [47] as.data.frame.SpatialPointsDataFrame*
## [48] as.data.frame.SpatialPolygonsDataFrame*
## [49] as.data.frame.stars*
## [50] as.data.frame.stars_proxy*
## [51] as.data.frame.table
## [52] as.data.frame.tbl_df*
## [53] as.data.frame.ts
## [54] as.data.frame.units*
## [55] as.data.frame.univaov*
## [56] as.data.frame.vctrs_sclr*
## [57] as.data.frame.vctrs_vctr*
## [58] as.data.frame.vector
## see '?methods' for accessing help and source code
Rasmus Bååth has a report on naming conventions used in R some years ago
More recently, lowerCamelCase and snake_case have become predominant, with most recent code being snake_case (all lower case and words separated by underscore)
Obviously, the case-matching component of generic methods for S3 objects has to be separated by a dot
It is easy to create new methods for existing generic functions, like print
, plot
, or summary
It is also easy to create new classes - no definition is needed, but as software develops, the class-specific methods often need to guard against the absence of object components typically with !is.null()
usage
If you save an S3 object, say to an RDS file, possibly for serialization, the package context of the class will be lost
Then the class attribute will be set, but which package provides the methods for that class is not recorded
data(mtcars)
class(mtcars)
## [1] "data.frame"
attributes(mtcars)
## $names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
##
## $row.names
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## $class
## [1] "data.frame"
(oS <- search())
## [1] ".GlobalEnv" "package:openxlsx" "package:bdl"
## [4] "package:terra" "package:stars" "package:abind"
## [7] "package:sf" "package:readxl" "package:rgugik"
## [10] "package:stats" "package:graphics" "package:grDevices"
## [13] "package:utils" "package:datasets" "package:methods"
## [16] "Autoloads" "package:base"
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:terra':
##
## collapse
## This is mgcv 1.8-35. For overview type 'help("mgcv-package")'.
gm <- gam(formula=mpg ~ s(wt), data=mtcars)
class(gm)
## [1] "gam" "glm" "lm"
print(gm)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## mpg ~ s(wt)
##
## Estimated degrees of freedom:
## 2.14 total = 3.14
##
## GCV score: 7.887675
stats:::print.glm(gm)
##
## Call: gam(formula = mpg ~ s(wt), data = mtcars)
##
## Coefficients:
## (Intercept) s(wt).1 s(wt).2 s(wt).3 s(wt).4 s(wt).5
## 20.09063 -0.05848 -0.47219 -0.02163 0.43973 -0.38559
## s(wt).6 s(wt).7 s(wt).8 s(wt).9
## -0.61434 -0.27343 -3.45191 -4.99051
##
## Degrees of Freedom: 31 Total (i.e. Null); 28.85932 Residual
## Null Deviance: 1126
## Residual Deviance: 205.3 AIC: 158.6
methods(logLik)
## [1] logLik.Arima* logLik.classIntervals* logLik.corStruct*
## [4] logLik.gam logLik.glm* logLik.gls*
## [7] logLik.glsStruct* logLik.gnls* logLik.gnlsStruct*
## [10] logLik.lm* logLik.lme* logLik.lmeStruct*
## [13] logLik.lmeStructInt* logLik.lmList* logLik.logLik*
## [16] logLik.nls* logLik.reStruct* logLik.varComb*
## [19] logLik.varFunc*
## see '?methods' for accessing help and source code
library
affects the search pathBecause library
affects the search path, we have added visible methods compared to our first view of those available
nS <- search()
nS[!(nS %in% oS)]
## [1] "package:mgcv" "package:nlme"
In the online version of here, and in the forthcoming second edition, there are short descriptions of formal S4 classes
They were introduced in the Green Book , and covered in , and
The methods package provides S4 classes in R and is used both for formal S4 classes and for RC reference classes also used in Rcpp modules
S4 classes are used extensively in Bioconductor packages
Writing S4 classes involves thinking ahead, to plan a hierarchy of classes (and virtual classes)
If it is possible to generalise methods in the inheritance tree of class definitions, a method can be used on all descendants inheriting from a root class
Formal classes also provide certainty that the classes contain slots as required, and objects can be checked for validity
Over time, unclassed and S3 objects have been made able to work within S4 settings, but some of these adaptations have been fragile
First, a standard dense identity (unit diagonal) matrix created using diag
d100 <- diag(100)
isS4(d100)
## [1] FALSE
class(d100)
## [1] "matrix" "array"
str(d100)
## num [1:100, 1:100] 1 0 0 0 0 0 0 0 0 0 ...
object.size(d100)
## 80216 bytes
getClass(class(d100))
## Class "matrix" [package "methods"]
##
## No Slots, prototype of class "matrix"
##
## Extends:
## Class "array", directly
## Class "mMatrix", directly
## Class "structure", by class "array", distance 2
## Class "vector", by class "array", distance 3, with explicit coerce
##
## Known Subclasses: "mts"
Now a sparse identity (unit diagonal) matrix with Diagonal
from the Matrix package
library(Matrix)
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:terra':
##
## expand, pack
D100 <- Diagonal(100)
class(D100)
## [1] "ddiMatrix"
## attr(,"package")
## [1] "Matrix"
isS4(D100)
## [1] TRUE
str(D100)
## Formal class 'ddiMatrix' [package "Matrix"] with 4 slots
## ..@ diag : chr "U"
## ..@ Dim : int [1:2] 100 100
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ x : num(0)
object.size(D100)
## 1240 bytes
getClass(class(D100))
## Class "ddiMatrix" [package "Matrix"]
##
## Slots:
##
## Name: diag Dim Dimnames x
## Class: character integer list numeric
##
## Extends:
## Class "diagonalMatrix", directly
## Class "dMatrix", directly
## Class "sparseMatrix", by class "diagonalMatrix", distance 2
## Class "Matrix", by class "dMatrix", distance 2
## Class "xMatrix", by class "dMatrix", distance 2
## Class "mMatrix", by class "Matrix", distance 4
## Class "Mnumeric", by class "Matrix", distance 4
## Class "replValueSp", by class "Matrix", distance 4
showMethods(f=coerce, classes=class(D100))
## Function: coerce (package methods)
## from="ddiMatrix", to="CsparseMatrix"
## from="ddiMatrix", to="ddenseMatrix"
## from="ddiMatrix", to="dgCMatrix"
## from="ddiMatrix", to="dgeMatrix"
## from="ddiMatrix", to="diagonalMatrix"
## from="ddiMatrix", to="dMatrix"
## from="ddiMatrix", to="dsparseMatrix"
## from="ddiMatrix", to="dtCMatrix"
## from="ddiMatrix", to="matrix"
## from="ddiMatrix", to="Matrix"
## from="ddiMatrix", to="Mnumeric"
## from="ddiMatrix", to="symmetricMatrix"
## from="ddiMatrix", to="triangularMatrix"
## from="ddiMatrix", to="TsparseMatrix"
str(as(D100, "dgeMatrix"))
## Formal class 'dgeMatrix' [package "Matrix"] with 4 slots
## ..@ x : num [1:10000] 1 0 0 0 0 0 0 0 0 0 ...
## ..@ Dim : int [1:2] 100 100
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ factors : list()
Reference classes are (much more) like OOP mechanisms in C++, Java, and other programming languages, in which objects contain their class-specific methods
Because the methods are part of the class definitions, the appropriate method is known by the object, and dispatch is unproblematic
Reference classes are provided in the methods package, and R6 classes in the R6 package
R6 classes are more light-weight than RC classes
A unifying feature of S and R has been the treatment of data.frame
objects as data=
arguments to functions and methods
Most of the functions and methods fit models to data, and formula
objects show how to treat the variables in the data=
argument, or if not found there, in the calling environments of the model fitting function
We saw this earlier when looking at mgcv::gam
, but did not explain it
Formulae may be two-sided (mostly) and one-sided; the Formula package provides extensions useful in econometrics
Formula objects can be updated
library(mgcv)
f <- mpg ~ s(wt)
gm <- gam(formula=f, data=mtcars)
gm1 <- gam(update(f, . ~ - s(wt) + poly(wt, 3)), data=mtcars)
anova(gm, gm1, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: mpg ~ s(wt)
## Model 2: mpg ~ poly(wt, 3)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 28.399 205.29
## 2 28.000 203.67 0.39915 1.6217 0.3098
mtcars$fcyl <- as.factor(mtcars$cyl)
f1 <- update(f, log(.) ~ - s(wt) - 1 + wt + fcyl)
f1
## log(mpg) ~ wt + fcyl - 1
lm(f1, data=mtcars)
##
## Call:
## lm(formula = f1, data = mtcars)
##
## Coefficients:
## wt fcyl4 fcyl6 fcyl8
## -0.1762 3.6731 3.5295 3.4046
terms(f1)
## log(mpg) ~ wt + fcyl - 1
## attr(,"variables")
## list(log(mpg), wt, fcyl)
## attr(,"factors")
## wt fcyl
## log(mpg) 0 0
## wt 1 0
## fcyl 0 1
## attr(,"term.labels")
## [1] "wt" "fcyl"
## attr(,"order")
## [1] 1 1
## attr(,"intercept")
## [1] 0
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
head(model.matrix(f1, mtcars))
## wt fcyl4 fcyl6 fcyl8
## Mazda RX4 2.620 0 1 0
## Mazda RX4 Wag 2.875 0 1 0
## Datsun 710 2.320 1 0 0
## Hornet 4 Drive 3.215 0 1 0
## Hornet Sportabout 3.440 0 0 1
## Valiant 3.460 0 1 0
model.response(model.frame(f1, mtcars))[1:6]
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 3.044522 3.044522 3.126761 3.063391
## Hornet Sportabout Valiant
## 2.928524 2.895912
mtcars$mpg[1:6]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
log(mtcars$mpg[1:6])
## [1] 3.044522 3.044522 3.126761 3.063391 2.928524 2.895912
library(lme4)
## Registered S3 methods overwritten by 'lme4':
## method from
## cooks.distance.influence.merMod car
## influence.merMod car
## dfbeta.influence.merMod car
## dfbetas.influence.merMod car
##
## Attaching package: 'lme4'
## The following object is masked from 'package:nlme':
##
## lmList
lmm <- lmer(update(f, log(.) ~ poly(wt, 3) | fcyl), mtcars)
## boundary (singular) fit: see ?isSingular
fixef(lmm)
## (Intercept)
## 2.922225
ranef(lmm)
## $fcyl
## (Intercept) poly(wt, 3)1 poly(wt, 3)2 poly(wt, 3)3
## 4 0.26268014 -0.10735016 0.7149126 0.024031242
## 6 0.08117715 -0.03317492 0.2209325 0.007426475
## 8 -0.24527314 0.10023633 -0.6675376 -0.022438767
##
## with conditional variances for "fcyl"
mtcars$fam <- as.factor(mtcars$am)
lm(update(f, log(.) ~ - s(wt) + (fam*fcyl)/wt - 1), mtcars)
##
## Call:
## lm(formula = update(f, log(.) ~ -s(wt) + (fam * fcyl)/wt - 1),
## data = mtcars)
##
## Coefficients:
## fam0 fam1 fcyl6 fcyl8 fam1:fcyl6
## 2.73316 3.94159 2.46766 0.70085 -3.30564
## fam1:fcyl8 fam0:fcyl4:wt fam1:fcyl4:wt fam0:fcyl6:wt fam1:fcyl6:wt
## -1.47065 0.13514 -0.30280 -0.66469 -0.02918
## fam0:fcyl8:wt fam1:fcyl8:wt
## -0.18018 -0.12990
xtabs(~ fam + fcyl, data=mtcars)
## fcyl
## fam 4 6 8
## 0 3 4 12
## 1 8 3 2
row.names(mtcars)[mtcars$am == 1]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Fiat 128"
## [5] "Honda Civic" "Toyota Corolla" "Fiat X1-9" "Porsche 914-2"
## [9] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" "Maserati Bora"
## [13] "Volvo 142E"
In some settings in S and later R, it has been convenient to drop quotation marks around strings repesenting names of packages, list components and classes
We’ve seen this in action already, but have not drawn attention to it
In particular, the $
list component selection operator uses this, as does the use of names in formulae
We also see the same thing in functions attaching packages library
and require
require
## function (package, lib.loc = NULL, quietly = FALSE, warn.conflicts,
## character.only = FALSE, mask.ok, exclude, include.only, attach.required = missing(include.only))
## {
## if (!character.only)
## package <- as.character(substitute(package))
## loaded <- paste0("package:", package) %in% search()
## if (!loaded) {
## if (!quietly)
## packageStartupMessage(gettextf("Loading required package: %s",
## package), domain = NA)
## value <- tryCatch(library(package, lib.loc = lib.loc,
## character.only = TRUE, logical.return = TRUE, warn.conflicts = warn.conflicts,
## quietly = quietly, mask.ok = mask.ok, exclude = exclude,
## include.only = include.only, attach.required = attach.required),
## error = function(e) e)
## if (inherits(value, "error")) {
## if (!quietly) {
## msg <- conditionMessage(value)
## cat("Failed with error: ", sQuote(msg), "\n",
## file = stderr(), sep = "")
## .Internal(printDeferredWarnings())
## }
## return(invisible(FALSE))
## }
## if (!value)
## return(invisible(FALSE))
## }
## else value <- TRUE
## invisible(value)
## }
## <bytecode: 0x12aef08>
## <environment: namespace:base>
In such settings, we cannot use a single element character vector to transmit information:
ggplot2 <- "gridExtra"
ggplot2
## [1] "gridExtra"
ggplot2 <- as.character(substitute(ggplot2))
ggplot2
## [1] "ggplot2"
paste0("package:", ggplot2) %in% search()
## [1] FALSE
Here a similar mechanism is used to look inside the first data.frame
rather than the current and global environments
subset(mtcars, subset=cyl == 4)
## mpg cyl disp hp drat wt qsec vs am gear carb fcyl fam
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4 0
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4 0
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 4 0
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 4 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4 1
acyl <- "cyl"
ncyl <- 4
mtcars[mtcars[[acyl]] == ncyl,]
## mpg cyl disp hp drat wt qsec vs am gear carb fcyl fam
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4 0
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4 0
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 4 0
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 4 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4 1
is.list(mtcars)
## [1] TRUE
mtcars[eval(substitute(cyl == 4), mtcars, parent.frame()), ]
## mpg cyl disp hp drat wt qsec vs am gear carb fcyl fam
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4 1
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4 0
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4 0
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4 1
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 4 0
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 4 1
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 4 1
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4 1
In a blog post, rlang is used to create a list of formulae, but we can use SE constructs
create_form = function(power){
rhs = substitute(I(hp^pow), list(pow=power))
as.formula(paste0("mpg ~ ", deparse(rhs)))
}
list_formulae = Map(create_form, seq(1,6))
# mapply(create_form, seq(1,6))
llm <- lapply(list_formulae, lm, data=mtcars)
sapply(llm, function(x) summary(x)$sigma)
## [1] 3.862962 4.577939 5.132309 5.494144 5.712194 5.840452
Non-standard evaluation may be attractive, as the history of S and R syntax has shown
Use in additive graphics in ggplot2 and in verbs in dplyr and similar packages has led to increases
Standard evaluation is arguably more programmable, but workflows using NSE are promoted (not least by RStudio)
It is interesting that Microsoft are drawing attention to seplyr (see also John Mount’s blog)
Connections are part of base R corresponding to innovations in S4, the DBI database interface abstractions began to appear at about the same time
The DBI classes are formal S4 classes for obvious reasons, but in R connections are S3 classes
Connections generalize files to include downloading (also from https) and uncompressing files, and exchanging data by socket
Connections are sometimes platform-specific, and are used inside input-output functions (tomorrow)
(lns <- readLines(pipe("dir")))
## [1] "my_data_5.rds"
## [2] "output.txt"
## [3] "rr_data.csv"
## [4] "rr_data_wide.xlsx"
## [5] "Screenshot_2021-04-29\\ GUS\\ -\\ Bank\\ Danych\\ Lokalnych2.png"
## [6] "Screenshot_2021-04-29\\ GUS\\ -\\ Bank\\ Danych\\ Lokalnych3.png"
## [7] "Screenshot_2021-04-29\\ GUS\\ -\\ Bank\\ Danych\\ Lokalnych.png"
## [8] "thur2104291415.Rhistory"
## [9] "uam21.bib"
## [10] "UAM21_I_210505.html"
## [11] "UAM21_I_210505.R"
## [12] "UAM21_I_210505.Rmd"
## [13] "UAM21_I_210506_cache"
## [14] "UAM21_I_210506_files"
## [15] "UAM21_I_210506.html"
## [16] "UAM21_I_210506.R"
## [17] "UAM21_I_210506.Rmd"
## [18] "UAM21_I_210510.Rmd"
## [19] "UAM21_I_210511.Rmd"
## [20] "vv1.gpkg"
## [21] "vv.sqlite"
## [22] "wed2104282100.Rhistory"
con <- file(lns[3])
readLines(con, n=6L)
## [1] "\"id\",\"name\",\"year\",\"PRB\",\"measureUnitId\",\"measureName\",\"attrId\",\"attributeDescription\",\"TERC\""
## [2] "\"011200000000\",\"MAŁOPOLSKIE\",\"2014\",133114,\"6\",\"mln zł\",1,\"\",\"12\""
## [3] "\"011200000000\",\"MAŁOPOLSKIE\",\"2015\",142292,\"6\",\"mln zł\",1,\"\",\"12\""
## [4] "\"011200000000\",\"MAŁOPOLSKIE\",\"2016\",148446,\"6\",\"mln zł\",1,\"\",\"12\""
## [5] "\"011200000000\",\"MAŁOPOLSKIE\",\"2017\",160254,\"6\",\"mln zł\",1,\"\",\"12\""
## [6] "\"011200000000\",\"MAŁOPOLSKIE\",\"2018\",172708,\"6\",\"mln zł\",1,\"\",\"12\""
close(con)
The magrittr package introduced the %>%
pipe; coupled to right assign ->
, it even looks reasonable
There are more pipes in R, and the Bizarro pipe ->.;
is pure base R, saving output to .
The arguments for writing with pipes are concentrated on readability
It is possible that R 4.1.0 due 18 May will include native pipes: |>
, but this remains ongoing work.
Finally, I first saw ->
used with pipes in a talk about the archivist package
archivist is also discussed in this blog
We can distinguish between presentation graphics and analytical graphics
Presentation graphics are intended for others and are completed (even if interactive) - see work by Florence Nightingale
Analytical graphics may evolve into presentation graphics, but their main purpose is to visualize the data being analysed (see Antony Unwin’s book and website, and Claus Wilke’s Fundamentals of Data Visualization])
Many of the researchers who have developed approaches to visualization have been involved with Bell Labs, where S came from
As installed, R provides two graphics approaches, one known as base graphics, the other trellis or lattice graphics
Most types of visualization are available for both, but lattice graphics were conceived to handle conditioning, for example to generate matching plots for different categories
Many of these were based on the data-ink ratio, favouring graphics with little or no extraneous detail (Edward Tufte) - see Lukasz Piwek’s blog
There are other approaches, such as Leland Wilkinson’s Grammar of Graphics, implemented in Hadley Wickham’s ggplot2 package, which we will also be using here as its use is growing fast
So there are presentation and analytical graphics, and there can be a number of approaches to how best to communicate information in either of those modes
R can create excellent presentation graphics, or provide input for graphics designers to improve for print or other media
What we need now are the most relevant simple analytical graphics for the kinds of data we use
In early R, all the graphics functionality was in base; graphics was split out of base in 1.9.0 (2004-04-12), and grDevices in 2.0.0 (2004-10-04)
When R starts now, the graphics and grDevices packages are loaded and attached, ready for use
graphics provides the user-level graphical functions and methods, especially the most used plot()
methods that many other packages extend
grDevices provides lower-level interfaces to graphics devices, some of which create files and others display windows
The search()
function shows the packages present in the search path, so here we run an instance of R through system()
to check the startup status. In RStudio, one will also see "tools:rstudio"
in the search path.
cat(system('echo "search()" | R --no-save --vanilla', intern=TRUE)[20:23], sep="\n")
## > search()
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
The PDF (pdf()
) and PostScript (postscript()
) devices write commonly used vector files
Display formats and raster/pixel file devices must rasterize vector graphics input
Display formats vary by platform: X11()
on X11 systems, quartz()
on macOS, windows()
on Windows and are available if compiled
png()
, jpeg()
, tiff()
are generally available, svg()
and cairo_pdf()
and cairo_ps()
may also be built
The capabilities()
function shows what R itself can offer, including non-graphics capabilities, and we can also check the versions of external software used
capabilities()
## jpeg png tiff tcltk X11 aqua
## TRUE TRUE TRUE TRUE TRUE FALSE
## http/ftp sockets libxml fifo cledit iconv
## TRUE TRUE TRUE TRUE FALSE TRUE
## NLS profmem cairo ICU long.double libcurl
## TRUE FALSE TRUE TRUE TRUE TRUE
grSoftVersion()
## cairo cairoFT pango
## "1.16.0" "" "1.48.4"
## libpng jpeg libtiff
## "1.6.37" "6.2" "LIBTIFF, Version 4.1.0"
When there is no device active, the current device is the null device at position 1
; one can open several and move between active devices. In RStudio, the devices used vary by context - if the console is used, RStudioGD
will be used, backed by png
; in a notebook, small embedded devices appear (apparently PNG inline images).In addition to providing a device, RStudio appears to work quite hard to embed fonts and crop white space on the edges of the graphics file.
There are four numeric variables of the same length, read into a data.frame
. The summary
method shows their characteristics:
jacoby <- read.table("../data/jacoby.txt")
summary(jacoby)
## x1 x2 x3 x4
## Min. :17.50 Min. :21.60 Min. :19.10 Min. :22.10
## 1st Qu.:27.52 1st Qu.:25.32 1st Qu.:25.98 1st Qu.:27.38
## Median :30.00 Median :30.00 Median :28.35 Median :29.60
## Mean :30.00 Mean :30.00 Mean :30.00 Mean :30.00
## 3rd Qu.:32.48 3rd Qu.:34.67 3rd Qu.:34.12 3rd Qu.:30.23
## Max. :42.50 Max. :38.40 Max. :40.90 Max. :51.00
Using the plot
method on a single numeric vector puts that variable on the vertical (y) axis, and an observation index on the horizontal (x) axis
plot(jacoby$x1)
Using the plot
method on two numeric vectors puts the first vector on the horizontal (x) axis, and the second on the vertical (y) axis
plot(jacoby$x1, jacoby$x2)
By convention, the x-axis is seen as causing the y-axis, so we can use a formula interface, and a data
argument to tell the method where to find the vectors by name (column names in the data.frame)
plot(x2 ~ x1, data=jacoby)
In these cases, the default plot shows points, which is reasonable here
The default glyph (symbol) is a small unfilled circle, the pch
argument lets us change this (and its colour by col
)
The display is annotated with value scales along the axes, and the axes are placed to add a little space outside the bounding box of the observations
The display includes axis labels, but does not include a title
Here, we’ll manipulate the axis labels, and symbol type, colour and size; the symbol type, colour and size are also vectors, so can vary by observation or for groups of observations
plot(x2 ~ x1, data=jacoby, xlab="First vector",
ylab="Second vector", pch=16, col="#EB811B",
cex=1.2)
Base graphics methods like plot can be added to, drawing more on the open device. We’ll add lines with abline
showing the means of the two vectors:
plot(x2 ~ x1, data=jacoby, xlab="First vector",
ylab="Second vector", pch=16, col="#EB811B",
cex=1.2)
abline(v=mean(jacoby$x1), h=mean(jacoby$x2),
lty=2, lwd=2, col="#EB811B")
In lattice or trellis graphics, xyplot()
is used for scatterplots; the formula interface is standard:
library(lattice)
xyplot(x2 ~ x1, jacoby)
Extra elements may be manipulated by modifying the panel functions, and will then propagate to all the panels if grouping is used:
xyplot(x2 ~ x1, jacoby, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(h = mean(y), v=mean(x), lty = 2,
lwd=2, col="#EB811B")
})
Syntax similar to that of ggplot2 was introduced in latticeExtra, adding layers to the :
xyplot(x2 ~ x1, jacoby) +
latticeExtra::layer(panel.abline(h=mean(y), v=mean(x),
lty = 2, lwd=2, col="#EB811B"))
The ggplot2 package builds graphic output layer by layer, like base graphics, using aes
— aesthetics — to say what is being shown. Starting with the jacoby
"data.frame"
object, we plot points with geom_point
choosing variable "x1"
by a placeholder ""
empty string on the other axis
library(ggplot2)
ggplot(jacoby) + geom_point(aes(x=x1, y="")) + xlab("")
Using aes
on two numeric vectors puts the first vector on the horizontal (x) axis, and the second on the vertical (y) axis, to make a scatterplot
ggplot(jacoby) + geom_point(aes(x1, x2))
In these cases, the default plot shows points, which is reasonable here
The default glyph (symbol) is a small filled circle, the size
argument lets us change its size, and its colour by colour
The display is annotated with value scales along the axes, and the axes are placed to add a little space outside the bounding box of the observations
The display includes axis labels, but does not include a title
In ggplot2, the functions output (summations of) grobs — grid objects, with a print
method
Here, we’ll change the axis labels, and symbol colour and size; these are also vectors, so can vary by observation or for groups of observations; we turn off the legend which is not relevant here
p <- ggplot(jacoby) + geom_point(aes(x1, x2),
colour="#EB811B", size=2) + xlab("First vector") +
ylab("Second vector") +
theme(legend.position = "none")
p
We’ll add lines with geom_hline
and geom_vline
showing the means of the two vectors:
p + geom_hline(yintercept=mean(jacoby$x2),
colour="#EB811B", linetype=2) +
geom_vline(xintercept=mean(jacoby$x1),
colour="#EB811B", linetype=2)
We will use stripcharts (p. 6, pp. 30–32) to display all four vectors together on shared axes; now we see why the default symbol is a hollow circle
stripchart(jacoby, pch=1, xlab="Data Values",
ylab="Variable", main="Scatterchart")
We can choose to jitter the symbols in the vertical dimension (adding small random amounts to 0) make the data easier to see
stripchart(jacoby, method="jitter",
jitter=0.05, pch=1,
xlab="Data Values",
ylab="Variable",
main="Scatterchart with jittering")
Lattice stripplots do the same as stripcharts, using a formula interface rather than just an object to choose the correct method. To get started, we need to stack
the data in “long” form. It is also possible to use reshape2::melt
.
jacobyS <- stack(jacoby)
str(jacobyS, width=45, strict.width="cut")
## 'data.frame': 80 obs. of 2 variables:
## $ values: num 32.3 28 31.4 29.5 40 20 26 ..
## $ ind : Factor w/ 4 levels "x1","x2","x"..
The formula says that ind
(y-axis) depends on values
(x-axis)
stripplot(ind ~ values, data=jacobyS, jitter.data=TRUE)
We will use stripcharts (p. 6, pp. 30–32) to display all four vectors together on shared axes
#library(reshape2)
#jacobyS <- melt(jacoby)
p <- ggplot(jacobyS, aes(values, ind))
p + geom_point() + ylab("")
We can choose to jitter the symbols in the vertical dimension (adding small random amounts to 0, setting the RNG seed for reproducibility) to make the data easier to see
set.seed(1)
p + geom_point() + ylab("") +
geom_jitter(position=position_jitter(0.1))
The boxplot (pp. 38–43) display places all the vectors on the same axis, here the horizontal axis, places thicker lines at the medians, and boxes representing the interquartile ranges
boxplot(jacoby)
The lattice version of boxplot is — box and whiskers plot; again we stay with the counter-intuitive horizontal display
bwplot(values ~ ind, data=jacobyS)
The boxplot (pp. 38–43) display places all the vectors on the same axis, here the horizontal axis, places thicker lines at the medians, and boxes representing the interquartile ranges
p <- ggplot(jacobyS, aes(ind, values))
p + geom_boxplot() + xlab("")
Histograms (pp. 13–17) are very frequently used to examine data, and are directly comparable to thematic maps — both need chosen break points, which have a start point, an end point, and intermediate points which may be evenly spaced, defining bins or intervals
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
brks <- seq(15,55,by=5)
for (i in 1:4) {
hist(jacoby[,i], breaks=brks, col="grey85",
xlab=paste("x", i, ": seq(15, 55, by=5)", sep=""),
freq=TRUE, main="")
}
par(oldpar)
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
hist(jacoby[,i], breaks=seq(17.5,52.5,by=5), col="grey85",
xlab=paste("x", i, ": seq(17.5, 52.5, by=5)", sep=""), freq=TRUE, main="")
}
par(oldpar)
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
hist(jacoby[,i], breaks=seq(17.5,52.5,by=2.5), col="grey85",
xlab=paste("x", i, ": seq(17.5, 52.5, by=2.5)", sep=""), freq=TRUE, main="")
}
par(oldpar)
The lattice syntax is perhaps simpler, with default starting points and bin widths driven by the data
histogram(~ values | ind, data=jacobyS,
breaks=seq(15,55,by=5), type="count",
index.cond=list(c(3,4,1,2)))
Histograms (pp. 13–17) are very frequently used to examine data, and are directly comparable to thematic maps — both need chosen break points, which have a start point, an end point, and intermediate points which may be evenly spaced, defining bins or intervals; we use facet_wrap()
to place the plots
ggplot(jacobyS, aes(x=values)) +
geom_histogram(breaks=seq(15, 55, by=5)) +
facet_wrap(~ ind, ncol=2)
When we realise that histograms can be constructed by choice of break points to create an impression that may (or may not) mislead, we can look at smoothed histograms (density plots (pp. 18–30)) as an alternative. Here we use a fixed bandwidth
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
plot(density(jacoby[,i], bw=1.5), main="",
xlim=c(15,55), ylim=c(0, 0.15))
rug(jacoby[,i], ticksize=0.07, lwd=2)
title(main=paste("Smoothed histogram of x",
i, sep=""))
}
par(oldpar)
The lattice approach again uses the formula interface, and a rug by default:
densityplot(~ values | ind, data=jacobyS, bw=1.5,
index.cond=list(c(3,4,1,2)))
By default, ggplot facets fix the scales, but it isn’t clear whether the bandwidth is fixed, and it does not seem to be reported
ggplot(jacobyS, aes(x=values)) +
geom_density(bw=1.5) + geom_rug() +
facet_wrap(~ ind, ncol=2) +
xlim(c(15, 55))
Empirical cumulative distribution functions need no tuning arguments:
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
plot(ecdf(jacoby[,i]), main="",
xlim=c(15,55))
title(main=paste("ECDF of x",
i, sep=""))
}
par(oldpar)
We need latticeExtra to present the ECDF:
library(latticeExtra)
##
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
##
## layer
ecdfplot(~ values | ind, data=jacobyS,
index.cond=list(c(3,4,1,2)))
detach(package:latticeExtra)
In ggplot, the stat_ecdf()
operator is used
ggplot(jacobyS, aes(x=values)) +
stat_ecdf() +
facet_wrap(~ ind, ncol=2)
Quantile-quantile plots require the choice of a theoretical distribution, and as Deepayan Sarkar says, the ECDF plot uses a uniform theoretical distribution, so is a Uniform QQ plot in a different orientation:
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
x <- qunif(ppoints(100))
for (i in 1:4) {
qqplot(x=x, y=jacoby[,i])
title(main=paste("Uniform QQ of x",
i, sep=""))
}
par(oldpar)
The qqnorm()
function gives a Normal QQ plot directly:
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
qqnorm(y=jacoby[,i], xlab="", ylab="",
ylim=c(15, 55), main="")
title(main=paste("Normal QQ of x",
i, sep=""))
}
par(oldpar)
In lattice, qqmath()
takes a distribution
argument:
qqmath(~ values | ind, data=jacobyS,
distribution=qunif,
index.cond=list(c(3,4,1,2)))
so this reproduces qqnorm
:
qqmath(~ values | ind, data=jacobyS,
distribution=qnorm,
index.cond=list(c(3,4,1,2)))
In ggplot, the stat_qq()
operator is used with a specified distribution function, here Uniform:
ggplot(jacobyS, aes(sample=values)) +
stat_qq(distribution=qunif) +
facet_wrap(~ ind, ncol=2)
and for Normal QQ plots
ggplot(jacobyS, aes(sample=values)) +
stat_qq(distribution=qnorm) +
facet_wrap(~ ind, ncol=2)
There are a number of options for mapping to which we’ll return later. We can use the plot()
method in sf in base graphics:
plot(st_geometry(my_data_5))
or tmap in grid graphics:
library(tmap)
(map_out <- tm_shape(my_data_5) + tm_borders(col="grey"))
tmap also offers interactive mapping, but because this requires transformation of the object to geographical coordinates and on to Web Mercator, all the polygons need to be valid:
all(st_is_valid(st_geometry(my_data_5)))
## [1] FALSE
my_data_5b <- st_make_valid(my_data_5)
Now we can show the polygon borders on a web map:
tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(my_data_5b) + tm_borders()