Required current contributed CRAN packages:

I am running R 4.0.5, with recent update.packages().

needed <- c("zoo")

Script

Script and data at https://github.com/rsbivand/UAM21_I/raw/main/UAM21_I_210505.zip. Download to suitable location, unzip and use as basis.

Seminar introduction

This seminar/course

  • Getting to know R: today we will start fairly thoroughly - probably a challenging beginning

  • Not speed-dating; starting ab ovo or rather ab ova: R has multiple mutations and is better seen as an ecosystem than as an inherently purposed system

  • Once we have the antecedents, we can see how they may affect data structures and their uses in R

  • Further, we’ll be able to make informed choices with respect to use of data structures

  • This course is made up of integrated classes and lab sessions, with the tasks to be carried out tightly linked to the classes

  • It is important to participate in class and in the lab, both with the instructor and with other participants

  • If you get stuck, ask someone; everyone gets stuck, it isn’t embarrassing. When someone else gets stuck, try to help; learning R is not like having your hair done

Reproducible research (R)

  • R is about reproducible research; we learn by doing, and by building on things others have done

  • We can only benefit from things others have done if they are available and if we can show that we get the same results - we can reproduce their work

  • Scripts (recipes) are the basis for this, and can be extended to literate programming by writing text explaining the steps taken

  • The threshold to learning enough markdown to write documents showing what has been done is not high

Schedule

  • Today, background and basic structures; Tuesday, starting spatial data
Time Topic
Monday 5/5
09.00-12.00 What is R: programming language, community, ecosystem? What may it be used for in analysing spatial data in a social science setting? What are the basic data structures in R? How can we start writing an R markdown notebook? How to access help in using R? How to use built-in data sets and why? How to write reproducible examples? What can we learn from code examples? How can R help us in furthering reproducible research?
13.00-16.00 What kinds of data objects are used in R? What is the structure of a data.frame? What is a list object? What kinds of data can be contained in data objects?
Tuesday 6/5
09.00-12.00 How may we read data into R? From files, including spatial data files, and from online resources? How can we choose between output formats for notebooks and other output media? How can one choose between the basic graphics functions and devices in R?
13.00-16.00 When our data include spatial data objects, in which ways may they be represented in R? How can one make simple thematic maps using R? (sf, stars, tmap)
Monday 10/5
09.00-12.00 May we use R “like a GIS?” How may we structure temporal and spatio-temporal data? Closer introduction to R-spatial (sf, stars, gdalcubes, terra, GDAL, GEOS)
13.00-16.00 Planar and spherical geometries, projections and transformations (s2, PROJ, tmap, mapview, leaflet, geogrid)
Tuesday 11/5
09.00-12.00 What forms of expression and colour scales are available in R? How can we use class intervals and colour palettes to communicate? Rather than “lying with maps,” how can we explore the impact of choices made in thematic cartography? How can we condition on continuous or discrete variables to permit visual comparison? How can we combine multiple graphical elements in data visualization? (classInt, sf, tmap, mapsf)
13.00-16.00 Doing things with spatial data … (osmdata, …)
Thursday 20/11
09.00-12.00 Presentations/consultations/discussion
13.00-16.00 Presentations/consultations/discussion
  • The underlying aim: to survey contemporary approaches to spatial data structures in R and their handling in context

  • Why in context? Because without the context, some alternatives may seem to be closed off by the presentation narrative

Learning resources

  • Needs for learning resources, and ways of making use of them, vary greatly between participants

  • There are lots of books, but many now present one-size-fits-all solutions that may not be a best fit

  • Some resources in Polish may be found on Przemysław Biecek’s website, including two chapters of his introductory book and ebook versions of course materials

  • Other materials are described on the R site and on CRAN

  • RStudio also provides an online learning page, with a number of options, no longer like Datacamp, but still like swirl

  • R is distributed from mirrors of the comprehensive R archive network (CRAN)

  • The cloud mirror is the easiest, but a local server may be faster

  • RStudio can be downloaded and installed after R has been installed

  • R comes with many contributed packages - the ones we need are on CRAN, which lists them providing information; we’ll get back to contributed packages later

Basic data structures

R as a calculator

R can be a calculator, with output printed by the default method

2+3
## [1] 5
7*8
## [1] 56
3^2
## [1] 9
log(1)
## [1] 0
log10(10)
## [1] 1

We could print explicitly:

print(2+3)
## [1] 5
print(sqrt(2))
## [1] 1.414214
print(sqrt(2), digits=10)
## [1] 1.414213562
print(10^7)
## [1] 1e+07

Exceptions also happen (Inf is infinity, NaN is Not a Number):

log(0)
## [1] -Inf
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN
1/0
## [1] Inf
0/0
## [1] NaN

Assignment and object names

We assign results of operations and functions to named objects with <-, or equivalently =; names begin with letters or a dot:

a <- 2+3
a
## [1] 5
is.finite(a)
## [1] TRUE
a <- log(0)
is.finite(a)
## [1] FALSE

Vectors

The printed results are prepended by a curious [1]; all these results are unit length vectors. We can combine several objects with c():

a <- c(2, 3)
a
## [1] 2 3
sum(a)
## [1] 5
str(a)
##  num [1:2] 2 3
aa <- rep(a, 50)
aa
##   [1] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
##  [38] 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
##  [75] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3

The single square brackets [] are used to access or set elements of vectors (the colon : gives an integer sequence); negative indices drop elements:

length(aa)
## [1] 100
aa[1:10]
##  [1] 2 3 2 3 2 3 2 3 2 3
sum(aa)
## [1] 250
sum(aa[1:10])
## [1] 25
sum(aa[-(11:length(aa))])
## [1] 25

Arithmetic under the hood

Infix syntax is just a representation of the actual underlying forms

a[1] + a[2]
## [1] 5
sum(a)
## [1] 5
`+`(a[1], a[2])
## [1] 5
Reduce(`+`, a)
## [1] 5

We’ve done arithmetic on scalars, we can do vector-scalar arithmetic:

sum(aa)
## [1] 250
sum(aa+2)
## [1] 450
sum(aa)+2
## [1] 252
sum(aa*2)
## [1] 500
sum(aa)*2
## [1] 500

But vector-vector arithmetic poses the question of vector length and recycling (the shorter one gets recycled):

v5 <- 1:5
v2 <- c(5, 10)
v5 * v2
## Warning in v5 * v2: longer object length is not a multiple of shorter object
## length
## [1]  5 20 15 40 25
v2_stretch <- rep(v2, length.out=length(v5))
v2_stretch
## [1]  5 10  5 10  5
v5 * v2_stretch
## [1]  5 20 15 40 25

Missing values

In working with real data, we often meet missing values, coded by NA meaning Not Available:

anyNA(aa)
## [1] FALSE
is.na(aa) <- 5
aa[1:10]
##  [1]  2  3  2  3 NA  3  2  3  2  3
anyNA(aa)
## [1] TRUE
sum(aa)
## [1] NA
sum(aa, na.rm=TRUE)
## [1] 248

Exceptions

  • We’ve looked at the simple stuff, when arithmetic and assignment happens as expected

  • A strength of R is the handling of exceptions, which do happen when handling real data, which not infrequently differs from what we thought it was

  • Wanting a result from data is reasonable when the data meet all the requirements

  • If the data do not meet the requirements, we may get unexpected results, warnings or even errors: most often we need to go back and check our input data

Checking data

One way to check our input data is to print in the console - this works with small objects as we’ve seen, but for larger objects we need methods:

big <- 1:(10^5)
length(big)
## [1] 100000
head(big)
## [1] 1 2 3 4 5 6
str(big)
##  int [1:100000] 1 2 3 4 5 6 7 8 9 10 ...
summary(big)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   25001   50000   50000   75000  100000

Basic vector types

  • There are length, head, str (structure) and summary methods for many types of objects

  • str also gives us a hint of the type of object and its dimensions

  • We’ve seen a couple of uses of str so far, str(a) was num and str(big) was int, what does this signify?

  • They are both numbers, but of different types

  • There are six basic vector types: list, integer, double, logical, character and complex

  • The derived type factor (to which we return shortly) is integer with extra information

  • str reports these as int, num, logi, chr and cplx, and lists are enumerated recursively

  • In RStudio you see more or less the str output in the environment pane as Values in the list view; the grid view adds the object size in memory

  • From early S, we have typeof and storage.mode (including single precision, not used in R) - these are important for interfacing C, C++, Fortran and other languages

  • Beyond this is class, but then the different class systems (S3 and formal S4) complicate things

  • Objects such as vectors may also have attributes in which their class and other information may be placed

  • Typically, a lot of use is made of attributes to squirrel away strings and short vectors

Testing types

is methods are used to test types of objects; note that integers are also seen as numeric:

set.seed(1)
x <- runif(50, 1, 10)
is.numeric(x)
## [1] TRUE
y <- rpois(50, lambda=6)
is.numeric(y)
## [1] TRUE
is.integer(y)
## [1] TRUE
xy <- x < y
is.logical(xy)
## [1] TRUE

Coercion between types

as methods try to convert between object types and are widely used:

str(as.integer(xy))
##  int [1:50] 1 1 0 0 1 0 0 0 1 1 ...
str(as.numeric(y))
##  num [1:50] 6 9 5 4 3 3 5 6 7 5 ...
str(as.character(y))
##  chr [1:50] "6" "9" "5" "4" "3" "3" "5" "6" "7" "5" "9" "5" "6" "5" "7" "4" ...
str(as.integer(x))
##  int [1:50] 3 4 6 9 2 9 9 6 6 1 ...

Factor, time, encoding

What is a factor?

  • Sometimes character values are just that, not categorical values to be used in handling data

  • Factors are meant to be used for categories, and are stored as an integer vector with values pointing to places in a character vector of levels stored as an attribute of the object

  • Character data are read into R by default as factors, because that is the most usual scenario

  • Having a pre-defined set of indices to level values is very useful for visualization and analysis

  • Ordered factors can be used for ordinal data

Factors

We can retrieve to input character vector by indexing the levels:

gen <- c("female", "male", NA)
fgen <- factor(gen)
str(fgen)
##  Factor w/ 2 levels "female","male": 1 2 NA
nlevels(fgen)
## [1] 2
levels(fgen)
## [1] "female" "male"
as.integer(fgen)
## [1]  1  2 NA
levels(fgen)[as.integer(fgen)]
## [1] "female" "male"   NA

Ordered factors

Ordered factors do not sort the levels alphabetically:

status <- c("Lo", "Hi", "Med", "Med", "Hi")
ordered.status <- ordered(status, levels=c("Lo", "Med", "Hi"))
ordered.status
## [1] Lo  Hi  Med Med Hi 
## Levels: Lo < Med < Hi
str(ordered.status)
##  Ord.factor w/ 3 levels "Lo"<"Med"<"Hi": 1 3 2 2 3
table(status)
## status
##  Hi  Lo Med 
##   2   1   2
table(ordered.status)
## ordered.status
##  Lo Med  Hi 
##   1   2   2

Encodings

So far, we’ve only met ASCII 7-bit characters, but in many situations, we need more. The default encoding will depend on the locale in which your R session is running - this is my locale:

strsplit(Sys.getlocale(), ";")
## [[1]]
##  [1] "LC_CTYPE=en_GB.UTF-8"       "LC_NUMERIC=C"              
##  [3] "LC_TIME=en_GB.UTF-8"        "LC_COLLATE=en_GB.UTF-8"    
##  [5] "LC_MONETARY=en_GB.UTF-8"    "LC_MESSAGES=en_GB.UTF-8"   
##  [7] "LC_PAPER=en_GB.UTF-8"       "LC_NAME=C"                 
##  [9] "LC_ADDRESS=C"               "LC_TELEPHONE=C"            
## [11] "LC_MEASUREMENT=en_GB.UTF-8" "LC_IDENTIFICATION=C"

In UTF-8, non-ASCII characters are encoded by an 8th bit flag, and a second byte with the value; in codepage and ISO 8-bit character sets, the 8th bit is part of the character, but differs from set to set:

V5 <- c("ł", "ę", "ą", "Ł")
sapply(V5, charToRaw)
##       ł  ę  ą  Ł
## [1,] c5 c4 c4 c5
## [2,] 82 99 85 81
V6 <- iconv(V5, to="CP1250")
sapply(V6, charToRaw)
## \xb3 \xea \xb9 \xa3 
##   b3   ea   b9   a3

Date/time vectors

An aside before we proceed: handling temporal data is confusing. Time is multifaceted, where two of the variants are instantaneous time with data at that time point and interval time with data aggregated over the interval:

now <- Sys.time()
now
## [1] "2021-04-20 14:42:04 CEST"
class(now)
## [1] "POSIXct" "POSIXt"
as.Date(now)
## [1] "2021-04-20"
unclass(now)
## [1] 1618922525

One representation is in seconds since the epoch (with decimal parts of a second), another is in components also including important time zone information (time zone listings are updated regularly):

str(unclass(as.POSIXlt(now)))
## List of 11
##  $ sec   : num 4.86
##  $ min   : int 42
##  $ hour  : int 14
##  $ mday  : int 20
##  $ mon   : int 3
##  $ year  : int 121
##  $ wday  : int 2
##  $ yday  : int 109
##  $ isdst : int 1
##  $ zone  : chr "CEST"
##  $ gmtoff: int 7200
##  - attr(*, "tzone")= chr [1:3] "" "CET" "CEST"

In the social sciences, we are more likely to need annual or monthly representations, but it is useful to be aware that a year can mean status at year end, or an aggregated value accummulated during an interval.

suppressMessages(library(zoo))
as.yearmon(now)
## [1] "Apr 2021"
as.yearqtr(now)
## [1] "2021 Q2"
as.Date("2016-03-01") - 1 # day
## [1] "2016-02-29"
as.Date("2018-03-01") - 1 # day
## [1] "2018-02-28"
seq(as.Date(now), as.Date(now)+12, length.out=4)
## [1] "2021-04-20" "2021-04-24" "2021-04-28" "2021-05-02"

R itself

  • R is as small or large as you like, and runs in many different configurations (no smartphones); the core is written in C

  • The language has developed from S written at Bell Labs NJ, where Unix, C, C++, and scripting technologies were created in the 1970s and 1980s

  • Bell Labs statisticians had a strong focus on graphics and exploratory data analysis from the beginning

  • Many underlying abstractions were established by 1988 and 1992; we’ll get to the data.frame and formula abstractions later

  • An R session records its history - all that is entered at the console prompt - and a workspace containing objects

  • On exiting a session, the history may be saved to a history file, and the workspace may be saved to an RData file; history and chosen objects (or all objects) may be saved manually before exit

  • The workspace is in the memory of the computer, and R itself expects there to be enough memory for all of the data, intermediate and final results

  • Modern R is 64-bit, so limits are most often set by the computer hardware; use can be made of multiple cores to compute in parallel

Using Markdown in R

  • In the RStudio Interactive Development Environment (IDE), it is convenient to use R Markdown to write notebooks (annotated scripts)

  • Chunks of code are run in sequence and may be echoed in the output

  • Output is shown in its right place, including graphics output

  • The document may also be converted to a script, mirroring the weave/tangle - knit/purl duality

  • This presentation is written in Markdown, as we’ll see …

History of R and its data structures

Early R was Scheme via SICP

Ross Ihaka’s description

(JSM talk)

Brown Books

S: An Interactive Environment for Data Analysis and Graphics, A.K.A. the Brown Book (R. A. Becker and Chambers 1984); Extending the S System (R. A. Becker and Chambers 1985)

Brown Books

Blue and White Books

The New S Language: A Programming Environment for Data Analysis and Graphics, A.K.A. the Blue Book (Richard A. Becker, Chambers, and Wilks 1988); Statistical Models in S, A.K.A. the White Book (Chambers and Hastie 1992)

Blue and White Books

Green Book

Programming with Data: A Guide to the S Language, A.K.A. the Green Book (Chambers 1998); S Programming (Venables and Ripley 2000)

Green Book

S2 to S3 to S4

  • The S2 system was described in the Brown Book, S3 in the Blue Book and completed in the White Book, finally S4 in the Green Book

  • The big advance from S2 to S3 was that users could write functions; that data.frame objects were defined; that formula objects were defined; and that S3 classes and method dispatch appeared

  • S4 brought connections and formal S4 classes, the latter seen in R in the methods package (still controversial)

  • S-PLUS was/is the commercial implementation of S and its releases drove S3 and S4 changes

S, Bell Labs, S-PLUS

  • S was a Bell Labs innovation, like Unix, C, C++, and many interpreted languages (like AWK); many of these share key understandings

  • Now owned by Nokia, previously Alcatel-Lucent, Lucent, and AT&T

  • Why would a telecoms major (AT&T) pay for fundamental research in computer science and data analysis (not to sell or market other products better)?

  • Some Green Book examples are for quality control of telecoms components

S-PLUS and R

  • S-PLUS was quickly adopted for teaching and research, and with S3, provided extensibility in the form of libraries

  • Most links have died by now, but see this FAQ for a flavour - there was a lively community of applied statisticians during the 1990s

  • S built on a long tradition of documentation through examples, with use cases and data sets taken from the applied statistical literature; this let users compare output with methods descriptions

  • … so we get to R

and what about LispStat?

  • Luke Tierney was in R core in 1997, and has continued to exert clear influence over development

  • Because R uses a Scheme engine, similar to Lisp, under the hood, his insight into issues like the garbage collector, namespaces, byte-compilation, serialization, parallelization, and now ALTREP has been crucial (see also the proposal by Luke Tierney, Gabe Becker and Tomas Kalibera)

  • Many of these issues involve the defensive copy on possible change policy involved in lazy evaluation, which may lead to multiple redundant copies of data being present in memory

  • Luke Tierney and Brian Ripley have fought hard to let R load fast, something that is crucial to ease the use of R on multicore systems or inside databases

Roundup: history

  • Many sources in applied statistics with an S-like syntax but Lisp/Scheme-like internals, and sustained tensions between these

  • Many different opinions on prefered ways of structuring data and data handling, opening for adaptations to different settings

  • More recently larger commercial interest in handling large input long data sets, previously also present; simulations also generate large output data sets; bioinformatics both wide and long

  • Differing views of the world in terms of goals and approaches

  • Differences provide ecological robustness

Self-help in R

Help and examples

  • In RStudio, the Help tab in the lower right pane (default position) gives access to the R manuals and to the installed packages help pages through the Packages link under Reference

  • In R itself, help pages are available in HTML (browser) and text form; help.start() uses the default browser to display the Manuals, Reference and Miscellaneous Material sections in RStudio’s home help tab

  • The search engine can be used to locate help pages, but is not great if many packages are installed, as no indices are stored

  • The help system needs to be learned in order to provide the user with ways of progressing without wasting too much time

Base help system

  • The base help system does not tell you how to use R as a system, about packages not installed on your machine, or about R as a community

  • It does provide information about functions, methods and (some) classes in base R and in contributed packages installed on your machine

  • We’ll cover these first, then go on to look at vignettes, R Journal, task views, online help pages, and the blog aggregator

  • There are different requirements with regard to help systems - in R, the help pages of base R are expected to be accurate although terse

Help pages

  • Each help page provides a short description of the functions, methods or classes it covers; some pages cover more than one such

  • Help pages are grouped by package, so that the browser-based system is not easy to browse if you do not know which package a function belongs to

  • The usage of the function is shown explicitly, including any defaults for arguments to functions or methods

  • Each argument is described, showing names and types; in addition details of the description are given, together with the value returned

Interactive use of help pages

  • Rather than starting from the packages hierarchy of help pages, users most often use the help function

  • The function takes the name of of the function about which we need help, the name may be in quotation marks; class names contain a hyphen and must be quoted

  • Instead of using say help(help), we can shorten to the question mark operator: ?help

  • Occasionally, several packages offer different functions with the same name, and we may be offered a choice; we can disambiguate by putting the package name and two colons before the function name

Function arguments

  • In the usage section, function arguments are shown by name and order; the args function returns information

  • In general, if arguments are given by name, the order is arbitrary, but if names are not used at least sometimes, order matters

  • Some arguments do not have default values and are probably required, although some are guessed if missing

  • Being explicit about the names of arguments and the values they take is helpful in scripting and reproducible research

  • The ellipsis ... indicates that the function itself examines objects passed to see what to do

Tooltips and completion

  • The regular R console does not provide tooltips, that is a bubble first offering alternative function or object names as you type, then lists of argument names

  • RStudio, like many IDEs, does provide this, controlled by Tools -> Global options -> Code -> Completion (by default it is operative)

  • This may be helpful or not, depending on your style of working; if you find it helpful, fine, if not, you can make it less invasive under Global options

  • Other IDE have also provided this facility, which builds directly on the usage sections of help pages of functions in installed packages

Coherence code/documentation

  • Base R has a set of checks and tests that ensure coherence between the code itself and the usage sections in help pages

  • These mechanisms are used in checking contributed packages before they are released through the the archive network; the description of arguments on help pages must match the function definition

  • It is also possible to generate help pages documenting functions automatically, for example using the roxygen2 package

  • It is important to know that we can rely on this coherence

Returned values

  • The objects returned by functions are also documented on help pages, but the coherence of the description with reality is harder to check

  • This means that use of str or other functions or methods may be helpful when we want to look inside the returned object

  • The form taken by returned values will often also vary, depending on the arguments given

  • Most help pages address this issue not by writing more about the returned values, but by using the examples section to highlight points of potential importance for the user

Examples

  • Reading the examples section on the help page is often enlightening, but we do not need to copy and paste

  • The example function runs those parts of the code in the examples section of a function that are not tagged don’t run - this can be overridden, but may involve meeting conditions not met on your machine

  • This code is run nightly on CRAN servers on multiple operating systems and using released, patched and development versions of R, so checking both packages and the three versions of R

  • Some examples use data given verbatim, but many use built-in data sets; most packages also provide data sets to use for running examples

Built-in data sets

  • This means that the examples and the built-in data sets are a most significant resource for learning how to solve problems with R

  • Very often, one recognizes classic textbook data sets from the history of applied statistics; contemporary text book authors often publish collections of data sets as packages on CRAN

  • The built-in data sets also have help pages, describing their representation as R objects, and their licence and copyright status

  • These help pages also often include an examples section showing some of the analyses that may be carried out using them

  • One approach that typically works well when you have a data set of your own, but are unsure how to proceed, is to find a built-in data set that resembles the real one, and play with that first

  • The built-in data sets are often quite small, and if linked to text books, they are well described there as well as in the help pages

  • By definition, the built-in data sets do not have to be imported into R, as they are almost always stored as files of R objects

  • In some cases, these data sets are stored in external file formats, most often to show how to read those formats

  • The built-in data sets in the base datasets package are in the search path, but data sets in other packages should be loaded using the data() function:

str(Titanic)
##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"
library(MASS)
data(deaths)
str(deaths)
##  Time-Series [1:72] from 1974 to 1980: 3035 2552 2704 2554 2014 ...

Vignettes

  • At about the time that literate programming arrived in R with Sweave and Stangle - we mostly use knitr now - the idea arose of supplementing package documentation with example workflows

  • Vignettes are PDF documents with accompanying runnable R code that describe how to carry out particular sequences of operations

  • The RStudio packages help tab package index file shows user guides, package vignettes and other documentation

  • The vignette() function can be used to list vignettes by installed package, and to open the chosen vignette in a PDF reader

  • A very typical way of using vignettes on a machine with enough screen space is to read the document and run the code from the R file at the same time

  • Assign the output of vignette to an object; the print method shows the PDF or HTML, the edit method gives direct access to the underlying code for copy and paste

  • The help system in RStudio provides equivalent access to vignette documents and code

  • Papers about R contributed packages published in the Journal of Statistical Software and the R Journal are often constructed in this way too

Task views

  • As R has developed, the number of packages on CRAN has grown (other packages are on BioConductor and github)

  • CRAN task views were introduced to try to provide some subject area guidance

  • They remain terse, and struggle to keep up, but are still worth reviewing

  • Note that those working in different subject areas often see things rather differently, leading to subject specific treatment of intrinsically similar themes

Online help pages

  • The help system and vignettes were designed to be used offline, so that the versions of R and installed packages matched the documentation

  • If you search online for information about functions in R or in contributed packages, you often reach inside-R, sponsored by Revolution Analytics

  • Help pages may also be viewed online from your chosen CRAN mirror; package pages provide these (Reference manual) and vignettes as links

  • Remember to check that the versions of your installed software and the online documentation are the same

R communities

  • The R community has become a number of linked communities rather than a coherent and hierarchical whole

  • As in many open source projects, the R project is more basaar than cathedral; think of niches in ecosystems with differing local optima in contrast to a master plan

  • One style is based on mailing lists, in which an issue raised by an original poster is resolved later in that thread

  • Another style is to use online fora, such as StackOverflow, which you need to visit rather than receiving messages in your inbox

  • There are now many blogs involving the use of R, fortunately aggregated at R-bloggers, where other resources may also be found

  • New aggregated blog topics are linked to a Twitter account, so if you want, you too can be bombarded by notifications

  • These are also a potential source of project ideas, especially because some claims should be challenged

  • R Users Groups and R Ladies provide face-to-face meeting places that many value

R Consortium

  • R started as a teaching tool for applied statistics, but this community model has been complemented by others

  • R is now widely used in business, public administration and voluntary organizations for data analysis and visualization

  • The R Consortium was created in 2015 as a vehicle for companies with relationships to R

  • R itself remains under the control of the R Foundation, which is still mostly academic in flavour

Combining data structures

List, data.frame, matrix, array

The data frame object

  • First, let us see that is behind the data.frame object: the list object

  • list objects are vectors that contain other objects, which can be addressed by name or by 1-based indices

  • Like the vectors we have already met, lists can be accessed and manipulated using square brackets []

  • Single list elements can be accessed and manipulated using double square brackets [[]]

List objects

Starting with four vectors of differing types, we can assemble a list object; as we see, its structure is quite simple. The vectors in the list may vary in length, and lists can (and do often) include lists

V1 <- 1:3
V2 <- letters[1:3]
V3 <- sqrt(V1)
V4 <- sqrt(as.complex(-V1))
L <- list(v1=V1, v2=V2, v3=V3, v4=V4)
str(L)
## List of 4
##  $ v1: int [1:3] 1 2 3
##  $ v2: chr [1:3] "a" "b" "c"
##  $ v3: num [1:3] 1 1.41 1.73
##  $ v4: cplx [1:3] 0+1i 0+1.41i 0+1.73i
L$v3[2]
## [1] 1.414214
L[[3]][2]
## [1] 1.414214

Data Frames

Our list object contains four vectors of different types but of the same length; conversion to a data.frame is convenient. Note that by default strings are converted into factors:

DF <- as.data.frame(L)
str(DF)
## 'data.frame':    3 obs. of  4 variables:
##  $ v1: int  1 2 3
##  $ v2: chr  "a" "b" "c"
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i
DF <- as.data.frame(L, stringsAsFactors=FALSE)
str(DF)
## 'data.frame':    3 obs. of  4 variables:
##  $ v1: int  1 2 3
##  $ v2: chr  "a" "b" "c"
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i

We can also provoke an error in conversion from a valid list made up of vectors of different length to a data.frame:

V2a <- letters[1:4]
V4a <- factor(V2a)
La <- list(v1=V1, v2=V2a, v3=V3, v4=V4a)
DFa <- try(as.data.frame(La, stringsAsFactors=FALSE), silent=TRUE)
message(DFa)
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
##   arguments imply differing number of rows: 3, 4

We can access data.frame elements as list elements, where the $ is effectively the same as [[]] with the list component name as a string:

DF$v3[2]
## [1] 1.414214
DF[[3]][2]
## [1] 1.414214
DF[["v3"]][2]
## [1] 1.414214

Since a data.frame is a rectangular object with named columns with equal numbers of rows, it can also be indexed like a matrix, where the rows are the first index and the columns (variables) the second:

DF[2, 3]
## [1] 1.414214
DF[2, "v3"]
## [1] 1.414214
str(DF[2, 3])
##  num 1.41
str(DF[2, 3, drop=FALSE])
## 'data.frame':    1 obs. of  1 variable:
##  $ v3: num 1.41

If we coerce a data.frame containing a character vector or factor into a matrix, we get a character matrix; if we extract an integer and a numeric column, we get a numeric matrix.

as.matrix(DF)
##      v1  v2  v3         v4           
## [1,] "1" "a" "1.000000" "0+1.000000i"
## [2,] "2" "b" "1.414214" "0+1.414214i"
## [3,] "3" "c" "1.732051" "0+1.732051i"
as.matrix(DF[,c(1,3)])
##      v1       v3
## [1,]  1 1.000000
## [2,]  2 1.414214
## [3,]  3 1.732051

The fact that data.frame objects descend from list objects is shown by looking at their lengths; the length of a matrix is not its number of columns, but its element count:

length(L)
## [1] 4
length(DF)
## [1] 4
length(as.matrix(DF))
## [1] 12

There are dim methods for data.frame objects and matrices (and arrays with more than two dimensions); matrices and arrays are seen as vectors with dimensions; list objects have no dimensions:

dim(L)
## NULL
dim(DF)
## [1] 3 4
dim(as.matrix(DF))
## [1] 3 4
str(as.matrix(DF))
##  chr [1:3, 1:4] "1" "2" "3" "a" "b" "c" "1.000000" "1.414214" "1.732051" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "v1" "v2" "v3" "v4"

data.frame objects have names and row.names, matrices have dimnames, colnames and rownames; all can be used for setting new values:

row.names(DF)
## [1] "1" "2" "3"
names(DF)
## [1] "v1" "v2" "v3" "v4"
names(DF) <- LETTERS[1:4]
names(DF)
## [1] "A" "B" "C" "D"
str(dimnames(as.matrix(DF)))
## List of 2
##  $ : NULL
##  $ : chr [1:4] "A" "B" "C" "D"

R objects have attributes that are not normally displayed, but which show their structure and class (if any); we can see that data.frame objects are quite different internally from matrices:

str(attributes(DF))
## List of 3
##  $ names    : chr [1:4] "A" "B" "C" "D"
##  $ class    : chr "data.frame"
##  $ row.names: int [1:3] 1 2 3
str(attributes(as.matrix(DF)))
## List of 2
##  $ dim     : int [1:2] 3 4
##  $ dimnames:List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "A" "B" "C" "D"

If the reason for different vector lengths was that one or more observations are missing on that variable, NA should be used; the lengths are then equal, and a rectangular table can be created:

V1a <- c(V1, NA)
V3a <- sqrt(V1a)
La <- list(v1=V1a, v2=V2a, v3=V3a, v4=V4a)
DFa <- as.data.frame(La, stringsAsFactors=FALSE)
str(DFa)
## 'data.frame':    4 obs. of  4 variables:
##  $ v1: int  1 2 3 NA
##  $ v2: chr  "a" "b" "c" "d"
##  $ v3: num  1 1.41 1.73 NA
##  $ v4: Factor w/ 4 levels "a","b","c","d": 1 2 3 4

Encodings do not affect representation within the R workspace, but are a real problem for reading and writing data:

La <- list(v1=V1a, v2=V2a, v3=V3a, v4=V4a, v5=V5, v6=V6)
DFa <- as.data.frame(La)
str(DFa)
## 'data.frame':    4 obs. of  6 variables:
##  $ v1: int  1 2 3 NA
##  $ v2: chr  "a" "b" "c" "d"
##  $ v3: num  1 1.41 1.73 NA
##  $ v4: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
##  $ v5: chr  "ł" "ę" "ą" "Ł"
##  $ v6: chr  "\xb3" "\xea" "\xb9" "\xa3"

R’s sessionInfo()

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 33 (Workstation Edition)
## 
## Matrix products: default
## BLAS:   /home/rsb/topics/R/R405-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R405-share/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] MASS_7.3-53.1 zoo_1.8-9    
## 
## loaded via a namespace (and not attached):
##  [1] lattice_0.20-41   digest_0.6.27     grid_4.0.5        R6_2.5.0         
##  [5] jsonlite_1.7.2    magrittr_2.0.1    evaluate_0.14     rlang_0.4.10     
##  [9] stringi_1.5.3     jquerylib_0.1.3   bslib_0.2.4       rmarkdown_2.7    
## [13] tools_4.0.5       stringr_1.4.0     xfun_0.22         yaml_2.2.1       
## [17] compiler_4.0.5    htmltools_0.5.1.1 knitr_1.32        sass_0.3.1
Abelson, Harold, and Gerald Jay Sussman. 1996. Structure and Interpretation of Computer Programs. Boston, MA: MIT Press.
Becker, R. A., and J. M. Chambers. 1984. S: An Interactive Environment for Data Analysis and Graphics. Pacific Grove, CA, USA: Wadsworth & Brooks/Cole.
———. 1985. Extending the s System. Pacific Grove, CA, USA: Wadsworth & Brooks/Cole.
Becker, Richard A., John M. Chambers, and Allan R. Wilks. 1988. The New s Language. London: Chapman & Hall.
Chambers, John M. 1998. Programming with Data. New York: Springer.
———. 2016. Extending R. Boca Raton: Chapman & Hall.
Chambers, John M., and Trevor J. Hastie. 1992. Statistical Models in s. London: Chapman & Hall.
Ihaka, Ross, and Robert Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics 5 (3): 299–314. https://doi.org/10.1080/10618600.1996.10474713.
Tierney, Luke. 1990. LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics. Wiley, New York, NY.
———. 1996. “Recent Developments and Future Directions in Lisp-Stat.” Journal of Computational and Graphical Statistics 5 (3): 250–62.
———. 2005. “Some Notes on the Past and Future of Lisp-Stat.” Journal of Statistical Software, Articles 13 (9): 1–15. https://doi.org/10.18637/jss.v013.i09.
Venables, William N., and Brian D. Ripley. 2000. S Programming. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS3/Sprog/.
Wickham, Hadley. 2014. Advanced R. Boca Raton, FL: Chapman & Hall. http://adv-r.had.co.nz/.