Required current contributed CRAN packages:

I am running R 3.6.2, with recent update.packages().

needed <- c("cartography", "colorblindr", "colorspace", "viridis",
            "viridisLite", "RColorBrewer", "tmap", "sf",
            "rnaturalearthdata", "rnaturalearth", "matrixStats",
            "ISOcodes", "ggplot2")

Script

Script and data at https://github.com/rsbivand/BAN422_V20/raw/master/ban422_v20_mon.zip. Download to suitable location, unzip and use as basis.

Seminar introduction

This seminar/course

  • Visualization is one of the key strengths of R

  • However, visualization involves many choices, and R offers such a wide range of choices that guidance is desirable

  • There are two major underlying techologies, base graphics and grid graphics, and several toolboxes built on these (trellis graphics and grammar of graphics on grid graphics), in addition to JavaScript widgets for interaction

  • In addition, much work has been done on the effective use of shapes and colours.

Projects

  • The group projects provide learning opportunities for exploring visualization, through comparison of implementation and maybe perception mini-experiments using nullabor

  • Similar topics may be gleaned from R-bloggers and its Twitter feed; some of the claims deserve to be checked

  • The aim is not to find winners, but to explore alternatives

  • Thursday group work-day, Friday presentation/feedback day, hand in via WiseFlow by Friday 31 January 14:00.

Schedule

  • Today, we’ll borrow two talks from the Ihaka lecture series at Auckland University from March 2018, review simple visualizations, and follow up some of the data used in the second talk by Dianne Cook

  • Tuesday: binning, colours and conditioning, base graphics and graphics devices; Wednesday: grid graphics with lattice and ggplot2, interactive graphics, dashboards

  • The underlying aim: to survey contemporary approaches to visualization using R

  • Your (group) projects are a key part of the seminar

Time Topic
Monday 13/1
08.15-10.00 Introduction - Visualization in R - basics
10.15-11.00 Ihaka lecture 2018 Alberto Cairo
12.15-13.00 Ihaka lecture 2018 Dianne Cook
13.15-15.00 Reproducing PISA graphics, use of colour palettes
Tuesday 14/1
08.15-10.00 Conditioning and facets, choice of binning intervals in visualization
10.15-11.00 Practice
12.15-14.00 Base graphics structure and graphics devices
14.15-15.00 Practice
Wednesday 15/1
08.15-10.00 Grid graphics structure, lattice, ggplot2, etc.
10.15-11.00 Practice
12.15-14.00 Interactive graphics challenges, plotly, googleVis, Dashboards and shiny
14.15-15.00 Practice
Thursday 16/1
09.00-16.00 Group work day (not in aud. C, surgery in SAM-D217)
Friday 17/1
08.15-10.00 Presentations
10.15-11.00 Presentations
12.00-13.45 Presentations

R itself

  • R is as small or large as you like, and runs in many different configurations (no smartphones); the core is written in C

  • The language has developed from S written at Bell Labs NJ, where Unix, C, C++, and scripting technologies were created in the 1970s and 1980s

  • Bell Labs statisticians had a strong focus on graphics and exploratory data analysis from the beginning

  • Many underlying abstractions were established by 1988 and 1992; we’ll get to the data.frame and formula abstractions later

  • An R session records its history - all that is entered at the console prompt - and a workspace containing objects

  • On exiting a session, the history may be saved to a history file, and the workspace may be saved to an RData file; history and chosen objects (or all objects) may be saved manually before exit

  • The workspace is in the memory of the computer, and R itself expects there to be enough memory for all of the data, intermediate and final results

  • Modern R is 64-bit, so limits are most often set by the computer hardware; use can be made of multiple cores to compute in parallel

Using Markdown in R

  • In the RStudio Interactive Development Environment (IDE), it is convenient to use R Markdown to write notebooks (annotated scripts)

  • Chunks of code are run in sequence and may be echoed in the output

  • Output is shown in its right place, including graphics output

  • The document may also be converted to a script, mirroring the weave/tangle - knit/purl duality

  • This presentation is written in Markdown, as we’ll see …

Ancilliary information

Help, examples and built-in datasets in R

  • In RStudio, the Help tab in the lower right pane (default position) gives access to the R manuals and to the installed packages help pages through the Packages link under Reference

  • In R itself, help pages are available in HTML (browser) and text form; help.start() uses the default browser to display the Manuals, Reference and Miscellaneous Material sections in RStudio’s home help tab

  • The search engine can be used to locate help pages, but is not great if many packages are installed, as no indices are stored

  • The help system needs to be learned in order to provide the user with ways of progressing without wasting too much time

Base help system

  • The base help system does not tell you how to use R as a system, about packages not installed on your machine, or about R as a community

  • It does provide information about functions, methods and (some) classes in base R and in contributed packages installed on your machine

  • We’ll cover these first, then go on to look at vignettes, R Journal, task views, online help pages, and the blog aggregator

  • There are different requirements with regard to help systems - in R, the help pages of base R are expected to be accurate although terse

Help pages

  • Each help page provides a short description of the functions, methods or classes it covers; some pages cover more than one such

  • Help pages are grouped by package, so that the browser-based system is not easy to browse if you do not know which package a function belongs to

  • The usage of the function is shown explicitly, including any defaults for arguments to functions or methods

  • Each argument is described, showing names and types; in addition details of the description are given, together with the value returned

Interactive use of help pages

  • Rather than starting from the packages hierarchy of help pages, users most often use the help function

  • The function takes the name of of the function about which we need help, the name may be in quotation marks; class names contain a hyphen and must be quoted

  • Instead of using say help(help), we can shorten to the question mark operator: ?help

  • Occasionally, several packages offer different functions with the same name, and we may be offered a choice; we can disambiguate by putting the package name and two colons before the function name

Function arguments

  • In the usage section, function arguments are shown by name and order; the args function returns information

  • In general, if arguments are given by name, the order is arbitrary, but if names are not used at least sometimes, order matters

  • Some arguments do not have default values and are probably required, although some are guessed if missing

  • Being explicit about the names of arguments and the values they take is helpful in scripting and reproducible research

  • The ellipsis ... indicates that the function itself examines objects passed to see what to do

Tooltips and completion

  • The regular R console does not provide tooltips, that is a bubble first offering alternative function or object names as you type, then lists of argument names

  • RStudio, like many IDEs, does provide this, controlled by Tools -> Global options -> Code -> Completion (by default it is operative)

  • This may be helpful or not, depending on your style of working; if you find it helpful, fine, if not, you can make it less invasive under Global options

  • Other IDE have also provided this facility, which builds directly on the usage sections of help pages of functions in installed packages

Coherence code/documentation

  • Base R has a set of checks and tests that ensure coherence between the code itself and the usage sections in help pages

  • These mechanisms are used in checking contributed packages before they are released through the the archive network; the description of arguments on help pages must match the function definition

  • It is also possible to generate help pages documenting functions automatically, for example using the roxygen2 package

  • It is important to know that we can rely on this coherence

Returned values

  • The objects returned by functions are also documented on help pages, but the coherence of the description with reality is harder to check

  • This means that use of str or other functions or methods may be helpful when we want to look inside the returned object

  • The form taken by returned values will often also vary, depending on the arguments given

  • Most help pages address this issue not by writing more about the returned values, but by using the examples section to highlight points of potential importance for the user

Examples

  • Reading the examples section on the help page is often enlightening, but we do not need to copy and paste

  • The example function runs those parts of the code in the examples section of a function that are not tagged don’t run - this can be overridden, but may involve meeting conditions not met on your machine

  • This code is run nightly on CRAN servers on multiple operating systems and using released, patched and development versions of R, so checking both packages and the three versions of R

  • Some examples use data given verbatim, but many use built-in data sets; most packages also provide data sets to use for running examples

Built-in data sets

  • This means that the examples and the built-in data sets are a most significant resource for learning how to solve problems with R

  • Very often, one recognizes classic textbook data sets from the history of applied statistics; contemporary text book authors often publish collections of data sets as packages on CRAN

  • The built-in data sets also have help pages, describing their representation as R objects, and their licence and copyright status

  • These help pages also often include an examples section showing some of the analyses that may be carried out using them

  • One approach that typically works well when you have a data set of your own, but are unsure how to proceed, is to find a built-in data set that resembles the real one, and play with that first

  • The built-in data sets are often quite small, and if linked to text books, they are well described there as well as in the help pages

  • By definition, the built-in data sets do not have to be imported into R, as they are almost always stored as files of R objects

  • In some cases, these data sets are stored in external file formats, most often to show how to read those formats

  • The built-in data sets in the base datasets package are in the search path, but data sets in other packages should be loaded using the data() function:

str(Titanic)
##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"
library(MASS)
data(deaths)
str(deaths)
##  Time-Series [1:72] from 1974 to 1980: 3035 2552 2704 2554 2014 ...

Vignettes

  • At about the time that literate programming arrived in R with Sweave and Stangle - we mostly use knitr now - the idea arose of supplementing package documentation with example workflows

  • Vignettes are PDF documents with accompanying runnable R code that describe how to carry out particular sequences of operations

  • The RStudio packages help tab package index file shows user guides, package vignettes and other documentation

  • The vignette() function can be used to list vignettes by installed package, and to open the chosen vignette in a PDF reader

  • A very typical way of using vignettes on a machine with enough screen space is to read the document and run the code from the R file at the same time

  • Assign the output of vignette to an object; the print method shows the PDF or HTML, the edit method gives direct access to the underlying code for copy and paste

  • The help system in RStudio provides equivalent access to vignette documents and code

  • Papers about R contributed packages published in the Journal of Statistical Software and the R Journal are often constructed in this way too

Task views

  • As R has developed, the number of packages on CRAN has grown (other packages are on BioConductor and github)

  • CRAN task views were introduced to try to provide some subject area guidance

  • They remain terse, and struggle to keep up, but are still worth reviewing

  • Note that those working in different subject areas often see things rather differently, leading to subject specific treatment of intrinsically similar themes

Online help pages

  • The help system and vignettes were designed to be used offline, so that the versions of R and installed packages matched the documentation

  • If you search online for information about functions in R or in contributed packages, you often reach inside-R, sponsored by Revolution Analytics

  • Help pages may also be viewed online from your chosen CRAN mirror; package pages provide these (Reference manual) and vignettes as links

  • Remember to check that the versions of your installed software and the online documentation are the same

R communities

  • The R community has become a number of linked communities rather than a coherent and hierarchical whole

  • As in many open source projects, the R project is more basaar than cathedral; think of niches in ecosystems with differing local optima in contrast to a master plan

  • One style is based on mailing lists, in which an issue raised by an original poster is resolved later in that thread

  • Another style is to use online fora, such as StackOverflow, which you need to visit rather than receiving messages in your inbox

  • There are now many blogs involving the use of R, fortunately aggregated at R-bloggers, where other resources may also be found

  • New aggregated blog topics are linked to a Twitter account, so if you want, you too can be bombarded by notifications

  • These are also a potential source of project ideas, especially because some claims should be challenged

  • R Users Groups and R Ladies provide face-to-face meeting places that many value

R Consortium

  • R started as a teaching tool for applied statistics, but this community model has been complemented by others

  • R is now widely used in business, public administration and voluntary organizations for data analysis and visualization

  • The R Consortium was created in 2015 as a vehicle for companies with relationships to R

  • R itself remains under the control of the R Foundation, which is still mostly academic in flavour

Simple visualizations

Graphics and visualization in R

  • We can distinguish between presentation graphics and analytical graphics

  • Presentation graphics are intended for others and are completed (even if interactive) - see work by Florence Nightingale

  • Analytical graphics may evolve into presentation graphics, but their main purpose is to visualize the data being analysed (see Antony Unwin’s book and website, and Claus Wilke’s Fundamentals of Data Visualization])

  • Many of the researchers who have developed approaches to visualization have been involved with Bell Labs, where S came from

  • As installed, R provides two graphics approaches, one known as base graphics, the other trellis or lattice graphics

  • Most types of visualization are available for both, but lattice graphics were conceived to handle conditioning, for example to generate matching plots for different categories

  • Many of these were based on the data-ink ratio, favouring graphics with little or no extraneous detail (Edward Tufte) - see Lukasz Piwek’s blog

  • There are other approaches, such as Leland Wilkinson’s Grammar of Graphics, implemented in Hadley Wickham’s ggplot2 package, which we will also be using here as its use is growing fast

  • So there are presentation and analytical graphics, and there can be a number of approaches to how best to communicate information in either of those modes

  • R can create excellent presentation graphics, or provide input for graphics designers to improve for print or other media

  • What we need now are the most relevant simple analytical graphics for the kinds of data we use

Jacoby: synthetic data and methods

There are four numeric variables of the same length, read into a data.frame. The summary method shows their characteristics:

jacoby <- read.table("data/jacoby.txt")
summary(jacoby)
##        x1              x2              x3              x4       
##  Min.   :17.50   Min.   :21.60   Min.   :19.10   Min.   :22.10  
##  1st Qu.:27.52   1st Qu.:25.32   1st Qu.:25.98   1st Qu.:27.38  
##  Median :30.00   Median :30.00   Median :28.35   Median :29.60  
##  Mean   :30.00   Mean   :30.00   Mean   :30.00   Mean   :30.00  
##  3rd Qu.:32.48   3rd Qu.:34.67   3rd Qu.:34.12   3rd Qu.:30.23  
##  Max.   :42.50   Max.   :38.40   Max.   :40.90   Max.   :51.00

Using the plot method on a single numeric vector puts that variable on the vertical (y) axis, and an observation index on the horizontal (x) axis

plot(jacoby$x1)

Simple plot methods

Using the plot method on two numeric vectors puts the first vector on the horizontal (x) axis, and the second on the vertical (y) axis

plot(jacoby$x1, jacoby$x2)

By convention, the x-axis is seen as causing the y-axis, so we can use a formula interface, and a data argument to tell the method where to find the vectors by name (column names in the data.frame)

plot(x2 ~ x1, data=jacoby)

  • In these cases, the default plot shows points, which is reasonable here

  • The default glyph (symbol) is a small unfilled circle, the pch argument lets us change this (and its colour by col)

  • The display is annotated with value scales along the axes, and the axes are placed to add a little space outside the bounding box of the observations

  • The display includes axis labels, but does not include a title

Here, we’ll manipulate the axis labels, and symbol type, colour and size; the symbol type, colour and size are also vectors, so can vary by observation or for groups of observations

plot(x2 ~ x1, data=jacoby, xlab="First vector",
  ylab="Second vector", pch=16, col="#EB811B",
  cex=1.2)

Base graphics methods like plot can be added to, drawing more on the open device. We’ll add lines with abline showing the means of the two vectors:

plot(x2 ~ x1, data=jacoby, xlab="First vector",
  ylab="Second vector", pch=16, col="#EB811B",
  cex=1.2)
abline(v=mean(jacoby$x1), h=mean(jacoby$x2),
  lty=2, lwd=2, col="#EB811B")

In lattice or trellis graphics, xyplot() is used for scatterplots; the formula interface is standard:

library(lattice)
xyplot(x2 ~ x1, jacoby)

Extra elements may be manipulated by modifying the panel functions, and will then propagate to all the panels if grouping is used:

xyplot(x2 ~ x1, jacoby, panel = function(x, y, ...) {
  panel.xyplot(x, y, ...)  
  panel.abline(h = mean(y), v=mean(x), lty = 2,
    lwd=2, col="#EB811B")  
})

Syntax similar to that of ggplot2 was introduced in latticeExtra, adding layers to the :

xyplot(x2 ~ x1, jacoby) +
  latticeExtra::layer(panel.abline(h=mean(y), v=mean(x),
  lty = 2, lwd=2, col="#EB811B"))

The ggplot2 package builds graphic output layer by layer, like base graphics, using aes — aesthetics — to say what is being shown. Starting with the jacoby "data.frame" object, we plot points with geom_point choosing variable "x1" by a placeholder "" empty string on the other axis

library(ggplot2)
ggplot(jacoby) + geom_point(aes(x=x1, y="")) + xlab("")

Using aes on two numeric vectors puts the first vector on the horizontal (x) axis, and the second on the vertical (y) axis, to make a scatterplot

ggplot(jacoby) + geom_point(aes(x1, x2))

  • In these cases, the default plot shows points, which is reasonable here

  • The default glyph (symbol) is a small filled circle, the size argument lets us change its size, and its colour by colour

  • The display is annotated with value scales along the axes, and the axes are placed to add a little space outside the bounding box of the observations

  • The display includes axis labels, but does not include a title

  • In ggplot2, the functions output (summations of) grobs — grid objects, with a print method

Here, we’ll change the axis labels, and symbol colour and size; these are also vectors, so can vary by observation or for groups of observations; we turn off the legend which is not relevant here

p <- ggplot(jacoby) + geom_point(aes(x1, x2), 
  colour="#EB811B", size=2) + xlab("First vector") +
  ylab("Second vector") +
  theme(legend.position = "none")
p

We’ll add lines with geom_hline and geom_vline showing the means of the two vectors:

p + geom_hline(yintercept=mean(jacoby$x2), 
  colour="#EB811B", linetype=2) +
  geom_vline(xintercept=mean(jacoby$x1),
  colour="#EB811B", linetype=2)

Jacoby: synthetic data stripcharts

We will use stripcharts (p. 6, pp. 30–32) to display all four vectors together on shared axes; now we see why the default symbol is a hollow circle

stripchart(jacoby, pch=1, xlab="Data Values",
  ylab="Variable", main="Scatterchart")

We can choose to jitter the symbols in the vertical dimension (adding small random amounts to 0) make the data easier to see

stripchart(jacoby, method="jitter",
  jitter=0.05, pch=1,
  xlab="Data Values",
  ylab="Variable",
  main="Scatterchart with jittering")

Lattice stripplots do the same as stripcharts, using a formula interface rather than just an object to choose the correct method. To get started, we need to stack the data in “long” form. It is also possible to use reshape2::melt.

jacobyS <- stack(jacoby)
str(jacobyS, width=45, strict.width="cut")
## 'data.frame':    80 obs. of  2 variables:
##  $ values: num  32.3 28 31.4 29.5 40 20 26 ..
##  $ ind   : Factor w/ 4 levels "x1","x2","x"..

The formula says that ind (y-axis) depends on values (x-axis)

stripplot(ind ~ values, data=jacobyS, jitter.data=TRUE)

We will use stripcharts (p. 6, pp. 30–32) to display all four vectors together on shared axes

#library(reshape2)
#jacobyS <- melt(jacoby)
p <- ggplot(jacobyS, aes(values, ind))
p + geom_point() + ylab("")

We can choose to jitter the symbols in the vertical dimension (adding small random amounts to 0, setting the RNG seed for reproducibility) to make the data easier to see

set.seed(1)
p + geom_point() + ylab("") + 
  geom_jitter(position=position_jitter(0.1))

Jacoby: synthetic data boxplots

The boxplot (pp. 38–43) display places all the vectors on the same axis, here the horizontal axis, places thicker lines at the medians, and boxes representing the interquartile ranges

boxplot(jacoby)

The lattice version of boxplot is — box and whiskers plot; again we stay with the counter-intuitive horizontal display

bwplot(values ~ ind, data=jacobyS)

The boxplot (pp. 38–43) display places all the vectors on the same axis, here the horizontal axis, places thicker lines at the medians, and boxes representing the interquartile ranges

p <- ggplot(jacobyS, aes(ind, values))
p + geom_boxplot() + xlab("")

Jacoby: synthetic data histograms

Histograms (pp. 13–17) are very frequently used to examine data, and are directly comparable to thematic maps — both need chosen break points, which have a start point, an end point, and intermediate points which may be evenly spaced, defining bins or intervals

oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
brks <- seq(15,55,by=5)
for (i in 1:4) {
 hist(jacoby[,i], breaks=brks, col="grey85",
 xlab=paste("x", i, ": seq(15, 55, by=5)", sep=""),
 freq=TRUE, main="")
}

par(oldpar)
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
 hist(jacoby[,i], breaks=seq(17.5,52.5,by=5), col="grey85",
 xlab=paste("x", i, ": seq(17.5, 52.5, by=5)", sep=""), freq=TRUE, main="")
}

par(oldpar)
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
 hist(jacoby[,i], breaks=seq(17.5,52.5,by=2.5), col="grey85",
 xlab=paste("x", i, ": seq(17.5, 52.5, by=2.5)", sep=""), freq=TRUE, main="")
}

par(oldpar)

The lattice syntax is perhaps simpler, with default starting points and bin widths driven by the data

histogram(~ values | ind, data=jacobyS,
  breaks=seq(15,55,by=5), type="count",
  index.cond=list(c(3,4,1,2)))

Histograms (pp. 13–17) are very frequently used to examine data, and are directly comparable to thematic maps — both need chosen break points, which have a start point, an end point, and intermediate points which may be evenly spaced, defining bins or intervals; we use facet_wrap() to place the plots

ggplot(jacobyS, aes(x=values)) + 
  geom_histogram(breaks=seq(15, 55, by=5)) + 
  facet_wrap(~ ind, ncol=2)

Jacoby: synthetic data density plots

When we realise that histograms can be constructed by choice of break points to create an impression that may (or may not) mislead, we can look at smoothed histograms (density plots (pp. 18–30)) as an alternative. Here we use a fixed bandwidth

oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
  plot(density(jacoby[,i], bw=1.5), main="",
    xlim=c(15,55), ylim=c(0, 0.15))
  rug(jacoby[,i], ticksize=0.07, lwd=2)
  title(main=paste("Smoothed histogram of x",
    i, sep=""))
}

par(oldpar)

The lattice approach again uses the formula interface, and a rug by default:

densityplot(~ values | ind, data=jacobyS, bw=1.5,
  index.cond=list(c(3,4,1,2)))

By default, ggplot facets fix the scales, but it isn’t clear whether the bandwidth is fixed, and it does not seem to be reported

ggplot(jacobyS, aes(x=values)) + 
  geom_density(bw=1.5) + geom_rug() +
  facet_wrap(~ ind, ncol=2) + 
  xlim(c(15, 55))

Jacoby: empirical cumulative distribution function

Empirical cumulative distribution functions need no tuning arguments:

oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
  plot(ecdf(jacoby[,i]), main="",
    xlim=c(15,55))
  title(main=paste("ECDF of x",
    i, sep=""))
}

par(oldpar)

We need latticeExtra to present the ECDF:

library(latticeExtra)
## 
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
## 
##     layer
ecdfplot(~ values | ind, data=jacobyS,
  index.cond=list(c(3,4,1,2)))

detach(package:latticeExtra)

In ggplot, the stat_ecdf() operator is used

ggplot(jacobyS, aes(x=values)) + 
  stat_ecdf() +
  facet_wrap(~ ind, ncol=2)

Jacoby: quantile-quantile plots

Quantile-quantile plots require the choice of a theoretical distribution, and as Deepayan Sarkar says, the ECDF plot uses a uniform theoretical distribution, so is a Uniform QQ plot in a different orientation:

oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
x <- qunif(ppoints(100))
for (i in 1:4) {
  qqplot(x=x, y=jacoby[,i])
  title(main=paste("Uniform QQ of x",
    i, sep=""))
}

par(oldpar)

The qqnorm() function gives a Normal QQ plot directly:

oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
  qqnorm(y=jacoby[,i], xlab="", ylab="",
    ylim=c(15, 55), main="")
  title(main=paste("Normal QQ of x",
    i, sep=""))
}

par(oldpar)

In lattice, qqmath() takes a distribution argument:

qqmath(~ values | ind, data=jacobyS,
  distribution=qunif,
  index.cond=list(c(3,4,1,2)))

so this reproduces qqnorm:

qqmath(~ values | ind, data=jacobyS,
  distribution=qnorm,
  index.cond=list(c(3,4,1,2)))

In ggplot, the stat_qq() operator is used with a specified distribution function, here Uniform:

ggplot(jacobyS, aes(sample=values)) + 
  stat_qq(distribution=qunif) +
  facet_wrap(~ ind, ncol=2)

and for Normal QQ plots

ggplot(jacobyS, aes(sample=values)) + 
  stat_qq(distribution=qnorm) +
  facet_wrap(~ ind, ncol=2)

Ihaka Lecture 1, 2018

PISA data subset

  • In the talk, the PISA 2015 data set is mentioned

  • An OECD technical report describes in more detail how the plausible values from fitted item response models are generated, in 2015 10 PVs for each participant, 5 PVs before

  • In the talk, Di Cook uses the first plausible value; we can check what she does in the code blocks in her slides

Importing the SPSS data

The data provided for the over 500,000 individual students include over 900 variables, but we’ll look at the score plausible values (PV), taking means and standard deviations of the 10 PVs given for math, reading and science for each student. In the talk, only PV1 was used. We reread the data to get three-letter country codes.

library(foreign)
QQQ <- read.spss("data/CY6_MS_CMB_STU_QQQ.sav", to.data.frame=TRUE)
QQQ1 <- QQQ[,c(1,2,4,29,30,35,64,810:839,920)]
QQQ1$math_mean <- apply(as.matrix(QQQ1[,8:17]), 1, mean)
QQQ1$math_sd <- apply(as.matrix(QQQ1[,8:17]), 1, sd)
QQQ1$read_mean <- apply(as.matrix(QQQ1[,18:27]), 1, mean)
QQQ1$read_sd <- apply(as.matrix(QQQ1[,18:27]), 1, sd)
QQQ1$sci_mean <- apply(as.matrix(QQQ1[,28:37]), 1, mean)
QQQ1$sci_sd <- apply(as.matrix(QQQ1[,28:37]), 1, sd)
QQQ <- read.spss("data/CY6_MS_CMB_STU_QQQ.sav", to.data.frame=TRUE, use.value.labels=FALSE)
QQQ1$CNT_vl <- QQQ1$CNT
QQQ1$CNT <- QQQ$CNT
saveRDS(QQQ1, file="dicook/pisa_raw_subset.rds")

Subsetting the scores

Next we change some country codes to use ISO 3-letter alphabetical codes, and drop three entities as in the original script, and re-save this observation subset.

QQQ1 <- readRDS("dicook/pisa_raw_subset.rds")
recodes <- c("'QES'='ESP'", "'QCH'='CHN'", "'QAR'='ARG'", "'TAP'='TWN'")
for (str in recodes) QQQ1$CNT <- car::recode(QQQ1$CNT, str)
QQQ2 <- droplevels(QQQ1[!(as.character(QQQ1$CNT) %in% c("QUC", "QUD", "QUE")),])
library(ISOcodes)
data("ISO_3166_1")
scores <- merge(QQQ2, ISO_3166_1[,c("Alpha_3", "Name")], by.x="CNT", by.y="Alpha_3", all.x=TRUE)
scores$Name[scores$CNT == "KSV"] <- "Kosovo"
saveRDS(scores, file="dicook/pisa_subset.rds")

Adding country codes and names

Some of this repeats the steps above, and could be rationalsed. It is intended to yield a helper object listing countries and country codes with numbers of students, hence the use of table() and aggregate().

QQQ1 <- readRDS("dicook/pisa_raw_subset.rds")
recodes <- c("'QES'='ESP'", "'QCH'='CHN'", "'QAR'='ARG'", "'TAP'='TWN'")
CNT_count <- as.data.frame(table(QQQ1$CNT))
names(CNT_count) <- c("CNT", "n")
for(str in recodes) CNT_count$CNT <- car::recode(CNT_count$CNT, str)
CNT_count1 <- aggregate(CNT_count$n, list(CNT_count$CNT), sum)
CNT_count2 <- droplevels(CNT_count1[!(as.character(CNT_count1$Group.1) %in% c("QUC", "QUD", "QUE")),])
names(CNT_count2) <- c("CNT", "n")
library(ISOcodes)
data("ISO_3166_1")
countries <- merge(CNT_count2, ISO_3166_1, by.x="CNT", by.y="Alpha_3", all.x=TRUE)
countries$Name[countries$CNT == "KSV"] <- "Kosovo"
saveRDS(countries, file="dicook/countries.rds")

Splitting into groups

Once these preparations are complete, we have about 500,000 observations on 10 PVs for each of three school subjects by binary gender and country, and use split() to make a list of two genders times 69 countries of student data frames.

QQQ1 <- readRDS("dicook/pisa_subset.rds")
countries <- readRDS("dicook/countries.rds")
a0 <- split(QQQ1, list(QQQ1$CNT, QQQ1$ST004D01T))

Aggregating weighted math means

Using sapply() on this list gives us one weighted mean of the means per student of maths PVs. We weight using normalized “senate” weights to permit comparisons between countries: “The senate weight makes the population of each country to be 5,000 to ensure an equal contribution by each of the countries in the analysis.”

math_mean <- sapply(a0, function(x) weighted.mean(x$math_mean, w=x$SENWT))
n2 <- length(math_mean)/2
country <- sapply(strsplit(names(math_mean), "\\."), "[", 1)[1:n2]
co <- match(country, countries$CNT)
nms <- countries$Name[co]
gender <- factor(sapply(strsplit(names(math_mean), "\\."), "[", 2))[1:n2]
o <- order(math_mean[1:n2])

Math means Cleveland dotchart

The workhorse is a Cleveland dotchart, which shows one point per math mean per gender and country. Standard dotcharts would show either one grouping category, or two separate dotcharts, one per gender, but here we wish to compare means of PV means:

dotchart(math_mean[o], label=nms[o], cex=0.5, pt.cex=0.9, pch=3, xlim=c(325, 575))
points(math_mean[n2+o], 1:n2, cex=0.9, pch=4, col="brown1")
legend("bottomright", legend=levels(gender), col=c("black", "brown1"), pch=3:4, pt.cex=0.9, bty="n", cex=0.8)
title("PISA mean math PV means, age 15")

but what about a +/- 2 standard error caterpillar plot?

However, in much applied work, it is useful to see whether apparent differences are only apparent. The caterpillar plot is useful for this, it is like a dotplot, but adds lines usually showing a multiple of standard errors (here +/- 2). We use a weighted standard deviation and sums of weights to ensure that the means and standard errors are coherent. We can insert the lines using a for() loop:

library(matrixStats)
sqn <- sqrt(sapply(a0, function(x) sum(x$SENWT)))
math_se <- sapply(a0, function(x) weightedSd(x$math_mean, w=x$SENWT))/sqn
dotchart(math_mean[o], label=nms[o], cex=0.5, pt.cex=0.9, pch=3, xlim=c(270, 650))
for (i in 1:n2) segments(x0=math_mean[o][i]-2*(math_se[o][i]), y0=i, x1=math_mean[o][i]+2*(math_se[o][i]), lwd=2)
points(math_mean[n2+o], (1:n2)-0.25, cex=0.9, pch=4, col="brown1")
for (i in 1:n2) segments(x0=math_mean[n2+o][i]-2*(math_se[n2+o][i]), y0=i-0.25, x1=math_mean[n2+o][i]+2*(math_se[n2+o][i]), lwd=2, col="brown1")
legend("bottomright", legend=levels(gender), col=c("black", "brown1"), pch=3:4, pt.cex=0.9, bty="n", cex=0.8)
title("PISA math PV means, age 15, +/- 2se")

Aggregating weighted reading and science means

We repeat the simple comparative dotplot for the remaining school subjects:

read_mean <- sapply(a0, function(x) weighted.mean(x$read_mean, w=x$SENWT))
ro <- order(read_mean[1:n2])
sci_mean <- sapply(a0, function(x) weighted.mean(x$sci_mean, w=x$SENWT))
so <- order(sci_mean[1:n2])

Reading means Cleveland dotchart

dotchart(read_mean[ro], label=nms[ro], cex=0.5, pt.cex=0.9, pch=3, xlim=c(325, 575))
points(read_mean[n2+ro], 1:n2, cex=0.9, pch=4, col="brown1")
legend("bottomright", legend=levels(gender), col=c("black", "brown1"), pch=3:4, pt.cex=0.9, bty="n", cex=0.8)
title("PISA mean reading PV means, age 15")

Science means Cleveland dotchart

dotchart(sci_mean[so], label=nms[so], cex=0.5, pt.cex=0.9, pch=3, xlim=c(325, 575))
points(sci_mean[n2+so], 1:n2, cex=0.9, pch=4, col="brown1")
legend("bottomright", legend=levels(gender), col=c("black", "brown1"), pch=3:4, pt.cex=0.9, bty="n", cex=0.8)
title("PISA mean science PV means, age 15")

Mapping gaps Female-Male

As in the talk, we can make maps of the apparent gender gaps between the means of mean plausible values, first calculating the gaps and adding country identifiers:

math_gap <- math_mean[1:n2] - math_mean[n2+(1:n2)]
read_gap <- read_mean[1:n2] - read_mean[n2+(1:n2)]
sci_gap <- sci_mean[1:n2] - sci_mean[n2+(1:n2)]
gaps <- data.frame(math_gap, read_gap, sci_gap, iso_a3=country,
  iso_a2=countries$Alpha_2[co])

Country boundaries

We use a 1:50m set of country boundaries (in the 1:110m set, Singapore disappears).

library(rnaturalearth)
library(rnaturalearthdata)
data(countries50)
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.1.0dev-84d972714a, PROJ 6.2.1
countries50 <- st_as_sf(countries50)

Using tmap for maps

We can merge() the gaps data with the country boundaries, and the tmap package to create the map. Tomorrow we’ll look at binning and colours. Using ttm(), we can toggle to an interactive format, using last_map() to recall the last generated map.

world_gaps <- merge(countries50, gaps, by="iso_a2", all.x=TRUE)
library(tmap)
tm_shape(world_gaps) + tm_fill("math_gap", palette="div", n=7, style="jenks", midpoint=0)
## Warning: The shape world_gaps is invalid. See sf::st_is_valid

#ttm()
#last_map()

Other points

  • In the talk, boxplots are used to explore the variability between students by gender and country; here we tried a caterpillar plot. The PV means also flatten variability - would medians be more sensible?

  • Could aspects of the PISA data set be used in thinking about projects?

  • Other data sets are made available by Di Cook, and the nullabor package seems interesting, and maybe could also work (well) for projects, maybe also varying graphical representation?

Somewhere over the Rainbow

New base palettes coming

A New palette() for R

Colour palettes

  • Zeileis and others have discussed the opportunities to be found for effective visual communication in statistical graphics using colours; we distinguish between sequential, diverging and qualitative palettes

  • There are differences in opinions with regard to colour look-up (continuous or discrete) between implementations - RColorBrewer is discrete

  • There are several palettes, such as rainbow, cm.colors, heat.colors and others in grDevices

  • Other packages include viridis, colorspace, and further palettes provided in plotting methods (sf.colors uses sp::bpy.colors)

ColorBrewer

Fortunately, the RColorBrewer package gives by permission access to the ColorBrewer palettes accesible from the ColorBrewer website. Note that ColorBrewer limits the number of classes tightly, only 3–9 sequential classes

library(RColorBrewer)
opar <- par(cex=0.7)
display.brewer.all()

par(opar)

Viridis

suppressPackageStartupMessages(library(viridis))
library(colorspace)
library(ggplot2)
#devtools::install_github("wilkelab/cowplot")
#install.packages("colorspace", repos = "http://R-Forge.R-project.org")
#devtools::install_github("clauswilke/colorblindr")
library(colorblindr)
n_col <- 10
swatchplot(
"terrain.colors"=rev(terrain.colors(n_col)),
"heat"=rev(heat.colors(n_col)), 
"ggplot default"=rev(scales::seq_gradient_pal(low = "#132B43", high = "#56B1F7", space = "Lab")(seq(0, 1, length=n_col))), 
"brewer blues"=scales::gradient_n_pal(scales::brewer_pal(type="seq")(9))(seq(0, 1, length=n_col)), 
"brewer yellow-green-blue"=scales::gradient_n_pal(scales::brewer_pal(type="seq", palette = "YlGnBu")(9))(seq(0, 1, length=n_col)), 
"viridis"=rev(viridis(n_col)), 
"magma"=rev(magma(n_col))
)

Cartography

suppressPackageStartupMessages(library(cartography))
display.carto.all(n = 10)

colorspace and HCLWizard

  • The colorspace package provides choose_palette() to choose HCL palettes interactively; it returns a palette function.

  • HCL-Based Color Palettes in grDevices

  • The colorblindr from github provided the swatchplot() function to show colour swatches; it also provided an Okabe-Ito qualitative palette as palette_OkabeIto

n_col <- 8
swatchplot(
  "Okabe Ito"=palette_OkabeIto,
  "ColorBrewer accent"=brewer.pal(n_col, "Accent"),
  "ColorBrewer Pastel2"=brewer.pal(n_col, "Pastel2"),
  "Qualitative HCL"=qualitative_hcl(n_col)
)

n_col <- 10
swatchplot(
"terrain.colors"=rev(terrain.colors(n_col)),
"terrain HCL"=rev(terrain_hcl(n_col)),
"heat"=rev(heat.colors(n_col)), 
"heat HCL"=rev(heat_hcl(n_col)),
"rainbow"=rainbow(n_col),
"rainbow HCL"=rainbow_hcl(n_col)
)