All the material presented here, to the extent it is original, is available under CC-BY-SA.
I am running R 3.6.2, with recent update.packages()
.
needed <- c("cartography", "colorblindr", "colorspace", "viridis",
"viridisLite", "RColorBrewer", "tmap", "sf",
"rnaturalearthdata", "rnaturalearth", "matrixStats",
"ISOcodes", "ggplot2")
Script and data at https://github.com/rsbivand/BAN422_V20/raw/master/ban422_v20_mon.zip. Download to suitable location, unzip and use as basis.
Visualization is one of the key strengths of R
However, visualization involves many choices, and R offers such a wide range of choices that guidance is desirable
There are two major underlying techologies, base graphics and grid graphics, and several toolboxes built on these (trellis graphics and grammar of graphics on grid graphics), in addition to JavaScript widgets for interaction
In addition, much work has been done on the effective use of shapes and colours.
The group projects provide learning opportunities for exploring visualization, through comparison of implementation and maybe perception mini-experiments using nullabor
Similar topics may be gleaned from R-bloggers and its Twitter feed; some of the claims deserve to be checked
The aim is not to find winners, but to explore alternatives
Thursday group work-day, Friday presentation/feedback day, hand in via WiseFlow by Friday 31 January 14:00.
Today, we’ll borrow two talks from the Ihaka lecture series at Auckland University from March 2018, review simple visualizations, and follow up some of the data used in the second talk by Dianne Cook
Tuesday: binning, colours and conditioning, base graphics and graphics devices; Wednesday: grid graphics with lattice and ggplot2, interactive graphics, dashboards
The underlying aim: to survey contemporary approaches to visualization using R
Your (group) projects are a key part of the seminar
Time | Topic |
---|---|
Monday 13/1 | |
08.15-10.00 | Introduction - Visualization in R - basics |
10.15-11.00 | Ihaka lecture 2018 Alberto Cairo |
12.15-13.00 | Ihaka lecture 2018 Dianne Cook |
13.15-15.00 | Reproducing PISA graphics, use of colour palettes |
Tuesday 14/1 | |
08.15-10.00 | Conditioning and facets, choice of binning intervals in visualization |
10.15-11.00 | Practice |
12.15-14.00 | Base graphics structure and graphics devices |
14.15-15.00 | Practice |
Wednesday 15/1 | |
08.15-10.00 | Grid graphics structure, lattice, ggplot2, etc. |
10.15-11.00 | Practice |
12.15-14.00 | Interactive graphics challenges, plotly, googleVis, Dashboards and shiny |
14.15-15.00 | Practice |
Thursday 16/1 | |
09.00-16.00 | Group work day (not in aud. C, surgery in SAM-D217) |
Friday 17/1 | |
08.15-10.00 | Presentations |
10.15-11.00 | Presentations |
12.00-13.45 | Presentations |
R is as small or large as you like, and runs in many different configurations (no smartphones); the core is written in C
The language has developed from S written at Bell Labs NJ, where Unix, C, C++, and scripting technologies were created in the 1970s and 1980s
Bell Labs statisticians had a strong focus on graphics and exploratory data analysis from the beginning
Many underlying abstractions were established by 1988 and 1992; we’ll get to the data.frame
and formula
abstractions later
An R session records its history - all that is entered at the console prompt - and a workspace containing objects
On exiting a session, the history may be saved to a history file, and the workspace may be saved to an RData file; history and chosen objects (or all objects) may be saved manually before exit
The workspace is in the memory of the computer, and R itself expects there to be enough memory for all of the data, intermediate and final results
Modern R is 64-bit, so limits are most often set by the computer hardware; use can be made of multiple cores to compute in parallel
In the RStudio Interactive Development Environment (IDE), it is convenient to use R Markdown to write notebooks (annotated scripts)
Chunks of code are run in sequence and may be echoed in the output
Output is shown in its right place, including graphics output
The document may also be converted to a script, mirroring the weave/tangle - knit/purl duality
This presentation is written in Markdown, as we’ll see …
In RStudio, the Help tab in the lower right pane (default position) gives access to the R manuals and to the installed packages help pages through the Packages link under Reference
In R itself, help pages are available in HTML (browser) and text form; help.start()
uses the default browser to display the Manuals, Reference and Miscellaneous Material sections in RStudio’s home help tab
The search engine can be used to locate help pages, but is not great if many packages are installed, as no indices are stored
The help system needs to be learned in order to provide the user with ways of progressing without wasting too much time
The base help system does not tell you how to use R as a system, about packages not installed on your machine, or about R as a community
It does provide information about functions, methods and (some) classes in base R and in contributed packages installed on your machine
We’ll cover these first, then go on to look at vignettes, R Journal, task views, online help pages, and the blog aggregator
There are different requirements with regard to help systems - in R, the help pages of base R are expected to be accurate although terse
Each help page provides a short description of the functions, methods or classes it covers; some pages cover more than one such
Help pages are grouped by package, so that the browser-based system is not easy to browse if you do not know which package a function belongs to
The usage of the function is shown explicitly, including any defaults for arguments to functions or methods
Each argument is described, showing names and types; in addition details of the description are given, together with the value returned
Rather than starting from the packages hierarchy of help pages, users most often use the help
function
The function takes the name of of the function about which we need help, the name may be in quotation marks; class names contain a hyphen and must be quoted
Instead of using say help(help)
, we can shorten to the question mark operator: ?help
Occasionally, several packages offer different functions with the same name, and we may be offered a choice; we can disambiguate by putting the package name and two colons before the function name
In the usage section, function arguments are shown by name and order; the args
function returns information
In general, if arguments are given by name, the order is arbitrary, but if names are not used at least sometimes, order matters
Some arguments do not have default values and are probably required, although some are guessed if missing
Being explicit about the names of arguments and the values they take is helpful in scripting and reproducible research
The ellipsis ...
indicates that the function itself examines objects passed to see what to do
The regular R console does not provide tooltips, that is a bubble first offering alternative function or object names as you type, then lists of argument names
RStudio, like many IDEs, does provide this, controlled by Tools -> Global options -> Code -> Completion (by default it is operative)
This may be helpful or not, depending on your style of working; if you find it helpful, fine, if not, you can make it less invasive under Global options
Other IDE have also provided this facility, which builds directly on the usage sections of help pages of functions in installed packages
Base R has a set of checks and tests that ensure coherence between the code itself and the usage sections in help pages
These mechanisms are used in checking contributed packages before they are released through the the archive network; the description of arguments on help pages must match the function definition
It is also possible to generate help pages documenting functions automatically, for example using the roxygen2 package
It is important to know that we can rely on this coherence
The objects returned by functions are also documented on help pages, but the coherence of the description with reality is harder to check
This means that use of str
or other functions or methods may be helpful when we want to look inside the returned object
The form taken by returned values will often also vary, depending on the arguments given
Most help pages address this issue not by writing more about the returned values, but by using the examples section to highlight points of potential importance for the user
Reading the examples section on the help page is often enlightening, but we do not need to copy and paste
The example
function runs those parts of the code in the examples section of a function that are not tagged don’t run - this can be overridden, but may involve meeting conditions not met on your machine
This code is run nightly on CRAN servers on multiple operating systems and using released, patched and development versions of R, so checking both packages and the three versions of R
Some examples use data given verbatim, but many use built-in data sets; most packages also provide data sets to use for running examples
This means that the examples and the built-in data sets are a most significant resource for learning how to solve problems with R
Very often, one recognizes classic textbook data sets from the history of applied statistics; contemporary text book authors often publish collections of data sets as packages on CRAN
The built-in data sets also have help pages, describing their representation as R objects, and their licence and copyright status
These help pages also often include an examples section showing some of the analyses that may be carried out using them
One approach that typically works well when you have a data set of your own, but are unsure how to proceed, is to find a built-in data set that resembles the real one, and play with that first
The built-in data sets are often quite small, and if linked to text books, they are well described there as well as in the help pages
By definition, the built-in data sets do not have to be imported into R, as they are almost always stored as files of R objects
In some cases, these data sets are stored in external file formats, most often to show how to read those formats
The built-in data sets in the base datasets package are in the search path, but data sets in other packages should be loaded using the data()
function:
str(Titanic)
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
library(MASS)
data(deaths)
str(deaths)
## Time-Series [1:72] from 1974 to 1980: 3035 2552 2704 2554 2014 ...
At about the time that literate programming arrived in R with Sweave
and Stangle
- we mostly use knitr now - the idea arose of supplementing package documentation with example workflows
Vignettes are PDF documents with accompanying runnable R code that describe how to carry out particular sequences of operations
The RStudio packages help tab package index file shows user guides, package vignettes and other documentation
The vignette()
function can be used to list vignettes by installed package, and to open the chosen vignette in a PDF reader
A very typical way of using vignettes on a machine with enough screen space is to read the document and run the code from the R file at the same time
Assign the output of vignette
to an object; the print
method shows the PDF or HTML, the edit
method gives direct access to the underlying code for copy and paste
The help system in RStudio provides equivalent access to vignette documents and code
Papers about R contributed packages published in the Journal of Statistical Software and the R Journal are often constructed in this way too
As R has developed, the number of packages on CRAN has grown (other packages are on BioConductor and github)
CRAN task views were introduced to try to provide some subject area guidance
They remain terse, and struggle to keep up, but are still worth reviewing
Note that those working in different subject areas often see things rather differently, leading to subject specific treatment of intrinsically similar themes
The help system and vignettes were designed to be used offline, so that the versions of R and installed packages matched the documentation
If you search online for information about functions in R or in contributed packages, you often reach inside-R, sponsored by Revolution Analytics
Help pages may also be viewed online from your chosen CRAN mirror; package pages provide these (Reference manual) and vignettes as links
Remember to check that the versions of your installed software and the online documentation are the same
The R community has become a number of linked communities rather than a coherent and hierarchical whole
As in many open source projects, the R project is more basaar than cathedral; think of niches in ecosystems with differing local optima in contrast to a master plan
One style is based on mailing lists, in which an issue raised by an original poster is resolved later in that thread
Another style is to use online fora, such as StackOverflow, which you need to visit rather than receiving messages in your inbox
There are now many blogs involving the use of R, fortunately aggregated at R-bloggers, where other resources may also be found
New aggregated blog topics are linked to a Twitter account, so if you want, you too can be bombarded by notifications
These are also a potential source of project ideas, especially because some claims should be challenged
R Users Groups and R Ladies provide face-to-face meeting places that many value
R started as a teaching tool for applied statistics, but this community model has been complemented by others
R is now widely used in business, public administration and voluntary organizations for data analysis and visualization
The R Consortium was created in 2015 as a vehicle for companies with relationships to R
R itself remains under the control of the R Foundation, which is still mostly academic in flavour
We can distinguish between presentation graphics and analytical graphics
Presentation graphics are intended for others and are completed (even if interactive) - see work by Florence Nightingale
Analytical graphics may evolve into presentation graphics, but their main purpose is to visualize the data being analysed (see Antony Unwin’s book and website, and Claus Wilke’s Fundamentals of Data Visualization])
Many of the researchers who have developed approaches to visualization have been involved with Bell Labs, where S came from
As installed, R provides two graphics approaches, one known as base graphics, the other trellis or lattice graphics
Most types of visualization are available for both, but lattice graphics were conceived to handle conditioning, for example to generate matching plots for different categories
Many of these were based on the data-ink ratio, favouring graphics with little or no extraneous detail (Edward Tufte) - see Lukasz Piwek’s blog
There are other approaches, such as Leland Wilkinson’s Grammar of Graphics, implemented in Hadley Wickham’s ggplot2 package, which we will also be using here as its use is growing fast
So there are presentation and analytical graphics, and there can be a number of approaches to how best to communicate information in either of those modes
R can create excellent presentation graphics, or provide input for graphics designers to improve for print or other media
What we need now are the most relevant simple analytical graphics for the kinds of data we use
There are four numeric variables of the same length, read into a data.frame
. The summary
method shows their characteristics:
jacoby <- read.table("data/jacoby.txt")
summary(jacoby)
## x1 x2 x3 x4
## Min. :17.50 Min. :21.60 Min. :19.10 Min. :22.10
## 1st Qu.:27.52 1st Qu.:25.32 1st Qu.:25.98 1st Qu.:27.38
## Median :30.00 Median :30.00 Median :28.35 Median :29.60
## Mean :30.00 Mean :30.00 Mean :30.00 Mean :30.00
## 3rd Qu.:32.48 3rd Qu.:34.67 3rd Qu.:34.12 3rd Qu.:30.23
## Max. :42.50 Max. :38.40 Max. :40.90 Max. :51.00
Using the plot
method on a single numeric vector puts that variable on the vertical (y) axis, and an observation index on the horizontal (x) axis
plot(jacoby$x1)
Using the plot
method on two numeric vectors puts the first vector on the horizontal (x) axis, and the second on the vertical (y) axis
plot(jacoby$x1, jacoby$x2)
By convention, the x-axis is seen as causing the y-axis, so we can use a formula interface, and a data
argument to tell the method where to find the vectors by name (column names in the data.frame)
plot(x2 ~ x1, data=jacoby)
In these cases, the default plot shows points, which is reasonable here
The default glyph (symbol) is a small unfilled circle, the pch
argument lets us change this (and its colour by col
)
The display is annotated with value scales along the axes, and the axes are placed to add a little space outside the bounding box of the observations
The display includes axis labels, but does not include a title
Here, we’ll manipulate the axis labels, and symbol type, colour and size; the symbol type, colour and size are also vectors, so can vary by observation or for groups of observations
plot(x2 ~ x1, data=jacoby, xlab="First vector",
ylab="Second vector", pch=16, col="#EB811B",
cex=1.2)
Base graphics methods like plot can be added to, drawing more on the open device. We’ll add lines with abline
showing the means of the two vectors:
plot(x2 ~ x1, data=jacoby, xlab="First vector",
ylab="Second vector", pch=16, col="#EB811B",
cex=1.2)
abline(v=mean(jacoby$x1), h=mean(jacoby$x2),
lty=2, lwd=2, col="#EB811B")
In lattice or trellis graphics, xyplot()
is used for scatterplots; the formula interface is standard:
library(lattice)
xyplot(x2 ~ x1, jacoby)
Extra elements may be manipulated by modifying the panel functions, and will then propagate to all the panels if grouping is used:
xyplot(x2 ~ x1, jacoby, panel = function(x, y, ...) {
panel.xyplot(x, y, ...)
panel.abline(h = mean(y), v=mean(x), lty = 2,
lwd=2, col="#EB811B")
})
Syntax similar to that of ggplot2 was introduced in latticeExtra, adding layers to the :
xyplot(x2 ~ x1, jacoby) +
latticeExtra::layer(panel.abline(h=mean(y), v=mean(x),
lty = 2, lwd=2, col="#EB811B"))
The ggplot2 package builds graphic output layer by layer, like base graphics, using aes
— aesthetics — to say what is being shown. Starting with the jacoby
"data.frame"
object, we plot points with geom_point
choosing variable "x1"
by a placeholder ""
empty string on the other axis
library(ggplot2)
ggplot(jacoby) + geom_point(aes(x=x1, y="")) + xlab("")
Using aes
on two numeric vectors puts the first vector on the horizontal (x) axis, and the second on the vertical (y) axis, to make a scatterplot
ggplot(jacoby) + geom_point(aes(x1, x2))
In these cases, the default plot shows points, which is reasonable here
The default glyph (symbol) is a small filled circle, the size
argument lets us change its size, and its colour by colour
The display is annotated with value scales along the axes, and the axes are placed to add a little space outside the bounding box of the observations
The display includes axis labels, but does not include a title
In ggplot2, the functions output (summations of) grobs — grid objects, with a print
method
Here, we’ll change the axis labels, and symbol colour and size; these are also vectors, so can vary by observation or for groups of observations; we turn off the legend which is not relevant here
p <- ggplot(jacoby) + geom_point(aes(x1, x2),
colour="#EB811B", size=2) + xlab("First vector") +
ylab("Second vector") +
theme(legend.position = "none")
p
We’ll add lines with geom_hline
and geom_vline
showing the means of the two vectors:
p + geom_hline(yintercept=mean(jacoby$x2),
colour="#EB811B", linetype=2) +
geom_vline(xintercept=mean(jacoby$x1),
colour="#EB811B", linetype=2)
We will use stripcharts (p. 6, pp. 30–32) to display all four vectors together on shared axes; now we see why the default symbol is a hollow circle
stripchart(jacoby, pch=1, xlab="Data Values",
ylab="Variable", main="Scatterchart")
We can choose to jitter the symbols in the vertical dimension (adding small random amounts to 0) make the data easier to see
stripchart(jacoby, method="jitter",
jitter=0.05, pch=1,
xlab="Data Values",
ylab="Variable",
main="Scatterchart with jittering")
Lattice stripplots do the same as stripcharts, using a formula interface rather than just an object to choose the correct method. To get started, we need to stack
the data in “long” form. It is also possible to use reshape2::melt
.
jacobyS <- stack(jacoby)
str(jacobyS, width=45, strict.width="cut")
## 'data.frame': 80 obs. of 2 variables:
## $ values: num 32.3 28 31.4 29.5 40 20 26 ..
## $ ind : Factor w/ 4 levels "x1","x2","x"..
The formula says that ind
(y-axis) depends on values
(x-axis)
stripplot(ind ~ values, data=jacobyS, jitter.data=TRUE)
We will use stripcharts (p. 6, pp. 30–32) to display all four vectors together on shared axes
#library(reshape2)
#jacobyS <- melt(jacoby)
p <- ggplot(jacobyS, aes(values, ind))
p + geom_point() + ylab("")
We can choose to jitter the symbols in the vertical dimension (adding small random amounts to 0, setting the RNG seed for reproducibility) to make the data easier to see
set.seed(1)
p + geom_point() + ylab("") +
geom_jitter(position=position_jitter(0.1))
The boxplot (pp. 38–43) display places all the vectors on the same axis, here the horizontal axis, places thicker lines at the medians, and boxes representing the interquartile ranges
boxplot(jacoby)
The lattice version of boxplot is — box and whiskers plot; again we stay with the counter-intuitive horizontal display
bwplot(values ~ ind, data=jacobyS)
The boxplot (pp. 38–43) display places all the vectors on the same axis, here the horizontal axis, places thicker lines at the medians, and boxes representing the interquartile ranges
p <- ggplot(jacobyS, aes(ind, values))
p + geom_boxplot() + xlab("")
Histograms (pp. 13–17) are very frequently used to examine data, and are directly comparable to thematic maps — both need chosen break points, which have a start point, an end point, and intermediate points which may be evenly spaced, defining bins or intervals
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
brks <- seq(15,55,by=5)
for (i in 1:4) {
hist(jacoby[,i], breaks=brks, col="grey85",
xlab=paste("x", i, ": seq(15, 55, by=5)", sep=""),
freq=TRUE, main="")
}
par(oldpar)
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
hist(jacoby[,i], breaks=seq(17.5,52.5,by=5), col="grey85",
xlab=paste("x", i, ": seq(17.5, 52.5, by=5)", sep=""), freq=TRUE, main="")
}
par(oldpar)
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
hist(jacoby[,i], breaks=seq(17.5,52.5,by=2.5), col="grey85",
xlab=paste("x", i, ": seq(17.5, 52.5, by=2.5)", sep=""), freq=TRUE, main="")
}
par(oldpar)
The lattice syntax is perhaps simpler, with default starting points and bin widths driven by the data
histogram(~ values | ind, data=jacobyS,
breaks=seq(15,55,by=5), type="count",
index.cond=list(c(3,4,1,2)))
Histograms (pp. 13–17) are very frequently used to examine data, and are directly comparable to thematic maps — both need chosen break points, which have a start point, an end point, and intermediate points which may be evenly spaced, defining bins or intervals; we use facet_wrap()
to place the plots
ggplot(jacobyS, aes(x=values)) +
geom_histogram(breaks=seq(15, 55, by=5)) +
facet_wrap(~ ind, ncol=2)
When we realise that histograms can be constructed by choice of break points to create an impression that may (or may not) mislead, we can look at smoothed histograms (density plots (pp. 18–30)) as an alternative. Here we use a fixed bandwidth
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
plot(density(jacoby[,i], bw=1.5), main="",
xlim=c(15,55), ylim=c(0, 0.15))
rug(jacoby[,i], ticksize=0.07, lwd=2)
title(main=paste("Smoothed histogram of x",
i, sep=""))
}
par(oldpar)
The lattice approach again uses the formula interface, and a rug by default:
densityplot(~ values | ind, data=jacobyS, bw=1.5,
index.cond=list(c(3,4,1,2)))
By default, ggplot facets fix the scales, but it isn’t clear whether the bandwidth is fixed, and it does not seem to be reported
ggplot(jacobyS, aes(x=values)) +
geom_density(bw=1.5) + geom_rug() +
facet_wrap(~ ind, ncol=2) +
xlim(c(15, 55))
Empirical cumulative distribution functions need no tuning arguments:
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
plot(ecdf(jacoby[,i]), main="",
xlim=c(15,55))
title(main=paste("ECDF of x",
i, sep=""))
}
par(oldpar)
We need latticeExtra to present the ECDF:
library(latticeExtra)
##
## Attaching package: 'latticeExtra'
## The following object is masked from 'package:ggplot2':
##
## layer
ecdfplot(~ values | ind, data=jacobyS,
index.cond=list(c(3,4,1,2)))
detach(package:latticeExtra)
In ggplot, the stat_ecdf()
operator is used
ggplot(jacobyS, aes(x=values)) +
stat_ecdf() +
facet_wrap(~ ind, ncol=2)
Quantile-quantile plots require the choice of a theoretical distribution, and as Deepayan Sarkar says, the ECDF plot uses a uniform theoretical distribution, so is a Uniform QQ plot in a different orientation:
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
x <- qunif(ppoints(100))
for (i in 1:4) {
qqplot(x=x, y=jacoby[,i])
title(main=paste("Uniform QQ of x",
i, sep=""))
}
par(oldpar)
The qqnorm()
function gives a Normal QQ plot directly:
oldpar <- par(no.readonly=TRUE)
par(mfrow=c(2,2))
for (i in 1:4) {
qqnorm(y=jacoby[,i], xlab="", ylab="",
ylim=c(15, 55), main="")
title(main=paste("Normal QQ of x",
i, sep=""))
}
par(oldpar)
In lattice, qqmath()
takes a distribution
argument:
qqmath(~ values | ind, data=jacobyS,
distribution=qunif,
index.cond=list(c(3,4,1,2)))
so this reproduces qqnorm
:
qqmath(~ values | ind, data=jacobyS,
distribution=qnorm,
index.cond=list(c(3,4,1,2)))
In ggplot, the stat_qq()
operator is used with a specified distribution function, here Uniform:
ggplot(jacobyS, aes(sample=values)) +
stat_qq(distribution=qunif) +
facet_wrap(~ ind, ncol=2)
and for Normal QQ plots
ggplot(jacobyS, aes(sample=values)) +
stat_qq(distribution=qnorm) +
facet_wrap(~ ind, ncol=2)
In the talk, the PISA 2015 data set is mentioned
An OECD technical report describes in more detail how the plausible values from fitted item response models are generated, in 2015 10 PVs for each participant, 5 PVs before
In the talk, Di Cook uses the first plausible value; we can check what she does in the code blocks in her slides
The data provided for the over 500,000 individual students include over 900 variables, but we’ll look at the score plausible values (PV), taking means and standard deviations of the 10 PVs given for math, reading and science for each student. In the talk, only PV1 was used. We reread the data to get three-letter country codes.
library(foreign)
QQQ <- read.spss("data/CY6_MS_CMB_STU_QQQ.sav", to.data.frame=TRUE)
QQQ1 <- QQQ[,c(1,2,4,29,30,35,64,810:839,920)]
QQQ1$math_mean <- apply(as.matrix(QQQ1[,8:17]), 1, mean)
QQQ1$math_sd <- apply(as.matrix(QQQ1[,8:17]), 1, sd)
QQQ1$read_mean <- apply(as.matrix(QQQ1[,18:27]), 1, mean)
QQQ1$read_sd <- apply(as.matrix(QQQ1[,18:27]), 1, sd)
QQQ1$sci_mean <- apply(as.matrix(QQQ1[,28:37]), 1, mean)
QQQ1$sci_sd <- apply(as.matrix(QQQ1[,28:37]), 1, sd)
QQQ <- read.spss("data/CY6_MS_CMB_STU_QQQ.sav", to.data.frame=TRUE, use.value.labels=FALSE)
QQQ1$CNT_vl <- QQQ1$CNT
QQQ1$CNT <- QQQ$CNT
saveRDS(QQQ1, file="dicook/pisa_raw_subset.rds")
Next we change some country codes to use ISO 3-letter alphabetical codes, and drop three entities as in the original script, and re-save this observation subset.
QQQ1 <- readRDS("dicook/pisa_raw_subset.rds")
recodes <- c("'QES'='ESP'", "'QCH'='CHN'", "'QAR'='ARG'", "'TAP'='TWN'")
for (str in recodes) QQQ1$CNT <- car::recode(QQQ1$CNT, str)
QQQ2 <- droplevels(QQQ1[!(as.character(QQQ1$CNT) %in% c("QUC", "QUD", "QUE")),])
library(ISOcodes)
data("ISO_3166_1")
scores <- merge(QQQ2, ISO_3166_1[,c("Alpha_3", "Name")], by.x="CNT", by.y="Alpha_3", all.x=TRUE)
scores$Name[scores$CNT == "KSV"] <- "Kosovo"
saveRDS(scores, file="dicook/pisa_subset.rds")
Some of this repeats the steps above, and could be rationalsed. It is intended to yield a helper object listing countries and country codes with numbers of students, hence the use of table()
and aggregate()
.
QQQ1 <- readRDS("dicook/pisa_raw_subset.rds")
recodes <- c("'QES'='ESP'", "'QCH'='CHN'", "'QAR'='ARG'", "'TAP'='TWN'")
CNT_count <- as.data.frame(table(QQQ1$CNT))
names(CNT_count) <- c("CNT", "n")
for(str in recodes) CNT_count$CNT <- car::recode(CNT_count$CNT, str)
CNT_count1 <- aggregate(CNT_count$n, list(CNT_count$CNT), sum)
CNT_count2 <- droplevels(CNT_count1[!(as.character(CNT_count1$Group.1) %in% c("QUC", "QUD", "QUE")),])
names(CNT_count2) <- c("CNT", "n")
library(ISOcodes)
data("ISO_3166_1")
countries <- merge(CNT_count2, ISO_3166_1, by.x="CNT", by.y="Alpha_3", all.x=TRUE)
countries$Name[countries$CNT == "KSV"] <- "Kosovo"
saveRDS(countries, file="dicook/countries.rds")
Once these preparations are complete, we have about 500,000 observations on 10 PVs for each of three school subjects by binary gender and country, and use split()
to make a list of two genders times 69 countries of student data frames.
QQQ1 <- readRDS("dicook/pisa_subset.rds")
countries <- readRDS("dicook/countries.rds")
a0 <- split(QQQ1, list(QQQ1$CNT, QQQ1$ST004D01T))
Using sapply()
on this list gives us one weighted mean of the means per student of maths PVs. We weight using normalized “senate” weights to permit comparisons between countries: “The senate weight makes the population of each country to be 5,000 to ensure an equal contribution by each of the countries in the analysis.”
math_mean <- sapply(a0, function(x) weighted.mean(x$math_mean, w=x$SENWT))
n2 <- length(math_mean)/2
country <- sapply(strsplit(names(math_mean), "\\."), "[", 1)[1:n2]
co <- match(country, countries$CNT)
nms <- countries$Name[co]
gender <- factor(sapply(strsplit(names(math_mean), "\\."), "[", 2))[1:n2]
o <- order(math_mean[1:n2])
The workhorse is a Cleveland dotchart, which shows one point per math mean per gender and country. Standard dotcharts would show either one grouping category, or two separate dotcharts, one per gender, but here we wish to compare means of PV means:
dotchart(math_mean[o], label=nms[o], cex=0.5, pt.cex=0.9, pch=3, xlim=c(325, 575))
points(math_mean[n2+o], 1:n2, cex=0.9, pch=4, col="brown1")
legend("bottomright", legend=levels(gender), col=c("black", "brown1"), pch=3:4, pt.cex=0.9, bty="n", cex=0.8)
title("PISA mean math PV means, age 15")
However, in much applied work, it is useful to see whether apparent differences are only apparent. The caterpillar plot is useful for this, it is like a dotplot, but adds lines usually showing a multiple of standard errors (here +/- 2). We use a weighted standard deviation and sums of weights to ensure that the means and standard errors are coherent. We can insert the lines using a for()
loop:
library(matrixStats)
sqn <- sqrt(sapply(a0, function(x) sum(x$SENWT)))
math_se <- sapply(a0, function(x) weightedSd(x$math_mean, w=x$SENWT))/sqn
dotchart(math_mean[o], label=nms[o], cex=0.5, pt.cex=0.9, pch=3, xlim=c(270, 650))
for (i in 1:n2) segments(x0=math_mean[o][i]-2*(math_se[o][i]), y0=i, x1=math_mean[o][i]+2*(math_se[o][i]), lwd=2)
points(math_mean[n2+o], (1:n2)-0.25, cex=0.9, pch=4, col="brown1")
for (i in 1:n2) segments(x0=math_mean[n2+o][i]-2*(math_se[n2+o][i]), y0=i-0.25, x1=math_mean[n2+o][i]+2*(math_se[n2+o][i]), lwd=2, col="brown1")
legend("bottomright", legend=levels(gender), col=c("black", "brown1"), pch=3:4, pt.cex=0.9, bty="n", cex=0.8)
title("PISA math PV means, age 15, +/- 2se")
We repeat the simple comparative dotplot for the remaining school subjects:
read_mean <- sapply(a0, function(x) weighted.mean(x$read_mean, w=x$SENWT))
ro <- order(read_mean[1:n2])
sci_mean <- sapply(a0, function(x) weighted.mean(x$sci_mean, w=x$SENWT))
so <- order(sci_mean[1:n2])
dotchart(read_mean[ro], label=nms[ro], cex=0.5, pt.cex=0.9, pch=3, xlim=c(325, 575))
points(read_mean[n2+ro], 1:n2, cex=0.9, pch=4, col="brown1")
legend("bottomright", legend=levels(gender), col=c("black", "brown1"), pch=3:4, pt.cex=0.9, bty="n", cex=0.8)
title("PISA mean reading PV means, age 15")
dotchart(sci_mean[so], label=nms[so], cex=0.5, pt.cex=0.9, pch=3, xlim=c(325, 575))
points(sci_mean[n2+so], 1:n2, cex=0.9, pch=4, col="brown1")
legend("bottomright", legend=levels(gender), col=c("black", "brown1"), pch=3:4, pt.cex=0.9, bty="n", cex=0.8)
title("PISA mean science PV means, age 15")
As in the talk, we can make maps of the apparent gender gaps between the means of mean plausible values, first calculating the gaps and adding country identifiers:
math_gap <- math_mean[1:n2] - math_mean[n2+(1:n2)]
read_gap <- read_mean[1:n2] - read_mean[n2+(1:n2)]
sci_gap <- sci_mean[1:n2] - sci_mean[n2+(1:n2)]
gaps <- data.frame(math_gap, read_gap, sci_gap, iso_a3=country,
iso_a2=countries$Alpha_2[co])
We use a 1:50m set of country boundaries (in the 1:110m set, Singapore disappears).
library(rnaturalearth)
library(rnaturalearthdata)
data(countries50)
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.1.0dev-84d972714a, PROJ 6.2.1
countries50 <- st_as_sf(countries50)
We can merge()
the gaps data with the country boundaries, and the tmap package to create the map. Tomorrow we’ll look at binning and colours. Using ttm()
, we can toggle to an interactive format, using last_map()
to recall the last generated map.
world_gaps <- merge(countries50, gaps, by="iso_a2", all.x=TRUE)
library(tmap)
tm_shape(world_gaps) + tm_fill("math_gap", palette="div", n=7, style="jenks", midpoint=0)
## Warning: The shape world_gaps is invalid. See sf::st_is_valid
#ttm()
#last_map()
In the talk, boxplots are used to explore the variability between students by gender and country; here we tried a caterpillar plot. The PV means also flatten variability - would medians be more sensible?
Could aspects of the PISA data set be used in thinking about projects?
Other data sets are made available by Di Cook, and the nullabor package seems interesting, and maybe could also work (well) for projects, maybe also varying graphical representation?
Zeileis and others have discussed the opportunities to be found for effective visual communication in statistical graphics using colours; we distinguish between sequential, diverging and qualitative palettes
There are differences in opinions with regard to colour look-up (continuous or discrete) between implementations - RColorBrewer is discrete
There are several palettes, such as rainbow
, cm.colors
, heat.colors
and others in grDevices
Other packages include viridis, colorspace, and further palettes provided in plotting methods (sf.colors
uses sp::bpy.colors
)
Fortunately, the RColorBrewer package gives by permission access to the ColorBrewer palettes accesible from the ColorBrewer website. Note that ColorBrewer limits the number of classes tightly, only 3–9 sequential classes
library(RColorBrewer)
opar <- par(cex=0.7)
display.brewer.all()
par(opar)
suppressPackageStartupMessages(library(viridis))
library(colorspace)
library(ggplot2)
#devtools::install_github("wilkelab/cowplot")
#install.packages("colorspace", repos = "http://R-Forge.R-project.org")
#devtools::install_github("clauswilke/colorblindr")
library(colorblindr)
n_col <- 10
swatchplot(
"terrain.colors"=rev(terrain.colors(n_col)),
"heat"=rev(heat.colors(n_col)),
"ggplot default"=rev(scales::seq_gradient_pal(low = "#132B43", high = "#56B1F7", space = "Lab")(seq(0, 1, length=n_col))),
"brewer blues"=scales::gradient_n_pal(scales::brewer_pal(type="seq")(9))(seq(0, 1, length=n_col)),
"brewer yellow-green-blue"=scales::gradient_n_pal(scales::brewer_pal(type="seq", palette = "YlGnBu")(9))(seq(0, 1, length=n_col)),
"viridis"=rev(viridis(n_col)),
"magma"=rev(magma(n_col))
)
suppressPackageStartupMessages(library(cartography))
display.carto.all(n = 10)
The colorspace package provides choose_palette()
to choose HCL palettes interactively; it returns a palette function.
The colorblindr from github provided the swatchplot()
function to show colour swatches; it also provided an Okabe-Ito qualitative palette as palette_OkabeIto
n_col <- 8
swatchplot(
"Okabe Ito"=palette_OkabeIto,
"ColorBrewer accent"=brewer.pal(n_col, "Accent"),
"ColorBrewer Pastel2"=brewer.pal(n_col, "Pastel2"),
"Qualitative HCL"=qualitative_hcl(n_col)
)
n_col <- 10
swatchplot(
"terrain.colors"=rev(terrain.colors(n_col)),
"terrain HCL"=rev(terrain_hcl(n_col)),
"heat"=rev(heat.colors(n_col)),
"heat HCL"=rev(heat_hcl(n_col)),
"rainbow"=rainbow(n_col),
"rainbow HCL"=rainbow_hcl(n_col)
)