Required current contributed CRAN packages:

I am running R 3.6.2, with recent update.packages().

needed <- c("jpeg", "mapview", "ggplot2", "RColorBrewer", "classInt", "tmap", "sf", "rnaturalearthdata", "rnaturalearth", "matrixStats")

Script

Script and data at https://github.com/rsbivand/BAN422_V20/raw/master/ban422_v20_tues.zip. Download to suitable location, unzip and use as basis.

Introduction

  • Today: binning intervals; conditioning (trellis, lattice, facets); and graphics devices and base graphics

  • Tomorrow: grid graphics; interactive graphics; shiny and dashboards

  • After the lunch break, we’ll go on to look at graphics devices and base graphics

Binning intervals

Binning (class) intervals

  • We saw yesterday that histograms and many other kinds of chart require user choices about the number of bins to be used and bin/class intervals

  • This may also be termed quantization: the division of part of the real line on which we have measured a variable into intervals

  • This can also apply to combined categories if they are recoded to reduce the number of alternatives to be displayed

  • Class intervals are much used in thematic cartography, and I’m the author of the classInt package

Recreating data objects from Monday

QQQ1 <- readRDS("../mon/dicook/pisa_subset.rds")
countries <- readRDS("../mon/dicook/countries.rds")
countries$Alpha_2[countries$Name == "Kosovo"] <- "XK"
a0 <- split(QQQ1, list(QQQ1$CNT, QQQ1$ST004D01T))
math_mean <- sapply(a0, function(x) weighted.mean(x$math_mean, w=x$SENWT))
n2 <- length(math_mean)/2
country <- sapply(strsplit(names(math_mean), "\\."), "[", 1)[1:n2]
co <- match(country, countries$CNT)
gender <- factor(sapply(strsplit(names(math_mean), "\\."), "[", 2))
library(matrixStats)
sqn <- sqrt(sapply(a0, function(x) sum(x$SENWT)))
math_se <- sapply(a0, function(x) weightedSd(x$math_mean, w=x$SENWT))/sqn
read_mean <- sapply(a0, function(x) weighted.mean(x$read_mean, w=x$SENWT))
sci_mean <- sapply(a0, function(x) weighted.mean(x$sci_mean, w=x$SENWT))

Recreating data objects from Monday

math_mean_female <- math_mean[gender == "Female"]
math_mean_male <- math_mean[gender == "Male"]
math_se_female <- math_se[gender == "Female"]
math_se_male <- math_se[gender == "Male"]
math_gap <- math_mean_female - math_mean_male
read_mean_female <- read_mean[gender == "Female"]
read_mean_male <- read_mean[gender == "Male"]
read_gap <- read_mean_female - read_mean_male
sci_mean_female <- sci_mean[gender == "Female"]
sci_mean_male <- sci_mean[gender == "Male"]
sci_gap <- sci_mean_female - sci_mean_male
df <- data.frame(math_mean_female, math_mean_male, math_se_female, math_se_male,
  read_mean_female, read_mean_male, sci_mean_female, sci_mean_male,
  math_gap, read_gap, sci_gap, iso_a3=country, iso_a2=countries$Alpha_2[co])

Country boundaries

We use a 1:50m set of country boundaries (in the 1:110m set, Singapore disappears).

library(rnaturalearth)
library(rnaturalearthdata)
data(countries50)
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.1.0dev-84d972714a, PROJ 6.2.1
countries50 <- st_as_sf(countries50)
countries50$iso_a2[countries50$name == "Kosovo"] <- "XK"

Using tmap for maps

We can merge() the gaps data with the country boundaries, and the tmap package to create the map. In doing so yesterday, we did not look at the arguments. palette= sets the colour palette ("div" for diverging at zero), n= the intended number of bins/classes, style= the method for constructing the bins, and midpoint=0 to split the bins on zero.

world_gaps <- merge(countries50[!is.na(countries50$iso_a2),], df, by="iso_a2", all.x=TRUE)
library(tmap)
tm_shape(world_gaps) + tm_fill("math_gap", palette="RdYlGn", n=7, style="jenks", midpoint=0)
## Warning: The shape world_gaps is invalid. See sf::st_is_valid

Class intervals

  • Class intervals can be chosen in many ways, and some have been collected for convenience in the classInt package

  • The first problem is to assign class boundaries to values in a single dimension, for which many classification techniques may be used, including pretty, quantile, natural breaks among others, or even simple fixed values

  • From there, the intervals can be used to generate colours from a colour palette, using the very nice colorRampPalette() function

  • Because there are potentially many alternative class memberships even for a given number of classes, choosing a communicative set matters

Numbers of intervals

We may choose the number of intervals ourselves arbitrarily or after examination of the data, or use provided functions, such as nclass.Sturges(), nclass.scott() or nclass.FD(). In hist(), nclass.Sturges() is used by default. We can also split on sign(), but handling diverging intervals often involves more work.

x <- na.exclude(world_gaps$math_gap)
nclass.Sturges(x)
## [1] 8
nclass.scott(x)
## [1] 6
(n <- nclass.FD(x))
## [1] 7
nclass.Sturges(x[sign(x) == 1L])
## [1] 6
nclass.Sturges(x[sign(x) == -1L])
## [1] 7

Class interval breaks: pretty

The default intervals for bins in hist() are pretty(range(x), n = breaks, min.n = 1), where breaks <- nclass.Sturges(x). The function computes a sequence of about n+1 equally spaced ‘round’ values which cover the range of the values in x. The values are chosen like values of coins or banknotes (1, 2, 5, etc.)

n
## [1] 7
range(x)
## [1] -26.96514  17.60435
(p <-pretty(x, n=n))
##  [1] -30 -25 -20 -15 -10  -5   0   5  10  15  20
length(p)
## [1] 11
(p <-pretty(x, n=n, high.u.bias=3))
## [1] -30 -20 -10   0  10  20
length(p)
## [1] 6

If we use the classIntervals() function from classInt, we can pass through arguments to the function called through style=, and note that n will not necessarily be the number of output classes. By default, intervalClosure= is "left", so [-30, -20) means numbers greater than and equal to (>=) -30 and less than (<) -20; [10, 20] is numbers >= 10 and <= 20.

suppressPackageStartupMessages(library(classInt))
(ppd <- classIntervals(x, n=n, style="pretty",
  cutlabels=TRUE))
## style: pretty
##   one of 49,280,065,120 possible partitions of this variable into 10 classes
## [-30,-25) [-25,-20) [-20,-15) [-15,-10)  [-10,-5)    [-5,0)     [0,5)    [5,10) 
##         1         2         6        10        20         6        13         7 
##   [10,15)   [15,20] 
##         3         1
(pp3 <- classIntervals(x, n=n, style="pretty",
  high.u.bias=3, cutlabels=TRUE))
## style: pretty
##   one of 814,385 possible partitions of this variable into 5 classes
## [-30,-20) [-20,-10)   [-10,0)    [0,10)   [10,20] 
##         3        16        26        20         4

Class interval plot method

library(RColorBrewer)
pal <- brewer.pal(7, "RdYlGn")
oopar <- par(mfrow=c(1, 2))
plot(ppd, pal=pal, main="Pretty (default)", xlab="", ylab="")
plot(pp3, pal=pal, main="Pretty (compressed)", xlab="", ylab="")

par(oopar)

Class interval breaks: quantiles

Quantiles may seem simple, but there are many ways of implementing the breaks, see quantile(). We can also set the dataPrecision= to make the class breaks easier to read:

(pq7 <- classIntervals(x, n=n, style="quantile",
  type=7L, dataPrecision=2))
## style: quantile
##   one of 109,453,344 possible partitions of this variable into 7 classes
## [-26.96,-14.38)  [-14.38,-9.75)   [-9.75,-6.97)   [-6.97,-4.01)    [-4.01,1.87) 
##              10              10              10               9              10 
##     [1.87,6.51)    [6.51,17.61] 
##              10              10
(pq3 <- classIntervals(x, n=n, style="quantile",
  type=3L, dataPrecision=2))
## style: quantile
##   one of 109,453,344 possible partitions of this variable into 7 classes
## [-26.96,-14.67)  [-14.67,-9.96)   [-9.96,-7.12)    [-7.12,-5.2)     [-5.2,1.85) 
##               9              10              10               9              10 
##     [1.85,6.51)    [6.51,17.61] 
##              10              11
oopar <- par(mfrow=c(1, 2))
plot(pq7, pal=pal, main="Quantile type=7", xlab="", ylab="")
plot(pq3, pal=pal, main="Quantile type=3", xlab="", ylab="")

par(oopar)

Class interval breaks: equal and standard deviations

The "equal" style divides the range into n equal parts, while "sd" centres and scales the variable before using pretty on the result, converting back to get the breaks:

(peq <- classIntervals(x, n=n, style="equal",
  dataPrecision=2))
## style: equal
##   one of 109,453,344 possible partitions of this variable into 7 classes
## [-26.96,-20.59) [-20.59,-14.23)  [-14.23,-7.86)   [-7.86,-1.49)    [-1.49,4.88) 
##               3               8              17              14              16 
##    [4.88,11.24)   [11.24,17.61] 
##               7               4
diff(peq$brks)
## [1] 6.36707 6.36707 6.36707 6.36707 6.36707 6.36707 6.36707
(psd <- classIntervals(x, n=n, style="sd", 
  high.u.bias=3, dataPrecision=2))
## style: sd
##   one of 10,424,128 possible partitions of this variable into 6 classes
## [-32.85,-23.44) [-23.44,-14.03)  [-14.03,-4.62)    [-4.62,4.79)     [4.79,14.2) 
##               1              10              28              19               9 
##    [14.2,23.61] 
##               2
(psd$brks-attr(psd, "parameters")["center"]) /
  attr(psd, "parameters")["scale"]
## [1] -3 -2 -1  0  1  2  3
oopar <- par(mfrow=c(1, 2))
plot(peq, pal=pal, main="Equal intervals", xlab="", ylab="")
plot(psd, pal=pal, main="Standard deviations", xlab="", ylab="")

par(oopar)

Class interval breaks: clusters on the line

Hierarchical clustering is a well-known approach to trying to find similarities between multivariate observations based on dissimilarities or distances. If we use distances on the real line, we can build cluster trees for our univariate observations and chosen method=, and cut the trees for chosen numbers of classes:

(phc7 <- classIntervals(x, n=n, style="hclust", 
  method="complete", dataPrecision=2))
## style: hclust
##   one of 109,453,344 possible partitions of this variable into 7 classes
## [-26.96,-24.51) [-24.51,-13.13)   [-13.13,-6.6)    [-6.6,-1.45)    [-1.45,5.34) 
##               1              12              17              12              16 
##    [5.34,10.39)   [10.39,17.61] 
##               7               4
(phc5 <- getHclustClassIntervals(phc7, 5))
## style: hclust
##   one of 814,385 possible partitions of this variable into 5 classes
## [-26.96514,-24.51651)  [-24.51651,-13.1365)  [-13.1365,-1.453571) 
##                     1                    12                    29 
##  [-1.453571,10.38568)   [10.38568,17.60435] 
##                    23                     4
oopar <- par(mfrow=c(1, 2))
plot(phc7, pal=pal, main="Hierarchical clusters (7)", xlab="", ylab="")
plot(phc5, pal=pal, main="Hierarchical clusters (5)", xlab="", ylab="")

par(oopar)

kmeans provides a non-hierarchical clustering approach, and can be combined with hierarchical clustering using bagged clustering provided by e1071::bclust(); here we are using set.seed(1) to be able to reproduce the output:

set.seed(1)
(pk7 <- classIntervals(x, n=n, style="kmeans",
  dataPrecision=2))
## style: kmeans
##   one of 109,453,344 possible partitions of this variable into 7 classes
## [-26.96,-17.46) [-17.46,-10.43)  [-10.43,-4.51)    [-4.51,0.64)     [0.64,5.34) 
##               5              14              20               7              12 
##    [5.34,10.39)   [10.39,17.61] 
##               7               4
set.seed(1)
(pbc7 <- classIntervals(x, n=n, style="bclust", 
  verbose=FALSE, dataPrecision=2))
## style: bclust
##   one of 109,453,344 possible partitions of this variable into 7 classes
## [-26.96,-24.51) [-24.51,-19.14) [-19.14,-13.13)  [-13.13,-4.51)    [-4.51,5.34) 
##               1               3               9              26              19 
##    [5.34,12.21)   [12.21,17.61] 
##               8               3
oopar <- par(mfrow=c(1, 2))
plot(pk7, pal=pal, main="K-means clusters", xlab="", ylab="")
plot(pbc7, pal=pal, main="Bagged clusters", xlab="", ylab="")

par(oopar)

Class interval breaks: cartographic natural breaks

There are two implementations of the cartographic natural breaks approach, contributed by Hisaji Ono. As we have already seen, most methods for defining intervals differ in details. As with the cluster approaches, the intention is to place the breaks in obvious gaps in the distribution of the variable.

(pj7 <- classIntervals(x, n=n, style="jenks",
  dataPrecision=2))
## style: jenks
##   one of 109,453,344 possible partitions of this variable into 7 classes
## [-26.97,-19.94] (-19.94,-13.64]  (-13.64,-8.17]    (-8.17,-2.3]     (-2.3,4.15] 
##               4               9              15              14              16 
##      (4.15,9.2]      (9.2,17.6] 
##               7               4
(pf7 <- classIntervals(x, n=n, style="fisher",
  dataPrecision=2))
## style: fisher
##   one of 109,453,344 possible partitions of this variable into 7 classes
## [-26.96,-19.14) [-19.14,-13.13)  [-13.13,-7.72)   [-7.72,-1.45)    [-1.45,5.34) 
##               4               9              15              14              16 
##    [5.34,10.39)   [10.39,17.61] 
##               7               4
oopar <- par(mfrow=c(1, 2))
plot(pj7, pal=pal, main="Jenks natural breaks", xlab="", ylab="")
plot(pf7, pal=pal, main="Fisher natural breaks", xlab="", ylab="")

par(oopar)

Class interval assignment at breaks

Once found, the breaks need to be applied to the data vector (or vectors - data to be contrasted may need shared intervals). In classInt the findCols() function can be used, wrapping base::findInterval(); base::cut() could also be used:

str(findCols(pj7))
##  num [1:69] 6 6 1 1 4 2 5 2 3 3 ...
str(findInterval(x, pj7$brks, all.inside=TRUE,
  left.open=TRUE))
##  int [1:69] 6 6 1 1 4 2 5 2 3 3 ...
str(as.integer(cut(x, breaks=pj7$brks,
  include.lowest=TRUE)))
##  int [1:69] 6 6 1 1 4 2 5 2 3 3 ...

How does ggplot do this?

suppressPackageStartupMessages(library(ggplot2))
stat_bin
## function (mapping = NULL, data = NULL, geom = "bar", position = "stack", 
##     ..., binwidth = NULL, bins = NULL, center = NULL, boundary = NULL, 
##     breaks = NULL, closed = c("right", "left"), pad = FALSE, 
##     na.rm = FALSE, show.legend = NA, inherit.aes = TRUE) 
## {
##     layer(data = data, mapping = mapping, stat = StatBin, geom = geom, 
##         position = position, show.legend = show.legend, inherit.aes = inherit.aes, 
##         params = list(binwidth = binwidth, bins = bins, center = center, 
##             boundary = boundary, breaks = breaks, closed = closed, 
##             pad = pad, na.rm = na.rm, ...))
## }
## <bytecode: 0x1cbceda0>
## <environment: namespace:ggplot2>
StatBin
## <ggproto object: Class StatBin, Stat, gg>
##     aesthetics: function
##     compute_group: function
##     compute_layer: function
##     compute_panel: function
##     default_aes: uneval
##     extra_params: na.rm
##     finish_layer: function
##     non_missing_aes: 
##     parameters: function
##     required_aes: x
##     retransform: TRUE
##     setup_data: function
##     setup_params: function
##     super:  <ggproto object: Class Stat, gg>
cat(capture.output(print(StatBin$compute_group))[-(1:5)], sep="\n")
##   <Inner function (f)>
##     function (data, scales, binwidth = NULL, bins = NULL, center = NULL, 
##     boundary = NULL, closed = c("right", "left"), pad = FALSE, 
##     breaks = NULL, origin = NULL, right = NULL, drop = NULL, 
##     width = NULL) 
## {
##     if (!is.null(breaks)) {
##         if (!scales$x$is_discrete()) {
##             breaks <- scales$x$transform(breaks)
##         }
##         bins <- bin_breaks(breaks, closed)
##     }
##     else if (!is.null(binwidth)) {
##         if (is.function(binwidth)) {
##             binwidth <- binwidth(data$x)
##         }
##         bins <- bin_breaks_width(scales$x$dimension(), binwidth, 
##             center = center, boundary = boundary, closed = closed)
##     }
##     else {
##         bins <- bin_breaks_bins(scales$x$dimension(), bins, center = center, 
##             boundary = boundary, closed = closed)
##     }
##     bin_vector(data$x, bins, weight = data$weight, pad = pad)
## }
ggplot2:::bin_breaks_bins
## function (x_range, bins = 30, center = NULL, boundary = NULL, 
##     closed = c("right", "left")) 
## {
##     stopifnot(length(x_range) == 2)
##     bins <- as.integer(bins)
##     if (bins < 1) {
##         stop("Need at least one bin.", call. = FALSE)
##     }
##     else if (zero_range(x_range)) {
##         width <- 0.1
##     }
##     else if (bins == 1) {
##         width <- diff(x_range)
##         boundary <- x_range[1]
##     }
##     else {
##         width <- (x_range[2] - x_range[1])/(bins - 1)
##     }
##     bin_breaks_width(x_range, width, boundary = boundary, center = center, 
##         closed = closed)
## }
## <bytecode: 0x1c036958>
## <environment: namespace:ggplot2>
cat(capture.output(print(ggplot2:::bin_breaks_width))[-(4:8)], sep="\n")
## function (x_range, width = NULL, center = NULL, boundary = NULL, 
##     closed = c("right", "left")) 
## {
##     if (!is.null(boundary) && !is.null(center)) {
##         stop("Only one of 'boundary' and 'center' may be specified.")
##     }
##     else if (is.null(boundary)) {
##         if (is.null(center)) {
##             boundary <- width/2
##         }
##         else {
##             boundary <- center - width/2
##         }
##     }
##     x_range <- as.numeric(x_range)
##     width <- as.numeric(width)
##     boundary <- as.numeric(boundary)
##     shift <- floor((x_range[1] - boundary)/width)
##     origin <- boundary + shift * width
##     max_x <- x_range[2] + (1 - 1e-08) * width
##     breaks <- seq(origin, max_x, width)
##     if (length(breaks) > 1e+06) {
##         stop("The number of histogram bins must be less than 1,000,000.\nDid you make `binwidth` too small?", 
##             call. = FALSE)
##     }
##     bin_breaks(breaks, closed = closed)
## }
## <bytecode: 0x1e1e3a30>
## <environment: namespace:ggplot2>
ggplot2:::bin_breaks
## function (breaks, closed = c("right", "left")) 
## {
##     bins(breaks, closed)
## }
## <bytecode: 0x3b1c0b8>
## <environment: namespace:ggplot2>
ggplot2:::bins
## function (breaks, closed = c("right", "left"), fuzz = 1e-08 * 
##     stats::median(diff(breaks))) 
## {
##     stopifnot(is.numeric(breaks))
##     closed <- match.arg(closed)
##     breaks <- sort(breaks)
##     if (closed == "right") {
##         fuzzes <- c(-fuzz, rep.int(fuzz, length(breaks) - 1))
##     }
##     else {
##         fuzzes <- c(rep.int(-fuzz, length(breaks) - 1), fuzz)
##     }
##     structure(list(breaks = breaks, fuzzy = breaks + fuzzes, 
##         right_closed = closed == "right"), class = "ggplot2_bins")
## }
## <bytecode: 0x1bb0b1e8>
## <environment: namespace:ggplot2>

Visualizing on web maps

We can also pass palettes and class intervals through (maybe need to stretch the min-max interval bounds)

library(mapview)
mapview(world_gaps, zcol="math_gap",
  col.regions=pal, at=pj7$brks)

But tmap just works

  • The tmap package provides cartographically informed, grammar of graphics (gg) based functionality now, like ggplot2 using grid graphics.

  • John McIntosh tried with ggplot2, with quite nice results

  • I suggested he look at tmap, and things got better, because tmap can switch between interactive and static viewing

  • tmap also provides direct access to classInt class intervals

Like the sf::plot() method, tmap plotting can use classInt internally and accepts a palette (try looking at tmaptools::palette_explorer() for ColorBrewer palettes):

library(tmap)
tm_shape(world_gaps) + tm_fill("math_gap",
  n=n, style="jenks", palette=pal, midpoint=0) +
  tm_borders(lwd=0.5, alpha=0.4)
## Warning: The shape world_gaps is invalid. See sf::st_is_valid

What about facets/panels?

If we give a short vector of column names, we will get facet/panel displays, but need to use tm_facets(free.scales=FALSE) to use the same class intervals

tm_shape(world_gaps) + tm_fill(c("math_mean_female", "math_mean_male"), n=7, style="jenks", palette=pal) +
  tm_facets(free.scales=FALSE) + tm_borders(lwd=0.5, alpha=0.4) + tm_layout(panel.labels=c("Female", "Male"))
## Warning: The shape world_gaps is invalid. See sf::st_is_valid

Trellis/lattice/facets: conditioning

Trellis/lattice/facets

  • The underlying logic of conditioned graphics is that multiple displays (windows, panes) use the same scales and representational elements for comparison

  • Using the same scales and representational elements for comparison can be done manually, imposing the same scales, colours and shapes in each plot and laying the plots out in a grid

  • Trellis graphics automated this in S, and lattice provides similar but enhanced facilities in R with a formula interface

  • ggplot2 and other packages also provide similar functionalities

Subsetting PISA Nordics

To try this out, let’s use about 25,000 PISA Science PV means by country, gender and count of books at home, choosing the appropriate country and gender data.frame objects from the big list:

nordics <- c("DNK", "FIN", "ISL", "NOR", "SWE")
a1 <- a0[which(sapply(strsplit(names(a0), "\\."),
  "[", 1) %in% nordics)]
a11 <- do.call("rbind", a1)
a11$CNT <- droplevels(a11$CNT)
levels(a11$ST013Q01TA) <- sub("More than",
  ">", sub(" books", "", levels(a11$ST013Q01TA)))
saveRDS(a11, "../mon/dicook/a11.rds")
library(lattice)
library(ggplot2)

Lattice - Science mean PVs by gender and country

histogram(~ sci_mean | CNT*ST004D01T, data=a11, main="Science mean PVs", xlab="")

ggplot - Science mean PVs by gender and country

ggplot(a11, aes(x=sci_mean)) + geom_histogram() + facet_wrap(~ ST004D01T + CNT, nrow=2) +
  ggtitle("Science mean PVs") + xlab("")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Lattice - Science mean PVs by gender and country (counts)

histogram(~ sci_mean | CNT*ST004D01T, data=a11, main="Science mean PVs", xlab="", type="count")

Lattice - Science mean PVs by gender and country

densityplot(~ sci_mean | ST004D01T, groups=CNT, data=a11, auto.key=list(space="right"),
  plot.points=FALSE, main="Science mean PVs", xlab="")

ggplot - Science mean PVs by gender and country

ggplot(a11, aes(x=sci_mean)) + geom_density(aes(colour=CNT)) + facet_wrap(~ ST004D01T, ncol=2) + 
  ggtitle("Science mean PVs") + xlab("")

Lattice - Science mean PVs by gender and country

densityplot(~ sci_mean | ST004D01T, groups=CNT, data=a11, auto.key=list(space="right"),
  plot.points=FALSE, main="Science mean PVs", xlab="", type=c("l", "g"))

Lattice - Science mean PVs by gender and country

densityplot(~ sci_mean | CNT, groups=ST004D01T, a11, auto.key=list(space="right"),
  plot.points=FALSE, main="Science mean PVs", xlab="")

ggplot - Science mean PVs by gender and country

ggplot(a11, aes(x=sci_mean)) + geom_density(aes(colour=ST004D01T)) + facet_wrap(~ CNT, ncol=3) + 
  ggtitle("Science mean PVs") + xlab("")

Lattice - Science mean PVs by gender and books at home

bwplot(sci_mean ~ ST013Q01TA | ST004D01T, a11, main="Science mean PVs", xlab="books", ylab="")

ggplot - Science mean PVs by gender and books at home

ggplot(na.omit(a11), aes(ST013Q01TA, sci_mean)) + geom_boxplot(aes(group=ST013Q01TA)) +
  facet_wrap(~ ST004D01T) + ggtitle("Science mean PVs") + xlab("books") + ylab("")

Lattice - Science mean PVs by gender and books at home

bwplot(sci_mean ~ ST013Q01TA | ST004D01T, a11, main="Science mean PVs", xlab="books", ylab="", varwidth=TRUE)

ggplot - Science mean PVs by gender and books at home

ggplot(na.omit(a11), aes(ST013Q01TA, sci_mean)) + geom_boxplot(aes(group=ST013Q01TA), varwidth=TRUE) +
  facet_wrap(~ ST004D01T) + ggtitle("Science mean PVs") + xlab("books") + ylab("")

Base graphics

Base graphics

  • In early R, all the graphics functionality was in base; graphics was split out of base in 1.9.0 (2004-04-12), and grDevices in 2.0.0 (2004-10-04)

  • When R starts now, the graphics and grDevices packages are loaded and attached, ready for use

  • graphics provides the user-level graphical functions and methods, especially the most used plot() methods that many other packages extend

  • grDevices provides lower-level interfaces to graphics devices, some of which create files and others display windows

Base graphics packages

The search() function shows the packages present in the search path, so here we run an instance of R through system() to check the startup status. In RStudio, one will also see "tools:rstudio" in the search path.

cat(system('echo "search()" | R --no-save --vanilla', intern=TRUE)[20:23], sep="\n")
## > search()
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

Graphics devices

  • The PDF (pdf()) and PostScript (postscript()) devices write commonly used vector files

  • Display formats and raster/pixel file devices must rasterize vector graphics input

  • Display formats vary by platform: X11() on X11 systems, quartz() on macOS, windows() on Windows and are available if compiled

  • png(), jpeg(), tiff() are generally available, svg() and cairo_pdf() and cairo_ps() may also be built

The capabilities() function shows what R itself can offer, including non-graphics capabilities, and we can also check the versions of external software used

capabilities()
##        jpeg         png        tiff       tcltk         X11        aqua 
##        TRUE        TRUE        TRUE        TRUE        TRUE       FALSE 
##    http/ftp     sockets      libxml        fifo      cledit       iconv 
##        TRUE        TRUE        TRUE        TRUE       FALSE        TRUE 
##         NLS     profmem       cairo         ICU long.double     libcurl 
##        TRUE       FALSE        TRUE        TRUE        TRUE        TRUE
grSoftVersion()
##                     cairo                    libpng                      jpeg 
##                  "1.16.0"                  "1.6.37"                     "6.2" 
##                   libtiff 
## "LIBTIFF, Version 4.0.10"

When there is no device active, the current device is the null device at position 1; one can open several and move between active devices. In RStudio, the devices used vary by context - if the console is used, RStudioGD will be used, backed by png; in a notebook, small embedded devices appear (apparently PNG inline images).In addition to providing a device, RStudio appears to work quite hard to embed fonts and crop white space on the edges of the graphics file

a11 <- readRDS("../mon/dicook/a11.rds")
png("plot.png")
dev.cur()
## png 
##   3
boxplot(sci_mean ~ ST004D01T, a11)
dev.off()
## png 
##   2
png(tempfile())
unlist(dev.capabilities())
##      semiTransparency transparentBackground           rasterImage 
##                "TRUE"                "semi"                 "yes" 
##               capture               locator 
##               "FALSE"               "FALSE"
dev.size("in")
## [1] 6.666667 6.666667
dev.interactive()
## [1] FALSE
dev.off()
## png 
##   2

The markdown code to include say a logo image is: ![alt text][logo], and using this we can display the boxplot created above.

Read in and display plot.png:

[plot.png][plot.png]

Base graphics

  • The graphics package provides the basic components for drawing on the devices, and user-level functions bundling the components

  • A boxplot() will need to be placed in a plotting area, the boxes and whiskers drawn, axes and ticks drawn, and anotation added

  • The graphical parameters are set and may be modified using the par() function; the layout() function can also be used to position multiple display elements.

  • Inspecting ?par shows how much can be modified within the limits of the base graphics system

We can manipulate the number of plots mfrow=, axis label orientation las=, line width lwd=, point symbol pch= and so on

opar <- par(mfrow=c(2,2))
boxplot(sci_mean ~ ST004D01T, a11, las=0)
boxplot(sci_mean ~ ST004D01T, a11, las=1)
boxplot(sci_mean ~ ST004D01T, a11, lwd=2)
boxplot(sci_mean ~ ST004D01T, a11, pch=4)

par(opar)

We can build up the plot bit-by-bit - here using the axes as an example

opar <- par(mfrow=c(2,2))
boxplot(sci_mean ~ ST004D01T, a11, axes=FALSE)
boxplot(sci_mean ~ ST004D01T, a11, axes=FALSE)
axis(2, las=1)
boxplot(sci_mean ~ ST004D01T, a11, axes=FALSE)
axis(2, las=1)
axis(1, at=1:2, labels=levels(a11$ST004D01T))
boxplot(sci_mean ~ ST004D01T, a11, axes=FALSE)
axis(2, las=1)
axis(1, at=1:2, labels=levels(a11$ST004D01T))
box()

par(opar)

We can manipulate the margins of the plot region within the figure region; layout() helps us align the output

layout(cbind(c(1, 1), c(2,2)))
opar <- par(mar=c(3, 3, 0, 0)+0.1)
boxplot(sci_mean ~ ST004D01T, a11)
par(mar=c(10, 3, 4, 0)+0.1)
boxplot(sci_mean ~ ST004D01T, a11)

layout(1)

It is tricky to use plot region metrics in RStudioGD, so we’ll try the PNG device to look at device dimensions and to plot some logo images

library("jpeg")
im <- as.raster(readJPEG("image001.jpg"))
prop <- dim(im)[2]/dim(im)[1]
png("plot1.png")
par("din")
## [1] 6.666667 6.666667
plot(1, type="n", axes=FALSE, xlim=c(1, 10),
  ylim=c(1, 10), asp=1, ann=FALSE)
rasterImage(im, 1:9, 1:9, (1:9) + prop, 2:10)
dev.off()
## png 
##   2
plot1.png

plot1.png

The plot region and figure region are shown using R Graphics code

opar <- par(oma=rep(3, 4), bg="gray80")
plot(c(0, 1), c(0, 1), type="n", ann=FALSE,
  axes=FALSE); box("outer", col="gray")
par(xpd=TRUE)
rect(-1, -1, 2, 2, col="gray90")
box("figure"); par(xpd=FALSE)
rect(-1, -1, 2, 2, col="gray80")
box("plot", lty="dashed")
text(.5, .5, "Plot Region")
mtext("Figure Region", side=3, line=2)
for (i in 1:4) mtext(paste("Outer margin", i),
  side=i, line=1, outer=TRUE)

par(opar)

Base plot methods

The plot.lm() method in base package stats shows diagnostic plots, here indicating potential outliers; outlier detection is also provided in OutliersO3 and HDoutliers among others

opar <- par(mfrow=c(2,2))
plot(lm(math_mean ~ read_mean, data=a11), pch=".")

par(opar)