I am running R 3.6.1, with recent update.packages()
needed <- c("rbenchmark", "wrapr", "magrittr", "lme4", "reticulate", "Matrix", "mgcv", "nlme", "Rcpp", "lattice", "MASS", "grid_3.6.1", "jsonlite", "minqa", "nloptr", "boot", "splines", "tools", "compiler", "BiocManager")
Script and data at Download to suitable location, unzip and use as basis.
Yesterday was mostly for setting the scene; today and tomorrow should make more progress
The classes will be split 45m introduction, 45m exercises/exploration/seminar
We will also open for multi-speed approaches, rather than convoy; those moving more slowly are still learning, and can benefit from what those with stronger prior knowledge and experience are doing
In S2 syntax, there were objects; in S3, some objects also had a class attribute set, to offer guidance on what the object might contain
In S4, the contents of objects were formalised, moving checks from methods dispached by the class of the first method argument to the classes themselves
In OOP in other languages, methods belong to objects, and RC (reference classes) and R6 systems have been developed to provide these
RC are linked to the success of Rcpp, and perhaps used in reticulate to interface Python from R (see keras and tensorflow)
Formulae provide the S3 modelling interface, and use non-standard evaluation
They tell us where to look for variables in modelling, and which transformations to apply to them
They provide us with what we need from our data in terms of output from analyses
They are very flexible, with update
methods to modify our approaches flexibly
There are plenty of base R functions that use non-standard evaluation, such as formula
and library
The underlying issue is often how to point to objects within other objects, where the emcompassing object is an environment
or data.frame
Connections to files and databases include the original meaning of “pipe”
More recently, pipes may be used instead of nested function arguments
Classes and objects appeared first in Simula from the Norwegian Computing Center in Oslo, and were intended to further encapsulation in programs
C++ was the extension of C to include Simula-like object handling, and, like Simula, a garbage collector
More modern languages, like Java, settled on a strict OOP view of classes of objects that contained methods, but until the 1990s, this was not the only possibility
S adopted a functional class quasi-system with S3, where generic methods stood apart from classes of objects, but related to them
As in file names, various non-alphanumeric characters can be used to separate parts, for example the .
In S and early R, the _
underscore was used for assignment like <-
(see this posting)
So a central S3 class was called data.frame
, and coercion to a data.frame
for a matrix argument, and
for a list argument
Here the
part was sufficient, and method dispatch would choose the appropriate implementation by matching the class of the first argument to the final dot-separated part of the list of available methods
Rasmus Bååth has a report on naming conventions used in R some years ago
More recently, lowerCamelCase and snake_case have become predominant, with most recent code being snake_case (all lower case and words separated by underscore)
Obviously, the case-matching component of generic methods for S3 objects has to be separated by a dot
It is easy to create new methods for existing generic functions, like print
, plot
, or summary
It is also easy to create new classes - no definition is needed, but as software develops, the class-specific methods often need to guard against the absence of object components typically with !is.null()
If you save an S3 object, say to an RDS file, possibly for serialization, the package context of the class will be lost
Then the class attribute will be set, but which package provides the methods for that class is not recorded
gm <- gam(formula=mpg ~ s(wt), data=mtcars)
## [1] "gam" "glm" "lm"
## Family: gaussian
## Link function: identity
## Formula:
## mpg ~ s(wt)
## Estimated degrees of freedom:
## 2.14 total = 3.14
## GCV score: 7.887675
## Call: gam(formula = mpg ~ s(wt), data = mtcars)
## Coefficients:
## (Intercept) s(wt).1 s(wt).2 s(wt).3 s(wt).4
## 20.09062 -0.05848 -0.47219 -0.02163 0.43973
## s(wt).5 s(wt).6 s(wt).7 s(wt).8 s(wt).9
## -0.38559 -0.61434 -0.27343 -3.45191 -4.99051
## Degrees of Freedom: 31 Total (i.e. Null); 28.85932 Residual
## Null Deviance: 1126
## Residual Deviance: 205.3 AIC: 158.6
affects the search pathBecause library
affects the search path, we have added visible methods compared to our first view of those available
nS <- search()
nS[!(nS %in% oS)]
## [1] "package:mgcv" "package:nlme"
In the online version of here, and in the forthcoming second edition, there are short descriptions of formal S4 classes
They were introduced in the Green Book , and covered in , and
The methods package provides S4 classes in R and is used both for formal S4 classes and for RC reference classes also used in Rcpp modules
S4 classes are used extensively in Bioconductor packages
cran_deps <- grep("methods", cran_db[, "Depends"])
cran_imps <- grep("methods", cran_db[, "Imports"])
cran_methods_packages <- cran_db[unique(sort(c(cran_deps, cran_imps))), "Package"]
## [1] 15109
## [1] 2838
bioc_deps <- grep("methods", bioc_db[, "Depends"])
bioc_imps <- grep("methods", bioc_db[, "Imports"])
bioc_methods_packages <- bioc_db[unique(sort(c(bioc_deps, bioc_imps))), "Package"]
## [1] 3053
## [1] 1541
Just to check how many of the CRAN packages using methods may be being driven by Rcpp modules
cran_deps <- grep("Rcpp", cran_db[, "Depends"])
cran_imps <- grep("Rcpp", cran_db[, "Imports"])
cran_lt <- grep("Rcpp", cran_db[, "LinkingTo"])
cran_Rcpp_packages <- cran_db[unique(sort(c(cran_deps, cran_imps, cran_lt))), "Package"]
cran_Rcpp_methods_packages <- intersect(cran_methods_packages, cran_Rcpp_packages)
## [1] 1846
## [1] 586
bioc_deps <- grep("Rcpp", bioc_db[, "Depends"])
bioc_imps <- grep("Rcpp", bioc_db[, "Imports"])
bioc_lt <- grep("Rcpp", bioc_db[, "LinkingTo"])
bioc_Rcpp_packages <- bioc_db[unique(sort(c(bioc_deps, bioc_imps, bioc_lt))), "Package"]
bioc_Rcpp_methods_packages <- intersect(bioc_methods_packages, bioc_Rcpp_packages)
## [1] 183
## [1] 135
Writing S4 classes involves thinking ahead, to plan a hierarchy of classes (and virtual classes)
If it is possible to generalise methods in the inheritance tree od class definitions, a method can be used on all descendants inheriting from a root class
Formal classes also provide certainty that the classes contain slots as required, and objects can be checked for validity
Over time, unclassed and S3 objects have been made able to work within S4 settings, but some of these adaptions have been fragile
First, a standard dense identity (unit diagonal) matrix created using diag
d100 <- diag(100)
## [1] FALSE
## num [1:100, 1:100] 1 0 0 0 0 0 0 0 0 0 ...
## 80216 bytes
## Class "matrix" [package "methods"]
## No Slots, prototype of class "matrix"
## Extends:
## Class "array", directly
## Class "mMatrix", directly
## Class "structure", by class "array", distance 2
## Class "vector", by class "array", distance 3, with explicit coerce
## Known Subclasses: "mts"
Now a sparse identity (unit diagonal) matrix with Diagonal
from the Matrix package
D100 <- Diagonal(100)
## [1] TRUE
## Formal class 'ddiMatrix' [package "Matrix"] with 4 slots
## ..@ diag : chr "U"
## ..@ Dim : int [1:2] 100 100
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ x : num(0)
## 1240 bytes
## Class "ddiMatrix" [package "Matrix"]
## Slots:
## Name: diag Dim Dimnames x
## Class: character integer list numeric
## Extends:
## Class "diagonalMatrix", directly
## Class "dMatrix", directly
## Class "sparseMatrix", by class "diagonalMatrix", distance 2
## Class "Matrix", by class "dMatrix", distance 2
## Class "xMatrix", by class "dMatrix", distance 2
## Class "mMatrix", by class "Matrix", distance 4
## Class "Mnumeric", by class "Matrix", distance 4
## Class "replValueSp", by class "Matrix", distance 4
showMethods(f=coerce, classes=class(D100))
## Formal class 'dgeMatrix' [package "Matrix"] with 4 slots
## ..@ x : num [1:10000] 1 0 0 0 0 0 0 0 0 0 ...
## ..@ Dim : int [1:2] 100 100
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ factors : list()
Reference classes are (much more) like OOP mechanisms in C++, Java, and other programming languages, in which objects contain their class-specific methods
Because the methods are part of the class definitions, the appropriate method is known by the object, and dispatch is unproblematic
Reference classes are provided in the methods package, and R6 classes in the R6 package
R6 classes are more light-weight than RC classes
deps <- grep("R6", cran_db[, "Depends"])
imps <- grep("R6", cran_db[, "Imports"])
R6_packages <- cran_db[unique(sort(c(deps, imps))), "Package"]
## [1] 252
cran_db["reticulate", c("Depends", "Imports", "LinkingTo")]
os <- import("os")
## [1] "/home/rsb/und/ban421/h19/tues"
np <- import("numpy", convert = FALSE)
a <- np$array(c(1:4))
## [1 2 3 4]
## [1] 1 2 3 4
sum <- a$cumsum()
## [ 1 3 6 10]
## [1] 1 3 6 10
## [1] 1 3 6 10
A unifying feature of S and R has been the treatment of data.frame
objects as data=
arguments to functions and methods
Most of the functions and methods fit models to data, and formula
objects show how to treat the variables in the data=
argument, or if not found there, in the calling environments of the model fitting function
We saw this earlier when looking at mgcv::gam
, but did not explain it
Formulae may be two-sided (mostly) and one-sided; the Formula package provides extensions useful in econometrics
Formula objects can be updated
f <- mpg ~ s(wt)
gm <- gam(formula=f, data=mtcars)
gm1 <- gam(update(f, . ~ - s(wt) + poly(wt, 3)), data=mtcars)
anova(gm, gm1, test="Chisq")
mtcars$fcyl <- as.factor(mtcars$cyl)
f1 <- update(f, log(.) ~ - s(wt) - 1 + wt + fcyl)
## log(mpg) ~ wt + fcyl - 1
lm(f1, data=mtcars)
head(model.matrix(f1, mtcars))
## The following object is masked from 'package:nlme':
## lmList
lmm <- lmer(update(f, log(.) ~ poly(wt, 3) | fcyl), mtcars)
## boundary (singular) fit: see ?isSingular
mtcars$fam <- as.factor(mtcars$am)
lm(update(f, log(.) ~ - s(wt) + (fam*fcyl)/wt - 1), mtcars)
row.names(mtcars)[mtcars$am == 1]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Fiat 128"
## [5] "Honda Civic" "Toyota Corolla" "Fiat X1-9" "Porsche 914-2"
## [9] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" "Maserati Bora"
## [13] "Volvo 142E"
In some settings in S and later R, it has been convenient to drop quotation marks around strings repesenting names of packages, list components and classes
We’ve seen this in action already, but have not drawn attention to it
In particular, the $
list component selection operator uses this, as does the use of names in formulae
We also see the same thing in functions attaching packages library
and require
In such settings, we cannot use a single element character vector to transmit information:
ggplot2 <- "gridExtra"
## [1] "gridExtra"
ggplot2 <- as.character(substitute(ggplot2))
## [1] "ggplot2"
paste0("package:", ggplot2) %in% search()
## [1] FALSE
Here a similar mechanism is used to look inside the first data.frame
rather than the current and global environments
subset(mtcars, subset=cyl == 4)
acyl <- "cyl"
ncyl <- 4
mtcars[mtcars[[acyl]] == ncyl,]
## [1] TRUE
mtcars[eval(substitute(cyl == 4), mtcars, parent.frame()), ]
In a blog post, rlang is used to create a list of formulae, but we can use SE constructs
create_form = function(power){
rhs = substitute(I(hp^pow), list(pow=power))
as.formula(paste0("mpg ~ ", deparse(rhs)))
list_formulae = Map(create_form, seq(1,6))
# mapply(create_form, seq(1,6))
llm <- lapply(list_formulae, lm, data=mtcars)
sapply(llm, function(x) summary(x)$sigma)
## [1] 3.862962 4.577939 5.132309 5.494144 5.712194 5.840452
Non-standard evaluation may be attractive, as the history of S and R syntax has shown
Use in additive graphics in ggplot2 and in verbs in dplyr and similar packages has led to increases
Standard evaluation is arguably more programmable, but workflows using NSE are promoted (not least by RStudio)
It is interesting that Microsoft are drawing attention to seplyr (see also John Mount’s blog)
Connections are part of base R corresponding to innovations in S4, the DBI database interface abstractions began to appear at about the same time
The DBI classes are formal S4 classes for obvious reasons, but in R connections are S3 classes
Connections generalize files to include downloading (also from https) and uncompressing files, and exchanging data by socket
Connections are sometimes platform-specific, and are used inside input-output functions (tomorrow)
The magrittr package introduced the %>%
pipe; coupled to right assign ->
, it even looks reasonable
There are more pipes in R, and the Bizarro pipe ->.;
is pure base R, saving output to .
The arguments for writing with pipes are concentrated on readability
But is the magrittr implementation one that is efficient, or can we stay with standard evaluation?
## Attaching package: 'wrapr'
## The following object is masked from 'package:mgcv':
## %.%
"dot-pipe"=4 %.>% sin(.) %.>% exp(.) %.>% cos(.),
"bizarro-pipe"={4 ->.; sin(.) ->.; exp(.) ->.; cos(.)},
"magrittr-pipe"=4 %>% sin() %>% exp() %>% cos(),
replications=10000, order="elapsed")
The overhead shown above seems to stem from aggressive NSE, breaking each pipe transfer down into parts, then running them sequentially
chain_parts <- readRDS("chain_parts.rds")
So it seems as though temporary objects are being created and garbage collected in all cases
There are however differences in how many such objects are being created, and how big they are
Finally, I first saw ->
used with pipes in a talk about the archivist package
archivist is also discussed in this blog
