Required current contributed CRAN packages:

I am running R 3.6.1, with recent update.packages().

needed <- c("rbenchmark", "wrapr", "magrittr", "lme4", "reticulate", "Matrix", "mgcv", "nlme", "Rcpp", "lattice", "MASS", "grid_3.6.1", "jsonlite", "minqa", "nloptr", "boot", "splines", "tools", "compiler", "BiocManager")

Script

Script and data at https://github.com/rsbivand/ban421_h19/raw/master/ban421_h19_tues.zip. Download to suitable location, unzip and use as basis.

Introduction: classes, methods and formulae etc.

Aims for today

  • Yesterday was mostly for setting the scene; today and tomorrow should make more progress

  • The classes will be split 45m introduction, 45m exercises/exploration/seminar

  • We will also open for multi-speed approaches, rather than convoy; those moving more slowly are still learning, and can benefit from what those with stronger prior knowledge and experience are doing

Classes and methods

  • In S2 syntax, there were objects; in S3, some objects also had a class attribute set, to offer guidance on what the object might contain

  • In S4, the contents of objects were formalised, moving checks from methods dispached by the class of the first method argument to the classes themselves

  • In OOP in other languages, methods belong to objects, and RC (reference classes) and R6 systems have been developed to provide these

  • RC are linked to the success of Rcpp, and perhaps used in reticulate to interface Python from R (see keras and tensorflow)

Formulae

  • Formulae provide the S3 modelling interface, and use non-standard evaluation

  • They tell us where to look for variables in modelling, and which transformations to apply to them

  • They provide us with what we need from our data in terms of output from analyses

  • They are very flexible, with update methods to modify our approaches flexibly

Non-standard evaluation and combining functions/pipes

  • There are plenty of base R functions that use non-standard evaluation, such as formula and library

  • The underlying issue is often how to point to objects within other objects, where the emcompassing object is an environment or data.frame

  • Connections to files and databases include the original meaning of “pipe”

  • More recently, pipes may be used instead of nested function arguments

Classes and methods

Classes

  • Classes and objects appeared first in Simula from the Norwegian Computing Center in Oslo, and were intended to further encapsulation in programs

  • C++ was the extension of C to include Simula-like object handling, and, like Simula, a garbage collector

  • More modern languages, like Java, settled on a strict OOP view of classes of objects that contained methods, but until the 1990s, this was not the only possibility

  • S adopted a functional class quasi-system with S3, where generic methods stood apart from classes of objects, but related to them

S3 classes and methods

  • As in file names, various non-alphanumeric characters can be used to separate parts, for example the . dot

  • In S and early R, the _ underscore was used for assignment like <- (see this posting)

  • So a central S3 class was called data.frame, and coercion to a data.frame was as.data.frame.matrix for a matrix argument, and as.data.frame.list for a list argument

  • Here the as.data.frame part was sufficient, and method dispatch would choose the appropriate implementation by matching the class of the first argument to the final dot-separated part of the list of available methods

S3 classes and methods

methods(as.data.frame)
##  [1] as.data.frame.aovproj*        as.data.frame.array          
##  [3] as.data.frame.AsIs            as.data.frame.character      
##  [5] as.data.frame.complex         as.data.frame.data.frame     
##  [7] as.data.frame.Date            as.data.frame.default        
##  [9] as.data.frame.difftime        as.data.frame.factor         
## [11] as.data.frame.ftable*         as.data.frame.integer        
## [13] as.data.frame.list            as.data.frame.logical        
## [15] as.data.frame.logLik*         as.data.frame.matrix         
## [17] as.data.frame.model.matrix    as.data.frame.noquote        
## [19] as.data.frame.numeric         as.data.frame.numeric_version
## [21] as.data.frame.ordered         as.data.frame.POSIXct        
## [23] as.data.frame.POSIXlt         as.data.frame.raw            
## [25] as.data.frame.table           as.data.frame.ts             
## [27] as.data.frame.vector         
## see '?methods' for accessing help and source code

Side note on naming conventions

  • Rasmus Bååth has a report on naming conventions used in R some years ago

  • More recently, lowerCamelCase and snake_case have become predominant, with most recent code being snake_case (all lower case and words separated by underscore)

  • Obviously, the case-matching component of generic methods for S3 objects has to be separated by a dot

S3 and extension

  • It is easy to create new methods for existing generic functions, like print, plot, or summary

  • It is also easy to create new classes - no definition is needed, but as software develops, the class-specific methods often need to guard against the absence of object components typically with !is.null() usage

  • If you save an S3 object, say to an RDS file, possibly for serialization, the package context of the class will be lost

  • Then the class attribute will be set, but which package provides the methods for that class is not recorded

Attributes

data(mtcars)
class(mtcars)
## [1] "data.frame"
attributes(mtcars)
## $names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"
## 
## $row.names
##  [1] "Mazda RX4"           "Mazda RX4 Wag"       "Datsun 710"         
##  [4] "Hornet 4 Drive"      "Hornet Sportabout"   "Valiant"            
##  [7] "Duster 360"          "Merc 240D"           "Merc 230"           
## [10] "Merc 280"            "Merc 280C"           "Merc 450SE"         
## [13] "Merc 450SL"          "Merc 450SLC"         "Cadillac Fleetwood" 
## [16] "Lincoln Continental" "Chrysler Imperial"   "Fiat 128"           
## [19] "Honda Civic"         "Toyota Corolla"      "Toyota Corona"      
## [22] "Dodge Challenger"    "AMC Javelin"         "Camaro Z28"         
## [25] "Pontiac Firebird"    "Fiat X1-9"           "Porsche 914-2"      
## [28] "Lotus Europa"        "Ford Pantera L"      "Ferrari Dino"       
## [31] "Maserati Bora"       "Volvo 142E"         
## 
## $class
## [1] "data.frame"

Search path

(oS <- search())
## [1] ".GlobalEnv"        "package:stats"     "package:graphics" 
## [4] "package:grDevices" "package:utils"     "package:datasets" 
## [7] "package:methods"   "Autoloads"         "package:base"

Method dispatch

library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-29. For overview type 'help("mgcv-package")'.
gm <- gam(formula=mpg ~ s(wt), data=mtcars)
class(gm)
## [1] "gam" "glm" "lm"
print(gm)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## mpg ~ s(wt)
## 
## Estimated degrees of freedom:
## 2.14  total = 3.14 
## 
## GCV score: 7.887675
stats:::print.glm(gm)
## 
## Call:  gam(formula = mpg ~ s(wt), data = mtcars)
## 
## Coefficients:
## (Intercept)      s(wt).1      s(wt).2      s(wt).3      s(wt).4  
##    20.09062     -0.05848     -0.47219     -0.02163      0.43973  
##     s(wt).5      s(wt).6      s(wt).7      s(wt).8      s(wt).9  
##    -0.38559     -0.61434     -0.27343     -3.45191     -4.99051  
## 
## Degrees of Freedom: 31 Total (i.e. Null);  28.85932 Residual
## Null Deviance:       1126 
## Residual Deviance: 205.3     AIC: 158.6

Which methods

methods(logLik)
##  [1] logLik.Arima*        logLik.corStruct*    logLik.gam          
##  [4] logLik.glm*          logLik.gls*          logLik.glsStruct*   
##  [7] logLik.gnls*         logLik.gnlsStruct*   logLik.lm*          
## [10] logLik.lme*          logLik.lmeStruct*    logLik.lmeStructInt*
## [13] logLik.lmList*       logLik.logLik*       logLik.nls*         
## [16] logLik.reStruct*     logLik.varComb*      logLik.varFunc*     
## see '?methods' for accessing help and source code

library affects the search path

Because library affects the search path, we have added visible methods compared to our first view of those available

nS <- search()
nS[!(nS %in% oS)]
## [1] "package:mgcv" "package:nlme"

S4 classes and methods

  • In the online version of here, and in the forthcoming second edition, there are short descriptions of formal S4 classes

  • They were introduced in the Green Book , and covered in , and

  • The methods package provides S4 classes in R and is used both for formal S4 classes and for RC reference classes also used in Rcpp modules

  • S4 classes are used extensively in Bioconductor packages

S4 classes and methods

As of SVN revision 74084 (Luke Tierney, 6 Jan 2018):

$ svn diff -r 74083:74084
     \item{\option{--default-packages=list}}{where \code{list} is a
       comma-separated list of package names or \code{NULL}.  Sets the
       environment variable \env{R_DEFAULT_PACKAGES} which determines the
-      packages loaded on startup.  The default for \command{Rscript}
-      omits \pkg{methods} as it takes about 60\% of the startup time.
+      packages loaded on startup.
cran_deps <- grep("methods", cran_db[, "Depends"])
cran_imps <- grep("methods", cran_db[, "Imports"])
cran_methods_packages <- cran_db[unique(sort(c(cran_deps, cran_imps))), "Package"]
nrow(cran_db)
## [1] 15109
length(cran_methods_packages)
## [1] 2838
bioc_deps <- grep("methods", bioc_db[, "Depends"])
bioc_imps <- grep("methods", bioc_db[, "Imports"])
bioc_methods_packages <- bioc_db[unique(sort(c(bioc_deps, bioc_imps))), "Package"]
nrow(bioc_db)
## [1] 3053
length(bioc_methods_packages)
## [1] 1541

S4 classes and methods (CRAN)

Just to check how many of the CRAN packages using methods may be being driven by Rcpp modules

cran_deps <- grep("Rcpp", cran_db[, "Depends"])
cran_imps <- grep("Rcpp", cran_db[, "Imports"])
cran_lt <- grep("Rcpp", cran_db[, "LinkingTo"])
cran_Rcpp_packages <- cran_db[unique(sort(c(cran_deps, cran_imps, cran_lt))), "Package"]
cran_Rcpp_methods_packages <- intersect(cran_methods_packages, cran_Rcpp_packages)
length(cran_Rcpp_packages)
## [1] 1846
length(cran_Rcpp_methods_packages)
## [1] 586

S4 classes and methods (BIOC)

bioc_deps <- grep("Rcpp", bioc_db[, "Depends"])
bioc_imps <- grep("Rcpp", bioc_db[, "Imports"])
bioc_lt <- grep("Rcpp", bioc_db[, "LinkingTo"])
bioc_Rcpp_packages <- bioc_db[unique(sort(c(bioc_deps, bioc_imps, bioc_lt))), "Package"]
bioc_Rcpp_methods_packages <- intersect(bioc_methods_packages, bioc_Rcpp_packages)
length(bioc_Rcpp_packages)
## [1] 183
length(bioc_Rcpp_methods_packages)
## [1] 135

S4 classes and methods

  • Writing S4 classes involves thinking ahead, to plan a hierarchy of classes (and virtual classes)

  • If it is possible to generalise methods in the inheritance tree od class definitions, a method can be used on all descendants inheriting from a root class

  • Formal classes also provide certainty that the classes contain slots as required, and objects can be checked for validity

  • Over time, unclassed and S3 objects have been made able to work within S4 settings, but some of these adaptions have been fragile

S4 classes for sparse matrices

First, a standard dense identity (unit diagonal) matrix created using diag

d100 <- diag(100)
isS4(d100)
## [1] FALSE
str(d100)
##  num [1:100, 1:100] 1 0 0 0 0 0 0 0 0 0 ...
object.size(d100)
## 80216 bytes
getClass(class(d100))
## Class "matrix" [package "methods"]
## 
## No Slots, prototype of class "matrix"
## 
## Extends: 
## Class "array", directly
## Class "mMatrix", directly
## Class "structure", by class "array", distance 2
## Class "vector", by class "array", distance 3, with explicit coerce
## 
## Known Subclasses: "mts"

Now a sparse identity (unit diagonal) matrix with Diagonal from the Matrix package

library(Matrix)
D100 <- Diagonal(100)
isS4(D100)
## [1] TRUE
str(D100)
## Formal class 'ddiMatrix' [package "Matrix"] with 4 slots
##   ..@ diag    : chr "U"
##   ..@ Dim     : int [1:2] 100 100
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num(0)
object.size(D100)
## 1240 bytes
getClass(class(D100))
## Class "ddiMatrix" [package "Matrix"]
## 
## Slots:
##                                               
## Name:       diag       Dim  Dimnames         x
## Class: character   integer      list   numeric
## 
## Extends: 
## Class "diagonalMatrix", directly
## Class "dMatrix", directly
## Class "sparseMatrix", by class "diagonalMatrix", distance 2
## Class "Matrix", by class "dMatrix", distance 2
## Class "xMatrix", by class "dMatrix", distance 2
## Class "mMatrix", by class "Matrix", distance 4
## Class "Mnumeric", by class "Matrix", distance 4
## Class "replValueSp", by class "Matrix", distance 4
showMethods(f=coerce, classes=class(D100))
## Function: coerce (package methods)
## from="ddiMatrix", to="CsparseMatrix"
## from="ddiMatrix", to="ddenseMatrix"
## from="ddiMatrix", to="dgeMatrix"
## from="ddiMatrix", to="diagonalMatrix"
## from="ddiMatrix", to="dMatrix"
## from="ddiMatrix", to="dsparseMatrix"
## from="ddiMatrix", to="matrix"
## from="ddiMatrix", to="Matrix"
## from="ddiMatrix", to="Mnumeric"
## from="ddiMatrix", to="symmetricMatrix"
## from="ddiMatrix", to="triangularMatrix"
## from="ddiMatrix", to="TsparseMatrix"
str(as(D100, "dgeMatrix"))
## Formal class 'dgeMatrix' [package "Matrix"] with 4 slots
##   ..@ x       : num [1:10000] 1 0 0 0 0 0 0 0 0 0 ...
##   ..@ Dim     : int [1:2] 100 100
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ factors : list()

Reference classes (RC and R6)

  • Reference classes are (much more) like OOP mechanisms in C++, Java, and other programming languages, in which objects contain their class-specific methods

  • Because the methods are part of the class definitions, the appropriate method is known by the object, and dispatch is unproblematic

  • Reference classes are provided in the methods package, and R6 classes in the R6 package

  • R6 classes are more light-weight than RC classes

R6 Reference classes

deps <- grep("R6", cran_db[, "Depends"])
imps <- grep("R6", cran_db[, "Imports"])
R6_packages <- cran_db[unique(sort(c(deps, imps))), "Package"]
length(R6_packages)
## [1] 252

reticulate

cran_db["reticulate", c("Depends", "Imports", "LinkingTo")]
##                                               Depends 
##                                          "R (>= 3.0)" 
##                                               Imports 
## "utils, graphics, jsonlite, Rcpp (>= 0.12.7), Matrix" 
##                                             LinkingTo 
##                                                "Rcpp"
cran_db["jsonlite", c("Depends", "Imports", "LinkingTo")]
##   Depends   Imports LinkingTo 
## "methods"        NA        NA
cran_db["Rcpp", c("Depends", "Imports", "LinkingTo")]
##          Depends          Imports        LinkingTo 
##   "R (>= 3.0.0)" "methods, utils"               NA
cran_db["Matrix", c("Depends", "Imports", "LinkingTo")]
##                                          Depends 
##                                   "R (>= 3.2.0)" 
##                                          Imports 
## "methods, graphics, grid, stats, utils, lattice" 
##                                        LinkingTo 
##                                               NA

reticulate

library(reticulate)
os <- import("os")
os$getcwd()
## [1] "/home/rsb/und/ban421/h19/tues"
np <- import("numpy", convert = FALSE)
a <- np$array(c(1:4))
a
## [1 2 3 4]
py_to_r(a)
## [1] 1 2 3 4
sum <- a$cumsum()
sum
## [ 1  3  6 10]
py_to_r(sum)
## [1]  1  3  6 10
cumsum(c(1:4))
## [1]  1  3  6 10

Formulae

Formula objects

  • A unifying feature of S and R has been the treatment of data.frame objects as data= arguments to functions and methods

  • Most of the functions and methods fit models to data, and formula objects show how to treat the variables in the data= argument, or if not found there, in the calling environments of the model fitting function

  • We saw this earlier when looking at mgcv::gam, but did not explain it

  • Formulae may be two-sided (mostly) and one-sided; the Formula package provides extensions useful in econometrics

Formula objects can be updated

library(mgcv)
f <- mpg ~ s(wt)
gm <- gam(formula=f, data=mtcars)
gm1 <- gam(update(f, . ~ - s(wt) + poly(wt, 3)), data=mtcars)
anova(gm, gm1, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: mpg ~ s(wt)
## Model 2: mpg ~ poly(wt, 3)
##   Resid. Df Resid. Dev      Df Deviance Pr(>Chi)
## 1    28.399     205.29                          
## 2    28.000     203.67 0.39915   1.6217   0.3098
mtcars$fcyl <- as.factor(mtcars$cyl)
f1 <- update(f, log(.) ~ - s(wt) - 1 + wt + fcyl)
f1
## log(mpg) ~ wt + fcyl - 1
lm(f1, data=mtcars)
## 
## Call:
## lm(formula = f1, data = mtcars)
## 
## Coefficients:
##      wt    fcyl4    fcyl6    fcyl8  
## -0.1762   3.6731   3.5295   3.4046
terms(f1)
## log(mpg) ~ wt + fcyl - 1
## attr(,"variables")
## list(log(mpg), wt, fcyl)
## attr(,"factors")
##          wt fcyl
## log(mpg)  0    0
## wt        1    0
## fcyl      0    1
## attr(,"term.labels")
## [1] "wt"   "fcyl"
## attr(,"order")
## [1] 1 1
## attr(,"intercept")
## [1] 0
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
head(model.matrix(f1, mtcars))
##                      wt fcyl4 fcyl6 fcyl8
## Mazda RX4         2.620     0     1     0
## Mazda RX4 Wag     2.875     0     1     0
## Datsun 710        2.320     1     0     0
## Hornet 4 Drive    3.215     0     1     0
## Hornet Sportabout 3.440     0     0     1
## Valiant           3.460     0     1     0
model.response(model.frame(f1, mtcars))[1:6]
##         Mazda RX4     Mazda RX4 Wag        Datsun 710    Hornet 4 Drive 
##          3.044522          3.044522          3.126761          3.063391 
## Hornet Sportabout           Valiant 
##          2.928524          2.895912
mtcars$mpg[1:6]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
log(mtcars$mpg[1:6])
## [1] 3.044522 3.044522 3.126761 3.063391 2.928524 2.895912
library(lme4)
## 
## Attaching package: 'lme4'
## The following object is masked from 'package:nlme':
## 
##     lmList
lmm <- lmer(update(f, log(.) ~ poly(wt, 3) | fcyl), mtcars)
## boundary (singular) fit: see ?isSingular
fixef(lmm)
## (Intercept) 
##    2.922204
ranef(lmm)
## $fcyl
##   (Intercept) poly(wt, 3)1 poly(wt, 3)2 poly(wt, 3)3
## 4  0.26290733  -0.10590267    0.7154024  0.023990411
## 6  0.08117715  -0.03269927    0.2208928  0.007407559
## 8 -0.24499664   0.09868800   -0.6666653 -0.022356080
## 
## with conditional variances for "fcyl"
mtcars$fam <- as.factor(mtcars$am)
lm(update(f, log(.) ~ - s(wt) + (fam*fcyl)/wt - 1), mtcars)
## 
## Call:
## lm(formula = update(f, log(.) ~ -s(wt) + (fam * fcyl)/wt - 1), 
##     data = mtcars)
## 
## Coefficients:
##          fam0           fam1          fcyl6          fcyl8     fam1:fcyl6  
##       2.73316        3.94159        2.46766        0.70085       -3.30564  
##    fam1:fcyl8  fam0:fcyl4:wt  fam1:fcyl4:wt  fam0:fcyl6:wt  fam1:fcyl6:wt  
##      -1.47065        0.13514       -0.30280       -0.66469       -0.02918  
## fam0:fcyl8:wt  fam1:fcyl8:wt  
##      -0.18018       -0.12990
xtabs(~ fam + fcyl, data=mtcars)
##    fcyl
## fam  4  6  8
##   0  3  4 12
##   1  8  3  2
row.names(mtcars)[mtcars$am == 1]
##  [1] "Mazda RX4"      "Mazda RX4 Wag"  "Datsun 710"     "Fiat 128"      
##  [5] "Honda Civic"    "Toyota Corolla" "Fiat X1-9"      "Porsche 914-2" 
##  [9] "Lotus Europa"   "Ford Pantera L" "Ferrari Dino"   "Maserati Bora" 
## [13] "Volvo 142E"

Non-standard evaluation and combining functions/pipes

Non-standard evaluation

  • In some settings in S and later R, it has been convenient to drop quotation marks around strings repesenting names of packages, list components and classes

  • We’ve seen this in action already, but have not drawn attention to it

  • In particular, the $ list component selection operator uses this, as does the use of names in formulae

  • We also see the same thing in functions attaching packages library and require

require
## function (package, lib.loc = NULL, quietly = FALSE, warn.conflicts, 
##     character.only = FALSE, mask.ok, exclude, include.only, attach.required = missing(include.only)) 
## {
##     if (!character.only) 
##         package <- as.character(substitute(package))
##     loaded <- paste0("package:", package) %in% search()
##     if (!loaded) {
##         if (!quietly) 
##             packageStartupMessage(gettextf("Loading required package: %s", 
##                 package), domain = NA)
##         value <- tryCatch(library(package, lib.loc = lib.loc, 
##             character.only = TRUE, logical.return = TRUE, warn.conflicts = warn.conflicts, 
##             quietly = quietly, mask.ok = mask.ok, exclude = exclude, 
##             include.only = include.only, attach.required = attach.required), 
##             error = function(e) e)
##         if (inherits(value, "error")) {
##             if (!quietly) {
##                 msg <- conditionMessage(value)
##                 cat("Failed with error:  ", sQuote(msg), "\n", 
##                   file = stderr(), sep = "")
##                 .Internal(printDeferredWarnings())
##             }
##             return(invisible(FALSE))
##         }
##         if (!value) 
##             return(invisible(FALSE))
##     }
##     else value <- TRUE
##     invisible(value)
## }
## <bytecode: 0xd8c890>
## <environment: namespace:base>

In such settings, we cannot use a single element character vector to transmit information:

ggplot2 <- "gridExtra"
ggplot2
## [1] "gridExtra"
ggplot2 <- as.character(substitute(ggplot2))
ggplot2
## [1] "ggplot2"
paste0("package:", ggplot2) %in% search()
## [1] FALSE

Here a similar mechanism is used to look inside the first data.frame rather than the current and global environments

subset(mtcars, subset=cyl == 4)
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb fcyl
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    4
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2    4
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1    4
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2    4
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1    4
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1    4
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1    4
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2    4
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2    4
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2    4
##                fam
## Datsun 710       1
## Merc 240D        0
## Merc 230         0
## Fiat 128         1
## Honda Civic      1
## Toyota Corolla   1
## Toyota Corona    0
## Fiat X1-9        1
## Porsche 914-2    1
## Lotus Europa     1
## Volvo 142E       1
acyl <- "cyl"
ncyl <- 4
mtcars[mtcars[[acyl]] == ncyl,]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb fcyl
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    4
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2    4
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1    4
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2    4
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1    4
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1    4
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1    4
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2    4
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2    4
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2    4
##                fam
## Datsun 710       1
## Merc 240D        0
## Merc 230         0
## Fiat 128         1
## Honda Civic      1
## Toyota Corolla   1
## Toyota Corona    0
## Fiat X1-9        1
## Porsche 914-2    1
## Lotus Europa     1
## Volvo 142E       1
is.list(mtcars)
## [1] TRUE
mtcars[eval(substitute(cyl == 4), mtcars, parent.frame()), ]
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb fcyl
## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1    4
## Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2    4
## Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2    4
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1    4
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2    4
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1    4
## Toyota Corona  21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1    4
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1    4
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2    4
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2    4
## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2    4
##                fam
## Datsun 710       1
## Merc 240D        0
## Merc 230         0
## Fiat 128         1
## Honda Civic      1
## Toyota Corolla   1
## Toyota Corona    0
## Fiat X1-9        1
## Porsche 914-2    1
## Lotus Europa     1
## Volvo 142E       1

In a blog post, rlang is used to create a list of formulae, but we can use SE constructs

create_form = function(power){
 rhs = substitute(I(hp^pow), list(pow=power))
 as.formula(paste0("mpg ~ ", deparse(rhs)))
}
list_formulae = Map(create_form, seq(1,6))
  # mapply(create_form, seq(1,6))
llm <- lapply(list_formulae, lm, data=mtcars)
sapply(llm, function(x) summary(x)$sigma)
## [1] 3.862962 4.577939 5.132309 5.494144 5.712194 5.840452
  • Non-standard evaluation may be attractive, as the history of S and R syntax has shown

  • Use in additive graphics in ggplot2 and in verbs in dplyr and similar packages has led to increases

  • Standard evaluation is arguably more programmable, but workflows using NSE are promoted (not least by RStudio)

  • It is interesting that Microsoft are drawing attention to seplyr (see also John Mount’s blog)

Connections

  • Connections are part of base R corresponding to innovations in S4, the DBI database interface abstractions began to appear at about the same time

  • The DBI classes are formal S4 classes for obvious reasons, but in R connections are S3 classes

  • Connections generalize files to include downloading (also from https) and uncompressing files, and exchanging data by socket

  • Connections are sometimes platform-specific, and are used inside input-output functions (tomorrow)

(lns <- readLines(pipe("dir")))
## [1] "ban421.bib\t       ban421_h19_tues.html  chain_parts.rds"
## [2] "ban421_h18_tues.Rmd    ban421_h19_tues.R"                
## [3] "ban421_h19_tues_cache  ban421_h19_tues.Rmd"
lns <- sort(do.call("c", strsplit(lns, " +")))
con <- file(lns[3])
readLines(con, n=6L)
## [1] "<!DOCTYPE html>"                              
## [2] ""                                             
## [3] "<html xmlns=\"http://www.w3.org/1999/xhtml\">"
## [4] ""                                             
## [5] "<head>"                                       
## [6] ""
close(con)

Pipes

  • The magrittr package introduced the %>% pipe; coupled to right assign ->, it even looks reasonable

  • There are more pipes in R, and the Bizarro pipe ->.; is pure base R, saving output to .

  • The arguments for writing with pipes are concentrated on readability

  • But is the magrittr implementation one that is efficient, or can we stay with standard evaluation?

library(magrittr)
magrittr:::'%>%'
## function (lhs, rhs) 
## {
##     parent <- parent.frame()
##     env <- new.env(parent = parent)
##     chain_parts <- split_chain(match.call(), env = env)
##     pipes <- chain_parts[["pipes"]]
##     rhss <- chain_parts[["rhss"]]
##     lhs <- chain_parts[["lhs"]]
##     env[["_function_list"]] <- lapply(1:length(rhss), function(i) wrap_function(rhss[[i]], 
##         pipes[[i]], parent))
##     env[["_fseq"]] <- `class<-`(eval(quote(function(value) freduce(value, 
##         `_function_list`)), env, env), c("fseq", "function"))
##     env[["freduce"]] <- freduce
##     if (is_placeholder(lhs)) {
##         env[["_fseq"]]
##     }
##     else {
##         env[["_lhs"]] <- eval(lhs, parent, parent)
##         result <- withVisible(eval(quote(`_fseq`(`_lhs`)), env, 
##             env))
##         if (is_compound_pipe(pipes[[1L]])) {
##             eval(call("<-", lhs, result[["value"]]), parent, 
##                 parent)
##         }
##         else {
##             if (result[["visible"]]) 
##                 result[["value"]]
##             else invisible(result[["value"]])
##         }
##     }
## }
## <bytecode: 0x9f0a370>
## <environment: 0x9f0b250>
library(wrapr)
## 
## Attaching package: 'wrapr'
## The following object is masked from 'package:mgcv':
## 
##     %.%
library(rbenchmark);
benchmark(SE=cos(exp(sin(4))),
          "dot-pipe"=4 %.>% sin(.) %.>% exp(.) %.>% cos(.),
          "bizarro-pipe"={4 ->.; sin(.) ->.; exp(.) ->.; cos(.)},
          "magrittr-pipe"=4 %>% sin() %>% exp() %>% cos(),
          replications=10000, order="elapsed")
##            test replications elapsed relative user.self sys.self
## 1            SE        10000   0.015    1.000     0.015    0.000
## 3  bizarro-pipe        10000   0.021    1.400     0.021    0.000
## 2      dot-pipe        10000   0.666   44.400     0.653    0.012
## 4 magrittr-pipe        10000   1.127   75.133     1.125    0.000
##   user.child sys.child
## 1          0         0
## 3          0         0
## 2          0         0
## 4          0         0

The overhead shown above seems to stem from aggressive NSE, breaking each pipe transfer down into parts, then running them sequentially

chain_parts <- readRDS("chain_parts.rds")
str(chain_parts)
## List of 3
##  $ rhss :List of 3
##   ..$ : language sin(.)
##   ..$ : language exp(.)
##   ..$ : language cos(.)
##  $ pipes:List of 3
##   ..$ : symbol %>%
##   ..$ : symbol %>%
##   ..$ : symbol %>%
##  $ lhs  : num 4
object.size(chain_parts)
## 1496 bytes
  • So it seems as though temporary objects are being created and garbage collected in all cases

  • There are however differences in how many such objects are being created, and how big they are

  • Finally, I first saw -> used with pipes in a talk about the archivist package

  • archivist is also discussed in this blog

R’s sessionInfo()

sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 30 (Workstation Edition)
## 
## Matrix products: default
## BLAS:   /home/rsb/topics/R/R361-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R361-share/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rbenchmark_1.0.0 wrapr_1.9.2      magrittr_1.5     lme4_1.1-21     
## [5] reticulate_1.13  Matrix_1.2-17    mgcv_1.8-29      nlme_3.1-141    
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2      knitr_1.25      splines_3.6.1   MASS_7.3-51.4  
##  [5] lattice_0.20-38 rlang_0.4.0     minqa_1.2.4     stringr_1.4.0  
##  [9] tools_3.6.1     grid_3.6.1      xfun_0.10       htmltools_0.4.0
## [13] yaml_2.2.0      digest_0.6.21   nloptr_1.2.1    evaluate_0.14  
## [17] rmarkdown_1.16  stringi_1.4.3   compiler_3.6.1  boot_1.3-23    
## [21] jsonlite_1.6