All the material presented here, to the extent it is original, is available under CC-BY-SA.
I am running R 3.6.1, with recent update.packages()
.
needed <- c("rbenchmark", "wrapr", "magrittr", "lme4", "reticulate", "Matrix", "mgcv", "nlme", "Rcpp", "lattice", "MASS", "grid_3.6.1", "jsonlite", "minqa", "nloptr", "boot", "splines", "tools", "compiler", "BiocManager")
Script and data at https://github.com/rsbivand/ban421_h19/raw/master/ban421_h19_tues.zip. Download to suitable location, unzip and use as basis.
Yesterday was mostly for setting the scene; today and tomorrow should make more progress
The classes will be split 45m introduction, 45m exercises/exploration/seminar
We will also open for multi-speed approaches, rather than convoy; those moving more slowly are still learning, and can benefit from what those with stronger prior knowledge and experience are doing
In S2 syntax, there were objects; in S3, some objects also had a class attribute set, to offer guidance on what the object might contain
In S4, the contents of objects were formalised, moving checks from methods dispached by the class of the first method argument to the classes themselves
In OOP in other languages, methods belong to objects, and RC (reference classes) and R6 systems have been developed to provide these
RC are linked to the success of Rcpp, and perhaps used in reticulate to interface Python from R (see keras and tensorflow)
Formulae provide the S3 modelling interface, and use non-standard evaluation
They tell us where to look for variables in modelling, and which transformations to apply to them
They provide us with what we need from our data in terms of output from analyses
They are very flexible, with update
methods to modify our approaches flexibly
There are plenty of base R functions that use non-standard evaluation, such as formula
and library
The underlying issue is often how to point to objects within other objects, where the emcompassing object is an environment
or data.frame
Connections to files and databases include the original meaning of “pipe”
More recently, pipes may be used instead of nested function arguments
Classes and objects appeared first in Simula from the Norwegian Computing Center in Oslo, and were intended to further encapsulation in programs
C++ was the extension of C to include Simula-like object handling, and, like Simula, a garbage collector
More modern languages, like Java, settled on a strict OOP view of classes of objects that contained methods, but until the 1990s, this was not the only possibility
S adopted a functional class quasi-system with S3, where generic methods stood apart from classes of objects, but related to them
As in file names, various non-alphanumeric characters can be used to separate parts, for example the .
dot
In S and early R, the _
underscore was used for assignment like <-
(see this posting)
So a central S3 class was called data.frame
, and coercion to a data.frame
was as.data.frame.matrix
for a matrix argument, and as.data.frame.list
for a list argument
Here the as.data.frame
part was sufficient, and method dispatch would choose the appropriate implementation by matching the class of the first argument to the final dot-separated part of the list of available methods
methods(as.data.frame)
## [1] as.data.frame.aovproj* as.data.frame.array
## [3] as.data.frame.AsIs as.data.frame.character
## [5] as.data.frame.complex as.data.frame.data.frame
## [7] as.data.frame.Date as.data.frame.default
## [9] as.data.frame.difftime as.data.frame.factor
## [11] as.data.frame.ftable* as.data.frame.integer
## [13] as.data.frame.list as.data.frame.logical
## [15] as.data.frame.logLik* as.data.frame.matrix
## [17] as.data.frame.model.matrix as.data.frame.noquote
## [19] as.data.frame.numeric as.data.frame.numeric_version
## [21] as.data.frame.ordered as.data.frame.POSIXct
## [23] as.data.frame.POSIXlt as.data.frame.raw
## [25] as.data.frame.table as.data.frame.ts
## [27] as.data.frame.vector
## see '?methods' for accessing help and source code
Rasmus Bååth has a report on naming conventions used in R some years ago
More recently, lowerCamelCase and snake_case have become predominant, with most recent code being snake_case (all lower case and words separated by underscore)
Obviously, the case-matching component of generic methods for S3 objects has to be separated by a dot
It is easy to create new methods for existing generic functions, like print
, plot
, or summary
It is also easy to create new classes - no definition is needed, but as software develops, the class-specific methods often need to guard against the absence of object components typically with !is.null()
usage
If you save an S3 object, say to an RDS file, possibly for serialization, the package context of the class will be lost
Then the class attribute will be set, but which package provides the methods for that class is not recorded
data(mtcars)
class(mtcars)
## [1] "data.frame"
attributes(mtcars)
## $names
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
##
## $row.names
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710"
## [4] "Hornet 4 Drive" "Hornet Sportabout" "Valiant"
## [7] "Duster 360" "Merc 240D" "Merc 230"
## [10] "Merc 280" "Merc 280C" "Merc 450SE"
## [13] "Merc 450SL" "Merc 450SLC" "Cadillac Fleetwood"
## [16] "Lincoln Continental" "Chrysler Imperial" "Fiat 128"
## [19] "Honda Civic" "Toyota Corolla" "Toyota Corona"
## [22] "Dodge Challenger" "AMC Javelin" "Camaro Z28"
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
##
## $class
## [1] "data.frame"
(oS <- search())
## [1] ".GlobalEnv" "package:stats" "package:graphics"
## [4] "package:grDevices" "package:utils" "package:datasets"
## [7] "package:methods" "Autoloads" "package:base"
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-29. For overview type 'help("mgcv-package")'.
gm <- gam(formula=mpg ~ s(wt), data=mtcars)
class(gm)
## [1] "gam" "glm" "lm"
print(gm)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## mpg ~ s(wt)
##
## Estimated degrees of freedom:
## 2.14 total = 3.14
##
## GCV score: 7.887675
stats:::print.glm(gm)
##
## Call: gam(formula = mpg ~ s(wt), data = mtcars)
##
## Coefficients:
## (Intercept) s(wt).1 s(wt).2 s(wt).3 s(wt).4
## 20.09062 -0.05848 -0.47219 -0.02163 0.43973
## s(wt).5 s(wt).6 s(wt).7 s(wt).8 s(wt).9
## -0.38559 -0.61434 -0.27343 -3.45191 -4.99051
##
## Degrees of Freedom: 31 Total (i.e. Null); 28.85932 Residual
## Null Deviance: 1126
## Residual Deviance: 205.3 AIC: 158.6
methods(logLik)
## [1] logLik.Arima* logLik.corStruct* logLik.gam
## [4] logLik.glm* logLik.gls* logLik.glsStruct*
## [7] logLik.gnls* logLik.gnlsStruct* logLik.lm*
## [10] logLik.lme* logLik.lmeStruct* logLik.lmeStructInt*
## [13] logLik.lmList* logLik.logLik* logLik.nls*
## [16] logLik.reStruct* logLik.varComb* logLik.varFunc*
## see '?methods' for accessing help and source code
library
affects the search pathBecause library
affects the search path, we have added visible methods compared to our first view of those available
nS <- search()
nS[!(nS %in% oS)]
## [1] "package:mgcv" "package:nlme"
In the online version of here, and in the forthcoming second edition, there are short descriptions of formal S4 classes
They were introduced in the Green Book , and covered in , and
The methods package provides S4 classes in R and is used both for formal S4 classes and for RC reference classes also used in Rcpp modules
S4 classes are used extensively in Bioconductor packages
As of SVN revision 74084 (Luke Tierney, 6 Jan 2018):
$ svn diff -r 74083:74084
\item{\option{--default-packages=list}}{where \code{list} is a
comma-separated list of package names or \code{NULL}. Sets the
environment variable \env{R_DEFAULT_PACKAGES} which determines the
- packages loaded on startup. The default for \command{Rscript}
- omits \pkg{methods} as it takes about 60\% of the startup time.
+ packages loaded on startup.
cran_deps <- grep("methods", cran_db[, "Depends"])
cran_imps <- grep("methods", cran_db[, "Imports"])
cran_methods_packages <- cran_db[unique(sort(c(cran_deps, cran_imps))), "Package"]
nrow(cran_db)
## [1] 15109
length(cran_methods_packages)
## [1] 2838
bioc_deps <- grep("methods", bioc_db[, "Depends"])
bioc_imps <- grep("methods", bioc_db[, "Imports"])
bioc_methods_packages <- bioc_db[unique(sort(c(bioc_deps, bioc_imps))), "Package"]
nrow(bioc_db)
## [1] 3053
length(bioc_methods_packages)
## [1] 1541
Just to check how many of the CRAN packages using methods may be being driven by Rcpp modules
cran_deps <- grep("Rcpp", cran_db[, "Depends"])
cran_imps <- grep("Rcpp", cran_db[, "Imports"])
cran_lt <- grep("Rcpp", cran_db[, "LinkingTo"])
cran_Rcpp_packages <- cran_db[unique(sort(c(cran_deps, cran_imps, cran_lt))), "Package"]
cran_Rcpp_methods_packages <- intersect(cran_methods_packages, cran_Rcpp_packages)
length(cran_Rcpp_packages)
## [1] 1846
length(cran_Rcpp_methods_packages)
## [1] 586
bioc_deps <- grep("Rcpp", bioc_db[, "Depends"])
bioc_imps <- grep("Rcpp", bioc_db[, "Imports"])
bioc_lt <- grep("Rcpp", bioc_db[, "LinkingTo"])
bioc_Rcpp_packages <- bioc_db[unique(sort(c(bioc_deps, bioc_imps, bioc_lt))), "Package"]
bioc_Rcpp_methods_packages <- intersect(bioc_methods_packages, bioc_Rcpp_packages)
length(bioc_Rcpp_packages)
## [1] 183
length(bioc_Rcpp_methods_packages)
## [1] 135
Writing S4 classes involves thinking ahead, to plan a hierarchy of classes (and virtual classes)
If it is possible to generalise methods in the inheritance tree od class definitions, a method can be used on all descendants inheriting from a root class
Formal classes also provide certainty that the classes contain slots as required, and objects can be checked for validity
Over time, unclassed and S3 objects have been made able to work within S4 settings, but some of these adaptions have been fragile
First, a standard dense identity (unit diagonal) matrix created using diag
d100 <- diag(100)
isS4(d100)
## [1] FALSE
str(d100)
## num [1:100, 1:100] 1 0 0 0 0 0 0 0 0 0 ...
object.size(d100)
## 80216 bytes
getClass(class(d100))
## Class "matrix" [package "methods"]
##
## No Slots, prototype of class "matrix"
##
## Extends:
## Class "array", directly
## Class "mMatrix", directly
## Class "structure", by class "array", distance 2
## Class "vector", by class "array", distance 3, with explicit coerce
##
## Known Subclasses: "mts"
Now a sparse identity (unit diagonal) matrix with Diagonal
from the Matrix package
library(Matrix)
D100 <- Diagonal(100)
isS4(D100)
## [1] TRUE
str(D100)
## Formal class 'ddiMatrix' [package "Matrix"] with 4 slots
## ..@ diag : chr "U"
## ..@ Dim : int [1:2] 100 100
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ x : num(0)
object.size(D100)
## 1240 bytes
getClass(class(D100))
## Class "ddiMatrix" [package "Matrix"]
##
## Slots:
##
## Name: diag Dim Dimnames x
## Class: character integer list numeric
##
## Extends:
## Class "diagonalMatrix", directly
## Class "dMatrix", directly
## Class "sparseMatrix", by class "diagonalMatrix", distance 2
## Class "Matrix", by class "dMatrix", distance 2
## Class "xMatrix", by class "dMatrix", distance 2
## Class "mMatrix", by class "Matrix", distance 4
## Class "Mnumeric", by class "Matrix", distance 4
## Class "replValueSp", by class "Matrix", distance 4
showMethods(f=coerce, classes=class(D100))
## Function: coerce (package methods)
## from="ddiMatrix", to="CsparseMatrix"
## from="ddiMatrix", to="ddenseMatrix"
## from="ddiMatrix", to="dgeMatrix"
## from="ddiMatrix", to="diagonalMatrix"
## from="ddiMatrix", to="dMatrix"
## from="ddiMatrix", to="dsparseMatrix"
## from="ddiMatrix", to="matrix"
## from="ddiMatrix", to="Matrix"
## from="ddiMatrix", to="Mnumeric"
## from="ddiMatrix", to="symmetricMatrix"
## from="ddiMatrix", to="triangularMatrix"
## from="ddiMatrix", to="TsparseMatrix"
str(as(D100, "dgeMatrix"))
## Formal class 'dgeMatrix' [package "Matrix"] with 4 slots
## ..@ x : num [1:10000] 1 0 0 0 0 0 0 0 0 0 ...
## ..@ Dim : int [1:2] 100 100
## ..@ Dimnames:List of 2
## .. ..$ : NULL
## .. ..$ : NULL
## ..@ factors : list()
Reference classes are (much more) like OOP mechanisms in C++, Java, and other programming languages, in which objects contain their class-specific methods
Because the methods are part of the class definitions, the appropriate method is known by the object, and dispatch is unproblematic
Reference classes are provided in the methods package, and R6 classes in the R6 package
R6 classes are more light-weight than RC classes
deps <- grep("R6", cran_db[, "Depends"])
imps <- grep("R6", cran_db[, "Imports"])
R6_packages <- cran_db[unique(sort(c(deps, imps))), "Package"]
length(R6_packages)
## [1] 252
reticulate
cran_db["reticulate", c("Depends", "Imports", "LinkingTo")]
## Depends
## "R (>= 3.0)"
## Imports
## "utils, graphics, jsonlite, Rcpp (>= 0.12.7), Matrix"
## LinkingTo
## "Rcpp"
cran_db["jsonlite", c("Depends", "Imports", "LinkingTo")]
## Depends Imports LinkingTo
## "methods" NA NA
cran_db["Rcpp", c("Depends", "Imports", "LinkingTo")]
## Depends Imports LinkingTo
## "R (>= 3.0.0)" "methods, utils" NA
cran_db["Matrix", c("Depends", "Imports", "LinkingTo")]
## Depends
## "R (>= 3.2.0)"
## Imports
## "methods, graphics, grid, stats, utils, lattice"
## LinkingTo
## NA
reticulate
library(reticulate)
os <- import("os")
os$getcwd()
## [1] "/home/rsb/und/ban421/h19/tues"
np <- import("numpy", convert = FALSE)
a <- np$array(c(1:4))
a
## [1 2 3 4]
py_to_r(a)
## [1] 1 2 3 4
sum <- a$cumsum()
sum
## [ 1 3 6 10]
py_to_r(sum)
## [1] 1 3 6 10
cumsum(c(1:4))
## [1] 1 3 6 10
A unifying feature of S and R has been the treatment of data.frame
objects as data=
arguments to functions and methods
Most of the functions and methods fit models to data, and formula
objects show how to treat the variables in the data=
argument, or if not found there, in the calling environments of the model fitting function
We saw this earlier when looking at mgcv::gam
, but did not explain it
Formulae may be two-sided (mostly) and one-sided; the Formula package provides extensions useful in econometrics
Formula objects can be updated
library(mgcv)
f <- mpg ~ s(wt)
gm <- gam(formula=f, data=mtcars)
gm1 <- gam(update(f, . ~ - s(wt) + poly(wt, 3)), data=mtcars)
anova(gm, gm1, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: mpg ~ s(wt)
## Model 2: mpg ~ poly(wt, 3)
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 28.399 205.29
## 2 28.000 203.67 0.39915 1.6217 0.3098
mtcars$fcyl <- as.factor(mtcars$cyl)
f1 <- update(f, log(.) ~ - s(wt) - 1 + wt + fcyl)
f1
## log(mpg) ~ wt + fcyl - 1
lm(f1, data=mtcars)
##
## Call:
## lm(formula = f1, data = mtcars)
##
## Coefficients:
## wt fcyl4 fcyl6 fcyl8
## -0.1762 3.6731 3.5295 3.4046
terms(f1)
## log(mpg) ~ wt + fcyl - 1
## attr(,"variables")
## list(log(mpg), wt, fcyl)
## attr(,"factors")
## wt fcyl
## log(mpg) 0 0
## wt 1 0
## fcyl 0 1
## attr(,"term.labels")
## [1] "wt" "fcyl"
## attr(,"order")
## [1] 1 1
## attr(,"intercept")
## [1] 0
## attr(,"response")
## [1] 1
## attr(,".Environment")
## <environment: R_GlobalEnv>
head(model.matrix(f1, mtcars))
## wt fcyl4 fcyl6 fcyl8
## Mazda RX4 2.620 0 1 0
## Mazda RX4 Wag 2.875 0 1 0
## Datsun 710 2.320 1 0 0
## Hornet 4 Drive 3.215 0 1 0
## Hornet Sportabout 3.440 0 0 1
## Valiant 3.460 0 1 0
model.response(model.frame(f1, mtcars))[1:6]
## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive
## 3.044522 3.044522 3.126761 3.063391
## Hornet Sportabout Valiant
## 2.928524 2.895912
mtcars$mpg[1:6]
## [1] 21.0 21.0 22.8 21.4 18.7 18.1
log(mtcars$mpg[1:6])
## [1] 3.044522 3.044522 3.126761 3.063391 2.928524 2.895912
library(lme4)
##
## Attaching package: 'lme4'
## The following object is masked from 'package:nlme':
##
## lmList
lmm <- lmer(update(f, log(.) ~ poly(wt, 3) | fcyl), mtcars)
## boundary (singular) fit: see ?isSingular
fixef(lmm)
## (Intercept)
## 2.922204
ranef(lmm)
## $fcyl
## (Intercept) poly(wt, 3)1 poly(wt, 3)2 poly(wt, 3)3
## 4 0.26290733 -0.10590267 0.7154024 0.023990411
## 6 0.08117715 -0.03269927 0.2208928 0.007407559
## 8 -0.24499664 0.09868800 -0.6666653 -0.022356080
##
## with conditional variances for "fcyl"
mtcars$fam <- as.factor(mtcars$am)
lm(update(f, log(.) ~ - s(wt) + (fam*fcyl)/wt - 1), mtcars)
##
## Call:
## lm(formula = update(f, log(.) ~ -s(wt) + (fam * fcyl)/wt - 1),
## data = mtcars)
##
## Coefficients:
## fam0 fam1 fcyl6 fcyl8 fam1:fcyl6
## 2.73316 3.94159 2.46766 0.70085 -3.30564
## fam1:fcyl8 fam0:fcyl4:wt fam1:fcyl4:wt fam0:fcyl6:wt fam1:fcyl6:wt
## -1.47065 0.13514 -0.30280 -0.66469 -0.02918
## fam0:fcyl8:wt fam1:fcyl8:wt
## -0.18018 -0.12990
xtabs(~ fam + fcyl, data=mtcars)
## fcyl
## fam 4 6 8
## 0 3 4 12
## 1 8 3 2
row.names(mtcars)[mtcars$am == 1]
## [1] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Fiat 128"
## [5] "Honda Civic" "Toyota Corolla" "Fiat X1-9" "Porsche 914-2"
## [9] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" "Maserati Bora"
## [13] "Volvo 142E"
In some settings in S and later R, it has been convenient to drop quotation marks around strings repesenting names of packages, list components and classes
We’ve seen this in action already, but have not drawn attention to it
In particular, the $
list component selection operator uses this, as does the use of names in formulae
We also see the same thing in functions attaching packages library
and require
require
## function (package, lib.loc = NULL, quietly = FALSE, warn.conflicts,
## character.only = FALSE, mask.ok, exclude, include.only, attach.required = missing(include.only))
## {
## if (!character.only)
## package <- as.character(substitute(package))
## loaded <- paste0("package:", package) %in% search()
## if (!loaded) {
## if (!quietly)
## packageStartupMessage(gettextf("Loading required package: %s",
## package), domain = NA)
## value <- tryCatch(library(package, lib.loc = lib.loc,
## character.only = TRUE, logical.return = TRUE, warn.conflicts = warn.conflicts,
## quietly = quietly, mask.ok = mask.ok, exclude = exclude,
## include.only = include.only, attach.required = attach.required),
## error = function(e) e)
## if (inherits(value, "error")) {
## if (!quietly) {
## msg <- conditionMessage(value)
## cat("Failed with error: ", sQuote(msg), "\n",
## file = stderr(), sep = "")
## .Internal(printDeferredWarnings())
## }
## return(invisible(FALSE))
## }
## if (!value)
## return(invisible(FALSE))
## }
## else value <- TRUE
## invisible(value)
## }
## <bytecode: 0xd8c890>
## <environment: namespace:base>
In such settings, we cannot use a single element character vector to transmit information:
ggplot2 <- "gridExtra"
ggplot2
## [1] "gridExtra"
ggplot2 <- as.character(substitute(ggplot2))
ggplot2
## [1] "ggplot2"
paste0("package:", ggplot2) %in% search()
## [1] FALSE
Here a similar mechanism is used to look inside the first data.frame
rather than the current and global environments
subset(mtcars, subset=cyl == 4)
## mpg cyl disp hp drat wt qsec vs am gear carb fcyl
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 4
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 4
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 4
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4
## fam
## Datsun 710 1
## Merc 240D 0
## Merc 230 0
## Fiat 128 1
## Honda Civic 1
## Toyota Corolla 1
## Toyota Corona 0
## Fiat X1-9 1
## Porsche 914-2 1
## Lotus Europa 1
## Volvo 142E 1
acyl <- "cyl"
ncyl <- 4
mtcars[mtcars[[acyl]] == ncyl,]
## mpg cyl disp hp drat wt qsec vs am gear carb fcyl
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 4
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 4
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 4
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4
## fam
## Datsun 710 1
## Merc 240D 0
## Merc 230 0
## Fiat 128 1
## Honda Civic 1
## Toyota Corolla 1
## Toyota Corona 0
## Fiat X1-9 1
## Porsche 914-2 1
## Lotus Europa 1
## Volvo 142E 1
is.list(mtcars)
## [1] TRUE
mtcars[eval(substitute(cyl == 4), mtcars, parent.frame()), ]
## mpg cyl disp hp drat wt qsec vs am gear carb fcyl
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 4
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 4
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 4
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 4
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 4
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 4
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 4
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 4
## fam
## Datsun 710 1
## Merc 240D 0
## Merc 230 0
## Fiat 128 1
## Honda Civic 1
## Toyota Corolla 1
## Toyota Corona 0
## Fiat X1-9 1
## Porsche 914-2 1
## Lotus Europa 1
## Volvo 142E 1
In a blog post, rlang is used to create a list of formulae, but we can use SE constructs
create_form = function(power){
rhs = substitute(I(hp^pow), list(pow=power))
as.formula(paste0("mpg ~ ", deparse(rhs)))
}
list_formulae = Map(create_form, seq(1,6))
# mapply(create_form, seq(1,6))
llm <- lapply(list_formulae, lm, data=mtcars)
sapply(llm, function(x) summary(x)$sigma)
## [1] 3.862962 4.577939 5.132309 5.494144 5.712194 5.840452
Non-standard evaluation may be attractive, as the history of S and R syntax has shown
Use in additive graphics in ggplot2 and in verbs in dplyr and similar packages has led to increases
Standard evaluation is arguably more programmable, but workflows using NSE are promoted (not least by RStudio)
It is interesting that Microsoft are drawing attention to seplyr (see also John Mount’s blog)
Connections are part of base R corresponding to innovations in S4, the DBI database interface abstractions began to appear at about the same time
The DBI classes are formal S4 classes for obvious reasons, but in R connections are S3 classes
Connections generalize files to include downloading (also from https) and uncompressing files, and exchanging data by socket
Connections are sometimes platform-specific, and are used inside input-output functions (tomorrow)
(lns <- readLines(pipe("dir")))
## [1] "ban421.bib\t ban421_h19_tues.html chain_parts.rds"
## [2] "ban421_h18_tues.Rmd ban421_h19_tues.R"
## [3] "ban421_h19_tues_cache ban421_h19_tues.Rmd"
lns <- sort(do.call("c", strsplit(lns, " +")))
con <- file(lns[3])
readLines(con, n=6L)
## [1] "<!DOCTYPE html>"
## [2] ""
## [3] "<html xmlns=\"http://www.w3.org/1999/xhtml\">"
## [4] ""
## [5] "<head>"
## [6] ""
close(con)
The magrittr package introduced the %>%
pipe; coupled to right assign ->
, it even looks reasonable
There are more pipes in R, and the Bizarro pipe ->.;
is pure base R, saving output to .
The arguments for writing with pipes are concentrated on readability
But is the magrittr implementation one that is efficient, or can we stay with standard evaluation?
library(magrittr)
magrittr:::'%>%'
## function (lhs, rhs)
## {
## parent <- parent.frame()
## env <- new.env(parent = parent)
## chain_parts <- split_chain(match.call(), env = env)
## pipes <- chain_parts[["pipes"]]
## rhss <- chain_parts[["rhss"]]
## lhs <- chain_parts[["lhs"]]
## env[["_function_list"]] <- lapply(1:length(rhss), function(i) wrap_function(rhss[[i]],
## pipes[[i]], parent))
## env[["_fseq"]] <- `class<-`(eval(quote(function(value) freduce(value,
## `_function_list`)), env, env), c("fseq", "function"))
## env[["freduce"]] <- freduce
## if (is_placeholder(lhs)) {
## env[["_fseq"]]
## }
## else {
## env[["_lhs"]] <- eval(lhs, parent, parent)
## result <- withVisible(eval(quote(`_fseq`(`_lhs`)), env,
## env))
## if (is_compound_pipe(pipes[[1L]])) {
## eval(call("<-", lhs, result[["value"]]), parent,
## parent)
## }
## else {
## if (result[["visible"]])
## result[["value"]]
## else invisible(result[["value"]])
## }
## }
## }
## <bytecode: 0x9f0a370>
## <environment: 0x9f0b250>
library(wrapr)
##
## Attaching package: 'wrapr'
## The following object is masked from 'package:mgcv':
##
## %.%
library(rbenchmark);
benchmark(SE=cos(exp(sin(4))),
"dot-pipe"=4 %.>% sin(.) %.>% exp(.) %.>% cos(.),
"bizarro-pipe"={4 ->.; sin(.) ->.; exp(.) ->.; cos(.)},
"magrittr-pipe"=4 %>% sin() %>% exp() %>% cos(),
replications=10000, order="elapsed")
## test replications elapsed relative user.self sys.self
## 1 SE 10000 0.015 1.000 0.015 0.000
## 3 bizarro-pipe 10000 0.021 1.400 0.021 0.000
## 2 dot-pipe 10000 0.666 44.400 0.653 0.012
## 4 magrittr-pipe 10000 1.127 75.133 1.125 0.000
## user.child sys.child
## 1 0 0
## 3 0 0
## 2 0 0
## 4 0 0
The overhead shown above seems to stem from aggressive NSE, breaking each pipe transfer down into parts, then running them sequentially
chain_parts <- readRDS("chain_parts.rds")
str(chain_parts)
## List of 3
## $ rhss :List of 3
## ..$ : language sin(.)
## ..$ : language exp(.)
## ..$ : language cos(.)
## $ pipes:List of 3
## ..$ : symbol %>%
## ..$ : symbol %>%
## ..$ : symbol %>%
## $ lhs : num 4
object.size(chain_parts)
## 1496 bytes
So it seems as though temporary objects are being created and garbage collected in all cases
There are however differences in how many such objects are being created, and how big they are
Finally, I first saw ->
used with pipes in a talk about the archivist package
archivist is also discussed in this blog
sessionInfo()
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 30 (Workstation Edition)
##
## Matrix products: default
## BLAS: /home/rsb/topics/R/R361-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R361-share/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] rbenchmark_1.0.0 wrapr_1.9.2 magrittr_1.5 lme4_1.1-21
## [5] reticulate_1.13 Matrix_1.2-17 mgcv_1.8-29 nlme_3.1-141
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 knitr_1.25 splines_3.6.1 MASS_7.3-51.4
## [5] lattice_0.20-38 rlang_0.4.0 minqa_1.2.4 stringr_1.4.0
## [9] tools_3.6.1 grid_3.6.1 xfun_0.10 htmltools_0.4.0
## [13] yaml_2.2.0 digest_0.6.21 nloptr_1.2.1 evaluate_0.14
## [17] rmarkdown_1.16 stringi_1.4.3 compiler_3.6.1 boot_1.3-23
## [21] jsonlite_1.6