All the material presented here, to the extent it is original, is available under CC-BY-SA.
I am running R 3.6.1, with recent update.packages()
.
needed <- c("zoo", "ggraph", "tidygraph", "forcats", "stringr", "dplyr", "purrr", "readr", "tidyr", "tibble", "ggplot2", "tidyverse", "wordcloud", "RColorBrewer", "magrittr", "igraph", "miniCRAN", "XML", "MASS", "BiocManager")
Script and data at https://github.com/rsbivand/ban421_h19/raw/master/ban421_h19_mon.zip. Download to suitable location, unzip and use as basis.
Getting to know R: today we will start fairly thoroughly - probably a challenging beginning
Not speed-dating; starting ab ovo or rather ab ova: R has multiple mutations and is better seen as an ecosystem than as an inherently purposed system
Once we have the antecedents, we can see how they may affect data structures and their uses in R
Further, we’ll be able to make informed choices with respect to use of data structures
This course is made up of integrated classes and lab sessions, with the tasks to be carried out tightly linked to the classes
It is important to participate in class and in the lab, both with the instructor and with other participants
If you get stuck, ask someone; everyone gets stuck, it isn’t embarrassing. When someone else gets stuck, try to help; learning R is not like having your hair done
R is about reproducible research; we learn by doing, and by building on things others have done
We can only benefit from things others have done if they are available and if we can show that we get the same results - we can reproduce their work
Scripts (recipes) are the basis for this, and can be extended to literate programming by writing text explaining the steps taken
The threshold to learning enough markdown to write documents showing what has been done is not high
Time | Topic |
---|---|
Monday 4/11 | |
08.15-10.00 | History of R and its data structures |
10.15-11.00 | Class exercises |
12.15-14.00 | Basic data structures: vectors, list, data.frame, matrix, array |
14.15-15.00 | Class exercises |
Tuesday 5/11 | |
08.15-10.00 | Basic data structures: factors, time, encoding |
10.15-11.00 | Class exercises |
12.15-14.00 | Class systems and method dispatch; formulae, non-standard evaluation and combining functions |
14.15-15.00 | Class exercises |
Wednesday 6/11 | |
08.15-10.00 | Basic input/output into/from data structures |
10.15-11.00 | Class exercises |
12.15-14.00 | Comparing alternatives |
14.15-15.00 | Class exercises |
Thursday 7/11 | |
09.00-16.00 | Group work day |
Friday 8/11 | |
08.00-10.00 | Presentations |
10.15-11.45 | Presentations |
12.15-15.00 | Presentations |
The underlying aim: to survey contemporary approaches to data structures and their handling in context
Why in context? Because without the context, some alternatives may seem to be closed off by the presentation narrative
Your (group) projects are key part of the seminar
Suggested topics include comparisons of different representations, both legacy (data.frame
) and modern (data.table
, tibble
, …), their input/output and handling methods
Similar topics may be gleaned from R-bloggers and its Twitter feed; some of the claims deserve to be checked
The aim is not to find winners, but to explore alternatives
Thursday work-day, Friday presentation day, hand in via WiseFlow by 14.00, 23 November.
Needs for learning resources, and ways of making use of them, vary greatly between participants
There are lots of books, but many now present one-size-fits-all solutions that may not be a best fit
RStudio also provides an online learning page, with a number of options, no longer like Datacamp, but still like swirl
R is distributed from mirrors of the comprehensive R archive network (CRAN)
The cloud mirror is the easiest, but a local server may be faster
RStudio can be downloaded and installed after R has been installed
R comes with many contributed packages - the ones we need are on CRAN, which lists them providing information; we’ll get back to contributed packages later
R is as small or large as you like, and runs in many different configurations (no smartphones); the core is written in C
The language has developed from S written at Bell Labs NJ, where Unix, C, C++, and scripting technologies were created in the 1970s and 1980s
Bell Labs statisticians had a strong focus on graphics and exploratory data analysis from the beginning
Many underlying abstractions were established by 1988 and 1992; we’ll get to the data.frame
and formula
abstractions later
An R session records its history - all that is entered at the console prompt - and a workspace containing objects
On exiting a session, the history may be saved to a history file, and the workspace may be saved to an RData file; history and chosen objects (or all objects) may be saved manually before exit
The workspace is in the memory of the computer, and R itself expects there to be enough memory for all of the data, intermediate and final results
Modern R is 64-bit, so limits are most often set by the computer hardware; use can be made of multiple cores to compute in parallel
In the RStudio Interactive Development Environment (IDE), it is convenient to use R Markdown to write notebooks (annotated scripts)
Chunks of code are run in sequence and may be echoed in the output
Output is shown in its right place, including graphics output
The document may also be converted to a script, mirroring the weave/tangle - knit/purl duality
This presentation is written in Markdown, as we’ll see …
In RStudio, the Help tab in the lower right pane (default position) gives access to the R manuals and to the installed packages help pages through the Packages link under Reference
In R itself, help pages are available in HTML (browser) and text form; help.start()
uses the default browser to display the Manuals, Reference and Miscellaneous Material sections in RStudio’s home help tab
The search engine can be used to locate help pages, but is not great if many packages are installed, as no indices are stored
The help system needs to be learned in order to provide the user with ways of progressing without wasting too much time
The base help system does not tell you how to use R as a system, about packages not installed on your machine, or about R as a community
It does provide information about functions, methods and (some) classes in base R and in contributed packages installed on your machine
We’ll cover these first, then go on to look at vignettes, R Journal, task views, online help pages, and the blog aggregator
There are different requirements with regard to help systems - in R, the help pages of base R are expected to be accurate although terse
Each help page provides a short description of the functions, methods or classes it covers; some pages cover more than one such
Help pages are grouped by package, so that the browser-based system is not easy to browse if you do not know which package a function belongs to
The usage of the function is shown explicitly, including any defaults for arguments to functions or methods
Each argument is described, showing names and types; in addition details of the description are given, together with the value returned
Rather than starting from the packages hierarchy of help pages, users most often use the help
function
The function takes the name of of the function about which we need help, the name may be in quotation marks; class names contain a hyphen and must be quoted
Instead of using say help(help)
, we can shorten to the question mark operator: ?help
Occasionally, several packages offer different functions with the same name, and we may be offered a choice; we can disambiguate by putting the package name and two colons before the function name
In the usage section, function arguments are shown by name and order; the args
function returns information
In general, if arguments are given by name, the order is arbitrary, but if names are not used at least sometimes, order matters
Some arguments do not have default values and are probably required, although some are guessed if missing
Being explicit about the names of arguments and the values they take is helpful in scripting and reproducible research
The ellipsis ...
indicates that the function itself examines objects passed to see what to do
The regular R console does not provide tooltips, that is a bubble first offering alternative function or object names as you type, then lists of argument names
RStudio, like many IDEs, does provide this, controlled by Tools -> Global options -> Code -> Completion (by default it is operative)
This may be helpful or not, depending on your style of working; if you find it helpful, fine, if not, you can make it less invasive under Global options
Other IDE have also provided this facility, which builds directly on the usage sections of help pages of functions in installed packages
Base R has a set of checks and tests that ensure coherence between the code itself and the usage sections in help pages
These mechanisms are used in checking contributed packages before they are released through the the archive network; the description of arguments on help pages must match the function definition
It is also possible to generate help pages documenting functions automatically, for example using the roxygen2 package
It is important to know that we can rely on this coherence
The objects returned by functions are also documented on help pages, but the coherence of the description with reality is harder to check
This means that use of str
or other functions or methods may be helpful when we want to look inside the returned object
The form taken by returned values will often also vary, depending on the arguments given
Most help pages address this issue not by writing more about the returned values, but by using the examples section to highlight points of potential importance for the user
Reading the examples section on the help page is often enlightening, but we do not need to copy and paste
The example
function runs those parts of the code in the examples section of a function that are not tagged don’t run - this can be overridden, but may involve meeting conditions not met on your machine
This code is run nightly on CRAN servers on multiple operating systems and using released, patched and development versions of R, so checking both packages and the three versions of R
Some examples use data given verbatim, but many use built-in data sets; most packages also provide data sets to use for running examples
This means that the examples and the built-in data sets are a most significant resource for learning how to solve problems with R
Very often, one recognizes classic textbook data sets from the history of applied statistics; contemporary text book authors often publish collections of data sets as packages on CRAN
The built-in data sets also have help pages, describing their representation as R objects, and their licence and copyright status
These help pages also often include an examples section showing some of the analyses that may be carried out using them
One approach that typically works well when you have a data set of your own, but are unsure how to proceed, is to find a built-in data set that resembles the real one, and play with that first
The built-in data sets are often quite small, and if linked to text books, they are well described there as well as in the help pages
By definition, the built-in data sets do not have to be imported into R, as they are almost always stored as files of R objects
In some cases, these data sets are stored in external file formats, most often to show how to read those formats
The built-in data sets in the base datasets package are in the search path, but data sets in other packages should be loaded using the data()
function:
str(Titanic)
## 'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
## - attr(*, "dimnames")=List of 4
## ..$ Class : chr [1:4] "1st" "2nd" "3rd" "Crew"
## ..$ Sex : chr [1:2] "Male" "Female"
## ..$ Age : chr [1:2] "Child" "Adult"
## ..$ Survived: chr [1:2] "No" "Yes"
library(MASS)
data(deaths)
str(deaths)
## Time-Series [1:72] from 1974 to 1980: 3035 2552 2704 2554 2014 ...
At about the time that literate programming arrived in R with Sweave
and Stangle
- we mostly use knitr now - the idea arose of supplementing package documentation with example workflows
Vignettes are PDF documents with accompanying runnable R code that describe how to carry out particular sequences of operations
The RStudio packages help tab package index file shows user guides, package vignettes and other documentation
The vignette()
function can be used to list vignettes by installed package, and to open the chosen vignette in a PDF reader
A very typical way of using vignettes on a machine with enough screen space is to read the document and run the code from the R file at the same time
Assign the output of vignette
to an object; the print
method shows the PDF or HTML, the edit
method gives direct access to the underlying code for copy and paste
The help system in RStudio provides equivalent access to vignette documents and code
Papers about R contributed packages published in the Journal of Statistical Software and the R Journal are often constructed in this way too
As R has developed, the number of packages on CRAN has grown (other packages are on BioConductor and github)
CRAN task views were introduced to try to provide some subject area guidance
They remain terse, and struggle to keep up, but are still worth reviewing
Note that those working in different subject areas often see things rather differently, leading to subject specific treatment of intrinsically similar themes
The help system and vignettes were designed to be used offline, so that the versions of R and installed packages matched the documentation
If you search online for information about functions in R or in contributed packages, you often reach inside-R, sponsored by Revolution Analytics
Help pages may also be viewed online from your chosen CRAN mirror; package pages provide these (Reference manual) and vignettes as links
Remember to check that the versions of your installed software and the online documentation are the same
The R community has become a number of linked communities rather than a coherent and hierarchical whole
As in many open source projects, the R project is more basaar than cathedral; think of niches in ecosystems with differing local optima in contrast to a master plan
One style is based on mailing lists, in which an issue raised by an original poster is resolved later in that thread
Another style is to use online fora, such as StackOverflow, which you need to visit rather than receiving messages in your inbox
There are now many blogs involving the use of R, fortunately aggregated at R-bloggers, where other resources may also be found
New aggregated blog topics are linked to a Twitter account, so if you want, you too can be bombarded by notifications
These are also a potential source of project ideas, especially because some claims should be challenged
R Users Groups and R Ladies provide face-to-face meeting places that many value
R started as a teaching tool for applied statistics, but this community model has been complemented by others
R is now widely used in business, public administration and voluntary organizations for data analysis and visualization
The R Consortium was created in 2015 as a vehicle for companies with relationships to R
R itself remains under the control of the R Foundation, which is still mostly academic in flavour
Rasmus Bååth has a useful blog piece on R’s antecedents in the S language
Something similar is present in the second chapter of (Chambers 2016), from the viewpoint of one of those responsible for the development of the S language
In addition to S, we need to take SICP and Scheme into account (Abelson and Sussman 1996), as described by (Ihaka and Gentleman 1996) and (Wickham 2014)
Finally, LispStat and its creators have played and continue to play a major role in developing R (Tierney 1990, 1996, 2005)
S: An Interactive Environment for Data Analysis and Graphics, A.K.A. the Brown Book (Becker and Chambers 1984); Extending the S System (Becker and Chambers 1985)
Brown Books
The New S Language: A Programming Environment for Data Analysis and Graphics, A.K.A. the Blue Book (Becker, Chambers, and Wilks 1988); Statistical Models in S, A.K.A. the White Book (Chambers and Hastie 1992)
Blue and White Books
Programming with Data: A Guide to the S Language, A.K.A. the Green Book (Chambers 1998); S Programming (Venables and Ripley 2000)
Green Book
The S2 system was described in the Brown Book, S3 in the Blue Book and completed in the White Book, finally S4 in the Green Book
The big advance from S2 to S3 was that users could write functions; that data.frame objects were defined; that formula objects were defined; and that S3 classes and method dispatch appeared
S4 brought connections and formal S4 classes, the latter seen in R in the methods package (still controversial)
S-PLUS was/is the commercial implementation of S and its releases drove S3 and S4 changes
S was a Bell Labs innovation, like Unix, C, C++, and many interpreted languages (like AWK); many of these share key understandings
Now owned by Nokia, previously Alcatel-Lucent, Lucent, and AT&T
Why would a telecoms major (AT&T) pay for fundamental research in computer science and data analysis (not to sell or market other products better)?
Some Green Book examples are for quality control of telecoms components
S-PLUS was quickly adopted for teaching and research, and with S3, provided extensibility in the form of libraries
Most links have died by now, but see this FAQ for a flavour - there was a lively community of applied statisticians during the 1990s
S built on a long tradition of documentation through examples, with use cases and data sets taken from the applied statistical literature; this let users compare output with methods descriptions
… so we get to R
Luke Tierney was in R core in 1997, and has continued to exert clear influence over development
Because R uses a Scheme engine, similar to Lisp, under the hood, his insight into issues like the garbage collector, namespaces, byte-compilation, serialization, parallelization, and now ALTREP has been crucial (see also the proposal by Luke Tierney, Gabe Becker and Tomas Kalibera)
Many of these issues involve the defensive copy on possible change policy involved in lazy evaluation, which may lead to multiple redundant copies of data being present in memory
Luke Tierney and Brian Ripley have fought hard to let R load fast, something that is crucial to ease the use of R on multicore systems or inside databases
R 3.4.4
> n <- 1e7
> set.seed(1)
> x <- rnorm(n)
> y <- rnorm(n)
> system.time(lm(y ~ x))
user system elapsed
7.007 0.356 7.431
ALTREP R 3.5.1
> n <- 1e7
> set.seed(1)
> x <- rnorm(n)
> y <- rnorm(n)
> system.time(lm(y ~ x))
user system elapsed
1.254 0.433 1.700
Computational processes are abstract ``beings’’ that inhabit computers.
Processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules called a program.
People create programs to direct processes.
Analysing data always involves programming, even if hidden under a point-and-click user interface.
Scripts are simple interpreted programs that are easy to write, read, and improve, and permit analysis to be documented in a responsible way.
A computational process, in a correctly working computer, executes programs precisely and accurately.
Novice programmers must learn to understand and foresee the consequences of their attempts to create and execute programs.
Even small errors in programs can have complex ans unanticipated consequences.
Software engineers or progammers learn to organize programs so that they can be reasonably sure that the resulting processes will perform the tasks intended.
They know how to structure programs so that unanticipated problems do not lead to catastrophic consequences, and when problems do arise, they can debug their programs.
The language is more than just a means for instructing a computer to perform tasks, it also serves as a framework within which we organize our ideas about processes.
When we describe a language, we should pay particular attention to the means that the language provides for combining simple ideas to form more complex ideas. Each language has three mechanisms for this:
primitive expressions, which represent the simplest entities the language is concerned with,
means of combination, by which compound elements are built from simpler ones, and
means of abstraction, by which compound elements can be named and manipulated as units.
All of these mechanisms are present in scripting languages, but may only be available to the user in specialised forms.
Think of a pocket calculator. You type an expression 50
, and the calculator responds by displaying the result of its evaluating that expression: 50
.
A number is a primitive expression. Type a number, and your calculator displays the result. Expressions representing numbers can be combined with an expression respresenting a primitive arithmetic procedure (such as addition, subtraction, multiplication or division) to form a compound expression that represents the application of the procedures to those numbers.
These compound expressions are called combinations. They are built up of an operator and a number of operands. The order in which the operator and the operands are typed depends on the grammar of the language, called its notation, and some means are needed to signal the beginning and end of the combination, for example parentheses ().
Most languages use infix notation, which means that operators and operands are mixed together in the combination, and require care in typing. Most calculators also use infix notation: typing the combination 50+25
yields the result: 75
.
The expression 50+25*2
may be ambiguous without rules for nesting combinations; left to right gives 150
, but right to left 100
. Moral: use parentheses to make sure the program does what you want: (50+25)*2
, or 50+(25*2)
. Infix notation looks like arithmetic, but in programming can be ambiguous.
A critical aspect of a programming language is the means it provides for using names to refer to computational objects. We say that the name identifies a variable whose value is the object. Our calculator often has a key named PI
which when typed PI
displays the value 3.1415927
.
So we can type (2*PI*10)
to calculate the circumference of a circle of radius 10 units: the calculator displays 62.831853
. Naming or defining variables in most scripting languages uses the assignment abstraction, symbolised by the =
sign. This is not ``equals’’!
In res = expression
, it is res
that is being named as a variable with its value set to the result of evaluating the expression. Documenting what you did saves tears later.
Associating values with symbols and later retrieving them means that the interpreter must maintain a memory that keeps track of the name-object pairs - the environment.
To build compound procedures, we need to remember that combinations can be nested in an expression: ((layer1*10)+(layer2/100))
, or ((x*x)+(y*y))
, where the variables x
and y
have been given values.
But this expression ((x*x)+(y*y))
is actually a combination of two identical subexpressions for squaring the specified variable. It would be a good idea to give a name to this compound procedure, for example on our calculator it could be named x2, and takes one formal parameter.
Other compound procedures found on our calculator are SQRT
, log
, and sin
. Of course, you could make the value of the formal parameter to these compound procedures an arbitrarily complicated expression.
Compound procedures are used in exactly the same way as primitive procedures.
Because we are dealing with computational processes, not arithmetic, we will find that there are exceptional conditions that can arise when the value of an operand that we have typed is not in the set of feasible values.
The expression 1/0
yields Inf
- infinity, while 0/0
yields NaN
- not-a-number, and log(0)
yields error
.
When does 3/4
yield 0
? In most languages, because the numbers are integers, not floating point numbers. If you want a floating point result, you have to make the operands floating point first: 3.0/4.0
and FLOAT(3)/FLOAT(4)
yield 0.75
.
Why was underscore_separated
not a permitted naming convention in R earlier (see (Bååth 2012))? _
was not a permitted character in names until it had lost its left assign role, the same as <-
, in 1.9.0 in 2004. (Brown Book p. 256, Blue Book p. 387)
(../../pix/assign.png)
strings
As
Factors
Why is the factor
storage mode still so central? stringsAsFactors = TRUE
was the legacy as.is = FALSE
; analysis of categorical variables was more important, and factor
only needed to store nlevels()
strings (White Book p. 55-56, 567)
(../../pix/asis1.png)
drop
drop = TRUE
for array-like objects; since matrices are vectors with a dim
atrribute, choosing (part of) a row or column made dim
redundant (Blue Book p. 128, White Book p. 64)
(../../pix/drop.png)
Treating scalars as vectors is not efficient:
(../../pix/Screenshot-2018-4-6 Ihaka Lecture Series 2017 Statistical computing in a (more) static environment - YouTube.png)
An R-0.49 source tarball is available from CRAN
Diffs for Fedora 27 (gcc 7.3.1) include setting compilers and -fPIC
in config.site
, putting ./
before config.site
in configure
, and three corrections in src/unix
: in dataentry.h
add #include <X11/Xfuncproto.h>
and comment out NeedFunctionPrototypes
; in rotated.c
comment out /*static*/ double round
; in system.c
comment out __setfpucw
twice; BLAS must be provided externally
Not (yet) working: prototypes are missing in the eda and mva packages so the shared objects fail to build
The command: svn log --xml --verbose -r 6:77269 https://svn.r-project.org/R/trunk > trunk_verbose_log_new1.xml
provides a rich data source
Each log entry has a revision number, author and timestamp, message and paths to files indicating the action undertaken for each file
The XML version is somewhat easier to untangle than the plain-text version
I haven’t tried possible similar approaches to Winston Chang’s github r-source repo
## <logentry revision="6">
## <author>ihaka</author>
## <date>1997-09-18T04:41:25.000000Z</date>
## <paths>
## <path action="M" prop-mods="false" text-mods="true" kind="file">/trunk/src/library/base/R/lm</path>
## </paths>
## <msg>New predict.lm from Peter Dalgaard</msg>
## </logentry>
## <logentry revision="77268">
## <author>ripley</author>
## <date>2019-10-09T08:47:26.341664Z</date>
## <paths>
## <path text-mods="true" kind="file" action="M" prop-mods="false">/trunk/doc/manual/R-exts.texi</path>
## </paths>
## <msg>add note on need to declare Python etc</msg>
## </logentry>
## 2015 68948 2150 use https
## 2012 59039 1727 use preferred form of 'R Core Team'
## 2011 56186 1260 Revert r56184 and r56185
## 2011 56184 1249 Remove redundant \alias entries from man pages
## 2007 42333 1223 add copyright/licence header, remove CVS-style $Id fields
## 2012 61433 620 remove trailing spaces
## 2012 60146 602 add copyright statements
## 2007 42338 559 add licence statements
## 2012 59780 524 update, including bug-reporting address
## 2003 27444 497 splitting base
##
## tools FAQ m4 etc configure.ac
## 409 467 626 637 747
## share configure BUGS date-stamp po
## 1118 1398 1485 2531 3762
## NEWS doc tests src
## 5842 12243 13132 99998
##
## windows graphics gnome macintosh appl unix nmath
## 74 79 217 611 820 1545 1771
## scripts extra modules include gnuwin32 main library
## 1992 2367 2394 3621 9199 15505 59682
##
## compiler profile stats4 translations nls
## 248 249 261 296 319
## datasets modreg mva splines ctest
## 340 402 462 465 549
## tcltk ts parallel grid graphics
## 872 900 1195 2046 2213
## grDevices methods utils stats tools
## 3849 4066 5656 7256 7542
## base
## 19720
##
## Makefile DESCRIPTION.in makebasedb.R Makefile.win baseloader.R
## 8 10 11 18 32
## demo data Makefile.in inst po
## 62 66 75 270 499
## R man
## 6884 11778
{ban421_h18_mon_files/figure-beamer/fig3-1.pdf
Once S3 permitted extension by writing functions, and packaging functions in libraries, S and R ceased to be monolithic
In R, a library is where packages are kept, distinguishing between base and recommended packages distributed with R, and contributed packages
Contributed packages can be installed from CRAN (infrastructure built on CPAN and CTAN for Perl and Tex), Bioconductor, other package repositories, and other sources such as github
With over 12000 contributed packages, CRAN is central to the R community, but is stressed by dependency issues (CRAN is not run by R core)
Andrie de Vries Finding clusters of CRAN packages using \pkg{igraph looked at CRAN package clusters from a page rank graph
We are over three years further on now, so updating may be informative
However, this is only CRAN, and there is the big Bioconductor repository to consider too
Adding in the Bioconductor (S4, curated) repo does alter the optics, as you’ll see, over and above the cluster dominated by Rcpp
## Rcpp ggplot2 MASS AnnotationDbi dplyr
## 0.023611 0.012835 0.011114 0.009177 0.008589
## Matrix stringr magrittr data.table mvtnorm
## 0.006606 0.005141 0.005132 0.004810 0.004725
## plyr survival jsonlite RcppArmadillo Biobase
## 0.004700 0.004672 0.004445 0.004274 0.004183
## httr igraph tibble foreach shiny
## 0.004035 0.003822 0.003565 0.003505 0.003470
## MASS mvtnorm survival igraph foreach lattice
## 0.011113669 0.004724803 0.004671883 0.003822415 0.003504848 0.003273646
## doParallel zoo coda glmnet nlme R6
## 0.002545046 0.001938126 0.001881350 0.001722928 0.001588220 0.001582231
## ggplot2 dplyr stringr magrittr data.table plyr
## 0.012835220 0.008588749 0.005140872 0.005132022 0.004809706 0.004699903
## jsonlite httr tibble shiny reshape2 tidyr
## 0.004444708 0.004035063 0.003565334 0.003469870 0.003402997 0.003310935
## Biobase GenomicRanges BiocGenerics
## 0.0041831518 0.0024108003 0.0021567369
## SummarizedExperiment limma GenomeInfoDb
## 0.0015930622 0.0013923746 0.0012813877
## graph BiocParallel Rsamtools
## 0.0009660412 0.0009172630 0.0008740876
## affy rtracklayer edgeR
## 0.0007696580 0.0007413752 0.0005661088
## Rcpp Matrix RcppArmadillo RcppEigen BH
## 0.0236111055 0.0066059623 0.0042738209 0.0013315944 0.0012213641
## rstan RcppParallel bigmemory RcppProgress StanHeaders
## 0.0006470612 0.0005426527 0.0003310505 0.0003149103 0.0002288213
## nlmixr rstantools
## 0.0001900809 0.0001566603
## sp raster rgdal sf rgeos
## 0.0029299842 0.0018844473 0.0010844239 0.0007735674 0.0007501712
## spatstat png maptools maps leaflet
## 0.0006813641 0.0006595522 0.0006108745 0.0004304173 0.0003218325
## geosphere ncdf4
## 0.0002870568 0.0002709302
## AnnotationDbi org.Hs.eg.db
## 9.176603e-03 1.068938e-03
## GenomicFeatures org.Mm.eg.db
## 9.622581e-04 5.691116e-04
## org.Rn.eg.db ChIPQC
## 3.689497e-04 1.231048e-04
## rCGH chimera
## 1.060978e-04 1.019420e-04
## TxDb.Hsapiens.UCSC.hg19.knownGene org.Dm.eg.db
## 7.966464e-05 7.002200e-05
## Mus.musculus Rattus.norvegicus
## 6.394154e-05 6.137107e-05
## Name Package
## 1 Bioconductor Package 42
## 2 Martin Morgan 39
## 3 Wolfgang Huber 34
## 4 Marc Carlson 29
## 5 Herve Pages 24
## 6 Aaron Lun 15
## 7 Levi Waldron 14
## 8 Marcel Ramos 13
## 9 Davide Risso 11
## 10 Gordon Smyth 10
## 11 Mike Smith 10
## 12 Joern Toedling 10
## Name Package
## 1 Rafael A. Irizarry 29
## 2 Kasper Daniel Hansen 21
## 3 Matthew N. McCall 20
## 4 Hector Corrada Bravo 17
## 5 Tim Triche 14
## 6 Andrew E. Jaffe 12
## 7 John D. Storey 12
## 8 Jeffrey T. Leek 11
## 9 Leonardo Collado-Torres 9
## 10 Jean-Philippe Fortin 9
## 11 D. 9
## 12 Rafael Irizarry 8
## Name Package
## 1 Hana Sevcikova 14
## 2 Adrian Raftery 12
## 3 Thomas Brendan Murphy 9
## 4 Chris Fraley 9
## 5 University of Washington 8
## 6 Adrian E. Raftery 7
## 7 Luca Scrucca 7
## 8 Xiuwen Zheng 6
## 9 Isobel Claire Gormley 6
## 10 Michael Fop 5
## 11 Ian Painter 5
## 12 Patrick Gerland 5
## Name Package
## 1 org 101
## 2 bioconductor 97
## 3 The Bioconductor Pro 97
## 4 Mark S. Handcock 13
## 5 Martina Morris 11
## 6 Pavel N. Krivitsky 11
## 7 Skye Bender-deMoll 10
## 8 David R. Hunter 8
## 9 Li Wang 7
## 10 Steven M. Goodreau 7
## 11 Carter T. Butts 7
## 12 Kirk Li 6
Many sources in applied statistics with an S-like syntax but Lisp/Scheme-like internals, and sustained tensions between these
Many different opinions on prefered ways of structuring data and data handling, opening for adaptations to different settings
More recently larger commercial interest in handling large input long data sets, previously also present; simulations also generate large output data sets; bioinformatics both wide and long
Differing views of the world in terms of goals and approaches
Differences provide ecological robustness
R can be a calculator, with output printed by the default method
2+3
## [1] 5
7*8
## [1] 56
3^2
## [1] 9
log(1)
## [1] 0
log10(10)
## [1] 1
We could print explicitly:
print(2+3)
## [1] 5
print(sqrt(2))
## [1] 1.414214
print(sqrt(2), digits=10)
## [1] 1.414213562
print(10^7)
## [1] 1e+07
Exceptions also happen (Inf is infinity, NaN is Not a Number):
log(0)
## [1] -Inf
sqrt(-1)
## Warning in sqrt(-1): NaNs produced
## [1] NaN
1/0
## [1] Inf
0/0
## [1] NaN
We assign results of operations and functions to named objects with <-
, or equivalently =
; names begin with letters or a dot:
a <- 2+3
a
## [1] 5
is.finite(a)
## [1] TRUE
a <- log(0)
is.finite(a)
## [1] FALSE
The printed results are prepended by a curious [1]
; all these results are unit length vectors. We can combine several objects with c()
:
a <- c(2, 3)
a
## [1] 2 3
sum(a)
## [1] 5
str(a)
## num [1:2] 2 3
aa <- rep(a, 50)
aa
## [1] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
## [36] 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
## [71] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
The single square brackets []
are used to access or set elements of vectors (the colon :
gives an integer sequence); negative indices drop elements:
length(aa)
## [1] 100
aa[1:10]
## [1] 2 3 2 3 2 3 2 3 2 3
sum(aa)
## [1] 250
sum(aa[1:10])
## [1] 25
sum(aa[-(11:length(aa))])
## [1] 25
Infix syntax is just a representation of the actual underlying forms
a[1] + a[2]
## [1] 5
sum(a)
## [1] 5
`+`(a[1], a[2])
## [1] 5
Reduce(`+`, a)
## [1] 5
We’ve done arithmetic on scalars, we can do vector-scalar arithmetic:
sum(aa)
## [1] 250
sum(aa+2)
## [1] 450
sum(aa)+2
## [1] 252
sum(aa*2)
## [1] 500
sum(aa)*2
## [1] 500
But vector-vector arithmetic poses the question of vector length and recycling (the shorter one gets recycled):
v5 <- 1:5
v2 <- c(5, 10)
v5 * v2
## Warning in v5 * v2: longer object length is not a multiple of shorter
## object length
## [1] 5 20 15 40 25
v2_stretch <- rep(v2, length.out=length(v5))
v2_stretch
## [1] 5 10 5 10 5
v5 * v2_stretch
## [1] 5 20 15 40 25
In working with real data, we often meet missing values, coded by NA meaning Not Available:
anyNA(aa)
## [1] FALSE
is.na(aa) <- 5
aa[1:10]
## [1] 2 3 2 3 NA 3 2 3 2 3
anyNA(aa)
## [1] TRUE
sum(aa)
## [1] NA
sum(aa, na.rm=TRUE)
## [1] 248
We’ve looked at the simple stuff, when arithmetic and assignment happens as expected
A strength of R is the handling of exceptions, which do happen when handling real data, which not infrequently differs from what we thought it was
Wanting a result from data is reasonable when the data meet all the requirements
If the data do not meet the requirements, we may get unexpected results, warnings or even errors: most often we need to go back and check our input data
One way to check our input data is to print in the console - this works with small objects as we’ve seen, but for larger objects we need methods:
big <- 1:(10^5)
length(big)
## [1] 100000
head(big)
## [1] 1 2 3 4 5 6
str(big)
## int [1:100000] 1 2 3 4 5 6 7 8 9 10 ...
summary(big)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 25001 50000 50000 75000 100000
There are length
, head
, str
(structure) and summary
methods for many types of objects
str
also gives us a hint of the type of object and its dimensions
We’ve seen a couple of uses of str
so far, str(a)
was num
and str(big)
was int
, what does this signify?
They are both numbers, but of different types
There are six basic vector types: list, integer, double, logical, character and complex
The derived type factor (to which we return shortly) is integer with extra information
str
reports these as int, num, logi, chr and cplx, and lists are enumerated recursively
In RStudio you see more or less the str
output in the environment pane as Values in the list view; the grid view adds the object size in memory
From early S, we have typeof
and storage.mode
(including single precision, not used in R) - these are important for interfacing C, C++, Fortran and other languages
Beyond this is class
, but then the different class systems (S3 and formal S4) complicate things
Objects such as vectors may also have attributes in which their class and other information may be placed
Typically, a lot of use is made of attributes to squirrel away strings and short vectors
is
methods are used to test types of objects; note that integers are also seen as numeric:
set.seed(1)
x <- runif(50, 1, 10)
is.numeric(x)
## [1] TRUE
y <- rpois(50, lambda=6)
is.numeric(y)
## [1] TRUE
is.integer(y)
## [1] TRUE
xy <- x < y
is.logical(xy)
## [1] TRUE
as
methods try to convert between object types and are widely used:
str(as.integer(xy))
## int [1:50] 1 1 0 0 1 0 0 0 1 1 ...
str(as.numeric(y))
## num [1:50] 6 9 5 4 3 3 5 6 7 5 ...
str(as.character(y))
## chr [1:50] "6" "9" "5" "4" "3" "3" "5" "6" "7" "5" "9" "5" "6" "5" ...
str(as.integer(x))
## int [1:50] 3 4 6 9 2 9 9 6 6 1 ...
First, let us see that is behind the data.frame
object: the list
object
list
objects are vectors that contain other objects, which can be addressed by name or by 1-based indices
Like the vectors we have already met, lists can be accessed and manipulated using square brackets []
Single list elements can be accessed and manipulated using double square brackets [[]]
Starting with four vectors of differing types, we can assemble a list object; as we see, its structure is quite simple. The vectors in the list may vary in length, and lists can (and do often) include lists
V1 <- 1:3
V2 <- letters[1:3]
V3 <- sqrt(V1)
V4 <- sqrt(as.complex(-V1))
L <- list(v1=V1, v2=V2, v3=V3, v4=V4)
str(L)
## List of 4
## $ v1: int [1:3] 1 2 3
## $ v2: chr [1:3] "a" "b" "c"
## $ v3: num [1:3] 1 1.41 1.73
## $ v4: cplx [1:3] 0+1i 0+1.41i 0+1.73i
L$v3[2]
## [1] 1.414214
L[[3]][2]
## [1] 1.414214
Our list
object contains four vectors of different types but of the same length; conversion to a data.frame
is convenient. Note that by default strings are converted into factors:
DF <- as.data.frame(L)
str(DF)
## 'data.frame': 3 obs. of 4 variables:
## $ v1: int 1 2 3
## $ v2: Factor w/ 3 levels "a","b","c": 1 2 3
## $ v3: num 1 1.41 1.73
## $ v4: cplx 0+1i 0+1.41i 0+1.73i
DF <- as.data.frame(L, stringsAsFactors=FALSE)
str(DF)
## 'data.frame': 3 obs. of 4 variables:
## $ v1: int 1 2 3
## $ v2: chr "a" "b" "c"
## $ v3: num 1 1.41 1.73
## $ v4: cplx 0+1i 0+1.41i 0+1.73i
We can also provoke an error in conversion from a valid list
made up of vectors of different length to a data.frame
:
V2a <- letters[1:4]
V4a <- factor(V2a)
La <- list(v1=V1, v2=V2a, v3=V3, v4=V4a)
DFa <- try(as.data.frame(La, stringsAsFactors=FALSE), silent=TRUE)
message(DFa)
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
## arguments imply differing number of rows: 3, 4
We can access data.frame
elements as list
elements, where the $
is effectively the same as [[]]
with the list component name as a string:
DF$v3[2]
## [1] 1.414214
DF[[3]][2]
## [1] 1.414214
DF[["v3"]][2]
## [1] 1.414214
Since a data.frame
is a rectangular object with named columns with equal numbers of rows, it can also be indexed like a matrix, where the rows are the first index and the columns (variables) the second:
DF[2, 3]
## [1] 1.414214
DF[2, "v3"]
## [1] 1.414214
str(DF[2, 3])
## num 1.41
str(DF[2, 3, drop=FALSE])
## 'data.frame': 1 obs. of 1 variable:
## $ v3: num 1.41
If we coerce a data.frame
containing a character vector or factor into a matrix, we get a character matrix; if we extract an integer and a numeric column, we get a numeric matrix.
as.matrix(DF)
## v1 v2 v3 v4
## [1,] "1" "a" "1.000000" "0+1.000000i"
## [2,] "2" "b" "1.414214" "0+1.414214i"
## [3,] "3" "c" "1.732051" "0+1.732051i"
as.matrix(DF[,c(1,3)])
## v1 v3
## [1,] 1 1.000000
## [2,] 2 1.414214
## [3,] 3 1.732051
The fact that data.frame
objects descend from list
objects is shown by looking at their lengths; the length of a matrix is not its number of columns, but its element count:
length(L)
## [1] 4
length(DF)
## [1] 4
length(as.matrix(DF))
## [1] 12
There are dim
methods for data.frame
objects and matrices (and arrays with more than two dimensions); matrices and arrays are seen as vectors with dimensions; list
objects have no dimensions:
dim(L)
## NULL
dim(DF)
## [1] 3 4
dim(as.matrix(DF))
## [1] 3 4
str(as.matrix(DF))
## chr [1:3, 1:4] "1" "2" "3" "a" "b" "c" "1.000000" "1.414214" ...
## - attr(*, "dimnames")=List of 2
## ..$ : NULL
## ..$ : chr [1:4] "v1" "v2" "v3" "v4"
data.frame
objects have names
and row.names
, matrices have dimnames
, colnames
and rownames
; all can be used for setting new values:
row.names(DF)
## [1] "1" "2" "3"
names(DF)
## [1] "v1" "v2" "v3" "v4"
names(DF) <- LETTERS[1:4]
names(DF)
## [1] "A" "B" "C" "D"
str(dimnames(as.matrix(DF)))
## List of 2
## $ : NULL
## $ : chr [1:4] "A" "B" "C" "D"
R objects have attributes that are not normally displayed, but which show their structure and class (if any); we can see that data.frame
objects are quite different internally from matrices:
str(attributes(DF))
## List of 3
## $ names : chr [1:4] "A" "B" "C" "D"
## $ class : chr "data.frame"
## $ row.names: int [1:3] 1 2 3
str(attributes(as.matrix(DF)))
## List of 2
## $ dim : int [1:2] 3 4
## $ dimnames:List of 2
## ..$ : NULL
## ..$ : chr [1:4] "A" "B" "C" "D"
If the reason for different vector lengths was that one or more observations are missing on that variable, NA
should be used; the lengths are then equal, and a rectangular table can be created:
V1a <- c(V1, NA)
V3a <- sqrt(V1a)
La <- list(v1=V1a, v2=V2a, v3=V3a, v4=V4a)
DFa <- as.data.frame(La, stringsAsFactors=FALSE)
str(DFa)
## 'data.frame': 4 obs. of 4 variables:
## $ v1: int 1 2 3 NA
## $ v2: chr "a" "b" "c" "d"
## $ v3: num 1 1.41 1.73 NA
## $ v4: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
Sometimes character values are just that, not categorical values to be used in handling data
Factors are meant to be used for categories, and are stored as an integer vector with values pointing to places in a character vector of levels stored as an attribute of the object
Character data are read into R by default as factors, because that is the most usual scenario
Having a pre-defined set of indices to level values is very useful for visualization and analysis
Ordered factors can be used for ordinal data
We can retrieve to input character vector by indexing the levels:
gen <- c("female", "male", NA)
fgen <- factor(gen)
str(fgen)
## Factor w/ 2 levels "female","male": 1 2 NA
nlevels(fgen)
## [1] 2
levels(fgen)
## [1] "female" "male"
as.integer(fgen)
## [1] 1 2 NA
levels(fgen)[as.integer(fgen)]
## [1] "female" "male" NA
Ordered factors do not sort the levels alphabetically:
status <- c("Lo", "Hi", "Med", "Med", "Hi")
ordered.status <- ordered(status, levels=c("Lo", "Med", "Hi"))
ordered.status
## [1] Lo Hi Med Med Hi
## Levels: Lo < Med < Hi
str(ordered.status)
## Ord.factor w/ 3 levels "Lo"<"Med"<"Hi": 1 3 2 2 3
table(status)
## status
## Hi Lo Med
## 2 1 2
table(ordered.status)
## ordered.status
## Lo Med Hi
## 1 2 2
So far, we’ve only met ASCII 7-bit characters, but in many situations, we need more. The default encoding will depend on the locale in which your R session is running - this is my locale:
strsplit(Sys.getlocale(), ";")
## [[1]]
## [1] "LC_CTYPE=en_GB.UTF-8" "LC_NUMERIC=C"
## [3] "LC_TIME=en_GB.UTF-8" "LC_COLLATE=en_GB.UTF-8"
## [5] "LC_MONETARY=en_GB.UTF-8" "LC_MESSAGES=en_GB.UTF-8"
## [7] "LC_PAPER=en_GB.UTF-8" "LC_NAME=C"
## [9] "LC_ADDRESS=C" "LC_TELEPHONE=C"
## [11] "LC_MEASUREMENT=en_GB.UTF-8" "LC_IDENTIFICATION=C"
In UTF-8, non-ASCII characters are encoded by an 8th bit flag, and a second byte with the value; in codepage and ISO 8-bit character sets, the 8th bit is part of the character, but differs from set to set:
V5 <- c("æ", "Æ", "ø", "å")
sapply(V5, charToRaw)
## æ Æ ø å
## [1,] c3 c3 c3 c3
## [2,] a6 86 b8 a5
V6 <- iconv(V5, to="CP1252")
sapply(V6, charToRaw)
## æ Æ ø å
## e6 c6 f8 e5
Encodings do not affect representation within the R workspace, but are a real problem for reading and writing data:
La <- list(v1=V1a, v2=V2a, v3=V3a, v4=V4a, v5=V5, v6=V6)
DFa <- as.data.frame(La)
str(DFa)
## 'data.frame': 4 obs. of 6 variables:
## $ v1: int 1 2 3 NA
## $ v2: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
## $ v3: num 1 1.41 1.73 NA
## $ v4: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
## $ v5: Factor w/ 4 levels "å","æ","Æ","ø": 2 3 4 1
## $ v6: Factor w/ 4 levels "å","æ","Æ","ø": 2 3 4 1
An aside before we proceed: handling temporal data is confusing. Time is multifaceted, where two of the variants are instantaneous time with data at that time point and interval time with data aggregated over the interval:
now <- Sys.time()
now
## [1] "2019-10-22 11:35:20 CEST"
class(now)
## [1] "POSIXct" "POSIXt"
as.Date(now)
## [1] "2019-10-22"
unclass(now)
## [1] 1571736920
One representation is in seconds since the epoch (with decimal parts of a second), another is in components also including important time zone information (time zone listings are updated regularly):
str(unclass(as.POSIXlt(now)))
## List of 11
## $ sec : num 20.2
## $ min : int 35
## $ hour : int 11
## $ mday : int 22
## $ mon : int 9
## $ year : int 119
## $ wday : int 2
## $ yday : int 294
## $ isdst : int 1
## $ zone : chr "CEST"
## $ gmtoff: int 7200
## - attr(*, "tzone")= chr [1:3] "" "CET" "CEST"
In the social sciences, we are more likely to need annual or monthly representations, but it is useful to be aware that a year can mean status at year end, or an aggregated value accummulated during an interval.
suppressMessages(library(zoo))
as.yearmon(now)
## [1] "Oct 2019"
as.yearqtr(now)
## [1] "2019 Q4"
as.Date("2016-03-01") - 1 # day
## [1] "2016-02-29"
as.Date("2018-03-01") - 1 # day
## [1] "2018-02-28"
seq(as.Date(now), as.Date(now)+12, length.out=4)
## [1] "2019-10-22" "2019-10-26" "2019-10-30" "2019-11-03"
sessionInfo()
sessionInfo()
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 30 (Workstation Edition)
##
## Matrix products: default
## BLAS: /home/rsb/topics/R/R361-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R361-share/lib64/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] zoo_1.8-6 wordcloud_2.6 RColorBrewer_1.1-2
## [4] MASS_7.3-51.4
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 lattice_0.20-38 digest_0.6.21 grid_3.6.1
## [5] magrittr_1.5 evaluate_0.14 rlang_0.4.0 stringi_1.4.3
## [9] rmarkdown_1.16 tools_3.6.1 stringr_1.4.0 xfun_0.10
## [13] yaml_2.2.0 compiler_3.6.1 htmltools_0.4.0 knitr_1.25
Abelson, Harold, and Gerald Jay Sussman. 1996. Structure and Interpretation of Computer Programs. Boston, MA: MIT Press.
Bååth, Rasmus. 2012. “The State of Naming Conventions in R.” The R Journal 4 (2): 74–75. https://journal.r-project.org/archive/2012/RJ-2012-018/index.html.
Becker, R.A., and J.M. Chambers. 1984. S: An Interactive Environment for Data Analysis and Graphics. Pacific Grove, CA, USA: Wadsworth & Brooks/Cole.
———. 1985. Extending the S System. Pacific Grove, CA, USA: Wadsworth & Brooks/Cole.
Becker, Richard A., John M. Chambers, and Allan R. Wilks. 1988. The New S Language. London: Chapman & Hall.
Chambers, John M. 1998. Programming with Data. New York: Springer.
———. 2016. Extending R. Boca Raton: Chapman & Hall.
Chambers, John M., and Trevor J. Hastie. 1992. Statistical Models in S. London: Chapman & Hall.
Ihaka, Ross, and Robert Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics 5 (3): 299–314. https://doi.org/10.1080/10618600.1996.10474713.
Tierney, Luke. 1990. LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics. Wiley, New York, NY.
———. 1996. “Recent Developments and Future Directions in Lisp-Stat.” Journal of Computational and Graphical Statistics 5 (3): 250–62.
———. 2005. “Some Notes on the Past and Future of Lisp-Stat.” Journal of Statistical Software, Articles 13 (9): 1–15. https://doi.org/10.18637/jss.v013.i09.
Venables, William N., and Brian D. Ripley. 2000. S Programming. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS3/Sprog/.
Wickham, Hadley. 2014. Advanced R. Boca Raton, FL: Chapman & Hall. http://adv-r.had.co.nz/.