Copyright

All the material presented here, to the extent it is original, is available under CC-BY-SA.

Required current contributed CRAN packages:

I am running R 3.6.1, with recent update.packages().

needed <- c("zoo", "ggraph", "tidygraph", "forcats", "stringr", "dplyr", "purrr", "readr", "tidyr", "tibble", "ggplot2", "tidyverse", "wordcloud", "RColorBrewer", "magrittr", "igraph", "miniCRAN", "XML", "MASS", "BiocManager")

Script

Script and data at https://github.com/rsbivand/ban421_h19/raw/master/ban421_h19_mon.zip. Download to suitable location, unzip and use as basis.

Seminar introduction

This seminar/course

Getting to know R: today we will start fairly thoroughly - probably a challenging beginning
Not speed-dating; starting ab ovo or rather ab ova: R has multiple mutations and is better seen as an ecosystem than as an inherently purposed system
Once we have the antecedents, we can see how they may affect data structures and their uses in R
Further, we’ll be able to make informed choices with respect to use of data structures
This course is made up of integrated classes and lab sessions, with the tasks to be carried out tightly linked to the classes
It is important to participate in class and in the lab, both with the instructor and with other participants
If you get stuck, ask someone; everyone gets stuck, it isn’t embarrassing. When someone else gets stuck, try to help; learning R is not like having your hair done

Reproducible research (R)

R is about reproducible research; we learn by doing, and by building on things others have done
We can only benefit from things others have done if they are available and if we can show that we get the same results - we can reproduce their work
Scripts (recipes) are the basis for this, and can be extended to literate programming by writing text explaining the steps taken
The threshold to learning enough markdown to write documents showing what has been done is not high

Schedule

Today, background and basic structures; Tuesday class systems, method dispatch, formulae, non-standard evaluation and combining functions (ceci n’est pas une pipe); Wednesday input/output alternatives

Time	Topic
Monday 4/11
08.15-10.00	History of R and its data structures
10.15-11.00	Class exercises
12.15-14.00	Basic data structures: vectors, list, data.frame, matrix, array
14.15-15.00	Class exercises
Tuesday 5/11
08.15-10.00	Basic data structures: factors, time, encoding
10.15-11.00	Class exercises
12.15-14.00	Class systems and method dispatch; formulae, non-standard evaluation and combining functions
14.15-15.00	Class exercises
Wednesday 6/11
08.15-10.00	Basic input/output into/from data structures
10.15-11.00	Class exercises
12.15-14.00	Comparing alternatives
14.15-15.00	Class exercises
Thursday 7/11
09.00-16.00	Group work day
Friday 8/11
08.00-10.00	Presentations
10.15-11.45	Presentations
12.15-15.00	Presentations

The underlying aim: to survey contemporary approaches to data structures and their handling in context
Why in context? Because without the context, some alternatives may seem to be closed off by the presentation narrative
Your (group) projects are key part of the seminar

Projects

Suggested topics include comparisons of different representations, both legacy (data.frame) and modern (data.table, tibble, …), their input/output and handling methods
Similar topics may be gleaned from R-bloggers and its Twitter feed; some of the claims deserve to be checked
The aim is not to find winners, but to explore alternatives
Thursday work-day, Friday presentation day, hand in via WiseFlow by 14.00, 23 November.

Learning resources

Needs for learning resources, and ways of making use of them, vary greatly between participants
There are lots of books, but many now present one-size-fits-all solutions that may not be a best fit
Other materials are described on the R site and on CRAN
RStudio also provides an online learning page, with a number of options, no longer like Datacamp, but still like swirl
R is distributed from mirrors of the comprehensive R archive network (CRAN)
The cloud mirror is the easiest, but a local server may be faster
RStudio can be downloaded and installed after R has been installed
R comes with many contributed packages - the ones we need are on CRAN, which lists them providing information; we’ll get back to contributed packages later

R itself

R is as small or large as you like, and runs in many different configurations (no smartphones); the core is written in C
The language has developed from S written at Bell Labs NJ, where Unix, C, C++, and scripting technologies were created in the 1970s and 1980s
Bell Labs statisticians had a strong focus on graphics and exploratory data analysis from the beginning
Many underlying abstractions were established by 1988 and 1992; we’ll get to the data.frame and formula abstractions later
An R session records its history - all that is entered at the console prompt - and a workspace containing objects
On exiting a session, the history may be saved to a history file, and the workspace may be saved to an RData file; history and chosen objects (or all objects) may be saved manually before exit
The workspace is in the memory of the computer, and R itself expects there to be enough memory for all of the data, intermediate and final results
Modern R is 64-bit, so limits are most often set by the computer hardware; use can be made of multiple cores to compute in parallel

Using Markdown in R

In the RStudio Interactive Development Environment (IDE), it is convenient to use R Markdown to write notebooks (annotated scripts)
Chunks of code are run in sequence and may be echoed in the output
Output is shown in its right place, including graphics output
The document may also be converted to a script, mirroring the weave/tangle - knit/purl duality
This presentation is written in Markdown, as we’ll see …

Ancilliary information

Help, examples and built-in datasets in R

In RStudio, the Help tab in the lower right pane (default position) gives access to the R manuals and to the installed packages help pages through the Packages link under Reference
In R itself, help pages are available in HTML (browser) and text form; help.start() uses the default browser to display the Manuals, Reference and Miscellaneous Material sections in RStudio’s home help tab
The search engine can be used to locate help pages, but is not great if many packages are installed, as no indices are stored
The help system needs to be learned in order to provide the user with ways of progressing without wasting too much time

Base help system

The base help system does not tell you how to use R as a system, about packages not installed on your machine, or about R as a community
It does provide information about functions, methods and (some) classes in base R and in contributed packages installed on your machine
We’ll cover these first, then go on to look at vignettes, R Journal, task views, online help pages, and the blog aggregator
There are different requirements with regard to help systems - in R, the help pages of base R are expected to be accurate although terse

Help pages

Each help page provides a short description of the functions, methods or classes it covers; some pages cover more than one such
Help pages are grouped by package, so that the browser-based system is not easy to browse if you do not know which package a function belongs to
The usage of the function is shown explicitly, including any defaults for arguments to functions or methods
Each argument is described, showing names and types; in addition details of the description are given, together with the value returned

Interactive use of help pages

Rather than starting from the packages hierarchy of help pages, users most often use the help function
The function takes the name of of the function about which we need help, the name may be in quotation marks; class names contain a hyphen and must be quoted
Instead of using say help(help), we can shorten to the question mark operator: ?help
Occasionally, several packages offer different functions with the same name, and we may be offered a choice; we can disambiguate by putting the package name and two colons before the function name

Function arguments

In the usage section, function arguments are shown by name and order; the args function returns information
In general, if arguments are given by name, the order is arbitrary, but if names are not used at least sometimes, order matters
Some arguments do not have default values and are probably required, although some are guessed if missing
Being explicit about the names of arguments and the values they take is helpful in scripting and reproducible research
The ellipsis ... indicates that the function itself examines objects passed to see what to do

Tooltips and completion

The regular R console does not provide tooltips, that is a bubble first offering alternative function or object names as you type, then lists of argument names
RStudio, like many IDEs, does provide this, controlled by Tools -> Global options -> Code -> Completion (by default it is operative)
This may be helpful or not, depending on your style of working; if you find it helpful, fine, if not, you can make it less invasive under Global options
Other IDE have also provided this facility, which builds directly on the usage sections of help pages of functions in installed packages

Coherence code/documentation

Base R has a set of checks and tests that ensure coherence between the code itself and the usage sections in help pages
These mechanisms are used in checking contributed packages before they are released through the the archive network; the description of arguments on help pages must match the function definition
It is also possible to generate help pages documenting functions automatically, for example using the roxygen2 package
It is important to know that we can rely on this coherence

Returned values

The objects returned by functions are also documented on help pages, but the coherence of the description with reality is harder to check
This means that use of str or other functions or methods may be helpful when we want to look inside the returned object
The form taken by returned values will often also vary, depending on the arguments given
Most help pages address this issue not by writing more about the returned values, but by using the examples section to highlight points of potential importance for the user

Examples

Reading the examples section on the help page is often enlightening, but we do not need to copy and paste
The example function runs those parts of the code in the examples section of a function that are not tagged don’t run - this can be overridden, but may involve meeting conditions not met on your machine
This code is run nightly on CRAN servers on multiple operating systems and using released, patched and development versions of R, so checking both packages and the three versions of R
Some examples use data given verbatim, but many use built-in data sets; most packages also provide data sets to use for running examples

Built-in data sets

This means that the examples and the built-in data sets are a most significant resource for learning how to solve problems with R
Very often, one recognizes classic textbook data sets from the history of applied statistics; contemporary text book authors often publish collections of data sets as packages on CRAN
The built-in data sets also have help pages, describing their representation as R objects, and their licence and copyright status
These help pages also often include an examples section showing some of the analyses that may be carried out using them
One approach that typically works well when you have a data set of your own, but are unsure how to proceed, is to find a built-in data set that resembles the real one, and play with that first
The built-in data sets are often quite small, and if linked to text books, they are well described there as well as in the help pages
By definition, the built-in data sets do not have to be imported into R, as they are almost always stored as files of R objects
In some cases, these data sets are stored in external file formats, most often to show how to read those formats
The built-in data sets in the base datasets package are in the search path, but data sets in other packages should be loaded using the data() function:

str(Titanic)

##  'table' num [1:4, 1:2, 1:2, 1:2] 0 0 35 0 0 0 17 0 118 154 ...
##  - attr(*, "dimnames")=List of 4
##   ..$ Class   : chr [1:4] "1st" "2nd" "3rd" "Crew"
##   ..$ Sex     : chr [1:2] "Male" "Female"
##   ..$ Age     : chr [1:2] "Child" "Adult"
##   ..$ Survived: chr [1:2] "No" "Yes"

library(MASS)
data(deaths)
str(deaths)

##  Time-Series [1:72] from 1974 to 1980: 3035 2552 2704 2554 2014 ...

Vignettes

At about the time that literate programming arrived in R with Sweave and Stangle - we mostly use knitr now - the idea arose of supplementing package documentation with example workflows
Vignettes are PDF documents with accompanying runnable R code that describe how to carry out particular sequences of operations
The RStudio packages help tab package index file shows user guides, package vignettes and other documentation
The vignette() function can be used to list vignettes by installed package, and to open the chosen vignette in a PDF reader
A very typical way of using vignettes on a machine with enough screen space is to read the document and run the code from the R file at the same time
Assign the output of vignette to an object; the print method shows the PDF or HTML, the edit method gives direct access to the underlying code for copy and paste
The help system in RStudio provides equivalent access to vignette documents and code
Papers about R contributed packages published in the Journal of Statistical Software and the R Journal are often constructed in this way too

Task views

As R has developed, the number of packages on CRAN has grown (other packages are on BioConductor and github)
CRAN task views were introduced to try to provide some subject area guidance
They remain terse, and struggle to keep up, but are still worth reviewing
Note that those working in different subject areas often see things rather differently, leading to subject specific treatment of intrinsically similar themes

Online help pages

The help system and vignettes were designed to be used offline, so that the versions of R and installed packages matched the documentation
If you search online for information about functions in R or in contributed packages, you often reach inside-R, sponsored by Revolution Analytics
Help pages may also be viewed online from your chosen CRAN mirror; package pages provide these (Reference manual) and vignettes as links
Remember to check that the versions of your installed software and the online documentation are the same

R communities

The R community has become a number of linked communities rather than a coherent and hierarchical whole
As in many open source projects, the R project is more basaar than cathedral; think of niches in ecosystems with differing local optima in contrast to a master plan
One style is based on mailing lists, in which an issue raised by an original poster is resolved later in that thread
Another style is to use online fora, such as StackOverflow, which you need to visit rather than receiving messages in your inbox
There are now many blogs involving the use of R, fortunately aggregated at R-bloggers, where other resources may also be found
New aggregated blog topics are linked to a Twitter account, so if you want, you too can be bombarded by notifications
These are also a potential source of project ideas, especially because some claims should be challenged
R Users Groups and R Ladies provide face-to-face meeting places that many value

R Consortium

R started as a teaching tool for applied statistics, but this community model has been complemented by others
R is now widely used in business, public administration and voluntary organizations for data analysis and visualization
The R Consortium was created in 2015 as a vehicle for companies with relationships to R
R itself remains under the control of the R Foundation, which is still mostly academic in flavour

History of R and its data structures

Sources

Rasmus Bååth has a useful blog piece on R’s antecedents in the S language
Something similar is present in the second chapter of (Chambers 2016), from the viewpoint of one of those responsible for the development of the S language
In addition to S, we need to take SICP and Scheme into account (Abelson and Sussman 1996), as described by (Ihaka and Gentleman 1996) and (Wickham 2014)
Finally, LispStat and its creators have played and continue to play a major role in developing R (Tierney 1990, 1996, 2005)

Early R was Scheme via SICP

Ross Ihaka’s description

(JSM talk)

Brown Books

S: An Interactive Environment for Data Analysis and Graphics, A.K.A. the Brown Book (Becker and Chambers 1984); Extending the S System (Becker and Chambers 1985)

Brown Books

Blue and White Books

The New S Language: A Programming Environment for Data Analysis and Graphics, A.K.A. the Blue Book (Becker, Chambers, and Wilks 1988); Statistical Models in S, A.K.A. the White Book (Chambers and Hastie 1992)

Blue and White Books

Green Book

Programming with Data: A Guide to the S Language, A.K.A. the Green Book (Chambers 1998); S Programming (Venables and Ripley 2000)

Green Book

S2 to S3 to S4

The S2 system was described in the Brown Book, S3 in the Blue Book and completed in the White Book, finally S4 in the Green Book
The big advance from S2 to S3 was that users could write functions; that data.frame objects were defined; that formula objects were defined; and that S3 classes and method dispatch appeared
S4 brought connections and formal S4 classes, the latter seen in R in the methods package (still controversial)
S-PLUS was/is the commercial implementation of S and its releases drove S3 and S4 changes

S, Bell Labs, S-PLUS

S was a Bell Labs innovation, like Unix, C, C++, and many interpreted languages (like AWK); many of these share key understandings
Now owned by Nokia, previously Alcatel-Lucent, Lucent, and AT&T
Why would a telecoms major (AT&T) pay for fundamental research in computer science and data analysis (not to sell or market other products better)?
Some Green Book examples are for quality control of telecoms components

S-PLUS and R

S-PLUS was quickly adopted for teaching and research, and with S3, provided extensibility in the form of libraries
Most links have died by now, but see this FAQ for a flavour - there was a lively community of applied statisticians during the 1990s
S built on a long tradition of documentation through examples, with use cases and data sets taken from the applied statistical literature; this let users compare output with methods descriptions
… so we get to R

and what about LispStat?

Luke Tierney was in R core in 1997, and has continued to exert clear influence over development
Because R uses a Scheme engine, similar to Lisp, under the hood, his insight into issues like the garbage collector, namespaces, byte-compilation, serialization, parallelization, and now ALTREP has been crucial (see also the proposal by Luke Tierney, Gabe Becker and Tomas Kalibera)
Many of these issues involve the defensive copy on possible change policy involved in lazy evaluation, which may lead to multiple redundant copies of data being present in memory
Luke Tierney and Brian Ripley have fought hard to let R load fast, something that is crucial to ease the use of R on multicore systems or inside databases

ALTREP

R 3.4.4

> n <- 1e7
> set.seed(1)
> x <- rnorm(n)
> y <- rnorm(n)
> system.time(lm(y ~ x))
   user  system elapsed 
  7.007   0.356   7.431

ALTREP

ALTREP R 3.5.1

> n <- 1e7
> set.seed(1)
> x <- rnorm(n)
> y <- rnorm(n)
> system.time(lm(y ~ x))
   user  system elapsed 
  1.254   0.433   1.700

SICP: What are computational processes?

Computational processes are abstract ``beings’’ that inhabit computers.
Processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules called a program.
People create programs to direct processes.
Analysing data always involves programming, even if hidden under a point-and-click user interface.
Scripts are simple interpreted programs that are easy to write, read, and improve, and permit analysis to be documented in a responsible way.
A computational process, in a correctly working computer, executes programs precisely and accurately.
Novice programmers must learn to understand and foresee the consequences of their attempts to create and execute programs.
Even small errors in programs can have complex ans unanticipated consequences.
Software engineers or progammers learn to organize programs so that they can be reasonably sure that the resulting processes will perform the tasks intended.
They know how to structure programs so that unanticipated problems do not lead to catastrophic consequences, and when problems do arise, they can debug their programs.

Programming and programming languages

The language is more than just a means for instructing a computer to perform tasks, it also serves as a framework within which we organize our ideas about processes.
When we describe a language, we should pay particular attention to the means that the language provides for combining simple ideas to form more complex ideas. Each language has three mechanisms for this:
primitive expressions, which represent the simplest entities the language is concerned with,
means of combination, by which compound elements are built from simpler ones, and
means of abstraction, by which compound elements can be named and manipulated as units.
All of these mechanisms are present in scripting languages, but may only be available to the user in specialised forms.

Expressions

Think of a pocket calculator. You type an expression 50, and the calculator responds by displaying the result of its evaluating that expression: 50.
A number is a primitive expression. Type a number, and your calculator displays the result. Expressions representing numbers can be combined with an expression respresenting a primitive arithmetic procedure (such as addition, subtraction, multiplication or division) to form a compound expression that represents the application of the procedures to those numbers.
These compound expressions are called combinations. They are built up of an operator and a number of operands. The order in which the operator and the operands are typed depends on the grammar of the language, called its notation, and some means are needed to signal the beginning and end of the combination, for example parentheses ().
Most languages use infix notation, which means that operators and operands are mixed together in the combination, and require care in typing. Most calculators also use infix notation: typing the combination 50+25 yields the result: 75.
The expression 50+25*2 may be ambiguous without rules for nesting combinations; left to right gives 150, but right to left 100. Moral: use parentheses to make sure the program does what you want: (50+25)*2, or 50+(25*2). Infix notation looks like arithmetic, but in programming can be ambiguous.

Naming and the environment

A critical aspect of a programming language is the means it provides for using names to refer to computational objects. We say that the name identifies a variable whose value is the object. Our calculator often has a key named PI which when typed PI displays the value 3.1415927.
So we can type (2*PI*10) to calculate the circumference of a circle of radius 10 units: the calculator displays 62.831853. Naming or defining variables in most scripting languages uses the assignment abstraction, symbolised by the = sign. This is not ``equals’’!
In res = expression, it is res that is being named as a variable with its value set to the result of evaluating the expression. Documenting what you did saves tears later.
Associating values with symbols and later retrieving them means that the interpreter must maintain a memory that keeps track of the name-object pairs - the environment.

Compound procedures

To build compound procedures, we need to remember that combinations can be nested in an expression: ((layer1*10)+(layer2/100)), or ((x*x)+(y*y)), where the variables x and y have been given values.
But this expression ((x*x)+(y*y)) is actually a combination of two identical subexpressions for squaring the specified variable. It would be a good idea to give a name to this compound procedure, for example on our calculator it could be named x², and takes one formal parameter.
Other compound procedures found on our calculator are SQRT, log, and sin. Of course, you could make the value of the formal parameter to these compound procedures an arbitrarily complicated expression.
Compound procedures are used in exactly the same way as primitive procedures.

Exceptional conditions

Because we are dealing with computational processes, not arithmetic, we will find that there are exceptional conditions that can arise when the value of an operand that we have typed is not in the set of feasible values.
The expression 1/0 yields Inf - infinity, while 0/0 yields NaN - not-a-number, and log(0) yields error.
When does 3/4 yield 0? In most languages, because the numbers are integers, not floating point numbers. If you want a floating point result, you have to make the operands floating point first: 3.0/4.0 and FLOAT(3)/FLOAT(4) yield 0.75.

Vintage R

Use questions: left assign

Why was underscore_separated not a permitted naming convention in R earlier (see (Bååth 2012))? _ was not a permitted character in names until it had lost its left assign role, the same as <-, in 1.9.0 in 2004. (Brown Book p. 256, Blue Book p. 387)

(../../pix/assign.png)

Use questions: `strings`A`s`F`actors`

Why is the factor storage mode still so central? stringsAsFactors = TRUE was the legacy as.is = FALSE; analysis of categorical variables was more important, and factor only needed to store nlevels() strings (White Book p. 55-56, 567)

(../../pix/asis1.png)

Use questions: `drop`

drop = TRUE for array-like objects; since matrices are vectors with a dim atrribute, choosing (part of) a row or column made dim redundant (Blue Book p. 128, White Book p. 64)

(../../pix/drop.png)

But scalars are also vectors …

Treating scalars as vectors is not efficient:

(../../pix/Screenshot-2018-4-6 Ihaka Lecture Series 2017 Statistical computing in a (more) static environment - YouTube.png)

Vintage R

An R-0.49 source tarball is available from CRAN
Diffs for Fedora 27 (gcc 7.3.1) include setting compilers and -fPIC in config.site, putting ./ before config.site in configure, and three corrections in src/unix: in dataentry.h add #include <X11/Xfuncproto.h> and comment out NeedFunctionPrototypes; in rotated.c comment out /*static*/ double round; in system.c comment out __setfpucw twice; BLAS must be provided externally
Not (yet) working: prototypes are missing in the eda and mva packages so the shared objects fail to build

R SVN logs

The command: svn log --xml --verbose -r 6:77269 https://svn.r-project.org/R/trunk > trunk_verbose_log_new1.xml provides a rich data source
Each log entry has a revision number, author and timestamp, message and paths to files indicating the action undertaken for each file
The XML version is somewhat easier to untangle than the plain-text version
I haven’t tried possible similar approaches to Winston Chang’s github r-source repo

Commits 1998-2019

Commits by author and year between r6 and r77269

XML logentry structure

## <logentry revision="6">
##  <author>ihaka</author>
##  <date>1997-09-18T04:41:25.000000Z</date>
##  <paths>
##   <path action="M" prop-mods="false" text-mods="true" kind="file">/trunk/src/library/base/R/lm</path>
##  </paths>
##  <msg>New predict.lm from Peter Dalgaard</msg>
## </logentry>

XML logentry structure

## <logentry revision="77268">
##  <author>ripley</author>
##  <date>2019-10-09T08:47:26.341664Z</date>
##  <paths>
##   <path text-mods="true" kind="file" action="M" prop-mods="false">/trunk/doc/manual/R-exts.texi</path>
##  </paths>
##  <msg>add note on need to declare Python etc</msg>
## </logentry>

Commit messages by number of files affected, year and revision

## 2015 68948 2150 use https
## 2012 59039 1727 use preferred form of 'R Core Team'
## 2011 56186 1260 Revert r56184 and r56185
## 2011 56184 1249 Remove redundant \alias entries from man pages
## 2007 42333 1223 add copyright/licence header, remove CVS-style $Id fields
## 2012 61433 620 remove trailing spaces
## 2012 60146 602 add copyright statements
## 2007 42338 559 add licence statements
## 2012 59780 524 update, including bug-reporting address
## 2003 27444 497 splitting base

Epoch file commits in trunk/

## 
##        tools          FAQ           m4          etc configure.ac 
##          409          467          626          637          747 
##        share    configure         BUGS   date-stamp           po 
##         1118         1398         1485         2531         3762 
##         NEWS          doc        tests          src 
##         5842        12243        13132        99998

Epoch file commits in trunk/src

## 
##   windows  graphics     gnome macintosh      appl      unix     nmath 
##        74        79       217       611       820      1545      1771 
##   scripts     extra   modules   include  gnuwin32      main   library 
##      1992      2367      2394      3621      9199     15505     59682

Epoch file commits in trunk/src/library

## 
##     compiler      profile       stats4 translations          nls 
##          248          249          261          296          319 
##     datasets       modreg          mva      splines        ctest 
##          340          402          462          465          549 
##        tcltk           ts     parallel         grid     graphics 
##          872          900         1195         2046         2213 
##    grDevices      methods        utils        stats        tools 
##         3849         4066         5656         7256         7542 
##         base 
##        19720

Epoch file commits in trunk/src/library/base

## 
##       Makefile DESCRIPTION.in   makebasedb.R   Makefile.win   baseloader.R 
##              8             10             11             18             32 
##           demo           data    Makefile.in           inst             po 
##             62             66             75            270            499 
##              R            man 
##           6884          11778

Files by year and commit action

{ban421_h18_mon_files/figure-beamer/fig3-1.pdf

CRAN and Bioconductor packages

CRAN

Once S3 permitted extension by writing functions, and packaging functions in libraries, S and R ceased to be monolithic
In R, a library is where packages are kept, distinguishing between base and recommended packages distributed with R, and contributed packages
Contributed packages can be installed from CRAN (infrastructure built on CPAN and CTAN for Perl and Tex), Bioconductor, other package repositories, and other sources such as github
With over 12000 contributed packages, CRAN is central to the R community, but is stressed by dependency issues (CRAN is not run by R core)

CRAN/Bioconductor package clusters

Andrie de Vries Finding clusters of CRAN packages using \pkg{igraph looked at CRAN package clusters from a page rank graph
We are over three years further on now, so updating may be informative
However, this is only CRAN, and there is the big Bioconductor repository to consider too
Adding in the Bioconductor (S4, curated) repo does alter the optics, as you’ll see, over and above the cluster dominated by Rcpp

CRAN/Bioconductor package page rank scores

##          Rcpp       ggplot2          MASS AnnotationDbi         dplyr 
##      0.023611      0.012835      0.011114      0.009177      0.008589 
##        Matrix       stringr      magrittr    data.table       mvtnorm 
##      0.006606      0.005141      0.005132      0.004810      0.004725 
##          plyr      survival      jsonlite RcppArmadillo       Biobase 
##      0.004700      0.004672      0.004445      0.004274      0.004183 
##          httr        igraph        tibble       foreach         shiny 
##      0.004035      0.003822      0.003565      0.003505      0.003470

First package cluster

##        MASS     mvtnorm    survival      igraph     foreach     lattice 
## 0.011113669 0.004724803 0.004671883 0.003822415 0.003504848 0.003273646 
##  doParallel         zoo        coda      glmnet        nlme          R6 
## 0.002545046 0.001938126 0.001881350 0.001722928 0.001588220 0.001582231

Second package cluster

##     ggplot2       dplyr     stringr    magrittr  data.table        plyr 
## 0.012835220 0.008588749 0.005140872 0.005132022 0.004809706 0.004699903 
##    jsonlite        httr      tibble       shiny    reshape2       tidyr 
## 0.004444708 0.004035063 0.003565334 0.003469870 0.003402997 0.003310935

Third package cluster

##              Biobase        GenomicRanges         BiocGenerics 
##         0.0041831518         0.0024108003         0.0021567369 
## SummarizedExperiment                limma         GenomeInfoDb 
##         0.0015930622         0.0013923746         0.0012813877 
##                graph         BiocParallel            Rsamtools 
##         0.0009660412         0.0009172630         0.0008740876 
##                 affy          rtracklayer                edgeR 
##         0.0007696580         0.0007413752         0.0005661088

Fourth package cluster

##          Rcpp        Matrix RcppArmadillo     RcppEigen            BH 
##  0.0236111055  0.0066059623  0.0042738209  0.0013315944  0.0012213641 
##         rstan  RcppParallel     bigmemory  RcppProgress   StanHeaders 
##  0.0006470612  0.0005426527  0.0003310505  0.0003149103  0.0002288213 
##        nlmixr    rstantools 
##  0.0001900809  0.0001566603

Fifth package cluster

##           sp       raster        rgdal           sf        rgeos 
## 0.0029299842 0.0018844473 0.0010844239 0.0007735674 0.0007501712 
##     spatstat          png     maptools         maps      leaflet 
## 0.0006813641 0.0006595522 0.0006108745 0.0004304173 0.0003218325 
##    geosphere        ncdf4 
## 0.0002870568 0.0002709302

Sixth package cluster

##                     AnnotationDbi                      org.Hs.eg.db 
##                      9.176603e-03                      1.068938e-03 
##                   GenomicFeatures                      org.Mm.eg.db 
##                      9.622581e-04                      5.691116e-04 
##                      org.Rn.eg.db                            ChIPQC 
##                      3.689497e-04                      1.231048e-04 
##                              rCGH                           chimera 
##                      1.060978e-04                      1.019420e-04 
## TxDb.Hsapiens.UCSC.hg19.knownGene                      org.Dm.eg.db 
##                      7.966464e-05                      7.002200e-05 
##                      Mus.musculus                 Rattus.norvegicus 
##                      6.394154e-05                      6.137107e-05

CRAN/Bioconductor top two page rank clusters

CRAN/Bioconductor third and fourth page rank clusters

CRAN/Bioconductor fifth and sixth page rank clusters

CRAN/Bioconductor package author clusters

Francois Keck explored CRAN package co-authorship in a more recent blog: Exploring the CRAN social network
Once again, a little time has passed, so maybe things have shifted
Thanks to Martin Morgan, I’ve added listings corresponding in part to tools::CRAN_package_db()
It is refreshing to see that Bioconductor is clearly present, and the people implicated are active in upgrading R internals

## Warning: `as_data_frame()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

First two package author clusters

##                 Name Package
## 1            Rstudio     165
## 2     Hadley Wickham     125
## 3                Inc      75
## 4  Scott Chamberlain      54
## 5        Jeroen Ooms      48
## 6          Yihui Xie      43
## 7       R. Core Team      43
## 8       Gabor Csardi      41
## 9          Bob Rudis      39
## 10     Winston Chang      37
## 11        Jj Allaire      36
## 12   Romain Francois      33

##                 Name Package
## 1        Kurt Hornik      67
## 2    Martin Maechler      56
## 3  Dirk Eddelbuettel      52
## 4      Achim Zeileis      51
## 5       Brian Ripley      33
## 6       Roger Bivand      30
## 7         Ben Bolker      29
## 8    Torsten Hothorn      29
## 9      Douglas Bates      26
## 10     Edzer Pebesma      24
## 11          Max Kuhn      22
## 12  Michael Friendly      22

Third and fourth

##                    Name Package
## 1  Bioconductor Package      42
## 2         Martin Morgan      39
## 3        Wolfgang Huber      34
## 4          Marc Carlson      29
## 5           Herve Pages      24
## 6             Aaron Lun      15
## 7          Levi Waldron      14
## 8          Marcel Ramos      13
## 9          Davide Risso      11
## 10         Gordon Smyth      10
## 11           Mike Smith      10
## 12       Joern Toedling      10

##                       Name Package
## 1       Rafael A. Irizarry      29
## 2     Kasper Daniel Hansen      21
## 3        Matthew N. McCall      20
## 4     Hector Corrada Bravo      17
## 5               Tim Triche      14
## 6          Andrew E. Jaffe      12
## 7           John D. Storey      12
## 8          Jeffrey T. Leek      11
## 9  Leonardo Collado-Torres       9
## 10    Jean-Philippe Fortin       9
## 11                      D.       9
## 12         Rafael Irizarry       8

Fifth and sixth

##                        Name Package
## 1            Hana Sevcikova      14
## 2            Adrian Raftery      12
## 3     Thomas Brendan Murphy       9
## 4              Chris Fraley       9
## 5  University of Washington       8
## 6         Adrian E. Raftery       7
## 7              Luca Scrucca       7
## 8              Xiuwen Zheng       6
## 9     Isobel Claire Gormley       6
## 10              Michael Fop       5
## 11              Ian Painter       5
## 12          Patrick Gerland       5

##                    Name Package
## 1                   org     101
## 2          bioconductor      97
## 3  The Bioconductor Pro      97
## 4      Mark S. Handcock      13
## 5        Martina Morris      11
## 6    Pavel N. Krivitsky      11
## 7    Skye Bender-deMoll      10
## 8       David R. Hunter       8
## 9               Li Wang       7
## 10   Steven M. Goodreau       7
## 11      Carter T. Butts       7
## 12              Kirk Li       6

First two package author clusters

Third and fourth package author clusters

Fifth and sixth package author clusters

Roundup: history

Many sources in applied statistics with an S-like syntax but Lisp/Scheme-like internals, and sustained tensions between these
Many different opinions on prefered ways of structuring data and data handling, opening for adaptations to different settings
More recently larger commercial interest in handling large input long data sets, previously also present; simulations also generate large output data sets; bioinformatics both wide and long
Differing views of the world in terms of goals and approaches
Differences provide ecological robustness

Vectors

R as a calculator

R can be a calculator, with output printed by the default method

2+3

## [1] 5

7*8

## [1] 56

3^2

## [1] 9

log(1)

## [1] 0

log10(10)

## [1] 1

We could print explicitly:

print(2+3)

## [1] 5

print(sqrt(2))

## [1] 1.414214

print(sqrt(2), digits=10)

## [1] 1.414213562

print(10^7)

## [1] 1e+07

Exceptions also happen (Inf is infinity, NaN is Not a Number):

log(0)

## [1] -Inf

sqrt(-1)

## Warning in sqrt(-1): NaNs produced

## [1] NaN

1/0

## [1] Inf

0/0

## [1] NaN

Assignment and object names

We assign results of operations and functions to named objects with <-, or equivalently =; names begin with letters or a dot:

a <- 2+3
a

## [1] 5

is.finite(a)

## [1] TRUE

a <- log(0)
is.finite(a)

## [1] FALSE

Vectors

The printed results are prepended by a curious [1]; all these results are unit length vectors. We can combine several objects with c():

a <- c(2, 3)
a

## [1] 2 3

sum(a)

## [1] 5

str(a)

##  num [1:2] 2 3

aa <- rep(a, 50)
aa

##   [1] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
##  [36] 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
##  [71] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3

The single square brackets [] are used to access or set elements of vectors (the colon : gives an integer sequence); negative indices drop elements:

length(aa)

## [1] 100

aa[1:10]

##  [1] 2 3 2 3 2 3 2 3 2 3

sum(aa)

## [1] 250

sum(aa[1:10])

## [1] 25

sum(aa[-(11:length(aa))])

## [1] 25

Arithmetic under the hood

Infix syntax is just a representation of the actual underlying forms

a[1] + a[2]

## [1] 5

sum(a)

## [1] 5

`+`(a[1], a[2])

## [1] 5

Reduce(`+`, a)

## [1] 5

Vectors

We’ve done arithmetic on scalars, we can do vector-scalar arithmetic:

sum(aa)

## [1] 250

sum(aa+2)

## [1] 450

sum(aa)+2

## [1] 252

sum(aa*2)

## [1] 500

sum(aa)*2

## [1] 500

But vector-vector arithmetic poses the question of vector length and recycling (the shorter one gets recycled):

v5 <- 1:5
v2 <- c(5, 10)
v5 * v2

## Warning in v5 * v2: longer object length is not a multiple of shorter
## object length

## [1]  5 20 15 40 25

v2_stretch <- rep(v2, length.out=length(v5))
v2_stretch

## [1]  5 10  5 10  5

v5 * v2_stretch

## [1]  5 20 15 40 25

Missing values

In working with real data, we often meet missing values, coded by NA meaning Not Available:

anyNA(aa)

## [1] FALSE

is.na(aa) <- 5
aa[1:10]

##  [1]  2  3  2  3 NA  3  2  3  2  3

anyNA(aa)

## [1] TRUE

sum(aa)

## [1] NA

sum(aa, na.rm=TRUE)

## [1] 248

Exceptions here, exceptions there …

We’ve looked at the simple stuff, when arithmetic and assignment happens as expected
A strength of R is the handling of exceptions, which do happen when handling real data, which not infrequently differs from what we thought it was
Wanting a result from data is reasonable when the data meet all the requirements
If the data do not meet the requirements, we may get unexpected results, warnings or even errors: most often we need to go back and check our input data

Checking data

One way to check our input data is to print in the console - this works with small objects as we’ve seen, but for larger objects we need methods:

big <- 1:(10^5)
length(big)

## [1] 100000

head(big)

## [1] 1 2 3 4 5 6

str(big)

##  int [1:100000] 1 2 3 4 5 6 7 8 9 10 ...

summary(big)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1   25001   50000   50000   75000  100000

Basic vector types

There are length, head, str (structure) and summary methods for many types of objects
str also gives us a hint of the type of object and its dimensions
We’ve seen a couple of uses of str so far, str(a) was num and str(big) was int, what does this signify?
They are both numbers, but of different types
There are six basic vector types: list, integer, double, logical, character and complex
The derived type factor (to which we return shortly) is integer with extra information
str reports these as int, num, logi, chr and cplx, and lists are enumerated recursively
In RStudio you see more or less the str output in the environment pane as Values in the list view; the grid view adds the object size in memory
From early S, we have typeof and storage.mode (including single precision, not used in R) - these are important for interfacing C, C++, Fortran and other languages
Beyond this is class, but then the different class systems (S3 and formal S4) complicate things
Objects such as vectors may also have attributes in which their class and other information may be placed
Typically, a lot of use is made of attributes to squirrel away strings and short vectors

Testing types

is methods are used to test types of objects; note that integers are also seen as numeric:

set.seed(1)
x <- runif(50, 1, 10)
is.numeric(x)

## [1] TRUE

y <- rpois(50, lambda=6)
is.numeric(y)

## [1] TRUE

is.integer(y)

## [1] TRUE

xy <- x < y
is.logical(xy)

## [1] TRUE

Coercion between types

as methods try to convert between object types and are widely used:

str(as.integer(xy))

##  int [1:50] 1 1 0 0 1 0 0 0 1 1 ...

str(as.numeric(y))

##  num [1:50] 6 9 5 4 3 3 5 6 7 5 ...

str(as.character(y))

##  chr [1:50] "6" "9" "5" "4" "3" "3" "5" "6" "7" "5" "9" "5" "6" "5" ...

str(as.integer(x))

##  int [1:50] 3 4 6 9 2 9 9 6 6 1 ...

List, data.frame, matrix, array

The data frame object

First, let us see that is behind the data.frame object: the list object
list objects are vectors that contain other objects, which can be addressed by name or by 1-based indices
Like the vectors we have already met, lists can be accessed and manipulated using square brackets []
Single list elements can be accessed and manipulated using double square brackets [[]]

List objects

Starting with four vectors of differing types, we can assemble a list object; as we see, its structure is quite simple. The vectors in the list may vary in length, and lists can (and do often) include lists

V1 <- 1:3
V2 <- letters[1:3]
V3 <- sqrt(V1)
V4 <- sqrt(as.complex(-V1))
L <- list(v1=V1, v2=V2, v3=V3, v4=V4)

str(L)

## List of 4
##  $ v1: int [1:3] 1 2 3
##  $ v2: chr [1:3] "a" "b" "c"
##  $ v3: num [1:3] 1 1.41 1.73
##  $ v4: cplx [1:3] 0+1i 0+1.41i 0+1.73i

L$v3[2]

## [1] 1.414214

L[[3]][2]

## [1] 1.414214

Data Frames

Our list object contains four vectors of different types but of the same length; conversion to a data.frame is convenient. Note that by default strings are converted into factors:

DF <- as.data.frame(L)
str(DF)

## 'data.frame':    3 obs. of  4 variables:
##  $ v1: int  1 2 3
##  $ v2: Factor w/ 3 levels "a","b","c": 1 2 3
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i

DF <- as.data.frame(L, stringsAsFactors=FALSE)
str(DF)

## 'data.frame':    3 obs. of  4 variables:
##  $ v1: int  1 2 3
##  $ v2: chr  "a" "b" "c"
##  $ v3: num  1 1.41 1.73
##  $ v4: cplx  0+1i 0+1.41i 0+1.73i

We can also provoke an error in conversion from a valid list made up of vectors of different length to a data.frame:

V2a <- letters[1:4]
V4a <- factor(V2a)
La <- list(v1=V1, v2=V2a, v3=V3, v4=V4a)
DFa <- try(as.data.frame(La, stringsAsFactors=FALSE), silent=TRUE)
message(DFa)

## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE,  : 
##   arguments imply differing number of rows: 3, 4

We can access data.frame elements as list elements, where the $ is effectively the same as [[]] with the list component name as a string:

DF$v3[2]

## [1] 1.414214

DF[[3]][2]

## [1] 1.414214

DF[["v3"]][2]

## [1] 1.414214

Since a data.frame is a rectangular object with named columns with equal numbers of rows, it can also be indexed like a matrix, where the rows are the first index and the columns (variables) the second:

DF[2, 3]

## [1] 1.414214

DF[2, "v3"]

## [1] 1.414214

str(DF[2, 3])

##  num 1.41

str(DF[2, 3, drop=FALSE])

## 'data.frame':    1 obs. of  1 variable:
##  $ v3: num 1.41

If we coerce a data.frame containing a character vector or factor into a matrix, we get a character matrix; if we extract an integer and a numeric column, we get a numeric matrix.

as.matrix(DF)

##      v1  v2  v3         v4           
## [1,] "1" "a" "1.000000" "0+1.000000i"
## [2,] "2" "b" "1.414214" "0+1.414214i"
## [3,] "3" "c" "1.732051" "0+1.732051i"

as.matrix(DF[,c(1,3)])

##      v1       v3
## [1,]  1 1.000000
## [2,]  2 1.414214
## [3,]  3 1.732051

The fact that data.frame objects descend from list objects is shown by looking at their lengths; the length of a matrix is not its number of columns, but its element count:

length(L)

## [1] 4

length(DF)

## [1] 4

length(as.matrix(DF))

## [1] 12

There are dim methods for data.frame objects and matrices (and arrays with more than two dimensions); matrices and arrays are seen as vectors with dimensions; list objects have no dimensions:

dim(L)

## NULL

dim(DF)

## [1] 3 4

dim(as.matrix(DF))

## [1] 3 4

str(as.matrix(DF))

##  chr [1:3, 1:4] "1" "2" "3" "a" "b" "c" "1.000000" "1.414214" ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "v1" "v2" "v3" "v4"

data.frame objects have names and row.names, matrices have dimnames, colnames and rownames; all can be used for setting new values:

row.names(DF)

## [1] "1" "2" "3"

names(DF)

## [1] "v1" "v2" "v3" "v4"

names(DF) <- LETTERS[1:4]
names(DF)

## [1] "A" "B" "C" "D"

str(dimnames(as.matrix(DF)))

## List of 2
##  $ : NULL
##  $ : chr [1:4] "A" "B" "C" "D"

R objects have attributes that are not normally displayed, but which show their structure and class (if any); we can see that data.frame objects are quite different internally from matrices:

str(attributes(DF))

## List of 3
##  $ names    : chr [1:4] "A" "B" "C" "D"
##  $ class    : chr "data.frame"
##  $ row.names: int [1:3] 1 2 3

str(attributes(as.matrix(DF)))

## List of 2
##  $ dim     : int [1:2] 3 4
##  $ dimnames:List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "A" "B" "C" "D"

If the reason for different vector lengths was that one or more observations are missing on that variable, NA should be used; the lengths are then equal, and a rectangular table can be created:

V1a <- c(V1, NA)
V3a <- sqrt(V1a)
La <- list(v1=V1a, v2=V2a, v3=V3a, v4=V4a)
DFa <- as.data.frame(La, stringsAsFactors=FALSE)
str(DFa)

## 'data.frame':    4 obs. of  4 variables:
##  $ v1: int  1 2 3 NA
##  $ v2: chr  "a" "b" "c" "d"
##  $ v3: num  1 1.41 1.73 NA
##  $ v4: Factor w/ 4 levels "a","b","c","d": 1 2 3 4

Factor, time, encoding

What is a factor?

Sometimes character values are just that, not categorical values to be used in handling data
Factors are meant to be used for categories, and are stored as an integer vector with values pointing to places in a character vector of levels stored as an attribute of the object
Character data are read into R by default as factors, because that is the most usual scenario
Having a pre-defined set of indices to level values is very useful for visualization and analysis
Ordered factors can be used for ordinal data

Factors

We can retrieve to input character vector by indexing the levels:

gen <- c("female", "male", NA)
fgen <- factor(gen)
str(fgen)

##  Factor w/ 2 levels "female","male": 1 2 NA

nlevels(fgen)

## [1] 2

levels(fgen)

## [1] "female" "male"

as.integer(fgen)

## [1]  1  2 NA

levels(fgen)[as.integer(fgen)]

## [1] "female" "male"   NA

Ordered factors

Ordered factors do not sort the levels alphabetically:

status <- c("Lo", "Hi", "Med", "Med", "Hi")
ordered.status <- ordered(status, levels=c("Lo", "Med", "Hi"))
ordered.status

## [1] Lo  Hi  Med Med Hi 
## Levels: Lo < Med < Hi

str(ordered.status)

##  Ord.factor w/ 3 levels "Lo"<"Med"<"Hi": 1 3 2 2 3

table(status)

## status
##  Hi  Lo Med 
##   2   1   2

table(ordered.status)

## ordered.status
##  Lo Med  Hi 
##   1   2   2

Encodings

So far, we’ve only met ASCII 7-bit characters, but in many situations, we need more. The default encoding will depend on the locale in which your R session is running - this is my locale:

strsplit(Sys.getlocale(), ";")

## [[1]]
##  [1] "LC_CTYPE=en_GB.UTF-8"       "LC_NUMERIC=C"              
##  [3] "LC_TIME=en_GB.UTF-8"        "LC_COLLATE=en_GB.UTF-8"    
##  [5] "LC_MONETARY=en_GB.UTF-8"    "LC_MESSAGES=en_GB.UTF-8"   
##  [7] "LC_PAPER=en_GB.UTF-8"       "LC_NAME=C"                 
##  [9] "LC_ADDRESS=C"               "LC_TELEPHONE=C"            
## [11] "LC_MEASUREMENT=en_GB.UTF-8" "LC_IDENTIFICATION=C"

In UTF-8, non-ASCII characters are encoded by an 8th bit flag, and a second byte with the value; in codepage and ISO 8-bit character sets, the 8th bit is part of the character, but differs from set to set:

V5 <- c("æ", "Æ", "ø", "å")
sapply(V5, charToRaw)

##       æ  Æ  ø  å
## [1,] c3 c3 c3 c3
## [2,] a6 86 b8 a5

V6 <- iconv(V5, to="CP1252")
sapply(V6, charToRaw)

##  æ  Æ  ø  å 
## e6 c6 f8 e5

Encodings do not affect representation within the R workspace, but are a real problem for reading and writing data:

La <- list(v1=V1a, v2=V2a, v3=V3a, v4=V4a, v5=V5, v6=V6)
DFa <- as.data.frame(La)
str(DFa)

## 'data.frame':    4 obs. of  6 variables:
##  $ v1: int  1 2 3 NA
##  $ v2: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
##  $ v3: num  1 1.41 1.73 NA
##  $ v4: Factor w/ 4 levels "a","b","c","d": 1 2 3 4
##  $ v5: Factor w/ 4 levels "å","æ","Æ","ø": 2 3 4 1
##  $ v6: Factor w/ 4 levels "å","æ","Æ","ø": 2 3 4 1

Date/time vectors

An aside before we proceed: handling temporal data is confusing. Time is multifaceted, where two of the variants are instantaneous time with data at that time point and interval time with data aggregated over the interval:

now <- Sys.time()
now

## [1] "2019-10-22 11:35:20 CEST"

class(now)

## [1] "POSIXct" "POSIXt"

as.Date(now)

## [1] "2019-10-22"

unclass(now)

## [1] 1571736920

One representation is in seconds since the epoch (with decimal parts of a second), another is in components also including important time zone information (time zone listings are updated regularly):

str(unclass(as.POSIXlt(now)))

## List of 11
##  $ sec   : num 20.2
##  $ min   : int 35
##  $ hour  : int 11
##  $ mday  : int 22
##  $ mon   : int 9
##  $ year  : int 119
##  $ wday  : int 2
##  $ yday  : int 294
##  $ isdst : int 1
##  $ zone  : chr "CEST"
##  $ gmtoff: int 7200
##  - attr(*, "tzone")= chr [1:3] "" "CET" "CEST"

In the social sciences, we are more likely to need annual or monthly representations, but it is useful to be aware that a year can mean status at year end, or an aggregated value accummulated during an interval.

suppressMessages(library(zoo))
as.yearmon(now)

## [1] "Oct 2019"

as.yearqtr(now)

## [1] "2019 Q4"

as.Date("2016-03-01") - 1 # day

## [1] "2016-02-29"

as.Date("2018-03-01") - 1 # day

## [1] "2018-02-28"

seq(as.Date(now), as.Date(now)+12, length.out=4)

## [1] "2019-10-22" "2019-10-26" "2019-10-30" "2019-11-03"

R’s `sessionInfo()`

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 30 (Workstation Edition)
## 
## Matrix products: default
## BLAS:   /home/rsb/topics/R/R361-share/lib64/R/lib/libRblas.so
## LAPACK: /home/rsb/topics/R/R361-share/lib64/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] zoo_1.8-6          wordcloud_2.6      RColorBrewer_1.1-2
## [4] MASS_7.3-51.4     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.2      lattice_0.20-38 digest_0.6.21   grid_3.6.1     
##  [5] magrittr_1.5    evaluate_0.14   rlang_0.4.0     stringi_1.4.3  
##  [9] rmarkdown_1.16  tools_3.6.1     stringr_1.4.0   xfun_0.10      
## [13] yaml_2.2.0      compiler_3.6.1  htmltools_0.4.0 knitr_1.25

Abelson, Harold, and Gerald Jay Sussman. 1996. Structure and Interpretation of Computer Programs. Boston, MA: MIT Press.

Bååth, Rasmus. 2012. “The State of Naming Conventions in R.” The R Journal 4 (2): 74–75. https://journal.r-project.org/archive/2012/RJ-2012-018/index.html.

Becker, R.A., and J.M. Chambers. 1984. S: An Interactive Environment for Data Analysis and Graphics. Pacific Grove, CA, USA: Wadsworth & Brooks/Cole.

———. 1985. Extending the S System. Pacific Grove, CA, USA: Wadsworth & Brooks/Cole.

Becker, Richard A., John M. Chambers, and Allan R. Wilks. 1988. The New S Language. London: Chapman & Hall.

Chambers, John M. 1998. Programming with Data. New York: Springer.

———. 2016. Extending R. Boca Raton: Chapman & Hall.

Chambers, John M., and Trevor J. Hastie. 1992. Statistical Models in S. London: Chapman & Hall.

Ihaka, Ross, and Robert Gentleman. 1996. “R: A Language for Data Analysis and Graphics.” Journal of Computational and Graphical Statistics 5 (3): 299–314. https://doi.org/10.1080/10618600.1996.10474713.

Tierney, Luke. 1990. LISP-STAT: An Object-Oriented Environment for Statistical Computing and Dynamic Graphics. Wiley, New York, NY.

———. 1996. “Recent Developments and Future Directions in Lisp-Stat.” Journal of Computational and Graphical Statistics 5 (3): 250–62.

———. 2005. “Some Notes on the Past and Future of Lisp-Stat.” Journal of Statistical Software, Articles 13 (9): 1–15. https://doi.org/10.18637/jss.v013.i09.

Venables, William N., and Brian D. Ripley. 2000. S Programming. New York: Springer. http://www.stats.ox.ac.uk/pub/MASS3/Sprog/.

Wickham, Hadley. 2014. Advanced R. Boca Raton, FL: Chapman & Hall. http://adv-r.had.co.nz/.

BAN 421: Data Structures in R

Roger Bivand

Monday 4 November 2019

Copyright

Required current contributed CRAN packages:

Script

Seminar introduction

This seminar/course

Reproducible research (R)

Schedule

Projects

Learning resources

R itself

Using Markdown in R

Ancilliary information

Help, examples and built-in datasets in R

Base help system

Help pages

Interactive use of help pages

Function arguments

Tooltips and completion

Coherence code/documentation

Returned values

Examples

Built-in data sets

Vignettes

Task views

Online help pages

R communities

R Consortium

History of R and its data structures

Sources

Early R was Scheme via SICP

Brown Books

Blue and White Books

Green Book

S2 to S3 to S4

S, Bell Labs, S-PLUS

S-PLUS and R

and what about LispStat?

ALTREP

ALTREP

SICP: What are computational processes?

Programming and programming languages

Expressions

Naming and the environment

Compound procedures

Exceptional conditions

Vintage R

Use questions: left assign

Use questions: stringsAsFactors

Use questions: drop

But scalars are also vectors …

Vintage R

R SVN logs

R SVN logs

Commits 1998-2019

Commits by author and year between r6 and r77269

XML logentry structure

XML logentry structure

Commit messages by number of files affected, year and revision

Epoch file commits in trunk/

Epoch file commits in trunk/src

Epoch file commits in trunk/src/library

Epoch file commits in trunk/src/library/base

Files by year and commit action

CRAN and Bioconductor packages

CRAN

CRAN/Bioconductor package clusters

CRAN/Bioconductor package page rank scores

First package cluster

Second package cluster

Third package cluster

Fourth package cluster

Fifth package cluster

Sixth package cluster

CRAN/Bioconductor top two page rank clusters

CRAN/Bioconductor third and fourth page rank clusters

CRAN/Bioconductor fifth and sixth page rank clusters

CRAN/Bioconductor package author clusters

Use questions: `strings`A`s`F`actors`

Use questions: `drop`

R’s `sessionInfo()`