Chapter 2 Learning R and Bioconductor
In this section, we outline various resources for learning R and Bioconductor. We provide a brief set of instructions for installing R on your own machine, and then cover how to get help for functions, packages, and Bioconductor-specific resources for learning more.
2.1 The Benefits of R and Bioconductor
R is a high-level programming language that was initially designed for statistical applications. While there is much to be said about R as a programming language, one of the key advantages of using R is that it is highly extensible through packages. Packages are collections of functions, data, and documentation that extends the capabilities of base R. The ease of development and distribution of packages for R has made it a rich environment for many fields of study and application.
One of the primary ways in which packages are distributed is through centralized repositories. The first R repository a user typically runs into is the Comprehensive R Archive Network (CRAN), which hosts over 13,000 packages to date, and is home to many of the most popular R packages.
Similar to CRAN, Bioconductor is a repository of R packages as well. However, whereas CRAN is a general purpose repository, Bioconductor focuses on software tailored for genomic analysis. Furthermore, Bioconductor has stricter requirements for a package to be accepted into the repository. Of particular interest to us is the inclusion of high quality documentation and the use of common data infrastructure to promote package interoperability.
In order to use these packages from CRAN and Bioconductor, and start programming with R to follow along in these workflows, some knowledge of R is helpful. Here we outline resources to guide you through learning the basics.
2.2 Learning R Online
To learn more about programming with R, we highly recommend checking out the online courses offered by Datacamp, which includes both introductory and advanced courses within the R track. Datacamp is all online with many free courses, with videos and a code editor/console that promotes an interactive learning experience. What we like about Datacamp is that it is more focused on topics and programming paradigms that center around data science, which is especially helpful for getting started with R.
Beyond just Datacamp, a mainstay resource for learning R is the R for Data Science book. This book illustrates R programming through the exploration of various data science concepts - transformation, visualization, exploration, and more.
2.3 Running R Locally
While learning R through online resources is a great way to start with R, as it requires minimal knowledge to start up, at some point, it will be desirable to have a local installation - on your own hardware - of R. This will allow you to install and maintain your own software and code, and furthermore allow you to create a personalized workspace.
2.3.1 Installing R
Prior to getting started with this book, some prior programming experience with R is helpful. Check out the Learning R and More chapter for a list of resources to get started with R and other useful tools for bioinformatic analysis.
To follow along with the analysis workflows in this book on your personal computer, it is first necessary to install the R programming language. Additionally, we recommend a graphical user interface such as RStudio for programming in R and visualization. RStudio features many helpful tools, such as code completion and an interactive data viewer to name but two. For more details, please see the online book R for Data Science prerequisites section for more information about installing R and using RStudio.
2.3.1.1 For MacOS/Linux Users
A special note for MacOS/Linux users: we highly recommend using a package manager to manage your R installation. This differs across different Linux distributions, but for MacOS we highly recommend the Homebrew package manager. Follow the website directions to install homebrew, and install R via the commandline with brew install R
, and it will automatically configure your installation for you. Upgrading to new R versions can be done by running brew upgrade
.
2.3.2 Installing R & Bioconductor Packages
After installing R, the next step is to install R packages. In the R console, you can install packages from CRAN via the install.packages()
function. In order to install Bioconductor packages, we will first need the BiocManager package which is hosted on CRAN. This can be done by running:
install.packages("BiocManager")
The BiocManager package makes it easy to install packages from the Bioconductor repository. For example, to install the SingleCellExperiment package, we run:
## the command below is a one-line shortcut for:
## library(BiocManager)
## install("SingleCellExperiment")
BiocManager::install("SingleCellExperiment")
Throughout the book, we can load packages via the library()
function, which by convention usually comes at the top of scripts to alert readers as to what packages are required. For example, to load the SingleCellExperiment package, we run:
library(SingleCellExperiment)
Many packages will be referenced throughout the book within the workflows, and similar to the above, can be installed using the BiocManager::install()
function.
2.4 Getting Help In (and Out) of R
One of the most helpful parts of R is being able to get help inside of R. For example, to get the manual associated with a function, class, dataset, or package, you can prepend the code of interest with a ?
to retrieve the relevant help page. For example, to get information about the data.frame()
function, the SingleCellExperiment class, the in-built iris dataset, or for the BiocManager package, you can type:
?data.frame
?SingleCellExperiment
?iris
?BiocManager
Beyond the R console, there are myriad online resources to get help. The R for Data Science book has a great section dedicated to looking for help outside of R. In particular, Stackoverflow’s R tag is a helpful resource for asking and exploring general R programming questions.
2.5 Bioconductor Help
One of the key tenets of Bioconductor software that makes it stand out from CRAN is the required documentation of packages and workflows. In addition, Bioconductor hosts a Bioconductor-specific support site that has grown into a valuable resource of its own, thanks to the work of dedicated volunteers.
2.5.1 Bioconductor Packages
Each package hosted on Bioconductor has a dedicated page with various resources. For an example, looking at the scater
package page on Bioconductor, we see that it contains:
- a brief description of the package at the top, in addition to the authors, maintainer, and an associated citation
- installation instructions that can be cut and paste into your R console
- documentation - vignettes, reference manual, news
Here, the most important information comes from the documentation section. Every package in Bioconductor is required to be submitted with a vignette - a document showcasing basic functionality of the package. Typically, these vignettes have a descriptive title that summarizes the main objective of the vignette. These vignettes are a great resource for learning how to operate the essential functionality of the package.
The reference manual contains a comprehensive listing of all the functions available in the package. This is a compilation of each function’s manual, aka help pages, which can be accessed programmatically in the R console via ?<function>
.
Finally, the NEWS file contains notes from the authors which highlight changes across different versions of the package. This is a great way of tracking changes, especially functions that are added, removed, or deprecated, in order to keep your scripts current with new versions of dependent packages.
Below this, the Details section covers finer nuances of the package, mostly relating to its relationship to other packages:
- upstream dependencies (Depends, Imports, Suggests fields): packages that are imported upon loading the given package
- downstream dependencies (Depends On Me, Imports Me, Suggests Me): packages that import the given package when loaded
For example, we can see that an entry called simpleSingle in the Depends On Me field on the scater
page takes us to a step-by-step workflow for low-level analysis of single-cell RNA-seq data.
One additional Details entry, the biocViews, is helpful for looking at how the authors annotate their package. For example, for the scater
package, we see that it is associated with DataImport
, DimensionReduction
, GeneExpression
, RNASeq
, and SingleCell
, to name but some of its many annotations. We cover biocViews in more detail.
2.5.2 biocViews
To find packages via the Bioconductor website, one useful resource is the BiocViews page, which provides a hierarchically organized view of annotations associated with Bioconductor packages.
Under the “Software” label for example (which is comprised of most of the Bioconductor packages), there exist many different views to explore packages. For example, we can inspect based on the associated “Technology”, and explore “Sequencing” associated packages, and furthermore subset based on “RNASeq”.
Another area of particular interest is the “Workflow” view, which provides Bioconductor packages that illustrate an analytical workflow. For example, the “SingleCellWorkflow” contains the aforementioned tutorial, encapsulated in the simpleSingleCell package.
2.5.3 Bioconductor Forums
The Bioconductor support site contains a Stackoverflow-style question and answer support site that is actively contributed to from both users and package developers. Thanks to the work of dedicated volunteers, there are ample questions to explore to learn more about Bioconductor specific workflows.
Another way to connect with the Bioconductor community is through Slack, which hosts various channels dedicated to packages and workflows. The Bioc-community Slack is a great way to stay in the loop on the latest developments happening across Bioconductor, and we recommend exploring the “Channels” section to find topics of interest.