Chapter 7 Workflow: Clustering
The purpose of this case study is to demonstrate various approaches to clustering scRNA-seq datasets using R/Bioconductor packages. In this workflow, we go from preprocessing the data to clustering the data. Furthermore, we highlight methods which are especially suitable for large datasets.
Here we will be working with a dataset from the CellBench scRNA-seq benchmarking dataset collection. Specifically, we will be working with the sc_10x_5cl
dataset, which contains 5 sorted cell lines that were sequencing on the 10X Genomics platform. We will use this to showcase our different clustering strategies.
7.1 Package Requirements
These packages will be required for working through the vignette, and can be installed by running the code below:
BiocManager::install(c('scater', 'scran',
'SC3', 'clusterExperiment', 'BiocNeighbors',
'BiocParallel'))
## suggested for combining plots
devtools::install_github('thomasp85/patchwork')
library(scater)
library(scran)
library(SC3)
library(clusterExperiment)
library(BiocNeighbors)
library(BiocParallel)
library(patchwork)
7.2 Loading the Data
To follow along with the workflow, we use the CellBench_data Github repository data
folder’s sincell_with_class_5cl.RData
workspace, which contains an object called sc_10x_5cl_qc
. While this data will eventually be submitted to ExperimentHub, a central repository for datasets, it is currently as yet unavailable there.
For the time being, the data can be found in this book’s Github repo in the _rfiles/_data
folder. The data arrives as a SingleCellExperiment
class object.
sce <- readRDS("_rfiles/_data/cellbench_sce_sc_10x_5cl_qc.rds")
7.3 Understanding the Data
Within the cell metadata stored in the colData
slot of our sce
object, we see that information regarding the cell line of origin for each cell is stored under the column cell_line
, and furthermore can count the number of instances of each cell line present in the dataset as follows:
table(sce$cell_line)
##
## A549 H1975 H2228 H838 HCC827
## 1244 515 751 840 568
We will be using this information to illustrate our dimensionality reduction by highlighting each of the cell lines, and furthermore using this information to verify our clustering performance. While ground truth is not often known in practice, such benchmarking datasets are essential for validation of methods. While simulations are another source of establishing ground truth, this dataset is particularly appealing as it is derived under realistic experimental conditions.
7.4 Preprocessing
This dataset has already undergone cell quality control, so in this case we will skip this step (for this, we refer to the Workflow on Integrating Datasets. Thus, we will begin with the feature selection step and go on from there.
7.5 Feature Selection
In order to improve performance in terms of both speed and quality of results, we will start with defining a set of highly variable genes based on the biological coefficient of variability (see the scran vignette for details).
## Calculating highly variable genes -------------------
fit <- trendVar(sce, parametric=TRUE, use.spikes = FALSE)
dec <- decomposeVar(sce, fit)
hvg_genes <- rownames(dec[dec$FDR < 0.00001, ]) # 1874 genes
## remove uninteresting/unknown genes (ribosomal, mitochondrial)
bad_patterns <- 'RP[0-9]|^MT|[0-9][0-9][0-9][0-9]|^RPL|^RPS'
hvg_genes <- hvg_genes[!grepl(bad_patterns, hvg_genes)] # 1713 genes
## save the decomposed variance table and hvg_genes into metadata for safekeeping
metadata(sce)$hvg_genes <- hvg_genes
metadata(sce)$dec_var <- dec
7.6 Dimension Reduction
Here we will use PCA to calculate the first 20 principal components with our highly variable geneset, using the irlba
method for a faster approximate version of the calculation. These PCs will then form the basis of our subsequent tSNE calculation.
sce <- runPCA(sce, method = "irlba",
ncomponents = 20,
feature_set = metadata(sce)$hvg_genes)
sce <- runTSNE(sce,
perplexity = 30,
feature_set = metadata(sce)$hvg_genes)
sce <- runUMAP(sce,
n_neighbors = 15,
feature_set = metadata(sce)$hvg_genes)
Here, we see that PCA nicely separates most of the cell lines, with the exception of HCC827 and H1975. On the other hand, tSNE nicely manages to separate the 5 cell lines, but interestingly produces two separate clusters for cell line H1975.
We leave it as an exercise to the interested reader to modify the essential tuning parameters, perplexity
, and n_neighbors
for TSNE and UMAP respectively to see how the dimensionality reduction embeddings change. Here we have used the default parameters, but simply made them explicit. Information on these and other parameters can be seen by accessing the help for each function.
Note that clustering performance is not necessarily dependent on the dimensionality reduction results (to answer this, it will depend on the specific clustering technique at hand).
However, clustering performance indeed may be reflected in the dimensionality reduction results. For example, given the PCA result, we might expect some confusion between HCC827 and H1975, and based on the UMAP results, we might possibly even see two clusters within the H1975 cell line. Note however that these dimensionality reduction embeddings are also a function of the tuning parameters noted above.
gPCA <- plotReducedDim(sce, "PCA", colour_by = "cell_line")
gTSNE <- plotReducedDim(sce, "TSNE", colour_by = "cell_line")
gUMAP <- plotReducedDim(sce, "UMAP", colour_by = "cell_line")
patchwork::wrap_plots(gPCA, gTSNE, gUMAP, heights = 1, widths = 1, ncol = 1)
7.7 Clustering
Here we highlight different frameworks for clustering. The first two, using the sc3
and clusterExperiment
Bioconductor packages, are fuller implementations that can test across multiple parametrizations and furthermore inspect the quality of the clustering results. The BiocNeighbors
package is briefly highlighted to show a minimal alternative to clustering that emphasizes speed.
7.7.1 SC3
The sc3
package provides a simple framework that allows users to test for many k’s, e.g. numbers of clusters, and subsequently compare the results from these differing k’s in both quantitative and qualitative ways to pick an optimal k result. Below, we first set an essential rowData
feature required by the package to run. Subsequently, we run the sc3()
function in this example using a subset of the data (only the highly variable genes, hvg_genes
, testing k’s 3 through 6.
## SC3 requires this column to be appended
rowData(sce)$feature_symbol <- rownames(sce)
## SC3 will return an SCE object with appended "sc3_" columns
sce <- sc3(sce[metadata(sce)$hvg_genes, ],
ks = 3:6,
k_estimator = TRUE)
After using sc3()
, the function returns the original SingleCellExperiment
object, but with new columns in colData(sce)
corresponding to the different ks
supplied to the function, as well as a full representation of the analysis that is stored in metadata(sce)$sc3
, which includes an estimate of the optimal k
(as dictated by the k_estimator = TRUE
argument above).
Below, we show the clustering results of the ks
we supplied, 3 through 6, shown on the TSNE representation of the data. We use the scater
package function plotReducedDim()
to produce the individual plots, and for the sake of this vignette, wrap them together into a single plot using the patchwork
package.
## Basic code:
g3 <- plotReducedDim(sce, use_dimred = "TSNE", colour_by = "sc3_3_clusters")
g4 <- plotReducedDim(sce, use_dimred = "TSNE", colour_by = "sc3_4_clusters")
g5 <- plotReducedDim(sce, use_dimred = "TSNE", colour_by = "sc3_5_clusters")
g6 <- plotReducedDim(sce, use_dimred = "TSNE", colour_by = "sc3_6_clusters")
patchwork::wrap_plots(g3, g4, g5, g6, widths = 1, heights = 1, ncol = 1)
The sc3
package contains many more utilities for exploring the stability of clustering and can even produce differential expression analysis results using the biology = TRUE
argument within the sc3()
function. We leave it to the interested reader to learn more about sc3
via their vignette.
7.7.2 clusterExperiment
The clusterExperiment
package uses the function RSEC()
to calculate clusters across various parametrizations, and contains multiple parameters to do so. Here, we specify the parametrizations to iterate over using the alphas
and k0s
arguments below. We refer the reader to the help page for ?RSEC
to learn more about the different parameters. In the end, this exhaustive exercise is used to ultimately determine a consensus clustering that combines the information learned from the various parametrizations.
Note that clusterExperiment
takes a long time to run, even with the reduced search space parametrized below. We recommend utilizing cluster resources for this job.
rsec <- RSEC(sce[metadata(sce)$hvg_genes, ],
reduceMethod = "PCA",
nReducedDims = 20,
alphas = c(0.1, 0.3), k0s = 4:6,
consensusMinSize = 50,
ncores = 8)
metadata(sce)$rsec <- rsec
colData(sce)$rsec_consensus_clusters <- as.factor(clusterMatrix(rsec)[, "mergeClusters"])
Once again, we refer the interested reader to the vignette for clusterExperiment
to learn more about extended functionality for visualizing the data and about the specifics of the clustering workflow.
7.7.3 BiocNeighbors
The BiocNeighbors
package provides a lower-level method for clustering of data. Under the hood, it provides various types of clustering algorithms, as specified through the BNPARAM
class object. This BNPARAM
object can be passed into the BiocNeighbors
clustering function, findKNN()
, as well as other functions that support BNPARAM
.
The BiocNeighbors
packaged is especially designed for developers to enable a unified interface for clustering algorithm specification, allowing for the separation of clustering functionality from the main function.