Chapter 6 Workflow: Integrating Datasets

authors: Stephanie C. Hicks, Robert A. Amezquita

The purpose of this case study is to demonstrate how to integrate multiple scRNA-seq datasets using R/Bioconductor packages. In this workflow, we go from preprocessing the data to integrating the data and visualizing it in a reduced dimensionality space to showcase the success of the integration approach using mutual nearest neighbors relative to a naive approach. This approach is helpful for ameliorating batch effects that can be introduced when combining data from different sequencing runs and/or platforms.

Here, we will be combining two datasets from 10X Genomics PBMC Data. One dataset is comprised of 3000 PBMCs from a healthy donor, and the other dataset is comprised of 4000 PBMCs from a different healthy donor. Our goal is to produce an integrated representation of this data to facilitate downstream analysis, such as clustering.

6.1 Package Requirements

These packages will be required for working through the vignette, and can be installed by running the code below. The data that we will be using comes from the TENxPBMCData package.

## required
BiocManager::install(c('scater', 'scran', 'limma', 'TENxPBMCData'))

## suggested
BiocManager::install(c('BiocParallel', 'BiocNeighbors'))
library(scater)
library(scran)
library(limma)
library(TENxPBMCData)
library(BiocParallel)
library(BiocNeighbors)

6.2 Loading the Data

Here we will be combining two different runs of scRNA-seq data - each from different healthy donors, and comprised of either 3000 cells (pbmc3k) or 4000 cells (pbmc4k). Note that these objects are already SingleCellExperiment objects.

pbmc3k <- TENxPBMCData('pbmc3k')
pbmc4k <- TENxPBMCData('pbmc4k')

6.3 Preprocessing

Here we walk through the steps required to produce a clean expression matrix, taking the raw count data through to a normalized representation.

6.3.1 Working with Common Genes

First, we find intersection of gene names and keep only the entries that are in common between the two datasets. We then reduce each of the individual datasets down to these matching entries (keep_genes) by subsetting.

keep_genes <- intersect(rownames(pbmc3k), rownames(pbmc4k))
pbmc3k <- pbmc3k[match(keep_genes, rownames(pbmc3k)), ]
pbmc4k <- pbmc4k[na.omit(match(keep_genes, rownames(pbmc4k))), ]

6.3.2 Cell and Gene Quality Control

First, for the combined data sce and the individual datasets pbmc3k and pbmc4k, we calculate essential quality control characteristics using the scater function calculateQCMetrics(). We then determine cells low quality cells by finding outliers with uncharacteristically low total counts or total number of features (genes) detected. We automate this into a function.

## For pbmc3k
pbmc3k <- calculateQCMetrics(pbmc3k)
low_lib_pbmc3k <- isOutlier(pbmc3k$log10_total_counts, type="lower", nmad=3)
low_genes_pbmc3k <- isOutlier(pbmc3k$log10_total_features_by_counts, type="lower", nmad=3)

## For pbmc4k
pbmc4k <- calculateQCMetrics(pbmc4k)
low_lib_pbmc4k <- isOutlier(pbmc4k$log10_total_counts, type="lower", nmad=3)
low_genes_pbmc4k <- isOutlier(pbmc4k$log10_total_features_by_counts, type="lower", nmad=3)

These results flag approximately 30 to 100 cells for removal from each of the datasets. We can then further subset our data to remove these cells by running the following:

pbmc3k <- pbmc3k[,!(low_lib_pbmc3k | low_genes_pbmc3k)]
pbmc4k <- pbmc4k[,!(low_lib_pbmc4k | low_genes_pbmc4k)]

6.3.3 Normalization

From here, we now compute the size factors using the scran package’s computeSumFactors() function, and apply the size factors via the scran package’s normalize() function to produce a new assay, logcounts, within each SingleCellExperiment object.

## compute the sizeFactors
pbmc3k <- computeSumFactors(pbmc3k)
pbmc4k <- computeSumFactors(pbmc4k)

## Normalize (using already calculated size factors)
pbmc3k <- normalize(pbmc3k)
pbmc4k <- normalize(pbmc4k)

6.3.4 Multibatch Normalization

We also rescale each batch to adjust for differences in sequencing depth between batches. The multiBatchNorm() function from the scran package recomputes log-normalized expression values after adjusting the size factors for systematic differences in coverage between SingleCellExperiment objects. The previously computed size factors only remove biases between cells within a single batch. This improves the quality of the correction step by removing one aspect of the technical differences between batches.

rescaled <- multiBatchNorm(pbmc3k, pbmc4k)
pbmc3k <- rescaled[[1]]
pbmc4k <- rescaled[[2]]

6.4 Feature Selection

A key step across many data integration methods is the selection of informative features across the different experiments. This helps to speed up computation and possibly improve the resulting integration.

Once again, we rely on the scran package to identify the genes with the highest biological coefficient of variation, using the trendVar() and decomposeVar() functions to calculate the per gene variance and separate it into technical versus biological components. We perform this for each individual dataset:

fit_pbmc3k <- trendVar(pbmc3k, use.spikes=FALSE) 
dec_pbmc3k <- decomposeVar(pbmc3k, fit_pbmc3k)
dec_pbmc3k$Symbol_TENx <- rowData(pbmc3k)$Symbol_TENx
dec_pbmc3k <- dec_pbmc3k[order(dec_pbmc3k$bio, decreasing = TRUE), ]

fit_pbmc4k <- trendVar(pbmc4k, use.spikes=FALSE) 
dec_pbmc4k <- decomposeVar(pbmc4k, fit_pbmc4k)
dec_pbmc4k$Symbol_TENx <- rowData(pbmc4k)$Symbol_TENx
dec_pbmc4k <- dec_pbmc4k[order(dec_pbmc4k$bio, decreasing = TRUE), ]

Then select the most informative genes that are shared across both datasets:

universe <- intersect(rownames(dec_pbmc3k), rownames(dec_pbmc4k))
mean.bio <- (dec_pbmc3k[universe,"bio"] + dec_pbmc4k[universe,"bio"])/2
hvg_genes <- universe[mean.bio > 0]

6.5 Combining the Datasets

Finally, we combine the datasets into a unified SingleCellExperiment object for the downstream integration approaches, now that the data has been normalized (both within and between datasets) and the shared most informative features have been identified.

## total raw counts
counts_pbmc <- cbind(counts(pbmc3k), counts(pbmc4k))

## total normalized counts (with multibatch normalization)
logcounts_pbmc <- cbind(logcounts(pbmc3k), logcounts(pbmc4k))

sce <- SingleCellExperiment( 
    assays = list(counts = counts_pbmc, logcounts = logcounts_pbmc),  
    rowData = rowData(pbmc3k), # same as rowData(pbmc4k) 
    colData = rbind(colData(pbmc3k), colData(pbmc4k)) 
) 

For safekeeping, we will also store the hvg_genes from the prior section into the sce object via:

metadata(sce)$hvg_genes <- hvg_genes

6.6 Integrating Datasets

Here we will now be comparing the results of different approaches to integration.

6.6.1 Naive Approach

The naive approach simply entails visualizing the combined sce object post-normalization with no attempt at batch correction. Here we manually calculate the PCA on the normalized data (retrieved via logcounts(sce) or, similarly, via assay(sce, "logcounts"), and then assign the result into the reducedDim slot of sce, naming it "PCA_naive".

px <- prcomp(t(logcounts(sce)[hvg_genes, ]))
reducedDim(sce, "PCA_naive") <- px$x[, 1:20]
plotReducedDim(sce, use_dimred = "PCA_naive",
               colour_by="Sample") + 
    ggtitle("PCA Without batch correction")

6.6.2 Limma Batch Correction

The limma package, a popular framework for the statistical analysis of RNA-seq, has a function removeBatchEffect() which will be used here to correct the normalized expression matrix logcounts across the two batches. The result will be assigned into the assays slot of the sce object as limma_corrected, and then used for PCA, saving the result in the reducedDim slot as "PCA_limma".

limma_corrected <- limma::removeBatchEffect(logcounts(sce), batch = sce$Sample)
assay(sce, "logcounts_limma") <- limma_corrected ## add new assay

pl <- prcomp(t(assay(sce, 'logcounts_limma')[hvg_genes, ]))
reducedDim(sce, "PCA_limma") <- pl$x[, 1:20]
plotReducedDim(sce, use_dimred = "PCA_limma",
               colour_by="Sample") + 
    ggtitle("PCA With limma removeBatchEffect() correction")

6.6.3 MNN Approach

The mutual nearest neighbors (MNN) approach within the scran package utilizes a novel approach to adjust for batch effects. The fastMNN() function returns a representation of the data with reduced dimensionality, which can be used in a similar fashion to other lower-dimensional representations such as PCA. In particular, this representation can be used for downstream methods such as clustering.

Where fastMNN() differs from other integration methods such as the limma approach above is that it does not produce a batch-corrected expression matrix. Thus, the result from fastMNN() should solely be treated as a reduced dimensionality representation, suitable for direct plotting, TSNE/UMAP, clustering, and trajectory analysis that relies on such results. The already (batch) normalized (via normalize() and multiBatchNorm()) can be supplied to other statistical frameworks that are better suited to handle batch effects, such as in the case of differential expression.

## Basic method to run - not run
mnn_out <- fastMNN(sce[hvg_genes, sce$Sample == "pbmc3k"],
                  sce[hvg_genes, sce$Sample == "pbmc4k"],
                  ## subset.row = hvg_genes, ## same as subsetting above
                  k = 20, d = 50, approximate = TRUE,
                  BNPARAM = BiocNeighbors::AnnoyParam(),
                  BPPARAM = BiocParallel::multiCoreParam())
## Adding parallelization and Annoy method for approximate nearest neighbors
## this makes fastMNN significantly faster on large data
mnn_out <- fastMNN(sce[hvg_genes, sce$Sample == "pbmc3k"],
                  sce[hvg_genes, sce$Sample == "pbmc4k"],
                  ## subset.row = hvg_genes, ## same as subsetting above
                  k = 20, d = 50, approximate = TRUE,
                  BNPARAM = BiocNeighbors::AnnoyParam(),
                  BPPARAM = BiocParallel::MulticoreParam(8))

reducedDim(sce, "MNN") <- mnn_out$correct
plotReducedDim(sce, use_dimred = "MNN",
                    colour_by="Sample") + 
    ggtitle("MNN Ouput Reduced Dimensions")

6.7 Session Info

sessionInfo()
## R Under development (unstable) (2019-01-13 r75986)
## Platform: x86_64-apple-darwin18.2.0 (64-bit)
## Running under: macOS Mojave 10.14.3
## 
## Matrix products: default
## BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.5/lib/libopenblasp-r0.3.5.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] BiocNeighbors_1.1.12        TENxPBMCData_1.1.0         
##  [3] HDF5Array_1.11.10           rhdf5_2.27.12              
##  [5] limma_3.39.12               scran_1.11.20              
##  [7] scater_1.11.12              ggplot2_3.1.0              
##  [9] SingleCellExperiment_1.5.2  SummarizedExperiment_1.13.0
## [11] DelayedArray_0.9.8          BiocParallel_1.17.17       
## [13] matrixStats_0.54.0          Biobase_2.43.1             
## [15] GenomicRanges_1.35.1        GenomeInfoDb_1.19.2        
## [17] IRanges_2.17.4              S4Vectors_0.21.10          
## [19] BiocGenerics_0.29.1        
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-6                  bit64_0.9-7                  
##  [3] httr_1.4.0                    dynamicTreeCut_1.63-1        
##  [5] tools_3.6.0                   R6_2.4.0                     
##  [7] irlba_2.3.3                   vipor_0.4.5                  
##  [9] DBI_1.0.0                     lazyeval_0.2.1               
## [11] colorspace_1.4-0              withr_2.1.2                  
## [13] tidyselect_0.2.5              gridExtra_2.3                
## [15] bit_1.1-14                    compiler_3.6.0               
## [17] labeling_0.3                  bookdown_0.9                 
## [19] scales_1.0.0                  stringr_1.4.0                
## [21] digest_0.6.18                 rmarkdown_1.11               
## [23] XVector_0.23.0                pkgconfig_2.0.2              
## [25] htmltools_0.3.6               rlang_0.3.1                  
## [27] RSQLite_2.1.1                 shiny_1.2.0                  
## [29] DelayedMatrixStats_1.5.2      dplyr_0.8.0.1                
## [31] RCurl_1.95-4.11               magrittr_1.5                 
## [33] BiocSingular_0.99.12          GenomeInfoDbData_1.2.0       
## [35] Matrix_1.2-15                 Rcpp_1.0.0                   
## [37] ggbeeswarm_0.6.0              munsell_0.5.0                
## [39] Rhdf5lib_1.5.1                viridis_0.5.1                
## [41] stringi_1.3.1                 yaml_2.2.0                   
## [43] edgeR_3.25.3                  zlibbioc_1.29.0              
## [45] plyr_1.8.4                    AnnotationHub_2.15.7         
## [47] grid_3.6.0                    blob_1.1.1                   
## [49] promises_1.0.1                ExperimentHub_1.9.1          
## [51] crayon_1.3.4                  lattice_0.20-38              
## [53] cowplot_0.9.4                 locfit_1.5-9.1               
## [55] knitr_1.21                    pillar_1.3.1                 
## [57] igraph_1.2.4                  glue_1.3.0                   
## [59] evaluate_0.13                 BiocManager_1.30.4           
## [61] httpuv_1.4.5.1                gtable_0.2.0                 
## [63] purrr_0.3.1                   assertthat_0.2.0             
## [65] xfun_0.5                      rsvd_1.0.0                   
## [67] mime_0.6                      xtable_1.8-3                 
## [69] later_0.8.0                   viridisLite_0.3.0            
## [71] tibble_2.0.1                  AnnotationDbi_1.45.0         
## [73] beeswarm_0.2.3                memoise_1.1.0                
## [75] statmod_1.4.30                interactiveDisplayBase_1.21.0
saveRDS(sce, "_rfiles/_data/integration_sce.rds", compress = "xz")