Get Started

Install and load the R package SMAI from Github page rongstat/SMAI:

library(devtools)
install_github("rongstat/SMAI")
library(SMAI)

Load additional R packages needed for subsequent data analysis.

library(Seurat)
library(SeuratData)
library(uwot)
library(ggrepel)
library(ggplot2)
library(cluster)
library(fossil)
library(batchelor)

Load datasets

Load the human peripheral blood mononuclear cell (PBMC) dataset from the R package SeuratData.

InstallData("pbmcsca")
LoadData("pbmcsca")
# split the dataset into a list
data.list <- SplitObject(pbmcsca, split.by = "Method")
data.X1 = as.matrix(data.list$`10x Chromium (v2)`@assays$RNA@counts)
ct.X1 = data.list$`10x Chromium (v2)`@meta.data$CellType
data.X2 = as.matrix(data.list$`10x Chromium (v3)`@assays$RNA@counts)
ct.X2 = data.list$`10x Chromium (v3)`@meta.data$CellType

Here we focus on integrating two datasets obtained based on different sequencing technologies. The first count matrix (data.X1) contains 3362 cells, which are obtained based on 10x Chromium (v2) technology, whereas the second count matrix (data.X2) contains 3222 cells, which are obtained based on 10x Chromium (v3) technology. The corresponding cell type information is stored in ct.X1 and ct.X2, respectively.

dim(data.X1)
## [1] 33694  3362
dim(data.X2)
## [1] 33694  3222
table(ct.X1)#cell type attributes in dataset1
## ct.X1
##                      B cell              CD14+ monocyte 
##                         862                         436 
##              CD16+ monocyte                 CD4+ T cell 
##                          50                         963 
##            Cytotoxic T cell              Dendritic cell 
##                         694                          76 
##               Megakaryocyte         Natural killer cell 
##                          32                         219 
## Plasmacytoid dendritic cell 
##                          30
table(ct.X2)#cell type attributes in dataset2
## ct.X2
##              B cell      CD14+ monocyte      CD16+ monocyte         CD4+ T cell 
##                 346                 354                  98                 960 
##    Cytotoxic T cell      Dendritic cell       Megakaryocyte Natural killer cell 
##                 962                  38                 270                 194

Get the normalized datasets with 2000 most variable genes using Seurat package.

all.RNA = cbind(data.X1, data.X2)
meta.data=data.frame(cell_type = factor(c(ct.X1,ct.X2)), 
                     method = factor(c(rep("data1",length(ct.X1)), rep("data2",length(ct.X2)))))
row.names(meta.data)=c(colnames(all.RNA))
all.data <- CreateSeuratObject(counts = all.RNA, project = "align", min.cells = 1,
                               min.features = 1, meta.data = meta.data)
all.data <- SplitObject(all.data, split.by = "method")
all.data <- lapply(X = all.data, FUN = function(x) {
  x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
  x <- NormalizeData(x)
})
features <- SelectIntegrationFeatures(object.list = all.data)
data.X1.smai = as.matrix(all.data$data1@assays$RNA@data)
data.X2.smai = as.matrix(all.data$data2@assays$RNA@data)
data.X1.smai = data.X1.smai[match(features,rownames(data.X1.smai)),]
data.X2.smai = data.X2.smai[match(features,rownames(data.X2.smai)),]

The normalized matrices for the two datasets are stored in data.X1.smai and data.X2.smai, respectively.

Run SMAI

Here we consider testing for partial alignability. The null hypothesis is that there exists subsets of the two datasets containing at least 60% (i.e., prop.palign=0.6) of the samples that are alignable up to similarity transformation.

Identify maximal correspondence subsets

We first identifty subsets of both datasets with maximal correspondence or structure-sharing, based on an initial alignment and mutual nearest neighbor matching. Below the initial alignment is obtained using Seurat. Alternatively, one can use other integration methods such as fastMNN in the R package batchelor.

#Seurat for initial alignment
data.anchors <- FindIntegrationAnchors(object.list = all.data, anchor.features = features)
data.combined <- IntegrateData(anchorset = data.anchors)
dim(data.combined@assays$integrated@data)
#get integrated data
data1=as.matrix(data.combined@assays$integrated@data[,1:length(ct.X1)])
data2=as.matrix(data.combined@assays$integrated@data[,(length(ct.X1)+1):(length(ct.X1)+length(ct.X2))])
#set the proportion of partially alignable samples to be tested against
prop.palign = 0.6
#mutual nearest neighbor matching
k1=15
prop.mc=0
while(prop.mc<prop.palign/2+0.05){
mnn.out = findMutualNN(t(data1),t(data2), k1=k1) 
prop.mc = min(c(length(unique(mnn.out$first))/dim(data1)[2],
                       length(unique(mnn.out$second))/dim(data2)[2]))
k1=k1+5
}

The parameter prop.palign defines the proportion of partially alignable samples in both datasets, to be tested agasint. For example, if prop.palign=0.6, our null hypothesis states that “at least 60% of the samples in each dataset are alignable.” The parameter k1 in findNutualNN is chosen so that the maximal correspondence subsets is sufficiently large for downstream testing. Note that in principle prop.align.par should always be non-smaller than the prop.align parameter in the SMAI function align().

SMAI for alignability testing and data integration

set.seed(8)
out.smai <- align(data.X1.smai, data.X2.smai, sel1=unique(mnn.out$first),  sel2=unique(mnn.out$second),
                  t=3, knn=30, r.max=200, dir.map = "auto",  denoise="scree",
                  prop.align=prop.palign/2,  cutoff = 1.001, cutoff.t=1.5)
## [1] "iteration 1: sample space aligned!"
## [1] "iteration 1: feature space aligned!"
## [1] "iteration 2: sample space aligned!"
## [1] "iteration 2: feature space aligned!"
## [1] "iteration 3: sample space aligned!"
## [1] "iteration 3: feature space aligned!"
## [1] "Optimal alignment identified! Data1 aligned to Data2!"
## [1] "Spectral alignment done!"
## [1] "Spectral inference done!"

Return the p-value for partial alignability.

out.smai$p.value
## [1] 0.2593894

If p-value is smaller than 0.05, it suggests the two datasets are not alignable. Otherwise, the data do not contain sufficient evidence for rejecting the (partial) alignability hypothesis. Here the p-value suggests the datasets to be alignable. Note that the final p-value may differ under different random seeds. This is because for partial alignability test, a sample splitting procedure is used to ensure the validity of the test. In fact, one may benefit from such randomness by repeating the test under various random seeds, to evaluate the robustness/stability of the test results with respect to resampling.

The integrated datasets are stored as out.smai$data1.integrate and out.smai$data2.integrate, whereas the parameters characterizing the final alignment function is stored in out.smai$gamma, out.smai$beta, and out.smai$feature.rot. Note that the final integrated datasets will be returned regardless the significance of the p-value. The test essentially indicates the goodness-of-fit of the data generative model motivating the proposed alignment procedure, but does not affect the execution of the alignment algorithm. One should use caution or even not work with the integrated data if a significant p-value is returned.

Some Further Analysis

Visualize integrated data

#UMAP plot
data.int = t(cbind(out.smai$data1.integrate,out.smai$data2.integrate))
umap.out = umap(data.int, n_neighbors = 50, metric = "cosine", spread = 5)
meta_data=data.frame(cell_type = factor(c(ct.X1,ct.X2)),
                     batch = factor(c(rep("X1",length(ct.X1)), rep("X2",length(ct.X2)))))
data.plot = data.frame(UMAP1 = c(umap.out[,1]), UMAP2= c(umap.out[,2]),
                       cell_type = factor(meta_data$cell_type), batch = factor(meta_data$batch))
p1 <- ggplot(data.plot, aes(x=UMAP1, y=UMAP2, colour = cell_type)) + geom_point(size = 0.7)
p1 <- p1 +
  theme(axis.text=element_text(size=10,face="bold"),
        axis.title=element_text(size=10,face="bold"),
        legend.key.size = unit(0.5, 'cm'),
        legend.title = element_text(size=10,face="bold"),
        legend.text = element_text(size=10,face="bold")) +
  guides(colour = guide_legend(override.aes = list(size=5)))
p1

p1 <- ggplot(data.plot, aes(x=UMAP1, y=UMAP2, colour = batch, alpha=0.5)) + geom_point(size = 0.7)
p1 <- p1 +
  theme(axis.text=element_text(size=15,face="bold"),
        axis.title=element_text(size=15,face="bold"),
        legend.key.size = unit(1, 'cm'),
        legend.title = element_text(size=14,face="bold"), 
        legend.text = element_text(size=14,face="bold")) +
  guides(colour = guide_legend(override.aes = list(size=5)))
p1

Evaluate structure preservation

#compare dataset1 before and after integration
faithful(data.int[1:length(ct.X1),], t(data.X1.smai))
##   kendall  spearman   pearson 
## 0.8921189 0.9793743 0.9794497
#compare dataset2 before and after integration
faithful(data.int[-c(1:length(ct.X1)),], t(data.X2.smai))
##  kendall spearman  pearson 
##        1        1        1

The outputs include averaged Kendall’s tau (“kendall”), Spearman’s rho (“spearman”), and Pearson’s correlation (“pearson”) between the pairwise distances with respect to each sample before and after integration. The high correlations suggest the integrated datasets obtained by SMAI-align are less susceptible to technical artifacts, structural distortions, and information loss, making them more reliable for downstream analyses.

Inspect batch effects - mean shift

SMAI’s interpretability reveals insights into the sources of batch effects. SMAI-align not only returns the aligned datasets, but also outputs explicitly the underlying alignment function achieving such alignment, which enables further inspection and a deeper understanding of possible sources of batch effects. We first look at the mean-shift vector \(\gamma\in\mathbb{R}^d\).

highlight = names(out.smai$gamma)[abs(out.smai$gamma)>quantile(abs(out.smai$gamma),0.995)]
data.plot=data.frame(gamma=out.smai$gamma, label=names(out.smai$gamma))
library(tidyverse)
library(cowplot)
data.plot <- data.plot %>%
  mutate(
    label = ifelse(label %in% highlight, label, "")
  )
data.plot=data.plot[order(out.smai$gamma),]
plt <- ggplot(data.plot, aes(x=1:length(gamma), y = gamma)) +
  geom_point(
    size = 1.5,
    alpha = 0.8,
    aes(colour= factor((label=="")))# It's nice to add some transparency because there may be overlap.
  ) + ylab("gamma")+xlab("index")+
  geom_text_repel(
    aes(label = label),
    force         = 30,
    max.overlaps = 30,
    segment.color = 'grey50')
plt + theme(legend.position = "none")+coord_flip()

This plot suggests only a sparse set of genes are affected by the batch effect. Then we examine the expression levels of some particular genes highlighted above, grouped by cell type.

i=which(rownames(data.X1.smai)=="MTRNR2L1")
data.plot=data.frame(gene=data.X1.smai[i,], cell_type=meta_data$cell_type[1:length(ct.X1)])
sp <- ggplot(data.plot, aes(x=cell_type, y=gene)) +
  geom_boxplot(outlier.size = 0.5) + theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
  theme(axis.text=element_text(size=15),
        axis.title=element_text(size=15,face="bold"),
        legend.key.size = unit(1, 'cm'),
        text=element_text(size=20),
        legend.title = element_text(size=14), #change legend title font size
        legend.text = element_text(size=14)) + ylim(0,8.5)+
  ylab(names(out.smai$gamma)[i])
sp

data.plot=data.frame(gene=data.X2.smai[i,], cell_type=ct.X2)
sp <- ggplot(data.plot, aes(x=cell_type, y=gene)) +
   theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
  theme(axis.text=element_text(size=15),
        axis.title=element_text(size=15,face="bold"),
        legend.key.size = unit(1, 'cm'),
        text=element_text(size=20),
        legend.title = element_text(size=14), #change legend title font size
        legend.text = element_text(size=14)) +ylim(0,8.5)+
  geom_boxplot(outlier.size = 0.5) +
  ylab(names(out.smai$gamma)[i])
sp

i=which(rownames(data.X1.smai)=="DUSP1")
data.plot=data.frame(gene=data.X1.smai[i,], cell_type=meta_data$cell_type[1:length(ct.X1)])
sp <- ggplot(data.plot, aes(x=cell_type, y=gene)) +
  geom_boxplot(outlier.size = 0.5) + theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
  theme(axis.text=element_text(size=15),
        axis.title=element_text(size=15,face="bold"),
        legend.key.size = unit(1, 'cm'),
        text=element_text(size=20),
        legend.title = element_text(size=14), #change legend title font size
        legend.text = element_text(size=14)) + ylim(0,8.5)+
  ylab(names(out.smai$gamma)[i])
sp

data.plot=data.frame(gene=data.X2.smai[i,], cell_type=ct.X2)
sp <- ggplot(data.plot, aes(x=cell_type, y=gene)) +
   theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
  theme(axis.text=element_text(size=15),
        axis.title=element_text(size=15,face="bold"),
        legend.key.size = unit(1, 'cm'),
        text=element_text(size=20),
        legend.title = element_text(size=14), #change legend title font size
        legend.text = element_text(size=14)) +ylim(0,8.5)+
  geom_boxplot(outlier.size = 0.5) +
  ylab(names(out.smai$gamma)[i])
sp

Note that MTRNR2L1 and DUSP1 are both DE genes (Megakaryocyte and CD14+ monocyte, respectively). The batch effects on these genes are relatively uniform across cell types, and therefore SMAI-align doesn’t affect DE results after integration.

Inspect batch effects - covariance shift

The matrix \({\bf R}\in\mathbb{R}^{d\times d}\) essentially captures and remove the batch effects altering the gene correlation structures in each dataset.

# display rotation matrix
library(igraph)
rot.adj = out.smai$feature.rot
rownames(rot.adj)=rownames(data.X1.smai)
colnames(rot.adj)=rownames(data.X1.smai)
# obtain sparse graph representation of R by thresholding its entries
rot.adj[abs(rot.adj)<quantile(abs(rot.adj),0.9999)]=0
diag(rot.adj)=0
rm.ind=intersect(which(rowSums(rot.adj!=0)==0), which(colSums(rot.adj!=0)==0))
rot.adj = rot.adj[-rm.ind,-rm.ind]
g1 = graph_from_adjacency_matrix(abs(rot.adj), mode="directed",weighted = TRUE)
plot(g1,vertex.color="lightblue", vertex.size=3,edge.width = E(g1)$weight*10,
     edge.arrow.size=0.4, vertex.label.cex=0.8, vertex.label.dist=0.2)

The network illustrates the feature rotations needed to remove the corresponding batch effects. The width of the edge indicates the magnitude of the correction.