Install and load the R package SMAI
from Github page rongstat/SMAI
:
library(devtools)
install_github("rongstat/SMAI")
library(SMAI)
Load additional R packages needed for subsequent data analysis.
library(Seurat)
library(SeuratData)
library(uwot)
library(ggrepel)
library(ggplot2)
library(cluster)
library(fossil)
library(batchelor)
Load the human peripheral blood mononuclear cell (PBMC) dataset from
the R package SeuratData
.
InstallData("pbmcsca")
LoadData("pbmcsca")
# split the dataset into a list
data.list <- SplitObject(pbmcsca, split.by = "Method")
data.X1 = as.matrix(data.list$`10x Chromium (v2)`@assays$RNA@counts)
ct.X1 = data.list$`10x Chromium (v2)`@meta.data$CellType
data.X2 = as.matrix(data.list$`10x Chromium (v3)`@assays$RNA@counts)
ct.X2 = data.list$`10x Chromium (v3)`@meta.data$CellType
Here we focus on integrating two datasets obtained based on different
sequencing technologies. The first count matrix (data.X1
)
contains 3362 cells, which are obtained based on 10x Chromium (v2)
technology, whereas the second count matrix (data.X2
)
contains 3222 cells, which are obtained based on 10x Chromium (v3)
technology. The corresponding cell type information is stored in
ct.X1
and ct.X2
, respectively.
dim(data.X1)
## [1] 33694 3362
dim(data.X2)
## [1] 33694 3222
table(ct.X1)#cell type attributes in dataset1
## ct.X1
## B cell CD14+ monocyte
## 862 436
## CD16+ monocyte CD4+ T cell
## 50 963
## Cytotoxic T cell Dendritic cell
## 694 76
## Megakaryocyte Natural killer cell
## 32 219
## Plasmacytoid dendritic cell
## 30
table(ct.X2)#cell type attributes in dataset2
## ct.X2
## B cell CD14+ monocyte CD16+ monocyte CD4+ T cell
## 346 354 98 960
## Cytotoxic T cell Dendritic cell Megakaryocyte Natural killer cell
## 962 38 270 194
Get the normalized datasets with 2000 most variable genes using
Seurat
package.
all.RNA = cbind(data.X1, data.X2)
meta.data=data.frame(cell_type = factor(c(ct.X1,ct.X2)),
method = factor(c(rep("data1",length(ct.X1)), rep("data2",length(ct.X2)))))
row.names(meta.data)=c(colnames(all.RNA))
all.data <- CreateSeuratObject(counts = all.RNA, project = "align", min.cells = 1,
min.features = 1, meta.data = meta.data)
all.data <- SplitObject(all.data, split.by = "method")
all.data <- lapply(X = all.data, FUN = function(x) {
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
x <- NormalizeData(x)
})
features <- SelectIntegrationFeatures(object.list = all.data)
data.X1.smai = as.matrix(all.data$data1@assays$RNA@data)
data.X2.smai = as.matrix(all.data$data2@assays$RNA@data)
data.X1.smai = data.X1.smai[match(features,rownames(data.X1.smai)),]
data.X2.smai = data.X2.smai[match(features,rownames(data.X2.smai)),]
The normalized matrices for the two datasets are stored in
data.X1.smai
and data.X2.smai
,
respectively.
Here we consider testing for partial alignability. The null
hypothesis is that there exists subsets of the two datasets containing
at least 60% (i.e., prop.palign=0.6
) of the samples that
are alignable up to similarity transformation.
We first identifty subsets of both datasets with maximal
correspondence or structure-sharing, based on an initial alignment and
mutual nearest neighbor matching. Below the initial alignment is
obtained using Seurat
. Alternatively, one can use other
integration methods such as fastMNN
in the R package
batchelor
.
#Seurat for initial alignment
data.anchors <- FindIntegrationAnchors(object.list = all.data, anchor.features = features)
data.combined <- IntegrateData(anchorset = data.anchors)
dim(data.combined@assays$integrated@data)
#get integrated data
data1=as.matrix(data.combined@assays$integrated@data[,1:length(ct.X1)])
data2=as.matrix(data.combined@assays$integrated@data[,(length(ct.X1)+1):(length(ct.X1)+length(ct.X2))])
#set the proportion of partially alignable samples to be tested against
prop.palign = 0.6
#mutual nearest neighbor matching
k1=15
prop.mc=0
while(prop.mc<prop.palign/2+0.05){
mnn.out = findMutualNN(t(data1),t(data2), k1=k1)
prop.mc = min(c(length(unique(mnn.out$first))/dim(data1)[2],
length(unique(mnn.out$second))/dim(data2)[2]))
k1=k1+5
}
The parameter prop.palign
defines the proportion of
partially alignable samples in both datasets, to be tested agasint. For
example, if prop.palign=0.6
, our null hypothesis states
that “at least 60% of the samples in each dataset are alignable.” The
parameter k1
in findNutualNN
is chosen so that
the maximal correspondence subsets is sufficiently large for downstream
testing. Note that in principle prop.align.par
should
always be non-smaller than the prop.align
parameter in the
SMAI function align()
.
set.seed(8)
out.smai <- align(data.X1.smai, data.X2.smai, sel1=unique(mnn.out$first), sel2=unique(mnn.out$second),
t=3, knn=30, r.max=200, dir.map = "auto", denoise="scree",
prop.align=prop.palign/2, cutoff = 1.001, cutoff.t=1.5)
## [1] "iteration 1: sample space aligned!"
## [1] "iteration 1: feature space aligned!"
## [1] "iteration 2: sample space aligned!"
## [1] "iteration 2: feature space aligned!"
## [1] "iteration 3: sample space aligned!"
## [1] "iteration 3: feature space aligned!"
## [1] "Optimal alignment identified! Data1 aligned to Data2!"
## [1] "Spectral alignment done!"
## [1] "Spectral inference done!"
Return the p-value for partial alignability.
out.smai$p.value
## [1] 0.2593894
If p-value is smaller than 0.05, it suggests the two datasets are not alignable. Otherwise, the data do not contain sufficient evidence for rejecting the (partial) alignability hypothesis. Here the p-value suggests the datasets to be alignable. Note that the final p-value may differ under different random seeds. This is because for partial alignability test, a sample splitting procedure is used to ensure the validity of the test. In fact, one may benefit from such randomness by repeating the test under various random seeds, to evaluate the robustness/stability of the test results with respect to resampling.
The integrated datasets are stored as
out.smai$data1.integrate
and
out.smai$data2.integrate
, whereas the parameters
characterizing the final alignment function is stored in
out.smai$gamma
, out.smai$beta
, and
out.smai$feature.rot
. Note that the final integrated
datasets will be returned regardless the significance of the p-value.
The test essentially indicates the goodness-of-fit of the data
generative model motivating the proposed alignment procedure, but does
not affect the execution of the alignment algorithm. One should use
caution or even not work with the integrated data if a significant
p-value is returned.
#UMAP plot
data.int = t(cbind(out.smai$data1.integrate,out.smai$data2.integrate))
umap.out = umap(data.int, n_neighbors = 50, metric = "cosine", spread = 5)
meta_data=data.frame(cell_type = factor(c(ct.X1,ct.X2)),
batch = factor(c(rep("X1",length(ct.X1)), rep("X2",length(ct.X2)))))
data.plot = data.frame(UMAP1 = c(umap.out[,1]), UMAP2= c(umap.out[,2]),
cell_type = factor(meta_data$cell_type), batch = factor(meta_data$batch))
p1 <- ggplot(data.plot, aes(x=UMAP1, y=UMAP2, colour = cell_type)) + geom_point(size = 0.7)
p1 <- p1 +
theme(axis.text=element_text(size=10,face="bold"),
axis.title=element_text(size=10,face="bold"),
legend.key.size = unit(0.5, 'cm'),
legend.title = element_text(size=10,face="bold"),
legend.text = element_text(size=10,face="bold")) +
guides(colour = guide_legend(override.aes = list(size=5)))
p1
p1 <- ggplot(data.plot, aes(x=UMAP1, y=UMAP2, colour = batch, alpha=0.5)) + geom_point(size = 0.7)
p1 <- p1 +
theme(axis.text=element_text(size=15,face="bold"),
axis.title=element_text(size=15,face="bold"),
legend.key.size = unit(1, 'cm'),
legend.title = element_text(size=14,face="bold"),
legend.text = element_text(size=14,face="bold")) +
guides(colour = guide_legend(override.aes = list(size=5)))
p1
#compare dataset1 before and after integration
faithful(data.int[1:length(ct.X1),], t(data.X1.smai))
## kendall spearman pearson
## 0.8921189 0.9793743 0.9794497
#compare dataset2 before and after integration
faithful(data.int[-c(1:length(ct.X1)),], t(data.X2.smai))
## kendall spearman pearson
## 1 1 1
The outputs include averaged Kendall’s tau (“kendall”), Spearman’s rho (“spearman”), and Pearson’s correlation (“pearson”) between the pairwise distances with respect to each sample before and after integration. The high correlations suggest the integrated datasets obtained by SMAI-align are less susceptible to technical artifacts, structural distortions, and information loss, making them more reliable for downstream analyses.
SMAI’s interpretability reveals insights into the sources of batch effects. SMAI-align not only returns the aligned datasets, but also outputs explicitly the underlying alignment function achieving such alignment, which enables further inspection and a deeper understanding of possible sources of batch effects. We first look at the mean-shift vector \(\gamma\in\mathbb{R}^d\).
highlight = names(out.smai$gamma)[abs(out.smai$gamma)>quantile(abs(out.smai$gamma),0.995)]
data.plot=data.frame(gamma=out.smai$gamma, label=names(out.smai$gamma))
library(tidyverse)
library(cowplot)
data.plot <- data.plot %>%
mutate(
label = ifelse(label %in% highlight, label, "")
)
data.plot=data.plot[order(out.smai$gamma),]
plt <- ggplot(data.plot, aes(x=1:length(gamma), y = gamma)) +
geom_point(
size = 1.5,
alpha = 0.8,
aes(colour= factor((label=="")))# It's nice to add some transparency because there may be overlap.
) + ylab("gamma")+xlab("index")+
geom_text_repel(
aes(label = label),
force = 30,
max.overlaps = 30,
segment.color = 'grey50')
plt + theme(legend.position = "none")+coord_flip()
This plot suggests only a sparse set of genes are affected by the batch effect. Then we examine the expression levels of some particular genes highlighted above, grouped by cell type.
i=which(rownames(data.X1.smai)=="MTRNR2L1")
data.plot=data.frame(gene=data.X1.smai[i,], cell_type=meta_data$cell_type[1:length(ct.X1)])
sp <- ggplot(data.plot, aes(x=cell_type, y=gene)) +
geom_boxplot(outlier.size = 0.5) + theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
theme(axis.text=element_text(size=15),
axis.title=element_text(size=15,face="bold"),
legend.key.size = unit(1, 'cm'),
text=element_text(size=20),
legend.title = element_text(size=14), #change legend title font size
legend.text = element_text(size=14)) + ylim(0,8.5)+
ylab(names(out.smai$gamma)[i])
sp
data.plot=data.frame(gene=data.X2.smai[i,], cell_type=ct.X2)
sp <- ggplot(data.plot, aes(x=cell_type, y=gene)) +
theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
theme(axis.text=element_text(size=15),
axis.title=element_text(size=15,face="bold"),
legend.key.size = unit(1, 'cm'),
text=element_text(size=20),
legend.title = element_text(size=14), #change legend title font size
legend.text = element_text(size=14)) +ylim(0,8.5)+
geom_boxplot(outlier.size = 0.5) +
ylab(names(out.smai$gamma)[i])
sp
i=which(rownames(data.X1.smai)=="DUSP1")
data.plot=data.frame(gene=data.X1.smai[i,], cell_type=meta_data$cell_type[1:length(ct.X1)])
sp <- ggplot(data.plot, aes(x=cell_type, y=gene)) +
geom_boxplot(outlier.size = 0.5) + theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
theme(axis.text=element_text(size=15),
axis.title=element_text(size=15,face="bold"),
legend.key.size = unit(1, 'cm'),
text=element_text(size=20),
legend.title = element_text(size=14), #change legend title font size
legend.text = element_text(size=14)) + ylim(0,8.5)+
ylab(names(out.smai$gamma)[i])
sp
data.plot=data.frame(gene=data.X2.smai[i,], cell_type=ct.X2)
sp <- ggplot(data.plot, aes(x=cell_type, y=gene)) +
theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
theme(axis.text=element_text(size=15),
axis.title=element_text(size=15,face="bold"),
legend.key.size = unit(1, 'cm'),
text=element_text(size=20),
legend.title = element_text(size=14), #change legend title font size
legend.text = element_text(size=14)) +ylim(0,8.5)+
geom_boxplot(outlier.size = 0.5) +
ylab(names(out.smai$gamma)[i])
sp
Note that MTRNR2L1 and DUSP1 are both DE genes (Megakaryocyte and CD14+ monocyte, respectively). The batch effects on these genes are relatively uniform across cell types, and therefore SMAI-align doesn’t affect DE results after integration.
The matrix \({\bf R}\in\mathbb{R}^{d\times d}\) essentially captures and remove the batch effects altering the gene correlation structures in each dataset.
# display rotation matrix
library(igraph)
rot.adj = out.smai$feature.rot
rownames(rot.adj)=rownames(data.X1.smai)
colnames(rot.adj)=rownames(data.X1.smai)
# obtain sparse graph representation of R by thresholding its entries
rot.adj[abs(rot.adj)<quantile(abs(rot.adj),0.9999)]=0
diag(rot.adj)=0
rm.ind=intersect(which(rowSums(rot.adj!=0)==0), which(colSums(rot.adj!=0)==0))
rot.adj = rot.adj[-rm.ind,-rm.ind]
g1 = graph_from_adjacency_matrix(abs(rot.adj), mode="directed",weighted = TRUE)
plot(g1,vertex.color="lightblue", vertex.size=3,edge.width = E(g1)$weight*10,
edge.arrow.size=0.4, vertex.label.cex=0.8, vertex.label.dist=0.2)
The network illustrates the feature rotations needed to remove the corresponding batch effects. The width of the edge indicates the magnitude of the correction.