Load some useful packages, and the R functions in
main_fun.R
, which can be download from our Github page https://github.com/rongstat/meta-visualization/blob/main/main_fun.R.
source("R Codes/main_fun.R")
library(rARPACK)
library(MASS)
library(lle)
library(dimRed)
library(uwot)
library(cluster)
library(phateR)
library(Rtsne)
library(ggplot2)
We use the religious text data of Sah and Fokoue (2019), downloaded from https://archive.ics.uci.edu/ml/datasets/A+study+of++Asian+Religious+and+Biblical+Texts, and also analyzed in our paper (Ma, Sun and James, 2022+).
This dataset contains n = 590 fragments of text, extracted from English translations of eight religious books or sacred scripts including Book of Proverb (BOP), Book of Ecclesiastes (BOE1), Book of Ecclesiasticus (BOE2), Book of Wisdom (BOW), Four Noble Truth of Buddhism (BUD), Tao Te Ching (TTC), Yogasutras (YOG) and Upanishads (UPA). All the text were pre-processed using natural language processing into a 590x8265 Document Term Matrix that counts frequency of 8265 atomic words, such as truth, diligent, sense, power, in each text fragment. In other words, each text fragment was treated as a bag of words, represented by a vector with 8265 features. The word counts were centred and normalized before downstream analysis.
This dataset is also available on our Github page https://github.com/rongstat/meta-visualization/blob/main/Data/AllBooks_baseline_DTM_Labelled.csv.
data = read.csv("Data/AllBooks_baseline_DTM_Labelled.csv")
info = data[,1]
info = gsub("\\_.*", "",info)
data = data[,-1]
data = data[,which(colSums(data)!=0)]
data = scale(data, center=TRUE, scale = TRUE)
n=dim(data)[1]
info=factor(info)
levels(info) = c("BOE1", "BOE2", "BOP", "BOW", "BUD", "TTC", "UPA","YOG")
We apply our candidate.out()
function to get 16
candidate visualizations based on 12 different embedding algorithms.
This may take 3-5 mins.
candidate.out = candidate.visual(data, method=c("PCA", "MDS", "iMDS", "Sammon", "LLE", "HLLE","Isomap",
"kPCA", "LEIM", "UMAP", "tSNE","PHATE"),
kpca.sigma = c(0.01, 0.001),
umap.k= c(30, 50),
tsne.perplexity = c(10, 50),
phate.k = c(30, 50))
Below are a few examples of candidate visualizations.
k=1
data.plot = data.frame(dim1=candidate.out$embed.list[[k]][,1], dim2=candidate.out$embed.list[[k]][,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])
k=9
data.plot = data.frame(dim1=candidate.out$embed.list[[k]][,1], dim2=candidate.out$embed.list[[k]][,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])
k=13
data.plot = data.frame(dim1=candidate.out$embed.list[[k]][,1], dim2=candidate.out$embed.list[[k]][,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])
k=15
data.plot = data.frame(dim1=candidate.out$embed.list[[k]][,1], dim2=candidate.out$embed.list[[k]][,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])
We apply our spectral method, that simultaneously obtains (i)
the sample-specific eigenscores for each candidate visualization,
quantifying the reliability and faithfulness of each point, and (ii) the
consensus meta-distance matrix. Here, we used the recommended
function ensemble.viz()
in main_fun.R
.
ensemble.out = ensemble.viz(candidate.out$embed.list, candidate.out$method_name)
We can assess the candidate visualizations by looking at the boxplots of the eigenscores. In general, methods with higher eigenscores are more faithful embedding of the original data.
data.plot = data.frame(eigen.score = c(ensemble.out$eigenscore), method = rep(candidate.out$method_name, each=n))
ggplot(data.plot, aes(x=reorder(method, eigen.score, FUN=median), y=eigen.score)) +
geom_boxplot(outlier.size = 0.5) + theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
ylab("eigenscore") + xlab("method")
We can also visualize the eigenscores for each candidate visualization. Below are some examples.
k=8
data.plot = data.frame(dim1=candidate.out[[1]][[k]][,1], dim2=candidate.out[[1]][[k]][,2],
cluster=factor(info), eigenscore = c(ensemble.out[[2]][,k]))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=eigenscore)) +
scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])
k=16
data.plot = data.frame(dim1=candidate.out[[1]][[k]][,1], dim2=candidate.out[[1]][[k]][,2],
cluster=factor(info), eigenscore = c(ensemble.out[[2]][,k]))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=eigenscore)) +
scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])
Finally, we apply UMAP to the meta-distance matrix, to obtain the final meta-visualization.
ensemble.data=umap(as.dist(ensemble.out$ensemble.dist.mat), n_neighbors = 50)
data.plot = data.frame(dim1=ensemble.data[,1], dim2=ensemble.data[,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
scale_shape_manual(values=1:nlevels(data.plot$cluster))+ ggtitle("meta-spec")
The meta-visualization shows substantially better clustering of the text fragments in accordance with their sources. In addition, the meta-visualization also reflected deeper relationship between the eight religious books, such as the similarity between the two Hinduism books YOG and UPA, the similarity between Buddhism (BUD) and Taoism (TTC), the similarity between the four Christian books BOE1, BOE2, BOP, and BOW, as well as the general discrepancy between Asian religions (Hinduism, Buddhism, Taoism) and non-Asian religions (Christianity).