Quick Guide to Meta-Visualization

Get Started

Load some useful packages, and the R functions in main_fun.R, which can be download from our Github page https://github.com/rongstat/meta-visualization/blob/main/main_fun.R.

source("R Codes/main_fun.R")
library(rARPACK)
library(MASS)
library(lle)
library(dimRed)
library(uwot)
library(cluster)
library(phateR)
library(Rtsne)
library(ggplot2)

Load Example Data

We use the religious text data of Sah and Fokoue (2019), downloaded from https://archive.ics.uci.edu/ml/datasets/A+study+of++Asian+Religious+and+Biblical+Texts, and also analyzed in our paper (Ma, Sun and James, 2022+).

This dataset contains n = 590 fragments of text, extracted from English translations of eight religious books or sacred scripts including Book of Proverb (BOP), Book of Ecclesiastes (BOE1), Book of Ecclesiasticus (BOE2), Book of Wisdom (BOW), Four Noble Truth of Buddhism (BUD), Tao Te Ching (TTC), Yogasutras (YOG) and Upanishads (UPA). All the text were pre-processed using natural language processing into a 590x8265 Document Term Matrix that counts frequency of 8265 atomic words, such as truth, diligent, sense, power, in each text fragment. In other words, each text fragment was treated as a bag of words, represented by a vector with 8265 features. The word counts were centred and normalized before downstream analysis.

This dataset is also available on our Github page https://github.com/rongstat/meta-visualization/blob/main/Data/AllBooks_baseline_DTM_Labelled.csv.

data = read.csv("Data/AllBooks_baseline_DTM_Labelled.csv")
info = data[,1]
info = gsub("\\_.*", "",info)
data = data[,-1]
data = data[,which(colSums(data)!=0)]
data = scale(data, center=TRUE, scale = TRUE)
n=dim(data)[1]
info=factor(info)
levels(info) = c("BOE1", "BOE2", "BOP", "BOW", "BUD", "TTC", "UPA","YOG")

Get Candidate Visualizations

We apply our candidate.out() function to get 16 candidate visualizations based on 12 different embedding algorithms. This may take 3-5 mins.

candidate.out = candidate.visual(data, method=c("PCA", "MDS", "iMDS", "Sammon", "LLE", "HLLE","Isomap",
                                                "kPCA", "LEIM", "UMAP", "tSNE","PHATE"),
                                 kpca.sigma = c(0.01, 0.001), 
                                 umap.k= c(30, 50), 
                                 tsne.perplexity = c(10, 50),
                                 phate.k = c(30, 50))

Below are a few examples of candidate visualizations.

k=1
data.plot = data.frame(dim1=candidate.out$embed.list[[k]][,1], dim2=candidate.out$embed.list[[k]][,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
  scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])

k=9
data.plot = data.frame(dim1=candidate.out$embed.list[[k]][,1], dim2=candidate.out$embed.list[[k]][,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
  scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])

k=13
data.plot = data.frame(dim1=candidate.out$embed.list[[k]][,1], dim2=candidate.out$embed.list[[k]][,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
  scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])

k=15
data.plot = data.frame(dim1=candidate.out$embed.list[[k]][,1], dim2=candidate.out$embed.list[[k]][,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
  scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])

Get Meta-Visualization

We apply our spectral method, that simultaneously obtains (i) the sample-specific eigenscores for each candidate visualization, quantifying the reliability and faithfulness of each point, and (ii) the consensus meta-distance matrix. Here, we used the recommended function ensemble.viz() in main_fun.R.

ensemble.out = ensemble.viz(candidate.out$embed.list, candidate.out$method_name)

We can assess the candidate visualizations by looking at the boxplots of the eigenscores. In general, methods with higher eigenscores are more faithful embedding of the original data.

data.plot = data.frame(eigen.score = c(ensemble.out$eigenscore), method = rep(candidate.out$method_name, each=n))
ggplot(data.plot, aes(x=reorder(method, eigen.score, FUN=median), y=eigen.score)) +
  geom_boxplot(outlier.size = 0.5) + theme(axis.text.x = element_text(angle = 40, vjust = 1, hjust=1)) +
  ylab("eigenscore") + xlab("method")

We can also visualize the eigenscores for each candidate visualization. Below are some examples.

k=8
data.plot = data.frame(dim1=candidate.out[[1]][[k]][,1], dim2=candidate.out[[1]][[k]][,2], 
                       cluster=factor(info), eigenscore = c(ensemble.out[[2]][,k]))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=eigenscore)) +
  scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])

k=16
data.plot = data.frame(dim1=candidate.out[[1]][[k]][,1], dim2=candidate.out[[1]][[k]][,2], 
                       cluster=factor(info), eigenscore = c(ensemble.out[[2]][,k]))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=eigenscore)) +
  scale_shape_manual(values=1:nlevels(data.plot$cluster)) + ggtitle(candidate.out$method[k])

Finally, we apply UMAP to the meta-distance matrix, to obtain the final meta-visualization.

ensemble.data=umap(as.dist(ensemble.out$ensemble.dist.mat),  n_neighbors = 50)

data.plot = data.frame(dim1=ensemble.data[,1], dim2=ensemble.data[,2], cluster=factor(info))
ggplot(data.plot, aes(x=dim1, y=dim2)) + geom_point(size=1.5, aes(shape=cluster, color=cluster)) +
  scale_shape_manual(values=1:nlevels(data.plot$cluster))+ ggtitle("meta-spec")

The meta-visualization shows substantially better clustering of the text fragments in accordance with their sources. In addition, the meta-visualization also reflected deeper relationship between the eight religious books, such as the similarity between the two Hinduism books YOG and UPA, the similarity between Buddhism (BUD) and Taoism (TTC), the similarity between the four Christian books BOE1, BOE2, BOP, and BOW, as well as the general discrepancy between Asian religions (Hinduism, Buddhism, Taoism) and non-Asian religions (Christianity).

Quick Guide to Meta-Visualization

2022-11-01

Get Started

Load Example Data

Get Candidate Visualizations

Get Meta-Visualization