Generate datasets

We consider a simple experiment where the two datasets each contain a low-dimensional manifold, a torus, as their shared signal structure. Let’s first generate and visualize random samples from a torus in \(\mathbb{R}^3\).

n=5000
u = runif(n, 0, 2*pi)
v = runif(n, 0, 2*pi)
data = cbind( (2+0.8*cos(u))*cos(v),
              (2+0.8*cos(u))*sin(v),
              0.8*sin(u))
colfunc <- colorRampPalette(c("blue","forestgreen", "yellow"))
scatterplot3d(x=data[,1], y=data[,2], z=data[,3], angle=35, color = colfunc(n)[rank(-data[,1])],pch=20,
              xlab=NA, ylab=NA, zlab=NA)

Next we generate noisy, high-dimensional observations for both datasets. To do so, we embed the clean, noiseless samples from the torus into a much higher dimensional Euclidean space, and then add Gaussian noise to each data point. As a result, each dataset has dimensionality \(p=800\), containing the torus structure as (part of) its latent low-dimensional signal. For sample sizes, we set \(n_1=500\) for the first dataset and \(n_2=700\) for the second dataset.

In particular, to create nontrivial signal descripancy between the two datasets, for the second datasets, we add an additional signal structure (uniform signals between \([-20,20]\)) to 20 coordinates that are orthogonal to those containing the torus structure.

n1=500
n2=700
p=800
r=3 #assume the latent dimension is known
    
data.X1=data[sample(dim(data)[1],n1),]
data.X2=data[sample(dim(data)[1],n2),]
data.Z=matrix(rnorm(n1*p,0,0.4), ncol=p)
data.Y1 = data.Z
data.Y1[,1:r] = data.Y1[,1:r]+as.matrix(data.X1*5)
    
data.Z=matrix(rnorm(n2*p,0,1), ncol=p)
data.Y2 = data.Z
data.Y2[,1:r] = data.Y2[,1:r]+as.matrix(data.X2*5)
data.Y2[,(r+1):(r+20)] = data.Y2[,(r+1):(r+20)]+matrix(runif(n2*20,-20,20),ncol=20) #noise in Y2

These two datasets were each centered before joint analysis. The input of our algorithm should be two matrices, whose rows are samples and columns are features.

data.Y1 = scale(data.Y1, scale=FALSE)
data.Y2 = scale(data.Y2, scale=FALSE)

Let’s look at the marginal pairwise scatter plots for each dataset separately, focusing on their first four coordiantes.

pairs(data.Y1[,1:4])

pairs(data.Y2[,1:4])

Apparently, there is a shared manifold structure between the two datasets, in their first three coordinates. Note that the substantial contrast in the scale of the fourth coordinate between the two datasets.

Joint embedding by PCA, kernel PCA, and the duo-landmark method

It can be seen, for example, from a joint PCA (combine the two datasets and apply PCA) embedding of both datasets that, even after centering, there are significant discrepancy between the two datasets. In all of the joint embedding plots below, the red dots indicate samples in the first dataset, whereas the blue dots correspond to the samples in the second dataset.

pca.svd = svds(rbind(data.Y1,data.Y2), k=r)
X.combined =pca.svd$u[,1:r] %*% diag(pca.svd$d[1:r])^(1/2)
plot(X.combined[,c(1,2)], col=c(rep("red",dim(data.Y1)[1]), rep("blue",dim(data.Y2)[1])),pch=20, xlab="PC1",ylab="PC2")

plot(X.combined[,c(2,3)], col=c(rep("red",dim(data.Y1)[1]), rep("blue",dim(data.Y2)[1])),pch=20, xlab="PC2",ylab="PC3")

Moreover, a joint kernel PCA embedding of both datatsets also does not characterize the shared structure between the two datasets. Here, in line with Ma and Ding (2023) , we choose the bandwidth to be the median of all pairwise Euclidean distances between the samples across datasets.

dist.mat = Dist((rbind(data.Y1, data.Y2)))
K.mat.all = exp(-as.matrix(dist.mat)^2/quantile(dist.mat,0.5)^2)
kpca.svd = eigs(K.mat.all, k=r+1)
X.combined = kpca.svd$vectors[,1:r+1] %*% diag(kpca.svd$values[1:r+1])^(1/2)
plot(X.combined[,c(1,2)], col=c(rep("red",dim(data.Y1)[1]), rep("blue",dim(data.Y2)[1])),pch=20, xlab="KPC1",ylab="KPC2")

plot(X.combined[,c(2,3)], col=c(rep("red",dim(data.Y1)[1]), rep("blue",dim(data.Y2)[1])),pch=20, xlab="KPC2",ylab="KPC3")

Next we apply the proposed duo-landmark method to obtain a joint embedding. The R function that implements the proposed method is given below, which can be simply copied and ran for other related tasks.

DL.embed <- function(X,Y, w=0.5, Gamma=NULL){
    if(is.null(Gamma)){
    Gamma=c(2:min(dim(X),dim(Y)))
  }else{
    Gamma=Gamma+1
  }
  dist.mat = Dist(rbind(data.Y1, data.Y2))
  cross.dist = as.matrix(dist.mat)[1:dim(data.Y1)[1], (dim(data.Y1)[1]+1):(dim(data.Y1)[1]+dim(data.Y2)[1])]
  
  K.mat = exp(-cross.dist^2/quantile(cross.dist,0.5)^2)  
  prop.svd = svds(K.mat, k=max(Gamma))
  
  out1= sqrt(dim(prop.svd$u[,Gamma])[1])*prop.svd$u[,Gamma] %*% diag(prop.svd$d[Gamma])^(1/2)
  out2= sqrt(dim(prop.svd$v[,Gamma])[1])*prop.svd$v[,Gamma] %*% diag(prop.svd$d[Gamma])^(1/2)
  
  return(list(embed.data1=out1, embed.data2=out2))
}

Here we are interested in identifying the shared latent structure through a 2- or 3-dimensional joint embedding. We thus set Gamma=1:3 in DL.embed(). Note that internally our function automatically omit the latent space associated to the 1st singular vectors of the duo-landmark kernel matrix due to its lack of information (typically a nearly constant vector).

As can be seen below, the duo-landmark joint embedding successfully captures the shared manifold structure between the two datasets.

embed.out = DL.embed(data.Y1,data.Y2, Gamma=1:3)
X.combined = rbind(embed.out$embed.data1, embed.out$embed.data2)
plot(X.combined[,c(1,2)], col=c(rep("red",dim(data.Y1)[1]), rep("blue",dim(data.Y2)[1])),pch=20, xlab="Dim 1",ylab="Dim 2", main="duo-landmark joint embedding")

plot(X.combined[,c(2,3)], col=c(rep("red",dim(data.Y1)[1]), rep("blue",dim(data.Y2)[1])),pch=20, xlab="Dim 2",ylab="Dim 3")

A 3d scatterplot show a similar pattern.

scatterplot3d(x=X.combined[,2], y=X.combined[,3], z=X.combined[,1], angle=25, color =c(rep("red",dim(data.Y1)[1]), rep("blue",dim(data.Y2)[1])),pch=20, xlab="Dim 2", ylab="Dim 3", zlab="Dim 1")