Visualize Data – PCA vs t-sin vs TSVD

When we are working with ML, we are most of the time working with vectors in higher dimensions. This makes it very difficult to actually visualize and view data. Above are few of the methods which we can use to visualize data.

PCA (Principle Component Analysis)

This is one the most popular and simple way, i won’t go into detail on how this works and there many blogs which explain this in detail. In short this is what it does

“Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.”

Let’s see this in practice to understand our tf/idf vectors from previous blog

from sklearn.decomposition import PCA

X = Xtr.todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

plt.scatter(data2D[:,0], data2D[:,1])
#              #not required if using ipython notebook

We get an output like this.


t-sine is another method to reduce dimensions and visualize data . you can read about it in detail here

To implement it

# t-SNE plot
from sklearn.manifold import TSNE

embeddings = TSNE(n_components=2)
Y = embeddings.fit_transform(X)
plt.scatter(Y[:, 0], Y[:, 1],

There are more ways to visualize data using K-Mean clusters

See full source here