When we are working with ML, we are most of the time working with vectors in higher dimensions. This makes it very difficult to actually visualize and view data. Above are few of the methods which we can use to visualize data.
PCA (Principle Component Analysis)
This is one the most popular and simple way, i won’t go into detail on how this works and there many blogs which explain this in detail. In short this is what it does
“Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.”
Let’s see this in practice to understand our tf/idf vectors from previous blog
from sklearn.decomposition import PCA
X = Xtr.todense()
pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)
plt.scatter(data2D[:,0], data2D[:,1])
# plt.show() #not required if using ipython notebook
We get an output like this.
t-Sine
t-sine is another method to reduce dimensions and visualize data . you can read about it in detail here https://mlexplained.com/2018/09/14/paper-dissected-visualizing-data-using-t-sne-explained/
To implement it
# t-SNE plot
from sklearn.manifold import TSNE
embeddings = TSNE(n_components=2)
Y = embeddings.fit_transform(X)
plt.scatter(Y[:, 0], Y[:, 1], cmap=plt.cm.Spectral)
# plt.show()
There are more ways to visualize data using K-Mean clusters
See full source here https://colab.research.google.com/drive/16AMYXE7EzoseR8FrkWSlC5JOfvEy_3JJ