In this post we will understand basic concepts of word2vec and see how to implement and use it.
Previously we have seen word embedding models like Count Vector/TfIDF. While these models are useful they are simply based frequency of words. These models loose most the language characteristics and meanings of the words. Word2Vec is a model in which words are converted into a vector space in which similar words are close by vectors.
This means that word vectors have actual meaning and are not just random numbers. Since word vectors which are numbers now have actual meaning we are able to add or delete actual words to find new words!
This is very counter intuitive at first but once you understand the model in detail its becomes very exciting to play around with it.
Below of some of must read articles to understand word2vec in detail
https://gist.github.com/aparrish/2f562e3737544cf29aaf1af30362f469
http://jalammar.github.io/illustrated-word2vec/
If you have read the above articles properly, two things would be clear. We can either use pre-trained word2vec models like gloVe or generate our own word2vec model on our data set using a library like genisim.
Let’s see how to implement both of them.
https://www.machinelearningplus.com/nlp/gensim-tutorial/
Fasttext.cc are also similar vector produced by facebook you can read about them here https://fasttext.cc/docs/en/english-vectors.html
and to use it in code follow this https://stackoverflow.com/questions/50828314/how-does-the-gensim-fasttext-pre-trained-model-get-vectors-for-out-of-vocabulary
https://shuzhanfan.github.io/2018/08/understanding-word2vec-and-doc2vec/
General playing around
https://colab.research.google.com/drive/1W7_F0JaU6Xyhfyyi3Sq_QkTTfsrz7sCx