Word2Vec

An Introduction to Word2Vec

480

Natural Language Processing

Technology

Word2Vec is one of the language models developed by Google in 2014. The word embeddings produced by word2vec are beneficial as they not only convert words in vector form but also carry contextual meanings of the words which were missing in previous methods of converting words to vectors. The word embeddings in word2vec models are created while training a neural network. In this article, we will discuss word2vec models. But before talking about word2vec, we will discuss previous language models and their downsides which led to the creation of models like word2vec. We would discuss how word2vec creates word embeddings, but we wouldn't go into detail about other methods proposed before that. <b><center><h2>Bag Of Word Approach</b></center></h2> The need to convert words to vectors rises because computers can only work with numbers. One of the earliest approaches in this field was to use the bag-of-word approach. In the bag-of-word method, we do not care about the context or the meaning of the words. We convert them to vectors if they appear in the corpus (training dataset). Some of the common bag-of-word approaches are One Hot Encoding, term(or word) frequency, and TF-IDF. <b>TF-IDF</b> (Term frequency-Inverse Document Frequency) is considered the best approach amongst the others named above. It contains insights about the document's less relevant and more relevant work. The importance of a word in the text is of great significance in information retrieval. In NLP, <i>a corpus consists of a set of documents</i>. A document can be a paragraph, a sentence, or probably a whole article. In the TF-IDF approach, the model not only counts the number of times a word appears in a document but also considers the number of times the word appears in other documents, which helps in deciding the "importance" of the word. <img src="https://miro.medium.com/max/720/1*1pTLnoOPJKKcKIcRi3q0WA.jpeg" /> Though one of the problems with bag-of-word vectorizers is that they <b>do not take into account the meaning of the words or the context in which the word was used</b> and hence cannot be used for tasks like the following word prediction, finding the missing words, etc. <b><center><h2>Markov Chains</b></center></h2> Markov Chains come under the category of <b>probabilistic models</b>. It was one of the first approaches for word prediction-related tasks. Markov chains don't convert words to vectors. It is, more specifically, a machine-learning model for <b>word prediction</b>. The Markovian property states that the <i>evolution of the Markov process in the future depends only on the present state and not on history</i>. This means that it takes into account the current word to predict the next word in the sentence. <img src="https://aman.ai/coursera-nlp/assets/probabilistic/16.png" /> Markov model works quite well in different NLP-related tasks. The model is trained by creating a directed graph from one part of the speech tag (POS tag) to another, with each edge consisting of the probabilities of transitioning from one POS tag to another. In language, <b>Part of Speech (POS) is a category to which a word is assigned in accordance with its syntactic functions</b>. Some examples of parts of speech in English are nouns, verbs, Adverbs, etc. Every POS tag directs to a set of words with possibilities. These probabilities are calculated based on the number of times the word appears in the corpus about a particular POS tag. Though the model works quite well, even this <b>misses the meaning of the words</b>. We cannot tell whether the words have similar or opposite meanings, analogies, etc. <b><center><h2>Word2Vec</b></center></h2> Word2vec was introduced to tackle the problems of the previous models. The word embeddings produced by word2vec carry the meaning of the word. Hence, we can use vector operations of addition and subtraction, dot products, etc., to find similar words, opposite words, word analogies, etc. Word2vec is a <b>self-supervised learning model</b>. Self-supervised learning model learns from unlabeled datasets. The rationale is that if <b>two unique words are both frequently surrounded by a similar set of words when used in various sentences then those two words tend to be related semantically</b>. There are other self-supervised learning models for creating word embeddings with embedded word meaning, like GloVe (introduced by Stanford in 2015 and FastText by FaceBook in 2016. <b><center><h2>How are Word2Vec Embeddings Created?</b></center></h2> Word2Vec uses neural networks to "solve" a problem like a word prediction, and as a by-product of this trained neural network, word embeddings are formed (that's why it's self-supervised). Word2Vec can essentially have two architectures, <b>CBOW (Continuous Bag of Words) and Continuous Skip-Gram.</b> In CBOW architecture, we have a "sliding window." The window has a size of C words. Our model needs to <b>predict the middle word given when the other words in the window are given as input to the model</b>. Every time we slide the window by one step, the process is repeated as often as the programmer wants. After the model is trained, the <b>weights between the input layer and the hidden layer act as the word embeddings.</b> <img src="https://aman.ai/coursera-nlp/assets/probabilistic/45.png" /> Another model that exists is the <b>Continuous skip-gram</b>. We have a sliding window even in this model, but rather than predicting the middle word of the window, we provide the central word as input and predict the surrounding words. Even though the problem or architecture has changed, the method remains the same. The word embeddings are created due to training the model in the weight matrix between the input and hidden layers. <b><center><h2>Shortcomings</b></center></h2> There are some shortcomings of the word2vec as well. Even though word2vec is able to embed the meaning and the context of the word, <b>it falls behind if the model encounters a word that wasn't part of the training corpus</b>. This is because Word2Vec creates embeddings for the words present in the training corpus only. To solve these shortcomings new models like BERT, GPT2, etc. were introduced. They come under the category of Transformer models and can work with words the model hadn't encountered before. We will discuss them in some other blog.

- Ojas Srivastava, 12:35 AM, 15 Oct, 2022

Language Models