NLP

NLP 101: Document Vectorization

426

Natural Language Processing

Tech

In recent times, researchers have made significant contributions to the field of Artificial Intelligence, especially the subfields of Computer Vision and Natural Language Processing (NLP) have been developed a lot. In a series of blogs, I will write about the developments in the field of NLP. This is the first part of the series, in which we will discuss vectorization. We will go from basic to more complex models. <h2><b><center>Vectorization</h2></b></center> Natural Language Processing involves working with a natural language like text, speech, etc., and analyzing them to perform different tasks. The first task we need to complete before training a model is to convert text (or speech), generally referred to as <b>vocabulary</b>, into vector (numerical) form. This is because computers can only understand the language of numbers. There are different techniques to convert words into vector form. We will discuss them and see their advantages and disadvantages. <h2><b><center>One Hot Encoding</h2></b></center> One hot encoding isn't a vectorization method explicitly made for NLP. It can be used for any general categorical data. In <b>One Hot Encoding</b>, we first go through the corpus to find all the unique words and give them an index number. <b>Corpus</b> is the term used for training datasets in NLP, and unique words are called the <b>vocabulary of the corpus</b>. Then for each word, we create a vector of the vocabulary size. This vector is filled with 0s except for 1-bit, which equals 1. In this case, this 1-bit can be the index of the word. The disadvantage of using One hot encoding is that it suffers from the curse of dimensionality. In most cases, the vocabulary size is vast, and creating a large vector for each word would take away a lot of space and require more processing time. Some solution for this problem can be using a sparse matrix to store encoded vectors. We can also create varied-length vectors (using a similar method as Huffman encoding), but in that case, it would be hard to use them with neural networks that require fixed-size inputs, and it would be harder to decode them. One disadvantage of One hot encoding is that it doesn't capture the semantics of the words. This can be a huge disadvantage in many situations like in English, we can use the same word but for a different meaning. <h2><b><center>Bag of Word (BOW) Approach</h2></b></center> The bag of word approach to convert words into vectors is a set of vectorization techniques that do not embed the underlying semantics (meaning and grammar) of the words present in the sentences in the corpus. One hot encoding is a BOW approach. Some other BOW techniques include Count Vectorizer and TF-IDF. In <b>Count Vectorizer</b>, we first go through all the words in the corpus to collect the vocabulary and give each word an index. Then for each sentence, we create a vector with a size equal to the vocabulary size and initialize each index of the vector with 0. Then for each word appearing in the sentence, we save the number of times it appears in the sentence to its corresponding index. This method has similar disadvantages to One-hot encoding, but it captures more information about the words (the number of times they appear in a sentence). The number of times the word appears in a sentence signifies how important each word is in a sentence. But counting only the number of times a word appears can be deceiving. We must consider the length of the sentence as well. This is what TF-IDF does. In TF-IDF, TF stands for Term Frequency, and IDF stands for Inverse Document Frequency. <h2><b><center>Word Embeddings</h2></b></center> As discussed above, the disadvantage of BOW techniques is that it doesn't capture the meaning and the context of the words in the corpus. A newer method of word vectorization is web embeddings. Word embeddings capture the meaning of the words apart from converting the words to vector form. I have already written a blog on Word Embeddings, but I will briefly write about it <a href="https://www.s-tronomic.in/post/109">here</a>. <b>Word embeddings</b> are created as a side product while trying to train a model. For example, Word2Vec, one of the word embeddings, uses the CBOW technique to create word embeddings. In that, the model is trained to predict the missing words given a window around that word, and as a result, the weights between hidden layers and input layers can be used as word embeddings. The size of the vector equals the number of neurons in the hidden layers. One of the disadvantages of word embeddings can be that to get usable word embeddings, we need to train the model on a huge dataset which can take a long time. But we can also use already trained word embeddings like Word2Vec, GloVe, etc. Word embeddings are generally preferred over BOW as they capture the semantics of the word allowing the vectors to carry more important information. <h2><b><center>Conclusion</h2></b></center> In the first part of the NLP series, we came across various methods of converting natural language text to numerical form. Though before creating these vectors, we must preprocess the data. There are many techniques for data preprocessing, and they must be applied whenever required. In the next part, we will discuss preprocessing techniques like Stemming, Lemmatization, removing stop words, etc.

- Ojas Srivastava, 08:29 PM, 12 Dec, 2022

Vectorization


Next Read: Hot and Cold