Basics of NLP - 1 [Week 5]

discussion into how text preprocessing, regex, frequencies, and word embeddings work

You must have used ChatGPT, but how does it work? How does Google Translate work? They all work under the study of Natural Language Processing. Today, we embark on a long journey which was inspired from a tweet from Saurabh Bhaiya [Link to Tweet - Saurabh Kumar on Twitter / X]. Let’s dive deep into the study of NLP.

This blog is broken into 4 pieces

Text Preprocessing - Lowercasing, Tokenization, Stopword Removal, Stemming, Lemmatization
Regex - Basics and some examples
Frequencies - BoW, TF, IDF, TF-IDF
Word Embeddings - Word2Vec, GloVe

Text Preprocessing

We will be using the NLTK library for this task, make sure import it earlier on using import nltk

Lowercasing

This step is obvious, we do this to ensure consistency. It also reduces the complexity of the text data we are going to input. We can simply just convert every character in the test data to its lower case form.

text=#your input data
text=text.lower()

Tokenization

Imagine you want to teach your kid to study English, instead of teaching him Shakespeare, you will first teach him alphabets, then words then sentences.

Here your child is the machine, and we break down sentences for it to learn so that the sentence is broken into meaningful tokens, which still carry the original essence of context. This makes pattern recognition easier.

Let’s say we have a sentence I am neuralnets.
We can break it down to ["I","am","neuralnets"] . This is word tokenization, which breaks long sentences into individual words.

Now let’s go down one step further. ['I', ' ', 'a', 'm', ' ','n', 'e', 'u', 'r', 'a', 'l', 'n', 'e', 't', 's'] . This kind of tokenization is called character tokenization, mostly used for spelling correction tasks.

We can also break words neuralnets to ["neural","nets"].This kind of tokenization is termed as subword tokenization. This is useful for languages that form meaning by combining smaller tokens.

import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
tokens = word_tokenize("I am neuralnets.")
print("Tokens:", tokens)

Stopword Removal

What is a stopword? They are usually the words that don’t contribute any meaning or context to the sentence. For example, the word the doesn’t bring out any significance to the sentence, hence we can just remove it from the input data. But words which have important context to the sentence, like for example, history can’t be termed as stopwords.

from nltk.corpus import stopwords
stop_words=set(stopwords.words('english') 
#this forms a set of stop words
words=[word for word in tokens if word not in stop_words]
#this stores all words in tokens, that are not stopwords

Stemming

Let us consider this sentence, The leaves are falling, have fallen, but will leave behind beauty

With stemming, what we do is chop down the sentence into their root forms, which might not make any sense and lose the meaning of the word.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokeniz
text = "The leaves are falling, have fallen, but will leave behind beauty."
words = word_tokenize(text)
stemmer = PorterStemmer()
stems = [stemmer.stem(word) for word in words]
print("Stemming Results:", stems)

Lemmatization

Consider the earlier sentence.
What we do here, is we chop down the sentence into their base dictionary word, which considers the meaning and context of the word, hence not making a word, that makes no sense.

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
text = "The leaves are falling, have fallen, but will leave behind beauty."
words = word_tokenize(text)
stemmer = WordNetLemmatizer()
lemmas = [lemmatizer.lemmatize(word, pos="v") for word in words]
print("Lemmatization Results:", lemmas)

Regular Expressions(Regex)

A regular expression (shortened as regex) is a sequence of characters that specifies a search pattern in text. Basically it’s a way of finding and searching stuff in a string.

How to write Regex?

→ First we import the regex module with import re

→ Then we need to create a regex object with re.compile

→ We need to pass our input string into the Regex object using search() , which returns an object

→ Finally call the group() function to return a string output from the object.

Common Regex Symbols

Example

phone_num_regex = re.compile(r'\d\d\d\d-\d\d\d\d\d\d')

mo = phone_num_regex.search('My number is 9834-872919.')

print(f'Phone number found: {mo.group()}')
#this will output the number in the string

Since this topic is very practical, and should be learnt through practice, you’re suggested to try examples out on your own to get the grasp of regex!
Check out - https://docs.python.org/3/howto/regex.html

Frequencies

Bag of Words(BoW)

It is nothing but a simple way to represent text data. We don’t care about the order of words not the grammar, we just care about what words appear in the sentence. It’s like putting words in a bag and then randomly counting each type of word.

Firstly, we need to create a vocabulary, that is all the unique words in our dataset. Then we count how often each word occurs in the vocabulary.

Let’s say we have sentences
S1 : I like pizza

S2 : I do not like pizza

Our vocabulary will be ["I", "like", "pizza", "do", "not"]

Now we count the number of appearances

Word	Sentence 1	Sentence 2
I	1	1
Like	1	1
pizza	1	1
do	0	1
not	0	1
Total Words	3	5

The BoW representation is :
1. Sentence 1 : [1,1,1,0,0]
2. Sentence 2 : [1,1,1,1,1]

The BoW model represents each sentence as a vector, where each dimension is a word in the vocabulary and the value in it represents the count of that word.

Term Frequency(TF)

It measures how often a specific word appears in a document, compared to the total number of words in that document. Instead of counting like in BoW, TF normalises the counts, thus for larger datasets, we can see how important a word is in context of the document. It sets the stage for more advanced models to come.

The formula for TF is

\[TF(w)=\frac{count \space of \space w \space in \space document}{Total\space number \space of \space words\space in \space document}\]

We will be using the same example like the last time.

Word	TF of Sentence 1	TF of Sentence 2
I	1/3 = 0.333	1/5 = 0.2
Like	1/3 = 0.333	1/5 = 0.2
pizza	1/3 = 0.333	1/5 = 0.2
do	0	1/5 = 0.2
not	0	1/5 = 0.2
Total Words	3	5

Inverse Document Frequency(IDF)

While TF told us about the importance of a word in a document, IDF tells us about how rare or unique a word is in the document. Using this, we can often eliminate words that carry very less context, rather than the rare words which might carry huge context. Words that appear in every document get a lower IDF (close to 0). Words that appear in fewer documents get a higher IDF.

The formula is :

\[IDF(w)=log(\frac{Total\space number \space of \space documents}{Number\space of \space documents\space having\space w})\]

We will be using the same example like the last time.

Word	Document Frequency
I	2
like	2
pizza	2
do	1
not	1

Now let’s calculate the IDFs :

\[log(\frac{2}{2})=0 \space |\space log(\frac{2}{1})=0.693\]

Word	Document Frequency	IDF
I	2	0
like	2	0
pizza	2	0
do	1	0.693
not	1	0.693

IDF helps downweight common words (like "I") and emphasize rare words (like "do" and "not"), which are often more meaningful for distinguishing between documents.

TF-IDF

TF tells us how important a word is in a document, IDF tells us how rare that word is. Combining them gives a score that tells us how important a word is in a document, while leaving behind common words.

The formula is :

\[TF-IDF(w)=TF(w)\times IDF(w)\]