10  Natural Language Processing (NLP)

Learning Objectives of the Chapter

At the End of the Chapter, Students should be Able to -

  • Learn about What Natural Language Processing (NLP) is

  • Understand the Importance and Difference Concepts of NLP

  • Learn about Different R and Python Packages for NLP

  • Perform Some NLP on Text Data

10.1 Introduction

    In today’s data driven world, a significant amount data is produced each day. For example, Google processes 24 peta bytes of data every day; 10 million photos are uploaded every hour on Facebook; and 400 million tweets are posted on X (formerly Twitter). Of these amount of data, a significant portion consists of text data. Therefore, it it important to gain insights from text data.

    Natural Language Processing (NLP), according to IBM, is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language. Specifically, NLP involves understanding, interpreting, and extracting insights from human language. Businesses use NLP for many purposes such as processing and analyzing large volume of documents, analyzing customer reviews, and scaling customer services (like developing chatbots or virtual assistants).

10.2 Python Libraries for NLP

    Python has built a rich and efficient ecosystem for NLP. Some of the most popular python modules (libraries) for NLP include - NLTK (Natural Language Toolkit); SpaCy; Gensim; and TextBlob.

10.3 Steps in NLP

    Like other data science tools or techniques, NLP involves several steps because most the time text data is not readily available or even if they are available, we need to clean the data and make it ready for next step processing. In this section, several important steps, which are called preprocessing, of NLP will be discussed.

10.3.1 Preprocessing

    Before applying NLP techniques, it is necessary to preprosess and clean the text data. Therefore, the processes involving cleaning and preparing text data to get them ready for NLP models are called preprocessing. Preprocessing is very important in NLP to get effective and accurate insights from the data. Below we will discuss several important concepts of preprocessing.

# An exmaple of a text data 
my_text = """
Accounting is the systematic process of recording, analyzing, and reporting financial \
transactions. It helps businesses track their income, expenses, and overall financial \
health. Accountants use various financial statements, such as balance sheets and income \
statements, to summarize a company's financial position. Double-entry bookkeeping is a \
fundamental principle in accounting, ensuring that every transaction affects at least two \
accounts. Financial accounting focuses on providing information to external stakeholders, \
such as investors and creditors, while managerial accounting provides information to \
internal stakeholders, like managers, to aid in decision-making. Auditing is an essential \
aspect of accounting, involving the examination of financial records to ensure accuracy \
and compliance. Tax accounting deals with preparing tax returns and planning for \
future tax obligations. Forensic accounting involves investigating financial discrepancies \
and fraud. Accounting software, like QuickBooks and Xero, has revolutionized the way \
businesses manage their finances, making the process more efficient and accurate. \
Overall, accounting plays a crucial role in the financial management and transparency \
of businesses and organizations.
"""

10.3.1.1 Tokenization

import nltk
nltk.download("punkt_tab")
from nltk.tokenize import sent_tokenize, word_tokenize
# sentence tokenize
my_text_sent = sent_tokenize(my_text)
my_text_sent[0:5]
['\nAccounting is the systematic process of recording, analyzing, and reporting financial transactions.',
 'It helps businesses track their income, expenses, and overall financial health.',
 "Accountants use various financial statements, such as balance sheets and income statements, to summarize a company's financial position.",
 'Double-entry bookkeeping is a fundamental principle in accounting, ensuring that every transaction affects at least two accounts.',
 'Financial accounting focuses on providing information to external stakeholders, such as investors and creditors, while managerial accounting provides information to internal stakeholders, like managers, to aid in decision-making.']
# word tokenize
my_text_word = word_tokenize(my_text)
my_text_word[0:5]
['Accounting', 'is', 'the', 'systematic', 'process']

10.3.1.2 Removing Punctuation

    It is evident that in our word tokens, punctuations like comma (,), full stop (.) are also included, but they are unncessary. Therefore, we need to eliminate them from the token list.

import string
my_text_nopunc = [x for x in my_text_word if x not in string.punctuation]
my_text_nopunc[:11]
['Accounting',
 'is',
 'the',
 'systematic',
 'process',
 'of',
 'recording',
 'analyzing',
 'and',
 'reporting',
 'financial']

10.3.1.3 Filtering Stop Words

    Stop words are the words that we want to ignore. Words like “in”, “an”, “the” we want to ignore. Therefore, in this step, we want to filter out these kinds of words.

nltk.download("stopwords") # to download the stopwords from NLTK repository
from nltk.corpus import stopwords # imports the module 
stop_words = set(stopwords.words("english")) # access the stopwords for english 
# print(stop_words)
my_text_nostopwords = [x for x in my_text_nopunc if x.lower() not in stop_words]
my_text_nostopwords[0:11]
['Accounting',
 'systematic',
 'process',
 'recording',
 'analyzing',
 'reporting',
 'financial',
 'transactions',
 'helps',
 'businesses',
 'track']

    Still we can see there are some unnessary words in the list. So, we need to eliminate them. For example, “’s” is in the my_text_nostopwords. We need to get rid of it.

"'s" in my_text_nostopwords
my_text_nostopwords = [x for x in my_text_nostopwords if "'s" not in x]
"'s" in my_text_nostopwords
False

10.3.1.4 Stemming

    Stemming is the process of reducing the words to their base or root form. For example, the token list contains words like recording, reporting, analyzing and so on. The base form of those words are record, report, and analyze respectively. Therefore, we need to reduce those words to base form. Stemming will help to do so. For this purpose, there are several types of stemmers such as Porter stemmer, Lovins stemmer, Dawson stemmer, Krovetz stemmer, and Xerox stemmer.

from nltk.stem import PorterStemmer,SnowballStemmer, LancasterStemmer
porter = PorterStemmer()
snowball = SnowballStemmer("english")
lancaster = LancasterStemmer()
[porter.stem(x) for x in my_text_nostopwords]
[snowball.stem(x) for x in my_text_nostopwords]
[lancaster.stem(x) for x in my_text_nostopwords][0:11]
['account',
 'system',
 'process',
 'record',
 'analys',
 'report',
 'fin',
 'transact',
 'help',
 'busy',
 'track']

10.3.1.5 Lemmatization

    Lemmatization, like stemming, is the process of reducing a word to its base form, but, unlike stemming, it considers the context of the word.

from nltk.stem import WordNetLemmatizer
wordnet = WordNetLemmatizer()
my_text_lemmatized = [wordnet.lemmatize(x) for x in my_text_nostopwords]
my_text_lemmatized[:11]
['Accounting',
 'systematic',
 'process',
 'recording',
 'analyzing',
 'reporting',
 'financial',
 'transaction',
 'help',
 'business',
 'track']

10.3.1.6 Other Steps in Preprocessing

    In addition to the above preprocessing, we might need to remove many other special characters from the text. These special characters include - hastags, HTML tags, links. For this purpose, knowledge about “regular expression” might be useful. Python built-in package re could be handy for regular expression. To learn more about regular expression - https://www.w3schools.com/python/python_regex.asp.

10.4 Visualization of Words

10.4.1 Word Cloud

    Figure 10.1 shows a word cloud of our tokenized text.

from wordcloud import WordCloud
# We need a single string; So, it is tranformed below
my_text_lemmatizedSstring = ' '.join(my_text_lemmatized)
# Word Cloud 
word_cloud = WordCloud(collocations = False, background_color = 'white').generate(my_text_lemmatizedSstring)
import matplotlib.pyplot as plt
plt.imshow(word_cloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Figure 10.1: Word Cloud of the Words

10.4.2 Bar Diagram of Word Frequency

    Figure 10.2 shows the bar diagram of the words in tokenized list.

from collections import Counter
# calculate word frequencies 
word_freq = Counter(my_text_lemmatized)
# extract word and their frequencies 
words = list(word_freq.keys())
frequencies = list(word_freq.values())
# create a data frame 
import pandas as pd
import seaborn as sns
word_df = pd.DataFrame(word_freq.items(), columns = ['Word', "Frequency"])
word_df = word_df.sort_values(by='Frequency', ascending=False)
# Create the bar diagram 
plt.figure(figsize=(10, 5)) 
sns.barplot(y='Word', x='Frequency', data=word_df[word_df['Frequency']>1], palette='viridis') 
plt.ylabel('Words') 
plt.xlabel('Frequencies') 
plt.xticks(rotation=90)
plt.show()
Figure 10.2: Bar Diagram of Word Frequency

10.5 Sentiment Analysis

    Sentiment analysis involves converting text into sentiments such as positive, neutral, and negative. Texts are widely used to express emotion, feelings, opinion and so on. Therefore, sometimes sentiment analysis is also called “Opinion Mining.” Identifying sentiment from texts could provide valuable insights to make strategic decisions such as improving product features, launching new products, identifying strengths or weaknesses of product or service offerings. Before, we perform the sentiment analysis, we need to do the preprocessing as described in Section 10.3.1 first.

    Below we use texblob python module for seniment analysis of our text about Accounting. texblob is simple for sentiment analysis because the function accepts text as input and return sentiment score. There are two types of sentiment scores - polarity and subjectivity. Polarity score actually measures the sentiment of the text and it values are between -1 and +1, where -1 indicates high negative sentiment and +1 indicates very positive sentiment. On the other hand, subjectivity score measures whether the text contaiins factual information or personal opinion. Subjectivity scores range from 0 to 1, where 0 indicates factual information and 1 indicates personal opinion.

from textblob import TextBlob
# Determining Polarity 
TextBlob(my_text).sentiment.polarity
0.028571428571428574
# Determining Subjectivity 
TextBlob(my_text).sentiment.subjectivity
0.21706349206349207

    In the above analysis, we see the polarity score is 0.02857, which is very close to zero. Therefore, we can say our text is neutral. On the other hand, subjectivity score is 0.21706, which is close to 0, indicating that our text is factual information (not personal opinion).

10.6 Readability Index

10.7 Text Similarity

10.8 Topic Modeling

10.9 Conclusion