# An exmaple of a text data
= """
my_text Accounting is the systematic process of recording, analyzing, and reporting financial \
transactions. It helps businesses track their income, expenses, and overall financial \
health. Accountants use various financial statements, such as balance sheets and income \
statements, to summarize a company's financial position. Double-entry bookkeeping is a \
fundamental principle in accounting, ensuring that every transaction affects at least two \
accounts. Financial accounting focuses on providing information to external stakeholders, \
such as investors and creditors, while managerial accounting provides information to \
internal stakeholders, like managers, to aid in decision-making. Auditing is an essential \
aspect of accounting, involving the examination of financial records to ensure accuracy \
and compliance. Tax accounting deals with preparing tax returns and planning for \
future tax obligations. Forensic accounting involves investigating financial discrepancies \
and fraud. Accounting software, like QuickBooks and Xero, has revolutionized the way \
businesses manage their finances, making the process more efficient and accurate. \
Overall, accounting plays a crucial role in the financial management and transparency \
of businesses and organizations.
"""
10 Natural Language Processing (NLP)
Learning Objectives of the Chapter
At the End of the Chapter, Students should be Able to -
Learn about What Natural Language Processing (NLP) is
Understand the Importance and Difference Concepts of NLP
Learn about Different R and Python Packages for NLP
Perform Some NLP on Text Data
10.1 Introduction
In today’s data driven world, a significant amount data is produced each day. For example, Google processes 24 peta bytes of data every day; 10 million photos are uploaded every hour on Facebook; and 400 million tweets are posted on X (formerly Twitter). Of these amount of data, a significant portion consists of text data. Therefore, it it important to gain insights from text data.
Natural Language Processing (NLP), according to IBM, is a subfield of computer science and artificial intelligence (AI) that uses machine learning to enable computers to understand and communicate with human language. Specifically, NLP involves understanding, interpreting, and extracting insights from human language. Businesses use NLP for many purposes such as processing and analyzing large volume of documents, analyzing customer reviews, and scaling customer services (like developing chatbots or virtual assistants).
10.2 Python Libraries for NLP
Python has built a rich and efficient ecosystem for NLP. Some of the most popular python modules (libraries) for NLP include - NLTK (Natural Language Toolkit); SpaCy; Gensim; and TextBlob.
10.3 Steps in NLP
Like other data science tools or techniques, NLP involves several steps because most the time text data is not readily available or even if they are available, we need to clean the data and make it ready for next step processing. In this section, several important steps, which are called preprocessing, of NLP will be discussed.
10.3.1 Preprocessing
Before applying NLP techniques, it is necessary to preprosess and clean the text data. Therefore, the processes involving cleaning and preparing text data to get them ready for NLP models are called preprocessing. Preprocessing is very important in NLP to get effective and accurate insights from the data. Below we will discuss several important concepts of preprocessing.
10.3.1.1 Tokenization
import nltk
"punkt_tab")
nltk.download(from nltk.tokenize import sent_tokenize, word_tokenize
# sentence tokenize
= sent_tokenize(my_text)
my_text_sent 0:5] my_text_sent[
['\nAccounting is the systematic process of recording, analyzing, and reporting financial transactions.',
'It helps businesses track their income, expenses, and overall financial health.',
"Accountants use various financial statements, such as balance sheets and income statements, to summarize a company's financial position.",
'Double-entry bookkeeping is a fundamental principle in accounting, ensuring that every transaction affects at least two accounts.',
'Financial accounting focuses on providing information to external stakeholders, such as investors and creditors, while managerial accounting provides information to internal stakeholders, like managers, to aid in decision-making.']
# word tokenize
= word_tokenize(my_text)
my_text_word 0:5] my_text_word[
['Accounting', 'is', 'the', 'systematic', 'process']
10.3.1.2 Removing Punctuation
It is evident that in our word tokens, punctuations like comma (,), full stop (.) are also included, but they are unncessary. Therefore, we need to eliminate them from the token list.
import string
= [x for x in my_text_word if x not in string.punctuation]
my_text_nopunc 11] my_text_nopunc[:
['Accounting',
'is',
'the',
'systematic',
'process',
'of',
'recording',
'analyzing',
'and',
'reporting',
'financial']
10.3.1.3 Filtering Stop Words
Stop words are the words that we want to ignore. Words like “in”, “an”, “the” we want to ignore. Therefore, in this step, we want to filter out these kinds of words.
"stopwords") # to download the stopwords from NLTK repository
nltk.download(from nltk.corpus import stopwords # imports the module
= set(stopwords.words("english")) # access the stopwords for english
stop_words # print(stop_words)
= [x for x in my_text_nopunc if x.lower() not in stop_words]
my_text_nostopwords 0:11] my_text_nostopwords[
['Accounting',
'systematic',
'process',
'recording',
'analyzing',
'reporting',
'financial',
'transactions',
'helps',
'businesses',
'track']
Still we can see there are some unnessary words in the list. So, we need to eliminate them. For example, “’s” is in the my_text_nostopwords
. We need to get rid of it.
"'s" in my_text_nostopwords
= [x for x in my_text_nostopwords if "'s" not in x]
my_text_nostopwords "'s" in my_text_nostopwords
False
10.3.1.4 Stemming
Stemming is the process of reducing the words to their base or root form. For example, the token list contains words like recording, reporting, analyzing and so on. The base form of those words are record, report, and analyze respectively. Therefore, we need to reduce those words to base form. Stemming will help to do so. For this purpose, there are several types of stemmers such as Porter stemmer, Lovins stemmer, Dawson stemmer, Krovetz stemmer, and Xerox stemmer.
from nltk.stem import PorterStemmer,SnowballStemmer, LancasterStemmer
= PorterStemmer()
porter = SnowballStemmer("english")
snowball = LancasterStemmer()
lancaster for x in my_text_nostopwords]
[porter.stem(x) for x in my_text_nostopwords]
[snowball.stem(x) for x in my_text_nostopwords][0:11] [lancaster.stem(x)
['account',
'system',
'process',
'record',
'analys',
'report',
'fin',
'transact',
'help',
'busy',
'track']
10.3.1.5 Lemmatization
Lemmatization, like stemming, is the process of reducing a word to its base form, but, unlike stemming, it considers the context of the word.
from nltk.stem import WordNetLemmatizer
= WordNetLemmatizer()
wordnet = [wordnet.lemmatize(x) for x in my_text_nostopwords]
my_text_lemmatized 11] my_text_lemmatized[:
['Accounting',
'systematic',
'process',
'recording',
'analyzing',
'reporting',
'financial',
'transaction',
'help',
'business',
'track']
10.3.1.6 Other Steps in Preprocessing
In addition to the above preprocessing, we might need to remove many other special characters from the text. These special characters include - hastags, HTML tags, links. For this purpose, knowledge about “regular expression” might be useful. Python built-in package re
could be handy for regular expression. To learn more about regular expression - https://www.w3schools.com/python/python_regex.asp.
10.4 Visualization of Words
10.4.1 Word Cloud
Figure 10.1 shows a word cloud of our tokenized text.
from wordcloud import WordCloud
# We need a single string; So, it is tranformed below
= ' '.join(my_text_lemmatized)
my_text_lemmatizedSstring # Word Cloud
= WordCloud(collocations = False, background_color = 'white').generate(my_text_lemmatizedSstring)
word_cloud import matplotlib.pyplot as plt
="bilinear")
plt.imshow(word_cloud, interpolation"off")
plt.axis( plt.show()
10.4.2 Bar Diagram of Word Frequency
Figure 10.2 shows the bar diagram of the words in tokenized list.
from collections import Counter
# calculate word frequencies
= Counter(my_text_lemmatized)
word_freq # extract word and their frequencies
= list(word_freq.keys())
words = list(word_freq.values())
frequencies # create a data frame
import pandas as pd
import seaborn as sns
= pd.DataFrame(word_freq.items(), columns = ['Word', "Frequency"])
word_df = word_df.sort_values(by='Frequency', ascending=False)
word_df # Create the bar diagram
=(10, 5))
plt.figure(figsize='Word', x='Frequency', data=word_df[word_df['Frequency']>1], palette='viridis')
sns.barplot(y'Words')
plt.ylabel('Frequencies')
plt.xlabel(=90)
plt.xticks(rotation plt.show()
10.5 Sentiment Analysis
Sentiment analysis involves converting text into sentiments such as positive, neutral, and negative. Texts are widely used to express emotion, feelings, opinion and so on. Therefore, sometimes sentiment analysis is also called “Opinion Mining.” Identifying sentiment from texts could provide valuable insights to make strategic decisions such as improving product features, launching new products, identifying strengths or weaknesses of product or service offerings. Before, we perform the sentiment analysis, we need to do the preprocessing as described in Section 10.3.1 first.
Below we use texblob
python module for seniment analysis of our text about Accounting. texblob
is simple for sentiment analysis because the function accepts text as input and return sentiment score. There are two types of sentiment scores - polarity and subjectivity. Polarity score actually measures the sentiment of the text and it values are between -1 and +1, where -1 indicates high negative sentiment and +1 indicates very positive sentiment. On the other hand, subjectivity score measures whether the text contaiins factual information or personal opinion. Subjectivity scores range from 0 to 1, where 0 indicates factual information and 1 indicates personal opinion.
from textblob import TextBlob
# Determining Polarity
TextBlob(my_text).sentiment.polarity
0.028571428571428574
# Determining Subjectivity
TextBlob(my_text).sentiment.subjectivity
0.21706349206349207
In the above analysis, we see the polarity score is 0.02857, which is very close to zero. Therefore, we can say our text is neutral. On the other hand, subjectivity score is 0.21706, which is close to 0, indicating that our text is factual information (not personal opinion).