Data Pre-Processing: AI End-to-End Series (Part — 2.2 - NLP)

6 min readDec 14, 2021

By Hiren Rupchandani, Abhinav Jangir, and Ashish Lepcha

Now, we saw the various image pre-processing techniques. In this article, we will see some pre-processing techniques for textual data which can be used for natural language understanding. Some of the common text preprocessing / cleaning steps include:

Lower casing
Removal of Punctuations
Removal of Stopwords
Removal of Frequent words
Removal of Rare words
Stemming
Lemmatization
Removal of emojis
Removal of emoticons
Removal of URLs
Removal of HTML tags

We will be using the customer tweets dataset for this article. Let’s see each of the pre-processing steps one by one:

Lower casing

Lower-casing is a common text preprocessing technique.
Words like TEXT and text mean the same but when not converted to the lower case those two are represented as two different words.

Removal of Punctuations

One another common text preprocessing technique is to remove the punctuations from the text data.
The punctuations present in the text do not add value to the data.

Also, it creates additional word which has the same meaning. For example, the word ‘Hurray’ and ‘Hurray!’ will have the same meaning.
We can add or remove more punctuations as per our need.

Removal of Stopwords

Stopwords are commonly occurring words in a language that do not provide any useful information to infer content.
Sometimes they don’t carry any meaning like prepositions or they are frequent.
They can be removed from the text most of the time, as they don’t provide valuable information for downstream analysis.
These stopword lists are already compiled for different languages and we can safely use them.
For example, the stopword list for the English language from the nltk package can be imported using -

from nltk.corpus import stopwords
nltk.download(‘stopwords’)
", ".join(stopwords.words('english'))# OUTPUT:
i, me, my, myself, we, our, ours, ourselves, you, you're, you've, you'll, you'd, your, yours, yourself, yourselves, he, him, his, himself, she, she's, her, hers, herself, it, it's, its, itself, they, them, their, theirs, themselves, what, which, who, whom, this, that, that'll, these, those, am, is, are, was, were, be, been, being, have, has, had, having, do, does, did, doing, a, an, the, and, but, if, or, because, as, until, while, of, at, by, for, with, about, against, between, into, through, during, before, after, above, below, to, from, up, down, in, out, on, off, over, under, again, further, then, once, here, there, when, where, why, how, all, any, both, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, s, t, can, will, just, don, don't, should, should've, now, d, ll, m, o, re, ve, y, ain, aren, aren't, couldn, couldn't, didn, didn't, doesn, doesn't, hadn, hadn't, hasn, hasn't, haven, haven't, isn, isn't, ma, mightn, mightn't, mustn, mustn't, needn, needn't, shan, shan't, shouldn, shouldn't, wasn, wasn't, weren, weren't, won, won't, wouldn, wouldn't

Removal of Frequent Words

Sometimes in a domain-specific corpus, we might have some frequent words which are of no importance to us.
We can identify and remove them by counting the occurrences for each word and removing the words with the highest frequency.

Removal of Rare Words

Similar to frequent words, we also have to remove the rarely occurring words.
Using the same method as in the previous step, we will be counting the occurrences for each word and removing the words with the least frequency.

Stemming

Stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base, or root form.

**Various derived forms of a stem (consult)**

Porter Stemmer is one of the widely used algorithms among the plethora of stemming algorithms available.

Lemmatization

Lemmatization is similar to stemming in reducing inflected words to their word stem but differs in the way that it makes sure the root word belongs to the language dictionary.

WordNetLemmatizer is the most commonly used algorithm to lemmatize a text corpus.

Removal of Emojis

With more and more usage of social media platforms, there is an explosion in the usage of emojis in our day-to-day life as well.
Probably we might need to remove these emojis for some of our textual analysis.
Example:

# INPUT:
remove_emoji("game is on 🔥🔥")# OUTPUT:
game is on# INPUT:
remove_emoji("Hilarious😂")# OUTPUT:
Hilarious

Removal of Emoticons

There is a minor difference between emojis and emoticons.
An emoticon is built from keyboard characters that when put together in a certain way represent a facial expression, an emoji is an actual image.
:-) is an emoticon
😀 is an emoji
We need to make sure to remove the emoticons as well ;-)
Example:

INPUT:remove_emoticons("Hello :>")OUTPUT:HelloINPUT:remove_emoticons("I am sad :(")OUTPUT:I am sad

Conversion of Emoticon to Words

In use cases like sentiment analysis, the emoticons give some valuable information and so removing them might not be a good solution.
One way is to convert the emoticons to word format so that they can be used in downstream modeling processes.

INPUT:convert_emoticons("Hello :) :)")OUTPUT:Hello Happy_face_or_smiley Happy_face_or_smileyINPUT:convert_emoticons("I am sad :(")OUTPUT:I am sad Frown_sad_andry_or_pouting

This method might be better for some use cases when we do not want to miss out on emoticon information.

Removal of URLs

The next preprocessing step is to remove any URLs present in the data.
For example, if we are doing a Twitter analysis, then there is a good chance that the tweet will have some URL in it.
And, we might need to remove the URL for further analysis.
We are using the following regular expression to remove the URLs from our corpus: https?://\S+|www\.\S+
Let’s see the helper function for the same:

def remove_urls(text):
___url_pattern = re.compile(r'https?://\S+|www\.\S+')
___return url_pattern.sub(r'', text)

Removing URLs in the following example:

# INPUT:text = "NLP blog post on https://www.insaid.co/"
remove_urls(text)# OUTPUT:
"NLP blog post on "

Removal of HTML Tags

One another common preprocessing technique that will come in handy in multiple places is the removal of HTML tags.
This is especially useful if we scrap the data from different websites.
We might end up having HTML strings as part of our text.
Let’s see the helper function for the same:

def remove_html(text):
____html_pattern = re.compile(‘<.*?>’)
____return html_pattern.sub(r’’, text)

Checking an example to remove HTML tags:

# INPUT:text = """<div>
<h1> Introduction</h1>
<p> AutoML</p>
<a href="https://www.insaid.co/#exploreprograms"> School of AI and Data science</a>
</div>"""print(remove_html(text))# OUTPUT:
  "Introduction
  AutoML
  School of AI and Data science"

We can see that all the HTML tags have been removed from the text.
So these were the various text pre-processing techniques that we can perform before using them as an input for the model.

What’s Next?

In the next article of this series, we will develop a model with the help of our preprocessed data.

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.

Visit us on https://www.insaid.co/