close
close

first Drop

Com TW NOw News 2024

Clean and preprocess text data in Pandas for NLP tasks
news

Clean and preprocess text data in Pandas for NLP tasks

Clean and preprocess text data in Pandas for NLP tasksClean and preprocess text data in Pandas for NLP tasks
Image by author

Cleaning and preprocessing data is often one of the most daunting, yet crucial stages in building AI and machine learning solutions that are driven by data. Text data is no exception.

Our Top 5 Free Course Recommendations

1. Google Cybersecurity Certificate – Jump-start your cybersecurity career.

2. Natural Language Processing in TensorFlow – Building NLP Systems

3. Python for Everyone – Develop programs to collect, clean, analyze, and visualize data

4. Google IT Support Professional Certification

5. AWS Cloud Solutions Architect – Professional Certificate

This tutorial breaks the ice in tackling the challenge of preparing text data for NLP tasks like those that Language Models (LMs) can solve. By encapsulating your text data in pandas DataFrames, the steps below will help you get your text ready to be digested by NLP models and algorithms.

Load the data into a Pandas DataFrame

To keep this tutorial simple and focused on understanding the text cleaning and preprocessing steps required, we’ll look at a small example of four single-attribute text data instances being moved into a pandas DataFrame instance. We’ll be applying each preprocessing step to this DataFrame object from now on.

import pandas as pd
data = {'text': ("I love cooking!", "Baking is fun", None, "Japanese cuisine is great!")}
df = pd.DataFrame(data)
print(df)

Output:

    text
0   I love cooking!
1   Baking is fun
2   None
3   Japanese cuisine is great!

Process missing values

Did you notice the value ‘None’ in any of the sample data instances? This is known as a missing value. Missing values ​​are often collected for various reasons, often by accident. In short: you have to deal with them. The simplest approach is to detect and remove instances with missing values, as done in the code below:

df.dropna(subset=('text'), inplace=True)
print(df)

Output:

    text
0    I love cooking!
1    Baking is fun
3    Japanese cuisine is great!

Normalize the text to make it consistent

Normalizing text involves standardizing or unifying elements that may appear in different formats in different instances, for example date formats, full names, or case sensitivity. The simplest approach to normalizing our text is to convert everything to lowercase, as follows.

df('text') = df('text').str.lower()
print(df)

Output:

        text
0             i love cooking!
1               baking is fun
3  japanese cuisine is great!

Remove noise

Noise is unnecessary or unexpectedly collected data that can hinder subsequent modeling or prediction processes if not handled properly. In our example, we assume that punctuation marks such as “!” are not needed for the subsequent NLP task, so we perform some noise removal by detecting punctuation marks in the text using a regular expression. The Python package ‘re’ is used to work and perform text operations based on regular expression matching.

import re
df('text') = df('text').apply(lambda x: re.sub(r'(^\w\s)', '', x))
print(df)

Output:

         text
0             i love cooking
1              baking is fun
3  japanese cuisine is great

Tokenize the text

Tokenization is perhaps the most important text preprocessing step – along with encoding text into a numerical representation – before NLP and language models are used. It consists of splitting each text input into a vector of chunks or tokens. In the simplest scenario, tokens are usually associated with words, but in some cases, such as compound words, a single word can give rise to multiple tokens. Certain punctuation marks (if not previously removed as noise) are also sometimes identified as standalone tokens.

This code splits each of our three text inputs into separate words (tokens) and adds them as a new column in our DataFrame, and then shows the updated data structure with its two columns. The simplified tokenization approach applied is known as simple whitespace tokenization: it uses only whitespaces as a criterion to detect and separate tokens.

df('tokens') = df('text').str.split()
print(df)

Output:

          text                          tokens
0             i love cooking              (i, love, cooking)
1              baking is fun               (baking, is, fun)
3  japanese cuisine is great  (japanese, cuisine, is, great)

Remove stop words

Once the text is tokenized, we filter out unnecessary tokens. This is usually the case with stop words, such as articles “a/an, the” or conjunctions, which do not add any real semantics to the text and should be removed for efficient later processing. This process is language-dependent: the code below uses the NLTK library to download a dictionary of English stop words and filter them out of the token vectors.

import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df('tokens') = df('tokens').apply(lambda x: (word for word in x if word not in stop_words))
print(df('tokens'))

Output:

0               (love, cooking)
1                 (baking, fun)
3    (japanese, cuisine, great)

Stem formation and lemmatization

Almost done! Stemming and lemmatization are additional text preprocessing steps that can sometimes be used, depending on the specific task at hand. Stemming reduces each token (word) to its base or root form, while lemmatization further reduces it to its lemma or basic dictionary form, depending on the context, e.g. “best” -> “good”. For simplicity, we will only apply stemming in this example, using the PorterStemmer implemented in the NLTK library, aided by the wordnet dataset of word-root associations. The resulting stemmed words are stored in a new column in the DataFrame.

from nltk.stem import PorterStemmer
nltk.download('wordnet')
stemmer = PorterStemmer()
df('stemmed') = df('tokens').apply(lambda x: (stemmer.stem(word) for word in x))
print(df(('tokens','stemmed')))

Output:

          tokens                   stemmed
0             (love, cooking)              (love, cook)
1               (baking, fun)               (bake, fun)
3  (japanese, cuisine, great)  (japanes, cuisin, great)

Convert your text to numeric representations

Last but not least, computer algorithms, including AI/ML models, do not understand human language, but numbers. Therefore, we need to convert our word vectors into numerical representations, commonly known as embedding vectors or simply embedding. The example below converts tokenized text in the “tokens” column and uses a TF-IDF vectorization approach (one of the most popular approaches in the good old days of classical NLP) to convert the text into numerical representations.

from sklearn.feature_extraction.text import TfidfVectorizer
df('text') = df('tokens').apply(lambda x: ' '.join(x))
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df('text'))
print(X.toarray())

Output:

((0.         0.70710678 0.         0.         0.         0.       0.70710678)
(0.70710678 0.         0.         0.70710678 0.         0.        0.        )
(0.         0.         0.57735027 0.         0.57735027 0.57735027        0.        ))

And that’s it! As incomprehensible as it may seem to us, this numerical representation of our preprocessed text is what intelligent systems, including NLP models, do understand and can process exceptionally well for challenging language tasks such as classifying sentiment in text, summarizing it, or even translating it into another language.

The next step is to feed these numerical representations into our NLP model so it can do its magic.

Ivan Palomares Carrascosa is a thought leader, author, speaker and advisor in AI, machine learning, deep learning and LLMs. He trains and mentors others in leveraging AI in the real world.