Natural Language Processing

Natural Language Processing

Natural Language Processing (NLP) is a branch of artificial intelligence that focuses on enabling computers to understand, interpret, and interact with human language in a meaningful way. It combines linguistics, computer science, and machine learning to process text and speech, allowing machines to analyze syntax, semantics, and context in written or spoken language. NLP is used for tasks such as sentiment analysis, language translation, chatbots, information extraction, and text summarization. While NLP focuses on understanding and interpreting language, rather than predicting future events, it forms the foundation for applications that require machines to comprehend and respond to human communication in a natural, human-like manner.


Text Pre-Processing

There is a popular module in Python called nltk that used for NLP methodology. This module can be used to enhance threat detection and response

Install

pip3 # Python package installer for Python 3
install # Command that tells pip to install a package
nltk # The Natural Language Toolkit library (used for NLP tasks)

pip3 install nltk

Run this in Python

import nltk # Imports the Natural Language Toolkit (NLP library) into your Python script
nltk.download(‘all’) # Downloads all available NLTK datasets, models, and corpora

import nltk
nltk.download('all')

Breaking Sentences Into Words

You can break unstructured data and natural language text into chunks of information (Numerical data structure that can be used for machine learning) using a tokenizer. E.g., breaking a sentence words using the word_tokenize() method

Example

from nltk.tokenize import word_tokenize # Imports the word_tokenize function from NLTK’s tokenize module
print(word_tokenize(“Please follow this link.”)) # Tokenizes (splits) the sentence into individual words and punctuation, then prints the resulting list

from nltk.tokenize import word_tokenize
print(word_tokenize("Please follow this link."))

Output

['Please', 'follow', 'this', 'link', '.']

Finding Common Words

You can find common words in a sentence using the FreqDist() method

Example

from nltk.probability import FreqDist # Imports FreqDist class to calculate word frequency distribution
from nltk.tokenize import word_tokenize # Imports the word_tokenize function to split text into tokens
tokens = word_tokenize(“Please follow this link.”) # Tokenizes the sentence into individual words and punctuation marks
FreqDist(tokens).tabulate() # Creates a frequency distribution of the tokens and displays the counts in a formatted table

from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
tokens = word_tokenize("Please follow this link.")
FreqDist(tokens).tabulate()

Output

 Please follow    this    link       . 
      1       1       1       1       1 

Finding Senetnce Parts

If you want to find nouns, pronouns, verbs, adjectives, adverbs, prepositions, conjunctions, interjections, etc tags in a sentence, you can use pos_tag() method, you can review all the tags using nltk.help.upenn_tagset()

Example

from nltk import pos_tag # Imports the part-of-speech (POS) tagging function
from nltk.tokenize import word_tokenize # Imports the tokenizer to split text into words
tokens = word_tokenize(“Please follow this link.”) # Splits the sentence into individual tokens (words and punctuation)
for token in tokens: # Loops through each token
    print(pos_tag([token])) # Tags the token with its part of speech and prints it

from nltk import pos_tag
from nltk.tokenize import word_tokenize
tokens = word_tokenize("Please follow this link.")
for token in tokens:
    print(pos_tag([token]))

Output

[('Please', 'VB')]
[('follow', 'NN')]
[('this', 'DT')]
[('link', 'NN')]
[('.', '.')]

Normalizing Words

If you want to normalize a word, you can use the PorterStemmer() method or lemmatize(). Stemming removes the last few characters from a word (It removes the suffix from the word), whereas lemmatization replaces a word with its root or head (It returns the lemma of the word). Usually, search engines use them to analyze the meaning of a word, then use that to return search results that include all relevant forms of that word used. E.g., if you search for cars, you also get result for car. Bots, use that to understand the overall meaning of the sentence.

Example

from nltk.stem import PorterStemmer # Imports the Porter Stemmer algorithm for word stemming
for item in [“test”, “tests”, “testing”, “tested”]: # Loops through each word in the list
    print(item, “: “, PorterStemmer().stem(item)) # Applies stemming to each word and prints the original word along with its stemmed (root) form

from nltk.stem import PorterStemmer
for item in ["test","tests","testing","tested"]:
    print(item, ": ",PorterStemmer().stem(item))

Output

test

Example

from nltk.stem import WordNetLemmatizer # Imports the WordNet lemmatizer (uses vocabulary + morphology rules)
for item in [“test”, “tests”, “testing”, “tested”]: # Loops through each word in the list
    print(item, “: “, WordNetLemmatizer().lemmatize(item)) # Lemmatizes (reduces to dictionary base form) each word and prints the original word with its lemma

from nltk.stem import WordNetLemmatizer
for item in ["test","tests","testing","tested"]:
    print(item, ": ", WordNetLemmatizer().lemmatize(item))

Output

testing

Example

from nltk.stem import WordNetLemmatizer  # Imports the WordNet lemmatizer
from nltk.corpus import wordnet  # Imports WordNet corpus (provides POS constants)
from nltk import word_tokenize, pos_tag # Imports tokenizer and POS tagger
from collections import defaultdict # (Not used here, but commonly used for default dictionary behavior)
mapped = {
    “V”: wordnet.VERB, # Maps POS tags starting with ‘V’ to VERB
    “J”: wordnet.ADJ, # Maps POS tags starting with ‘J’ to ADJECTIVE
    “R”: wordnet.ADV  # Maps POS tags starting with ‘R’ to ADVERB
}
tokens = word_tokenize(“caring”) # Tokenizes the word
for token, tag in pos_tag(tokens): # Tags the token with its Penn Treebank POS tag (e.g., VBG, NN, JJ)
    tag = mapped.get(tag[0], wordnet.NOUN) # Looks at the first letter of the POS tag, of it exists in the mapped dictionary, use the corresponding WordNet POS, otherwise, default to NOUN
    print(token, WordNetLemmatizer().lemmatize(token, tag)) # Lemmatizes the token using the correct POS

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from nltk import word_tokenize, pos_tag
from collections import defaultdict

mapped = {
    "V": wordnet.VERB,
    "J": wordnet.ADJ,
    "R": wordnet.ADV
}

tokens = word_tokenize("caring")
for token, tag in pos_tag(tokens):
    tag  = mapped.get(tag[0], wordnet.NOUN)
    print(token, WordNetLemmatizer().lemmatize(token, tag))

Part-Of-Speech

POS stands for Part-Of-Speech, which is a grammatical category assigned to each word in a sentence. POS tagging tells you whether a word is a noun, verb, adjective, adverb, etc., based on its role in the sentence

CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there 
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb

Remove Stops Words

If you want to remove stopwords from a sentence, you can compare the words of the sentence with the stopwords

Example

from nltk.tokenize import sent_tokenize, word_tokenize # Import sentence and word tokenizers
from nltk.corpus import stopwords # Import stopwords list
tokens = word_tokenize(“Please followw this link.”) # Tokenize sentence into words
stop_words = set(stopwords.words(‘english’)) # Get the set of English stopwords
filtered = [w for w in tokens if w.lower() not in stop_words] # Filter out tokens that are stopwords
print(filtered) # Print the filtered words

from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
tokens = word_tokenize("Please followw this link.")
stop_words = set(stopwords.words('english'))
filtered = [w for w in tokens if w not in stop_words]
print(filtered)

Output

['Please', 'followw', 'link', '.']

Example #1

You can clean text using regex and nltk

import re # Import regular expressions for pattern-based text cleaning
from nltk.corpus import stopwords # Import list of common English stopwords
def clean_text(text):
    text = text.lower() # Convert all letters to lowercase so that ‘This’ and ‘this’ are treated the same
    text = re.sub(r’\d+’, ‘ ‘, text) # Remove all digits/numbers by replacing them with a space
    text = re.sub(r'[^\w\s]’, ‘ ‘, text) # Remove punctuation by replacing anything that is NOT a word character or whitespace with a space
    text = ” “.join(w for w in text.split() if w not in set(stopwords.words(‘english’))) # Remove stopwords (common words like ‘the’, ‘is’, ‘this’)
    return text # Return the cleaned text
print(clean_text(“Please follow this link.”)) # Expected output: “please follow link”

import re
from nltk.corpus import stopwords

def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', ' ', text)
    text = re.sub(r'[^\w\s]', ' ', text)
    text = " ".join(w for w in text.split() if w not in set(stopwords.words('english')))
    return text

print(clean_text("Please follow this link."))

Output

please follow link

Example #2

If you want to check a phishing email for broken words, you can do that using nltk module

import nltk # Import NLTK library
words = set(nltk.corpus.words.words()) # Load the set of valid English words from the NLTK corpus
sentence = “Please followw this link.” # Example sentence to check
errors = [] # List to store words not found in the dictionary (possible typos)
for w in nltk.wordpunct_tokenize(sentence): # Tokenize the sentence into words and punctuation
    if w.lower() in words or not w.isalpha(): # Check if the word is in the dictionary or is non-alphabetic (punctuation, numbers)
        pass # Word is correct or ignored
    else:
        errors.append(w) # Word is likely a typo
print(“Error(s): “, len(errors)) # Print the number of errors found

import nltk 
words = set(nltk.corpus.words.words())
sentence = "Please followw this link."
errors = []
for w in nltk.wordpunct_tokenize(sentence):
    if w.lower() in words or not w.isalpha():
        pass
    else:
        errors.append(w)
print("Error(s): ", len(errors))

Output

Error(s): 1