Menu
Scaler Ads

Complete Guide to Natural Language Processing (NLP) – with Practical Examples

Natural language processing (NLP) is the technique by which computers understand the human language. NLP allows you to perform a wide range of tasks such as classification, summarization, text-generation, translation and more.

NLP has advanced so much in recent times that AI can write its own movie scripts, create poetry, summarize text and answer questions for you from a piece of text. This article will help you understand the basic and advanced NLP concepts and show you how to implement using the most advanced and popular NLP libraries – spaCy, Gensim, Huggingface and NLTK.

Natural Language Processing. Art by Frances Hodgkins (d. 1947)

Introduction

More than 80% of the data available today is Unstructured Data. But, what exactly is unstructured Data, you ask?

The texts, videos, images which cannot be represented in a tabular form (or in any consistent structured data model) constitute unstructured Data. They hold the key to unraveling powerful information

To process and interpret the unstructured text data, we use NLP.

NLP has a wide number of applications in the real world.

  1. Sentiment analysis
  2. Speech recognition
  3. Sentiment Analysis
  4. Text Classification
  5. Machine Translation
  6. Semantic Search
  7. News/article Summarization
  8. Answering Questions

It is infact what drives your favorite voice assistant software (like Google Assistant, Amazon Echo etc), Chatbots, Language Translations, Article summarization and so on.

In this article, you will learn from the basic (and advanced) concepts of NLP to implement state of the art problems like Text Summarization, Classification, etc.

Install and Load Main Python Libraries for NLP

We have a large collection of NLP libraries available in Python. However, you ask me to pick the most important ones, here they are. Using these, you can accomplish nearly all the NLP tasks efficiently. This is all you need, well, mostly.

1. Natural Language Toolkit (NLTK)

It can be imported as shown:

# Install
!pip install nltk

Import package and download model.

# importing nltk
import nltk
nltk.download('punkt')

2. spaCy

It is the most trending and advanced library for implementing NLP today. It is many distinct features that provide clear advantage for processing text data and modeling.

Let me show you how to import spaCy and create a nlp object. To load the models and data for English Language, you have to use spacy.load('en_core_web_sm')

# Importing spaCy and creating nlp object
import spacy
nlp = spacy.load('en_core_web_sm')

nlp object is referred as language model instance

3. genism

It was developed for topic modelling. It supports the NLP tasks like Word Embedding, text summarization and many others.

!pip install gensim

# Importing gensim
import gensim

4. transformers

It was developed by HuggingFace and provides state of the art models. It is an advanced library known for the transformer modules, it is currently under active development.

# Installing the package
!pip install transformers

In this article, you will learn how to use these libraries for various NLP tasks.

Text Pre-processing

The raw text data often referred to as text corpus has a lot of noise. There are punctuation, suffices and stop words that do not give us any information. Text Processing involves preparing the text corpus to make it more usable for NLP tasks.

Let me walk you through some basic steps of pre-processing a raw text corpus

A text can be converted into nlp object of spaCy as shown.

# Converting text into a spaCy object
raw_text = 'NLP is a very powerful tool'
text_doc = nlp(raw_text)

Next, an important term in NLP is ‘token’.

What is a Token?

The words of a text document/file separated by spaces and punctuation are called as tokens.

What is Tokenization?

The process of extracting tokens from a text file/document is referred as tokenization.

You can print text of the tokens by accessing token.text while using spaCy

# printing tokens
for token in text_doc:
    print(token.text)
NLP
is
a
very
powerful
tool

spaCy provides many attributes with tokenization.

What if you want to know if a particular token has alphabetic characters or not?

It can be known by printingtoken.is_alpha

# Checking if the tokens of a text are alphabets
mixed_text = 'I love you 3000!'
mixed_text_doc = nlp(mixed_text)

for token in mixed_text_doc:
    print(token.text, token.is_alpha)
I True
love True
you True
3000 False
! False

Similarly what if you want to know if the particular token is space, or a stop word or punctuation?

All this information is stored in attributes of token. is_stop returns True if the token is a stop word. ( Stop words are ‘It’, ‘was’, etc..)

is_punct returns True if the token is a punctuation. (! , . etc)

Let’s see an example of how to access them.

# Accessing information about tokens through the attributes

random_text='It was amazing! It is a great pleasure to be a part of the band.'
random_text_doc=nlp(random_text)
print("Text".ljust(10), ' ', "Alpha", "Space", "Stop", "Punct")
for token in random_text_doc:
    print(token.text.ljust(10), ':', token.is_alpha, token.is_space, token.is_stop, token.is_punct)
Text         Alpha Space Stop Punct
It         : True False True False
was        : True False True False
amazing    : True False False False
!          : False False False True
It         : True False True False
is         : True False True False
a          : True False True False
great      : True False False False
pleasure   : True False False False
to         : True False True False
..(truncated)

You can see the details of each token similarly.

Tokenization

Let us look at another example – on a large amount of text. Let’s say you have text data on a product Alexa, and you wish to analyze it.

Below you can see the code for tokenization and printing the tokens, and the calculating the number of tokens on this text file

# printing all the tokens of a text
raw_text= 'Amazon Alexa, also known simply as Alexa, is a virtual assistant AI technology developed by Amazon, first used in the Amazon Echo smart speakers developed by Amazon Lab126. It is capable of voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, sports, and other real-time information, such as news. Alexa can also control several smart devices using itself as a home automation system. Users are able to extend the Alexa capabilities by installing "skills" additional functionality developed by third-party vendors, in other settings more commonly called apps such as weather programs and audio features.Most devices with Alexa allow users to activate the device using a wake-word (such as Alexa or Amazon); other devices (such as the Amazon mobile app on iOS or Android and Amazon Dash Wand) require the user to push a button to activate Alexa listening mode, although, some phones also allow a user to say a command, such as "Alexa" or "Alexa wake". Currently, interaction and communication with Alexa are available only in English, German, French, Italian, Spanish, Portuguese, Japanese, and Hindi. In Canada, Alexa is available in English and French with the Quebec accent).(truncated...).'
text_doc=nlp(raw_text)

token_count=0

for token in text_doc:
    print(token.text)
    token_count+=1
Amazon
Alexa
,
also
known
simply
(..truncated)

“`

Let us also print the total no of tokens.

print('No of tokens originally',token_count)
No of tokens originally 708

So, you can see we have a lot of tokens here. The stop words like ‘it’,’was’,’that’,’to’…, so on do not give us much information, especially for models that look at what words are present and how many times they are repeated. The same goes for the punctuations.

While dealing with large text files, the stop words and punctuations will be repeated at high levels, misguiding us to think they are important. So, it is necessary to filter out the stop words.

spaCy has an inbuilt list of stop words. You can view it as shown below

# Accessing the built-in stop words in Spacy
stopwords = spacy.lang.en.stop_words.STOP_WORDS
list_stopwords=list(stopwords)

# printing a fraction of the list through indexing
for word in list_stopwords[:7]:
    print(word)
see
whom
through
whatever
may
go
regarding

How to remove the stop words and punctuation

As we already established, when performing frequency analysis, stop words need to be removed.

In the same text data about a product Alexa, I am going to remove the stop words.

You can use is_stop to identify the stop words and remove them through below code..

# Raw text document
raw_text= 'Amazon Alexa, also known simply as Alexa, is a virtual assistant AI technology developed by Amazon, first used in the Amazon Echo smart speakers developed by Amazon Lab126. It is capable of voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, sports, and other real-time information, such as news. Alexa can also control several smart devices using itself as a home automation system. Users are able to extend the Alexa capabilities by installing "skills" additional functionality developed by third-party vendors, in other settings more commonly called apps such as weather programs and audio features.Most devices with Alexa allow users to activate the device using a wake-word (such as Alexa or Amazon); other devices (such as the Amazon mobile app on iOS or Android and Amazon Dash Wand) require the user to push a button to activate Alexa listening mode, although, some phones also allow a user to say a command, such as "Alexa" or "Alexa wake". Currently, interaction and communication with Alexa are available only in English, German, French, Italian, Spanish, Portuguese, Japanese, and Hindi. In Canada, Alexa is available in English and French with the Quebec accent).[6][7]As of November 2018, Amazon had more than 10,000 employees working on Alexa and related products.[8] In January 2019, Amazons devices team announced that they had sold over 100 million Alexa-enabled devices.[9]In September 2019, Amazon launched many new devices achieving many records while competing with the worlds smart home industry. The new Echo Studio became the first smart speaker with 360 sound and Dolby sound. Other new devices included an Echo dot with a clock behind the fabric, a new third-generation Amazon Echo, Echo Show 8, a plug-in Echo device, Echo Flex, Alexa built-in wireless earphones, Echo buds, Alexa built-in spectacles, Echo frames, an Alexa built-in Ring, and Echo Loop.Alexa can perform a number of preset functions out-of-the-box such as set timers, share the current weather, create lists, access Wikipedia articles, and many more things. Users say a designated "wake word" (the default is simply "Alexa") to alert an Alexa-enabled device of an ensuing function command. Alexa listens for the command and performs the appropriate function, or skill, to answer a question or command. Alexas question answering ability is partly powered by the Wolfram Language. When questions are asked, Alexa converts sound waves into text which allows it to gather information from various sources. Behind the scenes, the data gathered is then parsed by Wolfram technology to generate suitable and accurate answers. Alexa-supported devices can stream music from the owners Amazon Music accounts and have built-in support for Pandora and Spotify accounts. Alexa can play music from streaming services such as Apple Music and Google Play Music from a phone or tablet.In addition to performing pre-set functions, Alexa can also perform additional functions through third-party skills that users can enable. Some of the most popular Alexa skills in 2018 included "Question of the Day" and "National Geographic Geo Quiz" for trivia; "TuneIn Live" to listen to live sporting events and news stations; "Big Sky" for hyper local weather updates; "Sleep and Relaxation Sounds" for listening to calming sounds; "Sesame Street" for children entertainment; and "Fitbit" for Fitbit users who want to check in on their health stats. In 2019, Apple, Google, Amazon, and Zigbee Alliance announced a partnership to make smart home products work together.'
text_doc=nlp(raw_text)

token_count_without_stopwords=0

# Filtring out the stopwords
filtered_text= [token for token in text_doc if not token.is_stop]

# Counting the tokens after removal of stopwords
for token in filtered_text:
    print(token)
    token_count_without_stopwords+=1
Amazon
Alexa
,
known
simply
Alexa
,
virtual
assistant
AI
technology
developed
Amazon
,
Amazon
Echo
smart
speakers
developed
Amazon
Lab126
.
capable
voice
interaction
,
..(truncated...)

You can see that the stop words have been removed. To understand how much effect it has, let us print the number of tokens after removing stopwords.

print('No of tokens after removing stopwords', token_count_without_stopwords)
No of tokens after removing stopwords 488

You can observe that there is a significant reduction of tokens.

Next, let us remove the punctuations too. is_punct is used to identify punctuations . See the example below.

# Removing punctuations
filtered_text=[token for token in filtered_text if not token.is_punct]

token_count_without_stop_and_punct=0

# Counting the new no of tokens
for token in filtered_text:
    print(token)
    token_count_without_stop_and_punct += 1
Amazon
Alexa
known
simply
Alexa
virtual
assistant
AI
technology
developed
Amazon
Amazon
Echo
smart
speakers
developed
Amazon
(..truncated..)

You can observe that all punctuations are gone. Let us print the number of tokens now.

print('No of tokens after removing stopwords and punctuations', token_count_without_stop_and_punct)
No of tokens after removing stopwords and punctuations 361

It is reduced considerably, we have only the significant words which will help us understand the text

Now that you have relatively better text for analysis, let us look at a few other text preprocessing methods.

Stemming

Let us look at some words say ‘calculator’, ‘calculating’, ‘calculation’. we know that these words are not very distinct from each other. How to ensure that algorithms know this too?

It can be achieved through stemming. Stemming means reducing a word to its ‘root form’.

Let us see an example of how to implement stemming using nltk supported PorterStemmer().

# Import stemmer and perform stemming on a list of words
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ['dance ','dances', 'dancing ','danced']

for word in words:
    print(word.ljust(8),'-------', ps.stem(word))
dance    ------- dance 
dances   ------- danc
dancing  ------- dancing 
danced   ------- danc

You can observe that some words were reduced to the base stems ‘danc’

But, the word ‘danc’ is not semantically correct. Stemming may give us root words that are not present in the dictionary. How to overcome this ?

That’s where Lemmatization comes to rescue.

Lemmatization

It is similar to stemming, except that the root word is correct and always meaningful.

There are different ways to perform lemmatization. I’ll show lemmatization using nltk and spacy in this article.

Lemmatization through NLTK

The most commonly used Lemmatization technique is through WordNetLemmatizer from nltk library.

Let me show a simple example of how it works

# download wordnet
nltk.download('wordnet')

# Importing the lemmatizer
from nltk.stem import WordNetLemmatizer 

# Initialize the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatizing a single word
print(lemmatizer.lemmatize("dances"))

dance

You can see that dances ==> dance. Now let us go to real case of large text and perform lemmatization.

Let us say you have text data on the performance of various models.

# Tokenize
my_text='The model AIM has a performce accuracy of 87% and it is redady to be released. The release is going to be held on day 12.3.2000 and  it has amzing performance recod.Model AIM2.0 has accuracy of 89% and better performing expectations are there.le us see how it performs'

words=nltk.word_tokenize(my_text)

print(words)
['The', 'model', 'AIM', 'has', 'a', 'performce', 'accuracy', 'of', '87', '%', 'and', 'it', 'is', 'redady', 'to', 'be', 'released', '.', 'The', 'release', 'is', 'going', 'to', 'be', 'held', 'on', 'day', '12.3.2000', 'and', 'it', 'has', 'amzing', 'performance', 'recod.Model', 'AIM2.0', 'has', 'accuracy', 'of', '89', '%', 'and', 'better', 'performing', 'expectations', 'are', 'there.le', 'us', 'see', 'how', 'it', 'performs']

You can observe the tokens printed. The below code performs Lemmatization and prints final lemmatized text

lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in words])
print(lemmatized_output)
The model AIM ha a performce accuracy of 87 % and it is redady to be released . The release is going to be held on day 12.3.2000 and it ha amzing performance recod.Model AIM2.0 ha accuracy of 89 % and better performing expectation are there.le u see how it performs

Lemmatization using spaCy

In spaCy , the token object has an attribute .lemma_ which allows you to access the lemmatized version of that token.See below example.

# Lemmatization using spaCy

text='dance dances dancing danced'
text_doc=nlp(text)

for token in text_doc:
    print(token.text,'-------',token.lemma_)
dance ------- dance
dances ------- dance
dancing ------- dance
danced ------- dance

Here, all words are reduced to ‘dance’ which is meaningful and just as required.
It is highly preferred over stemming.

Let me show you how to convert a text into lemmatized form.

# lemmatizing a text document using spacy

text=' I had calculated it wrong.I am calculating the bill again.'
text_doc=nlp(text)

filter_text=[token for token in text_doc if not token.is_stop]
filter_text=[token for token in text_doc if not token.is_punct]


" ".join([token.lemma_ for token in filter_text])
'  -PRON- have calculate -PRON- wrong -PRON- be calculate the bill again'

You can see that the words are lemmatized. Also, spacy prints PRON before every pronoun in the sentence.

Word Frequency Analysis

Once the stop words are removed and lemmatization is done ,the tokens we have can be analysed further for information about the text data.

The words which occur more frequently in the text often have the key to the core of the text. So, we shall try to store all tokens with their frequencies for the same purpose.

You can use Counter to get the frequency of each token as shown below. If you provide a list to the Counter it returns a dictionary of all elements with their frequency as values.

#import Counter
import collections
from collections import Counter

# creating nlp obj using spaCy
data='It is my birthday today. I could not have a birthday party. I felt sad'
data_doc=nlp(data)

#creating a list of tokens after preprocessing
list_of_tokens=[token.text for token in data_doc if not token.is_stop and not token.is_punct]

#passing the list of tokens to Counter
token_frequency=Counter(list_of_tokens)
print(token_frequency)
Counter({'birthday': 2, 'today': 1, 'party': 1, 'felt': 1, 'sad': 1})

From above ouput, the word ‘ birthday ‘ is most common. So,it gives us an idea that the text is about a birthday.

Let us take a similar case, but with more larger data . For example , you have text data about a particular place , and you want to know the important factors.

How to know it ?

# Word Frequency
longtext='Gangtok is a city, municipality, the capital and the largest town of the Indian state of Sikkim. It is also the headquarters of the East Sikkim district. Gangtok is in the eastern Himalayan range, at an elevation of 1,650. The towns population of 100000 are from different ethnicities such as Bhutia, Lepchas and Indian Gorkhas. Within the higher peaks of the Himalaya and with a year-round mild temperate climate, Gangtok is at the centre of Sikkims tourism industry.Gangtok rose to prominence as a popular Buddhist pilgrimage site after the construction of the Enchey Monastery in 1840. In 1894, the ruling Sikkimese Chogyal, Thutob Namgyal, transferred the capital to Gangtok. In the early 20th century, Gangtok became a major stopover on the trade route between Lhasa in Tibet and cities such as Kolkata (then Calcutta) in British India. After India won its independence from Britain in 1947, Sikkim chose to remain an independent monarchy, with Gangtok as its capital. In 1975, after the integration with the union of India, Gangtok was made Indias 22nd state capital.Like the rest of Sikkim, not much is known about the early history of Gangtok. The earliest records date from the construction of the hermitic Gangtok monastery in 1716.[7] Gangtok remained a small hamlet until the construction of the Enchey Monastery in 1840 made it a pilgrimage center. It became the capital of what was left of Sikkim after an English conquest in the mid-19th century in response to a hostage crisis. After the defeat of the Tibetans by the British, Gangtok became a major stopover in the trade between Tibet and British India at the end of the 19th century. Most of the roads and the telegraph in the area were built during this time.In 1894, Thutob Namgyal, the Sikkimese monarch under British rule, shifted the capital from Tumlong to Gangtok, increasing the citys importance. A new grand palace along with other state buildings was built in the new capital. Following India independence in 1947, Sikkim became a nation-state with Gangtok as its capital. Sikkim came under the suzerainty of India, with the condition that it would retain its independence, by the treaty signed between the Chogyal and the then Indian Prime Minister Jawaharlal Nehru.[9] This pact gave the Indians control of external affairs on behalf of Sikkimese. Trade between India and Tibet continued to flourish through the Nathula and Jelepla passes, offshoots of the ancient Silk Road near Gangtok. These border passes were sealed after the Sino-Indian War in 1962, which deprived Gangtok of its trading business.[10] The Nathula pass was finally opened for limited trade in 2006, fuelling hopes of economic boomIn 1975, after years of political uncertainty and struggle, including riots, the monarchy was abrogated and Sikkim became India twenty-second state, with Gangtok as its capital after a referendum. Gangtok has witnessed annual landslides, resulting in loss of life and damage to property. The largest disaster occurred in June 1997, when 38 were killed and hundreds of buildings were destroyed'
long_text=nlp(longtext)


list_of_tokens=[token.text for token in long_text if not token.is_stop and not token.is_punct]
token_frequency=Counter(list_of_tokens)
print(token_frequency)
Counter({'Gangtok': 18, 'capital': 9, 'Sikkim': 8, 'India': 8, 'state': 5, 'Indian': 4, 'British': 4, 'construction': 3, 'Sikkimese': 3, 'century': 3, 'trade': 3, 'Tibet': 3, 'independence': 3, 'largest': 2, 'pilgrimage': 2, 'Enchey': 2, 'Monastery': 2, '1840': 2, '1894': 2, 'Chogyal': 2, 'Thutob': 2, 'Namgyal': 2, 'early': 2, 'major': 2, 'stopover': 2, '1947': 2, 'monarchy': 2, '1975': 2, 'built': 2, 'new': 2, 'buildings': 2, 'Nathula': 2, 'passes': 2, 'city': 1, 'municipality': 1, 'town': 1, 'headquarters': 1, 'East': 1, 'district': 1, 'eastern': 1, 'Himalayan': 1, 'range': 1, 'elevation': 1, '1,650': 1, 'towns': 1, 'population': 1, '100000': 1, 'different': 1, 'ethnicities': 1, 'Bhutia': 1, 'Lepchas': 1, 'Gorkhas': 1, 'higher': 1, 'peaks': 1, 'Himalaya': 1, 'year': 1, 'round': 1, 'mild': 1, 'temperate': 1, 'climate': 1, 'centre': 1, 'Sikkims': 1, 'tourism': 1, 'industry': 1, 'rose': 1, 'prominence': 1, 'popular': 1, 'Buddhist': 1 (..truncated..)})

As you can see, as the length or size of text data increases, it is difficult to analyse frequency of all tokens. so, you can print the n most common tokens using most_common function of Counter.

see the below example on how to print the 6 most frequent words of the text

#printing 6 most coomon words using most_common() method of Counter

most_frequent_tokens=token_frequency.most_common(6)
print(most_frequent_tokens)
[('Gangtok', 18), ('capital', 9), ('Sikkim', 8), ('India', 8), ('state', 5), ('Indian', 4)]

You see that the keywords are gangtok , sikkkim,Indian and so on. It gives us an idea of the text to start with.

It helps you find the list to main keywords. Hence, frequency analysis of token is an important method in text processing.

Let us look at more methods to understand the text data.

Part of Speech Tagging

Each word has it’s own role in a sentence. For example,’Geeta is dancing’. Geeta is the person or ‘Noun’ and dancing is the action performed by her ,so it is a ‘Verb’.Likewise,each word can be classified. This is referred as POS or Part of Speech Tagging.

What are the parts of Speech?

  • Noun
  • Verb
  • Adjective
  • Pronoun
  • Adverb
  • Preposition
  • Conjunction
  • Interjection

and so on.

POS can be implemented using both spaCy and nltk. First, let us see the method using spaCy

In spaCy, the POS tags are present in the attribute of Token object. You can access the POS tag of particular token theough the token.pos_ attribute.

See below example

text='Nitika was screaming out loud as usual Her roommate kept ignoring her '
text_doc=nlp(text)

for token in text_doc:
    print(token.text.ljust(10),'-----',token.pos_)
Nitika ----- PROPN
was ----- AUX
screaming ----- VERB
out ----- ADP
loud ----- ADV
as ----- SCONJ
usual ----- ADJ
Her ----- DET
roommate ----- NOUN
kept ----- VERB
ignoring ----- VERB
her ----- PRON

It is very easy, as it is already available as an attribute of token. Features like this make spaCy preferred over nltk.

POS tagging in a text file

In real life, you will stumble across huge amounts of data in the form of text files.

how to perform Part of speech analysis on a text file ?

Say you have a text file on robotics. See below code to understand how to work with text files.

# opening the text file and reading it into a string

with open('robotics.txt') as file:
    robot_text=file.read()

Now,the content of the text-file is stored in the string robot_text.

Next, like we did before, you have to pre-process the text.

# Preprocessing the text
robot_doc=nlp(robot_text)
robot_doc=[token for token in robot_doc if not token.is_stop and not token.is_punct]

What if you want to know what kind of tokens are present in your text ?

you can print the same with the help of token.pos_ as shown in below code.

# creating a dictionary with parts of speech present in the doc
all_tags = {token.pos: token.pos_ for token in robot_doc}
print(all_tags)
{92: 'NOUN', 84: 'ADJ', 100: 'VERB', 103: 'SPACE', 86: 'ADV', 96: 'PROPN', 93: 'NUM', 98: 'SCONJ', 101: 'X', 85: 'ADP'}

What if you want to extract all the NOUNS present in the text ?

You can categorize the tokens depending on the POS tags. Below example demonstrates how to print all the NOUNS in robot_doc.

# creating list to store nouns and verbs
nouns=[]
verbs=[]

# appending tokens to the list according POS category
for token in robot_doc:
    if token.pos_ =='NOUN':
        nouns.append(token)
    if token.pos_ =='VERB':
        verbs.append(token)

print('List of Nouns in the text\n',nouns)
print('List of verbs in the text\n',verbs)
List of Nouns in the text
 [Robotics, research, area, interface, computer, science, engineering, Robotics, design, construction, operation, use, robots, goal, robotics, machines, humans, day, day, lives, Robotics, achievement, information, engineering, computer, engineering, engineering, engineering, Robotics, machines, humans, actions, Robots, situations, lots, purposes, today, environments, inspection, materials, bomb, detection, deactivation, manufacturing, processes, humans, space, underwater, heat, containment, materials, radiation, Robots, form, humans, appearance, acceptance, robot, behaviors, people, robots, lifting, speech, cognition, activity, today, robots, nature, field, bio, (..truncated..)]
List of verbs in the text
 [involves, design, help, assist, draws, develops, substitute, replicate, including, survive, clean, resemble, said, help, performed, attempt, replicate, walking, inspired, contributing, inspired, creating, operate, dates, grow, assumed, robots, mimic, manage, growing, continue, researching, building, serve, built, defusing, finding, exploring, injected, revolutionize, involves, overlaps, derived, introduced, published, comes, means, begins, makes, called, mistaken, coin, wrote, named, According, published, coining, assumed, referred, states, introduced, predates, cited, share, comes, designed, achieve, designed, travel, use, completing, g, wheeled, designed, maneuver, function,(..truncated..)]

All the tokens which are nouns have been added to the list nouns.

In some cases, you may not need the verbs or numbers, when your information lies in nouns and adjectives.

In the current example, there is a POS tag named ‘X’. Let us look at what tokens to belong to this category.

numbers=[]

for token in robot_doc:
    if token.pos_=='X':
        print(token.text)
i.e.
etc
etc
etc
etc

It is clear that the tokens of this category are not significant. So, you can remove them.

How to remove token of certain POS categories?

You can write a function remove_pos that returns True if the token belongs to junk POS category,else it returns False

The below code removes the tokens of category ‘X’ and ‘SCONJ’.

# Storing the junk POS tags  
junk_pos=['X','SCONJ']

# Function to check if the token falls in the JUNK POS category.
def remove_pos(word):
    flag=False
    if word.pos_ in junk_pos:
        flag=True
    return flag

# Creating a new doc without the un-required tokens
revised_robot_doc=[token for token in robot_doc if remove_pos(token)==False]

# Printing POS tags present in the new document.                   
all_tags = {token.pos: token.pos_ for token in revised_robot_doc}
print(all_tags)
{92: 'NOUN', 84: 'ADJ', 100: 'VERB', 103: 'SPACE', 86: 'ADV', 96: 'PROPN', 93: 'NUM', 85: 'ADP'}

You can see that in the new document the X and SCONJ tags are not there. So, mission successful!

For better understanding, you can use displacy function of spacy.

# Using displacy function
with open('/content/robotics.txt') as file:
    text=file.read()
    doc=nlp(text)
    displacy.render(doc,style='dep',jupyter=True)

NLP - Usage of spacys displacy

Dependency Parsing

In a sentence, the words have a relationship with each other. The one word in a sentence which is independent of others, is called as Head /Root word. All the other word are dependent on the root word, they are termed as dependents.

Dependency Parsing is the method of analyzing the relationship/ dependency between different words of a sentence.

Usually, in a sentence, the verb is the head word.

You can access the dependency of a token through token.dep_ attribute.

token.dep_ prints dependency tags for each token

What is the meaning of these dependency tags ?

  • nsubj :Subject of the sentence
  • ROOT: The headword or independent word,(generally the verb)
  • prep: prepositional modifier, it modifies the meaning of a noun, verb, adjective, or preposition.
  • cc and conj : Linkages between words.For example : is , and, etc..
  • pobj : Denotes the object of the preposition
  • aux : Denotes it is an auxiliary word
  • dobj : direct object of the verb
  • det : Not specific, but not an independent word.

Let me show an example printing the dependencies of a sentence.

# Printing the dependency of each token
my_text='Ardra fell into a well and fractured her leg'
my_doc=nlp(my_text)

for token in my_doc:
    print(token.text,'---',token.dep_)
Ardra --- nsubj
fell --- ROOT
into --- prep
a --- det
well --- pobj
and --- cc
fractured --- conj
her --- poss
leg --- dobj

Clearly the verb “fell” is the ROOT word as expected.

For better understanding of dependencies, you can use displacy function from spacy on our doc object.

displacy.render(my_doc ,style='dep', jupyter=True)

What if you want to know the headword / rootword of a particular token?

In spacy, you can access the head word of every token through token.head.text.

See below example on the same

# Acessing headword of each token
my_text="The robot displayed its functions on the big screen  The audience applauded"
my_doc=nlp(my_text)

for token in my_doc:
    print(token.text,'---',token.head.text)
The --- robot
robot --- displayed
displayed --- displayed
its --- functions
functions --- displayed
on --- displayed
the --- screen
big --- screen
screen --- on
  --- screen
The --- audience
audience --- applauded
applauded --- applauded

Let us see an example of dependency parsing in real life application

I will use the same text file on robotics for this too.

How to print the root word of every sentence ?

Let me demonstrate the same in below example.

# Printing root word of each sentence of the doc
with open('robotics.txt') as file:
    robot_text=file.read()
robotics_doc=nlp(robot_text)

sentences=list(robotics_doc.sents)

for sentence in sentences:
    print(sentence.root)
is
involves
is
draws
develops
used
take
said
attempt

—(truncated)

How to navigate in a dependency tree

Let me show you an example of how to access the children of particular token.

print([token.text for token in robotics_doc[5].children])
['an', 'interdisciplinary', 'research', 'at']

How to get the previous neighboring node of a token ?

print (robotics_doc[5].nbor(-1))
research

How to print the 5th forward neighboring node of a token ?

print (robotics_doc[5].nbor(5))
computer

Named Entity Recognition

Suppose you have a collection of news articles text data. What if you want to know what companies/organizations have been in the news ? How will you do that ?

Take another case of text dataset of information about influential people. What if you want to know the names of influencers in this dataset ?

The answer to all these questions is NER.

What is NER ?

NER is the technique of identifying named entities in the text corpus and assigning them pre-defined categories such as ‘ person names’ , ‘ locations’ ,’organizations’,etc..

It is a very useful method especially in the field of claasification problems and search egine optimizations.

NER can be implemented through both nltk and spacy`.I will walk you through both the methods.

NER with NLTK

Let us start with a simple example to understand how to implement NER with nltk .

Consider the below sentence . Your goal is to identify which tokens are the person names, which is a company .

sentence = " John and Emily are from India , they work at Yahoo"

The first step would be to identify the named entities.

How to do this ?

The first step, you should tokenize your text usinh nltk.word_tokenize. Next, pass the tokens through nltk.pos_tag , to get the part-of speech tags of each token as shown below

# POS tagging on the text
from nltk import word_tokenize, pos_tag
tokens=word_tokenize(sentence)
tokens_pos=pos_tag(tokens)
print(tokens_pos)
[('John', 'NNP'), ('and', 'CC'), ('Emily', 'NNP'), ('are', 'VBP'), ('from', 'IN'), ('India', 'NNP'), (',', ','), ('they', 'PRP'), ('work', 'VBP'), ('at', 'IN'), ('Yahoo', 'NNP')]

Now , you have pos tags associated with each token.
Next, to obtain the named entities of the text , nltk provides ne_chunk method.

Below code demonstrates how to use nltk.ne_chunk on the above sentence.

# Importing ne_chunk from nltk and download requirements
from nltk import ne_chunk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!

You can pass the tokens, with their pos tags as input to ne_chunk(), as shown in below code

# Print all the named entities
named_entity=ne_chunk(tokens_pos)
print(named_entity)
(S
  (PERSON John/NNP)
  and/CC
  (PERSON Emily/NNP)
  are/VBP
  from/IN
  (GPE India/NNP)
  ,/,
  they/PRP
  work/VBP
  at/IN
  (ORGANIZATION Yahoo/NNP))

In the above output , you can see that John andd Emily have been categorized as ‘PERSON’ , Yahoo as ‘ORGANIZATION’ just as required.

Now, what if you have huge data, it will be impossible to print and check for names. For cases like this, using spacy can be easier.

NER with spacy

spacy is highly flexible and advanced library. Named entity recoginition through spacy is easy.

Every token of a spacy model, has an attribute token.label_ which stores the category/ label of each entity.

The few common labels are:

  • ‘ORG’ : companies,organizations,etc..
  • ‘PERSON’ : names of people
  • ‘GPE’ : countries,states,etc..
  • ‘PRODUCT’ : vehicles,food products and so on
  • ‘LANGUAGE’ : Names of different languages

There are many other labels too.

Let me show you a simple example to understand how to access the entity labels throughspacy

# Creating a spacy doc of a sentence
sentence=' The building is located at London. It is the headquaters of Yahoo. John works there. He speaks English'
doc=nlp(sentence)

You can iterate through the entities present in the doct using .ents and print theirlabel through .label_!

# Print named entities
for entity in doc.ents:
  print(entity.text,'--',entity.label_)
London -- GPE
Yahoo -- ORG
John -- PERSON
English -- LANGUAGE

You can see the output labels printed. spacy also provies visualization for better understanding.

You can visualize through displacy ,setting style='ent'.

from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)

NLP - using displacy for entity recognition

Now that you have understood the base of NER, let me show you how it is useful in real life.

Let us take the text data of collection of news headlines.

What if you want a list of all names that have occured in the news ?

This is where spacy has an upper hand, you can check the category of an entity through .ent_type attribute of token.

The below code demonstrates how to get a list of all the names in the news .

news_articles='Restaurant inspections: Space Coast establishments fare well with no closures or vermin Despite strong talk Tallahassee has a lot more to do for Indian River Lagoon | Opinion Kelly Slater-coached surfer from Cocoa Beach wins national title in Texas Despite parachute mishap Boeing Starliner pad abort test a success Vote for FLORIDA TODAY Community Credit Union Athlete of the Week for Oct 28-Nov 2 Brevard diners rave about experiences at Flavor on the Space Coast restaurants Heres what you need to know about Tuesdays municipal elections in Brevard County Save the turtles rally planned to oppose Satellite Beach hotel-condominium project Elon Musk wants humans to colonize space Brad Pitt shows thats easier said than done Google honors actor Will Rogers with new Doodle SpaceX targeting faster deployment of Starlink internet-beaming satellites Rockledge Cocoa hold while Merritt Island gains traction in state football polls Boeing launches and lands Starliner capsule for pad abort test in New Mexico Eight of the 10 most dangerous metro areas for pedestrians are in Florida report says Brevard faith leaders demand apology in open letter to Bill Mick How a Delaware researcher discovered a songbird can predict hurricane seasons Open doors change lives: United Way of Brevard kicks off 2019 campaign Florida man pleads guilty to killing sawfish by removing extended nose with power saw One-woman comedy Excuse My French at Viera Studio dives into authors mind Letters and feedback: Sept 19 2019 Viera talk will highlight technology used for dementia patients Adamson Road open after dump truck crash spilled sand and fuel one injured From the mob to mob movies with Robert De Niro: Meet Larry Mazza former hitman Irishman consultant Tropical Storm Jerry getting stronger in Atlantic; Hurricane Humberto nearing Bermuda Palm Bay Halloween party homicide suspect turns self in police say Fall playoffs: Satellite XC boys win region; Titusville swimmers second Sounds of celebration during Puerto Rican Day Parade in Palm Bay Tractor trailer fire extinguished on I-95 in Cocoa lanes open Logan Harvey bowls a 279 to help Astronaut beat Space Coast in battle of undefeated teams Letters and feedback: Sept 18 2019 Accelerated bargaining a bust as teachers union walks out on second day of pay talks Traffic Alert: Puerto Rican Day Parade set for Palm Bay County commissioners debate biosolids at first of two public hearings US 1 widening in Rockledge under study along with sidewalks and bike lanes Letters and feedback: Nov 3 2019 Titusville man indicted on first-degree murder charge Rockledge Cocoa lead 8 Brevard HS football teams into playoffs Space Coast Junior Achievement accepting business hall of fame nominations Why Brevards proposed moratorium on sewage sludge isnt enough |Rangel Brevard high school football standings after Week 4 Starbucks might build new store at old Checkers site in West Melbourne on US 192 Police search for Palm Bay shooting suspect in Orlando area Does your drivers license have a gold star? Well you need one Cocoa man charged with arson in blaze at triplex where 16 children were celebrating a birthday Unit and lot owners may inspect their ledgers upon written request Palm Bay veteran officer Nelson Moya chosen as police chief By the numbers: Your link to every state salary and what high ranking Florida officials make Longtime state county legislative aide Furru honored by County Commission Week 11: Rockledge pulls out BBQ win; EG Palm Bay Astro win rivalry games County commissioners allocate $446 million to repair South Beaches after Hurricane Dorian Tour of Brevard: Holy Trinity Tigers County Commission begins shakeup of its bloated advisory boards while adding term limits Mystery Jesus plaque washes ashore near Melbourne Beach thanks to Hurricane Humberto waves Widow of combat veteran who died after fight at Brevard Jail appeals to governor Health calendar: June 27-July 4 Going surfing this weekend? Check our wave forecast Have you ever seen a turtle release? Heres your chance Darkness turns to day as SpaceX Falcon Heavy launches from Kennedy Space Center Losing hair in strange places? It could be Alopecia areata Golf tip: Pairing and tee times on the PGA Tour NOAA updates projections saying above normal season could see 5-9 hurricanes 2-4 of them major Cape Canaveral Hospital relocation: Loss of a landmark hopeful future for healthcare and land How do I get adult child to move out start own life? The Atlas V rockets jellyfish in the sky Scores: High school football Week 11 in Brevard County Florida gator caught on camera munching on plastic at St Marks National Wildlife Refuge Q&A: New Melbourne head football coach Tony Rowell Does the First Amendment go too far? | Opinion FHSAA says it has a contingency plan if football associations strike but wont reveal it Tropical Depression Ten forms in central Atlantic; Hurricane Humberto continues to strengthen Updates: Watch SpaceX launch its Falcon Heavy rocket from Kennedy Space Center in Florida Letters and feedback: May 9 2019 The Good Stuff: Soccer star makes Wheaties cover cookies in space and more Highly anticipated pad abort test of Boeing’s Starliner spacecraft is Monday morning Health calendar: Sept 19-26 Naked man running through Titusville Walmart parking lot arrested Lacking money and manpower West Melbourne considers swapping school SROs for armed guards Worst best airports in country: 3 Florida airports make Top 5 worst list Colorado STEM school shooting impacted my family: When will we say enough? | Bellaby Human ashes Legos and a sandwich: Strange things we sent into space Massive 2600-acre burn underway near Brevard County line Health Pro: Trip to Philippines validated nurses career decision Harry Anderson Palm Bays Sitting Man exposes gaps in helping the homeless Brevard School Board splits 4-1 in favor of district plan for teacher pay; union vows war Whos No 1? See the Brevard high school football rankings after Week 4 3 people taken into custody after multiple agencies tracked them at high speeds across Central Florida As dawn breaks over Florida ULA Atlas V rocket lights up morning sky Why breast cancer was called Nuns disease Space Coast-based Rocket Crafters partners with Swiss tech giant RUAG Investigation continues for Palm Bay double killing in 2018 Can’t breastfeed? Tips to provide same nutrition to baby Former Florida Tech golfer Iacobelli wins playoff in Michigan for third pro title Baseball bliss every 50 years: Let’s go Mets! Let’s go Nats! 5 tips to make your holiday travel stress-free Updates: ULA launches Atlas V rocket military satellite from Cape Canaveral Andrew Reed bowls 274 in Eau Gallie win over Satellite Space industry propels Brevard economy into optimistic future; highlights from EDC investors meeting Governor DeSantis signs bill geared toward expanding vocational training Outage leaves 12500 customers without power in Melbourne for nearly an hour Weak cold front brings fall weather after unusually warm October on Space Coast 321 Launch: The space news you might have missed Teacher pay: Union opens pay talks with raise demand supplement for veteran teachers Tour of Brevard: Bayside Bears Youth fishing tournament coming Saturday to Melbourne President Trump heres what Hurricane Michael damage still looks like today Letters and feedback: Aug 8 2019 Man leads police on 100 mph chase through Melbourne Palm Bay Puerto Rican parade to take place this weekend in Palm Bay Viera-based USSSA Pride depart from pro softball league Tip of organized theft ring targeting boats motors leads to arrest of 3 men in Melbourne Multiple crashes slow I-95 traffic near Melbourne Is it time to reconsider how free speech should be? | Bill Cotterell SPCA of Brevard takes influx of cats and dogs from Glades County Brevard residents share some of their favorite recent dining experiences Tuesdays brewery running tour stops at Florida Beer Co in Cape Canaveral 89-year-old Tallahassee Grandma Bunny battles 6-foot snake after it eats visiting birds The horrors of Dorian parallel another nearby tragedy of historic proportions EFSCs Carter Stewart named NJCAA Region 8 Pitcher of the Week Titusville teenager charged with murder of grandfather If Trump bans vape flavors Brevard shops say theyll be out of business Cabana: NASAs Artemis is space explorations next giant leap | Opinion Florida continues cutting phone cords BSOs Confessore starting 25th year as music director High school sports results: May 7 Libraries busy with summer reading programs Four hurricanes in six weeks? Remember 2004 the year of hurricanes Letters and feedback: Nov 2 2019 Letters and feedback: May 8 2019 As Brevard kids head back to school here are five things to make lunchtime happy healthy Palm Bay police investigate overnight shooting that left one dead Rent takes the stage at two playhouses Cinderella Brighton Beach Memoirs arrive on Brevard stages this weekend Junes Saharan dust setting up right conditions for red tide but still too soon to tell Wizarding World of Harry Potter: Dark Arts unleashed on Hogwarts Castle at Universal Orlando Health First restructuring: What we know (and what we dont) From Jack in the Box to In-N-Out Burger: 7 restaurant chains missing in Florida SpaceX anomaly: No further action needed on Crew Dragon explosion cleanup Vietnam War mural pits residents vs Florida community Matter settled  unhappily British cruise line Marella to sail from Port Canaveral in 2021 Kids are at risk as religious exemptions to vaccines spike in Florida Brevard County crime rate falls in 2018 reflecting statewide trend according to FDLE report Florida-Georgia FSU-Miami: Why Saturday afternoon is a college football party in Florida Restaurants in Viera Suntree Port Canaveral Palm Bay Melbourne Beach draw Flavor raves Vote for the FLORIDA TODAY Community Credit Union Athlete of the Week for Sept 9-14 Whats happening May 15-21: Concert in Cocoa summer camps fun fair in Melbourne Port Authority Secretary/Treasurer Harvey seeks more public input at evening meetings SpaceX Falcon Heavy rocket launch: Where to watch from Space Coast Treasure Coast Volusia County Tallahassee Here are 10 of Arizonas most photogenic spots Some of them might surprise you Rockledge police search for suspects after armed robbery Kardashian production features Palm Bay homicide involving sisters-in-law FLORIDA TODAY journalist The holidays are almost here; Space Coast cooks are prepping special meals now How to watch tonights SpaceX Falcon Heavy launch from Kennedy Space Center in Florida New state budget spending wish list includes bug studies lionfish contests stronger coastlines Wheaties: Team USA soccer star Ashlyn Harris make cereal cover and it costs $23 a box Letter carriers Stamp Out Hunger food drive is Saturday May 11 Suzy Fleming Leonard: Fourth-graders talk about journalism share favorite restaurants Falcon Heavys night launch may be seen across portion of Sunshine State Police: Palm Bay teacher jerked autistic student from bus faces abuse charge Kicking axe: Urban axe throwing makes its debut in Brevard as trending sport stress reliever Palm Bay official Andy Anderson hired as Mexico Beach city administrator on Florida Panhandle Daddy Duty: Are we celebrating Mothers Day the right way? Father of 6 dies trying to save sons from rip current in Cocoa Beach Its time we address lack of affordable housing on Space Coast | Opinion Orlando Pride soccer player from Brevard diagnosed with breast cancer Space Coast bus system: Lets peck away at this problem County Commission chair says Do lovebugs have a purpose beyond annoying us all twice a year? Hurricane Humberto continues to strengthen swells expected to affect East-Central Florida surf One more launch this week; Things to know about ULA Atlas V launch from Cape Canaveral Free self-defense class for women this Friday in Cocoa Beach The farce behind blaming video games for El Paso Dayton mass shootings | Rangel Teacher pay talks take new shape as school district and union vow to work together First-place USSSA Pride on hot streak; Bennett on ESPNs Top Plays A legacy of giving: When it comes to giving back Judy and Bryan Roub lead by example Staph infection can come from ocean water beach sand; Florida man gets it from sandbar Its Launch Day! Things to know about SpaceX Falcon Heavy launch from Kennedy Space Center Viera 6 more Brevard HS softball teams enter playoffs Enjoy Jack Daniels tastings in Cocoa Village and Melbourne this month Brevard County location comes in at No 7 on list of US cities that get the most sun Letters and feedback: Oct 31 2019 Vietnam veteran receives settlement from Veterans Affairs  after waiting 20 years Brevard restaurants with outdoor seating wage war against lovebugs Aldi adds beer sparkling wine and dog treats to its collection of Advent calendars If millennials choose socialism fine Just dont make this mistake Gardening: Adding a tree to your Florida landscape? Southern redcedar is a great option Big Brevard high school football rivalries vary in postseason implications 6 hurricanes and 12 named storms predicted for remainder of 2019 hurricane season Cocoa man suspected of firing shots with unwitting mother in tow faces multiple charges Dogs on the beach? County Commission rejects proposal to allow dogs on South Brevard beaches Letters and feedback: Sept 15 2019 Despite weather odds SpaceX launches Falcon 9 rocket from Cape Canaveral Cyberstalking of alleged battery victim lands Rockledge man back in jail Winds are coming back again but inshore fishing is excellent for redfish and trout Exclusive: Trump administration sets plans for 2019 hurricane season after wakeup call of recent disasters Melting Pot: Intimacy of a shared meal makes this Viera spot perfect for date night The Chefs Table: Fine meat seafood draw diners to this Suntree favorite Holmes nurse arrested accused of sexual battery on a patient Tour of Brevard: Satellite Scorpions Djons Steak & Lobster House: Discover old-Florida elegance in Melbourne Beach restaurant Judicial Nominating Commission chair resigns claims governors office interfered with independence Medical marijuana firms seek to open more storefronts in Florida Brevard School votes 4-1 to support superintendents plan for teacher raises; audience not happy Tracking Brevards stars: Arena football champ UCF star PBA rookie lead the way Brevard school district miscalculated attrition savings finds $15 million for teacher pay Brevard high school cross country honor roll Oct 31 NASA Kennedy Space Center team wins Emmy for coverage of SpaceX mission Golf tip: Making adjustments to correct your swing Cruise ship rescues and mishaps: 6 times emergency struck on Royal Caribbean Carnival more Updates: SpaceX launches Falcon 9 from Cape Canaveral Air Force Station Secular group challenges In God We Trust motto on Brevard sheriffs patrol vehicles Chemicals in sunscreen seep into your bloodstream after just one day FDA says Cottage Rose mural dispute resolved in Indialantic after owner applies for permit Humberto strengthens into a hurricane as it moves away from Space Coast; NHC watching 2 more Planting herbs to cutting back on water heres what to do in your Brevard yard this month Viera Suntree Little League team heads to Michigan for Junior World Series We must prevent human trafficking through education | Opinion Please dont eat the deer Family spots Florida panther wandering around Estero backyard Man found hanging onto side of kayak in Atlantic Ocean off Patrick Air Force Base SpaceX set to launch Falcon Heavy on mission with 24 payloads –\xa0including human ashes Erin Baird is 2019 Volunteer of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards ZZ Top Steve Martin and Martin Short coming to King Center Needy Brevard students sleep easier on new beds donated from Ashley HomeStore These fitness trends are worth sticking with over time Florida saw fewer infections from Vibrio vulnificus (aka flesh-eating bacteria) in 2018 Health calendar: April 25-May 2 Young actresses shine in Matilda at Titusville Playhouse Freshman quarterback Gabriel leads UCF over Stanford in Orlando John Daly is 2019 Volunteer of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Health Pro: Doctors interest grew out of childhood health issues More strong storms move eastward over Brevard; funnel clouds possible in Viera Nurse at Holmes Regional Medical Center arrested on suspicion of sexual battery Strong isolated showers heat impacting Brevard County Brevard firefighters widow wins $9 million verdict in Dominos Pizza delivery death Housing providers may be prohibited from rejecting renters on basis of arrest record 10-foot great white shark Miss May pings off Satellite Beach Palm Bay man 22 dies in overnight crash on Floridas Turnpike Tamara Carroll is 2019 Volunteer of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Tropical Storm Humberto not expected to affect Space Coast; NHC watching two more disturbances Theater veteran Blackledge passes away unexpectedly Ten dead issues from the 2019 legislative session Letters and feedback: June 23 2019 Rockledge Cocoa win big as Space Coast takes Holy Trinity in OT Surfers can enjoy chest-high waves most of weekend in Brevard Plan ahead for Aug 14-20: Space Coast Symphony performs Mikado Chef Larry tweaks hours Daylight saving time ends on Nov 3 Here are 7 facts to know about the time change Two new Royal Caribbean ships including second largest in world move to Port Canaveral Fla Sen Debbie Mayfield files bill to define e-cigarettes liquid nicotine as tobacco products Escape the hustle and bustle for a western getaway Missing Palm Bay teen found in South Carolina; mother arrested police say Impossible burger Impossible Cottage Pie make 2019 Epcot Food and Wine Festival menu Got bats? Why you need to get rid of them by Tax Day 2 charged with attempted murder; Titusville police say they pistol-whipped waterboarded force-fed him dog food Bicyclist dies in hit-and-run crash on US 1 near Cocoa Group: St Johns River pollution threatens water supplies Rockledge man charged after hoax to shoot up Merritt Island Walmart Man awarded $80M in lawsuit claiming Monsantos Roundup causes cancer Astronauts Alisa Rendina won FLORIDA TODAY Community Credit Union Athlete of the Week vote Alligators and airboats: How to explore Everglades National Park in one day Longtime educator church mother and humanitarian Johnnie Mae Riley remembered in Palm Bay George Clooney astronauts to attend black-tie event at Kennedy Space Center High school sports results: March 27 Bill Nye visits Cocoa Beach ahead of SpaceX Falcon Heavy launch from Kennedy Space Center Health First to relocate Cape Canaveral Hospital to Merritt Island open new facilities Letters and feedback: Aug 7 2019 Duran to host top area amateur golfers this weekend Buckaroo Ball benefits Harmony Farms Get ready to rumble: Atlas V blasting off early Thursday morning War on lovebugs: Social media reactions tweets memes about pesky bugs cars windshields Master association that contains condominiums may be able to collect capital contribution The new LEGO Movie World opens at LEGOLAND River Rocks: Enjoy fresh seafood and river views at this Rockledge tradition March 27: Panthers Walk It Off 10 takeaways from the presidents campaign kickoff visit to Orlando Cocoa man arrested on manslaughter charges after Sharpes shooting Corrections & Clarifications Accused burglars hit five south Brevard stores before Melbourne officer shot one of them Brevard family to receive Filipino World War II veterans Congressional Gold Medal NASA: Want to go to the moon in five years? We need more money How to REALLY clean lovebugs off your windshield car house Top party schools: Florida Florida State make it to top of Princeton Reviews 2020 list Live scores: High school football Week 4 in Brevard County Restaurant inspections: Happy Kitchen in West Melbourne Divine Grace in Palm Bay closed after inspections School district sticks to $770 raise offer for teachers; special magistrate getting involved Candidate lineup is set for 2019 municipal special district elections in Brevard Death investigation underway after mans body found in Titusville ditch November hurricane forecast: Season not technically over but its over | WeatherTiger Tyndall Air Force Bases future remains uncertain after devastation from Hurricane Michael Whats behind a low performing school? Crummy parents | Opinion Well known Chevrolet dealership has The Right Stuff Lovebugs are back Many hate them Experts say theyre here for good The lax disciplinary policies that caused\xa0Parkland massacre may have spread to your school Port Canaveral commissioners tackle parking issues: 10 things to know Melbourne woman injured in crash dies from her injuries; police looking for witnesses Senate President Galvano calls for Florida lawmakers to focus on mass shootings Governor DeSantis approves $91 billion budget slashes $133 million in local projects Letters and feedback: March 28 2019 Vote for 321prepscom Community Credit Union Athlete of the Week for April 29-May 4 South Patrick Shores approved for federal cleanup program Island H2O Live! water park in Kissimmee opens: Make the plunge or chill in lazy river Travis Proctor is 2019 Citizen of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Vietnam War 50th anniversary to be remembered at Cape Canaveral National Cemetery Its launch day! Things to know about SpaceX Falcon 9 launch from Cape Canaveral NASA gave Artemis contractors bonuses despite delays and cost overruns GAO says Michael Bloom is 2019 Citizen of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Health Pro: Eye doctor honed skills on aircraft carrier Brevard pitchers fare well in state horseshoe event The real reason NASAs all-female spacewalk is not happening Hint: Its not discrimination Port Canaveral aquarium the best use of tourism dollars | Opinion SpaceX Falcon Heavy will carry worlds first green rocket fuel mission School District union at odds on whats actually available for teacher raises | Opinion Nuclear power is too costly and too risky Jarvis Wash is 2019 Citizen of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards SpaceX Dragon installed at ISS after launch from Cape Canaveral Rare phenomenon: Hail damage from unusual thunderstorm has Brevard residents talking Signs you live on the Space Coast Titusville couple charged with neglect after infant found malnourished Health coaches a personal way to help you meet goals New Viera elementary school to break ground in fall fast-tracked to open August 2020 Phase out nuclear power? Elizabeth Warren and Bernie Sanders locked in absurd arms race Strange days: Space Coast gets freak hail flaming meteor mysterious rumble red tide over 6 months Four women arrested on prostitution charges in ongoing Melbourne enforcement detail Whose name is on deed - partner or spouse? It matters a lot Florida should steer clear of drug importation from other countries | Opinion Melbourne police investigate Palm Bay Road shooting Brevard bowlers head to state tournament off district wins Florida Historical Societys May 16-18 conference: Countdown to History: Ice Age to the Space Age Turning the Toxic Tide: Florida must address the problem of human waste Contributions pour in to group seeking to ban possession sale of assault weapons in Florida Why Disney World sweets treats desserts are SO worthy of Instagram Satellite Beach girl 10 leads charge for organ donation Poliakoff: Community development district offers different advantages and disadvantages to an HOA Florida Tech in Melbourne hosts International Dinner with food from South Latin America From discovering water to snapping selfies: The lasting memories of Mars rover Opportunity Charges dropped against Texas man after comments against Cocoa Beach hotel werent direct specific threat Brevard Federation of Teachers just cant say yes to raise offers | Opinion How Mars Opportunity rover ranks in list of more than 1000 unmanned space missions by NASA since 1958 321 Flavor members rave about Fresh Scratch Bistro Sushi Factory and Beachside Seafood Texas man recovering after shark bite in Florida Buckeye the manatee rescued again by SeaWorld Rain hail pelts north central Brevard County causing widespread damage Regulators approve merger of Harris L3; deal to close June 29 Teachers approve largest pay raise in years but will it keep them in Brevard? Restaurant review: Continental Flambe in Melbourne shows great promise Weak cool front could bring torrential rain damaging winds to Space Coast Officials: Removing Beachline earthen causeways would improve Banana River water quality Buzz Aldrins outfit is everything in photo of last remaining Apollo astronauts Voting Republican only helps if youre rich | Opinion Homelessness: A national problem with a local solution EPA plans to regulate cancer-causing chemicals found in Americas drinking water UCF has chance to wow recruits playoff voters with a win over Stanford Ping-pong-sized hail pounds Space Coast as severe thunderstorms move through area Harris CEO optimistic about merger finances as deal with L3 nears completion March 26: EFSC’s Carter Stewart named NJCAA Region VIII Pitcher of the Week Knife-wielding suspect leads deputies officers on foot chase in Melbourne 4 Brevard winners in girls regional basketball Thursday March 26: Women’s Golf Finishes Runner-Up at Barry Invite Some seek healing remembrance at arrival of Vietnam War replica wall High school sports results: March 26 Following surge in discipline problems Brevard students ask lawmakers to address vaping in schools Titusville man 21 pronounced dead after trying to make a U-turn on South Street Donald Trumps criminal justice reform shows results | Opinion Florida cold case solved 22 years later after Google Earth satellite image shows missing mans car Health calendar: Aug 8-15 Tracking Brevards stars: Harris to World Cup; Dawson in golf hunt; Allen Taylor hitting 300 Satellite Beach hotel condo towers homes slated for 27 acres by Hightower Beach park Letters and feedback: Sept 14 2019 Mike Martin Jr to be named Florida States new baseball coach Golf tip: Hit down on hybrids similar to how you hit an iron Stormy weather moves across Space Coast bringing lightning heavy rain Commission OKs pay raises for 625 county employees to address disparities turnover High school sports results: Feb 14 weVENTURE is 2019 Organization of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Man taken to hospital after domestic incident in Titusville Brevard school district considers selling $800000 in unused land Weather looks OK for SpaceX Falcon Heavy launch from Kennedy Space Center 321 Flavor: Margarita tops the list of Brevards favorite cocktails UCF experiment and Citronaut fly to space on Blue Origins New Shepard Bridges sex assault: Victim’s family sues nonprofit caregiver over pregnancy Sheriff Ivey and Brevard emergency officials send mixed messages during Dorian New proposal on dog cat sales at pet stores is less restrictive than previous one Space Coast Honor Flight is 2019 Organization of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards It goes by so quickly: Brevard teachers back to work as days count down to school County commissioners reject lagoon funding plan saying they want less money for muck projects Yuck! Hypodermic needles stabbing employees inside Brevard recycling plant Hard-line anti-illegal immigration group works to sway sanctuary cities bill Titusville High School principal retires amid accusations of hostile environment Pol Pot’s deputy Nuon Chea architect of bloody Khmer Rouge policy dead at 93 in Cambodia NASAs new goal: Land astronauts on the moon by 2024 Going surfing this weekend? Check our forecast Neighbor Up is 2019 Organization of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Severe storms bring wind lightning and possible hail across Brevard Starbucks in Indialantic West Melbourne may move to larger stores with drive-thru windows SpaceX Falcon Heavy: Why everyone should watch this night launch from Kennedy Space Center Fishing report: Calm seas and good fishing ahead Two World Series commercials feature Brevard celebrities Letters and feedback: May 5 2019 In latest motion US Attorney confirms multiple investigations in Tallahassee Lane 70 softball team wins spring tournament Palm Bay police double down on nightclub safety following four violent episodes Relativity Space signs deal to deliver small satellites to more orbits HBO documentary new surf movie spotlight surf legends Kelly Slater Hobgood twins Plan ahead for April 3-10: Music in Melbourne adult prom in Titusville; art in Eau Gallie Family grieves woman fatally struck by car near Lous Blues bar in Melbourne At Cape Canaveral SpaceX kicks off Star Wars Day with smoke and fire Updates: SpaceX launches Falcon 9 from Cape Canaveral to ISS sticks drone ship landing Letters and feedback: March 27 2019 Florida Tech to host event celebrating Apollo 11 anniversary and future of US space program Boeing on Starliner orbital flight test: We really are close Florida Tech spring football game set for Saturday Feb 14: EFSC women’s golfer Kacey Walker nominated for 2019 Arnold Palmer Cup Four Panthers Selected to All-SSC Teams Heavy winds and lightning to hit north Brevard County How Ron DeSantis is shaking up the establishment | Opinion High school sports results: May 3 Palm Bay man who asked woman for permission to have illicit sexual encounter with child sentenced Letters and feedback: June 22 2019 Air Force pumped for two launches less than 48 hours apart Heres why NASA scrubbed all-female spacewalk Plan ahead for Nov 6-12: Holiday bazaar in Suntree pet events art openings and food fun Edgewoods Jon Wang shoots one under par for second time this week Burglar steals bolted safe from Melbourne bar in latest smash-and-grab Feb 14: Florida Tech Suffers Last-Second Heartbreaker Against Saint Leo Daddy Duty: My school cheating scandal is an example for my 4-year-old Tropical Storm Humberto forms but isnt expected to affect Space Coast Air Force weather squadron key player in ensuring successful rocket launches Ronsisvalle: Getting through to the disconnected husband How to watch SpaceX launch its Falcon 9 rocket from Cape Canaveral Air Force Station Members of Brevards Bahai Faith community come together to commemorate twin holy days A&Es Tiny House Nation to feature Melbourne tiny home builder Movable Roots Lawmakers grapple with felons voting rights All-female spacewalk canceled due to lack of medium-sized spacesuits Twitter reacts Medical marijuana and 3 other school district policy changes you should know about Felons rights bill heads to Gov DeSantis Lanes open delays persist after Crash on I-95 near Wickham Road Former UCF chair resigns day after Rep Randy Fine proposes UC'

Create a spacy object with news_article. Iterate through every token and check if the token.ent_type is person or not. If yes, append it to the list.

# creating spacy doc
news_doc=nlp(news_articles)

# creating an empty list to store people's names
list_of_people=[]

# iterating through every token in doc
for token in news_doc:
  # checking if category matches
  if token.ent_type_=='PERSON':
    # appending to list if PERSON
    list_of_people.append(token.text)

print(list_of_people)
['Kelly', 'Slater', 'Elon', 'Musk', 'Brad', 'Pitt', 'Will', 'Rogers', 'Bill', 'Mick', 'United', 'Way', 'Viera', 'Studio', 'Viera', 'Robert', 'De', 'Niro', 'Larry', 'Mazza', 'hitman', 'Tropical', 'Storm', 'Jerry', 'Logan', 'Harvey', 'Astronaut', 'beat', 'Space', 'Coast', 'Titusville', 'Nelson', 'Moya', 'Jesus', 'Alopecia', 'Atlas', 'New', 'Melbourne', 'Tony', 'Rowell', 'Titusville', 'Walmart', 'Harry', 'Anderson', 'Palm', 'Bays', 'Iacobelli', 'Atlas', 'Andrew', 'Reed', 'Eau', 'Gallie', 'DeSantis', 'Trump', 'Bill', 'Cotterell', 'Bunny', 'Carter', 'Stewart', 'Brevard', 'Cinderella', 'Brighton', 'Jack', 'Box', 'Marella', 'Treasurer', 'Harvey', 'Ashlyn', 'Harris', 'Suzy', 'Fleming', 'Falcon', 'Heavys', 'Andy', 'Anderson', 'Brevard', 'Lets', '|', 'Rangel', 'Teacher', 'Bennett', 'Judy', 'Bryan', 'Roub', 'Staph', 'Viera', 'Enjoy', 'Jack', 'Daniels', 'Aldi', 'Melting', 'Pot', ':', 'Viera', 'Holmes', 'Estero', 'Erin', 'Baird', 'Steve', 'Martin', 'Martin', 'Short', 'Ashley', 'HomeStore', 'Vibrio', 'Gabriel', 'John', 'Daly', 'Aug', 'Mikado', 'Chef', 'Larry', 'Titusville', 'Alisa', 'Rendina', 'Bill', 'Nye', 'Atlas', 'Brevard', 'Melbourne', 'Galvano', 'DeSantis', 'Travis', 'Proctor', 'Michael', 'Bloom', 'Jarvis', 'Wash', 'New', 'Viera', 'Elizabeth', 'Warren', 'Bernie', 'Sanders', 'Melbourne', 'Florida', 'Tech', 'Harris', 'L3', 'Continental', 'Flambe', 'Removing', 'Beachline', 'Buzz', 'Aldrins', 'Stanford', 'Ping', '-', 'pong', 'Harris', 'Carter', 'Stewart', 'Golf', 'Finishes', 'Runner', 'Titusville', 'Harris', 'Dawson', 'Allen', 'Taylor', 'Mike', 'Martin', 'Jr', 'Margarita', 'Yuck', 'Pol', 'Pot', '’s', 'Nuon', 'Chea', 'Kelly', 'Slater', 'Eau', 'Gallie', 'Family', 'Kacey', 'Walker', 'Ron', 'DeSantis', 'Edgewoods', 'Jon', 'Wang', 'Crash']

From the output of above code, you can clearly see the names of people that appeared in the news.

Also, in many cases you do not want the actual details to be published. So, what if you want to replace every person’s name or organization with ‘UNKNOWN’ to preserve privacy.

The below code demonstartes how to do the same .

# Function to identify  if tokens are named entities and replace them with UNKNOWN
def remove_details(word):
    if word.ent_type_ =='PERSON' or word.ent_type_=='ORG' or word.ent_type_=='GPE':
        return ' UNKNOWN '
    return word.string


# Function where each token of spacy doc is passed through remove_deatils()
def update_article(doc):
    # iterrating through all entities
    for ent in doc.ents:
        ent.merge()
    # Passing each token through remove_details() function.
    tokens = map(remove_details,doc)
    return ''.join(tokens)

# Passing our news_doc to the function update_article()
update_article(news_doc)
'Restaurant inspections: Space Coast establishments fare well with no closures or vermin Despite strong talk  UNKNOWN has a lot more to do for Indian River Lagoon | Opinion  UNKNOWN -coached surfer from  UNKNOWN wins national title in  UNKNOWN Despite parachute mishap  UNKNOWN Starliner pad abort test a success Vote for  UNKNOWN of the Week for Oct 28-Nov 2 Brevard diners rave about experiences at  UNKNOWN what you need to know about Tuesdays municipal elections in  UNKNOWN Save the turtles rally planned to oppose  UNKNOWN hotel-condominium project  UNKNOWN wants humans to colonize space  UNKNOWN shows thats easier said than done  UNKNOWN honors actor  UNKNOWN with new Doodle SpaceX targeting faster deployment of Starlink internet-beaming satellites Rockledge Cocoa hold while  UNKNOWN gains traction in state football polls  UNKNOWN launches and lands Starliner capsule for pad abort test in  UNKNOWN Eight of the 10 most dangerous metro areas for pedestrians are in  UNKNOWN report says Brevard faith leaders demand apology in open letter to  UNKNOWN How a  UNKNOWN researcher discovered a songbird can predict hurricane seasons Open doors change lives:  UNKNOWN of  UNKNOWN kicks off 2019 campaign  UNKNOWN man pleads guilty to killing sawfish by removing extended nose with power saw One-woman comedy Excuse My French at  UNKNOWN dives into authors mind Letters and feedback: Sept 19 2019  UNKNOWN talk will highlight technology used for dementia patients Adamson Road open after dump truck crash spilled sand and fuel one injured From the mob to mob movies with  UNKNOWN : Meet  UNKNOWN former  UNKNOWN Irishman consultant  UNKNOWN getting stronger in Atlantic; Hurricane Humberto nearing Bermuda Palm Bay Halloween party homicide suspect turns self in police say Fall playoffs: Satellite XC boys win region;  UNKNOWN swimmers second Sounds of celebration during Puerto Rican Day Parade in Palm Bay Tractor trailer fire extinguished on I-95 in  UNKNOWN lanes open  UNKNOWN bowls a 279 to help  UNKNOWN in battle of undefeated teams Letters and feedback: Sept 18 2019 Accelerated bargaining a bust as teachers union walks out on second day of pay talks Traffic Alert: Puerto Rican Day Parade set for Palm Bay County commissioners debate biosolids at first of two public hearings  UNKNOWN widening in  UNKNOWN under study along with sidewalks and bike lanes Letters and feedback: Nov 3 2019  UNKNOWN man indicted on first-degree murder charge Rockledge Cocoa lead 8 Brevard HS football teams into  UNKNOWN accepting business hall of fame nominations Why Brevards proposed moratorium on sewage sludge isnt enough |Rangel Brevard high school football standings after Week 4 Starbucks might build new store at old  UNKNOWN site in  UNKNOWN on  UNKNOWN 192 Police search for Palm Bay shooting suspect in  UNKNOWN area Does your drivers license have a gold star? Well you need one  UNKNOWN man charged with arson in blaze at triplex where 16 children were celebrating a birthday Unit and lot owners may inspect their ledgers upon written request Palm Bay veteran officer  UNKNOWN chosen as police chief By the numbers: Your link to every state salary and what high ranking  UNKNOWN officials make Longtime state county legislative aide  UNKNOWN honored by  UNKNOWN Week 11: Rockledge pulls out  UNKNOWN win; EG Palm Bay Astro win rivalry games County commissioners allocate $446 million to repair South Beaches after Hurricane Dorian Tour of Brevard: Holy Trinity Tigers County Commission begins shakeup of its bloated advisory boards while adding term limits Mystery  UNKNOWN plaque washes ashore near  UNKNOWN thanks to Hurricane Humberto waves Widow of combat veteran who died after fight at  UNKNOWN appeals to governor Health calendar: June 27-July 4 Going surfing this weekend? Check our wave forecast Have you ever seen a turtle release? Heres your chance Darkness turns to day as SpaceX  UNKNOWN Heavy launches from Kennedy Space Center Losing hair in strange places? It could be  UNKNOWN areata Golf tip: Pairing and tee times on  UNKNOWN NOAA updates projections saying above normal season could see 5-9 hurricanes 2-4 of them major  UNKNOWN relocation: Loss of a landmark hopeful future for healthcare and land How do I get adult child to move out start own life? The  UNKNOWN V rockets jellyfish in the sky Scores: High school football Week 11 in  UNKNOWN  UNKNOWN gator caught on camera munching on plastic at  UNKNOWN :  UNKNOWN head football coach  UNKNOWN Does the First Amendment go too far? | Opinion FHSAA says it has a contingency plan if football associations strike but wont reveal it Tropical Depression Ten forms in central Atlantic; Hurricane Humberto continues to strengthen Updates: Watch SpaceX launch its  UNKNOWN Heavy rocket from Kennedy Space Center in  UNKNOWN Letters and feedback: May 9 2019 The Good Stuff: Soccer star makes  UNKNOWN cover cookies in space and more Highly anticipated pad abort test of  UNKNOWN ’s Starliner spacecraft is Monday morning Health calendar: Sept 19-26 Naked man running through  UNKNOWN parking lot arrested Lacking money and manpower  UNKNOWN considers swapping school SROs for armed guards Worst best airports in country: 3  UNKNOWN airports make Top 5 worst list  UNKNOWN STEM school shooting impacted my family: When will we say enough? | Bellaby Human ashes Legos and a sandwich: Strange things we sent into space Massive 2600-acre burn underway near  UNKNOWN line Health Pro: Trip to  UNKNOWN validated nurses career decision  UNKNOWN Sitting Man exposes gaps in helping the homeless  UNKNOWN splits 4-1 in favor of district plan for teacher pay; union vows war Whos No 1? See the  UNKNOWN high school football rankings after Week 4 3 people taken into custody after multiple agencies tracked them at high speeds across Central Florida As dawn breaks over Florida ULA Atlas V rocket lights up morning sky Why breast cancer was called  UNKNOWN disease Space Coast-based  UNKNOWN partners with Swiss tech giant  UNKNOWN continues for Palm Bay double killing in 2018 Can’t breastfeed? Tips to provide same nutrition to baby Former  UNKNOWN Tech golfer  UNKNOWN wins playoff in  UNKNOWN for third pro title Baseball bliss every 50 years: Let’s go Mets! Let’s go  UNKNOWN ! 5 tips to make your holiday travel stress-free Updates:  UNKNOWN launches  UNKNOWN V rocket military satellite from Cape Canaveral  UNKNOWN bowls 274 in  UNKNOWN win over  UNKNOWN industry propels Brevard economy into optimistic future; highlights from  UNKNOWN investors meeting Governor  UNKNOWN signs bill geared toward expanding vocational training  UNKNOWN leaves 12500 customers without power in  UNKNOWN for nearly an hour Weak cold front brings fall weather after unusually warm October on Space Coast 321 Launch: The space news you might have missed Teacher pay: Union opens pay talks with raise demand supplement for veteran teachers  UNKNOWN Bayside Bears Youth fishing tournament coming Saturday to  UNKNOWN President  UNKNOWN heres what  UNKNOWN damage still looks like today Letters and feedback: Aug 8 2019 Man leads police on 100 mph chase through  UNKNOWN  UNKNOWN parade to take place this weekend in Palm Bay Viera-based  UNKNOWN Pride depart from pro softball league Tip of organized theft ring targeting boats motors leads to arrest of 3 men in  UNKNOWN Multiple crashes slow I-95 traffic near  UNKNOWN Is it time to reconsider how free speech should be? |  UNKNOWN SPCA of Brevard takes influx of cats and dogs from  UNKNOWN residents share some of their favorite recent dining experiences Tuesdays brewery running tour stops at  UNKNOWN in  UNKNOWN 89-year-old  UNKNOWN Grandma  UNKNOWN battles 6-foot snake after it eats visiting birds The horrors of Dorian parallel another nearby tragedy of historic proportions EFSCs  UNKNOWN named  UNKNOWN Region 8 Pitcher of the Week Titusville teenager charged with murder of grandfather If Trump bans vape flavors Brevard shops say theyll be out of business  UNKNOWN : NASAs Artemis is space explorations next giant leap  UNKNOWN  UNKNOWN continues cutting phone cords BSOs Confessore starting 25th year as music director High school sports results: May 7 Libraries busy with summer reading programs Four hurricanes in six weeks? Remember 2004 the year of hurricanes Letters and feedback: Nov 2 2019 Letters and feedback: May 8 2019 As  UNKNOWN kids head back to school here are five things to make lunchtime happy healthy Palm Bay police investigate overnight shooting that left one dead Rent takes the stage at two playhouses  UNKNOWN Beach Memoirs arrive on  UNKNOWN stages this weekend Junes Saharan dust setting up right conditions for red tide but still too soon to tell  UNKNOWN of Harry Potter: Dark Arts unleashed on  UNKNOWN at  UNKNOWN restructuring: What we know (and what we dont) From  UNKNOWN in the  UNKNOWN to In-N-Out Burger: 7 restaurant chains missing in  UNKNOWN SpaceX anomaly: No further action needed on Crew Dragon explosion cleanup Vietnam War mural pits residents vs  UNKNOWN community Matter settled  unhappily British cruise line  UNKNOWN to sail from  UNKNOWN in 2021 Kids are at risk as religious exemptions to vaccines spike in  UNKNOWN crime rate falls in 2018 reflecting statewide trend according to  UNKNOWN report  UNKNOWN - UNKNOWN FSU-Miami: Why Saturday afternoon is a college football party in  UNKNOWN in  UNKNOWN Suntree Port Canaveral Palm Bay  UNKNOWN draw  UNKNOWN raves Vote for  UNKNOWN of the Week for Sept 9-14 Whats happening May 15-21: Concert in  UNKNOWN summer camps fun fair in  UNKNOWN Secretary/ UNKNOWN seeks more public input at evening meetings SpaceX  UNKNOWN Heavy rocket launch: Where to watch from Space Coast Treasure Coast Volusia County  UNKNOWN Here are 10 of  UNKNOWN most photogenic spots Some of them might surprise you Rockledge police search for suspects after armed robbery Kardashian production features Palm Bay homicide involving sisters-in-law  UNKNOWN TODAY journalist The holidays are almost here; Space Coast cooks are prepping special meals now How to watch tonights SpaceX  UNKNOWN Heavy launch from  UNKNOWN in  UNKNOWN New state budget spending wish list includes bug studies lionfish contests stronger coastlines Wheaties:  UNKNOWN soccer star  UNKNOWN make cereal cover and it costs $23 a box Letter carriers Stamp Out Hunger food drive is Saturday May 11  UNKNOWN Leonard: Fourth-graders talk about journalism share favorite restaurants  UNKNOWN night launch may be seen across portion of  UNKNOWN : Palm Bay teacher jerked autistic student from bus faces abuse charge Kicking axe: Urban axe throwing makes its debut in  UNKNOWN as trending sport stress reliever Palm Bay official  UNKNOWN hired as  UNKNOWN city administrator on  UNKNOWN Panhandle Daddy Duty: Are we celebrating Mothers Day the right way? Father of 6 dies trying to save sons from rip current in  UNKNOWN Its time we address lack of affordable housing on Space Coast | Opinion  UNKNOWN Pride soccer player from  UNKNOWN diagnosed with breast cancer Space Coast bus system:  UNKNOWN peck away at this problem  UNKNOWN chair says Do lovebugs have a purpose beyond annoying us all twice a year? Hurricane Humberto continues to strengthen swells expected to affect East-Central Florida surf One more launch this week; Things to know about ULA Atlas V launch from  UNKNOWN Free self-defense class for women this Friday in  UNKNOWN The farce behind blaming video games for  UNKNOWN mass shootings  UNKNOWN pay talks take new shape as school district and union vow to work together First-place USSSA Pride on hot streak;  UNKNOWN on ESPNs Top Plays A legacy of giving: When it comes to giving back  UNKNOWN and  UNKNOWN lead by example  UNKNOWN infection can come from ocean water beach sand;  UNKNOWN man gets it from sandbar Its Launch Day! Things to know about SpaceX  UNKNOWN Heavy launch from Kennedy Space Center  UNKNOWN 6 more  UNKNOWN softball teams enter playoffs  UNKNOWN tastings in  UNKNOWN and  UNKNOWN this month  UNKNOWN location comes in at No 7 on list of  UNKNOWN cities that get the most sun Letters and feedback: Oct 31 2019  UNKNOWN veteran receives settlement from  UNKNOWN  after waiting 20 years Brevard restaurants with outdoor seating wage war against lovebugs  UNKNOWN adds beer sparkling wine and dog treats to its collection of Advent calendars If millennials choose socialism fine Just dont make this mistake Gardening: Adding a tree to your  UNKNOWN landscape? Southern redcedar is a great option  UNKNOWN high school football rivalries vary in postseason implications 6 hurricanes and 12 named storms predicted for remainder of 2019 hurricane season  UNKNOWN man suspected of firing shots with unwitting mother in tow faces multiple charges Dogs on the beach?  UNKNOWN rejects proposal to allow dogs on South Brevard beaches Letters and feedback: Sept 15 2019 Despite weather odds SpaceX launches  UNKNOWN 9 rocket from Cape Canaveral Cyberstalking of alleged battery victim lands Rockledge man back in jail Winds are coming back again but inshore fishing is excellent for redfish and trout Exclusive: Trump administration sets plans for 2019 hurricane season after wakeup call of recent disasters  UNKNOWN Intimacy of a shared meal makes this  UNKNOWN spot perfect for date night The Chefs Table: Fine meat seafood draw diners to this Suntree favorite  UNKNOWN nurse arrested accused of sexual battery on a patient  UNKNOWN : Discover old- UNKNOWN elegance in  UNKNOWN restaurant  UNKNOWN chair resigns claims governors office interfered with independence Medical marijuana firms seek to open more storefronts in  UNKNOWN votes 4-1 to support superintendents plan for teacher raises; audience not happy Tracking Brevards stars: Arena football champ  UNKNOWN star  UNKNOWN rookie lead the way Brevard school district miscalculated attrition savings finds $15 million for teacher pay  UNKNOWN country honor roll Oct 31  UNKNOWN team wins Emmy for coverage of SpaceX mission Golf tip: Making adjustments to correct your swing Cruise ship rescues and mishaps: 6 times emergency struck on  UNKNOWN more Updates: SpaceX launches Falcon 9 from  UNKNOWN group challenges  UNKNOWN We Trust motto on  UNKNOWN sheriffs patrol vehicles Chemicals in sunscreen seep into your bloodstream after just one day  UNKNOWN says  UNKNOWN Rose mural dispute resolved in Indialantic after owner applies for permit  UNKNOWN strengthens into a hurricane as it moves away from Space Coast;  UNKNOWN watching 2 more Planting herbs to cutting back on water heres what to do in your Brevard yard this month  UNKNOWN team heads to  UNKNOWN for Junior World Series We must prevent human trafficking through education | Opinion Please dont eat the deer Family spots  UNKNOWN panther wandering around  UNKNOWN backyard Man found hanging onto side of kayak in Atlantic Ocean off Patrick Air Force Base SpaceX set to launch  UNKNOWN on mission with 24 payloads –\xa0including human ashes  UNKNOWN is 2019 Volunteer of the Year finalist in  UNKNOWN Awards ZZ Top  UNKNOWN and  UNKNOWN coming to  UNKNOWN students sleep easier on new beds donated from  UNKNOWN These fitness trends are worth sticking with over time  UNKNOWN saw fewer infections from  UNKNOWN vulnificus (aka flesh-eating bacteria) in 2018 Health calendar: April 25-May 2 Young actresses shine in  UNKNOWN at Titusville Playhouse Freshman quarterback  UNKNOWN leads  UNKNOWN over  UNKNOWN in  UNKNOWN  UNKNOWN is 2019 Volunteer of the Year finalist in  UNKNOWN Doctors interest grew out of childhood health issues More strong storms move eastward over  UNKNOWN ; funnel clouds possible in  UNKNOWN arrested on suspicion of sexual battery Strong isolated showers heat impacting  UNKNOWN firefighters widow wins $9 million verdict in  UNKNOWN delivery death Housing providers may be prohibited from rejecting renters on basis of arrest record 10-foot great white shark Miss May pings off  UNKNOWN Beach Palm Bay man 22 dies in overnight crash on Floridas Turnpike Tamara Carroll is 2019 Volunteer of the Year finalist in  UNKNOWN Awards Tropical Storm Humberto not expected to affect Space Coast;  UNKNOWN watching two more disturbances  UNKNOWN veteran  UNKNOWN passes away unexpectedly Ten dead issues from the 2019 legislative session Letters and feedback: June 23 2019 Rockledge Cocoa win big as Space Coast takes  UNKNOWN in OT Surfers can enjoy chest-high waves most of weekend in  UNKNOWN ahead for  UNKNOWN 14-20:  UNKNOWN performs  UNKNOWN tweaks hours Daylight saving time ends on Nov 3 Here are 7 facts to know about the time change Two new  UNKNOWN ships including second largest in world move to Port Canaveral Fla Sen Debbie Mayfield files bill to define e-cigarettes liquid nicotine as tobacco products Escape the hustle and bustle for a western getaway Missing Palm Bay teen found in  UNKNOWN ; mother arrested police say Impossible burger Impossible Cottage Pie make 2019 Epcot Food and Wine Festival menu Got bats? Why you need to get rid of them by Tax Day 2 charged with attempted murder;  UNKNOWN police say they pistol-whipped waterboarded force-fed him dog food Bicyclist dies in hit-and-run crash on  UNKNOWN 1 near  UNKNOWN : St Johns River pollution threatens water supplies Rockledge man charged after hoax to shoot up  UNKNOWN awarded $80M in lawsuit claiming Monsantos Roundup causes cancer Astronauts  UNKNOWN won  UNKNOWN of the Week vote Alligators and airboats: How to explore Everglades National Park in one day Longtime educator church mother and humanitarian  UNKNOWN remembered in Palm Bay George Clooney astronauts to attend black-tie event at Kennedy Space Center High school sports results: March 27  UNKNOWN visits  UNKNOWN ahead of SpaceX  UNKNOWN Heavy launch from  UNKNOWN to relocate  UNKNOWN to Merritt Island open new facilities Letters and feedback: Aug 7 2019 Duran to host top area amateur golfers this weekend  UNKNOWN benefits Harmony Farms Get ready to rumble:  UNKNOWN V blasting off early Thursday morning War on lovebugs: Social media reactions tweets memes about pesky bugs cars windshields Master association that contains condominiums may be able to collect capital contribution The new  UNKNOWN opens at LEGOLAND River Rocks: Enjoy fresh seafood and river views at this  UNKNOWN tradition March 27:  UNKNOWN Off 10 takeaways from the presidents campaign kickoff visit to  UNKNOWN man arrested on manslaughter charges after  UNKNOWN burglars hit five south Brevard stores before  UNKNOWN officer shot one of them  UNKNOWN family to receive Filipino World War II veterans Congressional Gold Medal NASA: Want to go to the moon in five years? We need more money How to REALLY clean lovebugs off your windshield car house Top party schools:  UNKNOWN Florida State make it to top of  UNKNOWN 2020 list Live scores: High school football Week 4 in  UNKNOWN Restaurant inspections: Happy Kitchen in  UNKNOWN Divine Grace in Palm Bay closed after inspections School district sticks to $770 raise offer for teachers; special magistrate getting involved Candidate lineup is set for 2019 municipal special district elections in  UNKNOWN investigation underway after mans body found in  UNKNOWN ditch November hurricane forecast: Season not technically over but its over | WeatherTiger Tyndall Air Force Bases future remains uncertain after devastation from Hurricane Michael Whats behind a low performing school? Crummy parents | Opinion Well known  UNKNOWN dealership has The Right Stuff Lovebugs are back Many hate them Experts say theyre here for good The lax disciplinary policies that caused\xa0Parkland massacre may have spread to your school  UNKNOWN commissioners tackle parking issues: 10 things to know  UNKNOWN woman injured in crash dies from her injuries; police looking for witnesses  UNKNOWN President  UNKNOWN calls for  UNKNOWN lawmakers to focus on mass shootings Governor  UNKNOWN approves $91 billion budget slashes $133 million in local projects  UNKNOWN and feedback: March 28 2019 Vote for 321prepscom  UNKNOWN of the Week for April 29-May 4  UNKNOWN approved for federal cleanup program Island H2O Live! water park in  UNKNOWN opens: Make the plunge or chill in lazy river  UNKNOWN is 2019 Citizen of the Year finalist in  UNKNOWN Awards Vietnam War 50th anniversary to be remembered at  UNKNOWN launch day! Things to know about SpaceX  UNKNOWN 9 launch from  UNKNOWN gave  UNKNOWN contractors bonuses despite delays and cost overruns  UNKNOWN says  UNKNOWN is 2019 Citizen of the Year finalist in  UNKNOWN Eye doctor honed skills on aircraft carrier  UNKNOWN pitchers fare well in state  UNKNOWN event The real reason NASAs all-female spacewalk is not happening  UNKNOWN : Its not discrimination Port Canaveral aquarium the best use of tourism dollars | Opinion SpaceX  UNKNOWN will carry worlds first green rocket fuel mission  UNKNOWN at odds on whats actually available for teacher raises  UNKNOWN power is too costly and too risky  UNKNOWN is 2019 Citizen of the Year finalist in  UNKNOWN TODAY Volunteer Recognition Awards SpaceX Dragon installed at  UNKNOWN after launch from  UNKNOWN phenomenon: Hail damage from unusual thunderstorm has Brevard residents talking Signs you live on the Space Coast Titusville couple charged with neglect after infant found malnourished Health coaches a personal way to help you meet goals  UNKNOWN elementary school to break ground in fall fast-tracked to open August 2020 Phase out nuclear power?  UNKNOWN and  UNKNOWN locked in absurd arms race Strange days: Space Coast gets freak hail flaming meteor mysterious rumble red tide over 6 months Four women arrested on prostitution charges in ongoing  UNKNOWN enforcement detail Whose name is on deed - partner or spouse? It matters a lot  UNKNOWN should steer clear of drug importation from other countries | Opinion Melbourne police investigate Palm Bay Road shooting Brevard bowlers head to state tournament off district wins  UNKNOWN Historical Societys May 16-18 conference: Countdown to History: Ice Age to the Space Age Turning the Toxic Tide:  UNKNOWN must address the problem of human waste Contributions pour in to group seeking to ban possession sale of assault weapons in  UNKNOWN  UNKNOWN sweets treats desserts are SO worthy of Instagram Satellite Beach girl 10 leads charge for organ donation Poliakoff: Community development district offers different advantages and disadvantages to an  UNKNOWN  UNKNOWN in  UNKNOWN hosts International Dinner with food from South Latin America From discovering water to snapping selfies: The lasting memories of Mars rover Opportunity Charges dropped against  UNKNOWN man after comments against Cocoa Beach hotel werent direct specific threat  UNKNOWN just cant say yes to raise offers | Opinion How Mars Opportunity rover ranks in list of more than 1000 unmanned space missions by  UNKNOWN since 1958 321 Flavor members rave about  UNKNOWN and  UNKNOWN  UNKNOWN man recovering after shark bite in  UNKNOWN Buckeye the manatee rescued again by  UNKNOWN Rain hail pelts north central  UNKNOWN causing widespread damage Regulators approve merger of  UNKNOWN ; deal to close June 29 Teachers approve largest pay raise in years but will it keep them in  UNKNOWN ? Restaurant review:  UNKNOWN in  UNKNOWN shows great promise Weak cool front could bring torrential rain damaging winds to  UNKNOWN :  UNKNOWN earthen causeways would improve Banana River water quality  UNKNOWN outfit is everything in photo of last remaining  UNKNOWN astronauts Voting Republican only helps if youre rich | Opinion Homelessness: A national problem with a local solution  UNKNOWN plans to regulate cancer-causing chemicals found in Americas drinking water  UNKNOWN has chance to wow recruits playoff voters with a win over  UNKNOWN -sized hail pounds Space Coast as severe thunderstorms move through area  UNKNOWN CEO optimistic about merger finances as deal with  UNKNOWN nears completion March 26:  UNKNOWN ’s  UNKNOWN named  UNKNOWN Region VIII Pitcher of the Week Knife-wielding suspect leads deputies officers on foot chase in Melbourne 4 Brevard winners in girls regional basketball Thursday March 26: Women’s  UNKNOWN -Up at Barry Invite Some seek healing remembrance at arrival of Vietnam War replica wall High school sports results: March 26 Following surge in discipline problems Brevard students ask lawmakers to address vaping in schools  UNKNOWN man 21 pronounced dead after trying to make a U-turn on South Street Donald Trumps criminal justice reform shows results  UNKNOWN  UNKNOWN cold case solved 22 years later after Google Earth satellite image shows missing mans car Health calendar: Aug 8-15 Tracking Brevards stars:  UNKNOWN to World Cup;  UNKNOWN in golf hunt;  UNKNOWN hitting 300  UNKNOWN hotel condo towers homes slated for 27 acres by Hightower Beach park Letters and feedback: Sept 14 2019  UNKNOWN to be named  UNKNOWN new baseball coach Golf tip: Hit down on hybrids similar to how you hit an iron  UNKNOWN weather moves across Space Coast bringing lightning heavy rain Commission OKs pay raises for 625 county employees to address disparities turnover High school sports results: Feb 14 weVENTURE is 2019 Organization of the Year finalist in  UNKNOWN Man taken to hospital after domestic incident in Titusville Brevard school district considers selling $800000 in unused land Weather looks OK for SpaceX  UNKNOWN Heavy launch from  UNKNOWN Flavor:  UNKNOWN tops the list of Brevards favorite cocktails  UNKNOWN experiment and  UNKNOWN fly to space on  UNKNOWN New Shepard Bridges sex assault: Victim’s family sues nonprofit caregiver over pregnancy  UNKNOWN emergency officials send mixed messages during Dorian New proposal on dog cat sales at pet stores is less restrictive than previous one Space Coast Honor Flight is 2019 Organization of the Year finalist in  UNKNOWN It goes by so quickly: Brevard teachers back to work as days count down to school County commissioners reject lagoon funding plan saying they want less money for muck projects  UNKNOWN ! Hypodermic needles stabbing employees inside Brevard recycling plant Hard-line anti-illegal immigration group works to sway sanctuary cities bill  UNKNOWN principal retires amid accusations of hostile environment  UNKNOWN deputy  UNKNOWN architect of bloody  UNKNOWN policy dead at 93 in  UNKNOWN NASAs new goal: Land astronauts on the moon by 2024 Going surfing this weekend? Check our forecast Neighbor Up is 2019 Organization of the Year finalist in  UNKNOWN TODAY Volunteer Recognition Awards Severe storms bring wind lightning and possible hail across  UNKNOWN in  UNKNOWN Melbourne may move to larger stores with drive-thru windows SpaceX  UNKNOWN Heavy: Why everyone should watch this night launch from  UNKNOWN report: Calm seas and good fishing ahead Two World Series commercials feature Brevard celebrities Letters and feedback: May 5 2019 In latest motion  UNKNOWN Attorney confirms multiple investigations in Tallahassee Lane 70 softball team wins spring tournament Palm Bay police double down on nightclub safety following four violent episodes  UNKNOWN signs deal to deliver small satellites to more orbits  UNKNOWN documentary new surf movie spotlight surf legends  UNKNOWN Hobgood twins Plan ahead for April 3-10: Music in  UNKNOWN adult prom in  UNKNOWN ; art in  UNKNOWN grieves woman fatally struck by car near Lous Blues bar in  UNKNOWN At Cape Canaveral SpaceX kicks off Star Wars Day with smoke and fire Updates: SpaceX launches  UNKNOWN 9 from  UNKNOWN to  UNKNOWN sticks drone ship landing Letters and feedback: March 27 2019  UNKNOWN Tech to host event celebrating Apollo 11 anniversary and future of  UNKNOWN space program  UNKNOWN on  UNKNOWN orbital flight test: We really are close  UNKNOWN spring football game set for Saturday Feb 14: EFSC women’s golfer  UNKNOWN nominated for 2019 Arnold Palmer Cup Four Panthers Selected to All-SSC Teams Heavy winds and lightning to hit north  UNKNOWN How  UNKNOWN is shaking up the establishment | Opinion High school sports results: May 3 Palm Bay man who asked woman for permission to have illicit sexual encounter with child sentenced Letters and feedback: June 22 2019  UNKNOWN pumped for two launches less than 48 hours apart Heres why  UNKNOWN scrubbed all-female spacewalk Plan ahead for Nov 6-12: Holiday bazaar in  UNKNOWN pet events art openings and food fun  UNKNOWN shoots one under par for second time this week Burglar steals bolted safe from  UNKNOWN bar in latest smash-and-grab Feb 14:  UNKNOWN Tech Suffers Last-Second Heartbreaker Against Saint Leo Daddy Duty: My school cheating scandal is an example for my 4-year-old Tropical Storm Humberto forms but isnt expected to affect  UNKNOWN weather squadron key player in ensuring successful rocket launches  UNKNOWN : Getting through to the disconnected husband How to watch SpaceX launch its  UNKNOWN 9 rocket from  UNKNOWN Members of  UNKNOWN Bahai Faith community come together to commemorate twin holy days A&Es  UNKNOWN to feature  UNKNOWN tiny home builder Movable Roots Lawmakers grapple with felons voting rights All-female spacewalk canceled due to lack of medium-sized spacesuits  UNKNOWN and 3 other school district policy changes you should know about Felons rights bill heads to Gov DeSantis Lanes open delays persist after  UNKNOWN on I-95 near Wickham Road Former  UNKNOWN chair resigns day after  UNKNOWN proposes UC'

All the private details has been now replaced with ‘ UNKNOWN ‘.

I hope this shows you the immense usefulness of NER

Implementing NLP Tasks

Now that you have learnt about various NLP techniques ,it’s time to implement them. There are examples of NLP being used everywhere around you , like chatbots you use in a website, news-summaries you need online, positive and neative movie reviews and so on.

This section will equip you upon how to implement these vital tasks of NLP.

Text Summarization

Suppose you find a magazine with a bunch of articles. Do you just start reading it from page 1 to the end ?

NO! You first read the summary to choose your article of interest. The same applies for news articles , research papers etc..

The technique of creating a concise summary of a large text data , without losing the key information is referred as Text Summarization

Text Summarization is highly useful in today’s digital world. I will now walk you through some important methods to implement Text Summarization.

What is Extractive Text Summarization

This is the traditional method , in which the process is to identify significant phrases/sentences of the text corpus and include them in the summary.

The summary obtained from this method will contain the key-sentences of the original text corpus. It can be done through many methods, I will show you using gensim and spacy.

Extractive Text Summarization using Gensim

The text summarization process using gensim library is based on TextRank Algorithm

What does the TextRank Algorithm do ?

  1. The raw text is preprocessed.(All stopwords ,punctuations removed, words are lemmatized)
  2. Each sentence of the text corpus undergoes vectorization.i.e, we create word embeddings to represent the sentence
  3. Next , the algorithm calculates the similarities between sentence vectors whose values are stored in a matrix
  4. Now, it builds a graph representation. Each sentence represents a NODE , and the similarity between sentences( stored in the matrix ) are the EDGE values
  5. The sentences are ranked according to significance of weights .

The top significant sentences are choosen to be a part of the summary

Now, I shall guide through the code to implement this from gensim. Our first step would be to import the summarizer from gensim.summarization.

import gensim
from gensim.summarization import summarize

Let us say you have an article about economic junk food ,for which you want to do summarization.

article_text='Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time. Hope it will help you'

You can pass the text corpus as input to summarize function

short_summary=summarize(article_text)
print(short_summary)
They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.
Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body.
It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.
It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
One who like junk food develop more risk to put on extra weight and become fatter and unhealthier.
Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert.
For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats.

Seems too long right ! What if you want to decide how many words are to be in summary ?

You can change the default parameters of the summarize function according to your requirements.

The parameters are:

  • ratio : It can take values between 0 to 1. It represents the proportion of the summary compared to the original text.
  • word_count: It decides the no of words in the summary.

Let me show you how to use the parameters in above example

summary_by_ratio=summarize(article_text,ratio=0.1)
print(summary_by_ratio)
They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert.

In the above output, you can notice that only 10% of original text is taken as summary.

Next, let me show you how to summarize using word_count

summary_by_word_count=summarize(article_text,word_count=30)
print(summary_by_word_count)
They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.

In the above output, you can see the summary extracted by by the word_count.

What if you provide contradicting word_count and ratio value ?

In case both are mentioned, then the summarize function ignores the ratio . So,word_count has the upper hand.

Let me demonstrate the same in below code.

summary=summarize(article_text,ratio=0.1,word_count=30)
print(summary)
They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.

It is clear that word_count has been followed.

Moving on, let us look at how to summarize text using spacy library

Extractive Text Summarization with spacy

You can also implement Text Summarization using spacy package.

Here, I will try to summarize the same article on junk food using spacy

article_text='Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time. '

First , import the package and create a spacy object doc

# Importing & creating a spacy object

import spacy
nlp = spacy.load('en_core_web_sm')
doc=nlp(article_text)

Next , you know that extractive summarization is based on identifying the significant words. So, you need to store the keywords of the text in list.

How to choose the important words ?

we know that punctuations and stopwords are just noise. So, you can neglect them. Usually , the Nouns, pronouns,verbs add significant value to the text.

spacy gives you the option to check a token’s Part-of-speech through token.pos_ method. Using these, you can select desired tokens as shown below.

# creating a empty list to store keywords
keywords_list = []

# List of the POS categories which you think are significant
desired_pos = ['PROPN', 'ADJ', 'NOUN', 'VERB']

# Import punctuations for text cleaning
from string import punctuation


# Iterating through tokens 
for token in doc: 
  # checking if a token is stopword or punctuation
  if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
    # If true, they are just ignored and loop goes to the next token
    continue
  #  checking if the POS tag of the token is in our desired list
  if(token.pos_ in desired_pos):
    # If true, append the token to our keywords list
    keywords_list.append(token.text)

The above code iterates through every token and stored the tokens that are NOUN,PROPER NOUN, VERB, ADJECTIVE in keywords_list.

Next , you can find the frequency of each token in keywords_list using Counter. The list of keywords is passed as input to the Counter,it returns a dictionary of keywords and their frequencies.

# Importing Counter frpm Collections module
from collections import Counter

# creating dictionary of keywords + frequency
dictionary = Counter(keywords_list) 
print(dictionary)
Counter({'food': 19, 'junk': 13, 'foods': 12, 'high': 9, 'sugar': 7, 'Junk': 6, 'healthy': 6, 'weight': 6, 'cholesterol': 5, 'body': 5, 'blood': 5, 'heart': 5, 'good': 4, 'sodium': 4, 'fat': 4, 'gain': 4, 'life': 4, 'diabetes': 4, 'level': 4, 'increases': 4, 'risk': 4, 'children': 3, 'effects': 3, 'health': 3, 'found': 3, 'nutrients': 3, 'unhealthy': 3, 'lack': 3, 'impact': 3, 'person': 3, 'bad': 3, 'fats': 3, 'result': 3, 'taste': 2, 'age': 2, 'parents': 2, 'harmful': 2, 'fried': 2, 'dietary': 2, 'fibers': 2, 'makes': 2, 'obesity': 2, 'tastes': 2, 'looks': 2, 'calorie': 2, 'fries': 2, 'burgers': 2, 'candy': 2, 'cookies': 2, 'eating': 2, 'prone': 2, 'type-2': 2, 'disease': 2, 'kidney': 2, 'failure': 2, 'lead': 2, 'nutritional': 2, 'deficiencies': 2, 'essential': 2, 'vitamins': 2, 'minerals': 2, 'cardiovascular': 2, 'diseases': 2, 'diet': 2, 'pressure': 2, 'functioning': 2, 'spike': 2, 'people': 2, 'dull': 2, 'day': 2, 'arteries': 2, 'attack': 2, 'way': 2, 'main': 2, 'instance': 2, 'overtime': 2, 'things': 2, 'time': 2, 'green': 2, 'find': 2, 'try': 2, 'different': 2, 'help': 2, 'liked': 1, 'group': 1, 'kids': 1, 'school': 1, 'going': 1, 'ask': 1, 'trend': 1, 'childhood': 1, 'discussed': 1, 'According': 1, 'research': 1, 'scientists': 1, 'negative': 1, 'ways': 1, 'market': 1, 'packets': 1, 'calories': 1, 'low': 1, 'mineral': 1, 'starch': 1, 'protein': 1, 'Processed': 1, 'means': 1, 'rapid': 1, 'able': 1, 'excessive': 1, 'called': 1, 'fulfil': 1, 'requirement': 1, 'french': 1, 'pizza': 1, 'soft': 1, 'drinks': 1, 'baked': 1, 'goods': 1, 'ice': 1, 'cream': 1, 'example': 1, 'containing': 1, 'according': 1, 'Centres': 1, 'Disease': 1, 'Control': 1, 'Prevention': 1, 'Kids': 1, 'unable': 1, 'regulate': 1, 'Risk': 1, 'getting': 1, 'increasing': 1, 'obese': 1, 'overweight': 1, 'Eating': 1, 'iron': 1, 'rich': 1, 'saturated': 1, 'High': 1, 'overloads': 1, 'like': 1, 'develop': 1, 'extra': 1, 'fatter': 1, 'unhealthier': 1, 'contain': 1, 'carbohydrate': 1, 'lethargic': 1, 'sleepy': 1, 'active': 1, 'alert': 1, 'Reflexes': 1, 'senses': 1, 'live': 1, 'sedentary': 1, 'source': 1, 'constipation': 1, 'ailments': 1, 'clogged': 1, 'strokes': 1, 'poor': 1, 'nutrition': 1, 'easiest': 1, 'reasons': 1, 'increase': 1, 'positive': 1, 'points': 1, 'requires': 1, 'stay': 1, 'fit': 1, 'fulfilled': 1, 'French': 1, 'amounts': 1, 'long': 1, 'term': 1, 'illnesses': 1, 'consume': 1, 'consumption': 1, 'words': 1, 'interferes': 1, 'contains': 1, 'higher': 1, 'carbohydrates': 1, 'levels': 1, 'lethargy': 1, 'inactiveness': 1, 'sleepiness': 1, 'reflex': 1, 'inactive': 1, 'worse': 1, 'clogs': 1, 'avoided': 1, 'save': 1, 'ruined': 1, 'problem': 1, 'realize': 1, 'ill': 1, 'comes': 1, 'late': 1, 'issue': 1, 'works': 1, 'face': 1, 'consequences': 1, 'better': 1, 'stop': 1, 'avoid': 1, 'encouraging': 1, 'early': 1, 'eat': 1, 'vegetables': 1, 'buds': 1, 'developed': 1, 'tasty': 1, 'mix': 1, 'serve': 1, 'vegetable': 1, 'style': 1, 'Incorporate': 1, 'types': 1, 'following': 1, 'recipes': 1, 'home': 1, 'attracted': 1, 'short': 1, 'deprive': 1, 'Children': 1, 'Make': 1, 'sure': 1, 'limited': 1, 'quantities': 1, 'periods': 1})

Now, you need to normalize the frequency of all keywords.

For that, find the highest frequency using .most_common method . Then apply normalization formula to the all keyword frequencies in the dictionary.

# find the highest frequency 
highest_frequency = Counter(keywords_list).most_common(1)[0][1] 

# Normalizing process
for word in dictionary:
    dictionary[word] = (dictionary[word]/highest_frequency) 

print(dictionary)
Counter({'food': 1.0, 'junk': 0.6842105263157895, 'foods': 0.631578947368421, 'high': 0.47368421052631576, 'sugar': 0.3684210526315789, 'Junk': 0.3157894736842105, 'healthy': 0.3157894736842105, 'weight': 0.3157894736842105, 'cholesterol': 0.2631578947368421, 'body': 0.2631578947368421, 'blood': 0.2631578947368421, 'heart': 0.2631578947368421, 'good': 0.21052631578947367, 'sodium': 0.21052631578947367, 'fat': 0.21052631578947367, 'gain': 0.21052631578947367, 'life': 0.21052631578947367, 'diabetes': 0.21052631578947367, 'level': 0.21052631578947367, 'increases': 0.21052631578947367, 'risk': 0.21052631578947367, 'children': 0.15789473684210525, 'effects': 0.15789473684210525, 'health': 0.15789473684210525, 'found': 0.15789473684210525, 'nutrients': 0.15789473684210525, 'unhealthy': 0.15789473684210525, 'lack': 0.15789473684210525, 'impact': 0.15789473684210525, 'person': 0.15789473684210525, 'bad': 0.15789473684210525, 'fats': 0.15789473684210525, 'result': 0.15789473684210525, 'taste': 0.10526315789473684, 'age': 0.10526315789473684, 'parents': 0.10526315789473684, 'harmful': 0.10526315789473684, 'fried': 0.10526315789473684, 'dietary': 0.10526315789473684, 'fibers': 0.10526315789473684, 'makes': 0.10526315789473684, 'obesity': 0.10526315789473684, 'tastes': 0.10526315789473684, 'looks': 0.10526315789473684, 'calorie': 0.10526315789473684, 'fries': 0.10526315789473684, 'burgers': 0.10526315789473684, 'candy': 0.10526315789473684, 'cookies': 0.10526315789473684, 'eating': 0.10526315789473684, 'prone': 0.10526315789473684, 'type-2': 0.10526315789473684, 'disease': 0.10526315789473684, 'kidney': 0.10526315789473684, 'failure': 0.10526315789473684, 'lead': 0.10526315789473684, 'nutritional': 0.10526315789473684, 'deficiencies': 0.10526315789473684, 'essential': 0.10526315789473684, 'vitamins': 0.10526315789473684, 'minerals': 0.10526315789473684, 'cardiovascular': 0.10526315789473684, 'diseases': 0.10526315789473684, 'diet': 0.10526315789473684, 'pressure': 0.10526315789473684, 'functioning': 0.10526315789473684, 'spike': 0.10526315789473684, 'people': 0.10526315789473684, 'dull': 0.10526315789473684, 'day': 0.10526315789473684, 'arteries': 0.10526315789473684, 'attack': 0.10526315789473684, 'way': 0.10526315789473684, 'main': 0.10526315789473684, 'instance': 0.10526315789473684, 'overtime': 0.10526315789473684, 'things': 0.10526315789473684, 'time': 0.10526315789473684, 'green': 0.10526315789473684, 'find': 0.10526315789473684, 'try': 0.10526315789473684, 'different': 0.10526315789473684, 'help': 0.10526315789473684, 'liked': 0.05263157894736842, 'group': 0.05263157894736842, 'kids': 0.05263157894736842, 'school': 0.05263157894736842, 'going': 0.05263157894736842, 'ask': 0.05263157894736842, 'trend': 0.05263157894736842, 'childhood': 0.05263157894736842, 'discussed': 0.05263157894736842, 'According': 0.05263157894736842, 'research': 0.05263157894736842, 'scientists': 0.05263157894736842, 'negative': 0.05263157894736842, 'ways': 0.05263157894736842, 'market': 0.05263157894736842, 'packets': 0.05263157894736842, 'calories': 0.05263157894736842, 'low': 0.05263157894736842, 'mineral': 0.05263157894736842, 'starch': 0.05263157894736842, 'protein': 0.05263157894736842, 'Processed': 0.05263157894736842, 'means': 0.05263157894736842, 'rapid': 0.05263157894736842, 'able': 0.05263157894736842, 'excessive': 0.05263157894736842, 'called': 0.05263157894736842, 'fulfil': 0.05263157894736842, 'requirement': 0.05263157894736842, 'french': 0.05263157894736842, 'pizza': 0.05263157894736842, 'soft': 0.05263157894736842, 'drinks': 0.05263157894736842, 'baked': 0.05263157894736842, 'goods': 0.05263157894736842, 'ice': 0.05263157894736842, 'cream': 0.05263157894736842, 'example': 0.05263157894736842, 'containing': 0.05263157894736842, 'according': 0.05263157894736842, 'Centres': 0.05263157894736842, 'Disease': 0.05263157894736842, 'Control': 0.05263157894736842, 'Prevention': 0.05263157894736842, 'Kids': 0.05263157894736842, 'unable': 0.05263157894736842, 'regulate': 0.05263157894736842, 'Risk': 0.05263157894736842, 'getting': 0.05263157894736842, 'increasing': 0.05263157894736842, 'obese': 0.05263157894736842, 'overweight': 0.05263157894736842, 'Eating': 0.05263157894736842, 'iron': 0.05263157894736842, 'rich': 0.05263157894736842, 'saturated': 0.05263157894736842, 'High': 0.05263157894736842, 'overloads': 0.05263157894736842, 'like': 0.05263157894736842, 'develop': 0.05263157894736842, 'extra': 0.05263157894736842, 'fatter': 0.05263157894736842, 'unhealthier': 0.05263157894736842, 'contain': 0.05263157894736842, 'carbohydrate': 0.05263157894736842, 'lethargic': 0.05263157894736842, 'sleepy': 0.05263157894736842, 'active': 0.05263157894736842, 'alert': 0.05263157894736842, 'Reflexes': 0.05263157894736842, 'senses': 0.05263157894736842, 'live': 0.05263157894736842, 'sedentary': 0.05263157894736842, 'source': 0.05263157894736842, 'constipation': 0.05263157894736842, 'ailments': 0.05263157894736842, 'clogged': 0.05263157894736842, 'strokes': 0.05263157894736842, 'poor': 0.05263157894736842, 'nutrition': 0.05263157894736842, 'easiest': 0.05263157894736842, 'reasons': 0.05263157894736842, 'increase': 0.05263157894736842, 'positive': 0.05263157894736842, 'points': 0.05263157894736842, 'requires': 0.05263157894736842, 'stay': 0.05263157894736842, 'fit': 0.05263157894736842, 'fulfilled': 0.05263157894736842, 'French': 0.05263157894736842, 'amounts': 0.05263157894736842, 'long': 0.05263157894736842, 'term': 0.05263157894736842, 'illnesses': 0.05263157894736842, 'consume': 0.05263157894736842, 'consumption': 0.05263157894736842, 'words': 0.05263157894736842, 'interferes': 0.05263157894736842, 'contains': 0.05263157894736842, 'higher': 0.05263157894736842, 'carbohydrates': 0.05263157894736842, 'levels': 0.05263157894736842, 'lethargy': 0.05263157894736842, 'inactiveness': 0.05263157894736842, 'sleepiness': 0.05263157894736842, 'reflex': 0.05263157894736842, 'inactive': 0.05263157894736842, 'worse': 0.05263157894736842, 'clogs': 0.05263157894736842, 'avoided': 0.05263157894736842, 'save': 0.05263157894736842, 'ruined': 0.05263157894736842, 'problem': 0.05263157894736842, 'realize': 0.05263157894736842, 'ill': 0.05263157894736842, 'comes': 0.05263157894736842, 'late': 0.05263157894736842, 'issue': 0.05263157894736842, 'works': 0.05263157894736842, 'face': 0.05263157894736842, 'consequences': 0.05263157894736842, 'better': 0.05263157894736842, 'stop': 0.05263157894736842, 'avoid': 0.05263157894736842, 'encouraging': 0.05263157894736842, 'early': 0.05263157894736842, 'eat': 0.05263157894736842, 'vegetables': 0.05263157894736842, 'buds': 0.05263157894736842, 'developed': 0.05263157894736842, 'tasty': 0.05263157894736842, 'mix': 0.05263157894736842, 'serve': 0.05263157894736842, 'vegetable': 0.05263157894736842, 'style': 0.05263157894736842, 'Incorporate': 0.05263157894736842, 'types': 0.05263157894736842, 'following': 0.05263157894736842, 'recipes': 0.05263157894736842, 'home': 0.05263157894736842, 'attracted': 0.05263157894736842, 'short': 0.05263157894736842, 'deprive': 0.05263157894736842, 'Children': 0.05263157894736842, 'Make': 0.05263157894736842, 'sure': 0.05263157894736842, 'limited': 0.05263157894736842, 'quantities': 0.05263157894736842, 'periods': 0.05263157894736842})

From above output you can notice each keyword is associated with a value between 0-1.

Next , you need to assign scores to each sentence. You can access the sentences in a doc through doc.sents. You can iterate through each token of sentence , select the keyword values and store them in a dictionary score.

The below code demonstrates the same

# Creating a dictionary to store the score of each sentence
score={}

# Iterating through each sentence
for sentence in doc.sents: 
    # Iterating through token of each sentence
    for token in sentence:
        # checking if the token is a keyword 
        if token.text in dictionary.keys():
            # If true , add the frequency of keyword to the score dictionary 
        if sentence in score.keys():
            score[sentence]+=dictionary[token.text]
        else:
            score[sentence]=dictionary[token.text]

print(score)
{Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children.: 1.789473684210526, They generally ask for the junk food daily because they have been trend so by their parents from the childhood.: 1.9473684210526316, They never have been discussed by their parents about the harmful effects of junk foods over health.: 1.894736842105263, According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways.: 2.0526315789473686, They are generally fried food found in the market in the packets.: 1.3684210526315788, They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.: 4.36842105263158, Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.: 2.7894736842105265, It makes able a person to gain excessive weight which is called as obesity.: 1.0526315789473686, Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body.: 2.3684210526315788, Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods.: 4.526315789473684, It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.: 2.8421052631578947, In type-2 diabetes our body become unable to regulate blood sugar level.: 1.526315789473684, Risk of getting this disease is increasing as one become more obese or overweight.: 0.3684210526315789, It increases the risk of kidney failure.: 0.631578947368421, Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.: 3.210526315789473, It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol.: 1.5789473684210524, High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.: 1.789473684210526, One who like junk food develop more risk to put on extra weight and become fatter and unhealthier.: 2.4736842105263164, Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert.: 3.052631578947369, Reflexes and senses of the people eating this food become dull day by day: 1.6315789473684208, thus they live more sedentary life.: 0.3157894736842105, Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition.: 2.4210526315789473, Junk food is the easiest way to gain unhealthy weight.: 2.1578947368421053, The amount of fats and sugar in the food makes you gain weight rapidly.: 2.157894736842105, However, this is not a healthy weight.: 0.631578947368421, It is more of fats and cholesterol which will have a harmful impact on your health.: 0.8421052631578947, Junk food is also one of the main reasons for the increase in obesity nowadays.: 1.6315789473684208, This food only looks and tastes good, other than that, it has no positive points.: 1.5263157894736838, The amount of calorie your body requires to stay fit is not fulfilled by this food.: 1.5789473684210527, For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats.: 2.31578947368421, Therefore, this can result in long-term illnesses like diabetes and high blood pressure.: 1.4210526315789473, This may also result in kidney failure.: 0.3684210526315789, Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more.: 0.7368421052631579, You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium.: 1.2105263157894737, In other words, all this interferes with the functioning of your heart.: 0.47368421052631576, Furthermore, junk food contains a higher level of carbohydrates.: 2.052631578947368, It will instantly spike your blood sugar levels.: 0.7894736842105263, This will result in lethargy, inactiveness, and sleepiness.: 0.3157894736842105, A person reflex becomes dull overtime: 0.42105263157894735, and they lead an inactive life.: 0.3684210526315789, To make things worse, junk food also clogs your arteries and increases the risk of a heart attack.: 2.7894736842105257, Therefore, it must be avoided at the first instance to save your life from becoming ruined.: 0.47368421052631576, The main problem with junk food is that people don’t realize its ill effects now.: 2.2105263157894735, When the time comes, it is too late.: 0.21052631578947367, Most importantly, the issue is that it does not impact you instantly.: 0.21052631578947367, It works on your overtime; you will face the consequences sooner or later.: 0.2631578947368421, Thus, it is better to stop now.: 0.10526315789473684, You can avoid junk food by encouraging your children from an early age to eat green vegetables.: 2.3157894736842106, Their taste buds must be developed as such that they find healthy food tasty.: 1.6842105263157894, Moreover, try to mix things up.: 0.2631578947368421, Do not serve the same green vegetable daily in the same style.: 0.2631578947368421, Incorporate different types of healthy food in their diet following different recipes.: 1.8421052631578945, This will help them to try foods at home rather than being attracted to junk food.: 2.631578947368421, In short, do not deprive them completely of it as that will not help.: 0.21052631578947367, Children will find one way or the other to have it.: 0.2631578947368421, Make sure you give them junk food in limited quantities and at healthy periods of time.: 2.3684210526315788}

Now that you have score of each sentence, you can sort the sentences in the descending order of their significance.

# sorting the sentence scores
sorted_score = sorted(score.items(), key=lambda kv: kv[1], reverse=True)

Now, you are at the final part. You can decide the no of sentences you want. Then, add sentences from the sorted_score until you have reached the desired no_of_sentences.

This gives you the final text summary.

# list to store sentences of summary
text_summary=[]

# Deciding the  total no of sentences in summary
no_of_sentences=4

# to count the no of sentence we already added to summary
total = 0
for i in range(len(sorted_score)):
    # appending to the summary
    text_summary.append(str(sorted_score[i][0]).capitalize()) 
    total += 1
    # checking if limit exceeded
    if(total >= no_of_sentences):
        break 

print(text_summary)

You can see the 4 sentence-summary in above output. You would have noticed that this approach is more lengthy compared to using gensim.

These two are the traditional methods of text summarization. There is a new advanced method outbeating both of these. I will discuss about it next

Generative Text Summarization

You can notice that in the extractive method, the sentences of the summary are all taken from the original text. There is no change in structure of any sentence.

Generative text summarization methods overcome this shortcoming. The concept is based on capturing the meaning of the text and generating entitrely new sentences to best represent them in the summary.

These are more advanced methods and are best for summarization. Here, I shall guide you on implementing generative text summarization using Hugging face .

Hugging face is supported by Pytorch and tensorflow 2.0 currently. You can use transformers of Hugging Face to implement Summarization. Let me show how to import torch

import torch

You can download Hugging Face Transformers through below command

!pip install transformers --upgrade

Next, you can import pipeline from transformers . It is the model instance .

from transformers import pipeline

The pipeline has parameters which you can set as per the requirements of your problem.
Some main parameters are:

  • task : You can choose among a set of strings that specify the task , and the corresponding pipeline will bw returned.example : task=summarization returns a SummarizationPipeline
  • model : To specify the model that will be used by the pipeline. You can use pretrained moels
  • tokenizer : You can specify the tokenizer you want to use for encoding the data for the model. You can use PreTrained Tokenizers.

Note : model and tokenizer are optional arguments. If not specified, the default one for the pipeline will be used.

The below example demonstrates the building of a summarization pipeline

summarizer=pipeline(task="summarization",model='bart-large-cnn',tokenizer='bart-large-cnn')

The above code returns a SummarizationPipeline .

summarizer
article_text='Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time. '

You can pass your text as input to the summarizer . The parameters min_length and max_length allow you to control the length of summary as per needs.

our_summary=summarizer( article_text,min_length=5,max_length=100)
print(our_summary)
[{'summary_text': 'Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways.'}]

From above output, you can see the short summary. This is the method of summarization using pipeline

Now, let me introduce you to another method of text summarization using Pretrained models available in the transformers library.

There are pretrained models with weights available which can ne accessed through .from_pretrained() method. We shall be using one such model bart-large-cnn in this case for text summarization.

For working with this model, you can import corresponding Tokenizer and model as shown below.

# Importing model and tokenizer
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig

# Initializing the model & tokenizer using .from_pretrained()
model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')

Next, you need to convert the input text into input ids using batch_encode_plus method. This returns a tensor

The encoded ids are passed as input to model.generate() along with a feature max_length which decides the number of words in our summary. This function returns the summary_ids (the significant ones). You can next decode these ids to text and print the summary

# encode the input
inputs = tokenizer.batch_encode_plus([article_text], max_length=1024, return_tensors='pt')

# You can generate the ids that will constitute the summary.
summary_ids = model.generate(inputs['input_ids'], max_length=100, early_stopping=False)

# iterating through each id in summary
for ids in summary_ids:
    # decoding the 
    short = tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    # comparing original length and summary length
    print(len(article_text),'------>' ,len(short))
    print(short)
4870 ------> 484
Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in

ChatBot with Hugging Face

I’m sure you would have used a chatbot on a website. It welcomes you and answers your doubts.
It’s like a small scale AI Assistant !

How do these super-cool chatbots work ?

They are built using NLP techniques to understanding the context of question and provide answers as they are trained.

Hugging Face , a revolutionizing startup has introduced the advanced method of building chatbots with transformers. You can check the Hugging Face implementation here

If you are a beginner , you might get overwhelmed trying to implement it . The process is made easier for you by the simpletransformers package. This library is based on the transformers library of Hugging Face

Here, I shall guide you to build a chatbot efficiently using simpletransformers. First, download it through below command

pip install simpletransformers

For building chatbots or conversational AI models, you can use ConvAIModel class.

from simpletransformers.conv_ai import ConvAIModel

Presently, it supports OpenAI GPT or GPT-2 models. Certain important parameters of ConvAIModel are :

  • model-type : You need to pass a string to indicate your model type, like 'gpt' or " gpt2 "
  • model_name : You need to pass a string specifying the exact name of a pretrained model. Here, I have used “openai-gpt" , a pretrained model of HuggingFace Transformers. You can also pass the path to folder containing the model
  • args : A dictionary where you can fix argument values instead of the default ones.

Let me show you how to initiate the model in below code.

train_args ={"fp16":False,
             "num_train_epochs": 1,
              "save_model_every_epoch": False}

my_chatbot=ConvAIModel("gpt","openai-gpt",use_cuda=True,args=train_args)
ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.

Now that the model is stored in my_chatbot, you can train it using .train_model() function. When call the train_model() function without passing the input training data, simpletransformers downloads uses the default training data.

my_chatbot.train_model()

Once your chatbot is trained, you can test it now through interact() function as shown below

my_chatbot.interact()

Hence, you have successfully created your Chatbot.

Language Translation

I am sure each of us would have used a translator in our life ! Language Translation is the miracle that has made communication between diverse people possible.

Language translation is one of the main applications of NLP. Here, I shall you introduce you to some advanced methods to implement the same.

Language Translator can be built in a few steps using Hugging face’s transformers library. Let us first install it.

!pip install transformers

The transformers provides task-specific pipeline for our needs. This is a main feature which gives the edge to Hugging Face.

You can decide what pipeline you want through the task parameter. For translation tasks,you can provide task='translation_xx_to_yy'
( for xx and yy refer to the standard language codes)

Let us say I want to translate English to Romania !
My input will be task='translation_en_to_ro'.

from transformers import pipeline
translator_model=pipeline(task='translation_en_to_ro')
translator_model

Now that your pipeline is ready,you can pass the text in english as input.

translator_model('I welcome you to my country ! I am sure you will enjoy your visit')
[{'translation_text': 'Vă urez bun venit în țara mea !Sunt sigur că vă veți bucura de vizita dvs.'}]

You can see the translated sentence in above output.

You might be wondering is this the only way to do it ?

Now, I shall introduce you to another efficient method,using simpletransformers .You can download it through below command

!pip install simpletransformers

For language translation, we shall use sequence to sequence models. So, you can import the seq2seqModel through below command.

import torch
from simpletransformers.seq2seq import Seq2SeqModel

What is a Sequence to Sequence model ?

It has 2 parts-Encoder and decoder. The approach is to provide input sequence to encoder for learning ,and gets converted to a intermediate vector. From the intermediate vector, decoder predicts the output sequence. It is the common model for translation problems

Coming back to our example , let us see how to instantiate the seq2seqmodel ?

The parameters are:

  • encoder_decoder_type : It can be bart or marian or encoder-decoder.For translation, we usually use marian.
  • encoder_decoder_name : You can provide the path to the pretrained models of Hugging Face.
    You can see the list of available pretrained models click here

Below code demonstrates initiating the model with encoder_decoder_type='marian' and using a pretained model for translation of English to Romania.

model = Seq2SeqModel(encoder_decoder_type="marian",encoder_decoder_name="Helsinki-NLP/opus-mt-en-ro",args=model_args,use_cuda=False)

Now, you tranlation model is ready. You can pass the input text as a list of strings to model.predict() function

model.predict(['I welcome you to my country','I am sure you will enjoy your stay here'])
['Vă urez bun venit în ţara mea.',
 'Sunt sigur că vă veţi bucura de şederea dvs aici.']

You may feel that the time taken is long. You can always modify the arguments according to the neccesity of the problem. You can view the current values of arguments through model.args method.

model.args
{'adam_epsilon': 1e-08,
 'base_marian_model_name': 'Helsinki-NLP/opus-mt-en-ro',
 'best_model_dir': 'outputs/best_model',
 'cache_dir': 'cache_dir/',
 'config': {},
 'dataset_class': None,
 'do_lower_case': False,
 'do_sample': False,
 'early_stopping': True,
 'early_stopping_consider_epochs': False,
 'early_stopping_delta': 0,
 'early_stopping_metric': 'eval_loss',
 'early_stopping_metric_minimize': True,
 'early_stopping_patience': 3,
 'encoding': None,
 'eval_batch_size': 8,
 'evaluate_during_training': False,
 'evaluate_during_training_steps': 2000,
 'evaluate_during_training_verbose': True,
 'evaluate_generated_text': True,
 'fp16': False,
 'fp16_opt_level': 'O1',
 'gradient_accumulation_steps': 1,
 'learning_rate': 4e-05,
 'length_penalty': 2.0,
 'logging_steps': 50,
 'manual_seed': 4,
 'max_grad_norm': 1.0,
 'max_length': 50,
 'max_seq_length': 50,
 'max_steps': -1,
 'model_name': 'Helsinki-NLP/opus-mt-en-ro',
 'model_type': 'marian',
 'multiprocessing_chunksize': 500,
 'n_gpu': 1,
 'no_cache': False,
 'no_save': False,
 'num_beams': 1,
 'num_train_epochs': 10,
 'output_dir': 'outputs/',
 'overwrite_output_dir': True,
 'process_count': 1,
 'repetition_penalty': 1.0,
 'reprocess_input_data': True,
 'save_best_model': True,
 'save_eval_checkpoints': False,
 'save_model_every_epoch': False,
 'save_steps': 2000,
 'silent': False,
 'tensorboard_dir': None,
 'train_batch_size': 2,
 'use_cached_eval_features': False,
 'use_cuda': False,
 'use_early_stopping': False,
 'use_multiprocessing': False,
 'wandb_kwargs': {},
 'wandb_project': None,
 'warmup_ratio': 0.06,
 'warmup_steps': 0,
 'weight_decay': 0}

Text Generation

If you give a sentence or a phrase to a student, she can develop the sentence into a paragraph based on the context of the phrases. You can train your model to do the same with NLP.

This technique of generating new sentences relevant to context is called Text Generation.

You can implement text generation using torch and Hugging Face’s transformers

pip install transformers 
import torch

transformers library has various pretrained models with weights. At any time ,you can instantiate a pre-trained version of model through .from_pretrained() method. There are different types of models like BERT, GPT, GPT-2, XLM,etc..

For text generation, it is preferred to use GPT-2 model. So,let us import the model and tokenizer.

from transformers import GPT2Tokenizer
from transformers import GPT2DoubleHeadsModel

Next , you can initiate your tokenizer and model from a pretained GPT-2 model. Here, I will use ‘gpt2-medium’

tokenizer=GPT2Tokenizer.from_pretrained('gpt2-medium')
model=GPT2DoubleHeadsModel.from_pretrained('gpt2-medium')

Write the start of the sntence you want to generate upon and store in a string.

my_text='It is a bright'

I shall first walk you step-by step through the process to understand how the next word of the sentence is generated. After that, you can loop over the process to generate as many words as you want.

You can pass the string to .encode() which will converts a string in a sequence of ids, using the tokenizer and vocabulary.

ids=tokenizer.encode(my_text)
ids
[1026, 318, 257, 6016]

Once you have encoded the text , you pass the list of ids to torch.tensor(), which will return a tensor of these input ids.Store it in a variable my_tensor

my_tensor=torch.tensor([ids])

You have to set your model in eavluation mode through .eval() method. Now, you can convert one tensor type to another through model.to() method. I have changed it to CUDA tensor type

# Setting model in eval mode
model.eval()
my_tensor= my_tensor.to('cuda')
model.to('cuda')

You can pass the CUDA tensor as argument to your model. The tokens or ids of probable successive words will be stored in predictions.

result=model(my_tensor)
predictions=result[0]

predictions is a tensor with all the probable ids of next words of the text

torch.argmax() method returns the indices of the maximum value of all elements in the input tensor.So you pass the predictions tensor as input to torch.argmax and the returned value will give us the ids of next words.

predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_index
1110

You have the id in the encoded form for the next word. You have to convert it to text through .decode() method. Here, I have concatenated the predicted_index to input ids and then decoded to get the whole sentence

predicted_text = tokenizer.decode(ids + [predicted_index])

print(predicted_text)
It is a bright day

You can see the next word ” day ” has been generated and added to the original text

Now if you have understood how to generate a consecutive word of a sentence, you can similarly generate the required number of words by a loop.

Let me demonstrate the final method

# Fix the no of words you want to generate
no_of_words_to_generate=30

# original text
text = 'It is a sunny'

# Looping for each word to be generated
for i in range(no_of_words_to_generate):
    # encode the input text
    input_ids = tokenizer.encode(text)
    # convert into a tensor
    input_tensor = torch.tensor([input_ids])
    # convert into a CUDA tensor
    input_tensor = input_tensor.to('cuda')
    model.to('cuda')

    # passing input tensor to model
    result = model(input_tensor)
    # storing all predicted ids of next word
    predictions = result[0]
    # Choosing the predicted_index (og maximum value)
    predicted_index = torch.argmax(predictions[0,-1,:]).item()
    # decoding the predicted index to text and concatenating to original text
    text = tokenizer.decode(input_ids + [predicted_index])
    # end of loop


# Printing the whole generated text
print(text)
It is a sunny day in the city of San Francisco, and the sun is shining. The city is filled with people, and the people are enjoying themselves. The sun

The above output shows the text generated.

Question-Answering with NLP

When given a source or knowledge, humans can understand and answer questions based on it. How to build a model which can do the same function ?

It is possible with NLP. The transformers library of hugging face provides a very easy and advanced method to implement this function.

As discussed in previous sections, the pipeline of transformers can be used for particular tasks. In this case, you can pass task='question-answering' as input to pipeline

# Import pipeline from transformers
from transformers import pipeline

# Create a question-answering pipeline
faq_machine = pipeline(task="question-answering")
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…

The above code will return a QuestionAnsweringPipeline .You can verify it .

faq_machine

Our model requires two inputs : question and context

context refers to the source text based on whhich we require answers from the model. And ofcourse, you have pass your question as a string too.

about_junkfood= ' Junk food refers to items like pizza,chips ,oily snacks which have high fat content. They are the main reason for obesity and heart diseases. You need to contol and reduce the amount.You should eat healthy food like fruits and vegetables '

# Passing the question and context to the model.
faq_machine(question=' what is junk food ?',context=about_junkfood)

{'answer': 'pizza,chips ,oily snacks',
 'end': 56,
 'score': 0.48831075420295633,
 'start': 32}

You can notice that faq_machine returns a dictionary which has the answer stored in the value of answe key.

There are many real-life applications of this. For example, let us have you have a tourism company.Every time a customer has a question, you many not have people to answer. You can build a model which will do the job for you.

Let me demonstrate QuestionAnsweringPipeline can be great help for a tourist company

# context about the tourist package
information=' The packages we provide are basic, medium and super deluxe. The tourist  places include Gangktok,Ooty,Kodaikanal,DShimla,Himachal. The price for all basic packages are between 5000-7000 per person. The price for all basic packages are between 8000-1000 per person. The price for all basic packages are between 12000-20000 per person. Travel will be through train for basic and medium packages.Travel will be through flight for super deluxe package.For local travel, private AC cars shall be provided'

# Creating the model by passing question and context
answer=faq_machine(question='What packages do you offer ?',context=information)

#Printing the answer only from the dictionary
answer['answer']
'basic, medium and super deluxe.'

Let us try another question too.

# Another example
answer=faq_machine(question='What about local travel ?',context=information)
answer['answer']
'private AC cars shall be provided'

Text Classification

You can notice that all products and services get our feedback or review . How do they get to know if we liked their product or not ? It is impossible to classify each review as positive or negative manually!

This is where Text Classification with NLP takes the stage. You can classify texts into different groups based on their similarity of context.

An advanced method for implementing Text Classification is through simpletransformers library. Install it through below command

pip install simpletransformers

Now, I will walk you through a real-data example of classifying movie reviews as positive or negative. You can download the dataset from here.

First, read the dataset into a pandas dataframe and view it’s columns

import pandas as pd
movie_data=pd.read_csv('/content/movie_reviews.csv')
movie_data.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

  Unnamed: 0 review sentiment
0 0 One of the other reviewers has mentioned that … positive
1 1 A wonderful little production. <br /><br />The… positive
2 2 I thought this was a wonderful way to spend ti… positive
3 3 Basically there’s a family where a little boy … negative
4 4 Petter Mattei’s “Love in the Time of Money” is… positive

You can see it has review which is our text data , and sentiment which is the classification label. You need to build a model trained on movie_data ,which can classify any new review as positive or negative.

The simpletransformers library has ClassificationModel which is especially designed for text classification problems.

from simpletransformers.classification import ClassificationModel

You can initialize the your model as shown below

train_args ={"fp16":False, "save_model_every_epoch": False,'num_train_epochs':1,'overwrite_output_dir':True,'learning_rate': 1e-5,}


model = ClassificationModel('bert', 'bert-large-cased', use_cuda=True,args=train_args)

You should note that the training data you provide to ClassificationModel should contain the text in first coumn and the label in next column.

So, you need to preprocess your training data accordingly.

movie_data['label'] = (movie_data['sentiment']=='positive').astype(int)

movie_data.set_index('review')
movie_data.drop('sentiment',axis=1,inplace=True)

movie_data.head()

.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
vertical-align: top;
}

.dataframe thead th {
text-align: right;
}

  review label
0 One of the other reviewers has mentioned that … 1
1 A wonderful little production. <br /><br />The… 1
2 I thought this was a wonderful way to spend ti… 1
3 Basically there’s a family where a little boy … 0
4 Petter Mattei’s “Love in the Time of Money” is… 1

You can see that the training data is in the required form. You can pass it to model.train_model() function.

 model.train_model(movie_data)

Now that your model is trained , you can pass a new review string to model.predict() function and check the output.

model.predict(['The movie was amazing '])
(array([1]), array([[-3.3968935,  4.061106 ]], dtype=float32))

From the above output , you can see that for your input review, the model has assigned label 1. Similarly, for negative input reviews it assigns label 0. Thus , your model is working successfully.

You have seen the various uses of NLP techniques in this article. I hope you can now efficiently perform these tasks on any real dataset. The field of NLP is brimming with innovations every minute. Stay tuned for more articles .

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science