Natural language processing (NLP) is the technique by which computers understand the human language. NLP allows you to perform a wide range of tasks such as classification, summarization, text-generation, translation and more.
NLP has advanced so much in recent times that AI can write its own movie scripts, create poetry, summarize text and answer questions for you from a piece of text. This article will help you understand the basic and advanced NLP concepts and show you how to implement using the most advanced and popular NLP libraries – spaCy, Gensim, Huggingface and NLTK.
Natural Language Processing. Art by Frances Hodgkins (d. 1947)
Introduction
More than 80% of the data available today is Unstructured Data. But, what exactly is unstructured Data, you ask?
The texts, videos, images which cannot be represented in a tabular form (or in any consistent structured data model) constitute unstructured Data. They hold the key to unraveling powerful information
To process and interpret the unstructured text data, we use NLP.
NLP has a wide number of applications in the real world.
- Sentiment analysis
- Speech recognition
- Sentiment Analysis
- Text Classification
- Machine Translation
- Semantic Search
- News/article Summarization
- Answering Questions
It is infact what drives your favorite voice assistant software (like Google Assistant, Amazon Echo etc), Chatbots, Language Translations, Article summarization and so on.
In this article, you will learn from the basic (and advanced) concepts of NLP to implement state of the art problems like Text Summarization, Classification, etc.
Install and Load Main Python Libraries for NLP
We have a large collection of NLP libraries available in Python. However, you ask me to pick the most important ones, here they are. Using these, you can accomplish nearly all the NLP tasks efficiently. This is all you need, well, mostly.
1. Natural Language Toolkit (NLTK)
It can be imported as shown:
# Install
!pip install nltk
Import package and download model.
# importing nltk
import nltk
nltk.download('punkt')
2. spaCy
It is the most trending and advanced library for implementing NLP today. It is many distinct features that provide clear advantage for processing text data and modeling.
Let me show you how to import spaCy and create a nlp object. To load the models and data for English Language, you have to use spacy.load('en_core_web_sm')
# Importing spaCy and creating nlp object
import spacy
nlp = spacy.load('en_core_web_sm')
nlp
object is referred as language model instance
3. genism
It was developed for topic modelling. It supports the NLP tasks like Word Embedding, text summarization and many others.
!pip install gensim
# Importing gensim
import gensim
4. transformers
It was developed by HuggingFace and provides state of the art models. It is an advanced library known for the transformer modules, it is currently under active development.
# Installing the package
!pip install transformers
In this article, you will learn how to use these libraries for various NLP tasks.
Text Pre-processing
The raw text data often referred to as text corpus has a lot of noise. There are punctuation, suffices and stop words that do not give us any information. Text Processing involves preparing the text corpus to make it more usable for NLP tasks.
Let me walk you through some basic steps of pre-processing a raw text corpus
A text can be converted into nlp
object of spaCy as shown.
# Converting text into a spaCy object
raw_text = 'NLP is a very powerful tool'
text_doc = nlp(raw_text)
Next, an important term in NLP is ‘token’.
What is a Token?
The words of a text document/file separated by spaces and punctuation are called as tokens.
What is Tokenization?
The process of extracting tokens from a text file/document is referred as tokenization.
You can print text of the tokens by accessing token.text
while using spaCy
# printing tokens
for token in text_doc:
print(token.text)
NLP
is
a
very
powerful
tool
spaCy provides many attributes with tokenization.
What if you want to know if a particular token has alphabetic characters or not?
It can be known by printingtoken.is_alpha
# Checking if the tokens of a text are alphabets
mixed_text = 'I love you 3000!'
mixed_text_doc = nlp(mixed_text)
for token in mixed_text_doc:
print(token.text, token.is_alpha)
I True
love True
you True
3000 False
! False
Similarly what if you want to know if the particular token is space, or a stop word or punctuation?
All this information is stored in attributes of token. is_stop
returns True
if the token is a stop word. ( Stop words are ‘It’, ‘was’, etc..)
is_punct
returns True
if the token is a punctuation. (! , . etc)
Let’s see an example of how to access them.
# Accessing information about tokens through the attributes
random_text='It was amazing! It is a great pleasure to be a part of the band.'
random_text_doc=nlp(random_text)
print("Text".ljust(10), ' ', "Alpha", "Space", "Stop", "Punct")
for token in random_text_doc:
print(token.text.ljust(10), ':', token.is_alpha, token.is_space, token.is_stop, token.is_punct)
Text Alpha Space Stop Punct
It : True False True False
was : True False True False
amazing : True False False False
! : False False False True
It : True False True False
is : True False True False
a : True False True False
great : True False False False
pleasure : True False False False
to : True False True False
..(truncated)
You can see the details of each token similarly.
Tokenization
Let us look at another example – on a large amount of text. Let’s say you have text data on a product Alexa, and you wish to analyze it.
Below you can see the code for tokenization and printing the tokens, and the calculating the number of tokens on this text file
# printing all the tokens of a text
raw_text= 'Amazon Alexa, also known simply as Alexa, is a virtual assistant AI technology developed by Amazon, first used in the Amazon Echo smart speakers developed by Amazon Lab126. It is capable of voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, sports, and other real-time information, such as news. Alexa can also control several smart devices using itself as a home automation system. Users are able to extend the Alexa capabilities by installing "skills" additional functionality developed by third-party vendors, in other settings more commonly called apps such as weather programs and audio features.Most devices with Alexa allow users to activate the device using a wake-word (such as Alexa or Amazon); other devices (such as the Amazon mobile app on iOS or Android and Amazon Dash Wand) require the user to push a button to activate Alexa listening mode, although, some phones also allow a user to say a command, such as "Alexa" or "Alexa wake". Currently, interaction and communication with Alexa are available only in English, German, French, Italian, Spanish, Portuguese, Japanese, and Hindi. In Canada, Alexa is available in English and French with the Quebec accent).(truncated...).'
text_doc=nlp(raw_text)
token_count=0
for token in text_doc:
print(token.text)
token_count+=1
Amazon
Alexa
,
also
known
simply
(..truncated)
“`
Let us also print the total no of tokens.
print('No of tokens originally',token_count)
No of tokens originally 708
So, you can see we have a lot of tokens here. The stop words like ‘it’,’was’,’that’,’to’…, so on do not give us much information, especially for models that look at what words are present and how many times they are repeated. The same goes for the punctuations.
While dealing with large text files, the stop words and punctuations will be repeated at high levels, misguiding us to think they are important. So, it is necessary to filter out the stop words.
spaCy has an inbuilt list of stop words. You can view it as shown below
# Accessing the built-in stop words in Spacy
stopwords = spacy.lang.en.stop_words.STOP_WORDS
list_stopwords=list(stopwords)
# printing a fraction of the list through indexing
for word in list_stopwords[:7]:
print(word)
see
whom
through
whatever
may
go
regarding
How to remove the stop words and punctuation
As we already established, when performing frequency analysis, stop words need to be removed.
In the same text data about a product Alexa, I am going to remove the stop words.
You can use is_stop
to identify the stop words and remove them through below code..
# Raw text document
raw_text= 'Amazon Alexa, also known simply as Alexa, is a virtual assistant AI technology developed by Amazon, first used in the Amazon Echo smart speakers developed by Amazon Lab126. It is capable of voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, sports, and other real-time information, such as news. Alexa can also control several smart devices using itself as a home automation system. Users are able to extend the Alexa capabilities by installing "skills" additional functionality developed by third-party vendors, in other settings more commonly called apps such as weather programs and audio features.Most devices with Alexa allow users to activate the device using a wake-word (such as Alexa or Amazon); other devices (such as the Amazon mobile app on iOS or Android and Amazon Dash Wand) require the user to push a button to activate Alexa listening mode, although, some phones also allow a user to say a command, such as "Alexa" or "Alexa wake". Currently, interaction and communication with Alexa are available only in English, German, French, Italian, Spanish, Portuguese, Japanese, and Hindi. In Canada, Alexa is available in English and French with the Quebec accent).[6][7]As of November 2018, Amazon had more than 10,000 employees working on Alexa and related products.[8] In January 2019, Amazons devices team announced that they had sold over 100 million Alexa-enabled devices.[9]In September 2019, Amazon launched many new devices achieving many records while competing with the worlds smart home industry. The new Echo Studio became the first smart speaker with 360 sound and Dolby sound. Other new devices included an Echo dot with a clock behind the fabric, a new third-generation Amazon Echo, Echo Show 8, a plug-in Echo device, Echo Flex, Alexa built-in wireless earphones, Echo buds, Alexa built-in spectacles, Echo frames, an Alexa built-in Ring, and Echo Loop.Alexa can perform a number of preset functions out-of-the-box such as set timers, share the current weather, create lists, access Wikipedia articles, and many more things. Users say a designated "wake word" (the default is simply "Alexa") to alert an Alexa-enabled device of an ensuing function command. Alexa listens for the command and performs the appropriate function, or skill, to answer a question or command. Alexas question answering ability is partly powered by the Wolfram Language. When questions are asked, Alexa converts sound waves into text which allows it to gather information from various sources. Behind the scenes, the data gathered is then parsed by Wolfram technology to generate suitable and accurate answers. Alexa-supported devices can stream music from the owners Amazon Music accounts and have built-in support for Pandora and Spotify accounts. Alexa can play music from streaming services such as Apple Music and Google Play Music from a phone or tablet.In addition to performing pre-set functions, Alexa can also perform additional functions through third-party skills that users can enable. Some of the most popular Alexa skills in 2018 included "Question of the Day" and "National Geographic Geo Quiz" for trivia; "TuneIn Live" to listen to live sporting events and news stations; "Big Sky" for hyper local weather updates; "Sleep and Relaxation Sounds" for listening to calming sounds; "Sesame Street" for children entertainment; and "Fitbit" for Fitbit users who want to check in on their health stats. In 2019, Apple, Google, Amazon, and Zigbee Alliance announced a partnership to make smart home products work together.'
text_doc=nlp(raw_text)
token_count_without_stopwords=0
# Filtring out the stopwords
filtered_text= [token for token in text_doc if not token.is_stop]
# Counting the tokens after removal of stopwords
for token in filtered_text:
print(token)
token_count_without_stopwords+=1
Amazon
Alexa
,
known
simply
Alexa
,
virtual
assistant
AI
technology
developed
Amazon
,
Amazon
Echo
smart
speakers
developed
Amazon
Lab126
.
capable
voice
interaction
,
..(truncated...)
You can see that the stop words have been removed. To understand how much effect it has, let us print the number of tokens after removing stopwords.
print('No of tokens after removing stopwords', token_count_without_stopwords)
No of tokens after removing stopwords 488
You can observe that there is a significant reduction of tokens.
Next, let us remove the punctuations too. is_punct
is used to identify punctuations . See the example below.
# Removing punctuations
filtered_text=[token for token in filtered_text if not token.is_punct]
token_count_without_stop_and_punct=0
# Counting the new no of tokens
for token in filtered_text:
print(token)
token_count_without_stop_and_punct += 1
Amazon
Alexa
known
simply
Alexa
virtual
assistant
AI
technology
developed
Amazon
Amazon
Echo
smart
speakers
developed
Amazon
(..truncated..)
You can observe that all punctuations are gone. Let us print the number of tokens now.
print('No of tokens after removing stopwords and punctuations', token_count_without_stop_and_punct)
No of tokens after removing stopwords and punctuations 361
It is reduced considerably, we have only the significant words which will help us understand the text
Now that you have relatively better text for analysis, let us look at a few other text preprocessing methods.
Stemming
Let us look at some words say ‘calculator’, ‘calculating’, ‘calculation’. we know that these words are not very distinct from each other. How to ensure that algorithms know this too?
It can be achieved through stemming. Stemming means reducing a word to its ‘root form’.
Let us see an example of how to implement stemming using nltk
supported PorterStemmer()
.
# Import stemmer and perform stemming on a list of words
from nltk.stem import PorterStemmer
ps = PorterStemmer()
words = ['dance ','dances', 'dancing ','danced']
for word in words:
print(word.ljust(8),'-------', ps.stem(word))
dance ------- dance
dances ------- danc
dancing ------- dancing
danced ------- danc
You can observe that some words were reduced to the base stems ‘danc’
But, the word ‘danc’ is not semantically correct. Stemming may give us root words that are not present in the dictionary. How to overcome this ?
That’s where Lemmatization comes to rescue.
Lemmatization
It is similar to stemming, except that the root word is correct and always meaningful.
There are different ways to perform lemmatization. I’ll show lemmatization using nltk
and spacy
in this article.
Lemmatization through NLTK
The most commonly used Lemmatization technique is through WordNetLemmatizer
from nltk
library.
Let me show a simple example of how it works
# download wordnet
nltk.download('wordnet')
# Importing the lemmatizer
from nltk.stem import WordNetLemmatizer
# Initialize the Wordnet Lemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatizing a single word
print(lemmatizer.lemmatize("dances"))
dance
You can see that dances ==> dance. Now let us go to real case of large text and perform lemmatization.
Let us say you have text data on the performance of various models.
# Tokenize
my_text='The model AIM has a performce accuracy of 87% and it is redady to be released. The release is going to be held on day 12.3.2000 and it has amzing performance recod.Model AIM2.0 has accuracy of 89% and better performing expectations are there.le us see how it performs'
words=nltk.word_tokenize(my_text)
print(words)
['The', 'model', 'AIM', 'has', 'a', 'performce', 'accuracy', 'of', '87', '%', 'and', 'it', 'is', 'redady', 'to', 'be', 'released', '.', 'The', 'release', 'is', 'going', 'to', 'be', 'held', 'on', 'day', '12.3.2000', 'and', 'it', 'has', 'amzing', 'performance', 'recod.Model', 'AIM2.0', 'has', 'accuracy', 'of', '89', '%', 'and', 'better', 'performing', 'expectations', 'are', 'there.le', 'us', 'see', 'how', 'it', 'performs']
You can observe the tokens printed. The below code performs Lemmatization and prints final lemmatized text
lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in words])
print(lemmatized_output)
The model AIM ha a performce accuracy of 87 % and it is redady to be released . The release is going to be held on day 12.3.2000 and it ha amzing performance recod.Model AIM2.0 ha accuracy of 89 % and better performing expectation are there.le u see how it performs
Lemmatization using spaCy
In spaCy , the token object has an attribute .lemma_
which allows you to access the lemmatized version of that token.See below example.
# Lemmatization using spaCy
text='dance dances dancing danced'
text_doc=nlp(text)
for token in text_doc:
print(token.text,'-------',token.lemma_)
dance ------- dance
dances ------- dance
dancing ------- dance
danced ------- dance
Here, all words are reduced to ‘dance’ which is meaningful and just as required.
It is highly preferred over stemming.
Let me show you how to convert a text into lemmatized form.
# lemmatizing a text document using spacy
text=' I had calculated it wrong.I am calculating the bill again.'
text_doc=nlp(text)
filter_text=[token for token in text_doc if not token.is_stop]
filter_text=[token for token in text_doc if not token.is_punct]
" ".join([token.lemma_ for token in filter_text])
' -PRON- have calculate -PRON- wrong -PRON- be calculate the bill again'
You can see that the words are lemmatized. Also, spacy prints PRON
before every pronoun in the sentence.
Word Frequency Analysis
Once the stop words are removed and lemmatization is done ,the tokens we have can be analysed further for information about the text data.
The words which occur more frequently in the text often have the key to the core of the text. So, we shall try to store all tokens with their frequencies for the same purpose.
You can use Counter
to get the frequency of each token as shown below. If you provide a list to the Counter it returns a dictionary of all elements with their frequency as values.
#import Counter
import collections
from collections import Counter
# creating nlp obj using spaCy
data='It is my birthday today. I could not have a birthday party. I felt sad'
data_doc=nlp(data)
#creating a list of tokens after preprocessing
list_of_tokens=[token.text for token in data_doc if not token.is_stop and not token.is_punct]
#passing the list of tokens to Counter
token_frequency=Counter(list_of_tokens)
print(token_frequency)
Counter({'birthday': 2, 'today': 1, 'party': 1, 'felt': 1, 'sad': 1})
From above ouput, the word ‘ birthday ‘ is most common. So,it gives us an idea that the text is about a birthday.
Let us take a similar case, but with more larger data . For example , you have text data about a particular place , and you want to know the important factors.
How to know it ?
# Word Frequency
longtext='Gangtok is a city, municipality, the capital and the largest town of the Indian state of Sikkim. It is also the headquarters of the East Sikkim district. Gangtok is in the eastern Himalayan range, at an elevation of 1,650. The towns population of 100000 are from different ethnicities such as Bhutia, Lepchas and Indian Gorkhas. Within the higher peaks of the Himalaya and with a year-round mild temperate climate, Gangtok is at the centre of Sikkims tourism industry.Gangtok rose to prominence as a popular Buddhist pilgrimage site after the construction of the Enchey Monastery in 1840. In 1894, the ruling Sikkimese Chogyal, Thutob Namgyal, transferred the capital to Gangtok. In the early 20th century, Gangtok became a major stopover on the trade route between Lhasa in Tibet and cities such as Kolkata (then Calcutta) in British India. After India won its independence from Britain in 1947, Sikkim chose to remain an independent monarchy, with Gangtok as its capital. In 1975, after the integration with the union of India, Gangtok was made Indias 22nd state capital.Like the rest of Sikkim, not much is known about the early history of Gangtok. The earliest records date from the construction of the hermitic Gangtok monastery in 1716.[7] Gangtok remained a small hamlet until the construction of the Enchey Monastery in 1840 made it a pilgrimage center. It became the capital of what was left of Sikkim after an English conquest in the mid-19th century in response to a hostage crisis. After the defeat of the Tibetans by the British, Gangtok became a major stopover in the trade between Tibet and British India at the end of the 19th century. Most of the roads and the telegraph in the area were built during this time.In 1894, Thutob Namgyal, the Sikkimese monarch under British rule, shifted the capital from Tumlong to Gangtok, increasing the citys importance. A new grand palace along with other state buildings was built in the new capital. Following India independence in 1947, Sikkim became a nation-state with Gangtok as its capital. Sikkim came under the suzerainty of India, with the condition that it would retain its independence, by the treaty signed between the Chogyal and the then Indian Prime Minister Jawaharlal Nehru.[9] This pact gave the Indians control of external affairs on behalf of Sikkimese. Trade between India and Tibet continued to flourish through the Nathula and Jelepla passes, offshoots of the ancient Silk Road near Gangtok. These border passes were sealed after the Sino-Indian War in 1962, which deprived Gangtok of its trading business.[10] The Nathula pass was finally opened for limited trade in 2006, fuelling hopes of economic boomIn 1975, after years of political uncertainty and struggle, including riots, the monarchy was abrogated and Sikkim became India twenty-second state, with Gangtok as its capital after a referendum. Gangtok has witnessed annual landslides, resulting in loss of life and damage to property. The largest disaster occurred in June 1997, when 38 were killed and hundreds of buildings were destroyed'
long_text=nlp(longtext)
list_of_tokens=[token.text for token in long_text if not token.is_stop and not token.is_punct]
token_frequency=Counter(list_of_tokens)
print(token_frequency)
Counter({'Gangtok': 18, 'capital': 9, 'Sikkim': 8, 'India': 8, 'state': 5, 'Indian': 4, 'British': 4, 'construction': 3, 'Sikkimese': 3, 'century': 3, 'trade': 3, 'Tibet': 3, 'independence': 3, 'largest': 2, 'pilgrimage': 2, 'Enchey': 2, 'Monastery': 2, '1840': 2, '1894': 2, 'Chogyal': 2, 'Thutob': 2, 'Namgyal': 2, 'early': 2, 'major': 2, 'stopover': 2, '1947': 2, 'monarchy': 2, '1975': 2, 'built': 2, 'new': 2, 'buildings': 2, 'Nathula': 2, 'passes': 2, 'city': 1, 'municipality': 1, 'town': 1, 'headquarters': 1, 'East': 1, 'district': 1, 'eastern': 1, 'Himalayan': 1, 'range': 1, 'elevation': 1, '1,650': 1, 'towns': 1, 'population': 1, '100000': 1, 'different': 1, 'ethnicities': 1, 'Bhutia': 1, 'Lepchas': 1, 'Gorkhas': 1, 'higher': 1, 'peaks': 1, 'Himalaya': 1, 'year': 1, 'round': 1, 'mild': 1, 'temperate': 1, 'climate': 1, 'centre': 1, 'Sikkims': 1, 'tourism': 1, 'industry': 1, 'rose': 1, 'prominence': 1, 'popular': 1, 'Buddhist': 1 (..truncated..)})
As you can see, as the length or size of text data increases, it is difficult to analyse frequency of all tokens. so, you can print the n most common tokens using most_common
function of Counter
.
see the below example on how to print the 6 most frequent words of the text
#printing 6 most coomon words using most_common() method of Counter
most_frequent_tokens=token_frequency.most_common(6)
print(most_frequent_tokens)
[('Gangtok', 18), ('capital', 9), ('Sikkim', 8), ('India', 8), ('state', 5), ('Indian', 4)]
You see that the keywords are gangtok
, sikkkim
,Indian
and so on. It gives us an idea of the text to start with.
It helps you find the list to main keywords. Hence, frequency analysis of token is an important method in text processing.
Let us look at more methods to understand the text data.
Part of Speech Tagging
Each word has it’s own role in a sentence. For example,’Geeta is dancing’. Geeta is the person or ‘Noun’ and dancing is the action performed by her ,so it is a ‘Verb’.Likewise,each word can be classified. This is referred as POS or Part of Speech Tagging.
What are the parts of Speech?
- Noun
- Verb
- Adjective
- Pronoun
- Adverb
- Preposition
- Conjunction
- Interjection
and so on.
POS can be implemented using both spaCy and nltk. First, let us see the method using spaCy
In spaCy, the POS tags are present in the attribute of Token
object. You can access the POS tag of particular token
theough the token.pos_
attribute.
See below example
text='Nitika was screaming out loud as usual Her roommate kept ignoring her '
text_doc=nlp(text)
for token in text_doc:
print(token.text.ljust(10),'-----',token.pos_)
Nitika ----- PROPN
was ----- AUX
screaming ----- VERB
out ----- ADP
loud ----- ADV
as ----- SCONJ
usual ----- ADJ
Her ----- DET
roommate ----- NOUN
kept ----- VERB
ignoring ----- VERB
her ----- PRON
It is very easy, as it is already available as an attribute of token. Features like this make spaCy preferred over nltk
.
POS tagging in a text file
In real life, you will stumble across huge amounts of data in the form of text files.
how to perform Part of speech analysis on a text file ?
Say you have a text file on robotics. See below code to understand how to work with text files.
# opening the text file and reading it into a string
with open('robotics.txt') as file:
robot_text=file.read()
Now,the content of the text-file is stored in the string robot_text
.
Next, like we did before, you have to pre-process the text.
# Preprocessing the text
robot_doc=nlp(robot_text)
robot_doc=[token for token in robot_doc if not token.is_stop and not token.is_punct]
What if you want to know what kind of tokens are present in your text ?
you can print the same with the help of token.pos_
as shown in below code.
# creating a dictionary with parts of speech present in the doc
all_tags = {token.pos: token.pos_ for token in robot_doc}
print(all_tags)
{92: 'NOUN', 84: 'ADJ', 100: 'VERB', 103: 'SPACE', 86: 'ADV', 96: 'PROPN', 93: 'NUM', 98: 'SCONJ', 101: 'X', 85: 'ADP'}
What if you want to extract all the NOUNS present in the text ?
You can categorize the tokens depending on the POS tags. Below example demonstrates how to print all the NOUNS in robot_doc
.
# creating list to store nouns and verbs
nouns=[]
verbs=[]
# appending tokens to the list according POS category
for token in robot_doc:
if token.pos_ =='NOUN':
nouns.append(token)
if token.pos_ =='VERB':
verbs.append(token)
print('List of Nouns in the text\n',nouns)
print('List of verbs in the text\n',verbs)
List of Nouns in the text
[Robotics, research, area, interface, computer, science, engineering, Robotics, design, construction, operation, use, robots, goal, robotics, machines, humans, day, day, lives, Robotics, achievement, information, engineering, computer, engineering, engineering, engineering, Robotics, machines, humans, actions, Robots, situations, lots, purposes, today, environments, inspection, materials, bomb, detection, deactivation, manufacturing, processes, humans, space, underwater, heat, containment, materials, radiation, Robots, form, humans, appearance, acceptance, robot, behaviors, people, robots, lifting, speech, cognition, activity, today, robots, nature, field, bio, (..truncated..)]
List of verbs in the text
[involves, design, help, assist, draws, develops, substitute, replicate, including, survive, clean, resemble, said, help, performed, attempt, replicate, walking, inspired, contributing, inspired, creating, operate, dates, grow, assumed, robots, mimic, manage, growing, continue, researching, building, serve, built, defusing, finding, exploring, injected, revolutionize, involves, overlaps, derived, introduced, published, comes, means, begins, makes, called, mistaken, coin, wrote, named, According, published, coining, assumed, referred, states, introduced, predates, cited, share, comes, designed, achieve, designed, travel, use, completing, g, wheeled, designed, maneuver, function,(..truncated..)]
All the tokens which are nouns have been added to the list nouns
.
In some cases, you may not need the verbs or numbers, when your information lies in nouns and adjectives.
In the current example, there is a POS tag named ‘X’. Let us look at what tokens to belong to this category.
numbers=[]
for token in robot_doc:
if token.pos_=='X':
print(token.text)
i.e.
etc
etc
etc
etc
It is clear that the tokens of this category are not significant. So, you can remove them.
How to remove token of certain POS categories?
You can write a function remove_pos
that returns True
if the token belongs to junk POS category,else it returns False
The below code removes the tokens of category ‘X’ and ‘SCONJ’.
# Storing the junk POS tags
junk_pos=['X','SCONJ']
# Function to check if the token falls in the JUNK POS category.
def remove_pos(word):
flag=False
if word.pos_ in junk_pos:
flag=True
return flag
# Creating a new doc without the un-required tokens
revised_robot_doc=[token for token in robot_doc if remove_pos(token)==False]
# Printing POS tags present in the new document.
all_tags = {token.pos: token.pos_ for token in revised_robot_doc}
print(all_tags)
{92: 'NOUN', 84: 'ADJ', 100: 'VERB', 103: 'SPACE', 86: 'ADV', 96: 'PROPN', 93: 'NUM', 85: 'ADP'}
You can see that in the new document the X
and SCONJ
tags are not there. So, mission successful!
For better understanding, you can use displacy
function of spacy.
# Using displacy function
with open('/content/robotics.txt') as file:
text=file.read()
doc=nlp(text)
displacy.render(doc,style='dep',jupyter=True)
Dependency Parsing
In a sentence, the words have a relationship with each other. The one word in a sentence which is independent of others, is called as Head /Root word. All the other word are dependent on the root word, they are termed as dependents.
Dependency Parsing is the method of analyzing the relationship/ dependency between different words of a sentence.
Usually, in a sentence, the verb is the head word.
You can access the dependency of a token through token.dep_
attribute.
token.dep_
prints dependency tags for each token
What is the meaning of these dependency tags ?
- nsubj :Subject of the sentence
- ROOT: The headword or independent word,(generally the verb)
- prep: prepositional modifier, it modifies the meaning of a noun, verb, adjective, or preposition.
- cc and conj : Linkages between words.For example : is , and, etc..
- pobj : Denotes the object of the preposition
- aux : Denotes it is an auxiliary word
- dobj : direct object of the verb
- det : Not specific, but not an independent word.
Let me show an example printing the dependencies of a sentence.
# Printing the dependency of each token
my_text='Ardra fell into a well and fractured her leg'
my_doc=nlp(my_text)
for token in my_doc:
print(token.text,'---',token.dep_)
Ardra --- nsubj
fell --- ROOT
into --- prep
a --- det
well --- pobj
and --- cc
fractured --- conj
her --- poss
leg --- dobj
Clearly the verb “fell” is the ROOT word as expected.
For better understanding of dependencies, you can use displacy
function from spacy on our doc object.
displacy.render(my_doc ,style='dep', jupyter=True)
What if you want to know the headword / rootword of a particular token?
In spacy, you can access the head word of every token through token.head.text
.
See below example on the same
# Acessing headword of each token
my_text="The robot displayed its functions on the big screen The audience applauded"
my_doc=nlp(my_text)
for token in my_doc:
print(token.text,'---',token.head.text)
The --- robot
robot --- displayed
displayed --- displayed
its --- functions
functions --- displayed
on --- displayed
the --- screen
big --- screen
screen --- on
--- screen
The --- audience
audience --- applauded
applauded --- applauded
Let us see an example of dependency parsing in real life application
I will use the same text file on robotics for this too.
How to print the root word of every sentence ?
Let me demonstrate the same in below example.
# Printing root word of each sentence of the doc
with open('robotics.txt') as file:
robot_text=file.read()
robotics_doc=nlp(robot_text)
sentences=list(robotics_doc.sents)
for sentence in sentences:
print(sentence.root)
is
involves
is
draws
develops
used
take
said
attempt
—(truncated)
How to navigate in a dependency tree
Let me show you an example of how to access the children of particular token.
print([token.text for token in robotics_doc[5].children])
['an', 'interdisciplinary', 'research', 'at']
How to get the previous neighboring node of a token ?
print (robotics_doc[5].nbor(-1))
research
How to print the 5th forward neighboring node of a token ?
print (robotics_doc[5].nbor(5))
computer
Named Entity Recognition
Suppose you have a collection of news articles text data. What if you want to know what companies/organizations have been in the news ? How will you do that ?
Take another case of text dataset of information about influential people. What if you want to know the names of influencers in this dataset ?
The answer to all these questions is NER.
What is NER ?
NER is the technique of identifying named entities in the text corpus and assigning them pre-defined categories such as ‘ person names’ , ‘ locations’ ,’organizations’,etc..
It is a very useful method especially in the field of claasification problems and search egine optimizations.
NER can be implemented through both nltk
and spacy`.I will walk you through both the methods.
NER with NLTK
Let us start with a simple example to understand how to implement NER with nltk
.
Consider the below sentence . Your goal is to identify which tokens are the person names, which is a company .
sentence = " John and Emily are from India , they work at Yahoo"
The first step would be to identify the named entities.
How to do this ?
The first step, you should tokenize your text usinh nltk.word_tokenize
. Next, pass the tokens through nltk.pos_tag
, to get the part-of speech tags of each token as shown below
# POS tagging on the text
from nltk import word_tokenize, pos_tag
tokens=word_tokenize(sentence)
tokens_pos=pos_tag(tokens)
print(tokens_pos)
[('John', 'NNP'), ('and', 'CC'), ('Emily', 'NNP'), ('are', 'VBP'), ('from', 'IN'), ('India', 'NNP'), (',', ','), ('they', 'PRP'), ('work', 'VBP'), ('at', 'IN'), ('Yahoo', 'NNP')]
Now , you have pos tags associated with each token.
Next, to obtain the named entities of the text , nltk
provides ne_chunk
method.
Below code demonstrates how to use nltk.ne_chunk
on the above sentence.
# Importing ne_chunk from nltk and download requirements
from nltk import ne_chunk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data] /root/nltk_data...
[nltk_data] Package averaged_perceptron_tagger is already up-to-
[nltk_data] date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data] /root/nltk_data...
[nltk_data] Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data] Package words is already up-to-date!
You can pass the tokens, with their pos tags as input to ne_chunk()
, as shown in below code
# Print all the named entities
named_entity=ne_chunk(tokens_pos)
print(named_entity)
(S
(PERSON John/NNP)
and/CC
(PERSON Emily/NNP)
are/VBP
from/IN
(GPE India/NNP)
,/,
they/PRP
work/VBP
at/IN
(ORGANIZATION Yahoo/NNP))
In the above output , you can see that John andd Emily have been categorized as ‘PERSON’ , Yahoo as ‘ORGANIZATION’ just as required.
Now, what if you have huge data, it will be impossible to print and check for names. For cases like this, using spacy
can be easier.
NER with spacy
spacy
is highly flexible and advanced library. Named entity recoginition through spacy is easy.
Every token of a spacy model, has an attribute token.label_
which stores the category/ label of each entity.
The few common labels are:
- ‘ORG’ : companies,organizations,etc..
- ‘PERSON’ : names of people
- ‘GPE’ : countries,states,etc..
- ‘PRODUCT’ : vehicles,food products and so on
- ‘LANGUAGE’ : Names of different languages
There are many other labels too.
Let me show you a simple example to understand how to access the entity labels throughspacy
# Creating a spacy doc of a sentence
sentence=' The building is located at London. It is the headquaters of Yahoo. John works there. He speaks English'
doc=nlp(sentence)
You can iterate through the entities present in the doct using .ents
and print theirlabel through .label_
!
# Print named entities
for entity in doc.ents:
print(entity.text,'--',entity.label_)
London -- GPE
Yahoo -- ORG
John -- PERSON
English -- LANGUAGE
You can see the output labels printed. spacy also provies visualization for better understanding.
You can visualize through displacy
,setting style='ent'
.
from spacy import displacy
displacy.render(doc,style='ent',jupyter=True)
Now that you have understood the base of NER, let me show you how it is useful in real life.
Let us take the text data of collection of news headlines.
What if you want a list of all names that have occured in the news ?
This is where spacy
has an upper hand, you can check the category of an entity through .ent_type
attribute of token.
The below code demonstrates how to get a list of all the names in the news .
news_articles='Restaurant inspections: Space Coast establishments fare well with no closures or vermin Despite strong talk Tallahassee has a lot more to do for Indian River Lagoon | Opinion Kelly Slater-coached surfer from Cocoa Beach wins national title in Texas Despite parachute mishap Boeing Starliner pad abort test a success Vote for FLORIDA TODAY Community Credit Union Athlete of the Week for Oct 28-Nov 2 Brevard diners rave about experiences at Flavor on the Space Coast restaurants Heres what you need to know about Tuesdays municipal elections in Brevard County Save the turtles rally planned to oppose Satellite Beach hotel-condominium project Elon Musk wants humans to colonize space Brad Pitt shows thats easier said than done Google honors actor Will Rogers with new Doodle SpaceX targeting faster deployment of Starlink internet-beaming satellites Rockledge Cocoa hold while Merritt Island gains traction in state football polls Boeing launches and lands Starliner capsule for pad abort test in New Mexico Eight of the 10 most dangerous metro areas for pedestrians are in Florida report says Brevard faith leaders demand apology in open letter to Bill Mick How a Delaware researcher discovered a songbird can predict hurricane seasons Open doors change lives: United Way of Brevard kicks off 2019 campaign Florida man pleads guilty to killing sawfish by removing extended nose with power saw One-woman comedy Excuse My French at Viera Studio dives into authors mind Letters and feedback: Sept 19 2019 Viera talk will highlight technology used for dementia patients Adamson Road open after dump truck crash spilled sand and fuel one injured From the mob to mob movies with Robert De Niro: Meet Larry Mazza former hitman Irishman consultant Tropical Storm Jerry getting stronger in Atlantic; Hurricane Humberto nearing Bermuda Palm Bay Halloween party homicide suspect turns self in police say Fall playoffs: Satellite XC boys win region; Titusville swimmers second Sounds of celebration during Puerto Rican Day Parade in Palm Bay Tractor trailer fire extinguished on I-95 in Cocoa lanes open Logan Harvey bowls a 279 to help Astronaut beat Space Coast in battle of undefeated teams Letters and feedback: Sept 18 2019 Accelerated bargaining a bust as teachers union walks out on second day of pay talks Traffic Alert: Puerto Rican Day Parade set for Palm Bay County commissioners debate biosolids at first of two public hearings US 1 widening in Rockledge under study along with sidewalks and bike lanes Letters and feedback: Nov 3 2019 Titusville man indicted on first-degree murder charge Rockledge Cocoa lead 8 Brevard HS football teams into playoffs Space Coast Junior Achievement accepting business hall of fame nominations Why Brevards proposed moratorium on sewage sludge isnt enough |Rangel Brevard high school football standings after Week 4 Starbucks might build new store at old Checkers site in West Melbourne on US 192 Police search for Palm Bay shooting suspect in Orlando area Does your drivers license have a gold star? Well you need one Cocoa man charged with arson in blaze at triplex where 16 children were celebrating a birthday Unit and lot owners may inspect their ledgers upon written request Palm Bay veteran officer Nelson Moya chosen as police chief By the numbers: Your link to every state salary and what high ranking Florida officials make Longtime state county legislative aide Furru honored by County Commission Week 11: Rockledge pulls out BBQ win; EG Palm Bay Astro win rivalry games County commissioners allocate $446 million to repair South Beaches after Hurricane Dorian Tour of Brevard: Holy Trinity Tigers County Commission begins shakeup of its bloated advisory boards while adding term limits Mystery Jesus plaque washes ashore near Melbourne Beach thanks to Hurricane Humberto waves Widow of combat veteran who died after fight at Brevard Jail appeals to governor Health calendar: June 27-July 4 Going surfing this weekend? Check our wave forecast Have you ever seen a turtle release? Heres your chance Darkness turns to day as SpaceX Falcon Heavy launches from Kennedy Space Center Losing hair in strange places? It could be Alopecia areata Golf tip: Pairing and tee times on the PGA Tour NOAA updates projections saying above normal season could see 5-9 hurricanes 2-4 of them major Cape Canaveral Hospital relocation: Loss of a landmark hopeful future for healthcare and land How do I get adult child to move out start own life? The Atlas V rockets jellyfish in the sky Scores: High school football Week 11 in Brevard County Florida gator caught on camera munching on plastic at St Marks National Wildlife Refuge Q&A: New Melbourne head football coach Tony Rowell Does the First Amendment go too far? | Opinion FHSAA says it has a contingency plan if football associations strike but wont reveal it Tropical Depression Ten forms in central Atlantic; Hurricane Humberto continues to strengthen Updates: Watch SpaceX launch its Falcon Heavy rocket from Kennedy Space Center in Florida Letters and feedback: May 9 2019 The Good Stuff: Soccer star makes Wheaties cover cookies in space and more Highly anticipated pad abort test of Boeing’s Starliner spacecraft is Monday morning Health calendar: Sept 19-26 Naked man running through Titusville Walmart parking lot arrested Lacking money and manpower West Melbourne considers swapping school SROs for armed guards Worst best airports in country: 3 Florida airports make Top 5 worst list Colorado STEM school shooting impacted my family: When will we say enough? | Bellaby Human ashes Legos and a sandwich: Strange things we sent into space Massive 2600-acre burn underway near Brevard County line Health Pro: Trip to Philippines validated nurses career decision Harry Anderson Palm Bays Sitting Man exposes gaps in helping the homeless Brevard School Board splits 4-1 in favor of district plan for teacher pay; union vows war Whos No 1? See the Brevard high school football rankings after Week 4 3 people taken into custody after multiple agencies tracked them at high speeds across Central Florida As dawn breaks over Florida ULA Atlas V rocket lights up morning sky Why breast cancer was called Nuns disease Space Coast-based Rocket Crafters partners with Swiss tech giant RUAG Investigation continues for Palm Bay double killing in 2018 Can’t breastfeed? Tips to provide same nutrition to baby Former Florida Tech golfer Iacobelli wins playoff in Michigan for third pro title Baseball bliss every 50 years: Let’s go Mets! Let’s go Nats! 5 tips to make your holiday travel stress-free Updates: ULA launches Atlas V rocket military satellite from Cape Canaveral Andrew Reed bowls 274 in Eau Gallie win over Satellite Space industry propels Brevard economy into optimistic future; highlights from EDC investors meeting Governor DeSantis signs bill geared toward expanding vocational training Outage leaves 12500 customers without power in Melbourne for nearly an hour Weak cold front brings fall weather after unusually warm October on Space Coast 321 Launch: The space news you might have missed Teacher pay: Union opens pay talks with raise demand supplement for veteran teachers Tour of Brevard: Bayside Bears Youth fishing tournament coming Saturday to Melbourne President Trump heres what Hurricane Michael damage still looks like today Letters and feedback: Aug 8 2019 Man leads police on 100 mph chase through Melbourne Palm Bay Puerto Rican parade to take place this weekend in Palm Bay Viera-based USSSA Pride depart from pro softball league Tip of organized theft ring targeting boats motors leads to arrest of 3 men in Melbourne Multiple crashes slow I-95 traffic near Melbourne Is it time to reconsider how free speech should be? | Bill Cotterell SPCA of Brevard takes influx of cats and dogs from Glades County Brevard residents share some of their favorite recent dining experiences Tuesdays brewery running tour stops at Florida Beer Co in Cape Canaveral 89-year-old Tallahassee Grandma Bunny battles 6-foot snake after it eats visiting birds The horrors of Dorian parallel another nearby tragedy of historic proportions EFSCs Carter Stewart named NJCAA Region 8 Pitcher of the Week Titusville teenager charged with murder of grandfather If Trump bans vape flavors Brevard shops say theyll be out of business Cabana: NASAs Artemis is space explorations next giant leap | Opinion Florida continues cutting phone cords BSOs Confessore starting 25th year as music director High school sports results: May 7 Libraries busy with summer reading programs Four hurricanes in six weeks? Remember 2004 the year of hurricanes Letters and feedback: Nov 2 2019 Letters and feedback: May 8 2019 As Brevard kids head back to school here are five things to make lunchtime happy healthy Palm Bay police investigate overnight shooting that left one dead Rent takes the stage at two playhouses Cinderella Brighton Beach Memoirs arrive on Brevard stages this weekend Junes Saharan dust setting up right conditions for red tide but still too soon to tell Wizarding World of Harry Potter: Dark Arts unleashed on Hogwarts Castle at Universal Orlando Health First restructuring: What we know (and what we dont) From Jack in the Box to In-N-Out Burger: 7 restaurant chains missing in Florida SpaceX anomaly: No further action needed on Crew Dragon explosion cleanup Vietnam War mural pits residents vs Florida community Matter settled unhappily British cruise line Marella to sail from Port Canaveral in 2021 Kids are at risk as religious exemptions to vaccines spike in Florida Brevard County crime rate falls in 2018 reflecting statewide trend according to FDLE report Florida-Georgia FSU-Miami: Why Saturday afternoon is a college football party in Florida Restaurants in Viera Suntree Port Canaveral Palm Bay Melbourne Beach draw Flavor raves Vote for the FLORIDA TODAY Community Credit Union Athlete of the Week for Sept 9-14 Whats happening May 15-21: Concert in Cocoa summer camps fun fair in Melbourne Port Authority Secretary/Treasurer Harvey seeks more public input at evening meetings SpaceX Falcon Heavy rocket launch: Where to watch from Space Coast Treasure Coast Volusia County Tallahassee Here are 10 of Arizonas most photogenic spots Some of them might surprise you Rockledge police search for suspects after armed robbery Kardashian production features Palm Bay homicide involving sisters-in-law FLORIDA TODAY journalist The holidays are almost here; Space Coast cooks are prepping special meals now How to watch tonights SpaceX Falcon Heavy launch from Kennedy Space Center in Florida New state budget spending wish list includes bug studies lionfish contests stronger coastlines Wheaties: Team USA soccer star Ashlyn Harris make cereal cover and it costs $23 a box Letter carriers Stamp Out Hunger food drive is Saturday May 11 Suzy Fleming Leonard: Fourth-graders talk about journalism share favorite restaurants Falcon Heavys night launch may be seen across portion of Sunshine State Police: Palm Bay teacher jerked autistic student from bus faces abuse charge Kicking axe: Urban axe throwing makes its debut in Brevard as trending sport stress reliever Palm Bay official Andy Anderson hired as Mexico Beach city administrator on Florida Panhandle Daddy Duty: Are we celebrating Mothers Day the right way? Father of 6 dies trying to save sons from rip current in Cocoa Beach Its time we address lack of affordable housing on Space Coast | Opinion Orlando Pride soccer player from Brevard diagnosed with breast cancer Space Coast bus system: Lets peck away at this problem County Commission chair says Do lovebugs have a purpose beyond annoying us all twice a year? Hurricane Humberto continues to strengthen swells expected to affect East-Central Florida surf One more launch this week; Things to know about ULA Atlas V launch from Cape Canaveral Free self-defense class for women this Friday in Cocoa Beach The farce behind blaming video games for El Paso Dayton mass shootings | Rangel Teacher pay talks take new shape as school district and union vow to work together First-place USSSA Pride on hot streak; Bennett on ESPNs Top Plays A legacy of giving: When it comes to giving back Judy and Bryan Roub lead by example Staph infection can come from ocean water beach sand; Florida man gets it from sandbar Its Launch Day! Things to know about SpaceX Falcon Heavy launch from Kennedy Space Center Viera 6 more Brevard HS softball teams enter playoffs Enjoy Jack Daniels tastings in Cocoa Village and Melbourne this month Brevard County location comes in at No 7 on list of US cities that get the most sun Letters and feedback: Oct 31 2019 Vietnam veteran receives settlement from Veterans Affairs after waiting 20 years Brevard restaurants with outdoor seating wage war against lovebugs Aldi adds beer sparkling wine and dog treats to its collection of Advent calendars If millennials choose socialism fine Just dont make this mistake Gardening: Adding a tree to your Florida landscape? Southern redcedar is a great option Big Brevard high school football rivalries vary in postseason implications 6 hurricanes and 12 named storms predicted for remainder of 2019 hurricane season Cocoa man suspected of firing shots with unwitting mother in tow faces multiple charges Dogs on the beach? County Commission rejects proposal to allow dogs on South Brevard beaches Letters and feedback: Sept 15 2019 Despite weather odds SpaceX launches Falcon 9 rocket from Cape Canaveral Cyberstalking of alleged battery victim lands Rockledge man back in jail Winds are coming back again but inshore fishing is excellent for redfish and trout Exclusive: Trump administration sets plans for 2019 hurricane season after wakeup call of recent disasters Melting Pot: Intimacy of a shared meal makes this Viera spot perfect for date night The Chefs Table: Fine meat seafood draw diners to this Suntree favorite Holmes nurse arrested accused of sexual battery on a patient Tour of Brevard: Satellite Scorpions Djons Steak & Lobster House: Discover old-Florida elegance in Melbourne Beach restaurant Judicial Nominating Commission chair resigns claims governors office interfered with independence Medical marijuana firms seek to open more storefronts in Florida Brevard School votes 4-1 to support superintendents plan for teacher raises; audience not happy Tracking Brevards stars: Arena football champ UCF star PBA rookie lead the way Brevard school district miscalculated attrition savings finds $15 million for teacher pay Brevard high school cross country honor roll Oct 31 NASA Kennedy Space Center team wins Emmy for coverage of SpaceX mission Golf tip: Making adjustments to correct your swing Cruise ship rescues and mishaps: 6 times emergency struck on Royal Caribbean Carnival more Updates: SpaceX launches Falcon 9 from Cape Canaveral Air Force Station Secular group challenges In God We Trust motto on Brevard sheriffs patrol vehicles Chemicals in sunscreen seep into your bloodstream after just one day FDA says Cottage Rose mural dispute resolved in Indialantic after owner applies for permit Humberto strengthens into a hurricane as it moves away from Space Coast; NHC watching 2 more Planting herbs to cutting back on water heres what to do in your Brevard yard this month Viera Suntree Little League team heads to Michigan for Junior World Series We must prevent human trafficking through education | Opinion Please dont eat the deer Family spots Florida panther wandering around Estero backyard Man found hanging onto side of kayak in Atlantic Ocean off Patrick Air Force Base SpaceX set to launch Falcon Heavy on mission with 24 payloads –\xa0including human ashes Erin Baird is 2019 Volunteer of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards ZZ Top Steve Martin and Martin Short coming to King Center Needy Brevard students sleep easier on new beds donated from Ashley HomeStore These fitness trends are worth sticking with over time Florida saw fewer infections from Vibrio vulnificus (aka flesh-eating bacteria) in 2018 Health calendar: April 25-May 2 Young actresses shine in Matilda at Titusville Playhouse Freshman quarterback Gabriel leads UCF over Stanford in Orlando John Daly is 2019 Volunteer of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Health Pro: Doctors interest grew out of childhood health issues More strong storms move eastward over Brevard; funnel clouds possible in Viera Nurse at Holmes Regional Medical Center arrested on suspicion of sexual battery Strong isolated showers heat impacting Brevard County Brevard firefighters widow wins $9 million verdict in Dominos Pizza delivery death Housing providers may be prohibited from rejecting renters on basis of arrest record 10-foot great white shark Miss May pings off Satellite Beach Palm Bay man 22 dies in overnight crash on Floridas Turnpike Tamara Carroll is 2019 Volunteer of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Tropical Storm Humberto not expected to affect Space Coast; NHC watching two more disturbances Theater veteran Blackledge passes away unexpectedly Ten dead issues from the 2019 legislative session Letters and feedback: June 23 2019 Rockledge Cocoa win big as Space Coast takes Holy Trinity in OT Surfers can enjoy chest-high waves most of weekend in Brevard Plan ahead for Aug 14-20: Space Coast Symphony performs Mikado Chef Larry tweaks hours Daylight saving time ends on Nov 3 Here are 7 facts to know about the time change Two new Royal Caribbean ships including second largest in world move to Port Canaveral Fla Sen Debbie Mayfield files bill to define e-cigarettes liquid nicotine as tobacco products Escape the hustle and bustle for a western getaway Missing Palm Bay teen found in South Carolina; mother arrested police say Impossible burger Impossible Cottage Pie make 2019 Epcot Food and Wine Festival menu Got bats? Why you need to get rid of them by Tax Day 2 charged with attempted murder; Titusville police say they pistol-whipped waterboarded force-fed him dog food Bicyclist dies in hit-and-run crash on US 1 near Cocoa Group: St Johns River pollution threatens water supplies Rockledge man charged after hoax to shoot up Merritt Island Walmart Man awarded $80M in lawsuit claiming Monsantos Roundup causes cancer Astronauts Alisa Rendina won FLORIDA TODAY Community Credit Union Athlete of the Week vote Alligators and airboats: How to explore Everglades National Park in one day Longtime educator church mother and humanitarian Johnnie Mae Riley remembered in Palm Bay George Clooney astronauts to attend black-tie event at Kennedy Space Center High school sports results: March 27 Bill Nye visits Cocoa Beach ahead of SpaceX Falcon Heavy launch from Kennedy Space Center Health First to relocate Cape Canaveral Hospital to Merritt Island open new facilities Letters and feedback: Aug 7 2019 Duran to host top area amateur golfers this weekend Buckaroo Ball benefits Harmony Farms Get ready to rumble: Atlas V blasting off early Thursday morning War on lovebugs: Social media reactions tweets memes about pesky bugs cars windshields Master association that contains condominiums may be able to collect capital contribution The new LEGO Movie World opens at LEGOLAND River Rocks: Enjoy fresh seafood and river views at this Rockledge tradition March 27: Panthers Walk It Off 10 takeaways from the presidents campaign kickoff visit to Orlando Cocoa man arrested on manslaughter charges after Sharpes shooting Corrections & Clarifications Accused burglars hit five south Brevard stores before Melbourne officer shot one of them Brevard family to receive Filipino World War II veterans Congressional Gold Medal NASA: Want to go to the moon in five years? We need more money How to REALLY clean lovebugs off your windshield car house Top party schools: Florida Florida State make it to top of Princeton Reviews 2020 list Live scores: High school football Week 4 in Brevard County Restaurant inspections: Happy Kitchen in West Melbourne Divine Grace in Palm Bay closed after inspections School district sticks to $770 raise offer for teachers; special magistrate getting involved Candidate lineup is set for 2019 municipal special district elections in Brevard Death investigation underway after mans body found in Titusville ditch November hurricane forecast: Season not technically over but its over | WeatherTiger Tyndall Air Force Bases future remains uncertain after devastation from Hurricane Michael Whats behind a low performing school? Crummy parents | Opinion Well known Chevrolet dealership has The Right Stuff Lovebugs are back Many hate them Experts say theyre here for good The lax disciplinary policies that caused\xa0Parkland massacre may have spread to your school Port Canaveral commissioners tackle parking issues: 10 things to know Melbourne woman injured in crash dies from her injuries; police looking for witnesses Senate President Galvano calls for Florida lawmakers to focus on mass shootings Governor DeSantis approves $91 billion budget slashes $133 million in local projects Letters and feedback: March 28 2019 Vote for 321prepscom Community Credit Union Athlete of the Week for April 29-May 4 South Patrick Shores approved for federal cleanup program Island H2O Live! water park in Kissimmee opens: Make the plunge or chill in lazy river Travis Proctor is 2019 Citizen of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Vietnam War 50th anniversary to be remembered at Cape Canaveral National Cemetery Its launch day! Things to know about SpaceX Falcon 9 launch from Cape Canaveral NASA gave Artemis contractors bonuses despite delays and cost overruns GAO says Michael Bloom is 2019 Citizen of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Health Pro: Eye doctor honed skills on aircraft carrier Brevard pitchers fare well in state horseshoe event The real reason NASAs all-female spacewalk is not happening Hint: Its not discrimination Port Canaveral aquarium the best use of tourism dollars | Opinion SpaceX Falcon Heavy will carry worlds first green rocket fuel mission School District union at odds on whats actually available for teacher raises | Opinion Nuclear power is too costly and too risky Jarvis Wash is 2019 Citizen of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards SpaceX Dragon installed at ISS after launch from Cape Canaveral Rare phenomenon: Hail damage from unusual thunderstorm has Brevard residents talking Signs you live on the Space Coast Titusville couple charged with neglect after infant found malnourished Health coaches a personal way to help you meet goals New Viera elementary school to break ground in fall fast-tracked to open August 2020 Phase out nuclear power? Elizabeth Warren and Bernie Sanders locked in absurd arms race Strange days: Space Coast gets freak hail flaming meteor mysterious rumble red tide over 6 months Four women arrested on prostitution charges in ongoing Melbourne enforcement detail Whose name is on deed - partner or spouse? It matters a lot Florida should steer clear of drug importation from other countries | Opinion Melbourne police investigate Palm Bay Road shooting Brevard bowlers head to state tournament off district wins Florida Historical Societys May 16-18 conference: Countdown to History: Ice Age to the Space Age Turning the Toxic Tide: Florida must address the problem of human waste Contributions pour in to group seeking to ban possession sale of assault weapons in Florida Why Disney World sweets treats desserts are SO worthy of Instagram Satellite Beach girl 10 leads charge for organ donation Poliakoff: Community development district offers different advantages and disadvantages to an HOA Florida Tech in Melbourne hosts International Dinner with food from South Latin America From discovering water to snapping selfies: The lasting memories of Mars rover Opportunity Charges dropped against Texas man after comments against Cocoa Beach hotel werent direct specific threat Brevard Federation of Teachers just cant say yes to raise offers | Opinion How Mars Opportunity rover ranks in list of more than 1000 unmanned space missions by NASA since 1958 321 Flavor members rave about Fresh Scratch Bistro Sushi Factory and Beachside Seafood Texas man recovering after shark bite in Florida Buckeye the manatee rescued again by SeaWorld Rain hail pelts north central Brevard County causing widespread damage Regulators approve merger of Harris L3; deal to close June 29 Teachers approve largest pay raise in years but will it keep them in Brevard? Restaurant review: Continental Flambe in Melbourne shows great promise Weak cool front could bring torrential rain damaging winds to Space Coast Officials: Removing Beachline earthen causeways would improve Banana River water quality Buzz Aldrins outfit is everything in photo of last remaining Apollo astronauts Voting Republican only helps if youre rich | Opinion Homelessness: A national problem with a local solution EPA plans to regulate cancer-causing chemicals found in Americas drinking water UCF has chance to wow recruits playoff voters with a win over Stanford Ping-pong-sized hail pounds Space Coast as severe thunderstorms move through area Harris CEO optimistic about merger finances as deal with L3 nears completion March 26: EFSC’s Carter Stewart named NJCAA Region VIII Pitcher of the Week Knife-wielding suspect leads deputies officers on foot chase in Melbourne 4 Brevard winners in girls regional basketball Thursday March 26: Women’s Golf Finishes Runner-Up at Barry Invite Some seek healing remembrance at arrival of Vietnam War replica wall High school sports results: March 26 Following surge in discipline problems Brevard students ask lawmakers to address vaping in schools Titusville man 21 pronounced dead after trying to make a U-turn on South Street Donald Trumps criminal justice reform shows results | Opinion Florida cold case solved 22 years later after Google Earth satellite image shows missing mans car Health calendar: Aug 8-15 Tracking Brevards stars: Harris to World Cup; Dawson in golf hunt; Allen Taylor hitting 300 Satellite Beach hotel condo towers homes slated for 27 acres by Hightower Beach park Letters and feedback: Sept 14 2019 Mike Martin Jr to be named Florida States new baseball coach Golf tip: Hit down on hybrids similar to how you hit an iron Stormy weather moves across Space Coast bringing lightning heavy rain Commission OKs pay raises for 625 county employees to address disparities turnover High school sports results: Feb 14 weVENTURE is 2019 Organization of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Man taken to hospital after domestic incident in Titusville Brevard school district considers selling $800000 in unused land Weather looks OK for SpaceX Falcon Heavy launch from Kennedy Space Center 321 Flavor: Margarita tops the list of Brevards favorite cocktails UCF experiment and Citronaut fly to space on Blue Origins New Shepard Bridges sex assault: Victim’s family sues nonprofit caregiver over pregnancy Sheriff Ivey and Brevard emergency officials send mixed messages during Dorian New proposal on dog cat sales at pet stores is less restrictive than previous one Space Coast Honor Flight is 2019 Organization of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards It goes by so quickly: Brevard teachers back to work as days count down to school County commissioners reject lagoon funding plan saying they want less money for muck projects Yuck! Hypodermic needles stabbing employees inside Brevard recycling plant Hard-line anti-illegal immigration group works to sway sanctuary cities bill Titusville High School principal retires amid accusations of hostile environment Pol Pot’s deputy Nuon Chea architect of bloody Khmer Rouge policy dead at 93 in Cambodia NASAs new goal: Land astronauts on the moon by 2024 Going surfing this weekend? Check our forecast Neighbor Up is 2019 Organization of the Year finalist in FLORIDA TODAY Volunteer Recognition Awards Severe storms bring wind lightning and possible hail across Brevard Starbucks in Indialantic West Melbourne may move to larger stores with drive-thru windows SpaceX Falcon Heavy: Why everyone should watch this night launch from Kennedy Space Center Fishing report: Calm seas and good fishing ahead Two World Series commercials feature Brevard celebrities Letters and feedback: May 5 2019 In latest motion US Attorney confirms multiple investigations in Tallahassee Lane 70 softball team wins spring tournament Palm Bay police double down on nightclub safety following four violent episodes Relativity Space signs deal to deliver small satellites to more orbits HBO documentary new surf movie spotlight surf legends Kelly Slater Hobgood twins Plan ahead for April 3-10: Music in Melbourne adult prom in Titusville; art in Eau Gallie Family grieves woman fatally struck by car near Lous Blues bar in Melbourne At Cape Canaveral SpaceX kicks off Star Wars Day with smoke and fire Updates: SpaceX launches Falcon 9 from Cape Canaveral to ISS sticks drone ship landing Letters and feedback: March 27 2019 Florida Tech to host event celebrating Apollo 11 anniversary and future of US space program Boeing on Starliner orbital flight test: We really are close Florida Tech spring football game set for Saturday Feb 14: EFSC women’s golfer Kacey Walker nominated for 2019 Arnold Palmer Cup Four Panthers Selected to All-SSC Teams Heavy winds and lightning to hit north Brevard County How Ron DeSantis is shaking up the establishment | Opinion High school sports results: May 3 Palm Bay man who asked woman for permission to have illicit sexual encounter with child sentenced Letters and feedback: June 22 2019 Air Force pumped for two launches less than 48 hours apart Heres why NASA scrubbed all-female spacewalk Plan ahead for Nov 6-12: Holiday bazaar in Suntree pet events art openings and food fun Edgewoods Jon Wang shoots one under par for second time this week Burglar steals bolted safe from Melbourne bar in latest smash-and-grab Feb 14: Florida Tech Suffers Last-Second Heartbreaker Against Saint Leo Daddy Duty: My school cheating scandal is an example for my 4-year-old Tropical Storm Humberto forms but isnt expected to affect Space Coast Air Force weather squadron key player in ensuring successful rocket launches Ronsisvalle: Getting through to the disconnected husband How to watch SpaceX launch its Falcon 9 rocket from Cape Canaveral Air Force Station Members of Brevards Bahai Faith community come together to commemorate twin holy days A&Es Tiny House Nation to feature Melbourne tiny home builder Movable Roots Lawmakers grapple with felons voting rights All-female spacewalk canceled due to lack of medium-sized spacesuits Twitter reacts Medical marijuana and 3 other school district policy changes you should know about Felons rights bill heads to Gov DeSantis Lanes open delays persist after Crash on I-95 near Wickham Road Former UCF chair resigns day after Rep Randy Fine proposes UC'
Create a spacy object with news_article
. Iterate through every token and check if the token.ent_type
is person
or not. If yes, append it to the list.
# creating spacy doc
news_doc=nlp(news_articles)
# creating an empty list to store people's names
list_of_people=[]
# iterating through every token in doc
for token in news_doc:
# checking if category matches
if token.ent_type_=='PERSON':
# appending to list if PERSON
list_of_people.append(token.text)
print(list_of_people)
['Kelly', 'Slater', 'Elon', 'Musk', 'Brad', 'Pitt', 'Will', 'Rogers', 'Bill', 'Mick', 'United', 'Way', 'Viera', 'Studio', 'Viera', 'Robert', 'De', 'Niro', 'Larry', 'Mazza', 'hitman', 'Tropical', 'Storm', 'Jerry', 'Logan', 'Harvey', 'Astronaut', 'beat', 'Space', 'Coast', 'Titusville', 'Nelson', 'Moya', 'Jesus', 'Alopecia', 'Atlas', 'New', 'Melbourne', 'Tony', 'Rowell', 'Titusville', 'Walmart', 'Harry', 'Anderson', 'Palm', 'Bays', 'Iacobelli', 'Atlas', 'Andrew', 'Reed', 'Eau', 'Gallie', 'DeSantis', 'Trump', 'Bill', 'Cotterell', 'Bunny', 'Carter', 'Stewart', 'Brevard', 'Cinderella', 'Brighton', 'Jack', 'Box', 'Marella', 'Treasurer', 'Harvey', 'Ashlyn', 'Harris', 'Suzy', 'Fleming', 'Falcon', 'Heavys', 'Andy', 'Anderson', 'Brevard', 'Lets', '|', 'Rangel', 'Teacher', 'Bennett', 'Judy', 'Bryan', 'Roub', 'Staph', 'Viera', 'Enjoy', 'Jack', 'Daniels', 'Aldi', 'Melting', 'Pot', ':', 'Viera', 'Holmes', 'Estero', 'Erin', 'Baird', 'Steve', 'Martin', 'Martin', 'Short', 'Ashley', 'HomeStore', 'Vibrio', 'Gabriel', 'John', 'Daly', 'Aug', 'Mikado', 'Chef', 'Larry', 'Titusville', 'Alisa', 'Rendina', 'Bill', 'Nye', 'Atlas', 'Brevard', 'Melbourne', 'Galvano', 'DeSantis', 'Travis', 'Proctor', 'Michael', 'Bloom', 'Jarvis', 'Wash', 'New', 'Viera', 'Elizabeth', 'Warren', 'Bernie', 'Sanders', 'Melbourne', 'Florida', 'Tech', 'Harris', 'L3', 'Continental', 'Flambe', 'Removing', 'Beachline', 'Buzz', 'Aldrins', 'Stanford', 'Ping', '-', 'pong', 'Harris', 'Carter', 'Stewart', 'Golf', 'Finishes', 'Runner', 'Titusville', 'Harris', 'Dawson', 'Allen', 'Taylor', 'Mike', 'Martin', 'Jr', 'Margarita', 'Yuck', 'Pol', 'Pot', '’s', 'Nuon', 'Chea', 'Kelly', 'Slater', 'Eau', 'Gallie', 'Family', 'Kacey', 'Walker', 'Ron', 'DeSantis', 'Edgewoods', 'Jon', 'Wang', 'Crash']
From the output of above code, you can clearly see the names of people that appeared in the news.
Also, in many cases you do not want the actual details to be published. So, what if you want to replace every person’s name or organization with ‘UNKNOWN’ to preserve privacy.
The below code demonstartes how to do the same .
# Function to identify if tokens are named entities and replace them with UNKNOWN
def remove_details(word):
if word.ent_type_ =='PERSON' or word.ent_type_=='ORG' or word.ent_type_=='GPE':
return ' UNKNOWN '
return word.string
# Function where each token of spacy doc is passed through remove_deatils()
def update_article(doc):
# iterrating through all entities
for ent in doc.ents:
ent.merge()
# Passing each token through remove_details() function.
tokens = map(remove_details,doc)
return ''.join(tokens)
# Passing our news_doc to the function update_article()
update_article(news_doc)
'Restaurant inspections: Space Coast establishments fare well with no closures or vermin Despite strong talk UNKNOWN has a lot more to do for Indian River Lagoon | Opinion UNKNOWN -coached surfer from UNKNOWN wins national title in UNKNOWN Despite parachute mishap UNKNOWN Starliner pad abort test a success Vote for UNKNOWN of the Week for Oct 28-Nov 2 Brevard diners rave about experiences at UNKNOWN what you need to know about Tuesdays municipal elections in UNKNOWN Save the turtles rally planned to oppose UNKNOWN hotel-condominium project UNKNOWN wants humans to colonize space UNKNOWN shows thats easier said than done UNKNOWN honors actor UNKNOWN with new Doodle SpaceX targeting faster deployment of Starlink internet-beaming satellites Rockledge Cocoa hold while UNKNOWN gains traction in state football polls UNKNOWN launches and lands Starliner capsule for pad abort test in UNKNOWN Eight of the 10 most dangerous metro areas for pedestrians are in UNKNOWN report says Brevard faith leaders demand apology in open letter to UNKNOWN How a UNKNOWN researcher discovered a songbird can predict hurricane seasons Open doors change lives: UNKNOWN of UNKNOWN kicks off 2019 campaign UNKNOWN man pleads guilty to killing sawfish by removing extended nose with power saw One-woman comedy Excuse My French at UNKNOWN dives into authors mind Letters and feedback: Sept 19 2019 UNKNOWN talk will highlight technology used for dementia patients Adamson Road open after dump truck crash spilled sand and fuel one injured From the mob to mob movies with UNKNOWN : Meet UNKNOWN former UNKNOWN Irishman consultant UNKNOWN getting stronger in Atlantic; Hurricane Humberto nearing Bermuda Palm Bay Halloween party homicide suspect turns self in police say Fall playoffs: Satellite XC boys win region; UNKNOWN swimmers second Sounds of celebration during Puerto Rican Day Parade in Palm Bay Tractor trailer fire extinguished on I-95 in UNKNOWN lanes open UNKNOWN bowls a 279 to help UNKNOWN in battle of undefeated teams Letters and feedback: Sept 18 2019 Accelerated bargaining a bust as teachers union walks out on second day of pay talks Traffic Alert: Puerto Rican Day Parade set for Palm Bay County commissioners debate biosolids at first of two public hearings UNKNOWN widening in UNKNOWN under study along with sidewalks and bike lanes Letters and feedback: Nov 3 2019 UNKNOWN man indicted on first-degree murder charge Rockledge Cocoa lead 8 Brevard HS football teams into UNKNOWN accepting business hall of fame nominations Why Brevards proposed moratorium on sewage sludge isnt enough |Rangel Brevard high school football standings after Week 4 Starbucks might build new store at old UNKNOWN site in UNKNOWN on UNKNOWN 192 Police search for Palm Bay shooting suspect in UNKNOWN area Does your drivers license have a gold star? Well you need one UNKNOWN man charged with arson in blaze at triplex where 16 children were celebrating a birthday Unit and lot owners may inspect their ledgers upon written request Palm Bay veteran officer UNKNOWN chosen as police chief By the numbers: Your link to every state salary and what high ranking UNKNOWN officials make Longtime state county legislative aide UNKNOWN honored by UNKNOWN Week 11: Rockledge pulls out UNKNOWN win; EG Palm Bay Astro win rivalry games County commissioners allocate $446 million to repair South Beaches after Hurricane Dorian Tour of Brevard: Holy Trinity Tigers County Commission begins shakeup of its bloated advisory boards while adding term limits Mystery UNKNOWN plaque washes ashore near UNKNOWN thanks to Hurricane Humberto waves Widow of combat veteran who died after fight at UNKNOWN appeals to governor Health calendar: June 27-July 4 Going surfing this weekend? Check our wave forecast Have you ever seen a turtle release? Heres your chance Darkness turns to day as SpaceX UNKNOWN Heavy launches from Kennedy Space Center Losing hair in strange places? It could be UNKNOWN areata Golf tip: Pairing and tee times on UNKNOWN NOAA updates projections saying above normal season could see 5-9 hurricanes 2-4 of them major UNKNOWN relocation: Loss of a landmark hopeful future for healthcare and land How do I get adult child to move out start own life? The UNKNOWN V rockets jellyfish in the sky Scores: High school football Week 11 in UNKNOWN UNKNOWN gator caught on camera munching on plastic at UNKNOWN : UNKNOWN head football coach UNKNOWN Does the First Amendment go too far? | Opinion FHSAA says it has a contingency plan if football associations strike but wont reveal it Tropical Depression Ten forms in central Atlantic; Hurricane Humberto continues to strengthen Updates: Watch SpaceX launch its UNKNOWN Heavy rocket from Kennedy Space Center in UNKNOWN Letters and feedback: May 9 2019 The Good Stuff: Soccer star makes UNKNOWN cover cookies in space and more Highly anticipated pad abort test of UNKNOWN ’s Starliner spacecraft is Monday morning Health calendar: Sept 19-26 Naked man running through UNKNOWN parking lot arrested Lacking money and manpower UNKNOWN considers swapping school SROs for armed guards Worst best airports in country: 3 UNKNOWN airports make Top 5 worst list UNKNOWN STEM school shooting impacted my family: When will we say enough? | Bellaby Human ashes Legos and a sandwich: Strange things we sent into space Massive 2600-acre burn underway near UNKNOWN line Health Pro: Trip to UNKNOWN validated nurses career decision UNKNOWN Sitting Man exposes gaps in helping the homeless UNKNOWN splits 4-1 in favor of district plan for teacher pay; union vows war Whos No 1? See the UNKNOWN high school football rankings after Week 4 3 people taken into custody after multiple agencies tracked them at high speeds across Central Florida As dawn breaks over Florida ULA Atlas V rocket lights up morning sky Why breast cancer was called UNKNOWN disease Space Coast-based UNKNOWN partners with Swiss tech giant UNKNOWN continues for Palm Bay double killing in 2018 Can’t breastfeed? Tips to provide same nutrition to baby Former UNKNOWN Tech golfer UNKNOWN wins playoff in UNKNOWN for third pro title Baseball bliss every 50 years: Let’s go Mets! Let’s go UNKNOWN ! 5 tips to make your holiday travel stress-free Updates: UNKNOWN launches UNKNOWN V rocket military satellite from Cape Canaveral UNKNOWN bowls 274 in UNKNOWN win over UNKNOWN industry propels Brevard economy into optimistic future; highlights from UNKNOWN investors meeting Governor UNKNOWN signs bill geared toward expanding vocational training UNKNOWN leaves 12500 customers without power in UNKNOWN for nearly an hour Weak cold front brings fall weather after unusually warm October on Space Coast 321 Launch: The space news you might have missed Teacher pay: Union opens pay talks with raise demand supplement for veteran teachers UNKNOWN Bayside Bears Youth fishing tournament coming Saturday to UNKNOWN President UNKNOWN heres what UNKNOWN damage still looks like today Letters and feedback: Aug 8 2019 Man leads police on 100 mph chase through UNKNOWN UNKNOWN parade to take place this weekend in Palm Bay Viera-based UNKNOWN Pride depart from pro softball league Tip of organized theft ring targeting boats motors leads to arrest of 3 men in UNKNOWN Multiple crashes slow I-95 traffic near UNKNOWN Is it time to reconsider how free speech should be? | UNKNOWN SPCA of Brevard takes influx of cats and dogs from UNKNOWN residents share some of their favorite recent dining experiences Tuesdays brewery running tour stops at UNKNOWN in UNKNOWN 89-year-old UNKNOWN Grandma UNKNOWN battles 6-foot snake after it eats visiting birds The horrors of Dorian parallel another nearby tragedy of historic proportions EFSCs UNKNOWN named UNKNOWN Region 8 Pitcher of the Week Titusville teenager charged with murder of grandfather If Trump bans vape flavors Brevard shops say theyll be out of business UNKNOWN : NASAs Artemis is space explorations next giant leap UNKNOWN UNKNOWN continues cutting phone cords BSOs Confessore starting 25th year as music director High school sports results: May 7 Libraries busy with summer reading programs Four hurricanes in six weeks? Remember 2004 the year of hurricanes Letters and feedback: Nov 2 2019 Letters and feedback: May 8 2019 As UNKNOWN kids head back to school here are five things to make lunchtime happy healthy Palm Bay police investigate overnight shooting that left one dead Rent takes the stage at two playhouses UNKNOWN Beach Memoirs arrive on UNKNOWN stages this weekend Junes Saharan dust setting up right conditions for red tide but still too soon to tell UNKNOWN of Harry Potter: Dark Arts unleashed on UNKNOWN at UNKNOWN restructuring: What we know (and what we dont) From UNKNOWN in the UNKNOWN to In-N-Out Burger: 7 restaurant chains missing in UNKNOWN SpaceX anomaly: No further action needed on Crew Dragon explosion cleanup Vietnam War mural pits residents vs UNKNOWN community Matter settled unhappily British cruise line UNKNOWN to sail from UNKNOWN in 2021 Kids are at risk as religious exemptions to vaccines spike in UNKNOWN crime rate falls in 2018 reflecting statewide trend according to UNKNOWN report UNKNOWN - UNKNOWN FSU-Miami: Why Saturday afternoon is a college football party in UNKNOWN in UNKNOWN Suntree Port Canaveral Palm Bay UNKNOWN draw UNKNOWN raves Vote for UNKNOWN of the Week for Sept 9-14 Whats happening May 15-21: Concert in UNKNOWN summer camps fun fair in UNKNOWN Secretary/ UNKNOWN seeks more public input at evening meetings SpaceX UNKNOWN Heavy rocket launch: Where to watch from Space Coast Treasure Coast Volusia County UNKNOWN Here are 10 of UNKNOWN most photogenic spots Some of them might surprise you Rockledge police search for suspects after armed robbery Kardashian production features Palm Bay homicide involving sisters-in-law UNKNOWN TODAY journalist The holidays are almost here; Space Coast cooks are prepping special meals now How to watch tonights SpaceX UNKNOWN Heavy launch from UNKNOWN in UNKNOWN New state budget spending wish list includes bug studies lionfish contests stronger coastlines Wheaties: UNKNOWN soccer star UNKNOWN make cereal cover and it costs $23 a box Letter carriers Stamp Out Hunger food drive is Saturday May 11 UNKNOWN Leonard: Fourth-graders talk about journalism share favorite restaurants UNKNOWN night launch may be seen across portion of UNKNOWN : Palm Bay teacher jerked autistic student from bus faces abuse charge Kicking axe: Urban axe throwing makes its debut in UNKNOWN as trending sport stress reliever Palm Bay official UNKNOWN hired as UNKNOWN city administrator on UNKNOWN Panhandle Daddy Duty: Are we celebrating Mothers Day the right way? Father of 6 dies trying to save sons from rip current in UNKNOWN Its time we address lack of affordable housing on Space Coast | Opinion UNKNOWN Pride soccer player from UNKNOWN diagnosed with breast cancer Space Coast bus system: UNKNOWN peck away at this problem UNKNOWN chair says Do lovebugs have a purpose beyond annoying us all twice a year? Hurricane Humberto continues to strengthen swells expected to affect East-Central Florida surf One more launch this week; Things to know about ULA Atlas V launch from UNKNOWN Free self-defense class for women this Friday in UNKNOWN The farce behind blaming video games for UNKNOWN mass shootings UNKNOWN pay talks take new shape as school district and union vow to work together First-place USSSA Pride on hot streak; UNKNOWN on ESPNs Top Plays A legacy of giving: When it comes to giving back UNKNOWN and UNKNOWN lead by example UNKNOWN infection can come from ocean water beach sand; UNKNOWN man gets it from sandbar Its Launch Day! Things to know about SpaceX UNKNOWN Heavy launch from Kennedy Space Center UNKNOWN 6 more UNKNOWN softball teams enter playoffs UNKNOWN tastings in UNKNOWN and UNKNOWN this month UNKNOWN location comes in at No 7 on list of UNKNOWN cities that get the most sun Letters and feedback: Oct 31 2019 UNKNOWN veteran receives settlement from UNKNOWN after waiting 20 years Brevard restaurants with outdoor seating wage war against lovebugs UNKNOWN adds beer sparkling wine and dog treats to its collection of Advent calendars If millennials choose socialism fine Just dont make this mistake Gardening: Adding a tree to your UNKNOWN landscape? Southern redcedar is a great option UNKNOWN high school football rivalries vary in postseason implications 6 hurricanes and 12 named storms predicted for remainder of 2019 hurricane season UNKNOWN man suspected of firing shots with unwitting mother in tow faces multiple charges Dogs on the beach? UNKNOWN rejects proposal to allow dogs on South Brevard beaches Letters and feedback: Sept 15 2019 Despite weather odds SpaceX launches UNKNOWN 9 rocket from Cape Canaveral Cyberstalking of alleged battery victim lands Rockledge man back in jail Winds are coming back again but inshore fishing is excellent for redfish and trout Exclusive: Trump administration sets plans for 2019 hurricane season after wakeup call of recent disasters UNKNOWN Intimacy of a shared meal makes this UNKNOWN spot perfect for date night The Chefs Table: Fine meat seafood draw diners to this Suntree favorite UNKNOWN nurse arrested accused of sexual battery on a patient UNKNOWN : Discover old- UNKNOWN elegance in UNKNOWN restaurant UNKNOWN chair resigns claims governors office interfered with independence Medical marijuana firms seek to open more storefronts in UNKNOWN votes 4-1 to support superintendents plan for teacher raises; audience not happy Tracking Brevards stars: Arena football champ UNKNOWN star UNKNOWN rookie lead the way Brevard school district miscalculated attrition savings finds $15 million for teacher pay UNKNOWN country honor roll Oct 31 UNKNOWN team wins Emmy for coverage of SpaceX mission Golf tip: Making adjustments to correct your swing Cruise ship rescues and mishaps: 6 times emergency struck on UNKNOWN more Updates: SpaceX launches Falcon 9 from UNKNOWN group challenges UNKNOWN We Trust motto on UNKNOWN sheriffs patrol vehicles Chemicals in sunscreen seep into your bloodstream after just one day UNKNOWN says UNKNOWN Rose mural dispute resolved in Indialantic after owner applies for permit UNKNOWN strengthens into a hurricane as it moves away from Space Coast; UNKNOWN watching 2 more Planting herbs to cutting back on water heres what to do in your Brevard yard this month UNKNOWN team heads to UNKNOWN for Junior World Series We must prevent human trafficking through education | Opinion Please dont eat the deer Family spots UNKNOWN panther wandering around UNKNOWN backyard Man found hanging onto side of kayak in Atlantic Ocean off Patrick Air Force Base SpaceX set to launch UNKNOWN on mission with 24 payloads –\xa0including human ashes UNKNOWN is 2019 Volunteer of the Year finalist in UNKNOWN Awards ZZ Top UNKNOWN and UNKNOWN coming to UNKNOWN students sleep easier on new beds donated from UNKNOWN These fitness trends are worth sticking with over time UNKNOWN saw fewer infections from UNKNOWN vulnificus (aka flesh-eating bacteria) in 2018 Health calendar: April 25-May 2 Young actresses shine in UNKNOWN at Titusville Playhouse Freshman quarterback UNKNOWN leads UNKNOWN over UNKNOWN in UNKNOWN UNKNOWN is 2019 Volunteer of the Year finalist in UNKNOWN Doctors interest grew out of childhood health issues More strong storms move eastward over UNKNOWN ; funnel clouds possible in UNKNOWN arrested on suspicion of sexual battery Strong isolated showers heat impacting UNKNOWN firefighters widow wins $9 million verdict in UNKNOWN delivery death Housing providers may be prohibited from rejecting renters on basis of arrest record 10-foot great white shark Miss May pings off UNKNOWN Beach Palm Bay man 22 dies in overnight crash on Floridas Turnpike Tamara Carroll is 2019 Volunteer of the Year finalist in UNKNOWN Awards Tropical Storm Humberto not expected to affect Space Coast; UNKNOWN watching two more disturbances UNKNOWN veteran UNKNOWN passes away unexpectedly Ten dead issues from the 2019 legislative session Letters and feedback: June 23 2019 Rockledge Cocoa win big as Space Coast takes UNKNOWN in OT Surfers can enjoy chest-high waves most of weekend in UNKNOWN ahead for UNKNOWN 14-20: UNKNOWN performs UNKNOWN tweaks hours Daylight saving time ends on Nov 3 Here are 7 facts to know about the time change Two new UNKNOWN ships including second largest in world move to Port Canaveral Fla Sen Debbie Mayfield files bill to define e-cigarettes liquid nicotine as tobacco products Escape the hustle and bustle for a western getaway Missing Palm Bay teen found in UNKNOWN ; mother arrested police say Impossible burger Impossible Cottage Pie make 2019 Epcot Food and Wine Festival menu Got bats? Why you need to get rid of them by Tax Day 2 charged with attempted murder; UNKNOWN police say they pistol-whipped waterboarded force-fed him dog food Bicyclist dies in hit-and-run crash on UNKNOWN 1 near UNKNOWN : St Johns River pollution threatens water supplies Rockledge man charged after hoax to shoot up UNKNOWN awarded $80M in lawsuit claiming Monsantos Roundup causes cancer Astronauts UNKNOWN won UNKNOWN of the Week vote Alligators and airboats: How to explore Everglades National Park in one day Longtime educator church mother and humanitarian UNKNOWN remembered in Palm Bay George Clooney astronauts to attend black-tie event at Kennedy Space Center High school sports results: March 27 UNKNOWN visits UNKNOWN ahead of SpaceX UNKNOWN Heavy launch from UNKNOWN to relocate UNKNOWN to Merritt Island open new facilities Letters and feedback: Aug 7 2019 Duran to host top area amateur golfers this weekend UNKNOWN benefits Harmony Farms Get ready to rumble: UNKNOWN V blasting off early Thursday morning War on lovebugs: Social media reactions tweets memes about pesky bugs cars windshields Master association that contains condominiums may be able to collect capital contribution The new UNKNOWN opens at LEGOLAND River Rocks: Enjoy fresh seafood and river views at this UNKNOWN tradition March 27: UNKNOWN Off 10 takeaways from the presidents campaign kickoff visit to UNKNOWN man arrested on manslaughter charges after UNKNOWN burglars hit five south Brevard stores before UNKNOWN officer shot one of them UNKNOWN family to receive Filipino World War II veterans Congressional Gold Medal NASA: Want to go to the moon in five years? We need more money How to REALLY clean lovebugs off your windshield car house Top party schools: UNKNOWN Florida State make it to top of UNKNOWN 2020 list Live scores: High school football Week 4 in UNKNOWN Restaurant inspections: Happy Kitchen in UNKNOWN Divine Grace in Palm Bay closed after inspections School district sticks to $770 raise offer for teachers; special magistrate getting involved Candidate lineup is set for 2019 municipal special district elections in UNKNOWN investigation underway after mans body found in UNKNOWN ditch November hurricane forecast: Season not technically over but its over | WeatherTiger Tyndall Air Force Bases future remains uncertain after devastation from Hurricane Michael Whats behind a low performing school? Crummy parents | Opinion Well known UNKNOWN dealership has The Right Stuff Lovebugs are back Many hate them Experts say theyre here for good The lax disciplinary policies that caused\xa0Parkland massacre may have spread to your school UNKNOWN commissioners tackle parking issues: 10 things to know UNKNOWN woman injured in crash dies from her injuries; police looking for witnesses UNKNOWN President UNKNOWN calls for UNKNOWN lawmakers to focus on mass shootings Governor UNKNOWN approves $91 billion budget slashes $133 million in local projects UNKNOWN and feedback: March 28 2019 Vote for 321prepscom UNKNOWN of the Week for April 29-May 4 UNKNOWN approved for federal cleanup program Island H2O Live! water park in UNKNOWN opens: Make the plunge or chill in lazy river UNKNOWN is 2019 Citizen of the Year finalist in UNKNOWN Awards Vietnam War 50th anniversary to be remembered at UNKNOWN launch day! Things to know about SpaceX UNKNOWN 9 launch from UNKNOWN gave UNKNOWN contractors bonuses despite delays and cost overruns UNKNOWN says UNKNOWN is 2019 Citizen of the Year finalist in UNKNOWN Eye doctor honed skills on aircraft carrier UNKNOWN pitchers fare well in state UNKNOWN event The real reason NASAs all-female spacewalk is not happening UNKNOWN : Its not discrimination Port Canaveral aquarium the best use of tourism dollars | Opinion SpaceX UNKNOWN will carry worlds first green rocket fuel mission UNKNOWN at odds on whats actually available for teacher raises UNKNOWN power is too costly and too risky UNKNOWN is 2019 Citizen of the Year finalist in UNKNOWN TODAY Volunteer Recognition Awards SpaceX Dragon installed at UNKNOWN after launch from UNKNOWN phenomenon: Hail damage from unusual thunderstorm has Brevard residents talking Signs you live on the Space Coast Titusville couple charged with neglect after infant found malnourished Health coaches a personal way to help you meet goals UNKNOWN elementary school to break ground in fall fast-tracked to open August 2020 Phase out nuclear power? UNKNOWN and UNKNOWN locked in absurd arms race Strange days: Space Coast gets freak hail flaming meteor mysterious rumble red tide over 6 months Four women arrested on prostitution charges in ongoing UNKNOWN enforcement detail Whose name is on deed - partner or spouse? It matters a lot UNKNOWN should steer clear of drug importation from other countries | Opinion Melbourne police investigate Palm Bay Road shooting Brevard bowlers head to state tournament off district wins UNKNOWN Historical Societys May 16-18 conference: Countdown to History: Ice Age to the Space Age Turning the Toxic Tide: UNKNOWN must address the problem of human waste Contributions pour in to group seeking to ban possession sale of assault weapons in UNKNOWN UNKNOWN sweets treats desserts are SO worthy of Instagram Satellite Beach girl 10 leads charge for organ donation Poliakoff: Community development district offers different advantages and disadvantages to an UNKNOWN UNKNOWN in UNKNOWN hosts International Dinner with food from South Latin America From discovering water to snapping selfies: The lasting memories of Mars rover Opportunity Charges dropped against UNKNOWN man after comments against Cocoa Beach hotel werent direct specific threat UNKNOWN just cant say yes to raise offers | Opinion How Mars Opportunity rover ranks in list of more than 1000 unmanned space missions by UNKNOWN since 1958 321 Flavor members rave about UNKNOWN and UNKNOWN UNKNOWN man recovering after shark bite in UNKNOWN Buckeye the manatee rescued again by UNKNOWN Rain hail pelts north central UNKNOWN causing widespread damage Regulators approve merger of UNKNOWN ; deal to close June 29 Teachers approve largest pay raise in years but will it keep them in UNKNOWN ? Restaurant review: UNKNOWN in UNKNOWN shows great promise Weak cool front could bring torrential rain damaging winds to UNKNOWN : UNKNOWN earthen causeways would improve Banana River water quality UNKNOWN outfit is everything in photo of last remaining UNKNOWN astronauts Voting Republican only helps if youre rich | Opinion Homelessness: A national problem with a local solution UNKNOWN plans to regulate cancer-causing chemicals found in Americas drinking water UNKNOWN has chance to wow recruits playoff voters with a win over UNKNOWN -sized hail pounds Space Coast as severe thunderstorms move through area UNKNOWN CEO optimistic about merger finances as deal with UNKNOWN nears completion March 26: UNKNOWN ’s UNKNOWN named UNKNOWN Region VIII Pitcher of the Week Knife-wielding suspect leads deputies officers on foot chase in Melbourne 4 Brevard winners in girls regional basketball Thursday March 26: Women’s UNKNOWN -Up at Barry Invite Some seek healing remembrance at arrival of Vietnam War replica wall High school sports results: March 26 Following surge in discipline problems Brevard students ask lawmakers to address vaping in schools UNKNOWN man 21 pronounced dead after trying to make a U-turn on South Street Donald Trumps criminal justice reform shows results UNKNOWN UNKNOWN cold case solved 22 years later after Google Earth satellite image shows missing mans car Health calendar: Aug 8-15 Tracking Brevards stars: UNKNOWN to World Cup; UNKNOWN in golf hunt; UNKNOWN hitting 300 UNKNOWN hotel condo towers homes slated for 27 acres by Hightower Beach park Letters and feedback: Sept 14 2019 UNKNOWN to be named UNKNOWN new baseball coach Golf tip: Hit down on hybrids similar to how you hit an iron UNKNOWN weather moves across Space Coast bringing lightning heavy rain Commission OKs pay raises for 625 county employees to address disparities turnover High school sports results: Feb 14 weVENTURE is 2019 Organization of the Year finalist in UNKNOWN Man taken to hospital after domestic incident in Titusville Brevard school district considers selling $800000 in unused land Weather looks OK for SpaceX UNKNOWN Heavy launch from UNKNOWN Flavor: UNKNOWN tops the list of Brevards favorite cocktails UNKNOWN experiment and UNKNOWN fly to space on UNKNOWN New Shepard Bridges sex assault: Victim’s family sues nonprofit caregiver over pregnancy UNKNOWN emergency officials send mixed messages during Dorian New proposal on dog cat sales at pet stores is less restrictive than previous one Space Coast Honor Flight is 2019 Organization of the Year finalist in UNKNOWN It goes by so quickly: Brevard teachers back to work as days count down to school County commissioners reject lagoon funding plan saying they want less money for muck projects UNKNOWN ! Hypodermic needles stabbing employees inside Brevard recycling plant Hard-line anti-illegal immigration group works to sway sanctuary cities bill UNKNOWN principal retires amid accusations of hostile environment UNKNOWN deputy UNKNOWN architect of bloody UNKNOWN policy dead at 93 in UNKNOWN NASAs new goal: Land astronauts on the moon by 2024 Going surfing this weekend? Check our forecast Neighbor Up is 2019 Organization of the Year finalist in UNKNOWN TODAY Volunteer Recognition Awards Severe storms bring wind lightning and possible hail across UNKNOWN in UNKNOWN Melbourne may move to larger stores with drive-thru windows SpaceX UNKNOWN Heavy: Why everyone should watch this night launch from UNKNOWN report: Calm seas and good fishing ahead Two World Series commercials feature Brevard celebrities Letters and feedback: May 5 2019 In latest motion UNKNOWN Attorney confirms multiple investigations in Tallahassee Lane 70 softball team wins spring tournament Palm Bay police double down on nightclub safety following four violent episodes UNKNOWN signs deal to deliver small satellites to more orbits UNKNOWN documentary new surf movie spotlight surf legends UNKNOWN Hobgood twins Plan ahead for April 3-10: Music in UNKNOWN adult prom in UNKNOWN ; art in UNKNOWN grieves woman fatally struck by car near Lous Blues bar in UNKNOWN At Cape Canaveral SpaceX kicks off Star Wars Day with smoke and fire Updates: SpaceX launches UNKNOWN 9 from UNKNOWN to UNKNOWN sticks drone ship landing Letters and feedback: March 27 2019 UNKNOWN Tech to host event celebrating Apollo 11 anniversary and future of UNKNOWN space program UNKNOWN on UNKNOWN orbital flight test: We really are close UNKNOWN spring football game set for Saturday Feb 14: EFSC women’s golfer UNKNOWN nominated for 2019 Arnold Palmer Cup Four Panthers Selected to All-SSC Teams Heavy winds and lightning to hit north UNKNOWN How UNKNOWN is shaking up the establishment | Opinion High school sports results: May 3 Palm Bay man who asked woman for permission to have illicit sexual encounter with child sentenced Letters and feedback: June 22 2019 UNKNOWN pumped for two launches less than 48 hours apart Heres why UNKNOWN scrubbed all-female spacewalk Plan ahead for Nov 6-12: Holiday bazaar in UNKNOWN pet events art openings and food fun UNKNOWN shoots one under par for second time this week Burglar steals bolted safe from UNKNOWN bar in latest smash-and-grab Feb 14: UNKNOWN Tech Suffers Last-Second Heartbreaker Against Saint Leo Daddy Duty: My school cheating scandal is an example for my 4-year-old Tropical Storm Humberto forms but isnt expected to affect UNKNOWN weather squadron key player in ensuring successful rocket launches UNKNOWN : Getting through to the disconnected husband How to watch SpaceX launch its UNKNOWN 9 rocket from UNKNOWN Members of UNKNOWN Bahai Faith community come together to commemorate twin holy days A&Es UNKNOWN to feature UNKNOWN tiny home builder Movable Roots Lawmakers grapple with felons voting rights All-female spacewalk canceled due to lack of medium-sized spacesuits UNKNOWN and 3 other school district policy changes you should know about Felons rights bill heads to Gov DeSantis Lanes open delays persist after UNKNOWN on I-95 near Wickham Road Former UNKNOWN chair resigns day after UNKNOWN proposes UC'
All the private details has been now replaced with ‘ UNKNOWN ‘.
I hope this shows you the immense usefulness of NER
Implementing NLP Tasks
Now that you have learnt about various NLP techniques ,it’s time to implement them. There are examples of NLP being used everywhere around you , like chatbots you use in a website, news-summaries you need online, positive and neative movie reviews and so on.
This section will equip you upon how to implement these vital tasks of NLP.
Text Summarization
Suppose you find a magazine with a bunch of articles. Do you just start reading it from page 1 to the end ?
NO! You first read the summary to choose your article of interest. The same applies for news articles , research papers etc..
The technique of creating a concise summary of a large text data , without losing the key information is referred as Text Summarization
Text Summarization is highly useful in today’s digital world. I will now walk you through some important methods to implement Text Summarization.
What is Extractive Text Summarization
This is the traditional method , in which the process is to identify significant phrases/sentences of the text corpus and include them in the summary.
The summary obtained from this method will contain the key-sentences of the original text corpus. It can be done through many methods, I will show you using gensim
and spacy
.
Extractive Text Summarization using Gensim
The text summarization process using gensim
library is based on TextRank Algorithm
What does the TextRank Algorithm do ?
- The raw text is preprocessed.(All stopwords ,punctuations removed, words are lemmatized)
- Each sentence of the text corpus undergoes vectorization.i.e, we create word embeddings to represent the sentence
- Next , the algorithm calculates the similarities between sentence vectors whose values are stored in a matrix
- Now, it builds a graph representation. Each sentence represents a NODE , and the similarity between sentences( stored in the matrix ) are the EDGE values
- The sentences are ranked according to significance of weights .
The top significant sentences are choosen to be a part of the summary
Now, I shall guide through the code to implement this from gensim
. Our first step would be to import the summarizer
from gensim.summarization
.
import gensim
from gensim.summarization import summarize
Let us say you have an article about economic junk food ,for which you want to do summarization.
article_text='Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time. Hope it will help you'
You can pass the text corpus as input to summarize
function
short_summary=summarize(article_text)
print(short_summary)
They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.
Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body.
It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.
It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
One who like junk food develop more risk to put on extra weight and become fatter and unhealthier.
Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert.
For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats.
Seems too long right ! What if you want to decide how many words are to be in summary ?
You can change the default parameters of the summarize
function according to your requirements.
The parameters are:
ratio
: It can take values between 0 to 1. It represents the proportion of the summary compared to the original text.word_count
: It decides the no of words in the summary.
Let me show you how to use the parameters in above example
summary_by_ratio=summarize(article_text,ratio=0.1)
print(summary_by_ratio)
They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.
Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.
High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.
Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert.
In the above output, you can notice that only 10% of original text is taken as summary.
Next, let me show you how to summarize using word_count
summary_by_word_count=summarize(article_text,word_count=30)
print(summary_by_word_count)
They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
In the above output, you can see the summary extracted by by the word_count
.
What if you provide contradicting word_count
and ratio
value ?
In case both are mentioned, then the summarize
function ignores the ratio
. So,word_count
has the upper hand.
Let me demonstrate the same in below code.
summary=summarize(article_text,ratio=0.1,word_count=30)
print(summary)
They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.
It is clear that word_count
has been followed.
Moving on, let us look at how to summarize text using spacy
library
Extractive Text Summarization with spacy
You can also implement Text Summarization using spacy
package.
Here, I will try to summarize the same article on junk food using spacy
article_text='Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time. '
First , import the package and create a spacy object doc
# Importing & creating a spacy object
import spacy
nlp = spacy.load('en_core_web_sm')
doc=nlp(article_text)
Next , you know that extractive summarization is based on identifying the significant words. So, you need to store the keywords of the text in list.
How to choose the important words ?
we know that punctuations and stopwords are just noise. So, you can neglect them. Usually , the Nouns, pronouns,verbs add significant value to the text.
spacy gives you the option to check a token’s Part-of-speech through token.pos_
method. Using these, you can select desired tokens as shown below.
# creating a empty list to store keywords
keywords_list = []
# List of the POS categories which you think are significant
desired_pos = ['PROPN', 'ADJ', 'NOUN', 'VERB']
# Import punctuations for text cleaning
from string import punctuation
# Iterating through tokens
for token in doc:
# checking if a token is stopword or punctuation
if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
# If true, they are just ignored and loop goes to the next token
continue
# checking if the POS tag of the token is in our desired list
if(token.pos_ in desired_pos):
# If true, append the token to our keywords list
keywords_list.append(token.text)
The above code iterates through every token and stored the tokens that are NOUN,PROPER NOUN, VERB, ADJECTIVE in keywords_list
.
Next , you can find the frequency of each token in keywords_list using Counter
. The list of keywords is passed as input to the Counter
,it returns a dictionary of keywords and their frequencies.
# Importing Counter frpm Collections module
from collections import Counter
# creating dictionary of keywords + frequency
dictionary = Counter(keywords_list)
print(dictionary)
Counter({'food': 19, 'junk': 13, 'foods': 12, 'high': 9, 'sugar': 7, 'Junk': 6, 'healthy': 6, 'weight': 6, 'cholesterol': 5, 'body': 5, 'blood': 5, 'heart': 5, 'good': 4, 'sodium': 4, 'fat': 4, 'gain': 4, 'life': 4, 'diabetes': 4, 'level': 4, 'increases': 4, 'risk': 4, 'children': 3, 'effects': 3, 'health': 3, 'found': 3, 'nutrients': 3, 'unhealthy': 3, 'lack': 3, 'impact': 3, 'person': 3, 'bad': 3, 'fats': 3, 'result': 3, 'taste': 2, 'age': 2, 'parents': 2, 'harmful': 2, 'fried': 2, 'dietary': 2, 'fibers': 2, 'makes': 2, 'obesity': 2, 'tastes': 2, 'looks': 2, 'calorie': 2, 'fries': 2, 'burgers': 2, 'candy': 2, 'cookies': 2, 'eating': 2, 'prone': 2, 'type-2': 2, 'disease': 2, 'kidney': 2, 'failure': 2, 'lead': 2, 'nutritional': 2, 'deficiencies': 2, 'essential': 2, 'vitamins': 2, 'minerals': 2, 'cardiovascular': 2, 'diseases': 2, 'diet': 2, 'pressure': 2, 'functioning': 2, 'spike': 2, 'people': 2, 'dull': 2, 'day': 2, 'arteries': 2, 'attack': 2, 'way': 2, 'main': 2, 'instance': 2, 'overtime': 2, 'things': 2, 'time': 2, 'green': 2, 'find': 2, 'try': 2, 'different': 2, 'help': 2, 'liked': 1, 'group': 1, 'kids': 1, 'school': 1, 'going': 1, 'ask': 1, 'trend': 1, 'childhood': 1, 'discussed': 1, 'According': 1, 'research': 1, 'scientists': 1, 'negative': 1, 'ways': 1, 'market': 1, 'packets': 1, 'calories': 1, 'low': 1, 'mineral': 1, 'starch': 1, 'protein': 1, 'Processed': 1, 'means': 1, 'rapid': 1, 'able': 1, 'excessive': 1, 'called': 1, 'fulfil': 1, 'requirement': 1, 'french': 1, 'pizza': 1, 'soft': 1, 'drinks': 1, 'baked': 1, 'goods': 1, 'ice': 1, 'cream': 1, 'example': 1, 'containing': 1, 'according': 1, 'Centres': 1, 'Disease': 1, 'Control': 1, 'Prevention': 1, 'Kids': 1, 'unable': 1, 'regulate': 1, 'Risk': 1, 'getting': 1, 'increasing': 1, 'obese': 1, 'overweight': 1, 'Eating': 1, 'iron': 1, 'rich': 1, 'saturated': 1, 'High': 1, 'overloads': 1, 'like': 1, 'develop': 1, 'extra': 1, 'fatter': 1, 'unhealthier': 1, 'contain': 1, 'carbohydrate': 1, 'lethargic': 1, 'sleepy': 1, 'active': 1, 'alert': 1, 'Reflexes': 1, 'senses': 1, 'live': 1, 'sedentary': 1, 'source': 1, 'constipation': 1, 'ailments': 1, 'clogged': 1, 'strokes': 1, 'poor': 1, 'nutrition': 1, 'easiest': 1, 'reasons': 1, 'increase': 1, 'positive': 1, 'points': 1, 'requires': 1, 'stay': 1, 'fit': 1, 'fulfilled': 1, 'French': 1, 'amounts': 1, 'long': 1, 'term': 1, 'illnesses': 1, 'consume': 1, 'consumption': 1, 'words': 1, 'interferes': 1, 'contains': 1, 'higher': 1, 'carbohydrates': 1, 'levels': 1, 'lethargy': 1, 'inactiveness': 1, 'sleepiness': 1, 'reflex': 1, 'inactive': 1, 'worse': 1, 'clogs': 1, 'avoided': 1, 'save': 1, 'ruined': 1, 'problem': 1, 'realize': 1, 'ill': 1, 'comes': 1, 'late': 1, 'issue': 1, 'works': 1, 'face': 1, 'consequences': 1, 'better': 1, 'stop': 1, 'avoid': 1, 'encouraging': 1, 'early': 1, 'eat': 1, 'vegetables': 1, 'buds': 1, 'developed': 1, 'tasty': 1, 'mix': 1, 'serve': 1, 'vegetable': 1, 'style': 1, 'Incorporate': 1, 'types': 1, 'following': 1, 'recipes': 1, 'home': 1, 'attracted': 1, 'short': 1, 'deprive': 1, 'Children': 1, 'Make': 1, 'sure': 1, 'limited': 1, 'quantities': 1, 'periods': 1})
Now, you need to normalize the frequency of all keywords.
For that, find the highest frequency using .most_common
method . Then apply normalization formula to the all keyword frequencies in the dictionary.
# find the highest frequency
highest_frequency = Counter(keywords_list).most_common(1)[0][1]
# Normalizing process
for word in dictionary:
dictionary[word] = (dictionary[word]/highest_frequency)
print(dictionary)
Counter({'food': 1.0, 'junk': 0.6842105263157895, 'foods': 0.631578947368421, 'high': 0.47368421052631576, 'sugar': 0.3684210526315789, 'Junk': 0.3157894736842105, 'healthy': 0.3157894736842105, 'weight': 0.3157894736842105, 'cholesterol': 0.2631578947368421, 'body': 0.2631578947368421, 'blood': 0.2631578947368421, 'heart': 0.2631578947368421, 'good': 0.21052631578947367, 'sodium': 0.21052631578947367, 'fat': 0.21052631578947367, 'gain': 0.21052631578947367, 'life': 0.21052631578947367, 'diabetes': 0.21052631578947367, 'level': 0.21052631578947367, 'increases': 0.21052631578947367, 'risk': 0.21052631578947367, 'children': 0.15789473684210525, 'effects': 0.15789473684210525, 'health': 0.15789473684210525, 'found': 0.15789473684210525, 'nutrients': 0.15789473684210525, 'unhealthy': 0.15789473684210525, 'lack': 0.15789473684210525, 'impact': 0.15789473684210525, 'person': 0.15789473684210525, 'bad': 0.15789473684210525, 'fats': 0.15789473684210525, 'result': 0.15789473684210525, 'taste': 0.10526315789473684, 'age': 0.10526315789473684, 'parents': 0.10526315789473684, 'harmful': 0.10526315789473684, 'fried': 0.10526315789473684, 'dietary': 0.10526315789473684, 'fibers': 0.10526315789473684, 'makes': 0.10526315789473684, 'obesity': 0.10526315789473684, 'tastes': 0.10526315789473684, 'looks': 0.10526315789473684, 'calorie': 0.10526315789473684, 'fries': 0.10526315789473684, 'burgers': 0.10526315789473684, 'candy': 0.10526315789473684, 'cookies': 0.10526315789473684, 'eating': 0.10526315789473684, 'prone': 0.10526315789473684, 'type-2': 0.10526315789473684, 'disease': 0.10526315789473684, 'kidney': 0.10526315789473684, 'failure': 0.10526315789473684, 'lead': 0.10526315789473684, 'nutritional': 0.10526315789473684, 'deficiencies': 0.10526315789473684, 'essential': 0.10526315789473684, 'vitamins': 0.10526315789473684, 'minerals': 0.10526315789473684, 'cardiovascular': 0.10526315789473684, 'diseases': 0.10526315789473684, 'diet': 0.10526315789473684, 'pressure': 0.10526315789473684, 'functioning': 0.10526315789473684, 'spike': 0.10526315789473684, 'people': 0.10526315789473684, 'dull': 0.10526315789473684, 'day': 0.10526315789473684, 'arteries': 0.10526315789473684, 'attack': 0.10526315789473684, 'way': 0.10526315789473684, 'main': 0.10526315789473684, 'instance': 0.10526315789473684, 'overtime': 0.10526315789473684, 'things': 0.10526315789473684, 'time': 0.10526315789473684, 'green': 0.10526315789473684, 'find': 0.10526315789473684, 'try': 0.10526315789473684, 'different': 0.10526315789473684, 'help': 0.10526315789473684, 'liked': 0.05263157894736842, 'group': 0.05263157894736842, 'kids': 0.05263157894736842, 'school': 0.05263157894736842, 'going': 0.05263157894736842, 'ask': 0.05263157894736842, 'trend': 0.05263157894736842, 'childhood': 0.05263157894736842, 'discussed': 0.05263157894736842, 'According': 0.05263157894736842, 'research': 0.05263157894736842, 'scientists': 0.05263157894736842, 'negative': 0.05263157894736842, 'ways': 0.05263157894736842, 'market': 0.05263157894736842, 'packets': 0.05263157894736842, 'calories': 0.05263157894736842, 'low': 0.05263157894736842, 'mineral': 0.05263157894736842, 'starch': 0.05263157894736842, 'protein': 0.05263157894736842, 'Processed': 0.05263157894736842, 'means': 0.05263157894736842, 'rapid': 0.05263157894736842, 'able': 0.05263157894736842, 'excessive': 0.05263157894736842, 'called': 0.05263157894736842, 'fulfil': 0.05263157894736842, 'requirement': 0.05263157894736842, 'french': 0.05263157894736842, 'pizza': 0.05263157894736842, 'soft': 0.05263157894736842, 'drinks': 0.05263157894736842, 'baked': 0.05263157894736842, 'goods': 0.05263157894736842, 'ice': 0.05263157894736842, 'cream': 0.05263157894736842, 'example': 0.05263157894736842, 'containing': 0.05263157894736842, 'according': 0.05263157894736842, 'Centres': 0.05263157894736842, 'Disease': 0.05263157894736842, 'Control': 0.05263157894736842, 'Prevention': 0.05263157894736842, 'Kids': 0.05263157894736842, 'unable': 0.05263157894736842, 'regulate': 0.05263157894736842, 'Risk': 0.05263157894736842, 'getting': 0.05263157894736842, 'increasing': 0.05263157894736842, 'obese': 0.05263157894736842, 'overweight': 0.05263157894736842, 'Eating': 0.05263157894736842, 'iron': 0.05263157894736842, 'rich': 0.05263157894736842, 'saturated': 0.05263157894736842, 'High': 0.05263157894736842, 'overloads': 0.05263157894736842, 'like': 0.05263157894736842, 'develop': 0.05263157894736842, 'extra': 0.05263157894736842, 'fatter': 0.05263157894736842, 'unhealthier': 0.05263157894736842, 'contain': 0.05263157894736842, 'carbohydrate': 0.05263157894736842, 'lethargic': 0.05263157894736842, 'sleepy': 0.05263157894736842, 'active': 0.05263157894736842, 'alert': 0.05263157894736842, 'Reflexes': 0.05263157894736842, 'senses': 0.05263157894736842, 'live': 0.05263157894736842, 'sedentary': 0.05263157894736842, 'source': 0.05263157894736842, 'constipation': 0.05263157894736842, 'ailments': 0.05263157894736842, 'clogged': 0.05263157894736842, 'strokes': 0.05263157894736842, 'poor': 0.05263157894736842, 'nutrition': 0.05263157894736842, 'easiest': 0.05263157894736842, 'reasons': 0.05263157894736842, 'increase': 0.05263157894736842, 'positive': 0.05263157894736842, 'points': 0.05263157894736842, 'requires': 0.05263157894736842, 'stay': 0.05263157894736842, 'fit': 0.05263157894736842, 'fulfilled': 0.05263157894736842, 'French': 0.05263157894736842, 'amounts': 0.05263157894736842, 'long': 0.05263157894736842, 'term': 0.05263157894736842, 'illnesses': 0.05263157894736842, 'consume': 0.05263157894736842, 'consumption': 0.05263157894736842, 'words': 0.05263157894736842, 'interferes': 0.05263157894736842, 'contains': 0.05263157894736842, 'higher': 0.05263157894736842, 'carbohydrates': 0.05263157894736842, 'levels': 0.05263157894736842, 'lethargy': 0.05263157894736842, 'inactiveness': 0.05263157894736842, 'sleepiness': 0.05263157894736842, 'reflex': 0.05263157894736842, 'inactive': 0.05263157894736842, 'worse': 0.05263157894736842, 'clogs': 0.05263157894736842, 'avoided': 0.05263157894736842, 'save': 0.05263157894736842, 'ruined': 0.05263157894736842, 'problem': 0.05263157894736842, 'realize': 0.05263157894736842, 'ill': 0.05263157894736842, 'comes': 0.05263157894736842, 'late': 0.05263157894736842, 'issue': 0.05263157894736842, 'works': 0.05263157894736842, 'face': 0.05263157894736842, 'consequences': 0.05263157894736842, 'better': 0.05263157894736842, 'stop': 0.05263157894736842, 'avoid': 0.05263157894736842, 'encouraging': 0.05263157894736842, 'early': 0.05263157894736842, 'eat': 0.05263157894736842, 'vegetables': 0.05263157894736842, 'buds': 0.05263157894736842, 'developed': 0.05263157894736842, 'tasty': 0.05263157894736842, 'mix': 0.05263157894736842, 'serve': 0.05263157894736842, 'vegetable': 0.05263157894736842, 'style': 0.05263157894736842, 'Incorporate': 0.05263157894736842, 'types': 0.05263157894736842, 'following': 0.05263157894736842, 'recipes': 0.05263157894736842, 'home': 0.05263157894736842, 'attracted': 0.05263157894736842, 'short': 0.05263157894736842, 'deprive': 0.05263157894736842, 'Children': 0.05263157894736842, 'Make': 0.05263157894736842, 'sure': 0.05263157894736842, 'limited': 0.05263157894736842, 'quantities': 0.05263157894736842, 'periods': 0.05263157894736842})
From above output you can notice each keyword is associated with a value between 0-1.
Next , you need to assign scores to each sentence. You can access the sentences in a doc through doc.sents
. You can iterate through each token of sentence , select the keyword values and store them in a dictionary score
.
The below code demonstrates the same
# Creating a dictionary to store the score of each sentence
score={}
# Iterating through each sentence
for sentence in doc.sents:
# Iterating through token of each sentence
for token in sentence:
# checking if the token is a keyword
if token.text in dictionary.keys():
# If true , add the frequency of keyword to the score dictionary
if sentence in score.keys():
score[sentence]+=dictionary[token.text]
else:
score[sentence]=dictionary[token.text]
print(score)
{Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children.: 1.789473684210526, They generally ask for the junk food daily because they have been trend so by their parents from the childhood.: 1.9473684210526316, They never have been discussed by their parents about the harmful effects of junk foods over health.: 1.894736842105263, According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways.: 2.0526315789473686, They are generally fried food found in the market in the packets.: 1.3684210526315788, They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers.: 4.36842105263158, Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life.: 2.7894736842105265, It makes able a person to gain excessive weight which is called as obesity.: 1.0526315789473686, Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body.: 2.3684210526315788, Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods.: 4.526315789473684, It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes.: 2.8421052631578947, In type-2 diabetes our body become unable to regulate blood sugar level.: 1.526315789473684, Risk of getting this disease is increasing as one become more obese or overweight.: 0.3684210526315789, It increases the risk of kidney failure.: 0.631578947368421, Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers.: 3.210526315789473, It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol.: 1.5789473684210524, High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning.: 1.789473684210526, One who like junk food develop more risk to put on extra weight and become fatter and unhealthier.: 2.4736842105263164, Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert.: 3.052631578947369, Reflexes and senses of the people eating this food become dull day by day: 1.6315789473684208, thus they live more sedentary life.: 0.3157894736842105, Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition.: 2.4210526315789473, Junk food is the easiest way to gain unhealthy weight.: 2.1578947368421053, The amount of fats and sugar in the food makes you gain weight rapidly.: 2.157894736842105, However, this is not a healthy weight.: 0.631578947368421, It is more of fats and cholesterol which will have a harmful impact on your health.: 0.8421052631578947, Junk food is also one of the main reasons for the increase in obesity nowadays.: 1.6315789473684208, This food only looks and tastes good, other than that, it has no positive points.: 1.5263157894736838, The amount of calorie your body requires to stay fit is not fulfilled by this food.: 1.5789473684210527, For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats.: 2.31578947368421, Therefore, this can result in long-term illnesses like diabetes and high blood pressure.: 1.4210526315789473, This may also result in kidney failure.: 0.3684210526315789, Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more.: 0.7368421052631579, You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium.: 1.2105263157894737, In other words, all this interferes with the functioning of your heart.: 0.47368421052631576, Furthermore, junk food contains a higher level of carbohydrates.: 2.052631578947368, It will instantly spike your blood sugar levels.: 0.7894736842105263, This will result in lethargy, inactiveness, and sleepiness.: 0.3157894736842105, A person reflex becomes dull overtime: 0.42105263157894735, and they lead an inactive life.: 0.3684210526315789, To make things worse, junk food also clogs your arteries and increases the risk of a heart attack.: 2.7894736842105257, Therefore, it must be avoided at the first instance to save your life from becoming ruined.: 0.47368421052631576, The main problem with junk food is that people don’t realize its ill effects now.: 2.2105263157894735, When the time comes, it is too late.: 0.21052631578947367, Most importantly, the issue is that it does not impact you instantly.: 0.21052631578947367, It works on your overtime; you will face the consequences sooner or later.: 0.2631578947368421, Thus, it is better to stop now.: 0.10526315789473684, You can avoid junk food by encouraging your children from an early age to eat green vegetables.: 2.3157894736842106, Their taste buds must be developed as such that they find healthy food tasty.: 1.6842105263157894, Moreover, try to mix things up.: 0.2631578947368421, Do not serve the same green vegetable daily in the same style.: 0.2631578947368421, Incorporate different types of healthy food in their diet following different recipes.: 1.8421052631578945, This will help them to try foods at home rather than being attracted to junk food.: 2.631578947368421, In short, do not deprive them completely of it as that will not help.: 0.21052631578947367, Children will find one way or the other to have it.: 0.2631578947368421, Make sure you give them junk food in limited quantities and at healthy periods of time.: 2.3684210526315788}
Now that you have score of each sentence, you can sort the sentences in the descending order of their significance.
# sorting the sentence scores
sorted_score = sorted(score.items(), key=lambda kv: kv[1], reverse=True)
Now, you are at the final part. You can decide the no of sentences you want. Then, add sentences from the sorted_score
until you have reached the desired no_of_sentences
.
This gives you the final text summary.
# list to store sentences of summary
text_summary=[]
# Deciding the total no of sentences in summary
no_of_sentences=4
# to count the no of sentence we already added to summary
total = 0
for i in range(len(sorted_score)):
# appending to the summary
text_summary.append(str(sorted_score[i][0]).capitalize())
total += 1
# checking if limit exceeded
if(total >= no_of_sentences):
break
print(text_summary)
You can see the 4 sentence-summary in above output. You would have noticed that this approach is more lengthy compared to using gensim
.
These two are the traditional methods of text summarization. There is a new advanced method outbeating both of these. I will discuss about it next
Generative Text Summarization
You can notice that in the extractive method, the sentences of the summary are all taken from the original text. There is no change in structure of any sentence.
Generative text summarization methods overcome this shortcoming. The concept is based on capturing the meaning of the text and generating entitrely new sentences to best represent them in the summary.
These are more advanced methods and are best for summarization. Here, I shall guide you on implementing generative text summarization using Hugging face .
Hugging face is supported by Pytorch and tensorflow 2.0 currently. You can use transformers
of Hugging Face to implement Summarization. Let me show how to import torch
import torch
You can download Hugging Face Transformers through below command
!pip install transformers --upgrade
Next, you can import pipeline
from transformers
. It is the model instance .
from transformers import pipeline
The pipeline
has parameters which you can set as per the requirements of your problem.
Some main parameters are:
task
: You can choose among a set of strings that specify the task , and the corresponding pipeline will bw returned.example :task=summarization
returns aSummarizationPipeline
model
: To specify the model that will be used by thepipeline
. You can use pretrained moelstokenizer
: You can specify the tokenizer you want to use for encoding the data for the model. You can use PreTrained Tokenizers.
Note : model
and tokenizer
are optional arguments. If not specified, the default one for the pipeline will be used.
The below example demonstrates the building of a summarization pipeline
summarizer=pipeline(task="summarization",model='bart-large-cnn',tokenizer='bart-large-cnn')
The above code returns a SummarizationPipeline
.
summarizer
article_text='Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. They never have been discussed by their parents about the harmful effects of junk foods over health. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in healthy nutrients, high in sodium mineral, high in sugar, starch, unhealthy fat, lack of protein and lack of dietary fibers. Processed and junk foods are the means of rapid and unhealthy weight gain and negatively impact the whole body throughout the life. It makes able a person to gain excessive weight which is called as obesity. Junk foods tastes good and looks good however do not fulfil the healthy calorie requirement of the body. Some of the foods like french fries, fried foods, pizza, burgers, candy, soft drinks, baked goods, ice cream, cookies, etc are the example of high-sugar and high-fat containing foods. It is found according to the Centres for Disease Control and Prevention that Kids and children eating junk food are more prone to the type-2 diabetes. In type-2 diabetes our body become unable to regulate blood sugar level. Risk of getting this disease is increasing as one become more obese or overweight. It increases the risk of kidney failure. Eating junk food daily lead us to the nutritional deficiencies in the body because it is lack of essential nutrients, vitamins, iron, minerals and dietary fibers. It increases risk of cardiovascular diseases because it is rich in saturated fat, sodium and bad cholesterol. High sodium and bad cholesterol diet increases blood pressure and overloads the heart functioning. One who like junk food develop more risk to put on extra weight and become fatter and unhealthier. Junk foods contain high level carbohydrate which spike blood sugar level and make person more lethargic, sleepy and less active and alert. Reflexes and senses of the people eating this food become dull day by day thus they live more sedentary life. Junk foods are the source of constipation and other disease like diabetes, heart ailments, clogged arteries, heart attack, strokes, etc because of being poor in nutrition. Junk food is the easiest way to gain unhealthy weight. The amount of fats and sugar in the food makes you gain weight rapidly. However, this is not a healthy weight. It is more of fats and cholesterol which will have a harmful impact on your health. Junk food is also one of the main reasons for the increase in obesity nowadays.This food only looks and tastes good, other than that, it has no positive points. The amount of calorie your body requires to stay fit is not fulfilled by this food. For instance, foods like French fries, burgers, candy, and cookies, all have high amounts of sugar and fats. Therefore, this can result in long-term illnesses like diabetes and high blood pressure. This may also result in kidney failure. Above all, you can get various nutritional deficiencies when you don’t consume the essential nutrients, vitamins, minerals and more. You become prone to cardiovascular diseases due to the consumption of bad cholesterol and fat plus sodium. In other words, all this interferes with the functioning of your heart. Furthermore, junk food contains a higher level of carbohydrates. It will instantly spike your blood sugar levels. This will result in lethargy, inactiveness, and sleepiness. A person reflex becomes dull overtime and they lead an inactive life. To make things worse, junk food also clogs your arteries and increases the risk of a heart attack. Therefore, it must be avoided at the first instance to save your life from becoming ruined.The main problem with junk food is that people don’t realize its ill effects now. When the time comes, it is too late. Most importantly, the issue is that it does not impact you instantly. It works on your overtime; you will face the consequences sooner or later. Thus, it is better to stop now.You can avoid junk food by encouraging your children from an early age to eat green vegetables. Their taste buds must be developed as such that they find healthy food tasty. Moreover, try to mix things up. Do not serve the same green vegetable daily in the same style. Incorporate different types of healthy food in their diet following different recipes. This will help them to try foods at home rather than being attracted to junk food.In short, do not deprive them completely of it as that will not help. Children will find one way or the other to have it. Make sure you give them junk food in limited quantities and at healthy periods of time. '
You can pass your text as input to the summarizer
. The parameters min_length
and max_length
allow you to control the length of summary as per needs.
our_summary=summarizer( article_text,min_length=5,max_length=100)
print(our_summary)
[{'summary_text': 'Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways.'}]
From above output, you can see the short summary. This is the method of summarization using pipeline
Now, let me introduce you to another method of text summarization using Pretrained models available in the transformers
library.
There are pretrained models with weights available which can ne accessed through .from_pretrained()
method. We shall be using one such model bart-large-cnn
in this case for text summarization.
For working with this model, you can import corresponding Tokenizer and model as shown below.
# Importing model and tokenizer
from transformers import BartTokenizer, BartForConditionalGeneration, BartConfig
# Initializing the model & tokenizer using .from_pretrained()
model = BartForConditionalGeneration.from_pretrained('bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('bart-large-cnn')
Next, you need to convert the input text into input ids using batch_encode_plus
method. This returns a tensor
The encoded ids are passed as input to model.generate()
along with a feature max_length
which decides the number of words in our summary. This function returns the summary_ids
(the significant ones). You can next decode these ids to text and print the summary
# encode the input
inputs = tokenizer.batch_encode_plus([article_text], max_length=1024, return_tensors='pt')
# You can generate the ids that will constitute the summary.
summary_ids = model.generate(inputs['input_ids'], max_length=100, early_stopping=False)
# iterating through each id in summary
for ids in summary_ids:
# decoding the
short = tokenizer.decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
# comparing original length and summary length
print(len(article_text),'------>' ,len(short))
print(short)
4870 ------> 484
Junk foods taste good that’s why it is mostly liked by everyone of any age group especially kids and school going children. They generally ask for the junk food daily because they have been trend so by their parents from the childhood. According to the research by scientists, it has been found that junk foods have negative effects on the health in many ways. They are generally fried food found in the market in the packets. They become high in calories, high in cholesterol, low in
ChatBot with Hugging Face
I’m sure you would have used a chatbot on a website. It welcomes you and answers your doubts.
It’s like a small scale AI Assistant !
How do these super-cool chatbots work ?
They are built using NLP techniques to understanding the context of question and provide answers as they are trained.
Hugging Face , a revolutionizing startup has introduced the advanced method of building chatbots with transformers. You can check the Hugging Face implementation here
If you are a beginner , you might get overwhelmed trying to implement it . The process is made easier for you by the simpletransformers
package. This library is based on the transformers
library of Hugging Face
Here, I shall guide you to build a chatbot efficiently using simpletransformers
. First, download it through below command
pip install simpletransformers
For building chatbots or conversational AI models, you can use ConvAIModel
class.
from simpletransformers.conv_ai import ConvAIModel
Presently, it supports OpenAI GPT or GPT-2 models. Certain important parameters of ConvAIModel
are :
model-type
: You need to pass a string to indicate your model type, like'gpt'
or" gpt2 "
model_name
: You need to pass a string specifying the exact name of a pretrained model. Here, I have used “openai-gpt"
, a pretrained model of HuggingFace Transformers. You can also pass the path to folder containing the modelargs
: A dictionary where you can fix argument values instead of the default ones.
Let me show you how to initiate the model in below code.
train_args ={"fp16":False,
"num_train_epochs": 1,
"save_model_every_epoch": False}
my_chatbot=ConvAIModel("gpt","openai-gpt",use_cuda=True,args=train_args)
ftfy or spacy is not installed using BERT BasicTokenizer instead of SpaCy & ftfy.
Now that the model is stored in my_chatbot
, you can train it using .train_model()
function. When call the train_model()
function without passing the input training data, simpletransformers
downloads uses the default training data.
my_chatbot.train_model()
Once your chatbot is trained, you can test it now through interact()
function as shown below
my_chatbot.interact()
Hence, you have successfully created your Chatbot.
Language Translation
I am sure each of us would have used a translator in our life ! Language Translation is the miracle that has made communication between diverse people possible.
Language translation is one of the main applications of NLP. Here, I shall you introduce you to some advanced methods to implement the same.
Language Translator can be built in a few steps using Hugging face’s transformers
library. Let us first install it.
!pip install transformers
The transformers
provides task-specific pipeline
for our needs. This is a main feature which gives the edge to Hugging Face.
You can decide what pipeline you want through the task
parameter. For translation tasks,you can provide task='translation_xx_to_yy'
( for xx and yy refer to the standard language codes)
Let us say I want to translate English to Romania !
My input will be task='translation_en_to_ro'
.
from transformers import pipeline
translator_model=pipeline(task='translation_en_to_ro')
translator_model
Now that your pipeline is ready,you can pass the text in english as input.
translator_model('I welcome you to my country ! I am sure you will enjoy your visit')
[{'translation_text': 'Vă urez bun venit în țara mea !Sunt sigur că vă veți bucura de vizita dvs.'}]
You can see the translated sentence in above output.
You might be wondering is this the only way to do it ?
Now, I shall introduce you to another efficient method,using simpletransformers
.You can download it through below command
!pip install simpletransformers
For language translation, we shall use sequence to sequence models. So, you can import the seq2seqModel
through below command.
import torch
from simpletransformers.seq2seq import Seq2SeqModel
What is a Sequence to Sequence model ?
It has 2 parts-Encoder and decoder. The approach is to provide input sequence to encoder for learning ,and gets converted to a intermediate vector. From the intermediate vector, decoder predicts the output sequence. It is the common model for translation problems
Coming back to our example , let us see how to instantiate the seq2seqmodel
?
The parameters are:
encoder_decoder_type
: It can bebart
ormarian
orencoder-decoder
.For translation, we usually usemarian
.encoder_decoder_name
: You can provide the path to the pretrained models of Hugging Face.
You can see the list of available pretrained models click here
Below code demonstrates initiating the model with encoder_decoder_type='marian'
and using a pretained model for translation of English to Romania.
model = Seq2SeqModel(encoder_decoder_type="marian",encoder_decoder_name="Helsinki-NLP/opus-mt-en-ro",args=model_args,use_cuda=False)
Now, you tranlation model is ready. You can pass the input text as a list of strings to model.predict()
function
model.predict(['I welcome you to my country','I am sure you will enjoy your stay here'])
['Vă urez bun venit în ţara mea.',
'Sunt sigur că vă veţi bucura de şederea dvs aici.']
You may feel that the time taken is long. You can always modify the arguments according to the neccesity of the problem. You can view the current values of arguments through model.args
method.
model.args
{'adam_epsilon': 1e-08,
'base_marian_model_name': 'Helsinki-NLP/opus-mt-en-ro',
'best_model_dir': 'outputs/best_model',
'cache_dir': 'cache_dir/',
'config': {},
'dataset_class': None,
'do_lower_case': False,
'do_sample': False,
'early_stopping': True,
'early_stopping_consider_epochs': False,
'early_stopping_delta': 0,
'early_stopping_metric': 'eval_loss',
'early_stopping_metric_minimize': True,
'early_stopping_patience': 3,
'encoding': None,
'eval_batch_size': 8,
'evaluate_during_training': False,
'evaluate_during_training_steps': 2000,
'evaluate_during_training_verbose': True,
'evaluate_generated_text': True,
'fp16': False,
'fp16_opt_level': 'O1',
'gradient_accumulation_steps': 1,
'learning_rate': 4e-05,
'length_penalty': 2.0,
'logging_steps': 50,
'manual_seed': 4,
'max_grad_norm': 1.0,
'max_length': 50,
'max_seq_length': 50,
'max_steps': -1,
'model_name': 'Helsinki-NLP/opus-mt-en-ro',
'model_type': 'marian',
'multiprocessing_chunksize': 500,
'n_gpu': 1,
'no_cache': False,
'no_save': False,
'num_beams': 1,
'num_train_epochs': 10,
'output_dir': 'outputs/',
'overwrite_output_dir': True,
'process_count': 1,
'repetition_penalty': 1.0,
'reprocess_input_data': True,
'save_best_model': True,
'save_eval_checkpoints': False,
'save_model_every_epoch': False,
'save_steps': 2000,
'silent': False,
'tensorboard_dir': None,
'train_batch_size': 2,
'use_cached_eval_features': False,
'use_cuda': False,
'use_early_stopping': False,
'use_multiprocessing': False,
'wandb_kwargs': {},
'wandb_project': None,
'warmup_ratio': 0.06,
'warmup_steps': 0,
'weight_decay': 0}
Text Generation
If you give a sentence or a phrase to a student, she can develop the sentence into a paragraph based on the context of the phrases. You can train your model to do the same with NLP.
This technique of generating new sentences relevant to context is called Text Generation.
You can implement text generation using torch
and Hugging Face’s transformers
pip install transformers
import torch
transformers
library has various pretrained models with weights. At any time ,you can instantiate a pre-trained version of model through .from_pretrained()
method. There are different types of models like BERT, GPT, GPT-2, XLM,etc..
For text generation, it is preferred to use GPT-2 model. So,let us import the model and tokenizer.
from transformers import GPT2Tokenizer
from transformers import GPT2DoubleHeadsModel
Next , you can initiate your tokenizer and model from a pretained GPT-2 model. Here, I will use ‘gpt2-medium’
tokenizer=GPT2Tokenizer.from_pretrained('gpt2-medium')
model=GPT2DoubleHeadsModel.from_pretrained('gpt2-medium')
Write the start of the sntence you want to generate upon and store in a string.
my_text='It is a bright'
I shall first walk you step-by step through the process to understand how the next word of the sentence is generated. After that, you can loop over the process to generate as many words as you want.
You can pass the string to .encode()
which will converts a string in a sequence of ids, using the tokenizer and vocabulary.
ids=tokenizer.encode(my_text)
ids
[1026, 318, 257, 6016]
Once you have encoded the text , you pass the list of ids to torch.tensor()
, which will return a tensor of these input ids.Store it in a variable my_tensor
my_tensor=torch.tensor([ids])
You have to set your model in eavluation mode through .eval()
method. Now, you can convert one tensor type to another through model.to()
method. I have changed it to CUDA tensor type
# Setting model in eval mode
model.eval()
my_tensor= my_tensor.to('cuda')
model.to('cuda')
You can pass the CUDA tensor as argument to your model. The tokens or ids of probable successive words will be stored in predictions.
result=model(my_tensor)
predictions=result[0]
predictions
is a tensor with all the probable ids of next words of the text
torch.argmax()
method returns the indices of the maximum value of all elements in the input tensor.So you pass the predictions
tensor as input to torch.argmax
and the returned value will give us the ids of next words.
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_index
1110
You have the id in the encoded form for the next word. You have to convert it to text through .decode()
method. Here, I have concatenated the predicted_index
to input ids
and then decoded to get the whole sentence
predicted_text = tokenizer.decode(ids + [predicted_index])
print(predicted_text)
It is a bright day
You can see the next word ” day ” has been generated and added to the original text
Now if you have understood how to generate a consecutive word of a sentence, you can similarly generate the required number of words by a loop.
Let me demonstrate the final method
# Fix the no of words you want to generate
no_of_words_to_generate=30
# original text
text = 'It is a sunny'
# Looping for each word to be generated
for i in range(no_of_words_to_generate):
# encode the input text
input_ids = tokenizer.encode(text)
# convert into a tensor
input_tensor = torch.tensor([input_ids])
# convert into a CUDA tensor
input_tensor = input_tensor.to('cuda')
model.to('cuda')
# passing input tensor to model
result = model(input_tensor)
# storing all predicted ids of next word
predictions = result[0]
# Choosing the predicted_index (og maximum value)
predicted_index = torch.argmax(predictions[0,-1,:]).item()
# decoding the predicted index to text and concatenating to original text
text = tokenizer.decode(input_ids + [predicted_index])
# end of loop
# Printing the whole generated text
print(text)
It is a sunny day in the city of San Francisco, and the sun is shining. The city is filled with people, and the people are enjoying themselves. The sun
The above output shows the text generated.
Question-Answering with NLP
When given a source or knowledge, humans can understand and answer questions based on it. How to build a model which can do the same function ?
It is possible with NLP. The transformers
library of hugging face provides a very easy and advanced method to implement this function.
As discussed in previous sections, the pipeline
of transformers
can be used for particular tasks. In this case, you can pass task='question-answering'
as input to pipeline
# Import pipeline from transformers
from transformers import pipeline
# Create a question-answering pipeline
faq_machine = pipeline(task="question-answering")
HBox(children=(FloatProgress(value=0.0, description='Downloading', max=230.0, style=ProgressStyle(description_…
The above code will return a QuestionAnsweringPipeline
.You can verify it .
faq_machine
Our model requires two inputs : question
and context
context refers to the source text based on whhich we require answers from the model. And ofcourse, you have pass your question as a string too.
about_junkfood= ' Junk food refers to items like pizza,chips ,oily snacks which have high fat content. They are the main reason for obesity and heart diseases. You need to contol and reduce the amount.You should eat healthy food like fruits and vegetables '
# Passing the question and context to the model.
faq_machine(question=' what is junk food ?',context=about_junkfood)
{'answer': 'pizza,chips ,oily snacks',
'end': 56,
'score': 0.48831075420295633,
'start': 32}
You can notice that faq_machine
returns a dictionary which has the answer stored in the value of answe
key.
There are many real-life applications of this. For example, let us have you have a tourism company.Every time a customer has a question, you many not have people to answer. You can build a model which will do the job for you.
Let me demonstrate QuestionAnsweringPipeline
can be great help for a tourist company
# context about the tourist package
information=' The packages we provide are basic, medium and super deluxe. The tourist places include Gangktok,Ooty,Kodaikanal,DShimla,Himachal. The price for all basic packages are between 5000-7000 per person. The price for all basic packages are between 8000-1000 per person. The price for all basic packages are between 12000-20000 per person. Travel will be through train for basic and medium packages.Travel will be through flight for super deluxe package.For local travel, private AC cars shall be provided'
# Creating the model by passing question and context
answer=faq_machine(question='What packages do you offer ?',context=information)
#Printing the answer only from the dictionary
answer['answer']
'basic, medium and super deluxe.'
Let us try another question too.
# Another example
answer=faq_machine(question='What about local travel ?',context=information)
answer['answer']
'private AC cars shall be provided'
Text Classification
You can notice that all products and services get our feedback or review . How do they get to know if we liked their product or not ? It is impossible to classify each review as positive or negative manually!
This is where Text Classification with NLP takes the stage. You can classify texts into different groups based on their similarity of context.
An advanced method for implementing Text Classification is through simpletransformers
library. Install it through below command
pip install simpletransformers
Now, I will walk you through a real-data example of classifying movie reviews as positive or negative. You can download the dataset from here.
First, read the dataset into a pandas dataframe and view it’s columns
import pandas as pd
movie_data=pd.read_csv('/content/movie_reviews.csv')
movie_data.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Unnamed: 0 | review | sentiment | |
---|---|---|---|
0 | 0 | One of the other reviewers has mentioned that … | positive |
1 | 1 | A wonderful little production. <br /><br />The… | positive |
2 | 2 | I thought this was a wonderful way to spend ti… | positive |
3 | 3 | Basically there’s a family where a little boy … | negative |
4 | 4 | Petter Mattei’s “Love in the Time of Money” is… | positive |
You can see it has review
which is our text data , and sentiment
which is the classification label. You need to build a model trained on movie_data
,which can classify any new review as positive or negative.
The simpletransformers
library has ClassificationModel
which is especially designed for text classification problems.
from simpletransformers.classification import ClassificationModel
You can initialize the your model as shown below
train_args ={"fp16":False, "save_model_every_epoch": False,'num_train_epochs':1,'overwrite_output_dir':True,'learning_rate': 1e-5,}
model = ClassificationModel('bert', 'bert-large-cased', use_cuda=True,args=train_args)
You should note that the training data you provide to ClassificationModel
should contain the text in first coumn and the label in next column.
So, you need to preprocess your training data accordingly.
movie_data['label'] = (movie_data['sentiment']=='positive').astype(int)
movie_data.set_index('review')
movie_data.drop('sentiment',axis=1,inplace=True)
movie_data.head()
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
review | label | |
---|---|---|
0 | One of the other reviewers has mentioned that … | 1 |
1 | A wonderful little production. <br /><br />The… | 1 |
2 | I thought this was a wonderful way to spend ti… | 1 |
3 | Basically there’s a family where a little boy … | 0 |
4 | Petter Mattei’s “Love in the Time of Money” is… | 1 |
You can see that the training data is in the required form. You can pass it to model.train_model()
function.
model.train_model(movie_data)
Now that your model is trained , you can pass a new review string to model.predict()
function and check the output.
model.predict(['The movie was amazing '])
(array([1]), array([[-3.3968935, 4.061106 ]], dtype=float32))
From the above output , you can see that for your input review, the model has assigned label 1. Similarly, for negative input reviews it assigns label 0. Thus , your model is working successfully.
You have seen the various uses of NLP techniques in this article. I hope you can now efficiently perform these tasks on any real dataset. The field of NLP is brimming with innovations every minute. Stay tuned for more articles .