How to Train spaCy to Autodetect New Entities (NER) [Complete Guide]

Named-entity recognition (NER) is the process of automatically identifying the entities discussed in a text and classifying them into pre-defined categories such as ‘person’, ‘organization’, ‘location’ and so on. The spaCy library allows you to train NER models by both updating an existing spacy model to suit the specific context of your text documents and also to train a fresh NER model from scratch. This article explains both the methods clearly in detail.

Contents

  1. Introduction to Named Entity Recognition
  2. Need for Custom NER
  3. Updating the Named Entity Recognizer
  4. Format of the training examples
  5. Training the NER model
  6. Let’s predict on new texts the model has not seen
  7. How to train NER from a blank SpaCy model
  8. Training completely new entity type in spaCy

1. Introduction

spaCy is an open-source library for NLP. It is widely used because of its flexible and advanced features. Before diving into NER is implemented in spaCy, let’s quickly understand what a Named Entity Recognizer is.

Named Entity Recognition is a standard NLP task that can identify entities discussed in a text document. A Named Entity Recognizer is a model that can do this recognizing task. It should be able to identify named entities like ‘America’ , ‘Emily’ , ‘London’ ,etc.. and categorize them as PERSON, LOCATION , and so on. It is a very useful tool and helps in Information Retrival. In spacy, Named Entity Recognition is implemented by the pipeline component ner. Most of the models have it in their processing pipeline by default.

# Load a spacy model and chekc if it has ner
import spacy
nlp=spacy.load('en_core_web_sm')

nlp.pipe_names
#> ['tagger', 'parser', 'ner']

In case your model does not have , you can add it using nlp.add_pipe() method.

2. Need for Custom NER model

As you saw, spaCy has in-built pipeline ner for Named recogniyion. Though it performs well, it’s not always completely accurate for your text .Sometimes , a word can be categorized as PERSON or a ORG depending upon the context. Also , sometimes the category you want may not be buit-in in spacy.

Let’s have a look at how the default NER performs on an article about E-commerce companies.

# Performing NER on E-commerce article

article_text="""India that previously comprised only a handful of players in the e-commerce space, is now home to many biggies and giants battling out with each other to reach the top. This is thanks to the overwhelming internet and smartphone penetration coupled with the ever-increasing digital adoption across the country. These new-age innovations not only gave emerging startups a unique platform to deliver seamless shopping experiences but also provided brick and mortar stores with a level-playing field to begin their online journeys without leaving their offline legacies.
In the wake of so many players coming together on one platform, the Indian e-commerce market is envisioned to reach USD 84 billion in 2021 from USD 24 billion in 2017. Further, with the rate at which internet penetration is increasing, we can expect more and more international retailers coming to India in addition to a large pool of new startups. This, in turn, will provide a major Philip to the organized retail market and boost its share from 12% in 2017 to 22-25% by 2021. 
Here’s a view to the e-commerce giants that are dominating India’s online shopping space:
Amazon – One of the uncontested global leaders, Amazon started its journey as a simple online bookstore that gradually expanded its reach to provide a large suite of diversified products including media, furniture, food, and electronics, among others. And now with the launch of Amazon Prime and Amazon Music Limited, it has taken customer experience to a godly level, which will remain undefeatable for a very long time. 
Flipkart – Founded in 2007, Flipkart is recognized as the national leader in the Indian e-commerce market. Just like Amazon, it started operating by selling books and then entered other categories such as electronics, fashion, and lifestyle, mobile phones, etc. And now that it has been acquired by Walmart, one of the largest leading platforms of e-commerce in the US, it has also raised its bar of customer offerings in all aspects and giving huge competition to Amazon. 
Snapdeal – Started as a daily deals platform in 2010, Snapdeal became a full-fledged online marketplace in 2011 comprising more than 3 lac sellers across India. The platform offers over 30 million products across 800+ diverse categories from over 125,000 regional, national, and international brands and retailers. The Indian e-commerce firm follows a robust strategy to stay at the forefront of innovation and deliver seamless customer offerings to its wide customer base. It has shown great potential for recovery in recent years despite losing Freecharge and Unicommerce. 
ShopClues – Another renowned name in the Indian e-commerce industry, ShopClues was founded in July 2011. It’s a Gurugram based company having a current valuation of INR 1.1 billion and is backed by prominent names including Nexus Venture Partners, Tiger Global, and Helion Ventures as its major investors. Presently, the platform comprises more than 5 lac sellers selling products in nine different categories such as computers, cameras, mobiles, etc. 
Paytm Mall – To compete with the existing e-commerce giants, Paytm, an online payment system has also launched its online marketplace – Paytm Mall, which offers a wide array of products ranging from men and women fashion to groceries and cosmetics, electronics and home products, and many more. The unique thing about this platform is that it serves as a medium for third parties to sell their products directly through the widely-known app – Paytm. 
Reliance Retail – Given Reliance Jio’s disruptive venture in the Indian telecom space along with a solid market presence of Reliance, it is no wonder that Reliance will soon be foraying into retail space. As of now, it has plans to build an e-commerce space that will be established on online-to-offline market program and aim to bring local merchants on board to help them boost their sales and compete with the existing industry leaders. 
Big Basket – India’s biggest online supermarket, Big Basket provides a wide variety of imported and gourmet products through two types of delivery services – express delivery and slotted delivery. It also offers pre-cut fruits along with a long list of beverages including fresh juices, cold drinks, hot teas, etc. Moreover, it not only provides farm-fresh products but also ensures that the farmer gets better prices. 
Grofers – One of the leading e-commerce players in the grocery segment, Grofers started its operations in 2013 and has reached overwhelming heights in the last 5 years. Its wide range of products includes atta, milk, oil, daily need products, vegetables, dairy products, juices, beverages, among others. With its growing reach across India, it has become one of the favorite supermarkets for Indian consumers who want to shop grocery items from the comforts of their homes. 
Digital Mall of Asia – Going live in 2020, Digital Mall of Asia is a very unique concept coined by the founders of Yokeasia Malls. It is designed to provide an immersive digital space equipped with multiple visual and sensory elements to sellers and shoppers. It will also give retailers exclusive rights to sell a particular product category or brand in their respective cities. What makes it unique is its zero-commission model enabling retailers to pay only a fixed amount of monthly rental instead of paying commissions. With its one-of-a-kind features, DMA is expected to bring
never-seen transformation to the current e-commerce ecosystem while addressing all the existing e-commerce worries such as counterfeiting. """

doc=nlp(article_text)
for ent in doc.ents:
  print(ent.text,ent.label_)
India GPE
#> one CARDINAL
#> Indian NORP
#> USD 84 billion MONEY
#> 2021 DATE
#> USD 24 billion MONEY
#> 2017 DATE
#> India GPE
#> Philip PERSON
#> 12% PERCENT
#> 2017 DATE
#> 22-25% PERCENT
#> 2021 DATE
...(truncated)...

Observe the above output. Notice that FLIPKART has been identified as PERSON, it should have been ORG . Walmart has also been categorized wrongly as LOC , in this context it should have been ORG . Same goes for Freecharge , ShopClues ,etc..

In cases like this, you’ll face the need to update and train the NER as per the context and requirements. The next section will tell you how to do it.

3. Updating the Named Entity Recognizer

In the previous section, you saw why we need to update and train the NER. Now, let’s go ahead and see how to do it.

Let’s say you have variety of texts about customer statements and companies. Our task is make sure the NER recognizes the company asORGand not as PERSON , place the unidentified products under PRODUCT and so on.

To enable this, you need to provide training examples which will make the NER learn for future samples.

To do this, let’s use an existing pre-trained spacy model and update it with newer examples.

First , let’s load a pre-existing spacy model with an in-built ner component. Then, get the Named Entity Recognizer using get_pipe() method .

# Load pre-existing spacy model
import spacy
nlp=spacy.load('en_core_web_sm')

# Getting the pipeline component
ner=nlp.get_pipe("ner")

To update a pretrained model with new examples, you’ll have to provide many examples to meaningfully improve the system — a few hundred is a good start, although more is better.

4. Format of the training examples

spaCy accepts training data as list of tuples.

Each tuple should contain the text and a dictionary. The dictionary should hold the start and end indices of the named enity in the text, and the category or label of the named entity.

For example, ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]})

To do this, you’ll need example texts and the character offsets and labels of each entity contained in the texts.

# training data
TRAIN_DATA = [
              ("Walmart is a leading e-commerce company", {"entities": [(0, 7, "ORG")]}),
              ("I reached Chennai yesterday.", {"entities": [(19, 28, "GPE")]}),
              ("I recently ordered a book from Amazon", {"entities": [(24,32, "ORG")]}),
              ("I was driving a BMW", {"entities": [(16,19, "PRODUCT")]}),
              ("I ordered this from ShopClues", {"entities": [(20,29, "ORG")]}),
              ("Fridge can be ordered in Amazon ", {"entities": [(0,6, "PRODUCT")]}),
              ("I bought a new Washer", {"entities": [(16,22, "PRODUCT")]}),
              ("I bought a old table", {"entities": [(16,21, "PRODUCT")]}),
              ("I bought a fancy dress", {"entities": [(18,23, "PRODUCT")]}),
              ("I rented a camera", {"entities": [(12,18, "PRODUCT")]}),
              ("I rented a tent for our trip", {"entities": [(12,16, "PRODUCT")]}),
              ("I rented a screwdriver from our neighbour", {"entities": [(12,22, "PRODUCT")]}),
              ("I repaired my computer", {"entities": [(15,23, "PRODUCT")]}),
              ("I got my clock fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("I got my truck fixed", {"entities": [(16,21, "PRODUCT")]}),
              ("Flipkart started it's journey from zero", {"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Max", {"entities": [(24,27, "ORG")]}),
              ("Flipkart is recognized as leader in market",{"entities": [(0,8, "ORG")]}),
              ("I recently ordered from Swiggy", {"entities": [(24,29, "ORG")]})
              ]

The above code clearly shows you the training format. You have to add these labels to the ner using ner.add_label() method of pipeline . Below code demonstrates the same

# Adding labels to the `ner`

for _, annotations in TRAIN_DATA:
  for ent in annotations.get("entities"):
    ner.add_label(ent[2])

Now it’s time to train the NER over these examples. But before you train, remember that apart from ner , the model has other pipeline components. These components should not get affected in training.

So, disable the other pipeline components through nlp.disable_pipes() method.

# Disable pipeline components you dont need to change
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
unaffected_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

You have to perform the training with unaffected_pipes disabled.

5. Training the NER model

First, let’s understand the ideas involved before going to the code.

(a) To train an ner model, the model has to be looped over the example for sufficient number of iterations. If you train it for like just 5 or 6 iterations, it may not be effective.

(b) Before every iteration it’s a good practice to shuffle the examples randomly throughrandom.shuffle() function .

This will ensure the model does not make generalizations based on the order of the examples.

(c) The training data is usually passed in batches.

You can call the minibatch() function of spaCy over the training data that will return you data in batches . The minibatch function takes size parameter to denote the batch size. You can make use of the utility function compounding to generate an infinite series of compounding values.

compunding() function takes three inputs which are start ( the first integer value) ,stop (the maximum value that can be generated) and finally compound. This value stored in compund is the compounding factor for the series.If you are not clear, check out this link for understanding.

For each iteration , the model or ner is updated through the nlp.update() command. Parameters of nlp.update() are :

  • docs: This expects a batch of texts as input. You can pass each batch to the zip method, which will return you batches of text and annotations. `

  • golds: You can pass the annotations we got through zip method here

  • drop: This represents the dropout rate.

  • losses: A dictionary to hold the losses against each pipeline component. Create an empty dictionary and pass it here.

At each word, the update() it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t , it adjusts the weights so that the correct action will score higher next time.

Finally, all of the training is done within the context of the nlp model with disabled pipeline, to prevent the other components from being involved.

# Import requirements
import random
from spacy.util import minibatch, compounding
from pathlib import Path

# TRAINING THE MODEL
with nlp.disable_pipes(*unaffected_pipes):

  # Training for 30 iterations
  for iteration in range(30):

    # shuufling examples  before every iteration
    random.shuffle(TRAIN_DATA)
    losses = {}
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(
                    texts,  # batch of texts
                    annotations,  # batch of annotations
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
        print("Losses", losses)
Losses {'ner': 1.7402108064597996}
Losses {'ner': 6.028685417995803}
Losses {'ner': 8.19496433725908}
Losses {'ner': 10.814626494518961}
Losses {'ner': 16.469317436159923}
Losses {'ner': 1.5056047175115737}
Losses {'ner': 1.6088395608770725}
Losses {'ner': 7.383738242986851}
Losses {'ner': 9.185998913797789}
Losses {'ner': 10.913445074821482}
...(truncated)...

6. Let’s predict on new texts the model has not seen

Training of our NER is complete now. You can test if the ner is now working as you expected. If it’s not up to your expectations, include more training examples and try again.

# Testing the model
doc = nlp("I was driving a Alto")
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
#> Entities [('Alto', 'PRODUCT')]

Yay!

You can observe that even though I didn’t directly train the model to recognize “Alto” as a vehicle name, it has predicted based on the similarity of context.

This is the awesome part of the NER model.

The model does not just memorize the training examples. It should learn from them and be able to generalize it to new examples.

Once you find the performance of the model satisfactory, save the updated model.

You can save it your desired directory through the to_disk command.

After saving, you can load the model from the directory at any point of time by passing the directory path to spacy.load() function.

# Save the  model to directory
output_dir = Path('/content/')
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# Load the saved model and predict
print("Loading from", output_dir)
nlp_updated = spacy.load(output_dir)
doc = nlp_updated("Fridge can be ordered in FlipKart" )
print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
#> Saved model to /content
#> Loading from /content
#> Entities [('Fridge', 'PRODUCT'), ('FlipKart', 'ORG')]

The above output shows that our model has been updated and works as per our expectations. This is how you can update and train the Named Entity Recognizer of any existing model in spaCy.

7. How to train NER from a blank SpaCy model

If you don’t want to use a pre-existing model, you can create an empty model using spacy.blank() by just passing the language ID. For creating an empty model in the English language, you have to pass “en”.

After this, most of the steps for training the NER are similar. The key points to remember are:

  1. As it is an empty model , it does not have any pipeline component by default. You have to add the ner to the pipeline through add_pipe() method.

  2. You’ll not have to disable other pipelines as in previous case

  3. You must provide a larger number of training examples comparitively in rhis case.

  4. Before you start training the new model set nlp.begin_training().

The below code shows the initial steps for training NER of a new empty model

# Train NER from a blank spacy model
import spacy

nlp=spacy.blank("en")

nlp.add_pipe(nlp.create_pipe('ner'))

nlp.begin_training()
#> <thinc.neural.optimizers.Optimizer at 0x7f71cb046320>

After this, you can follow the same exact procedure as in the case for pre-existing model.

This is how you can train the named entity recognizer to identify and categorize correctly as per the context.

8. Training completely new entity type in spaCy

In previous section, we saw how to train the ner to categorize correctly.

What if you want to place an entity in a category that’s not already present?

Consider you have a lot of text data on the food consumed in diverse areas. And you want the NER to classify all the food items under the category FOOD. But, there’s no such existing category.

What can you do ?

spaCy is highly flexible and allows you to add a new entity type and train the model. This feature is extremely useful as it allows you to add new entity types for easier information retrieval. This section explains how to implement it.

First , load the pre-existing spacy model you want to use and get the ner pipeline throughget_pipe() method.

# Import and load the spacy model
import spacy
nlp=spacy.load("en_core_web_sm") 

# Getting the ner component
ner=nlp.get_pipe('ner')

Next, store the name of new category / entity type in a string variable LABEL .

Now, how will the model know which entities to be classified under the new label ?

You will have to train the model with examples. The training examples should teach the model what type of entities should be classified as FOOD.

Let us prepare the training data.

The format of the training data is a list of tuples. Each tuple contains the example text and a dictionary. The dictionary will have the key entities , that stores the start and end indices along with the label of the entitties present in the text.

For example , To pass “Pizza is a common fast food” as example the format will be : ("Pizza is a common fast food",{"entities" : [(0, 5, "FOOD")]})

The below code shows the training data I have prepared.

# New label to add
LABEL = "FOOD"

# Training examples in the required format
TRAIN_DATA =[ ("Pizza is a common fast food.", {"entities": [(0, 5, "FOOD")]}),
              ("Pasta is an italian recipe", {"entities": [(0, 5, "FOOD")]}),
              ("China's noodles are very famous", {"entities": [(8,14, "FOOD")]}),
              ("Shrimps are famous in China too", {"entities": [(0,7, "FOOD")]}),
              ("Lasagna is another classic of Italy", {"entities": [(0,7, "FOOD")]}),
              ("Sushi is extemely famous and expensive Japanese dish", {"entities": [(0,5, "FOOD")]}),
              ("Unagi is a famous seafood of Japan", {"entities": [(0,5, "FOOD")]}),
              ("Tempura , Soba are other famous dishes of Japan", {"entities": [(0,7, "FOOD")]}),
              ("Udon is a healthy type of noodles", {"entities": [(0,4, "ORG")]}),
              ("Chocolate soufflé is extremely famous french cuisine", {"entities": [(0,17, "FOOD")]}),
              ("Flamiche is french pastry", {"entities": [(0,8, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Burgers are the most commonly consumed fastfood", {"entities": [(0,7, "FOOD")]}),
              ("Frenchfries are considered too oily", {"entities": [(0,11, "FOOD")]})
           ]

Now that the training data is ready, we can go ahead to see how these examples are used to train the ner.

Remember the label “FOOD” label is not known to the model now.

So, our first task will be to add the label to ner through add_label() method. Next, you can use resume_training() function to return an optimizer.

Also , when training is done the other pipeline components will also get affected . To prevent these ,use disable_pipes() method to disable all other pipes.

# Add the new label to ner
ner.add_label(LABEL)

# Resume training
optimizer = nlp.resume_training()
move_names = list(ner.move_names)

# List of pipes you want to train
pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]

# List of pipes which should remain unaffected in training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

Few key things to take care of:

a) You have to pass the examples through the model for a sufficient number of iterations. Here, I implement 30 iterations.

b) Remember to fine-tune the model of iterations according to performance. Also, before every iteration it’s better to shuffle the examples randomly throughrandom.shuffle() function . This will ensure the model does not make generalizations based on the order of the examples.

c) The training data has to be passed in batches. You can call the minibatch() function of spaCy over the training examples that will return you data in batches . A parameter of minibatch function is size, denoting the batch size.

For each iteration , the model or ner is update through the nlp.update() command. Parameters of nlp.update() are :

  • docs : This expects a batch of texts as input. You can pass each batch to the zip method , which will return you batches of text and annotations.
    `

  • sgd : You have to pass the optimizer that was returned by resume_training() here.

  • golds : You can pass the annotations we got through zip method here

  • drop : This represents the dropout rate.

  • losses: A dictionary to hold the losses against each pipeline component. Create an empty dictionary and pass it here.

At each word,the update() it makes a prediction. It then consults the annotations to check if the prediction is right. If it isn’t, it adjusts the weights so that the correct action will score higher next time.

# Importing requirements
from spacy.util import minibatch, compounding
import random

# Begin training by disabling other pipeline components
with nlp.disable_pipes(*other_pipes) :

  sizes = compounding(1.0, 4.0, 1.001)
  # Training for 30 iterations     
  for itn in range(30):
    # shuffle examples before training
    random.shuffle(TRAIN_DATA)
    # batch up the examples using spaCy's minibatch
    batches = minibatch(TRAIN_DATA, size=sizes)
    # ictionary to store losses
    losses = {}
    for batch in batches:
      texts, annotations = zip(*batch)
      # Calling update() over the iteration
      nlp.update(texts, annotations, sgd=optimizer, drop=0.35, losses=losses)
      print("Losses", losses)
Losses {'ner': 5.880429838085547}
Losses {'ner': 11.32279500805599}
Losses {'ner': 13.28890202592863}
Losses {'ner': 17.162664665228625}
...(truncated)...
Losses {'ner': 67.31498911509686}
Losses {'ner': 71.54807552228147}

Training of our NER is complete now.

Let’s test if the ner can identify our new entity. If it’s not upto your expectations, try include more training examples.

# Testing the NER

test_text = "I ate Sushi yesterday. Maggi is a common fast food "
doc = nlp(test_text)
print("Entities in '%s'" % test_text)
for ent in doc.ents:
  print(ent)
#> Entities in 'I ate Sushi yesterday. Maggi is a common fast food '
#> Sushi
#> Maggi

Observe the above output. The model has correctly identified the FOOD items. Also, notice that I had not passed ” Maggi ” as a training example to the model. Still, based on the similarity of context, the model has identified “Maggi” also asFOOD. This is an important requirement! Our model should not just memorize the training examples. It should learn from them and generalize it to new examples.

Once you find the performance of the model satisfactory , you can save the updated model to directory using to_disk command. You can load the model from the directory at any point of time by passing the directory path to spacy.load() function.

# Output directory
from pathlib import Path
output_dir=Path('/content/')

# Saving the model to the output directory
if not output_dir.exists():
  output_dir.mkdir()
nlp.meta['name'] = 'my_ner'  # rename model
nlp.to_disk(output_dir)
print("Saved model to", output_dir)

# Loading the model from the directory
print("Loading from", output_dir)
nlp2 = spacy.load(output_dir)
assert nlp2.get_pipe("ner").move_names == move_names
doc2 = nlp2(' Dosa is an extremely famous south Indian dish')
for ent in doc2.ents:
  print(ent.label_, ent.text)
#> Saved model to /content
#> Loading from /content
#> FOOD Dosa

You can see that the model works as per our expectations. This is how you can train a new additional entity type to the ‘Named Entity Recognizer’ of spaCy.

I hope you have understood the when and how to use custom NERs. Custom Training of models has proven to be the gamechanger in many cases. It’s because of this flexibility, spaCy is widely used for NLP. Stay tuned for more such posts.