Topic Modeling with Gensim (Python)

Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package. The challenge, however, is how to extract good quality of topics that are clear, segregated and meaningful. This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics. This tutorial attempts to tackle both of these problems.

1. Introduction

One of the primary applications of natural language processing is to automatically extract what topics people are discussing from large volumes of text. Some examples of large text could be feeds from social media, customer reviews of hotels, movies, etc, user feedbacks, news stories, e-mails of customer complaints etc.

Knowing what people are talking about and understanding their problems and opinions is highly valuable to businesses, administrators, political campaigns. And it’s really hard to manually read through such large volumes and compile the topics.

Thus is required an automated algorithm that can read through the text documents and automatically output the topics discussed.

In this tutorial, we will take a real example of the ’20 Newsgroups’ dataset and use LDA to extract the naturally discussed topics.

I will be using the Latent Dirichlet Allocation (LDA) from Gensim package along with the Mallet’s implementation (via Gensim). Mallet has an efficient implementation of the LDA. It is known to run faster and gives better topics segregation.

We will also extract the volume and percentage contribution of each topic to get an idea of how important a topic is.

Let’s begin!

Topic Modeling with Gensim in Python. Photo by Jeremy Bishop.

2. Prerequisites – Download nltk stopwords and spacy model

We will need the stopwords from NLTK and spacy’s en model for text pre-processing. Later, we will be using the spacy model for lemmatization.

Lemmatization is nothing but converting a word to its root word. For example: the lemma of the word ‘machines’ is ‘machine’. Likewise, ‘walking’ –> ‘walk’, ‘mice’ –> ‘mouse’ and so on.

# Run in python console
import nltk; nltk.download('stopwords')

# Run in terminal or command prompt
python3 -m spacy download en

3. Import Packages

The core packages used in this tutorial are re, gensim, spacy and pyLDAvis. Besides this we will also using matplotlib, numpy and pandas for data handling and visualization. Let’s import them.

import re
import numpy as np
import pandas as pd
from pprint import pprint

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Plotting tools
import pyLDAvis
import pyLDAvis.gensim  # don't skip this
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)

4. What does LDA do?

LDA’s approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

When I say topic, what is it actually and how it is represented?

A topic is nothing but a collection of dominant keywords that are typical representatives. Just by looking at the keywords, you can identify what the topic is all about.

The following are key factors to obtaining good segregation topics:

  1. The quality of text processing.
  2. The variety of topics the text talks about.
  3. The choice of topic modeling algorithm.
  4. The number of topics fed to the algorithm.
  5. The algorithms tuning parameters.

5. Prepare Stopwords

We have already downloaded the stopwords. Let’s import them and make it available in stop_words.

# NLTK Stop words
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

6. Import Newsgroups Data

We will be using the 20-Newsgroups dataset for this exercise. This version of the dataset contains about 11k newsgroups posts from 20 different topics. This is available as newsgroups.json.

This is imported using pandas.read_json and the resulting dataset has 3 columns as shown.

# Import Dataset
df = pd.read_json('https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json')
print(df.target_names.unique())
df.head()
['rec.autos' 'comp.sys.mac.hardware' 'rec.motorcycles' 'misc.forsale'
 'comp.os.ms-windows.misc' 'alt.atheism' 'comp.graphics'
 'rec.sport.baseball' 'rec.sport.hockey' 'sci.electronics' 'sci.space'
 'talk.politics.misc' 'sci.med' 'talk.politics.mideast'
 'soc.religion.christian' 'comp.windows.x' 'comp.sys.ibm.pc.hardware'
 'talk.politics.guns' 'talk.religion.misc' 'sci.crypt']
20 Newsgroups Dataset

20 Newsgroups Dataset

7. Remove emails and newline characters

As you can see there are many emails, newline and extra spaces that is quite distracting. Let’s get rid of them using regular expressions.

# Convert to list
data = df.content.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])
['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: '
 'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
 '15 I was wondering if anyone out there could enlighten me on this car I saw '
 'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
 'early 70s. It was called a Bricklin. The doors were really small. In '
 'addition, the front bumper was separate from the rest of the body. This is '
 'all I know.  (..truncated..)]

After removing the emails and extra spaces, the text still looks messy. It is not ready for the LDA to consume. You need to break down each sentence into a list of words through tokenization, while clearing up all the messy text in the process.

Gensim’s simple_preprocess is great for this.

8. Tokenize words and Clean-up text

Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s simple_preprocess() is great for this. Additionally I have set deacc=True to remove the punctuations.

def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])
[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', (..truncated..))]]

9. Creating Bigram and Trigram Models

Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring.

Some examples in our example are: ‘front_bumper’, ‘oil_leak’, ‘maryland_college_park’ etc.

Gensim’s Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold. The higher the values of these param, the harder it is for words to be combined to bigrams.

# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  

# Faster way to get a sentence clubbed as a trigram/bigram
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

# See trigram example
print(trigram_mod[bigram_mod[data_words[0]]])
['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp_posting_host', 'rac_wam_umd_edu', 'organization', 'university', 'of', 'maryland_college_park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', 'it', 'was', 'door', 'sports', 'car', 'looked', 'to', 'be', 'from', 'the', 'late', 'early', 'it', 'was', 'called', 'bricklin', 'the', 'doors', 'were', 'really', 'small', 'in', 'addition', 'the', 'front_bumper' (..truncated..)]

10. Remove Stopwords, Make Bigrams and Lemmatize

The bigrams model is ready. Let’s define the functions to remove the stopwords, make bigrams and lemmatization and call them sequentially.

# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stopwords(texts):
    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

Let’s call the functions in order.

# Remove Stop Words
data_words_nostops = remove_stopwords(data_words)

# Form Bigrams
data_words_bigrams = make_bigrams(data_words_nostops)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only noun, adj, vb, adv
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:1])
[['where', 's', 'thing', 'car', 'nntp_post', 'host', 'rac_wam', 'umd', 'organization', 'university', 'maryland_college', 'park', 'line', 'wonder', 'anyone', 'could', 'enlighten', 'car', 'see', 'day', 'door', 'sport', 'car', 'look', 'late', 'early', 'call', 'bricklin', 'door', 'really', 'small', 'addition', 'front_bumper', 'separate', 'rest', 'body', 'know', 'anyone', 'tellme', 'model', 'name', 'engine', 'spec', 'year', 'production', 'car', 'make', 'history', 'whatev', 'info', 'funky', 'look', 'car', 'mail', 'thank', 'bring', 'neighborhood', 'lerxst']]

11. Create the Dictionary and Corpus needed for Topic Modeling

The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them.

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])
[[(0, 1), (1, 2), (2, 1), (3, 1), (4, 1), (5, 1), (6, 5), (7, 1), (8, 1), (9, 2), (10, 1), (11, 1), (12, 1), (13, 1), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 2), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 1), (40, 1), (41, 1), (42, 1), (43, 1), (44, 1), (45, 1), (46, 1), (47, 1), (48, 1), (49, 1), (50, 1)]]

Gensim creates a unique id for each word in the document. The produced corpus shown above is a mapping of (word_id, word_frequency).

For example, (0, 1) above implies, word id 0 occurs once in the first document. Likewise, word id 1 occurs twice and so on.

This is used as the input by the LDA model.

If you want to see what word a given id corresponds to, pass the id as a key to the dictionary.

id2word[0]
'addition'

Or, you can see a human-readable form of the corpus itself.

# Human readable format of corpus (term-frequency)
[[(id2word[id], freq) for id, freq in cp] for cp in corpus[:1]]
[[('addition', 1),
  ('anyone', 2),
  ('body', 1),
  ('bricklin', 1),
  ('bring', 1),
  ('call', 1),
  ('car', 5),
  ('could', 1),
  ('day', 1),
  ('door', 2),
  ('early', 1),
  ('engine', 1),
  ('enlighten', 1),
  ('front_bumper', 1),
  ('maryland_college', 1),
  (..truncated..)]]

Alright, without digressing further let’s jump back on track with the next step: Building the topic model.

12. Building the Topic Model

We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well.

Apart from that, alpha and eta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior.

chunksize is the number of documents to be used in each training chunk. update_every determines how often the model parameters should be updated and passes is the total number of training passes.

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

13. View the topics in LDA model

The above LDA model is built with 20 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics() as shown next.

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]
[(0,
  '0.016*"car" + 0.014*"power" + 0.010*"light" + 0.009*"drive" + 0.007*"mount" '
  '+ 0.007*"controller" + 0.007*"cool" + 0.007*"engine" + 0.007*"back" + '
  '0.006*"turn"'),
 (1,
  '0.072*"line" + 0.066*"organization" + 0.037*"write" + 0.032*"article" + '
  '0.028*"university" + 0.027*"nntp_post" + 0.026*"host" + 0.016*"reply" + '
  '0.014*"get" + 0.013*"thank"'),
 (2,
  '0.017*"patient" + 0.011*"study" + 0.010*"slave" + 0.009*"wing" + '
  '0.009*"disease" + 0.008*"food" + 0.008*"eat" + 0.008*"pain" + '
  '0.007*"treatment" + 0.007*"syndrome"'),
 (3,
  '0.013*"key" + 0.009*"use" + 0.009*"may" + 0.007*"public" + 0.007*"system" + '
  '0.007*"order" + 0.007*"government" + 0.006*"state" + 0.006*"provide" + '
  '0.006*"law"'),
 (4,
  '0.568*"ax" + 0.007*"rlk" + 0.005*"tufts_university" + 0.004*"ei" + '
  '0.004*"m" + 0.004*"vesa" + 0.004*"differential" + 0.004*"chz" + 0.004*"lk" '
  '+ 0.003*"weekly"'),
 (5,
  '0.029*"player" + 0.015*"master" + 0.015*"steven" + 0.009*"tor" + '
  '0.009*"van" + 0.008*"king" + 0.008*"scripture" + 0.007*"cal" + '
  '0.007*"helmet" + 0.007*"det"'),
 (6,
  '0.028*"system" + 0.020*"problem" + 0.019*"run" + 0.018*"use" + 0.016*"work" '
  '+ 0.015*"do" + 0.013*"window" + 0.013*"driver" + 0.013*"bit" + 0.012*"set"'),
 (7,
  '0.017*"israel" + 0.011*"israeli" + 0.010*"war" + 0.010*"armenian" + '
  '0.008*"kill" + 0.008*"soldier" + 0.008*"attack" + 0.008*"government" + '
  '0.007*"lebanese" + 0.007*"greek"'),
 (8,
  '0.018*"money" + 0.018*"year" + 0.016*"pay" + 0.012*"car" + 0.010*"drug" + '
  '0.010*"president" + 0.009*"rate" + 0.008*"face" + 0.007*"license" + '
  '0.007*"american"'),
 (9,
  '0.028*"god" + 0.020*"evidence" + 0.018*"christian" + 0.012*"believe" + '
  '0.012*"reason" + 0.011*"faith" + 0.009*"exist" + 0.008*"bible" + '
  '0.008*"religion" + 0.007*"claim"'),
 (10,
  '0.030*"physical" + 0.028*"science" + 0.012*"direct" + 0.012*"st" + '
  '0.012*"scientific" + 0.009*"waste" + 0.009*"jeff" + 0.008*"cub" + '
  '0.008*"brown" + 0.008*"msg"'),
 (11,
  '0.016*"wire" + 0.011*"keyboard" + 0.011*"md" + 0.009*"pm" + 0.008*"air" + '
  '0.008*"input" + 0.008*"fbi" + 0.007*"listen" + 0.007*"tube" + '
  '0.007*"koresh"'),
 (12,
  '0.016*"motif" + 0.014*"serial_number" + 0.013*"son" + 0.013*"father" + '
  '0.011*"choose" + 0.009*"server" + 0.009*"event" + 0.009*"value" + '
  '0.007*"collin" + 0.007*"prediction"'),
 (13,
  '0.098*"_" + 0.043*"max" + 0.015*"dn" + 0.011*"cx" + 0.009*"eeg" + '
  '0.008*"gateway" + 0.008*"c" + 0.005*"mu" + 0.005*"mr" + 0.005*"eg"'),
 (14,
  '0.024*"book" + 0.009*"april" + 0.007*"group" + 0.007*"page" + '
  '0.007*"new_york" + 0.007*"iran" + 0.006*"united_state" + 0.006*"author" + '
  '0.006*"include" + 0.006*"club"'),
 (15,
  '0.020*"would" + 0.017*"say" + 0.016*"people" + 0.016*"think" + 0.014*"make" '
  '+ 0.014*"go" + 0.013*"know" + 0.012*"see" + 0.011*"time" + 0.011*"get"'),
 (16,
  '0.026*"file" + 0.017*"program" + 0.012*"window" + 0.012*"version" + '
  '0.011*"entry" + 0.011*"software" + 0.011*"image" + 0.011*"color" + '
  '0.010*"source" + 0.010*"available"'),
 (17,
  '0.027*"game" + 0.027*"team" + 0.020*"year" + 0.017*"play" + 0.016*"win" + '
  '0.010*"good" + 0.009*"season" + 0.008*"fan" + 0.007*"run" + 0.007*"score"'),
 (18,
  '0.036*"drive" + 0.024*"card" + 0.020*"mac" + 0.017*"sale" + 0.014*"cpu" + '
  '0.010*"price" + 0.010*"disk" + 0.010*"board" + 0.010*"pin" + 0.010*"chip"'),
 (19,
  '0.030*"space" + 0.010*"sphere" + 0.010*"earth" + 0.009*"item" + '
  '0.008*"launch" + 0.007*"moon" + 0.007*"mission" + 0.007*"nasa" + '
  '0.007*"orbit" + 0.006*"research"')]

How to interpret this?

Topic 0 is a represented as _0.016“car” + 0.014“power” + 0.010“light” + 0.009“drive” + 0.007“mount” + 0.007“controller” + 0.007“cool” + 0.007“engine” + 0.007“back” + ‘0.006“turn”.

It means the top 10 keywords that contribute to this topic are: ‘car’, ‘power’, ‘light’.. and so on and the weight of ‘car’ on topic 0 is 0.016.

The weights reflect how important a keyword is to that topic.

Looking at these keywords, can you guess what this topic could be? You may summarise it either are ‘cars’ or ‘automobiles’.

Likewise, can you go through the remaining topic keywords and judge what the topic is?

Inferring Topic from Keywords

Inferring Topic from Keywords

14. Compute Model Perplexity and Coherence Score

Model perplexity and topic coherence provide a convenient measure to judge how good a given topic model is. In my experience, topic coherence score, in particular, has been more helpful.

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)
Perplexity:  -8.86067503009

Coherence Score:  0.532947587081

There you have a coherence score of 0.53.

15. Visualize the topics-keywords

Now that the LDA model is built, the next step is to examine the produced topics and the associated keywords. There is no better tool than pyLDAvis package’s interactive chart and is designed to work well with jupyter notebooks.

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
pyLDAvis Output

pyLDAvis Output

So how to infer pyLDAvis’s output?

Each bubble on the left-hand side plot represents a topic. The larger the bubble, the more prevalent is that topic.

A good topic model will have fairly big, non-overlapping bubbles scattered throughout the chart instead of being clustered in one quadrant.

A model with too many topics, will typically have many overlaps, small sized bubbles clustered in one region of the chart.

Alright, if you move the cursor over one of the bubbles, the words and bars on the right-hand side will update. These words are the salient keywords that form the selected topic.

We have successfully built a good looking topic model.

Given our prior knowledge of the number of natural topics in the document, finding the best model was fairly straightforward.

Upnext, we will improve upon this model by using Mallet’s version of LDA algorithm and then we will focus on how to arrive at the optimal number of topics given any large corpus of text.

16. Building LDA Mallet Model

So far you have seen Gensim’s inbuilt version of the LDA algorithm. Mallet’s version, however, often gives a better quality of topics.

Gensim provides a wrapper to implement Mallet’s LDA from within Gensim itself. You only need to download the zipfile, unzip it and provide the path to mallet in the unzipped directory to gensim.models.wrappers.LdaMallet. See how I have done this below.

# Download File: http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
mallet_path = 'path/to/mallet-2.0.8/bin/mallet' # update this path
ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)
# Show Topics
pprint(ldamallet.show_topics(formatted=False))

# Compute Coherence Score
coherence_model_ldamallet = CoherenceModel(model=ldamallet, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_ldamallet = coherence_model_ldamallet.get_coherence()
print('\nCoherence Score: ', coherence_ldamallet)
[(13,
  [('god', 0.022175351915726671),
   ('christian', 0.017560827817656381),
   ('people', 0.0088794630371958616),
   ('bible', 0.008215251235200895),
   ('word', 0.0077491376899412696),
   ('church', 0.0074112053696280414),
   ('religion', 0.0071198844038407759),
   ('man', 0.0067936049221590383),
   ('faith', 0.0067469935676330757),
   ('love', 0.0064556726018458093)]),
 (1,
  [('organization', 0.10977647987951586),
   ('line', 0.10182379194445974),
   ('write', 0.097397469098389255),
   ('article', 0.082483883409554246),
   ('nntp_post', 0.079894209047330425),
   ('host', 0.069737542931658306),
   ('university', 0.066303010266865026),
   ('reply', 0.02255404338163719),
   ('distribution_world', 0.014362591143681011),
   ('usa', 0.010928058478887726)]),
 (8,
  [('file', 0.02816690014008405),
   ('line', 0.021396171035954908),
   ('problem', 0.013508104862917751),
   ('program', 0.013157894736842105),
   ('read', 0.012607564538723234),
   ('follow', 0.01110666399839904),
   ('number', 0.011056633980388232),
   ('set', 0.010522980454939631),
   ('error', 0.010172770328863986),
   ('write', 0.010039356947501835)]),
 (7,
  [('include', 0.0091670556506405262),
   ('information', 0.0088169700741662776),
   ('national', 0.0085576474249260924),
   ('year', 0.0077667133447435295),
   ('report', 0.0070406099268710129),
   ('university', 0.0070406099268710129),
   ('book', 0.0068979824697889113),
   ('program', 0.0065219646283906432),
   ('group', 0.0058866241377521916),
   ('service', 0.0057180644157460714)]),
 (..truncated..)]

Coherence Score:  0.632431683088

Just by changing the LDA algorithm, we increased the coherence score from .53 to .63. Not bad!

17. How to find the optimal number of topics for LDA?

My approach to finding the optimal number of topics is to build many LDA models with different values of number of topics (k) and pick the one that gives the highest coherence value.

Choosing a ‘k’ that marks the end of a rapid growth of topic coherence usually offers meaningful and interpretable topics. Picking an even higher value can sometimes provide more granular sub-topics.

If you see the same keywords being repeated in multiple topics, it’s probably a sign that the ‘k’ is too large.

The compute_coherence_values() (see below) trains multiple LDA models and provides the models and their corresponding coherence scores.

def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    for num_topics in range(start, limit, step):
        model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values
# Can take a long time to run.
model_list, coherence_values = compute_coherence_values(dictionary=id2word, corpus=corpus, texts=data_lemmatized, start=2, limit=40, step=6)
# Show graph
limit=40; start=2; step=6;
x = range(start, limit, step)
plt.plot(x, coherence_values)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
plt.legend(("coherence_values"), loc='best')
plt.show()
Choosing the optimal number of LDA topics

Choosing the optimal number of LDA topics

# Print the coherence scores
for m, cv in zip(x, coherence_values):
    print("Num Topics =", m, " has Coherence Value of", round(cv, 4))
Num Topics = 2  has Coherence Value of 0.4451
Num Topics = 8  has Coherence Value of 0.5943
Num Topics = 14  has Coherence Value of 0.6208
Num Topics = 20  has Coherence Value of 0.6438
Num Topics = 26  has Coherence Value of 0.643
Num Topics = 32  has Coherence Value of 0.6478
Num Topics = 38  has Coherence Value of 0.6525

If the coherence score seems to keep increasing, it may make better sense to pick the model that gave the highest CV before flattening out. This is exactly the case here.

So for further steps I will choose the model with 20 topics itself.

# Select the model and print the topics
optimal_model = model_list[3]
model_topics = optimal_model.show_topics(formatted=False)
pprint(optimal_model.print_topics(num_words=10))
[(0,
  '0.025*"game" + 0.018*"team" + 0.016*"year" + 0.014*"play" + 0.013*"good" + '
  '0.012*"player" + 0.011*"win" + 0.007*"season" + 0.007*"hockey" + '
  '0.007*"fan"'),
 (1,
  '0.021*"window" + 0.015*"file" + 0.012*"image" + 0.010*"program" + '
  '0.010*"version" + 0.009*"display" + 0.009*"server" + 0.009*"software" + '
  '0.008*"graphic" + 0.008*"application"'),
 (2,
  '0.021*"gun" + 0.019*"state" + 0.016*"law" + 0.010*"people" + 0.008*"case" + '
  '0.008*"crime" + 0.007*"government" + 0.007*"weapon" + 0.007*"police" + '
  '0.006*"firearm"'),
 (3,
  '0.855*"ax" + 0.062*"max" + 0.002*"tm" + 0.002*"qax" + 0.001*"mf" + '
  '0.001*"giz" + 0.001*"_" + 0.001*"ml" + 0.001*"fp" + 0.001*"mr"'),
 (4,
  '0.020*"file" + 0.020*"line" + 0.013*"read" + 0.013*"set" + 0.012*"program" '
  '+ 0.012*"number" + 0.010*"follow" + 0.010*"error" + 0.010*"change" + '
  '0.009*"entry"'),
 (5,
  '0.021*"god" + 0.016*"christian" + 0.008*"religion" + 0.008*"bible" + '
  '0.007*"life" + 0.007*"people" + 0.007*"church" + 0.007*"word" + 0.007*"man" '
  '+ 0.006*"faith"'),
 (..truncated..)]

Those were the topics for the chosen LDA model.

18. Finding the dominant topic in each sentence

One of the practical application of topic modeling is to determine what topic a given document is about.

To find that, we find the topic number that has the highest percentage contribution in that document.

The format_topics_sentences() function below nicely aggregates this information in a presentable table.

def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row in enumerate(ldamodel[corpus]):
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)


df_topic_sents_keywords = format_topics_sentences(ldamodel=optimal_model, corpus=corpus, texts=data)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']

# Show
df_dominant_topic.head(10)
Dominant Topic For Each Document

Dominant Topic For Each Document

19. Find the most representative document for each topic

Sometimes just the topic keywords may not be enough to make sense of what a topic is about. So, to help with understanding the topic, you can find the documents a given topic has contributed to the most and infer the topic by reading that document. Whew!!

# Group top 5 sentences under each topic
sent_topics_sorteddf_mallet = pd.DataFrame()

sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')

for i, grp in sent_topics_outdf_grpd:
    sent_topics_sorteddf_mallet = pd.concat([sent_topics_sorteddf_mallet, 
                                             grp.sort_values(['Perc_Contribution'], ascending=[0]).head(1)], 
                                            axis=0)

# Reset Index    
sent_topics_sorteddf_mallet.reset_index(drop=True, inplace=True)

# Format
sent_topics_sorteddf_mallet.columns = ['Topic_Num', "Topic_Perc_Contrib", "Keywords", "Text"]

# Show
sent_topics_sorteddf_mallet.head()
Most Representative Topic For Each Document

Most Representative Topic For Each Document

The tabular output above actually has 20 rows, one each for a topic. It has the topic number, the keywords, and the most representative document. The Perc_Contribution column is nothing but the percentage contribution of the topic in the given document.

20. Topic distribution across documents

Finally, we want to understand the volume and distribution of topics in order to judge how widely it was discussed. The below table exposes that information.

# Number of Documents for Each Topic
topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()

# Percentage of Documents for Each Topic
topic_contribution = round(topic_counts/topic_counts.sum(), 4)

# Topic Number and Keywords
topic_num_keywords = df_topic_sents_keywords[['Dominant_Topic', 'Topic_Keywords']]

# Concatenate Column wise
df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

# Change Column names
df_dominant_topics.columns = ['Dominant_Topic', 'Topic_Keywords', 'Num_Documents', 'Perc_Documents']

# Show
df_dominant_topics
Topic Volume Distribution

Topic Volume Distribution

21. Conclusion

We started with understanding what topic modeling can do. We built a basic topic model using Gensim’s LDA and visualize the topics using pyLDAvis. Then we built mallet’s LDA implementation. You saw how to find the optimal number of topics using coherence scores and how you can come to a logical understanding of how to choose the optimal model.

Finally we saw how to aggregate and present the results to generate insights that may be in a more actionable.

Hope you enjoyed reading this. I would appreciate if you leave your thoughts in the comments section below.

100 Comments on “Topic Modeling with Gensim (Python)”

  1. Awesome post I ever seen about Topic modeling!! Most of blogs are out of date and sometimes doesn’t work but this one is great!

    One question, If I want to use TF-IDF model, where should I put the code ? Do i have to substitute `Bimodel` part with `TF-IDF`?

    1. Thanks 🙂

      To use a tfidf corpus instead, you will need something like:

      tfidf_model = gensim.models.TfidfModel(corpus, id2word=id2word)
      
      lda_model = gensim.models.ldamodel.LdaModel(tfidf_model[corpus], id2word=id2word, num_topics=20)
      
  2. Hi,
    This is very nice and clear explanation of topic modeling. I follow your instruction to build LDA Mallet Model. But a problem has occurred.
    May I have your suggestion of this error msg?
    CalledProcessError: Command ‘C:\mallet-2.0.8\mallet-2.0.8\bin\mallet import-file –preserve-case –keep-sequence –remove-stopwords –token-regex “\S+” –input C:\Users\Kellyz\AppData\Local\Temp\1a85b1_corpus.txt –output C:\Users\Kellyz\AppData\Local\Temp\1a85b1_corpus.mallet’ returned non-zero exit status 1.

  3. Hi Kelly, I need to be able to reproduce this in order to resolve the error. But from the msg, check if the path to mallet is set correctly. It is typically something like ‘c:\mallet-2.0.8\bin\mallet’. You may have to get the forward or backward slash right, depending on your OS.

    1. Thanks for your answer, but the error msg still have been occurred with CalledProcessError: Command ‘c:\mallet\bin\mallet import-file –preserve-case –keep-sequence –remove-stopwords –token-regex “\S+” –input C:\Users\Kellyz\AppData\Local\Temp\a76377_corpus.txt –output C:\Users\Kellyz\AppData\Local\Temp\a76377_corpus.mallet’ returned non-zero exit status 1. Is there any prerequisity besides topic#2 on your post?
      I appreciate your help!

      1. I would suggest few things:
        1. Try replacing the backword slash in filepath with forward slashes
        2. The path to your mallet doesn’t seem right. It should be something like ‘c:/mallet-2.0.8/bin/mallet’. In case you renamed some directory, don’t use that. Unzip the original zip file again and use that.

        You may also refer to this post about a similar error.

        1. hi,
          i got some problem on the same line with Kelly, but the error writen like this :
          [Errno 2] No such file or directory: ‘C:\\Users\\X302UJ~1\\AppData\\Local\\Temp\\92a9_state.mallet.gz’
          I have been trying your suggestion, but same error keep poping out.
          would you mind to give me some suggestion?

          *ps. I’m sorry for my bad english

          1. Angga, I would suggest to delete everything and re-do it from scratch. Just make sure, your mallet path points to the mallet file inside the bin/ directory like this: ‘c:/mallet-2.0.8/bin/mallet’. Hope it works.

            1. Hi Selva,

              I am facing the same problem like Angga. I tried changing the bin directory with the forward slash as suggested… but the problem still persists:

              FileNotFoundError: [Errno 2] No such file or directory: ‘C:\\Users\\Renuka\\AppData\\Local\\Temp\\ef2d0c_state.mallet.gz’

              1. 1. check if mallet home environment variable is set properlly.
                2.use python2 to run the code.

                rest is to sit back and feel the charm of code.

                Nice Article.

              2. This is what fixed it for me:
                a. Extracted to c:\ drive so I have c:/mallet-2.0.8 as my root directory
                b. Set MALLET_HOME environment variable to c:/mallet-2.0.8/
                c. mallet_path = “c:/mallet-2.0.8/bin/mallet”

                If you don’t want to restart a Notebook then do this to set the environment variable:

                import os
                os.environ.update({‘MALLET_HOME’: r’c:/mallet-2.0.8/’})

              3. hi, I went through the same problem,. Changing directory and setting environment variable was not enough. I had to compile the mallet source code that comes in the zipped file. to do that i needed to have Apache ant insalled (https://ant.apache.org/manual/install.html)
                once installed i used it to build mallet, and then was able to use it in my topic modeling

                1. Hi Lamia, I configured ant and the build was successful but still I am getting the zero status error . Can you explain how you built mallet using ant.

        2. Hey Selva I tried all your steps but its not getting rectified I am also having same error as kelly Please Help.

          TIA

  4. Thanks for the post, that is super helpful!

    I have two queries and hope to seek your advice
    Is it true that it takes long time to run Spacy lemmatization?
    I have about 300 thousand documents (about 16 GB) to analyze. Do you have suggestions to deal with such large database?

    Thanks!

    1. About spaCy, well its normally said that way because spacy can do a whole lot of things in just one function call. But if you want to use only the lemmatization feature, its better to turn off the other components in spaCy’s pipeline, like its done in this article. Here, we are turning off the ‘parser’ and ‘ner’ components which we are not making use of.

      `nlp = spacy.load(‘en’, disable=[‘parser’, ‘ner’])`

      For training 300k documents, its a matter of having the required computing resources. But, its possible to code it in a way to process the documents without loading all of it in memory at once. Gensim can help you with that.

  5. Thank you so much, Selva. It worked magic! I have one more question I am not able to figure out (newbee) . Is there a way to output similar to Mallet’s –output-doc-topics topic matrix per document? I reckon it should be something along the lines of point 18.

      1. Thank you so much for a swift reply! What I was actually meaning is if it is possible to see not only dominant topics’ proportions per document (like in p.18) but all topics proportions per document as in –output-doc-topics when I run it in terminal. So, say for document #3 the topics will be T_1 (0.532), T_2 (0.002), T_3 (0.021) etc. Is there a way to do that?

  6. Thanks for sharing this. It is a great article. I am relatively new to NLP and python. I followed your code but i got stuck in combining the topics with the documents.

    I am getting an error in format_topics_sentences.

    row = sorted(row, key=lambda x: (x[1]), reverse=True)
    TypeError: unorderable types: int() < tuple()

    This thing does not work in python 3 i guess.

    Can you help me how to resolve this?

    1. I would run the following and see the contents of each ‘row’:

      for i, row in enumerate(ldamodel[corpus]):
      print(i, row)

      This should print the row counter (i) and ‘row’ should be tuples of same size, if I remember it correct. I suspect, your ‘row’ is probably irregular, containing items of different size. Its hard to say, without reproducible code. If not, you need to rework `row = sorted(row, key=lambda x: (x[1]), reverse=True)` in order to sort that row to get the most dominant topic number to the front.

  7. Hi Selva,

    Thank you for such a wonderful post on Topic Modeling.

    I have few question:
    1) Can you please let me know how extensively organisations are making use of TOPIC MODELING?
    2) Which are the fields/sectors/organisations getting benefitted from TOPIC MODELING?

  8. Hi Selva,

    I have more questions, sorry

    1) How do you deploy TOPIC MODELING for any end user? Do you embed your model using FLASK or any other application?
    2) Do you also write on developing Chatbots or write on Artificial Intelligence?

    I’m very keen on learning and developing Chatbots

    1. Vijay Kumar,
      Thought provoking questionsm nice! My answers, are quite simple though: 1. It depends, but from what I have seen, its quite elementary. 2. Anywhere you would see a lot of text data: e-commerce, customer service, anywhere people write manual notes / comments. 3. I have’nt deployed a topic model in particuls using flask, but its very much possible. 4. I may write some day.

  9. This is great – how can I buy you a coffee?

    I’m trying to visualise Mallet model using pyLDAvis with no luck. This is my code:

    from gensim.models.wrappers import ldamallet

    pyLDAvis.enable_notebook()
    lda_model = ldamallet.malletmodel2ldamodel(optimal_model) # optimal_model is a Mallet model from your code above
    vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
    vis

    The code executes, but lda_model contains weirdly random topics, nothing like what’s in optimal_model. Furthermore, topics in lda_model change everytime I re-run that code.

    Any suggestions?

    1. I have also tried to do the same, and plotting directly the mallet model does not work. I convertied the mallet model to a gensim model, following the instructions here ….

      thus I wrote

      mallet_gen_optimal = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(optimal_model)

      pyLDAvis.enable_notebook()
      vis = pyLDAvis.gensim.prepare(mallet_gen_optimal, corpus, id2word)
      vis

      but I get the same messy useless result as you.

      Dear Selva, do you have any idea on ho to solve this?

    2. Hello stan,

      malletLDA model takes the random seed. Try setting up np.random seed .. Hope so after that your topics won’t be changing much.

    3. Thanks, Praveen and Tano!

      – LDA Mallet is not compatible with pyLDAvis as of today. But the hack that Tano showed might help pull it off.

      – About messy topics: The topics that a topic model generates depends on multiple factors (listed in section 4). You may do your part in preparing (preprocessing) the texts in a way that would remove the words that you don’t want be seen on the topics.

      Try expanding stop words and consider keeping some of the negative stopwords (they may turn into bigrams).

      Make your lemmatization more specific by keeping only nouns and adjectives.

      Try adjusting the ‘threshold’ in bigram and trigram.

      If you see same words appearing in multiple topics, try reducing the n_topics till the problem goes away.

      Implement spell correction in text processing. Its available in textblob and autocorrect.

  10. Excellent article, Very well written with very good explanation.
    I am stuck at the following place when coherence is computed.
    After perplexity is calculated, the program hangs and does not calculated coherence score. Any pointers would be great. This is just before using Mallet LDA.

  11. Hello,
    Great article, I find it very interested.
    But I have a question, How could you summarize each topic and find a suitable label or description for them??? to make my question clear you referred topic 19 by space, how determining these words?
    thanks in advance.

    1. I translated this article, and posted on my blog.
      If you do not allow it, I will delete it.
      Answer me. I will check this comment again.
      Thank you.

  12. Hi Selva,

    I am newbie in NLP and your post helped me alot, it is crisp and to the point.

    I have a doubt, after generating the topic keyword how to infer them to a topic? Is it possible to do it dynamically?

  13. Hi again,

    Can I replace ldamallte by lda fom gensim and how??? in finding the optimal number of topics model.

  14. Hi Hiba,

    You can certainly do that. Try replacing the following line in `compute_coherence_values ()`

    model = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word)

    with something like:

    lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,                                           id2word=id2word,  num_topics=num_topics,  random_state=100, alpha='auto')
  15. Hi,
    Great article. but i have a question.if i train my lda model and then later on i add few more docs in the same corpus.how i can update model?

  16. I’m not able to get the Mallet model working. I’m running Mac High Sierra 10.13
    When I try to do:

    mallet_path = ‘Users/JACKLAVIOLETTE/Desktop/mallet-2.0.8/bin/mallet’
    ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=ml_corpus, num_topics=10, id2word=ml_id2word)

    I get:

    CalledProcessError: Command ‘Users/JACKLAVIOLETTE/Desktop/mallet-2.0.8/bin/mallet import-file –preserve-case –keep-sequence –remove-stopwords –token-regex “\S+” –input /var/folders/_d/dxfw48sn0x9bycv9yt4drl8w0000gp/T/3f4227_corpus.txt –output /var/folders/_d/dxfw48sn0x9bycv9yt4drl8w0000gp/T/3f4227_corpus.mallet’ returned non-zero exit status 127.

    Is this something you can help with?

    Secondly, unlike in your example, my coherence is more or less decreasing as I increase the number of topics. Does this mean that my data is poorly suited to LDA?

    Thanks

  17. Great article, I am learning a lot, but there are some errors.

    One little thing (not an error but an aesthetic thing), is that the PyLDAvis has different topic numbers to the model – to prevent this include the parameter sort_topics=False when calling pyLDAvis.gensim.prepare.

    The use of the Mallet model in both pyLDAvis and in the topics per sentence (format_topic_sentences) does not work, because the words don’t translate back correctly from the Mallet model. This has been highlighted above.

    I believe that Rajeev Singhal tried, as I did, to instead use the Gensim model with format_topic_sentences – where it throws the error with the sentence structure. This is because lda_mallet[corpus] is a simple list of tuples per topic area, whereas lda_gensim[corpus] is a rather more complex beast.

    Have you verified that you really are correctly mapping key words per sentence when using the mallet model? And can you supply a version of format_topic_sentences that works with Gensim models?

  18. Hi Selva,

    I am using python 3 and getting following error, when executing format_topics_sentences function call:
    —-> 7 row = sorted(row, key=lambda x: (x[1]), reverse=True)
    8 # Get the Dominant topic, Perc Contribution and Keywords for each document
    9 for j, (topic_num, prop_topic) in enumerate(row):

    TypeError: ‘<' not supported between instances of 'int' and 'tuple'

    I executed following code you mentioned below and the output is mentioned below:

    for i, row in enumerate(loaded_lda_model[corpus]):
    print(i, row)
    if(i==2):
    break

    Output
    ———
    0 ([(0, 0.095578134), (1, 0.15087242), (2, 0.054104384), (6, 0.047754303), (8, 0.63085496)], [(0, [1, 8]), (1, [8, 1]), (2, [6]), (3, [8, 1]), (4, [8]), (5, [8]), (6, [0]), (7, [8, 1]), (8, [2]), (9, [0]), (10, [8]), (11, [8, 6]), (12, [8, 2]), (13, [8]), (14, [8]), (15, [1]), (16, [8]), (17, [8, 1, 0]), (18, [8, 1, 0, 6, 2])], [(0, [(1, 1.98242), (8, 0.017578531)]), (1, [(1, 0.23531295), (8, 0.764687)]), (2, [(6, 0.99999696)]), (3, [(1, 0.022382142), (8, 0.9776156)]), (4, [(8, 0.9996943)]), (5, [(8, 0.99999344)]), (6, [(0, 0.9999866)]), (7, [(1, 0.0314975), (8, 0.96849954)]), (8, [(2, 0.9999861)]), (9, [(0, 0.9999952)]), (10, [(8, 0.99999297)]), (11, [(6, 0.014432267), (8, 0.9772537)]), (12, [(2, 0.16705652), (8, 0.8309764)]), (13, [(8, 0.9991682)]), (14, [(8, 3.0)]), (15, [(1, 0.99520165)]), (16, [(8, 0.9999935)]), (17, [(0, 0.10171902), (1, 0.103229314), (8, 1.7950518)]), (18, [(0, 0.083051875), (1, 0.14795224), (2, 0.031359576), (6, 0.03167073), (8, 0.7059564)])])
    1 ([(0, 0.05), (1, 0.050017588), (2, 0.05), (3, 0.05), (4, 0.05), (5, 0.5499824), (6, 0.05), (7, 0.05), (8, 0.05), (9, 0.05)], [(19, [5])], [(19, [(5, 0.9999653)])])
    2 ([(0, 0.05), (1, 0.050018568), (2, 0.05), (3, 0.05), (4, 0.05), (5, 0.5499814), (6, 0.05), (7, 0.05), (8, 0.05), (9, 0.05)], [(19, [5])], [(19, [(5, 0.9999653)])])
    In [154]:

    Can you guide on what changes do we need to make in your code so that the function format_topics_sentences works?

  19. This is my modified version of format_topics_sentences, for those trying to work with a Gensim model. Note I add 1 to the topic model number so it aligns with the 1-based numbering used on the pyLDAviz graph, and that I set up a dict of top 10 values per topic up-front, so this doesn’t have to be repetitively created.

    def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
    # Array of top 10 topics
    top10array = []

    for row in range(ldamodel.num_topics):
    wp = ldamodel.show_topic(row)
    topic_keywords = “, “.join([word for word, prop in wp])
    top10array.append((row+1, topic_keywords))

    top10dict = dict(top10array)

    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for topic in ldamodel[corpus]:
    row = topic[0]
    row = sorted(row, key=lambda x: (x[1]), reverse=True)
    # Get the Dominant topic, Perc Contribution and Keywords for each document
    for j, (topic_num, prop_topic) in enumerate(row):
    if j == 0: # => dominant topic
    topic_num += 1 # view topics as a series starting from 1
    topic_keywords = top10dict[topic_num]
    sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
    else:
    break

    sent_topics_df.columns = [‘Dominant_Topic’, ‘Perc_Contribution’, ‘Topic_Keywords’]

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

  20. Here is a version of format_topics_sentences for Gensim models that dispenses with the main processing loop by just using pandas.apply. Again note that I number topics from 1.

    I am testing this against a twitter corpus of 250,000 records – taking this approach has decreased processing time from 30 minutes to 7 minutes:

    def format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
    # Array of top 10 topics
    top10array = []

    for row in range(ldamodel.num_topics):
    wp = ldamodel.show_topic(row)
    topic_keywords = “, “.join([word for word, prop in wp])
    top10array.append((row+1, topic_keywords))

    top10dict = dict(top10array)

    sent_topics_df = pd.DataFrame(pd.DataFrame([sorted(topic[0], key=lambda x: (x[1]), reverse=True) for topic in ldamodel[corpus]])[0])
    sent_topics_df.columns=[“Data”]
    sent_topics_df[‘Dominant_Topic’] = sent_topics_df.Data.apply(lambda x: x[0]+1)
    sent_topics_df[‘Perc_Contribution’] = sent_topics_df.Data.apply(lambda x: round(x[1],4))
    sent_topics_df[‘Topic_Keywords’] = sent_topics_df.Dominant_Topic.apply(lambda x: top10dict[x])

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents.rename(“Text”)], axis=1)
    sent_topics_df = sent_topics_df[[‘Dominant_Topic’, ‘Perc_Contribution’, ‘Topic_Keywords’, ‘Text’]]
    return(sent_topics_df)

  21. Hi Chris,

    I really appreciate your help. I tried your first logic, it was slow but it worked. I will try the second version.

    one. more thing I wanted to validate, would the code ins section 19 and section 20 would work without change once the change is done in format_topics_sentences method. I have 10 topics and corpus of around 140,000 records.

    I tried executing section 19 it seemed to work but when I tried code in section 20, it shows me index from 0 till 140,000(which is my total number of records) but only index from 1 to 10 are populated(assuming it just took 10 topics and showed corresponding details) but for rest of the records(0 index and all indexes > 10 till 140,000) shows NAN value under “Num_Documents” “Perc_Documents” columns.

    What am I missing?

  22. A version of the format_topics_sentences for use with Mallet models which uses pandas.apply and runs a lot faster by avoiding a processing loop. Note that I number topics from 1 rather than zero – if its your preference to use a zero starting point then remove any +1 from the code.

    To use this for a Gensim model all that needs to change is for the sorted(topic, code fragment to be replaced with sorted(topic[0], .

    def format_mallet_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data):
    # Array of top 10 topics
    top10array = []

    for row in range(ldamodel.num_topics):
    wp = ldamodel.show_topic(row)
    topic_keywords = “, “.join([word for word, prop in wp])
    top10array.append((row+1, topic_keywords))

    top10dict = dict(top10array)

    # use sorted(topic[0], if passing in a Gensim model
    sent_topics_df = pd.DataFrame(pd.DataFrame([sorted(topic, key=lambda x: (x[1]), reverse=True) for topic in ldamodel[corpus]])[0])
    sent_topics_df.columns=[“Data”]
    sent_topics_df[‘Dominant_Topic’] = sent_topics_df.Data.apply(lambda x: x[0]+1)
    sent_topics_df[‘Perc_Contribution’] = sent_topics_df.Data.apply(lambda x: round(x[1],4))
    sent_topics_df[‘Topic_Keywords’] = sent_topics_df.Dominant_Topic.apply(lambda x: top10dict[x])

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents.rename(“Text”)], axis=1)
    sent_topics_df = sent_topics_df[[‘Dominant_Topic’, ‘Perc_Contribution’, ‘Topic_Keywords’, ‘Text’]]
    return(sent_topics_df)

  23. Hi Chris,

    This is my final piece of code after your suggestions, used sorted(topic[0].

    def format_topics_sentences(ldamodel=loaded_lda_model, corpus=corpus, texts=text_list_cleaned):
    # Array of top 10 topics
    top10array = []

    for row in range(ldamodel.num_topics):
    wp = ldamodel.show_topic(row)
    topic_keywords = “,”.join([word for word, prop in wp])
    top10array.append((row+1, topic_keywords))

    top10dict = dict(top10array)

    # use sorted(topic[0], if passing in a Gensim model
    sent_topics_df = pd.DataFrame(pd.DataFrame([sorted(topic[0], key=lambda x: (x[1]), reverse=True) for topic in ldamodel[corpus]])[0])
    sent_topics_df.columns=[“Data”]
    sent_topics_df[‘Dominant_Topic’] = sent_topics_df.Data.apply(lambda x: x[0]+1)
    sent_topics_df[‘Perc_Contribution’] = sent_topics_df.Data.apply(lambda x: round(x[1],4))
    sent_topics_df[‘Topic_Keywords’] = sent_topics_df.Dominant_Topic.apply(lambda x: top10dict[x])

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents.rename(“Text”)], axis=1)
    sent_topics_df = sent_topics_df[[‘Dominant_Topic’, ‘Perc_Contribution’, ‘Topic_Keywords’, ‘Text’]]
    return(sent_topics_df)
    df_topic_sents_keywords = format_topics_sentences(ldamodel=loaded_lda_model, corpus=corpus, texts=text_list_cleaned)

    but I still see following issues:

    1.) Dominant topic starts with number 1 and section 19 and 20 gets topic numbers starting with 1 only.

    2.) I still see same issue with section 20.

    Section 20 logic is:

    #Finally, we want to understand the volume and distribution of topics in order to judge how widely it was discussed.
    # Number of Documents for Each Topic
    pd.set_option(‘display.max_colwidth’, -1)
    topic_counts = df_topic_sents_keywords[‘Dominant_Topic’].value_counts()

    # Percentage of Documents for Each Topic
    topic_contribution = round(topic_counts/topic_counts.sum(), 4)

    # Topic Number and Keywords
    topic_num_keywords = df_topic_sents_keywords[[‘Dominant_Topic’, ‘Topic_Keywords’]]

    # Concatenate Column wise
    df_dominant_topics = pd.concat([topic_num_keywords, topic_counts, topic_contribution], axis=1)

    # Change Column names
    df_dominant_topics.columns = [‘Dominant_Topic’, ‘Topic_Keywords’, ‘Num_Documents’, ‘Perc_Documents’]

    # Show
    print(df_dominant_topics)

    It produces following output:

    =============

    Dominant_Topic Num_Documents Perc_Documents
    0 9 NaN NaN
    1 6 27855.0 0.2036
    2 6 13686.0 0.1000
    3 5 16628.0 0.1215
    4 10 14833.0 0.1084
    5 2 8659.0 0.0633
    6 2 7395.0 0.0541
    7 6 13399.0 0.0979
    8 6 7409.0 0.0542
    9 6 11553.0 0.0844
    10 6 15391.0 0.1125
    11 2 NaN NaN
    12 2 NaN NaN
    13 4 NaN NaN
    14 1 NaN NaN
    …………………………………………
    139000……………………….. NaN NaN
    140000……………………….. NaN NaN

    140,000 is my total records count

  24. Regarding section 20, I also found it was incorrect and ended up giving me the entire corpus not just the summary per class. This is what I ended up doing (note I use the mallet keyword in every variable, just to remind myself that I’m dealing with the mallet data here):

    # Number of Documents for Each Topic
    topic_counts = df_mallet_topic_sents_keywords[‘Dominant_Topic’].value_counts()

    # Percentage of Documents for Each Topic
    topic_contribution = round(topic_counts/topic_counts.sum(), 4)

    top10array = []

    for row in range(optimal_model.num_topics):
    wp = ldamodel.show_topic(row)
    topic_keywords = “, “.join([word for word, prop in wp])
    top10array.append((row+1, topic_keywords))

    top10dict = dict(top10array)

    df_mallet_dominant_topics = pd.concat([topic_counts, topic_contribution], axis=1)

    df_mallet_dominant_topics = df_mallet_dominant_topics.reset_index()
    df_mallet_dominant_topics.columns = [‘Dominant_Topic’, ‘Num_Documents’, ‘Perc_Documents’]

    df_mallet_dominant_topics[‘Topic_Keywords’] = df_mallet_dominant_topics.Dominant_Topic.apply(lambda x: top10dict[x])

    df_mallet_dominant_topics = df_mallet_dominant_topics[[‘Dominant_Topic’, ‘Topic_Keywords’, ‘Num_Documents’, ‘Perc_Documents’]]

    # Show
    df_mallet_dominant_topics

  25. Regarding issues with topic labels with my suggested code, try removing any code where I’ve added 1 to bump up the counts to begin with 1 – I may have missed something I’ve done elsewhere that you need to do to get success on a 1 based topic list.

  26. Hi Chris,

    Thanks for all your responses, they have been really helpful.

    One other question, it seems you preferred mallet over LDA. I tried the same and it was giving me like 9 to 10% higher coherence score than LDA model, which was great but I really wanted to use that but found two issues:

    1.) The pyLDAvis pacakge does not work with it.
    2.) I did not find any random seed/state variable and because of that genesis mallet model will generate different topics overtime with different coherence scores.

    Just wanted to check, if you found some solutions to overcome these issues?

    Really appreciate the help!!

  27. little correction on bullet 2 of my previous message

    2.) I did not find any random seed/state variable and because of that genesis mallet model will generate different topics with different coherence scores, everytime the model is trained.

  28. Hi Vik

    I’m glad I’ve been able to make useful suggestions! Regarding my preference it is actually to stick with the Gensim LDA model, as (1) the problem with using pyLDAvis seems unsolvable at present, and (2) I found that the relationships between near topics in the Gensim model seemed a little more consistent with my perceptions of them. Note however I am not looking at 20 Newsgroups but my own twitter dataset.

    Also, regarding the 20 Newsgroups data, I am pretty familiar with it, you should note that there will be a number of empty and near empty records once headers, footers and quotes are removed, and also that there are some records in the computing area that contain embedded images – these serve to confuse NLP processing somewhat and should be removed first. If you need it I can help you with a workflow with that.

  29. This is a really great resource, thank you! I’ve put a copy of the code in a publicly accessible (view only) Google CoLab notebook here:

    https://drive.google.com/file/d/1B2kdhnm0KDZpANC2WT4lirXOxNHQ5FF2/view?usp=sharing

    With this, people can make a copy and play around without having to install anything to their own environment. Sometimes this is helpful for people who are just beginning with Python and machine learning. On the other hand, people comfortable making changes to their local environment might find that running on their own machine is faster.

  30. Thanks jason for posting the link. I think there needs to be some correction applied to the format_topics_sentences function else folks will face the following error, which I posted in my previous post.

    I am using python 3.6 and get following error, when executing format_topics_sentences function call:

    —-> 7 row = sorted(row, key=lambda x: (x[1]), reverse=True)
    8 # Get the Dominant topic, Perc Contribution and Keywords for each document
    9 for j, (topic_num, prop_topic) in enumerate(row):

    TypeError: ‘<' not supported between instances of 'int' and 'tuple'

    1. I think, I might have found the issue, I see that the compilation error goes away, if i change following line

      —-> 7 row = sorted(row, key=lambda x: (x[1]), reverse=True)

      to

      —-> 7 row = sorted(row[0], key=lambda x: (x[1]), reverse=True)

      The 0th index is where the list of all the topics and their corresponding probability scores are stored for a particular document.

  31. One more thing I noticed is that the bubble size for a particular topic as shown in pyldavis visualization does not match with the results you get for the same topic in section 20.

    e.g. the topic 1 might be showing highest bubble size with pyldavis vis but the calculations from section 20 may show it as lower in prevalence and may be topic 3 might be highest.

    Should the prevalence score be same between pyldavis(bubble size) and section 20 for same topic?

    1. Hi Selva,

      Can you please guide here?

      The bubble size showing the prevalence of a particular topic does not match the prevalence calculation in section 20. Do you agree, if they would differ?

      Currently, I am little fuzzy on what prevalence measure should i trust, pyldavis or the one in section 20th?

      Really appreciate your help!!

  32. I tried your code with the dataset you provided:

    # Import Dataset
    df = pd.read_json(‘https://raw.githubusercontent.com/selva86/datasets/master/newsgroups.json’)
    print(df.target_names.unique())

    however my output was:
    [u’rec.autos’ u’comp.sys.mac.hardware’ u’rec.motorcycles’ u’misc.forsale’
    u’comp.os.ms-windows.misc’ u’alt.atheism’ u’comp.graphics’
    u’rec.sport.baseball’ u’rec.sport.hockey’ u’sci.electronics’ u’sci.space’
    u’talk.politics.misc’ u’sci.med’ u’talk.politics.mideast’
    u’soc.religion.christian’ u’comp.windows.x’ u’comp.sys.ibm.pc.hardware’
    u’talk.politics.guns’ u’talk.religion.misc’ u’sci.crypt’]

    I’m not sure why there’s an extra character. Is there any way to fix this issue?

  33. Hi Selva,

    Thank you so so much for your wonderful post(s). This one is exactly what I need for my thesis research!
    I was hoping you could help me with an error that’s stopping the model, though? When I add the line I receive the error , which I just cannot get past.

    Do you know how to finish the topic modeling?

      1. Hi Selva, sorry for the late reply. I thought I had it figured out but it ended up not working 🙁 My question is quite long, but essentially my problem is that I get two different topic outputs for what are the same dataset/settings.

        I have used gensim by preprocessing texts, loading them in a corpus using nltk.corpus.reader.PlaintextCorpusReader, and applying the gensim topic model to that corpus, following your wonderful tutorial.

        My problem is, is that when I load my texts directly from my ‘Data’ folder, (gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r”C:/Users/maart/Documents/Thesis/Data/”, r”.*\.txt”, encoding = “ISO-8859-1″)) I get the output I want.

        However, when I use a pandas dataframe to select certain categories of texts, print those texts in a different folder that looks exactly the same (see twodatadirectories.jpg) and then load it in the nltk.corpus.reader, the output is vastly different (gothiccorpus = nltk.corpus.reader.PlaintextCorpusReader(r”C:/Users/maart/Documents/Thesis/n.” + handle + “-all/”, r”.*\.txt”, encoding = “ISO-8859-1”)).

        Please see https://imgur.com/a/Ck3lPU2 for the similarity in data yet the difference in output. Except for selecting either /Data/ or /n+ handle + all/, which contains the same files, the process is exactly the same.

        How can this be the case, and how do I get the results from the directly-to-data-map rather than the preselect-printsamedatatonewmap model?

        What am I doing wrong, I am at wits end?

        Thank you very much for this blog, it’s the reason I could start topic modeling in the first place! 😀

  34. Hi Selva,

    Thank you first of all! It worked for me also. Only when i try to visualise i get the following message – module ‘pyLDAvis’ has no attribute ‘gensim’

  35. Sorry, Selva. Ignore my question. The issue was solved after adding ‘import pyLDAvis.gensim’.

    Can you please advice how can i add outcomes/results (topic category and keywords) to the existing file used as a source. Thank you!

  36. This is a great article. Thank you for posting and walking us through the entire process. This is very helpful to a newbie like me.

    I have a quick question about the function for lemmatizing words. I wanted to stick to this function

    from nltk.stem.wordnet import WordNetLemmatizer
    def get_lemma(word):
    return WordNetLemmatizer().lemmatize(word)

    However when I run the next command
    data_lemmatized = get_lemma(data_words_bigrams)

    I get a type error TypeError: unhashable type: ‘list’
    how do I overcome this error? I turned my dataframe into a list in a prior step like you did.
    Can you please let me know how I need to adjust my code. Again I am brand new to python.

    Thank you

  37. Hi Selva, thank for this great post. However, I couldn’t build the LDA Mallet model. The error code is 127, and my platform is Ubuntu 16.04. Do you know what the problem is?

    Traceback (most recent call last):
    File “lda_final.py”, line 198, in main
    ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=10, id2word=id2word)
    File “/home/anaconda3/envs/emre_env/lib/python3.6/site-packages/gensim/models/wrappers/ldamallet.py”, line 126, in __init__
    self.train(corpus)
    subprocess.CalledProcessError: Command ‘/home/users/emre/mallet-2.0.8/bin/mallet import-file –preserve-case –keep-sequence –remove-stopwords –token-regex “\S+” –input /tmp/ea2bdd_corpus.txt –output /tmp/ea2bdd_corpus.mallet’ returned non-zero exit status 126.
    Command ‘/home/users/emre/mallet-2.0.8/bin/mallet import-file –preserve-case –keep-sequence –remove-stopwords –token-regex “\S+” –input /tmp/ea2bdd_corpus.txt –output /tmp/ea2bdd_corpus.mallet’ returned non-zero exit status 126.

  38. Hi, Thanks for your wonderful tutorial! It helps me a lot!
    But I am wondering in section 20
    ——————————————————————————————-
    # Topic Number and Keywords
    topic_num_keywords = df_topic_sents_keywords[[‘Dominant_Topic’, ‘Topic_Keywords’]]
    ——————————————————————————————-

    Shouldn’t it be
    ——————————————————————————————-
    # Topic Number and Keywords
    topic_num_keywords = sent_topics_sorteddf_mallet[[‘Topic_Num’, ‘Keywords’]]
    ——————————————————————————————-

    since I think what we need is the representative topic and keywords for per topic instead of per document.

  39. Hi,
    thank you for this wonderful .it is really helpful.
    i have a doubt, for example found that my optimal no. of topics are 5 and we are giving topic_perc_contribution. i want to print only 5 topics and is tehre any chance that each document contribution to the topic(5 topics)?
    and i tried to get the dataframe in the sorted as per the topic percentage contribution. but getting error.
    can you please help me.
    Thank you

  40. and 1 more doubt is we are getting keywords what exactly that column represents? is it like keywords column in 1st row represents keywords in 1st topic?2nd row represents keywords in topic2?

    Thank you,

  41. hi ..
    thank u for this great tutorial, i am used this method to analyze Arabic reviews to extract aspect to build sentiment analyzer for aspect .. du u have suggestion to improve my thought ???

    thank u so much

    ahmed/ Baghdad

    1. Hi Ahmed,

      I dont have a suggestion immediate. It would be interesting to see it applied on arabic language. Keep us posted.

      1. yes .. its work very well but i want to find a good Arabic lemmatize to increase accuracy other way i shall use only the ISERi stemmer

        but the topics are overlapped this in aspect based extraction is not drawback its useful as it is a measure for interception between aspects

  42. Hi Selva. This is an amazing post. Could you see my coherence plot here:

    The plot sharply increases up to 8 topics, but then there is a sudden drop for a while until it hits 20 topics, at which point there is a slow climb again.

    https://drive.google.com/file/d/14ST-hK94RfzHvN0G4hXa173vDZn6TfAJ/view?usp=sharing

    I assumed that it would keep increasing until it begins to flatten out as in your example. Would you say that my model fit is poor based on this plot?

    Sincerely,
    Allan

  43. Hi Selva ,

    Thank you for posting this great article.

    In topic number “18: Finding the dominant topic in each sentence”, How can I get top ‘n’ topic for
    a each document?

    Regards,
    mahipal

  44. Hey Selva, this is excellent work! I learned much. In CMD administrator, I was able to install spacy en using this command and following success message. However, after running in pycharm, I get error message. Any advice?

    (1) Install spacy en command:
    py -3 -m spacy download en

    (2) message in CMD:
    Linking successful
    C:\Users\dannylim\AppData\Local\Programs\Python\Python36-32\lib\site-packages\en_core_web_sm
    –>
    C:\Users\dannylim\AppData\Local\Programs\Python\Python36-32\lib\site-packages\spacy\data\en

    You can now load the model via spacy.load(‘en’)

    (3) error message after running in Pycharm
    raise IOError(Errors.E050.format(name=name))
    OSError: [E050] Can’t find model ‘en’. It doesn’t seem to be a shortcut link, a Python package or a valid path to a data directory.

  45. This is fantastic, thank you!
    I do have a question though – I’m a complete novice at this and I’m not sure how to use my own data input.
    I have a collection of .txt files in a folder named ‘input’ – how would I go about replacing the newsgroup data with my own? Thank you!

    1. Ok, I’ve managed to fix that problem.
      But I have a new one – the stop words aren’t actually being removed. I’m getting topics full of words that are in the NTLK english stop word list, and I can’t for the life of me work out what’s wrong. I’ve checked and re-checked that I’ve copied the code over correctly and I have. Any ideas what I’m doing wrong?

  46. Hey Selva

    Thanks for Nice post
    I am new to NLP and I tried all your steps but I am getting same issue
    Pls help me

    Called process error -returned non-zero exit status 1.

  47. Thank you so much for this port Selva, It’s so helpful.

    I tried to implement the lda mallat model but getting the same error I noticed some others have had here, I have tried out the suggestion but the issue persists. Please do you see anything unusual in the error

    Command ‘/Users/bliss/Documents/Personal/Project/mallet-2.0.8/bin/mallet import-file –preserve-case –keep-sequence –remove-stopwords –token-regex “\S+” –input /var/folders/g1/13w50y9904g6591cfzm08xvc0000gn/T/5d31c1_corpus.txt –output /var/folders/g1/13w50y9904g6591cfzm08xvc0000gn/T/5d31c1_corpus.mallet’ returned non-zero exit status 1.

    this is my path becaue the mallet folder is in thesame directory as my notebook

    mallet_path = ‘/Users/bliss/Documents/Personal/Project/mallet-2.0.8/bin/mallet’
    ldamallet = gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=20, id2word=id2word)

  48. This is really helpful, thank you so much.

    Need small help on my error while doing lemmatization i m getting “list index out of range error”.

    can you please look into this code where I m going wrong:

    my list contains 99977 elements in it, and I have used you lemmatization function only.

    lemmatization(list of cleaned documents, allowed_postags=[‘NOUN’, ‘ADJ’, ‘VERB’, ‘ADV’])

    Thanks in advance.

  49. Hey Selva, Excellent reading. I have a doubt.

    I am planning to extract keywords from restaurant reviews and find the proportion of the 3 topics it talks (Food quality, Service, Price).

    How to map keywords to topics through program?

    In your article you have mentioned “Likewise, can you go through the remaining topic keywords and judge what the topic is?”

    Should I manually figure out the topic which the review is talking about using the keywords

  50. “Mallet’s version, however, often gives a better quality of topics.”

    It would have been better if you would have specified the reason behind this being the default number of passes in Mallet’s LDA is 1000. And you are running the normal LDA with passes of 10.

  51. “Mallet’s version, however, often gives a better quality of topics.”

    The reason behind this is that the default number of passes in Mallet’s LDA is 1000. And you are running the normal LDA with passes of 10.

  52. Hi Selva,

    This is an amazing post. i have same question as Sanjay Krishnamurthy. I have more than 10K new articles & social media feeds and i want to assign each article to a list of topics ex: price, quality, service etc.

    Please suggest a solution as i am new to NLP and Python.

  53. Pingback: 20180917 에서 모아둔 링크글 – 내 낡은 서랍 속의 바다

  54. Hi,
    Thank You for this amazing article! It helps understand LDA Mallet Topic Modelling implementation really well.
    I have one doubt though: Why is the number of topic_keywords fixed to 10 per topic? Is there any way this can be variable? Say, the most semantically coherent words be clustered together & the cluster size can be variable.

    Please guide through this – if LDA Mallet is not an effective option for this problem, kindly suggest better leads

    Thanks in advance!

Leave a Reply

Your email address will not be published. Required fields are marked *


*