# LDA in Python – How to grid search best topic models?

Python’s Scikit Learn provides a convenient interface for topic modeling using algorithms like Latent Dirichlet allocation(LDA), LSI and Non-Negative Matrix Factorization. In this tutorial, you will learn how to build the best possible LDA topic model and explore how to showcase the outputs as meaningful results.

## 1. Introduction

In the last tutorial you saw how to build topics models with LDA using gensim. In this tutorial, however, I am going to use python’s the most popular machine learning library – scikit learn.

With scikit learn, you have an entirely different interface and with grid search and vectorizers, you have a lot of options to explore in order to find the optimal model and to present the results.

In this tutorial, you will learn:

1. How to clean and process text data?
2. How to prepare the text documents to build topic models with scikit learn?
3. How to build a basic topic model using LDA and understand the params?
4. How to extract the topic’s keywords?
5. How to gridsearch and tune for optimal model?
6. How to get the dominant topics in each document?
7. Review and visualize the topic keywords distribution
8. How to predict the topics for a new piece of text?
9. Cluster the documents based on topic distribution
10. How to get most similar documents based on topics discussed?

A lot of exciting stuff ahead. Let’s roll!

## 2. Load the packages

The core package used in this tutorial is scikit-learn (sklearn).

Regular expressions re, gensim and spacy are used to process texts. pyLDAvis and matplotlib for visualization and numpy and pandas for manipulating and viewing data in tabular format.

Let’s import them.

# Run in terminal or command prompt

import numpy as np
import pandas as pd
import re, nltk, spacy, gensim

# Sklearn
from sklearn.decomposition import LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from pprint import pprint

# Plotting tools
import pyLDAvis
import pyLDAvis.sklearn
import matplotlib.pyplot as plt
%matplotlib inline


## 3. Import Newsgroups Text Data

I will be using the 20-Newsgroups dataset for this. This version of the dataset contains about 11k newsgroups posts from 20 different topics. This is available as newsgroups.json.

Since it is in a json format with a consistent structure, I am using pandas.read_json() and the resulting dataset has 3 columns as shown.

# Import Dataset
print(df.target_names.unique())

['rec.autos' 'comp.sys.mac.hardware' 'rec.motorcycles' 'misc.forsale'
'comp.os.ms-windows.misc' 'alt.atheism' 'comp.graphics'
'rec.sport.baseball' 'rec.sport.hockey' 'sci.electronics' 'sci.space'
'talk.politics.misc' 'sci.med' 'talk.politics.mideast'
'soc.religion.christian' 'comp.windows.x' 'comp.sys.ibm.pc.hardware'
'talk.politics.guns' 'talk.religion.misc' 'sci.crypt']

df.head(15)


Input – 20Newsgroups

## 4. Remove emails and newline characters

You can see many emails, newline characters and extra spaces in the text and it is quite distracting. Let’s get rid of them using regular expressions.

# Convert to list
data = df.content.values.tolist()

# Remove Emails
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]

# Remove new line characters
data = [re.sub('\s+', ' ', sent) for sent in data]

# Remove distracting single quotes
data = [re.sub("\'", "", sent) for sent in data]

pprint(data[:1])

['From: (wheres my thing) Subject: WHAT car is this!? Nntp-Posting-Host: '
'rac3.wam.umd.edu Organization: University of Maryland, College Park Lines: '
'15 I was wondering if anyone out there could enlighten me on this car I saw '
'the other day. It was a 2-door sports car, looked to be from the late 60s/ '
'early 70s. It was called a Bricklin. The doors were really small. In '
'addition, the front bumper was separate from the rest of the body. This is '
'all I know. If anyone can tellme a model name, engine specs, years of '
'production, where this car is made, history, or whatever info you have on '
'this funky looking car, please e-mail. Thanks, - IL ---- brought to you by '
'your neighborhood Lerxst ---- ']


## 5. Tokenize and Clean-up using gensim’s simple_preprocess()

The sentences look better now, but you want to tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether.

Gensim’s simple_preprocess() is great for this. Additionally I have set deacc=True to remove the punctuations.

def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

data_words = list(sent_to_words(data))

print(data_words[:1])

[['from', 'wheres', 'my', 'thing', 'subject', 'what', 'car', 'is', 'this', 'nntp', 'posting', 'host', 'rac', 'wam', 'umd', 'edu', 'organization', 'university', 'of', 'maryland', 'college', 'park', 'lines', 'was', 'wondering', 'if', 'anyone', 'out', 'there', 'could', 'enlighten', 'me', 'on', 'this', 'car', 'saw', 'the', 'other', 'day', (..truncated..)]]


## 6. Lemmatization

Lemmatization is a process where we convert words to its root word.

For example: ‘Studying’ becomes ‘Study’, ‘Meeting becomes ‘Meet’, ‘Better’ and ‘Best’ becomes ‘Good’.

The advantage of this is, we get to reduce the total number of unique words in the dictionary. As a result, the number of columns in the document-word matrix (created by CountVectorizer in the next step) will be denser with lesser columns.

You can expect better topics to be generated in the end.

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
"""https://spacy.io/api/annotation"""
texts_out = []
for sent in texts:
doc = nlp(" ".join(sent))
texts_out.append(" ".join([token.lemma_ if token.lemma_ not in ['-PRON-'] else '' for token in doc if token.pos_ in allowed_postags]))
return texts_out

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# Run in terminal: python3 -m spacy download en
nlp = spacy.load('en', disable=['parser', 'ner'])

# Do lemmatization keeping only Noun, Adj, Verb, Adverb
data_lemmatized = lemmatization(data_words, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

print(data_lemmatized[:2])

['where s  thing subject what car be nntp post host rac wam umd edu organization university maryland college park line be wonder anyone out there could enlighten car see other days be door sport car look be late early be call bricklin door be really small addition front bumper be separate rest body be know anyone can tellme model name engine spec year production where car be make history whatev info have funky look car mail thank bring  neighborhood lerxst' (..truncated..)]


## 7. Create the Document-Word matrix

The LDA topic model algorithm requires a document word matrix as the main input.

You can create one using CountVectorizer. In the below code, I have configured the CountVectorizer to consider words that has occurred at least 10 times (min_df), remove built-in english stopwords, convert all words to lowercase, and a word can contain numbers and alphabets of at least length 3 in order to be qualified as a word.

So, to create the doc-word matrix, you need to first initialise the CountVectorizer class with the required configuration and then apply fit_transform to actually create the matrix.

Since most cells contain zeros, the result will be in the form of a sparse matrix to save memory.

If you want to materialize it in a 2D array format, call the todense() method of the sparse matrix like its done in the next step.

vectorizer = CountVectorizer(analyzer='word',
min_df=10,                        # minimum reqd occurences of a word
stop_words='english',             # remove stop words
lowercase=True,                   # convert all words to lowercase
token_pattern='[a-zA-Z0-9]{3,}',  # num chars > 3
# max_features=50000,             # max number of uniq words
)

data_vectorized = vectorizer.fit_transform(data_lemmatized)


## 8. Check the Sparsicity

Sparsicity is nothing but the percentage of non-zero datapoints in the document-word matrix, that is data_vectorized.

Since most cells in this matrix will be zero, I am interested in knowing what percentage of cells contain non-zero values.

# Materialize the sparse data
data_dense = data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparsicity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

Sparsicity:  0.775887569365 %


## 9. Build LDA model with sklearn

Everything is ready to build a Latent Dirichlet Allocation (LDA) model. Let’s initialise one and call fit_transform() to build the LDA model.

For this example, I have set the n_topics as 20 based on prior knowledge about the dataset. Later we will find the optimal number using grid search.

# Build LDA Model
lda_model = LatentDirichletAllocation(n_topics=20,               # Number of topics
max_iter=10,               # Max learning iterations
learning_method='online',
random_state=100,          # Random state
batch_size=128,            # n docs in each learning iter
evaluate_every = -1,       # compute perplexity every n iters, default: Don't
n_jobs = -1,               # Use all available CPUs
)
lda_output = lda_model.fit_transform(data_vectorized)

print(lda_model)  # Model attributes

LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7,
learning_method='online', learning_offset=10.0,
max_doc_update_iter=100, max_iter=10, mean_change_tol=0.001,
n_components=10, n_jobs=-1, n_topics=20, perp_tol=0.1,
random_state=100, topic_word_prior=None,
total_samples=1000000.0, verbose=0)


## 10. Diagnose model performance with perplexity and log-likelihood

A model with higher log-likelihood and lower perplexity (exp(-1. * log-likelihood per word)) is considered to be good. Let’s check for our model.

# Log Likelyhood: Higher the better
print("Log Likelihood: ", lda_model.score(data_vectorized))

# Perplexity: Lower the better. Perplexity = exp(-1. * log-likelihood per word)
print("Perplexity: ", lda_model.perplexity(data_vectorized))

# See model parameters
pprint(lda_model.get_params())

Log Likelihood:  -9965645.21463
Perplexity:  2061.88393838
{'batch_size': 128,
'doc_topic_prior': None,
'evaluate_every': -1,
'learning_decay': 0.7,
'learning_method': 'online',
'learning_offset': 10.0,
'max_doc_update_iter': 100,
'max_iter': 10,
'mean_change_tol': 0.001,
'n_components': 10,
'n_jobs': -1,
'n_topics': 20,
'perp_tol': 0.1,
'random_state': 100,
'topic_word_prior': None,
'total_samples': 1000000.0,
'verbose': 0}


On a different note, perplexity might not be the best measure to evaluate topic models because it doesn’t consider the context and semantic associations between words. This can be captured using topic coherence measure, an example of this is described in the gensim tutorial I mentioned earlier.

## 11. How to GridSearch the best LDA model?

The most important tuning parameter for LDA models is n_components (number of topics). In addition, I am going to search learning_decay (which controls the learning rate) as well.

Besides these, other possible search params could be learning_offset (downweigh early iterations. Should be > 1) and max_iter. These could be worth experimenting if you have enough computing resources.

Be warned, the grid search constructs multiple LDA models for all possible combinations of param values in the param_grid dict. So, this process can consume a lot of time and resources.

# Define Search Param
search_params = {'n_components': [10, 15, 20, 25, 30], 'learning_decay': [.5, .7, .9]}

# Init the Model
lda = LatentDirichletAllocation()

# Init Grid Search Class
model = GridSearchCV(lda, param_grid=search_params)

# Do the Grid Search
model.fit(data_vectorized)

GridSearchCV(cv=None, error_score='raise',
estimator=LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7, learning_method=None,
learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
mean_change_tol=0.001, n_components=10, n_jobs=1,
n_topics=None, perp_tol=0.1, random_state=None,
topic_word_prior=None, total_samples=1000000.0, verbose=0),
fit_params=None, iid=True, n_jobs=1,
param_grid={'n_topics': [10, 15, 20, 25, 30], 'learning_decay': [0.5, 0.7, 0.9]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=0)


## 12. How to see the best topic model and its parameters?

# Best Model
best_lda_model = model.best_estimator_

# Model Parameters
print("Best Model's Params: ", model.best_params_)

# Log Likelihood Score
print("Best Log Likelihood Score: ", model.best_score_)

# Perplexity
print("Model Perplexity: ", best_lda_model.perplexity(data_vectorized))

Best Model's Params:  {'learning_decay': 0.9, 'n_topics': 10}
Best Log Likelyhood Score:  -3417650.82946
Model Perplexity:  2028.79038336


## 13. Compare LDA Model Performance Scores

Plotting the log-likelihood scores against num_topics, clearly shows number of topics = 10 has better scores. And learning_decay of 0.7 outperforms both 0.5 and 0.9.

This makes me think, even though we know that the dataset has 20 distinct topics to start with, some topics could share common keywords. For example, ‘alt.atheism’ and ‘soc.religion.christian’ can have a lot of common words. Same with ‘rec.motorcycles’ and ‘rec.autos’, ‘comp.sys.ibm.pc.hardware’ and ‘comp.sys.mac.hardware’, you get the idea.

To tune this even further, you can do a finer grid search for number of topics between 10 and 15. But I am going to skip that for now.

So the bottom line is, a lower optimal number of distinct topics (even 10 topics) may be reasonable for this dataset. I don’t know that yet. But LDA says so. Let’s see.

# Get Log Likelyhoods from Grid Search Output
n_topics = [10, 15, 20, 25, 30]
log_likelyhoods_5 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.5]
log_likelyhoods_7 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.7]
log_likelyhoods_9 = [round(gscore.mean_validation_score) for gscore in model.grid_scores_ if gscore.parameters['learning_decay']==0.9]

# Show graph
plt.figure(figsize=(12, 8))
plt.plot(n_topics, log_likelyhoods_5, label='0.5')
plt.plot(n_topics, log_likelyhoods_7, label='0.7')
plt.plot(n_topics, log_likelyhoods_9, label='0.9')
plt.title("Choosing Optimal LDA Model")
plt.xlabel("Num Topics")
plt.ylabel("Log Likelyhood Scores")
plt.legend(title='Learning decay', loc='best')
plt.show()


Grid Search Topic Models

## 14. How to see the dominant topic in each document?

To classify a document as belonging to a particular topic, a logical approach is to see which topic has the highest contribution to that document and assign it.

In the table below, I’ve greened out all major topics in a document and assigned the most dominant topic in its own column.

# Create Document - Topic Matrix
lda_output = best_lda_model.transform(data_vectorized)

# column names
topicnames = ["Topic" + str(i) for i in range(best_lda_model.n_topics)]

# index names
docnames = ["Doc" + str(i) for i in range(len(data))]

# Make the pandas dataframe
df_document_topic = pd.DataFrame(np.round(lda_output, 2), columns=topicnames, index=docnames)

# Get dominant topic for each document
dominant_topic = np.argmax(df_document_topic.values, axis=1)
df_document_topic['dominant_topic'] = dominant_topic

# Styling
def color_green(val):
color = 'green' if val > .1 else 'black'
return 'color: {col}'.format(col=color)

def make_bold(val):
weight = 700 if val > .1 else 400
return 'font-weight: {weight}'.format(weight=weight)

# Apply Style
df_document_topics


Document Topic Weights: df_document_topics

## 15. Review topics distribution across documents

df_topic_distribution = df_document_topic['dominant_topic'].value_counts().reset_index(name="Num Documents")
df_topic_distribution.columns = ['Topic Num', 'Num Documents']
df_topic_distribution


Topic Document Distribution: df_topic_distribution

## 16. How to visualize the LDA model with pyLDAvis?

The pyLDAvis offers the best visualization to view the topics-keywords distribution.

A good topic model will have non-overlapping, fairly big sized blobs for each topic. This seems to be the case here. So, we are good.

pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(best_lda_model, data_vectorized, vectorizer, mds='tsne')
panel


Visualize Topic Distribution using pyLDAvis

## 17. How to see the Topic’s keywords?

The weights of each keyword in each topic is contained in lda_model.components_ as a 2d array. The names of the keywords itself can be obtained from vectorizer object using get_feature_names().

Let’s use this info to construct a weight matrix for all keywords in each topic.

# Topic-Keyword Matrix
df_topic_keywords = pd.DataFrame(best_lda_model.components_)

# Assign Column and Index
df_topic_keywords.columns = vectorizer.get_feature_names()
df_topic_keywords.index = topicnames

# View


Topic Word Weights: df_topic_keywords

## 18. Get the top 15 keywords each topic

From the above output, I want to see the top 15 keywords that are representative of the topic.

The show_topics() defined below creates that.

# Show top n keywords for each topic
def show_topics(vectorizer=vectorizer, lda_model=lda_model, n_words=20):
keywords = np.array(vectorizer.get_feature_names())
topic_keywords = []
for topic_weights in lda_model.components_:
top_keyword_locs = (-topic_weights).argsort()[:n_words]
topic_keywords.append(keywords.take(top_keyword_locs))

topic_keywords = show_topics(vectorizer=vectorizer, lda_model=best_lda_model, n_words=15)

# Topic - Keywords Dataframe
df_topic_keywords = pd.DataFrame(topic_keywords)
df_topic_keywords.columns = ['Word '+str(i) for i in range(df_topic_keywords.shape[1])]
df_topic_keywords.index = ['Topic '+str(i) for i in range(df_topic_keywords.shape[0])]
df_topic_keywords


Top 15 topic keywords

## 19. How to predict the topics for a new piece of text?

Assuming that you have already built the topic model, you need to take the text through the same routine of transformations and before predicting the topic.

For our case, the order of transformations is:

sent_to_words() –> lemmatization() –> vectorizer.transform() –> best_lda_model.transform()

You need to apply these transformations in the same order. So to simplify it, let’s combine these steps into a predict_topic() function.

# Define function to predict topic for a given text document.
nlp = spacy.load('en', disable=['parser', 'ner'])

def predict_topic(text, nlp=nlp):
global sent_to_words
global lemmatization

# Step 1: Clean with simple_preprocess
mytext_2 = list(sent_to_words(text))

# Step 2: Lemmatize
mytext_3 = lemmatization(mytext_2, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])

# Step 3: Vectorize transform
mytext_4 = vectorizer.transform(mytext_3)

# Step 4: LDA Transform
topic_probability_scores = best_lda_model.transform(mytext_4)
topic = df_topic_keywords.iloc[np.argmax(topic_probability_scores), :].values.tolist()

# Predict the topic
mytext = ["Some text about christianity and bible"]
topic, prob_scores = predict_topic(text = mytext)
print(topic)

['say', 'god', 'people', 'write', 'think', 'know', 'believe', 'christian', 'make', 'subject', 'line', 'good', 'just', 'organization', 'thing']


mytext has been allocated to the topic that has religion and Christianity related keywords, which is quite meaningful and makes sense.

## 20. How to cluster documents that share similar topics and plot?

You can use k-means clustering on the document-topic probabilioty matrix, which is nothing but lda_output object. Since out best model has 15 clusters, I’ve set n_clusters=15 in KMeans().

Alternately, you could avoid k-means and instead, assign the cluster as the topic column number with the highest probability score.

We now have the cluster number. But we also need the X and Y columns to draw the plot.

For the X and Y, you can use SVD on the lda_output object with n_components as 2. SVD ensures that these two columns captures the maximum possible amount of information from lda_output in the first 2 components.

# Construct the k-means clusters
from sklearn.cluster import KMeans
clusters = KMeans(n_clusters=15, random_state=100).fit_predict(lda_output)

# Build the Singular Value Decomposition(SVD) model
svd_model = TruncatedSVD(n_components=2)  # 2 components
lda_output_svd = svd_model.fit_transform(lda_output)

# X and Y axes of the plot using SVD decomposition
x = lda_output_svd[:, 0]
y = lda_output_svd[:, 1]

# Weights for the 15 columns of lda_output, for each component
print("Component's weights: \n", np.round(svd_model.components_, 2))

# Percentage of total information in 'lda_output' explained by the two components
print("Perc of Variance Explained: \n", np.round(svd_model.explained_variance_ratio_, 2))

Component's weights:
[[ 0.08  0.23  0.24  0.14  0.2   0.85  0.09  0.19  0.07  0.2 ]
[ 0.02 -0.1   0.9   0.16  0.16 -0.32 -0.01 -0.01  0.13  0.09]]
Perc of Variance Explained:
[ 0.09  0.21]


We have the X, Y and the cluster number for each document.

Let’s plot the document along the two SVD decomposed components. The color of points represents the cluster number (in this case) or topic number.

# Plot
plt.figure(figsize=(12, 12))
plt.scatter(x, y, c=clusters)
plt.xlabel('Component 2')
plt.xlabel('Component 1')
plt.title("Segregation of Topic Clusters", )


Topic Clusters

## 21. How to get similar documents for any given piece of text?

Once you know the probaility of topics for a given document (using predict_topic()), compute the euclidean distance with the probability scores of all other documents.

The most similar documents are the ones with the smallest distance.

from sklearn.metrics.pairwise import euclidean_distances

nlp = spacy.load('en', disable=['parser', 'ner'])

def similar_documents(text, doc_topic_probs, documents = data, nlp=nlp, top_n=5, verbose=False):
topic, x  = predict_topic(text)
dists = euclidean_distances(x.reshape(1, -1), doc_topic_probs)[0]
doc_ids = np.argsort(dists)[:top_n]
if verbose:
print("Topic KeyWords: ", topic)
print("Topic Prob Scores of text: ", np.round(x, 1))
print("Most Similar Doc's Probs:  ", np.round(doc_topic_probs[doc_ids], 1))
return doc_ids, np.take(documents, doc_ids)

# Get similar documents
mytext = ["Some text about christianity and bible"]
doc_ids, docs = similar_documents(text=mytext, doc_topic_probs=lda_output, documents = data, top_n=1, verbose=True)
print('\n', docs[0][:500])

Topic KeyWords:  ['say', 'god', 'people', 'write', 'think', 'know', 'believe', 'christian', 'make', 'subject', 'line', 'good', 'just', 'organization', 'thing']
Topic Prob Scores of text:  [[ 0.   0.   0.8  0.   0.   0.   0.   0.   0.   0. ]]
Most Similar Doc's Probs:   [[ 0.   0.   0.8  0.   0.   0.   0.1  0.   0.   0. ]]

From: Subject: about Eliz C Prophet Lines: 21 Rob Butera asks about a book called THE LOST YEARS OF JESUS, by Elizabeth Clare Prophet. I do not know the book. However, Miss Prophet is the leader of a group (The Church Universal and Triumphant) derived from the I AM group founded by a Mr. Ballard who began his mission in the 1930s (I am writing this from memory and may not have all the details straight -- for an old account, check your library for a book by Marcus Bach) after an eighteenth-centu


## 22. Conclusion

We’ve covered some cutting-edge topic modeling approaches in this post. If you managed to work this through, well done. For those concerned about the time, memory consumption and variety of topics when building topic models check out the gensim tutorial on LDA.

I will meet you with a new tutorial next week.

## 28 Comments on “LDA in Python – How to grid search best topic models?”

1. Awesome tutorial again, just like the one on Gensim LDA. It’s nice to have several implementations at hand!

2. Thank you, a very useful tutorial! Would it be possible to share a Jupiter notebook with the code?

3. Awesome and very useful! Is it possible to use Mallet LDA here? And may I ask which one (compared with Gensim LDA) is more widely used in practice?

Thanks!!!

1. Thanks Gshe! I don’t think a sklearn wrapper for Mallet LDA exists as of today. Between, gensim and sklearn it’s a matter of convenience. Personally, since gensim was conceived and built for topic modeling, it does a good job at it in terms of speed and facilities.

4. Hi,
I am unable to load the model ‘en’ from spacy. I get the following error:

OSError: [E050] Can’t find model ‘en’. It doesn’t seem to be a shortcut link, a Python package or a valid path to a data directory.

I tried running ‘python3 -m spacy download en’ from terminal but I got another error ‘An existing connection was forcibly closed by the remote host’. Please help

You certainly need to run ‘python3 -m spacy download en’ in order to load the ‘en’ model. There could be many reasons to get the ‘An existing connection was forcibly closed by the remote host’ message, but the most common is probably with a poor network. Check this for more help.

5. Dear Selva,

Thanks for such a great tutorial. I ran the codes within the article, but ran into some problems at some steps. For instance, I printed out the value of best_lda_model.n_topics, which is ‘None’, (empty value). I am curious how this parameter is set, is it set during the parameter tuning step?

Besides, I also was not able to load the model ‘en’ from spacy, but I searched on Internet and got a workable fix. I am providing my fix here, because it may be useful to others. First, to visit the following page:
https://github.com/explosion/spacy-models/releases/tag/en_core_web_sm-2.0.0
and then download en_core_web_sm-2.0.0.tar.gz on the page. Next, to install the package using:
pip install en_core_web_sm-2.0.0.tar.gz.

1. Thanks Xiaowei.

Its probably got to do with how you set up your search parameters when you define search_params.

search_params = {‘n_components’: [10, 15, 20, 25, 30], ‘learning_decay’: [.5, .7, .9]}

In this case, LDA will grid search for n_components (or n topics) as 10, 15, 20, 25, 30.

Also, check if your corpus is intact inside data_vectorized just before starting model.fit(data_vectorized). To be sure, run data_dense = data_vectorized.todense() and check few rows of data_dense.

1. Hi Selva,
great tutorial indeed!
I have got the same problem, but it depends on sklearn:
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.LatentDirichletAllocation.html

n_topics : int, optional (default=None)

This parameter has been renamed to n_components and will be removed in version 0.21. .. deprecated:: 0.19

n_topics is still working, but deprecated when running the model, but it is not working anymore when it is called later.
just swap n_topics with n_components and it works

6. Thanks a lot selva learnt a lot of libraries and their uses
i have one question post the prediction of topic number how are u telling which all doc from the original file belong to a particular topic as we have only the doc number in out output and we are assigning random state

1. The topic that contributes to the most proportion to the document can be taken as the dominant topic for that doc. Based on your requirement, you can take more than 1 topic, perhaps the top 3 maybe.

7. Thanks for sharing this. It is a great article. I am relatively new to python. I followed your code but i got stuck
getting the 30 most common terms in the whole text with their weights for each topic.

8. I ran out of memory creating the dense matrix for measuring sparsity. So experimented and found it is not necessary to create a dense matrix to measure sparsity – just use the properties of the sparse matrix:

shape : 2-tuple
Shape of the matrix
nnz
Number of nonzero elements

print(“Sparsity: “, (data_vectorized.nnz / (data_vectorized.shape[0] * data_vectorized.shape[1]))*100, “%”)

9. Hi Selva,
Thanks for this excellent tutorial! It is really helpful!
While I was using GridSearchCV(), the optimal number of topics will finally converge to 1, which does not really make sense. And I see that in your output, 10 is optimal within [10, 15, 20, 25, 30]. So what will happen if continuing exploring (e.g. [1, 3, 5, 7, 10])? Will it converge to 1?
Thanks!

1. This is a subjective matter. Some pointers: How many topics do you think are there in your document? Are they different from each other or are they very similar?

Secondly, you don’t have to set number of topics as 1. In the approch discussed, in sklearn related topics tend to get clubbed together as one. If you want to keep them as separate, check out gensim.

With multiple approaches, more the control you have.

10. Hi Ivyswius

I observed this too, its not really appropriate to measure likelihood, as you’ve seen (and I did too) its increasingly likely you will find a place in a topic as you reduce the number. I think Selva needs to re-consider his advice here. With the Gensim tutorial he uses Coherence which delivers a proper measure – but I’ve spent some time looking for a way to do this with sklearn and haven’t found one yet. In the end the topic modelling capabilities of Gensim or Mallet in the other tutorial seem more fit for purpose.

11. Excellent article and learnt a lot from this article. Thanks a lot for a detailed article.
Quick question. I created a project by following your code. I am able to get a good topic representation for a given data set. This document is about 2000 words and lets call it doc A. Now I have 10 document with average words around 2000 in each of the document (Similar to doc A) . I believe I need to follow the steps mentioned in Section 19 .(How to predict the topics for a new piece of text?) to get the topics for the 10 documents.

Here my question is

Are we reusing the same model (best_lda_model ) for the remaining 10 docs.

If yes, how did “Some text about christianity and bible” as input to best_lda_model produced the topic.

[‘say’, ‘god’, ‘people’, ‘write’, ‘think’, ‘know’, ‘believe’, ‘christian’, ‘make’, ‘subject’, ‘line’, ‘good’, ‘just’, ‘organization’, ‘thing’]

as these topic words are not made available in the source text. I believe topic modelling is unsupervised and there is no labelled data as input to the model. Again, correct me if I am wrong.

(In other words, does the 20-Newsgroups dataset has a foot print in the best_lda_model for a new piece of text?

Again, a wonderful post and kudos to you…

12. Thanks for this good example!

May I ask you why you created the k mean with n_clusters = 15? I thought the number of the topics in the model is 10.

13. Excellent article. I have been learning a lot. Keep up the good work. Kindly cover other Text related tasks. Thank you.

14. Thanks for the very useful article about topic modeling. It helps to understand the aspect in detail.

I have few questions. Can I use the same for Gujarati language dataset? How to make the model work with the dataset which is not in English?

15. If I wanted to add bigrams/trigrams in here as in the Gensim model, where would I inset them? I tried to work it in just before constructing the document-word matrix, and got a bunch of errors!

16. Awesome tutorial again, just like the one on Gensim LDA. My question is that “have you done any comparison between different libraries and what library do you recommand to use?

17. Thank you very much for such a good tutorial. Have you done any comparison among different libraries in terms of the LDA algorithm and what one do you prefer?