Not all Data Science projects that get kicked-off see the light of day. While this is true for any project, there are common reasons specific to Data Science projects, which can be avoided.
These can either be prevented if you are aware and not commit these mistakes. Or if you see these problems, consider avoiding stepping into the project in the first place.
Now, when it comes to pitfalls, certain aspects are under your control and certain other aspects cannot be changes.
1. Data Leakage
Data Leakage can happen in two ways:
- If your ML model was trained on data that contains information that you are trying to predict.
That is the ML algorithm sees the answers it is supposed to predict, leading to an abnormally accurate prediction. It’s like cheating in an exam.
ML model trains on a data that is not going to be available
For example: Your model is trying to forecast the page views to a website in the future, but your data contains the number of visitors. If the number of visitors is a data that will not be available at the time of prediction, it should NOT be part of training data.
Besides, page views and number of users tend to have exact same pattern and are highly correlated. Is so, it is almost like cheating. You may as well multiply number of users by a constant to make the prediction.
How to check?
You should become suspicious when model’s performance is too good to be true. Look for data leakages in such case.
What can you do about it?
Just remove the offending variables are rebuild the models. The earlier you catch such mistakes, better off you are.
2. Weak variables
In order to predict some variable (say the crop yield), you need to have predictors that actually influence the crop yield. In this case, variables such as type of soil, fertilizers given, moisture content etc may be helpful.
But if you only have variables that don’t influence crop yield (ex: hair color of the landlord) taking an extreme example to make the point, then your model will not have much useful information to predict the crop yield.
Garbage In, Garbage Out. Remember?
How to check?
Check for correlation if both your X and Y are continuous. Use Chi-sq test and ANOVA if they are both categorical or a mix of categorical and continuous respectively. If the correlation is too low for all the variables in your data, and/or the variables are not meaningful to explain the Y, it could be possible your variables are not helpful.
What can you do about it?
If good explanatory variables are not available:
- Think through features if present could help predict the Y. For crop yield, you might want to look at the soil, rainfall, temperature, crop variety, seeding conditions etc could matter.
Have few working sessions with your stakeholders, preferably a (not your regular meetings) and come up with such a list.
3. Spurious variables
It is possible that there could exist certain X variables that happen to be highly correlated with your Y in the training data purely by matter of chance.
It may even have a large feature importance. But if you try to make rational sense, it cannot be that important.
This can be problematic, if it has more variable importance than features that actually influence the Y variable in the real world.
How to check?
After you build the ML model, check the variable importance of the features. A good model will have the features that matter to have high importance.
What can you do about it?
- Remove variables that has low variance (having almost constant value)
- Try building your model with different algorithms and check if the feature importance changes. If it does, then it could be also specific problem, which you can examine further, that is, understand the way the algorithm learns from the data.
4. Making your stakeholders wait forever
This is not DS specific and applies to software engineering projects as well.
However, from my experience Data Scientists are more prone to committing the mistake. They wait till they get ‘substantial’ results and insights from the data before they share it with clients.
Try to take your clients along the discovery journey, perhaps they have something to contribute to your the problems you face, be it data availability, some incongruency you observe in the data, poor model results, whatever it is.
Set the expectations right, make clear what is in-scope and out-of-scope, establish the boundary conditions where your models might fail, meet with your stakeholders regularly and more importantly present your model results to your immediate working team and get their inputs before presenting to a large high stakes meeting. Just my two cents.
5. Don’t promise same performance when your models go live.
Your ML model might perform great in lab conditions on in-sample, validation and out-of-bag datasets.
However, once you’ve deployed it, expect things to go bad.
Because it can be a while since you collected the data you used for the project and by the time it goes to deployment, conditions could’ve changed.
There can always be cases that occur in real world that was not captured in the training data. Your model has not learnt them yet.
So expect some deterioration of the results right after deployment. Inform your teams and clients to expect so as well.
6. Include sufficient user testing as part of your project plan
Same as heading. At the start of the project, you should know who your end-users are and how are they going to use your solution.
Suppose you are building a computer vision solution that will be used as a mobile app, you need to consider the model’s latency when building your solution.
End users use your product in ways you might not anticipate. So be prepared and test with users as early and as frequently as possible.
7. Not factoring in Data Drifts
As time passes, the environment can change, so does the characteristics of the incoming data.
As a result, the distribution of data can change over time, which we normally refer to as Data Drift. This can cause your model’s performance to degrade over time.
How to detect?
Check the distribution of data at regular intervals and store them. Metrics like Population Stability Index (PSI) can be helpful as well.
What to do about it?
If data drift is observed, you might have to retrain your models. Better to establish a retraining frequency, where the duration between successive training is not so large to miss the data drifts.
It might be worthwhile to re-do the entire EDA process if there are significant changes in environment / business / market conditions.
8. Not retraining your models often enough
Same as the heading.
While there can be budget constraints about the retraining frequency of models, on account of personnel cost, computing resource cost etc especially for large data, you might want to automate the retrainings as much as you can and make sure to monitor the model performances, characteristic of input data as well as the outputs.
9. Choice of algorithm
It might be a better idea to use an ensemble algorithm like xgboost or random forest for high dimensional datasets instead of using logistic regression.
While logistic regression has advantages such as providing naturally calibrated probability scores, looking beyond that might provide better model performance, especially in the absence of strong predictors (features).
10. Not providing Model Interpretation Tools
For business projects, you want to enable your clients to make sense why a particular prediction was made for a particular data point.
Let them know what is the contribution (could be +ve or -ve) of each predictor in arriving at a given prediction.
For example: Let’s say you are predicting sales for a particular product for a given month. When you share your prediction, also better show the reason why you are predicting so high (or low) for a given month.
A high prediction could be because your sales forecast model had ‘Digital spend’ and ‘competitors discount’ as one of the significant variables. You observed that for that specific forecast month, if your marketing team is planning a large ad campaign (or your competitor has planned a massive discount) your sales is going to be impacted accordingly.
The magnitude of such impact can be conveniently calculated and visualized using something like SHAP values or LIME.
11. Failing to do a RCA for incorrect predictions. Your models could be missing key business condition changes that may need to get factored in.
RCA stands for ‘Root Cause Analysis’.
There are going to be instances / observations where your model predictions are way off.
However, when you aggregate the overall model performance (say overall F1 score) on the full tested data, it might still be acceptable.
Even in such cases, it’s usually not a bad idea to understand the root cause of why certain specific predictions were off.
For example, you are predicting the housing price in a region. Most of your predictions are very close. However, for certain instances, the predicted price is far from actual.
In such case, it is a good idea to examine and understand what was the offending variable that caused the prediction to go bad.
You might be able to prevent greater mistakes early on.
How to detect it?
Incorporate techniques like SHAP values to shows the contribution of individual variables to a given prediction.
12. Doing everything your stakeholder asks you to do without rational.
Make sure you are solving the right problem and using the right approach to solve the problem.
There are instances when your client will ask you to do a regression modeling when in reality it requires combinatorial optimization methods. Don’t commit to outcomes and deadline prematurely without understanding the full extent of the problem. Sometimes, it is better to build a proof of concept first and then go for building a full scaled solution. Use your best judgement.
13. Don’t do everything yourself.
You may have full stack data scientists on your team, but it might be a good idea to leave the work of deployment to DevOps / MLOps and the work of sourcing the data and making it available to you to the Data Engineers.
While, this luxury might not be available in all organizations, having organized systems will allow Data Scientists do their job well.
See below for who does what:
1. Devops: Manage the test to deployment and further operations there on.
2. Data engg and architecture: Configure the systems that source the data for consumption by ML models.
3. Project management:
Manage the time lines, scrum and various ceremonies of your project management methodology.
4. Product managers: Cross functional collaborations and Stakeholder communications plans
5. Software engineers: Develop the frontend, backend, mobile apps that will host / access your ML models.
6. Data Analysts: Role may depend on the organization / specific job role. May help build Tableau / PowerBI dashboards, perform data analysis for various use cases, put together decks for various analyses and findings.
14. More Questions to ask
Let the stakeholders and the teams answer the following questions:
- Who are the end users and how will they practically use your model / solution / tool?
How many end users are expected?
Is there scope for replicating your solution for other product lines/regions/businesses?
What is in scope and what is out-of-scope? Lay out explicitly what your solution can and cannot do. In what conditions it may fail.
What is the form in which the end-solution will be delivered? It may probably be one of these: tableau/powerbi dashboard / ready made webapp like rshiny or streamlit / custom built web app using web frameworks like react, django etc / Mobile app / Served from cloud frameworks like Sagemaker / Azure ML / Google Vertex AI.
What is the plan for end-user testing?
What is the model refresh and monitoring strategy? If it is a critical project, do you need a triaging protocol?
What is the model deployment strategy?
What are the key milestones?
This is not meant to be a comprehensive list.
Hope you will find it helpful.