Today, I discuss the Data Science Roadmap, the missing guide to self study machine learning. I’ll discuss what exactly you need to know and do in order to self study Data science / ML / AI / Stats. I will provide you with some of the best resources for each topic, why you need to learn the said topics and a recommended self study plan.
STEP 1. Learn Programming (Python, R, SQL)
(Time Required: 30 days45 days)
What to learn?
The main programming languages for Data Science:
 Python
 R Programming
 SQL
Which one to choose first?
Between Python and R, Python is more popular and widely adopted. R is also popular in certain countries, domains. If you have to choose between these two to start off, pick Python.
It is a good idea to learn R language also, for it has a great ML/stats packages ecosystem and I’ve seen several companies using both R and Python.
When should you learn R?
You can learn R after you’re completely done with Python as well. However, if your company / domain mainly uses R just go with it.
Why learn programming?
Data Scientists are handson folks. Ultimately, you will need to do various analyses with data and build ML models.
Why Python?
Python is the default language these days for ML. New research / development are done in Python. With mature libraries for ML, stats, deep learning etc, it’s easy to find solutions quickly.
Why R?
R programming was the most popular language before Python took over. It is built for Data analysis, statistical modeling and ML. Data Scientist use R, for the rich collection of Packages in stats and ML.
Why SQL?
To collect, transform data from various different databases. Most databases speak SQL. People write SQL queries to collect, transform the data from various sources and assemble them to form a master data to be used for data analysis and ML.
Free Time Series Project Template
Do you want learn how to approach projects across different domains with Time Series?
Get started with your first Time Series Industry Project and Learn how to use and implement algorithms like ARIMA, SARIMA, SARIMAX, Simple Exponential Smoothing and HoltWinters.
Do you want learn how to approach projects across different domains with Time Series?
Get started with your first Time Series Industry Project and Learn how to use and implement algorithms like ARIMA, SARIMA, SARIMAX, Simple Exponential Smoothing and HoltWinters.
Note: Another growing language worth mentioning is ‘Julia’. It has a great speed advantage, has expressive syntax borrowing the good parts of both R and Python. However, the size of Julia community is still maturing.
So, if you are getting started, I’d recommend to get started with Python first. Then learn SQL. Then Learn R. Optionally you can look at Julia and do cool stuff.
What else to learn to go further?
You need to learn Git as well.
Most companies version control their projects with Git, using popular services such as GitHub, BitBucket or GitLab.
I recommend to open a GitHub account and maintain / showcase your ML project work. Once you have done few good projects (I’ll talk about this in Projects section below), you can add the GitHub link to your resume. It will increase your chance of getting shortlisted for interviews, if people see you can code well.
Best Python Resources
Python Courses
 Corey Schaefer YouTube – Nice Youtube channel for simple Python explanations, covers a range of topics but can be better structured.

The Python Course – ML+ (Paid) – I teach this course at ML+, simplify things as always. Organized and taught using Jupyter Notebook. Ideal for Data Science / ML practitioners. Use code ‘PYTHONFREE21’ at checkout.

Learn Python – Kaggle – Learn with practice with succinct lessons. Not complete though.

Telusko Python – Youtube – Another nice Youtube course to learn Python.
Python Books
 Python Crash Course 2nd Ed: Well written book on Introductory Python. Possibly (amongst) the best.
 Fluent Python 2nd Ed: Advanced Python book. Read this after you’ve become experienced.
Python Practice Books
 Automate the Boring Stuff – Practical approach to learning
 Python Workbook – Springer – Collection of nice practice questions.
 The Big Book of Small Python Projects – This is also good.
Python Practice Websites
 Exercism Python – Nice collection of exercises, nice UI.
 Hack in Science – Split by 3 different levels.
 Practical Python Projects – Beginner exercises.
Besides this there are competitive coding websites like Leetcode, Codewars and Hackerrank that offer a large number of programming questions. I’d not recommend these at this stage for beginners, simply because, our goal right now is to become sufficiently good in Python in order to do machine Data Science/ML.
Best SQL Resources
 SQL Tutorial with Mosh – Good one (video course).
 SQL Tutorial – Mode – Also good (text course).
SQL practice websites
 Advanced SQL Puzzles – Practice solving SQL.
 SQL Interview questions – YT videos
 SQL Questions – Hackerrank – For practice
Best R Programming Resources
R Courses
R Books
 Applied Predictive Modeling – Highly recommended
 The R Book 2nd Ed – Teaches concepts and how to apply in R.
 R in Action – Well written
Web books and blogs for R
 R for DataScience – Optimized for Data Science
 Advanced R – More advanced concepts
 Introduction to Statistical Learning (ISLR) – Standard book
 R Statistics – By Selva (Myself)
 Caret Package Page – Think of it as scikitlearn of R
 QuickR – Nicely organized materials.
 R Bloggers – Blogs aggregator
Best Git Resources
Git Course
 Git Tutorial Youtube – 1 hour – Well explained.
Git Tutorials and Practice
 Git Tutorial by Vogella – Learn Git with one well written article.
 Git Tutorial by Atlassian – One of the best reference.
 OhShitGit! – Important things about Git, with a bit of swearing.
 Git Howto – Learn by doing.
 Git Immersion – Another learn by doing. Uses Ruby Language though.
Recommended Learning Plan for Programming
Python
Step 1: The Python course – ML+ (Paid). Use coupon code ‘PYTHONFREE21’ at checkout.
Step 2: Revise and Practice Python in Kaggle. Stretch: Solve 15 easy to medium exercises here.
R Programming Route
 Base R Programming (Paid) – Watch the full course.
 R for DataScience – Read till chapter 21.
SQL
Step 3: SQL Course with Mosh
Step 4: SQL Questions – Hackerrank – Practice 15 questions.
Optional: If your job requires a lot of SQL proficiency, take these lessons as well. If you are not sure, then skip this.
Git
Step 5: Git Course
Step 6: Git Practice
STEP 2. Learn Data Wrangling
(Time Required: 30 days45 days)
Why learn Data Wrangling?
In Data science and machine learning, you will primarily work with Data, both stuctured (tabular data) and unstructured (text, images etc). First step is to gain mastery over wrangling tabular data. This is important.
You should be able to transform, create, merge, split, aggregate basically be comfortable to do anything with datasets. You should also be able to create various type of plots and visualizations.
What to learn?
If you’ve chosen Python as your programming language, learn the NumPy and Pandas libraries well. These are the two main libraries for data wrangling, especially Pandas.
Why Pandas?
It is the most popular package for Data wrangling in Python, thoughtfully designed and optimized for performance by Wes Mc Kinney and team.
What else to learn to go further?
 Learn matplotlib for making plots. Optionally plotly library as well. Other options just for knowledge are: Altair, Bokeh, yellowbricks etc.

Learn Dask, built for parallel computing for Data analysis and ML. It enables you to work with large datasets and provides tools to make code faster.

Optionally learn Modin, Vaex packages they are pandas alternates, making your data wrangling code run faster.
Best Data Wrangling Resources
 Pandas User Guide
 NumPy Basics
 NumPy Course – ML+
 Pandas Course – ML+
 Dask Tutorial, Dask Offl
 Modin Tutorial, Modin Offl
 Vaex Tutorial, Vaex Offl
Recommended Learning Plan – Data Wrangling
 Foundations course (Paid) – Use COUPON at checkout: FOUNDATIONSFREE25
 Pandas Course – ML+ (Paid) – Use COUPON at checkout: PANDASFREE25
 NumPy Course (Paid) – Use COUPON at checkout: NUMPYFREE25
 Matplotlib Course – ML+ (Coming Soon)
 Pandas Book
STEP 3. Learn Probability and Statistics
(Time Required: 21 days)
Why learn Probability and Statistics?
Probability and statistics are foundational concepts for Machine learning. Often statistical methods are used in tandem with machine learning for:
 Performing statistical significance to test claims and enable sound business decision.
For example: (i) Statistical tests can help decide on what is a better approach for customers asking for refunds: provide full refund vs giving cashback credits. (ii) Whether placing a specific feature on a website improved lingering duration of customers etc.
 Support exploratory data analysis (EDA)

To understand the math behind Bayesian algorithms, basic knowledge of probability is required.
What to learn in ProbStats?
You need to learn various statistical concepts such as standard deviation, covariance, correlation etc. How statistical significance tests work and how to interpret them, especially Pvalues, null and alternate hypothesis, test statistic, degrees of freedom etc.
Here is a list of essential probability and stats:
Sample vs Population; Measures of Central tendency and Dispersion; Law of large numbers; Central Limit Theorem; Joint and conditional probability; Bayes Theorem; Frequency Distributions; Normal distribution; Probability Density Function; Probability Mass function; Hypothesis testing; Type 1 and Type 2 error; Pvalues; Confidence interval calculation; TTest; ChiSquared Test; Mann Whitney U Test, ANOVA, Standard Error
Python libraries: Statsmodels, scipy
Best Statistics Resources
Stats Video Lessons
Stats Books
 Naked Statistics
 Practical statistics for Data Scientists
 Think Stats (Free web book)
 Think Bayes (Free web book)
Recommended Learning Plan
 Complete statsquest playlist
 Review permutations test in Practical statistics for Data Scientists
STEP 4. Practice Data Preprocessing, Exploratory Data Analysis (EDA) and Story Telling
(Time Required: 30 days45 days)
Why learn EDA and Story Telling?
Once you’ve pieced the data together, you need to:
1. Preprocess Data
Involves handling missing values, identifying outliers and process them if needed, sanity checks on the various fields in data, create new features if appropriate, consider variable transformations etc.
2. Perform Exploratory Data Analysis (EDA)
Involves performing various data analyses to summarize the data and the relationships that exist between variables. You may have to conduct statistical tests, and bring out statistics that will be of interest.
3. Come up with insights and build the story (that the data narrates).
Have conversations or even working sessions with your stakeholders. Good to be on the same page with the SME (subject matter expert) partner before sharing the findings before a larger audience.
What to learn?
 Data Preprocessing Techniques
 Univariate and Bivariate analysis (put your stats learning to use here).
 Practice on real world datasets/problems
What else can be helpful?
Ability to build dashboards quickly with tools like Tableau / PowerBi / Jupyter / Streamlit / Dash. Streamlit and Dash has the advantage of being flexible to use ML models outputs in the backend.
Best Resources for EDA and Story telling
 Data Preprocessing and EDA course
 Kaggle learning competitions
 Data Visualization practice
 Top 50 matplotlib visualizations for Data Analysis(Bookmark this page.)
 Python Graph Gallery
Recommended Learning Plan – EDA and Data Preprocessing
STEP 5. Learn Machine Learning Concepts and Algorithms
(Time Required: 60 days)
What to learn?
Learn the fundamental concepts behind machine learning. For example: concepts like cross validation, biasvariance tradeoff, hyper parameter tuning, overfitting etc are needed to implement ML.
Things you should know:
 ML fundamental concepts
 ML Algorithms – The math behind it
 Building ML in Python / R, tuning and validating
 Model performance improvement strategies
You can learn the math behind ML, before or after learning to implement it. Whatever works best for you. Just don’t make it an excuse to not becoming handson for too long.
Essential ML Concepts Topics
Generative vs Discriminative models, Cross Validation, OOB error; BiasVariance Tradeoff; RSquared; Adjusted RSquared; RSS; TSS; Forward Selection; Backward Selection; BoxCox Transforms; Entropy; Information Gain; Gini Index; Binary Cross Entropy; Variance Inflation Factor; Assumptions of Linear regression; Heteroscedasticity; Sigmoid function; Softmax; Cost Function; ROC Curve; PrecisionRecall; Confusion Matrix and related concepts; Imbalanced Classification; Effect of outliers; Missing value treatment approaches; Expectation Maximization Algorithm; Maximum Likelihood Estimation.
Essential ML Algorithms
Linear Regression; Logistic Regression; Gradient Descent; Decision Trees; Gradient Boosting; Adaboost; Random Forest; SVM; kNearest Neighbors; Gradient Descent; kMeans Clustering; Hierarchical Clustering; Principal Components Analysis; Regularization; Ridge and LASSO regression; Naive Bayes; XGBoost
Note: Know the key hyper parameters of various algorithms and its purpose.
Stretch: Variations of decision trees such as: ID3; C4.5; CART; CHAID; MARS; CRF; HMM; Gaussian Mixture Models; CatBoost; LightGBM.
Why learn ML Algorithms and the math behind it?
Here are 5 solid reasons for you to learn the internals of how ML algorithms work.
What else can be helpful?
You need to be quite familiar with using scikitlearn package for building ML models. In addition to this, pick up statsmodels
package as well.
What would also be great to know is to use packages like ‘h2o’. You can use it with both R and Python.
Another important domain is Time Series forecasting, which deals primarily with forecasting the future values of time series data.
Later on, once you’ve completed ther learning path, you can explore advanced modeling such as Probablistic modeling with packages such as PyMC3, Pyro, etc. Sometimes companies opensource their internal ML projects for wider audience, some of it are very good. For example: Facebook released Prophet for time series forecasting.
Best Machine Learning Courses
 ML+ Linear Regression Course (Paid)
 ML+ Logistic Regression Course (Paid)
 ML+ Supervised Learning Course (Paid)
 ML+ Ensemble Modeling Course (Paid)
 Prof. Andrew Ng ML Course
 Krish Naik – ML Playlist
Time Series Forecasting Resources
 Time Series Analysis Course
 Singular Spectrum Analysis Course (Coming soon, check the store)
 Time Series Forecasting Course (Coming Soon, check the store)
 Time Series Feature Engineering Course (Coming Soon, check the store)
Time Series Blogs
Recommended Learning Plan – ML
 ML+ Linear Regression Course (Paid)
 ML+ Logistic Regression Course (Paid)
 ML+ Supervised Learning Course (Paid)
 ML+ Ensemble Modeling Course (Paid)
 ML+ Feauture Engineering Course (Coming Soon, check the store)
 ML+ Dimensionality Reduction Course (Coming Soon, check the store)
For topics 5 and 6, check out the corresponding lectures from Prof. Andrew Ng’s course.
STEP 6. Learn Deep Learning Concepts and Algorithms
(Time Required: 60 days)
Why learn Deep Learning?
Deep learning is a special branch of machine learning, based on innovative architectures of one type of ML algorithm called Neural networks. A neural network is composed of input, hidden and output layer of neurons. When the number of layers increases, it becomes ‘deep’, thereby the name deep learning.
By innovatively creating the architectures, deep learning has be utilized to solve variety of real world problems such as speech translation, face detection, object tracking, self driving, winning computer games, text generation, creating pictures from text etc.
Companies use deep learning for various such use cases.
What to learn?
 Deep learning architectures such as Multi layer perceptron, Convolutional neural networks (CNN), Recurrent neural networks (RNN), Long Short Term Memory Networks (LSTM), GANS, EncoderDecoder, Transformers etc.

Deep Learning concepts such as Chain Rule and Back propagation, Activation functions, L1 and L2 regularization, Dropout layer, Early stopping, Optimizers like RMSProp, ADAM, AdaGrad, Batch Normalization, Pooling, Transfer Learning

Frameworks: Tensorflow, PyTorch or both.
What else can be helpful?

Once you learnt the overall concepts, pick one area like NLP or Computer Vision and dive deeper.

For NLP, deep learning libraries like SpaCy, HuggingFace, Gluon NLP are quite the main ones.

For Computer Vision: Start with Open CV, and check out for various applications in Keras and Pytorch websites.
Best Deep Learning Resources
Best Deep Learning Courses
Deep Learning Books
 Deep Learning with Python, 2nd Ed
 HandsOn Machine Learning with ScikitLearn, Keras, and TensorFlow, 2nd Edition
 Deep Learning Book by Y. Bengio and A. Courville
Recommended Learning Plan
6. Learn ML Ops and Big Data
(Time Required: 30 days45 days)
Why learn MLOps?
Traditionally, Statisticians would share the results of their models in a Dataset (Excel or store in database) or create a presentation to share their findings.
But now, Data Scientists deploy the ML models. Typically, it’s a dashboard or a REST API that takes an input (ex: voice) and give back an output (ex: text).
Once models are deployed, their performance should be monitored and models refreshed periodically.
MLOps deals with practices that enable deployment, monitor and refresh ML models in production in a reliable way.
This involves DevOps and ML Engineers to work together. This area is still evolving and companies (like AWS, Google Cloud, MS Azure, Databricks etc) strive to do the heavy lifting to enable Data Scientists to take their models to production with minimal effort.
What to learn?
 Framework to take ML Model to REST API. Ex: Python: Flask, Django, FAST API. R: Plumber
 Docker: Solving for dependencies and reproducibility.
 Sagemaker: Flagship ML development, deployment and monitoring platform from AWS>
 Kubernetes / KubeFlow: Kubernetes enables automated deployment and scaling. KubeFlow enables deloying ML worflows on Kubernetes simple.
 Seldon Core: Fast deploy ML on Kubernetes at scale.
 CI/CD: Continuous Integration and Continuous Delivery of ML solutions. GitHub Actions, Jenkins, CircleCI, GitLab, Travis CI are popular tools.
 ML Orchestration: Create, schedule and monitor workflows as an endtoend pipeline. Popular tools are: AirFlow, Dagster, Prefect, Luigi. ML focused pipelines are from KubeFlow Pipelines, Metaflow, Flyte.
Certifications (Optional)
AWS and Google provide training and certifications.
AWS
Google Cloud
What else can be helpful?
Learn distributed computing with PySpark. Several companies use PySpark for Extract, Transform, Load (ETL) operations as well as for build ML and applications on Big datasets using PySpark.
Databricks and Cloudera are leading providers. Apache Pyspark is the package you need to install and start using.
MLFlow: An popular platform to manage the ML life cycle. Language, library agnostic. Open source.
Best ML Ops Courses
 Deploying ML in AWS EC2 – ML+
 Docker Course – Mosh
 AWS Sagemaker course – ML+
 Deploy ML in AWS Lambda Course – ML+
 Full Stack Deep Learning – Fall 2019, 2022
 Docker HandsOn training
 CI/CD for Machine Learning – MadewithML
 GitHub Actions
Recommended Learning Plan – MLOps
 Deploying ML in AWS EC2 – ML+ (Paid)
 Docker Course – Mosh
 AWS Sagemaker course – ML+ (Paid)
 AWS Lambda Course – ML+ (Paid)
 KubeFlow Course – Google
 Pyspark Tutorial
Topics such as CI/CD pipelines, Orchestration framework can change from company to company, easy to pick it up on the job.
STEP 7. Do Portfolio Projects
(Time Required: 21 days+)
Why build ML projects?
The projects that you showcase in your resume speak volumes. More than the certifications will ever do.
But what type of projects to build?
The kind that companies actually execute, that is, Data science projects that help generate revenue or save costs.
What to build?
It’s ok to start with educational projects such as the Kaggle’s Titanic competition or the Housing Price prediction competition. There are more such competitions for knowledge and practice.
Then solve projects that companies are interested in, like, estimating customer lifetime value, forecasting demand of products, predicting churning customers, simulating optimal marketing budget spend, etc.
What else can be helpful?
Solve projects endtoend with well written code. Public datasets that represent real world data is good enough. Host your code in github repositories, maintain your github repos well and add the link to github to your resume.
Best Projects Resources
Recommended Learning Plan – DS Portfolio Projects
 Kaggle’s Titanic competition
 Estimating CLTV – ML+
 Restaurant Visitor Forecasting
 Market Mix Modeling
If you are interested in more projects checkout Kaggle competitions, get practice, host your code on GitHub.
Do You Want to Learn More About Data Science?
Data Science and ML have become integral with most companies both in IT and product companies. Machine Learning Plus has everything you need to make your Data Science Roadmap journey easier and achievable.
The Machine Learning Plus Complete ML Mastery Courses, features the ideal learning path if you want to succeed in a career in Data Science. If you are struggling with understanding ‘tough’ ML and Stats concepts, at ML+ you will get the most clear, complete and straight forward explanations.
Complete ML Mastery covers vital data science topics, starting with the Foundations of ML course, Python programming, R programming, machine learning, deep learning, and build portfolio machine learning projects which will give the exposure and build your business acumen. You can showcase these projects in resume and talk about it to win interviews.
According to Glassdoor, Data scientists earn an annual average of $123,250. The world needs more data scientists. Companies offer attractive incentives, stable, secure and fulfilling career path. If this sounds like your kind of profession, take those first few steps towards a new career. Start ML Mastery Self Learning Path today!!