Data Science Roadmap – How to become a Data Scientist? (6 month self study plan)

Today, I discuss the Data Science Roadmap, the missing guide to self study machine learning. I’ll discuss what exactly you need to know and do in order to self study Data science / ML / AI / Stats. I will provide you with some of the best resources for each topic, why you need to learn the said topics and a recommended self study plan.

STEP 1. Learn Programming (Python, R, SQL)

(Time Required: 30 days-45 days)

What to learn?

The main programming languages for Data Science:

Python
R Programming
SQL

Which one to choose first?

Between Python and R, Python is more popular and widely adopted. R is also popular in certain countries, domains. If you have to choose between these two to start off, pick Python.

It is a good idea to learn R language also, for it has a great ML/stats packages ecosystem and I’ve seen several companies using both R and Python.

When should you learn R?

You can learn R after you’re completely done with Python as well. However, if your company / domain mainly uses R just go with it.

Why learn programming?

Data Scientists are hands-on folks. Ultimately, you will need to do various analyses with data and build ML models.

Why Python?

Python is the default language these days for ML. New research / development are done in Python. With mature libraries for ML, stats, deep learning etc, it’s easy to find solutions quickly.

Why R?

R programming was the most popular language before Python took over. It is built for Data analysis, statistical modeling and ML. Data Scientist use R, for the rich collection of Packages in stats and ML.

Why SQL?

To collect, transform data from various different databases. Most databases speak SQL. People write SQL queries to collect, transform the data from various sources and assemble them to form a master data to be used for data analysis and ML.

Note: Another growing language worth mentioning is ‘Julia’. It has a great speed advantage, has expressive syntax borrowing the good parts of both R and Python. However, the size of Julia community is still maturing.

So, if you are getting started, I’d recommend to get started with Python first. Then learn SQL. Then Learn R. Optionally you can look at Julia and do cool stuff.

What else to learn to go further?

You need to learn Git as well.

Most companies version control their projects with Git, using popular services such as GitHub, BitBucket or GitLab.

I recommend to open a GitHub account and maintain / showcase your ML project work. Once you have done few good projects (I’ll talk about this in Projects section below), you can add the GitHub link to your resume. It will increase your chance of getting shortlisted for interviews, if people see you can code well.

Best Python Resources

Python Courses

Corey Schaefer YouTube – Nice Youtube channel for simple Python explanations, covers a range of topics but can be better structured.
The Python Course – ML+ (Paid) – I teach this course at ML+, simplify things as always. Organized and taught using Jupyter Notebook. Ideal for Data Science / ML practitioners. Use code ‘PYTHONFREE21’ at checkout.
Learn Python – Kaggle – Learn with practice with succinct lessons. Not complete though.
Telusko Python – Youtube – Another nice Youtube course to learn Python.

Python Books

Python Crash Course 2nd Ed: Well written book on Introductory Python. Possibly (amongst) the best.
Fluent Python 2nd Ed: Advanced Python book. Read this after you’ve become experienced.

Python Practice Books

Automate the Boring Stuff – Practical approach to learning
Python Workbook – Springer – Collection of nice practice questions.
The Big Book of Small Python Projects – This is also good.

Python Practice Websites

Exercism Python – Nice collection of exercises, nice UI.
Hack in Science – Split by 3 different levels.
Practical Python Projects – Beginner exercises.

Besides this there are competitive coding websites like Leetcode, Codewars and Hackerrank that offer a large number of programming questions. I’d not recommend these at this stage for beginners, simply because, our goal right now is to become sufficiently good in Python in order to do machine Data Science/ML.

Best SQL Resources

SQL Tutorial with Mosh – Good one (video course).
SQL Tutorial – Mode – Also good (text course).

SQL practice websites

Advanced SQL Puzzles – Practice solving SQL.
SQL Interview questions – YT videos
SQL Questions – Hackerrank – For practice

Best R Programming Resources

R Courses

Machine Learning with R – Learning Path Courses at ML+ (Paid)
R Programming – Coursera

R Books

Applied Predictive Modeling – Highly recommended
The R Book 2nd Ed – Teaches concepts and how to apply in R.
R in Action – Well written

Web books and blogs for R

R for DataScience – Optimized for Data Science
Advanced R – More advanced concepts
Introduction to Statistical Learning (ISLR) – Standard book
R Statistics – By Selva (Myself)
Caret Package Page – Think of it as scikit-learn of R
Quick-R – Nicely organized materials.
R Bloggers – Blogs aggregator

Best Git Resources

Git Course

Git Tutorial Youtube – 1 hour – Well explained.

Git Tutorials and Practice

Git Tutorial by Vogella – Learn Git with one well written article.
Git Tutorial by Atlassian – One of the best reference.
OhShitGit! – Important things about Git, with a bit of swearing.
Git Howto – Learn by doing.
Git Immersion – Another learn by doing. Uses Ruby Language though.

Recommended Learning Plan for Programming

Python

Step 1: The Python course – ML+ (Paid). Use coupon code ‘PYTHONFREE21’ at checkout.
Step 2: Revise and Practice Python in Kaggle. Stretch: Solve 15 easy to medium exercises here.

R Programming Route

Base R Programming (Paid) – Watch the full course.
R for DataScience – Read till chapter 21.

SQL

Step 3: SQL Course with Mosh
Step 4: SQL Questions – Hackerrank – Practice 15 questions.

Optional: If your job requires a lot of SQL proficiency, take these lessons as well. If you are not sure, then skip this.

Git

Step 5: Git Course
Step 6: Git Practice

STEP 2. Learn Data Wrangling

(Time Required: 30 days-45 days)

Why learn Data Wrangling?

In Data science and machine learning, you will primarily work with Data, both stuctured (tabular data) and unstructured (text, images etc). First step is to gain mastery over wrangling tabular data. This is important.

You should be able to transform, create, merge, split, aggregate basically be comfortable to do anything with datasets. You should also be able to create various type of plots and visualizations.

What to learn?

If you’ve chosen Python as your programming language, learn the NumPy and Pandas libraries well. These are the two main libraries for data wrangling, especially Pandas.

Why Pandas?

It is the most popular package for Data wrangling in Python, thoughtfully designed and optimized for performance by Wes Mc Kinney and team.

What else to learn to go further?

Learn matplotlib for making plots. Optionally plotly library as well. Other options just for knowledge are: Altair, Bokeh, yellowbricks etc.
Learn Dask, built for parallel computing for Data analysis and ML. It enables you to work with large datasets and provides tools to make code faster.
Optionally learn Modin, Vaex packages they are pandas alternates, making your data wrangling code run faster.

Best Data Wrangling Resources

Recommended Learning Plan – Data Wrangling

Foundations course (Paid) – Use COUPON at checkout: FOUNDATIONSFREE25
Pandas Course – ML+ (Paid) – Use COUPON at checkout: PANDASFREE25
NumPy Course (Paid) – Use COUPON at checkout: NUMPYFREE25
Matplotlib Course – ML+ (Coming Soon)
Pandas Book

STEP 3. Learn Probability and Statistics

(Time Required: 21 days)

Why learn Probability and Statistics?

Probability and statistics are foundational concepts for Machine learning. Often statistical methods are used in tandem with machine learning for:

Performing statistical significance to test claims and enable sound business decision.

For example: (i) Statistical tests can help decide on what is a better approach for customers asking for refunds: provide full refund vs giving cashback credits. (ii) Whether placing a specific feature on a website improved lingering duration of customers etc.

Support exploratory data analysis (EDA)
To understand the math behind Bayesian algorithms, basic knowledge of probability is required.

What to learn in ProbStats?

You need to learn various statistical concepts such as standard deviation, covariance, correlation etc. How statistical significance tests work and how to interpret them, especially P-values, null and alternate hypothesis, test statistic, degrees of freedom etc.

Here is a list of essential probability and stats:

Sample vs Population; Measures of Central tendency and Dispersion; Law of large numbers; Central Limit Theorem; Joint and conditional probability; Bayes Theorem; Frequency Distributions; Normal distribution; Probability Density Function; Probability Mass function; Hypothesis testing; Type 1 and Type 2 error; P-values; Confidence interval calculation; T-Test; Chi-Squared Test; Mann Whitney U Test, ANOVA, Standard Error

Python libraries: Statsmodels, scipy

Best Statistics Resources

Stats Video Lessons

Stats Books

Naked Statistics
Practical statistics for Data Scientists
Think Stats (Free web book)
Think Bayes (Free web book)

Recommended Learning Plan

Complete statsquest playlist
Review permutations test in Practical statistics for Data Scientists

STEP 4. Practice Data Preprocessing, Exploratory Data Analysis (EDA) and Story Telling

(Time Required: 30 days-45 days)

Why learn EDA and Story Telling?

Once you’ve pieced the data together, you need to:

1. Preprocess Data

Involves handling missing values, identifying outliers and process them if needed, sanity checks on the various fields in data, create new features if appropriate, consider variable transformations etc.

2. Perform Exploratory Data Analysis (EDA)

Involves performing various data analyses to summarize the data and the relationships that exist between variables. You may have to conduct statistical tests, and bring out statistics that will be of interest.

3. Come up with insights and build the story (that the data narrates).

Have conversations or even working sessions with your stakeholders. Good to be on the same page with the SME (subject matter expert) partner before sharing the findings before a larger audience.

What to learn?

Data Preprocessing Techniques
Univariate and Bivariate analysis (put your stats learning to use here).
Practice on real world datasets/problems

What else can be helpful?

Ability to build dashboards quickly with tools like Tableau / PowerBi / Jupyter / Streamlit / Dash. Streamlit and Dash has the advantage of being flexible to use ML models outputs in the backend.

Best Resources for EDA and Story telling

Data Preprocessing and EDA course
Kaggle learning competitions
Data Visualization practice
Top 50 matplotlib visualizations for Data Analysis(Bookmark this page.)
Python Graph Gallery

Recommended Learning Plan – EDA and Data Preprocessing

Data Preprocessing and EDA course
Kaggle learning competitions
1. Titanic Disaster Competition
2. Housing Prices Prediction

STEP 5. Learn Machine Learning Concepts and Algorithms

(Time Required: 60 days)

What to learn?

Learn the fundamental concepts behind machine learning. For example: concepts like cross validation, bias-variance tradeoff, hyper parameter tuning, overfitting etc are needed to implement ML.

Things you should know:

ML fundamental concepts
ML Algorithms – The math behind it
Building ML in Python / R, tuning and validating
Model performance improvement strategies

You can learn the math behind ML, before or after learning to implement it. Whatever works best for you. Just don’t make it an excuse to not becoming hands-on for too long.

Essential ML Concepts Topics

Generative vs Discriminative models, Cross Validation, OOB error; Bias-Variance Tradeoff; R-Squared; Adjusted R-Squared; RSS; TSS; Forward Selection; Backward Selection; Box-Cox Transforms; Entropy; Information Gain; Gini Index; Binary Cross Entropy; Variance Inflation Factor; Assumptions of Linear regression; Heteroscedasticity; Sigmoid function; Softmax; Cost Function; ROC Curve; Precision-Recall; Confusion Matrix and related concepts; Imbalanced Classification; Effect of outliers; Missing value treatment approaches; Expectation Maximization Algorithm; Maximum Likelihood Estimation.

Essential ML Algorithms

Linear Regression; Logistic Regression; Gradient Descent; Decision Trees; Gradient Boosting; Adaboost; Random Forest; SVM; k-Nearest Neighbors; Gradient Descent; k-Means Clustering; Hierarchical Clustering; Principal Components Analysis; Regularization; Ridge and LASSO regression; Naive Bayes; XGBoost

Note: Know the key hyper parameters of various algorithms and its purpose.

Stretch: Variations of decision trees such as: ID3; C4.5; CART; CHAID; MARS; CRF; HMM; Gaussian Mixture Models; CatBoost; LightGBM.

Why learn ML Algorithms and the math behind it?

Here are 5 solid reasons for you to learn the internals of how ML algorithms work.

What else can be helpful?

You need to be quite familiar with using scikit-learn package for building ML models. In addition to this, pick up statsmodels package as well.

What would also be great to know is to use packages like ‘h2o’. You can use it with both R and Python.

Another important domain is Time Series forecasting, which deals primarily with forecasting the future values of time series data.

Later on, once you’ve completed ther learning path, you can explore advanced modeling such as Probablistic modeling with packages such as PyMC3, Pyro, etc. Sometimes companies opensource their internal ML projects for wider audience, some of it are very good. For example: Facebook released Prophet for time series forecasting.

Best Machine Learning Courses

ML+ Linear Regression Course (Paid)
ML+ Logistic Regression Course (Paid)
ML+ Supervised Learning Course (Paid)
ML+ Ensemble Modeling Course (Paid)
Prof. Andrew Ng ML Course
Krish Naik – ML Playlist

Time Series Forecasting Resources

Time Series Blogs

Recommended Learning Plan – ML

ML+ Linear Regression Course (Paid)
ML+ Logistic Regression Course (Paid)
ML+ Supervised Learning Course (Paid)
ML+ Ensemble Modeling Course (Paid)
ML+ Feauture Engineering Course (Coming Soon, check the store)
ML+ Dimensionality Reduction Course (Coming Soon, check the store)

For topics 5 and 6, check out the corresponding lectures from Prof. Andrew Ng’s course.

STEP 6. Learn Deep Learning Concepts and Algorithms

(Time Required: 60 days)

Why learn Deep Learning?

Deep learning is a special branch of machine learning, based on innovative architectures of one type of ML algorithm called Neural networks. A neural network is composed of input, hidden and output layer of neurons. When the number of layers increases, it becomes ‘deep’, thereby the name deep learning.

By innovatively creating the architectures, deep learning has be utilized to solve variety of real world problems such as speech translation, face detection, object tracking, self driving, winning computer games, text generation, creating pictures from text etc.

Companies use deep learning for various such use cases.

What to learn?

Deep learning architectures such as Multi layer perceptron, Convolutional neural networks (CNN), Recurrent neural networks (RNN), Long Short Term Memory Networks (LSTM), GANS, Encoder-Decoder, Transformers etc.
Deep Learning concepts such as Chain Rule and Back propagation, Activation functions, L1 and L2 regularization, Dropout layer, Early stopping, Optimizers like RMSProp, ADAM, AdaGrad, Batch Normalization, Pooling, Transfer Learning
Frameworks: Tensorflow, PyTorch or both.

What else can be helpful?

Once you learnt the overall concepts, pick one area like NLP or Computer Vision and dive deeper.
For NLP, deep learning libraries like SpaCy, HuggingFace, Gluon NLP are quite the main ones.
For Computer Vision: Start with Open CV, and check out for various applications in Keras and Pytorch websites.

Best Deep Learning Resources

Best Deep Learning Courses

Deep Learning Books

Recommended Learning Plan

Coursera – Deep Learning Specialization

6. Learn ML Ops and Big Data

(Time Required: 30 days-45 days)

Why learn MLOps?

Traditionally, Statisticians would share the results of their models in a Dataset (Excel or store in database) or create a presentation to share their findings.

But now, Data Scientists deploy the ML models. Typically, it’s a dashboard or a REST API that takes an input (ex: voice) and give back an output (ex: text).

Once models are deployed, their performance should be monitored and models refreshed periodically.

MLOps deals with practices that enable deployment, monitor and refresh ML models in production in a reliable way.

This involves DevOps and ML Engineers to work together. This area is still evolving and companies (like AWS, Google Cloud, MS Azure, Databricks etc) strive to do the heavy lifting to enable Data Scientists to take their models to production with minimal effort.

What to learn?

Framework to take ML Model to REST API. Ex: Python: Flask, Django, FAST API. R: Plumber
Docker: Solving for dependencies and reproducibility.
Sagemaker: Flagship ML development, deployment and monitoring platform from AWS>
Kubernetes / KubeFlow: Kubernetes enables automated deployment and scaling. KubeFlow enables deloying ML worflows on Kubernetes simple.
Seldon Core: Fast deploy ML on Kubernetes at scale.
CI/CD: Continuous Integration and Continuous Delivery of ML solutions. GitHub Actions, Jenkins, CircleCI, GitLab, Travis CI are popular tools.
ML Orchestration: Create, schedule and monitor workflows as an end-to-end pipeline. Popular tools are: AirFlow, Dagster, Prefect, Luigi. ML focused pipelines are from KubeFlow Pipelines, Metaflow, Flyte.

Certifications (Optional)

AWS and Google provide training and certifications.

AWS

Google Cloud

Google Cloud – Professional Machine Learning Engineer

What else can be helpful?

Learn distributed computing with PySpark. Several companies use PySpark for Extract, Transform, Load (ETL) operations as well as for build ML and applications on Big datasets using PySpark.

Databricks and Cloudera are leading providers. Apache Pyspark is the package you need to install and start using.

MLFlow: An popular platform to manage the ML life cycle. Language, library agnostic. Open source.

Best ML Ops Courses

Deploying ML in AWS EC2 – ML+
Docker Course – Mosh
AWS Sagemaker course – ML+
Deploy ML in AWS Lambda Course – ML+
Full Stack Deep Learning – Fall 2019, 2022
Docker Hands-On training
CI/CD for Machine Learning – MadewithML
GitHub Actions

Recommended Learning Plan – MLOps

Deploying ML in AWS EC2 – ML+ (Paid)
Docker Course – Mosh
AWS Sagemaker course – ML+ (Paid)
AWS Lambda Course – ML+ (Paid)
KubeFlow Course – Google
Pyspark Tutorial

Topics such as CI/CD pipelines, Orchestration framework can change from company to company, easy to pick it up on the job.

STEP 7. Do Portfolio Projects

(Time Required: 21 days+)

Why build ML projects?

The projects that you showcase in your resume speak volumes. More than the certifications will ever do.

But what type of projects to build?

The kind that companies actually execute, that is, Data science projects that help generate revenue or save costs.

What to build?

It’s ok to start with educational projects such as the Kaggle’s Titanic competition or the Housing Price prediction competition. There are more such competitions for knowledge and practice.

Then solve projects that companies are interested in, like, estimating customer lifetime value, forecasting demand of products, predicting churning customers, simulating optimal marketing budget spend, etc.

What else can be helpful?

Solve projects end-to-end with well written code. Public datasets that represent real world data is good enough. Host your code in github repositories, maintain your github repos well and add the link to github to your resume.

Best Projects Resources

Recommended Learning Plan – DS Portfolio Projects

If you are interested in more projects checkout Kaggle competitions, get practice, host your code on GitHub.

Do You Want to Learn More About Data Science?

Data Science and ML have become integral with most companies both in IT and product companies. Machine Learning Plus has everything you need to make your Data Science Roadmap journey easier and achievable.

The Machine Learning Plus Complete ML Mastery Courses, features the ideal learning path if you want to succeed in a career in Data Science. If you are struggling with understanding ‘tough’ ML and Stats concepts, at ML+ you will get the most clear, complete and straight forward explanations.

Complete ML Mastery covers vital data science topics, starting with the Foundations of ML course, Python programming, R programming, machine learning, deep learning, and build portfolio machine learning projects which will give the exposure and build your business acumen. You can showcase these projects in resume and talk about it to win interviews.

According to Glassdoor, Data scientists earn an annual average of $123,250. The world needs more data scientists. Companies offer attractive incentives, stable, secure and fulfilling career path. If this sounds like your kind of profession, take those first few steps towards a new career. Start ML Mastery Self Learning Path today!!

Machine Learning

KL Divergence – What is it and mathematical details explained

Oct 02, 2023

Machine Learning

Probe Method – How to select features for ML models

Sep 30, 2023

Machine Learning

Cook’s Distance for Detecting Influential Observations

Aug 09, 2023

Machine Learning

How to detect outliers with z-score

Aug 05, 2023

Machine Learning

How to detect outliers using Z score?

Aug 01, 2023

Machine Learning

How to detect outliers using IQR and Boxplots?

Jul 30, 2023

Machine Learning

Data Science Roadmap – How to become a Data Scientist? (6 month self study plan)

STEP 1. Learn Programming (Python, R, SQL)

What to learn?

Why learn programming?

Best Python Resources

Python Courses

Python Books

Python Practice Books

Python Practice Websites

Best SQL Resources

SQL practice websites

Best R Programming Resources

R Courses

R Books

Web books and blogs for R

Best Git Resources

Git Course

Git Tutorials and Practice

Recommended Learning Plan for Programming

Python

R Programming Route

SQL

Git

STEP 2. Learn Data Wrangling

Why learn Data Wrangling?

What to learn?

What else to learn to go further?

Best Data Wrangling Resources

Recommended Learning Plan – Data Wrangling

STEP 3. Learn Probability and Statistics

Why learn Probability and Statistics?

What to learn in ProbStats?

Best Statistics Resources

Stats Video Lessons

Stats Books

Recommended Learning Plan

STEP 4. Practice Data Preprocessing, Exploratory Data Analysis (EDA) and Story Telling

Why learn EDA and Story Telling?

What to learn?

What else can be helpful?

Best Resources for EDA and Story telling

Recommended Learning Plan – EDA and Data Preprocessing

STEP 5. Learn Machine Learning Concepts and Algorithms

What to learn?

Essential ML Concepts Topics

Essential ML Algorithms

Why learn ML Algorithms and the math behind it?

What else can be helpful?

Best Machine Learning Courses

Time Series Forecasting Resources

Time Series Blogs

Recommended Learning Plan – ML

STEP 6. Learn Deep Learning Concepts and Algorithms

Why learn Deep Learning?

What to learn?

What else can be helpful?

Best Deep Learning Resources

Best Deep Learning Courses

Deep Learning Books

Recommended Learning Plan

6. Learn ML Ops and Big Data

Why learn MLOps?

What to learn?

Best ML Ops Courses

Recommended Learning Plan – MLOps

STEP 7. Do Portfolio Projects

Why build ML projects?

What to build?

Best Projects Resources

Recommended Learning Plan – DS Portfolio Projects

Do You Want to Learn More About Data Science?

More Articles

KL Divergence – What is it and mathematical details explained

Probe Method – How to select features for ML models

Cook’s Distance for Detecting Influential Observations

How to detect outliers with z-score

How to detect outliers using Z score?

How to detect outliers using IQR and Boxplots?

Similar Articles