Menu

Data Science Roadmap – How to become a Data Scientist? (6 month self study plan)

Today, I discuss the Data Science Roadmap, the missing guide to self study machine learning. I’ll discuss what exactly you need to know and do in order to self study Data science / ML / AI / Stats. I will provide you with some of the best resources for each topic, why you need to learn the said topics and a recommended self study plan.

STEP 1. Learn Programming (Python, R, SQL)

(Time Required: 30 days-45 days)

What to learn?

The main programming languages for Data Science:

  1. Python
  2. R Programming
  3. SQL

Which one to choose first?

Between Python and R, Python is more popular and widely adopted. R is also popular in certain countries, domains. If you have to choose between these two to start off, pick Python.

It is a good idea to learn R language also, for it has a great ML/stats packages ecosystem and I’ve seen several companies using both R and Python.

When should you learn R?

You can learn R after you’re completely done with Python as well. However, if your company / domain mainly uses R just go with it.

Why learn programming?

Data Scientists are hands-on folks. Ultimately, you will need to do various analyses with data and build ML models.

Why Python?

Python is the default language these days for ML. New research / development are done in Python. With mature libraries for ML, stats, deep learning etc, it’s easy to find solutions quickly.

Why R?

R programming was the most popular language before Python took over. It is built for Data analysis, statistical modeling and ML. Data Scientist use R, for the rich collection of Packages in stats and ML.

Why SQL?

To collect, transform data from various different databases. Most databases speak SQL. People write SQL queries to collect, transform the data from various sources and assemble them to form a master data to be used for data analysis and ML.

Note: Another growing language worth mentioning is ‘Julia’. It has a great speed advantage, has expressive syntax borrowing the good parts of both R and Python. However, the size of Julia community is still maturing.

So, if you are getting started, I’d recommend to get started with Python first. Then learn SQL. Then Learn R. Optionally you can look at Julia and do cool stuff.

What else to learn to go further?

You need to learn Git as well.

Most companies version control their projects with Git, using popular services such as GitHub, BitBucket or GitLab.

I recommend to open a GitHub account and maintain / showcase your ML project work. Once you have done few good projects (I’ll talk about this in Projects section below), you can add the GitHub link to your resume. It will increase your chance of getting shortlisted for interviews, if people see you can code well.

Best Python Resources

Python Courses

  1. Corey Schaefer YouTube – Nice Youtube channel for simple Python explanations, covers a range of topics but can be better structured.

  2. The Python Course – ML+ (Paid) – I teach this course at ML+, simplify things as always. Organized and taught using Jupyter Notebook. Ideal for Data Science / ML practitioners. Use code ‘PYTHONFREE21’ at checkout.

  3. Learn Python – Kaggle – Learn with practice with succinct lessons. Not complete though.

  4. Telusko Python – Youtube – Another nice Youtube course to learn Python.

Python Books

  1. Python Crash Course 2nd Ed: Well written book on Introductory Python. Possibly (amongst) the best.
  2. Fluent Python 2nd Ed: Advanced Python book. Read this after you’ve become experienced.

Python Practice Books

  1. Automate the Boring Stuff – Practical approach to learning
  2. Python Workbook – Springer – Collection of nice practice questions.
  3. The Big Book of Small Python Projects – This is also good.

Python Practice Websites

  1. Exercism Python – Nice collection of exercises, nice UI.
  2. Hack in Science – Split by 3 different levels.
  3. Practical Python Projects – Beginner exercises.

Besides this there are competitive coding websites like Leetcode, Codewars and Hackerrank that offer a large number of programming questions. I’d not recommend these at this stage for beginners, simply because, our goal right now is to become sufficiently good in Python in order to do machine Data Science/ML.

Best SQL Resources

  1. SQL Tutorial with Mosh – Good one (video course).
  2. SQL Tutorial – Mode – Also good (text course).

SQL practice websites

  1. Advanced SQL Puzzles – Practice solving SQL.
  2. SQL Interview questions – YT videos
  3. SQL Questions – Hackerrank – For practice

Best R Programming Resources

R Courses

  1. Machine Learning with R – Learning Path Courses at ML+ (Paid)
  2. R Programming – Coursera

R Books

  1. Applied Predictive Modeling – Highly recommended
  2. The R Book 2nd Ed – Teaches concepts and how to apply in R.
  3. R in Action – Well written

Web books and blogs for R

  1. R for DataScience – Optimized for Data Science
  2. Advanced R – More advanced concepts
  3. Introduction to Statistical Learning (ISLR) – Standard book
  4. R Statistics – By Selva (Myself)
  5. Caret Package Page – Think of it as scikit-learn of R
  6. Quick-R – Nicely organized materials.
  7. R Bloggers – Blogs aggregator

Best Git Resources

Git Course

  1. Git Tutorial Youtube – 1 hour – Well explained.

Git Tutorials and Practice

  1. Git Tutorial by Vogella – Learn Git with one well written article.
  2. Git Tutorial by Atlassian – One of the best reference.
  3. OhShitGit! – Important things about Git, with a bit of swearing.
  4. Git Howto – Learn by doing.
  5. Git Immersion – Another learn by doing. Uses Ruby Language though.

Recommended Learning Plan for Programming

Python

Step 1: The Python course – ML+ (Paid). Use coupon code ‘PYTHONFREE21’ at checkout.
Step 2: Revise and Practice Python in Kaggle. Stretch: Solve 15 easy to medium exercises here.

R Programming Route

  1. Base R Programming (Paid) – Watch the full course.
  2. R for DataScience – Read till chapter 21.

SQL

Step 3: SQL Course with Mosh
Step 4: SQL Questions – Hackerrank – Practice 15 questions.

Optional: If your job requires a lot of SQL proficiency, take these lessons as well. If you are not sure, then skip this.

Git

Step 5: Git Course
Step 6: Git Practice

STEP 2. Learn Data Wrangling

(Time Required: 30 days-45 days)

Why learn Data Wrangling?

In Data science and machine learning, you will primarily work with Data, both stuctured (tabular data) and unstructured (text, images etc). First step is to gain mastery over wrangling tabular data. This is important.

You should be able to transform, create, merge, split, aggregate basically be comfortable to do anything with datasets. You should also be able to create various type of plots and visualizations.

What to learn?

If you’ve chosen Python as your programming language, learn the NumPy and Pandas libraries well. These are the two main libraries for data wrangling, especially Pandas.

Why Pandas?

It is the most popular package for Data wrangling in Python, thoughtfully designed and optimized for performance by Wes Mc Kinney and team.

What else to learn to go further?

  1. Learn matplotlib for making plots. Optionally plotly library as well. Other options just for knowledge are: Altair, Bokeh, yellowbricks etc.

  2. Learn Dask, built for parallel computing for Data analysis and ML. It enables you to work with large datasets and provides tools to make code faster.

  3. Optionally learn Modin, Vaex packages they are pandas alternates, making your data wrangling code run faster.

Best Data Wrangling Resources

  1. Pandas User Guide
  2. NumPy Basics
  3. NumPy Course – ML+
  4. Pandas Course – ML+
  5. Dask Tutorial, Dask Offl
  6. Modin Tutorial, Modin Offl
  7. Vaex Tutorial, Vaex Offl

Recommended Learning Plan – Data Wrangling

  1. Foundations course (Paid) – Use COUPON at checkout: FOUNDATIONSFREE25
  2. Pandas Course – ML+ (Paid) – Use COUPON at checkout: PANDASFREE25
  3. NumPy Course (Paid) – Use COUPON at checkout: NUMPYFREE25
  4. Matplotlib Course – ML+ (Coming Soon)
  5. Pandas Book

STEP 3. Learn Probability and Statistics

(Time Required: 21 days)

Why learn Probability and Statistics?

Probability and statistics are foundational concepts for Machine learning. Often statistical methods are used in tandem with machine learning for:

  1. Performing statistical significance to test claims and enable sound business decision.

For example: (i) Statistical tests can help decide on what is a better approach for customers asking for refunds: provide full refund vs giving cashback credits. (ii) Whether placing a specific feature on a website improved lingering duration of customers etc.

  1. Support exploratory data analysis (EDA)

  2. To understand the math behind Bayesian algorithms, basic knowledge of probability is required.

What to learn in ProbStats?

You need to learn various statistical concepts such as standard deviation, covariance, correlation etc. How statistical significance tests work and how to interpret them, especially P-values, null and alternate hypothesis, test statistic, degrees of freedom etc.

Here is a list of essential probability and stats:

Sample vs Population; Measures of Central tendency and Dispersion; Law of large numbers; Central Limit Theorem; Joint and conditional probability; Bayes Theorem; Frequency Distributions; Normal distribution; Probability Density Function; Probability Mass function; Hypothesis testing; Type 1 and Type 2 error; P-values; Confidence interval calculation; T-Test; Chi-Squared Test; Mann Whitney U Test, ANOVA, Standard Error

Python libraries: Statsmodels, scipy

Best Statistics Resources

Stats Video Lessons

  1. Statsquest – YT
  2. Statistics and Probability – Khan Academy

Stats Books

  1. Naked Statistics
  2. Practical statistics for Data Scientists
  3. Think Stats (Free web book)
  4. Think Bayes (Free web book)

Recommended Learning Plan

  1. Complete statsquest playlist
  2. Review permutations test in Practical statistics for Data Scientists

STEP 4. Practice Data Preprocessing, Exploratory Data Analysis (EDA) and Story Telling

(Time Required: 30 days-45 days)

Why learn EDA and Story Telling?

Once you’ve pieced the data together, you need to:

1. Preprocess Data

Involves handling missing values, identifying outliers and process them if needed, sanity checks on the various fields in data, create new features if appropriate, consider variable transformations etc.

2. Perform Exploratory Data Analysis (EDA)

Involves performing various data analyses to summarize the data and the relationships that exist between variables. You may have to conduct statistical tests, and bring out statistics that will be of interest.

3. Come up with insights and build the story (that the data narrates).

Have conversations or even working sessions with your stakeholders. Good to be on the same page with the SME (subject matter expert) partner before sharing the findings before a larger audience.

What to learn?

  • Data Preprocessing Techniques
  • Univariate and Bivariate analysis (put your stats learning to use here).
  • Practice on real world datasets/problems

What else can be helpful?

Ability to build dashboards quickly with tools like Tableau / PowerBi / Jupyter / Streamlit / Dash. Streamlit and Dash has the advantage of being flexible to use ML models outputs in the backend.

Best Resources for EDA and Story telling

  1. Data Preprocessing and EDA course
  2. Kaggle learning competitions
  3. Data Visualization practice
  4. Top 50 matplotlib visualizations for Data Analysis(Bookmark this page.)
  5. Python Graph Gallery

Recommended Learning Plan – EDA and Data Preprocessing

  1. Data Preprocessing and EDA course
  2. Kaggle learning competitions
    1. Titanic Disaster Competition
    2. Housing Prices Prediction

STEP 5. Learn Machine Learning Concepts and Algorithms

(Time Required: 60 days)

What to learn?

Learn the fundamental concepts behind machine learning. For example: concepts like cross validation, bias-variance tradeoff, hyper parameter tuning, overfitting etc are needed to implement ML.

Things you should know:

  1. ML fundamental concepts
  2. ML Algorithms – The math behind it
  3. Building ML in Python / R, tuning and validating
  4. Model performance improvement strategies

You can learn the math behind ML, before or after learning to implement it. Whatever works best for you. Just don’t make it an excuse to not becoming hands-on for too long.

Essential ML Concepts Topics

Generative vs Discriminative models, Cross Validation, OOB error; Bias-Variance Tradeoff; R-Squared; Adjusted R-Squared; RSS; TSS; Forward Selection; Backward Selection; Box-Cox Transforms; Entropy; Information Gain; Gini Index; Binary Cross Entropy; Variance Inflation Factor; Assumptions of Linear regression; Heteroscedasticity; Sigmoid function; Softmax; Cost Function; ROC Curve; Precision-Recall; Confusion Matrix and related concepts; Imbalanced Classification; Effect of outliers; Missing value treatment approaches; Expectation Maximization Algorithm; Maximum Likelihood Estimation.

Essential ML Algorithms

Linear Regression; Logistic Regression; Gradient Descent; Decision Trees; Gradient Boosting; Adaboost; Random Forest; SVM; k-Nearest Neighbors; Gradient Descent; k-Means Clustering; Hierarchical Clustering; Principal Components Analysis; Regularization; Ridge and LASSO regression; Naive Bayes; XGBoost

Note: Know the key hyper parameters of various algorithms and its purpose.

Stretch: Variations of decision trees such as: ID3; C4.5; CART; CHAID; MARS; CRF; HMM; Gaussian Mixture Models; CatBoost; LightGBM.

Why learn ML Algorithms and the math behind it?

Here are 5 solid reasons for you to learn the internals of how ML algorithms work.

What else can be helpful?

You need to be quite familiar with using scikit-learn package for building ML models. In addition to this, pick up statsmodels package as well.

What would also be great to know is to use packages like ‘h2o’. You can use it with both R and Python.

Another important domain is Time Series forecasting, which deals primarily with forecasting the future values of time series data.

Later on, once you’ve completed ther learning path, you can explore advanced modeling such as Probablistic modeling with packages such as PyMC3, Pyro, etc. Sometimes companies opensource their internal ML projects for wider audience, some of it are very good. For example: Facebook released Prophet for time series forecasting.

Best Machine Learning Courses

  1. ML+ Linear Regression Course (Paid)
  2. ML+ Logistic Regression Course (Paid)
  3. ML+ Supervised Learning Course (Paid)
  4. ML+ Ensemble Modeling Course (Paid)
  5. Prof. Andrew Ng ML Course
  6. Krish Naik – ML Playlist

Time Series Forecasting Resources

  1. Time Series Analysis Course
  2. Singular Spectrum Analysis Course (Coming soon, check the store)
  3. Time Series Forecasting Course (Coming Soon, check the store)
  4. Time Series Feature Engineering Course (Coming Soon, check the store)

Time Series Blogs

  1. Time Series Analysis
  2. ARIMA Blog
  3. Otexts blog
  4. PennState tutorial
  5. Dr. Rob J Hyndman Blog

Recommended Learning Plan – ML

  1. ML+ Linear Regression Course (Paid)
  2. ML+ Logistic Regression Course (Paid)
  3. ML+ Supervised Learning Course (Paid)
  4. ML+ Ensemble Modeling Course (Paid)
  5. ML+ Feauture Engineering Course (Coming Soon, check the store)
  6. ML+ Dimensionality Reduction Course (Coming Soon, check the store)

For topics 5 and 6, check out the corresponding lectures from Prof. Andrew Ng’s course.

STEP 6. Learn Deep Learning Concepts and Algorithms

(Time Required: 60 days)

Why learn Deep Learning?

Deep learning is a special branch of machine learning, based on innovative architectures of one type of ML algorithm called Neural networks. A neural network is composed of input, hidden and output layer of neurons. When the number of layers increases, it becomes ‘deep’, thereby the name deep learning.

By innovatively creating the architectures, deep learning has be utilized to solve variety of real world problems such as speech translation, face detection, object tracking, self driving, winning computer games, text generation, creating pictures from text etc.

Companies use deep learning for various such use cases.

What to learn?

  1. Deep learning architectures such as Multi layer perceptron, Convolutional neural networks (CNN), Recurrent neural networks (RNN), Long Short Term Memory Networks (LSTM), GANS, Encoder-Decoder, Transformers etc.

  2. Deep Learning concepts such as Chain Rule and Back propagation, Activation functions, L1 and L2 regularization, Dropout layer, Early stopping, Optimizers like RMSProp, ADAM, AdaGrad, Batch Normalization, Pooling, Transfer Learning

  3. Frameworks: Tensorflow, PyTorch or both.

What else can be helpful?

  • Once you learnt the overall concepts, pick one area like NLP or Computer Vision and dive deeper.

  • For NLP, deep learning libraries like SpaCy, HuggingFace, Gluon NLP are quite the main ones.

  • For Computer Vision: Start with Open CV, and check out for various applications in Keras and Pytorch websites.

Best Deep Learning Resources

  1. D2L
  2. Pytorch Tutorials
  3. TensorFlow Tutorials
  4. Keras Examples

Best Deep Learning Courses

  1. Deep Learning Specialization – Coursera
  2. NYU Deep Learning

Deep Learning Books

  1. Deep Learning with Python, 2nd Ed
  2. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition
  3. Deep Learning Book by Y. Bengio and A. Courville

Recommended Learning Plan

  1. Coursera – Deep Learning Specialization

6. Learn ML Ops and Big Data

(Time Required: 30 days-45 days)

Why learn MLOps?

Traditionally, Statisticians would share the results of their models in a Dataset (Excel or store in database) or create a presentation to share their findings.

But now, Data Scientists deploy the ML models. Typically, it’s a dashboard or a REST API that takes an input (ex: voice) and give back an output (ex: text).

Once models are deployed, their performance should be monitored and models refreshed periodically.

MLOps deals with practices that enable deployment, monitor and refresh ML models in production in a reliable way.

This involves DevOps and ML Engineers to work together. This area is still evolving and companies (like AWS, Google Cloud, MS Azure, Databricks etc) strive to do the heavy lifting to enable Data Scientists to take their models to production with minimal effort.

What to learn?

  1. Framework to take ML Model to REST API. Ex: Python: Flask, Django, FAST API. R: Plumber
  2. Docker: Solving for dependencies and reproducibility.
  3. Sagemaker: Flagship ML development, deployment and monitoring platform from AWS>
  4. Kubernetes / KubeFlow: Kubernetes enables automated deployment and scaling. KubeFlow enables deloying ML worflows on Kubernetes simple.
  5. Seldon Core: Fast deploy ML on Kubernetes at scale.
  6. CI/CD: Continuous Integration and Continuous Delivery of ML solutions. GitHub Actions, Jenkins, CircleCI, GitLab, Travis CI are popular tools.
  7. ML Orchestration: Create, schedule and monitor workflows as an end-to-end pipeline. Popular tools are: AirFlow, Dagster, Prefect, Luigi. ML focused pipelines are from KubeFlow Pipelines, Metaflow, Flyte.

Certifications (Optional)

AWS and Google provide training and certifications.

AWS

  1. AWS – Machine Learning Specialty
  2. AWS – Certified Developer Associate

Google Cloud

  1. Google Cloud – Professional Machine Learning Engineer

What else can be helpful?

Learn distributed computing with PySpark. Several companies use PySpark for Extract, Transform, Load (ETL) operations as well as for build ML and applications on Big datasets using PySpark.

Databricks and Cloudera are leading providers. Apache Pyspark is the package you need to install and start using.

MLFlow: An popular platform to manage the ML life cycle. Language, library agnostic. Open source.

Best ML Ops Courses

Recommended Learning Plan – MLOps

  1. Deploying ML in AWS EC2 – ML+ (Paid)
  2. Docker Course – Mosh
  3. AWS Sagemaker course – ML+ (Paid)
  4. AWS Lambda Course – ML+ (Paid)
  5. KubeFlow Course – Google
  6. Pyspark Tutorial

Topics such as CI/CD pipelines, Orchestration framework can change from company to company, easy to pick it up on the job.

STEP 7. Do Portfolio Projects

(Time Required: 21 days+)

Why build ML projects?

The projects that you showcase in your resume speak volumes. More than the certifications will ever do.

But what type of projects to build?

The kind that companies actually execute, that is, Data science projects that help generate revenue or save costs.

What to build?

It’s ok to start with educational projects such as the Kaggle’s Titanic competition or the Housing Price prediction competition. There are more such competitions for knowledge and practice.

Then solve projects that companies are interested in, like, estimating customer lifetime value, forecasting demand of products, predicting churning customers, simulating optimal marketing budget spend, etc.

What else can be helpful?

Solve projects end-to-end with well written code. Public datasets that represent real world data is good enough. Host your code in github repositories, maintain your github repos well and add the link to github to your resume.

Best Projects Resources

  1. Kaggle Competitions and associated kernel notebooks.
  2. ML+ Industry DS Project Courses

Recommended Learning Plan – DS Portfolio Projects

  1. Kaggle’s Titanic competition
  2. Estimating CLTV – ML+
  3. Restaurant Visitor Forecasting
  4. Market Mix Modeling

If you are interested in more projects checkout Kaggle competitions, get practice, host your code on GitHub.

Do You Want to Learn More About Data Science?

Data Science and ML have become integral with most companies both in IT and product companies. Machine Learning Plus has everything you need to make your Data Science Roadmap journey easier and achievable.

The Machine Learning Plus Complete ML Mastery Courses, features the ideal learning path if you want to succeed in a career in Data Science. If you are struggling with understanding ‘tough’ ML and Stats concepts, at ML+ you will get the most clear, complete and straight forward explanations.

Complete ML Mastery covers vital data science topics, starting with the Foundations of ML course, Python programming, R programming, machine learning, deep learning, and build portfolio machine learning projects which will give the exposure and build your business acumen. You can showcase these projects in resume and talk about it to win interviews.

According to Glassdoor, Data scientists earn an annual average of $123,250. The world needs more data scientists. Companies offer attractive incentives, stable, secure and fulfilling career path. If this sounds like your kind of profession, take those first few steps towards a new career. Start ML Mastery Self Learning Path today!!

Course Preview

Machine Learning A-Z™: Hands-On Python & R In Data Science

Free Sample Videos:

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science

Machine Learning A-Z™: Hands-On Python & R In Data Science