machine learning +
Bayesian Optimization for Hyperparameter Tuning – Clearly explained.
Machine Learning
Today, I discuss the Data Science Roadmap, the missing guide to self study machine learning in about 6 months. I'll discuss what exactly you need to know and do in order to self study Data science / ML / AI / Stats. I will provide you with some of the best resources for each topic, why you need to learn the said topics and a recommended self study plan.
Today, I discuss the Data Science Roadmap, the missing guide to self study machine learning. I’ll discuss what exactly you need to know and do in order to self study Data science / ML / AI / Stats. I will provide you with some of the best resources for each topic, why you need to learn the said topics and a recommended self study plan.
(Time Required: 30 days-45 days)
The main programming languages for Data Science:
Which one to choose first?
Between Python and R, Python is more popular and widely adopted. R is also popular in certain countries, domains. If you have to choose between these two to start off, pick Python.
It is a good idea to learn R language also, for it has a great ML/stats packages ecosystem and I’ve seen several companies using both R and Python.
When should you learn R?
You can learn R after you’re completely done with Python as well. However, if your company / domain mainly uses R just go with it.
Data Scientists are hands-on folks. Ultimately, you will need to do various analyses with data and build ML models.
Why Python?
Python is the default language these days for ML. New research / development are done in Python. With mature libraries for ML, stats, deep learning etc, it’s easy to find solutions quickly.
Why R?
R programming was the most popular language before Python took over. It is built for Data analysis, statistical modeling and ML. Data Scientist use R, for the rich collection of Packages in stats and ML.
Why SQL?
To collect, transform data from various different databases. Most databases speak SQL. People write SQL queries to collect, transform the data from various sources and assemble them to form a master data to be used for data analysis and ML.
Note: Another growing language worth mentioning is ‘Julia’. It has a great speed advantage, has expressive syntax borrowing the good parts of both R and Python. However, the size of Julia community is still maturing.
So, if you are getting started, I’d recommend to get started with Python first. Then learn SQL. Then Learn R. Optionally you can look at Julia and do cool stuff.
What else to learn to go further?
You need to learn Git as well.
Most companies version control their projects with Git, using popular services such as GitHub, BitBucket or GitLab.
I recommend to open a GitHub account and maintain / showcase your ML project work. Once you have done few good projects (I’ll talk about this in Projects section below), you can add the GitHub link to your resume. It will increase your chance of getting shortlisted for interviews, if people see you can code well.
The Python Course – ML+ (Paid) – I teach this course at ML+, simplify things as always. Organized and taught using Jupyter Notebook. Ideal for Data Science / ML practitioners. Use code ‘PYTHONFREE21’ at checkout.
Learn Python – Kaggle – Learn with practice with succinct lessons. Not complete though.
Telusko Python – Youtube – Another nice Youtube course to learn Python.
Besides this there are competitive coding websites like Leetcode, Codewars and Hackerrank that offer a large number of programming questions. I’d not recommend these at this stage for beginners, simply because, our goal right now is to become sufficiently good in Python in order to do machine Data Science/ML.
Step 1: The Python course – ML+ (Paid). Use coupon code ‘PYTHONFREE21’ at checkout.
Step 2: Revise and Practice Python in Kaggle. Stretch: Solve 15 easy to medium exercises here.
Step 3: SQL Course with Mosh
Step 4: SQL Questions – Hackerrank – Practice 15 questions.
Optional: If your job requires a lot of SQL proficiency, take these lessons as well. If you are not sure, then skip this.
Step 5: Git Course
Step 6: Git Practice
(Time Required: 30 days-45 days)
In Data science and machine learning, you will primarily work with Data, both stuctured (tabular data) and unstructured (text, images etc). First step is to gain mastery over wrangling tabular data. This is important.
You should be able to transform, create, merge, split, aggregate basically be comfortable to do anything with datasets. You should also be able to create various type of plots and visualizations.
If you’ve chosen Python as your programming language, learn the NumPy and Pandas libraries well. These are the two main libraries for data wrangling, especially Pandas.
Why Pandas?
It is the most popular package for Data wrangling in Python, thoughtfully designed and optimized for performance by Wes Mc Kinney and team.
Learn Dask, built for parallel computing for Data analysis and ML. It enables you to work with large datasets and provides tools to make code faster.
Optionally learn Modin, Vaex packages they are pandas alternates, making your data wrangling code run faster.
(Time Required: 21 days)
Probability and statistics are foundational concepts for Machine learning. Often statistical methods are used in tandem with machine learning for:
For example: (i) Statistical tests can help decide on what is a better approach for customers asking for refunds: provide full refund vs giving cashback credits. (ii) Whether placing a specific feature on a website improved lingering duration of customers etc.
To understand the math behind Bayesian algorithms, basic knowledge of probability is required.
You need to learn various statistical concepts such as standard deviation, covariance, correlation etc. How statistical significance tests work and how to interpret them, especially P-values, null and alternate hypothesis, test statistic, degrees of freedom etc.
Here is a list of essential probability and stats:
Sample vs Population; Measures of Central tendency and Dispersion; Law of large numbers; Central Limit Theorem; Joint and conditional probability; Bayes Theorem; Frequency Distributions; Normal distribution; Probability Density Function; Probability Mass function; Hypothesis testing; Type 1 and Type 2 error; P-values; Confidence interval calculation; T-Test; Chi-Squared Test; Mann Whitney U Test, ANOVA, Standard Error
Python libraries: Statsmodels, scipy
(Time Required: 30 days-45 days)
Once you’ve pieced the data together, you need to:
1. Preprocess Data
Involves handling missing values, identifying outliers and process them if needed, sanity checks on the various fields in data, create new features if appropriate, consider variable transformations etc.
2. Perform Exploratory Data Analysis (EDA)
Involves performing various data analyses to summarize the data and the relationships that exist between variables. You may have to conduct statistical tests, and bring out statistics that will be of interest.
3. Come up with insights and build the story (that the data narrates).
Have conversations or even working sessions with your stakeholders. Good to be on the same page with the SME (subject matter expert) partner before sharing the findings before a larger audience.
Ability to build dashboards quickly with tools like Tableau / PowerBi / Jupyter / Streamlit / Dash. Streamlit and Dash has the advantage of being flexible to use ML models outputs in the backend.
(Time Required: 60 days)
Learn the fundamental concepts behind machine learning. For example: concepts like cross validation, bias-variance tradeoff, hyper parameter tuning, overfitting etc are needed to implement ML.
Things you should know:
You can learn the math behind ML, before or after learning to implement it. Whatever works best for you. Just don’t make it an excuse to not becoming hands-on for too long.
Generative vs Discriminative models, Cross Validation, OOB error; Bias-Variance Tradeoff; R-Squared; Adjusted R-Squared; RSS; TSS; Forward Selection; Backward Selection; Box-Cox Transforms; Entropy; Information Gain; Gini Index; Binary Cross Entropy; Variance Inflation Factor; Assumptions of Linear regression; Heteroscedasticity; Sigmoid function; Softmax; Cost Function; ROC Curve; Precision-Recall; Confusion Matrix and related concepts; Imbalanced Classification; Effect of outliers; Missing value treatment approaches; Expectation Maximization Algorithm; Maximum Likelihood Estimation.
Linear Regression; Logistic Regression; Gradient Descent; Decision Trees; Gradient Boosting; Adaboost; Random Forest; SVM; k-Nearest Neighbors; Gradient Descent; k-Means Clustering; Hierarchical Clustering; Principal Components Analysis; Regularization; Ridge and LASSO regression; Naive Bayes; XGBoost
Note: Know the key hyper parameters of various algorithms and its purpose.
Stretch: Variations of decision trees such as: ID3; C4.5; CART; CHAID; MARS; CRF; HMM; Gaussian Mixture Models; CatBoost; LightGBM.
Here are 5 solid reasons for you to learn the internals of how ML algorithms work.
You need to be quite familiar with using scikit-learn package for building ML models. In addition to this, pick up statsmodels package as well.
What would also be great to know is to use packages like ‘h2o’. You can use it with both R and Python.
Another important domain is Time Series forecasting, which deals primarily with forecasting the future values of time series data.
Later on, once you’ve completed ther learning path, you can explore advanced modeling such as Probablistic modeling with packages such as PyMC3, Pyro, etc. Sometimes companies opensource their internal ML projects for wider audience, some of it are very good. For example: Facebook released Prophet for time series forecasting.
For topics 5 and 6, check out the corresponding lectures from Prof. Andrew Ng’s course.
(Time Required: 60 days)
Deep learning is a special branch of machine learning, based on innovative architectures of one type of ML algorithm called Neural networks. A neural network is composed of input, hidden and output layer of neurons. When the number of layers increases, it becomes ‘deep’, thereby the name deep learning.
By innovatively creating the architectures, deep learning has be utilized to solve variety of real world problems such as speech translation, face detection, object tracking, self driving, winning computer games, text generation, creating pictures from text etc.
Companies use deep learning for various such use cases.
Deep Learning concepts such as Chain Rule and Back propagation, Activation functions, L1 and L2 regularization, Dropout layer, Early stopping, Optimizers like RMSProp, ADAM, AdaGrad, Batch Normalization, Pooling, Transfer Learning
Frameworks: Tensorflow, PyTorch or both.
Once you learnt the overall concepts, pick one area like NLP or Computer Vision and dive deeper.
For NLP, deep learning libraries like SpaCy, HuggingFace, Gluon NLP are quite the main ones.
For Computer Vision: Start with Open CV, and check out for various applications in Keras and Pytorch websites.
(Time Required: 30 days-45 days)
Traditionally, Statisticians would share the results of their models in a Dataset (Excel or store in database) or create a presentation to share their findings.
But now, Data Scientists deploy the ML models. Typically, it’s a dashboard or a REST API that takes an input (ex: voice) and give back an output (ex: text).
Once models are deployed, their performance should be monitored and models refreshed periodically.
MLOps deals with practices that enable deployment, monitor and refresh ML models in production in a reliable way.
This involves DevOps and ML Engineers to work together. This area is still evolving and companies (like AWS, Google Cloud, MS Azure, Databricks etc) strive to do the heavy lifting to enable Data Scientists to take their models to production with minimal effort.
Certifications (Optional)
AWS and Google provide training and certifications.
AWS
Google Cloud
What else can be helpful?
Learn distributed computing with PySpark. Several companies use PySpark for Extract, Transform, Load (ETL) operations as well as for build ML and applications on Big datasets using PySpark.
Databricks and Cloudera are leading providers. Apache Pyspark is the package you need to install and start using.
MLFlow: An popular platform to manage the ML life cycle. Language, library agnostic. Open source.
Topics such as CI/CD pipelines, Orchestration framework can change from company to company, easy to pick it up on the job.
(Time Required: 21 days+)
The projects that you showcase in your resume speak volumes. More than the certifications will ever do.
But what type of projects to build?
The kind that companies actually execute, that is, Data science projects that help generate revenue or save costs.
It’s ok to start with educational projects such as the Kaggle’s Titanic competition or the Housing Price prediction competition. There are more such competitions for knowledge and practice.
Then solve projects that companies are interested in, like, estimating customer lifetime value, forecasting demand of products, predicting churning customers, simulating optimal marketing budget spend, etc.
What else can be helpful?
Solve projects end-to-end with well written code. Public datasets that represent real world data is good enough. Host your code in github repositories, maintain your github repos well and add the link to github to your resume.
If you are interested in more projects checkout Kaggle competitions, get practice, host your code on GitHub.
Data Science and ML have become integral with most companies both in IT and product companies. Machine Learning Plus has everything you need to make your Data Science Roadmap journey easier and achievable.
The Machine Learning Plus Complete ML Mastery Courses, features the ideal learning path if you want to succeed in a career in Data Science. If you are struggling with understanding ‘tough’ ML and Stats concepts, at ML+ you will get the most clear, complete and straight forward explanations.
Complete ML Mastery covers vital data science topics, starting with the Foundations of ML course, Python programming, R programming, machine learning, deep learning, and build portfolio machine learning projects which will give the exposure and build your business acumen. You can showcase these projects in resume and talk about it to win interviews.
According to Glassdoor, Data scientists earn an annual average of $123,250. The world needs more data scientists. Companies offer attractive incentives, stable, secure and fulfilling career path. If this sounds like your kind of profession, take those first few steps towards a new career. Start ML Mastery Self Learning Path today!!
Build a strong Python foundation with hands-on exercises designed for aspiring Data Scientists and AI/ML Engineers.
Start Free Course →Get the exact 10-course programming foundation that Data Science professionals use.