Drop a Query

# Machine Learning

## KL Divergence – What is it and mathematical details explained

At its core, KL (Kullback-Leibler) Divergence is a statistical measure that quantifies the dissimilarity between two probability distributions. Think of it like a mathematical ruler that tells us the “distance” or difference between two probability distributions. Remember, in data science, we’re often working with probabilities – the chances of events happening. So, if we have …

## Cook’s Distance for Detecting Influential Observations

Cook’s distance is a measure computed to measure the influence exerted by each observation on the trained model. It is measured by building a regression model and therefore is impacted only by the X variables included in the model. What is Cooks Distance? Cook’s distance measures the influence exerted by each data point (row / …

## How to detect outliers with z-score

Z score, also called as standard score, is used to scale the features in a dataset for machine learning model training. It can also be used to detect outliers. In this one, we will first see how to compute Z-scores and then use it to detect outliers. How is Z-score used in machine learning? Now, …

## How to detect outliers using Z score?

Z score is one of the most important concepts in statistics. It is also called standard score. Typically it is used to scale the features for machine learning. But can also be used to detect outliers. Also Read: How to detect outliers with IQR and Box Plots How is Z-score used in machine learning? Now, …

## PySpark OneHot Encoding – Mastering OneHot Encoding in PySpark and Unleash the Power of Categorical Data in Machine Learning

Let’s dive deep into OneHot Encoding in PySpark, exploring its benefits in machine learning and walking you through practical example with code. As machine learning continues to gain traction in the world of data science, the need for efficient data preprocessing has never been more crucial. One such preprocessing technique is OneHot Encoding, which allows …

## PySpark Statistics Deciles and Quartiles – Understanding Deciles and Quartiles a Deep Dive with PySpark

Let’s dive into the concept of deciles and quartiles and how to calculate them in PySpark. When analyzing data, it’s important to understand the distribution of the data. One way to do this is by calculating the deciles and quartiles. What are Deciles? Deciles divide a set of data into 10 equal parts. For example, …

## PySpark Statistics Mode – Calculating the Mode in PySpark a Comprehensive Guide for Everyone

Lets explore different ways of calculating the Mode using PySpark, helping you become an expert Mode is the value that appears most frequently in a dataset. It is a measure of central tendency, similar to mean and median, but focuses on the most common value(s) in the data. Mode can be applied to both numerical …

## PySpark Statistics Median – Calculating the Median in PySpark a Comprehensive Guide for Everyone

Lets explore different ways of calculating the Median using PySpark, helping you become an expert As data continues to grow exponentially, efficient data processing becomes critical for extracting meaningful insights. PySpark, an Apache Spark library, enables large-scale data processing in Python. How to Calcualte Median? The median is a measure of central tendency that represents …

## PySpark Mllib K-Means Clustering – Mastering K-means Clustering with PySpark MLlib and Example Code

Lets explore K-means clustering using PySpark’s MLlib library in-depth. PySpark is an open-source Python library that facilitates distributed data processing and offers a simple way to run machine learning algorithms on large-scale data. K-means clustering is a widely-used unsupervised machine learning algorithm that partitions a dataset into K distinct clusters based on the features of …

## PySpark Gradient Boosting model – Building and Evaluating Gradient Boosting model using PySpark MLlib: A Step-By-Step Guide

Lets discuss how to build and evaluate Gradient Boosting model using PySpark MLlib and cover key aspects such as hyperparameter tuning and variable selection, providing example code to help you along the way. Gradient Boosting is a powerful machine learning technique that combines multiple weak learners to create a strong predictor. Pyspark MLlib is a …

## PySpark Random Forest – Building and Evaluating Random Forest Models using PySpark MLlib: A Step-By-Step Guide

Lets discuss how to build and evaluate Random Forest models using PySpark MLlib and cover key aspects such as hyperparameter tuning and variable selection, providing example code to help you along the way. Random Forest is an ensemble machine learning algorithm that can be used for both classification and regression tasks. PySpark is the Python …

## PySpark Lasso Regression – Building, Tuning, and Evaluating Lasso Regression with PySpark MLlib

Lets explore how to build, tune, and evaluate a Lasso Regression model using PySpark MLlib, a powerful library for machine learning and data processing in Apache Spark. Lasso regression is a popular machine learning algorithm that helps to identify the most important features in a dataset, allowing for more effective model building. In this blog …

## PySpark Decision Tree – How to Build and Evaluate Decision Tree Model for Classification using PySpark MLlib

How to build and evaluate a Decision Tree model for classification using PySpark’s MLlib library. Decision Trees are widely used for solving classification problems due to their simplicity, interpretability, and ease of use. PySpark’s MLlib library provides an array of tools and algorithms that make it easier to build, train, and evaluate machine learning models …

## PySpark Linear Regression – How to Build and Evaluate Linear Regression Models using PySpark MLlib

MLlib, the machine learning library within PySpark, offers various tools and functions for machine learning algorithms, including linear regression. In this blog post, you will learn how to building and evaluating a linear regression model using PySpark MLlib with example code. Linear regression is a simple yet powerful machine learning algorithm used to predict a …

## Interpolation in Python – How to interpolate missing data, formula and approaches

Interpolation can be used to impute missing data. Let’s see the formula and how to implement in Python. But, you need to be careful with this technique and try to really understand whether or not this is a valid choice for your data. Often, interpolation is applicable when the data is in a sequence or …

## Missing Data Imputation Approaches | How to handle missing values in Python

Machine Learning works on the idea of garbage in – garbage out. If you put in useless junk data to the machine learning algorithm, the results will also be, well, ‘junk’. The quality and consistency of results depend on the data provided. Missing values in data degrade the quality. Why clean the data before training …

## Exploratory Data Analysis (EDA) – How to do EDA for Machine Learning Problems using Python

Exploratory Data Analysis, simply referred to as EDA, is the step where you understand the data in detail. You understand each variable individually by calculating frequency counts, visualizing the distributions, etc. Also the relationships between the various combinations of the predictor and response variables by creating scatterplots, correlations, etc. EDA is typically part of every …

## ML Modeling – Problem statement and Data description

ML modeling is the step where machine learning is used to find patterns in data and use that learned knowledge to predict an outcome. The type of ML modeling we are going to solve in this problem is called ‘Churn Modeling’. Let’s first understand the Churn modeling problem statement and then go over the data …

Adaboost is one of the earliest implementations of the boosting algorithm. It forms the base of other boosting algorithms, like gradient boosting and XGBoost. This tutorial will take you through the math behind implementing this algorithm and also a practical example of using the scikit-learn Adaboost API. Contents: What is boosting? What is Adaboost? Algorithm …

## How to formulate machine learning problem

Let’s understand how to define and formulate the machine learning problem (for predictive modeling) from a business problem. This structured approach should help you apply the process to most other types of predictive modeling problems at work. Introduction Often in ML teams, you will hear from the business/company departments about the problems and issues they …

Course Preview