Interpolation can be used to impute missing data. Let’s see the formula and how to implement in Python.
But, you need to be careful with this technique and try to really understand whether or not this is a valid choice for your data. Often, interpolation is applicable when the data is in a sequence or a series.
You should also know there are multiple interpolation methods available, the default is a linear method.
When to use interpolation for imputing missing data?
You can use interpolation when there is an order or a sequence and you want to estimate a missing value in the sequence. For example: Let’s say there are various classes of tickets in train travel, like, first class, second class, and so on. You would naturally expect the ticket price of the higher class to be more expensive than the lower class.
In that case, if the ticket price of an intermediate class is missing, you can use interpolation to estimate the missing value.
When not to use interpolation?
In case, there was no association between the order of the classes and the ticket fares, that is, if it was not necessary that the first class is more expensive than the second class, then, it might not be appropriate to use interpolation.
Let’s see this with an example.
import numpy as np
import pandas as pd
# class and ticket prices.
fare = {'first_class':100,
'second_class':np.nan,
'third_class':60,
'open_class':20}
Convert it to a pandas series object to make interpolation convenient.
# store as pandas series
ser = pd.Series(fare)
ser
first_class 100.0
second_class NaN
third_class 60.0
open_class 20.0
dtype: float64
Now you can use ser.interpolate()
to predict the missing value. By default, ser.interpolate() will do a linear interpolation.
Important caveat before you apply interpolation
Linear interpolation will take the index (0,1,2..) as the X and the column you want to interpolate as Y and do the interpolation. So, you need to make sure the X is sorted in your data to make this work.
In the above equation, when ‘x’ is known, you can compute the value of ‘y’, using the the following formula for linear interpolation.
Interpolation is also possible on a multidimensional space as well and is given by Lagrange’s interpolation polynomial.
Free Time Series Project Template
Do you want learn how to approach projects across different domains with Time Series?
Get started with your first Time Series Industry Project and Learn how to use and implement algorithms like ARIMA, SARIMA, SARIMAX, Simple Exponential Smoothing and HoltWinters.
Do you want learn how to approach projects across different domains with Time Series?
Get started with your first Time Series Industry Project and Learn how to use and implement algorithms like ARIMA, SARIMA, SARIMAX, Simple Exponential Smoothing and HoltWinters.
Implement linear interpolation
ser.interpolate(method='linear')
first_class 100.0
second_class 80.0
third_class 60.0
open_class 20.0
dtype: float64
linear
interpolation may be more suitable if you assume the relationship between x (index) and y (value) to be linear. If not, you might want to try spline
and cubicspline
interpolation as well.
Spline interpolation
To use spline interpolation you need to make sure the index is reset to start from 0,1,2.. etc. So do a reset_index
first, then do interpolate
.
# order = 2
ser.reset_index(drop=True).interpolate(method='spline', order=2)
0 100.000000
1 86.666667
2 60.000000
3 20.000000
dtype: float64
Cubic spline
# cubic spline
ser.reset_index(drop=True).interpolate(method='cubicspline')
0 100.000000
1 86.666667
2 60.000000
3 20.000000
dtype: float64
Related Topics
 Missing Data Imputation Approaches in Python

Spline Interpolation – How to find the polynomial curve to interpolate missing values
[Next] Lesson 7: MICE imputation – How to predict missing values using machine learning in Python
Build Your First Machine Learning Project – ML Lessons Series