Missing values in datasets are common and can lead to incorrect analyses if not handled properly. Linear Regression is one of the most effective algorithms for missing value imputation. It predicts missing values based on other available features in the dataset. So, in this article, I’ll take you through a step-by-step guide to missing value imputation using Linear Regression.
Missing Value Imputation using Linear Regression
The dataset we will be using is based on real-estate prices, where some house prices are either missing or incorrectly recorded as 0, due to system errors. You can download the dataset from here.
Step 1: Identifying Null Values
Now, let’s get started with using Linear Regression to predict and fill in the missing values based on the other available features in the dataset.
First, we will load the dataset:
import pandas as pd
df = pd.read_csv("Real_Estate_with_Missing.csv")
print(df)Transaction date House age \
0 2012-09-02 16:42:30.519336 13.3
1 2012-09-04 22:52:29.919544 35.5
2 2012-09-05 01:10:52.349449 1.1
3 2012-09-05 13:26:01.189083 22.2
4 2012-09-06 08:29:47.910523 8.5
.. ... ...
409 2013-07-25 15:30:36.565239 18.3
410 2013-07-26 17:16:34.019780 11.9
411 2013-07-28 21:47:23.339050 0.0
412 2013-07-29 13:33:29.405317 35.9
413 2013-08-01 09:49:41.506402 12.0
Distance to the nearest MRT station Number of convenience stores \
0 4082.01500 8
1 274.01440 2
2 1978.67100 10
3 1055.06700 5
4 967.40000 6
.. ... ...
409 170.12890 6
410 323.69120 2
411 451.64190 8
412 292.99780 5
413 90.45606 6
Latitude Longitude House price of unit area
0 25.007059 121.561694 6.488673
1 25.012148 121.546990 24.970725
2 25.003850 121.528336 26.694267
3 24.962887 121.482178 38.091638
4 25.011037 121.479946 21.654710
.. ... ... ...
409 24.981186 121.486798 29.096310
410 24.950070 121.483918 33.871347
411 24.963901 121.543387 25.255105
412 24.997863 121.558286 25.285620
413 24.952904 121.526395 37.580554
[414 rows x 7 columns]
Now, we will identify missing values and check if there are any 0s that should also be considered missing:
df.isnull().sum()

(df == 0).sum()

As we found 0s in a numeric column that should not have zero values, we will replace them with NaN:
import numpy as np df["House price of unit area"] = df["House price of unit area"].replace(0, np.nan)
Step 2: Define Features and Target Variable
Since we are using Linear Regression, we need to define the independent variables (features) and the dependent variable (target):
features = ["House age", "Distance to the nearest MRT station",
"Number of convenience stores", "Latitude",
"Longitude"]
target = "House price of unit area"We will now create two separate DataFrames:
- df_complete: Rows where the target variable is available.
- df_missing: Rows where the target variable is missing.
df_complete = df[df[target].notna()] df_missing = df[df[target].isna()]
Now, we will split df_complete into training and testing sets to train our Linear Regression model:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(
df_complete[features], df_complete[target], test_size=0.2, random_state=42
)
model = LinearRegression()
model.fit(X_train, y_train)Step 3: Predicting the Missing Values
Now, we will use our trained model to predict the missing values:
df.loc[df[target].isna(), target] = model.predict(df_missing[features])
After filling in the missing values, we will now save the cleaned dataset for further use:
df.to_csv("Real_Estate_Cleaned.csv", index=False)
df.isnull().sum()
By following these steps, we successfully filled missing values using Linear Regression. This method is useful when missing values are correlated with other features in the dataset.
Summary
So, Linear Regression is one of the most effective algorithms for missing value imputation. It predicts missing values based on other available features in the dataset. I hope you liked this article on missing value imputation. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.





