Linear Regression Parameters Guide

Training a Machine Learning model is not a difficult task, the difficult part is to understand when to use which parameter and how to select the right value for the parameter. So, let’s talk about the linear regression algorithm. In this article, I’ll take you through a guide to the parameters of the linear regression algorithm, which will help you train better models and do better in your interviews.

Linear Regression Parameters Guide

We use the scikit-learn library in Python to implement linear regression while training a Machine Learning model. Here are the parameters of linear regression you should know:

fit_intercept
normalize
copy_X
n_jobs
positive

Let’s go through all these parameters in detail by understanding when to use them and how to determine their values.

fit_intercept

The fit_intercept parameter in a Linear Regression model determines whether the model should include an intercept term to represent the bias or baseline value of the response variable when all predictors are zero.

Look at the graph above. The difference between these two lines highlights the role of the intercept. The model with an intercept can adjust to fit the data more flexibly, which captures the baseline level of the response variable when all predictors are zero. In contrast, the model without an intercept assumes the line passes through the origin, which may not accurately reflect the underlying data relationship if there’s an inherent baseline level different from zero.

By default, this parameter is set to True, meaning the model will calculate the intercept. If fit_intercept=True, the model fits an additional parameter corresponding to the intercept, which accounts for the average effect on the target variable when all feature values are zero. It is important when the data is not centred around the origin, as the intercept captures the baseline level of the target variable.

However, if you have centred your data, meaning you have adjusted the features and target so that their means are zero, you can set fit_intercept=False. It is because, in such a scenario, the intercept should theoretically be zero, and including it could introduce unnecessary complexity to the model. In practical terms, unless the data is preprocessed to be centred, fit_intercept should typically be left as True to provide a better fit.

normalize

The normalize parameter in scikit-learn’s Linear Regression model controls whether the input features (regressors) should be normalized before fitting the model. Normalization involves scaling the features such that they have a mean of 0 and a standard deviation of 1, which can be crucial when features have different scales.

If set to True, the model will normalize the features internally to ensure that the model does not give undue weight to features with larger scales.

However, normalization should ideally be handled explicitly before fitting the model using preprocessing techniques like StandardScaler from scikit-learn’s preprocessing module. This separate preprocessing step allows for more control over the data transformation process and can be crucial for proper model validation and interpretation. Therefore, even if features have different scales, it is recommended to preprocess the data independently rather than relying on the normalize parameter.

copy_X

The copy_X parameter in scikit-learn’s Linear Regression model determines whether the input features (predictor variables) should be copied before fitting the model. When copy_X=True, a copy of the input data is made, which ensures that the original data remains unchanged during the fitting process.

It is particularly useful when you want to preserve the integrity of the original dataset, especially if it is being used in other parts of the analysis or if multiple models are being trained.

On the other hand, setting copy_X=False allows the model to overwrite the original input data, which can save memory and computation time, especially for large datasets.

n_jobs

The n_jobs parameter in scikit-learn’s Linear Regression model specifies the number of CPU cores used for computation, which is particularly useful when dealing with large datasets or complex computations. By default, n_jobs = None (or equivalently n_jobs = 1), meaning the computation will run on a single core.

If n_jobs is set to a positive integer, such as 2 or 4, the computation will utilize that specific number of cores, which allows for parallel processing and potentially speeds up the model fitting process.

Setting n_jobs = -1 instructs the model to use all available CPU cores, which maximizes the parallelism and reduces computation time. It can be especially advantageous in environments with multiple processors, as it can significantly reduce the time required for fitting the model by distributing the workload.

positive

The positive parameter in scikit-learn’s Linear Regression model ensures that all the coefficients of the features are non-negative, which means they are forced to be zero or positive. This parameter is particularly useful in situations where domain knowledge or theoretical considerations suggest that the relationship between the predictors and the target should not result in negative coefficients.

For example, increasing the spending on advertisements should either increase or not affect sales revenue, but the model should not decrease it by any chance. By setting positive = True, the model constrains the coefficients during optimization, which ensures that they align with these expectations. It can lead to more interpretable models that better reflect the underlying relationships in the data.

You can learn many more such concepts in detail from my book on Machine Learning Algorithms. Find it here.

Summary

So, here are the parameters of the Linear Regression algorithm you should know:

fit_intercept: Analyze your data. If your data is already centred, you might set it to False. Usually, it remains True.
normalize: Preprocess your data using StandardScaler or another preprocessing method.
copy_X: Typically, this remains True unless you are handling memory issues manually.
n_jobs: For large datasets, setting n_jobs=-1 can speed up the computation by utilizing all available processors. For smaller datasets, the default setting is usually sufficient.
positive: Depending on the nature of your problem, if you need to constrain the coefficients to be positive, set it to True. Otherwise, keep it False.

I hope you liked this article on a guide to the parameters of the Linear Regression algorithm. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.