Regression is one of the main predictive modeling and data mining methods. It allows you to establish a relationship between variables in order to predict the development of a phenomenon in the future. For example, in this way you can find out how many products the store will sell in the coming months, how price changes will affect the flow of customers, and what proportion of employees may leave the company.
Even beginner analysts know about linear and logistic regression. The rest of the functions in this class are less often heard, but a true data science professional needs to know what they are and what they are used for. This knowledge will be useful for frontend programmers , web developers and anyone who works with data in Python.
Today we will arrange a short tour of different types of regression, get acquainted with their capabilities and application features. Add this article to your cheat sheet collection and off you go. Tools for applying these models in Python are implemented in the NumPy , scikit-learn, statsmodels libraries.
Let’s start with the simplest model, which is used when relationships between variables are linear in nature. For example, linear regression will tell you how many call center operators can handle the workload during the hot season, or how vehicle mileage affects the frequency of repairs.
If you have a single explanatory variable (descriptor), you are dealing with simple linear regression. If there are two or more independent variables, then this is multiple linear regression.
The main feature of linear regression is the absence of out-of-trend values of the dependent variable and minimal scatter of results. Also, in this case, there is no relationship between the independent variables.
The second most popular model is used when the dependent variable is binary in nature, that is, falls into one of two categories. For example, you want to know how certain factors influence the user’s decision to close the site or stay on the page. Or you need to assess the chances of success among several electoral participants (win / not win).
Logistic regression can also be applied if there are more than two endpoints. Let’s say you want to assign students to the Humanities, Engineering, and Science Grades using school test scores. In this case, we are talking about multinomial, or multiple logistic regression.
This technique allows you to work with nonlinear equations using entire rational (polynomial) functions of independent variables. To understand the difference between polynomial and linear regression, take a look at the graph below. The red curve describes the behavior of the dependent variable much better because its relationship to the descriptor is non-linear.
Polynomial regression helps analysts and developers solve the problem of underfitting when the model does not capture a significant portion of the results. On the other hand, it should be remembered that inappropriate use of this technique or the addition of unnecessary, unnecessary characteristics creates the risks of overfitting, which makes a model that shows good results on a training set not applicable to work with real data.
This method is used when there are severe distortions in the data, outliers and random errors are common. In other words, if the mean that the linear regression works with does not accurately reflect the relationship between the variables. In these cases, quantile regression allows you to enter a target error in the calculations, or set quantiles – a value that the resulting variables will not exceed.
To apply quantile regression in Python, you need the statsmodels package. With it, you can analyze information with customizable quantiles, allowing you to look at the data from different angles.
Lasso Regression / Ridge Regression
These two techniques are useful when you need to reduce the dimension of your data and eliminate the overfitting problem. There are two ways to do this:
- L1-regularization – adds a penalty to the sum of the absolute values of the coefficients. This method is used in lasso regression.
- L2 regularization – adds a penalty to the sum of the squared coefficients. This method is used in ridge regression.
In most cases, researchers and developers prefer the L2 function – it is more efficient in terms of computational functions. On the other hand, lasso regression allows you to reduce the values of some coefficients to 0, that is, to remove unnecessary variables from the study field. This is useful if a phenomenon is influenced by thousands of factors and it turns out to be meaningless to consider all of them.
Both regularization methods are combined in the elastic net technique. It is best suited when the explanatory variables are highly correlated with each other. In these cases, the model will be able to alternately apply the L1 and L2 functions, whichever suits the input better.
Principal component method
Principal Components Analysis is another way to reduce the dimension of data. It builds on the creation of the key explanatory variables that have the greatest impact on the function. Thus, you can build a regression model based on highly noisy data. At the first stage, the analyst identifies the main components among them, then applies the necessary function to them.
It is important to understand that the main components with which the analyst works in this case are in fact a function of the rest of the characteristics. That is why we are talking about creating key variables, and not isolating them from the general number. For this reason, the use of PCA is not suitable for explaining the actual relationships between variables – it is rather the creation of a simulation model based on known data about a particular phenomenon.
Least Partial Squares Regression
Unlike the previous technique, Partial Least Squares (PLS) takes the dependent variable into account. This allows you to build models with fewer components, which is very convenient in cases where the number of predictors greatly exceeds the number of dependent variables, or if the former are highly correlated.
Technically, PLS is very similar to PCR – first, the hidden factors that explain the relationship of variables are determined, then a forecast is built using this data.
This method allows you to study phenomena in relation to the values of any scales. For example, when it comes to the relationship of users to the design of the site – from “not like at all” to “very much”. Or, in medical research, this way you can understand how the patient’s feelings change (from “very severe pain” to “no pain at all”).
Why can’t linear regression be used for this? Because it does not take into account the semantic difference between different digits of the scale. Take, for example, three people 175 cm tall and weighing 55, 70 and 85 kg. The 15 kilograms, by which the thinnest and most obese person are separated from the participant with the average, have the same value for the linear function. And from the point of view of sociology and medicine, this is the difference between obesity, dystrophy and normal weight.
Poisson Regression / Negative Binomial Regression
There are two more techniques that are used for special situations, in this case, when you need to recount certain events that will occur independently of each other during a given period of time. For example, to predict the number of shoppers’ trips to the store for a specific product. Or the number of critical errors on corporate computers. Such phenomena occur in accordance with the Poisson distribution, from which the technique got its name.
The disadvantage of this method is that when it is used, the distribution of dependent variables is equal to their mean values. In reality, analysts are often faced with a high variance of the observed phenomena, which significantly differs from the average. For such models, negative binomial regression is used.
The specificity of these regressions determines certain requirements for the dependent variables: they must be expressed as whole, positive numbers.
The last model in our collection is used to estimate the time until a certain event. What is the likelihood that an employee will work for the company for 10 years? How many rings is the customer willing to wait before hanging up? When will the patient have the next crisis?
The model works on the basis of two parameters: one reflects the passage of time, the second, a binary indicator determines whether an event has happened or not. This is similar to the mechanics of logistic regression, but that technique does not use time. The underlying assumptions for Cox regression are that there is no correlation between the explanatory variables and that they all have a linear effect on the expected event. In addition, at any time interval, the probability of an event occurring for any two objects must be proportional.
This is not a complete list of regressions that are available to Python developers and analysts. However, even this list gives an idea of what opportunities this language offers for studying a wide variety of data.
Feature image Cretit: Unsplash