Regression models from scratch
How to identify a Regression problem?
Regression is one of the key methods regularly used in data science to model relationships between variables, where the target variable (i.e. the value to be estimated) is a continuous number. Examples of Regression problems are:
- Forecast sales in the next period.
- Predicting a student's grade in an exam.
- Predicting the price of a property.
Simple Linear Regression
Regression analysis consists of finding a function (F(X)), under a given set of assumptions, that best describes the relationship between the dependent variable (Y) and the independent variable (X).
When the number of independent variables is only one and the relationship between the dependent and independent variable is assumed to be a straight line, the type of regression analysis is called simple linear regression. The straight line relationship is called a regression line or line of best fit.
How can the regression line be determined for a given data set? A common method used to determine the regression line is called the least squares method.
The simple linear regression equation is as follows:
y ≈ B0 + B1*X
where B0 and B1 are unknown constants, representing the intercept and slope of the regression line, respectively.
The intercept is the value of the dependent variable (Y) when the independent variable (X) has a value of zero (0), or in other words, the value of the prediction in the absence of variables. The slope is a measure of how much the prediction value changes with a one-unit change in the independent variable, i.e. it measures the impact of the independent variable on the prediction. The unknown constants are called coefficients or parameters of the model.
Calculating the difference between the actual value of the dependent variable and the predicted value of the dependent variable gives an error commonly referred to as the residual (Ei).
By repeating this calculation for each data point in the sample, the residual (Ei) for each data point can be squared, to remove algebraic signs, and summed to obtain the sum of squares of the error (SSE). The least squares method seeks to minimise the SSE.
Figure 1.1 explains graphically what is described above:
Multiple Linear Regression
In the simple linear regression discussed above, we have only one independent variable. If we include multiple independent variables in our analysis, we obtain a multiple linear regression model. Multiple linear regression is represented in a similar way to simple linear regression.
Consideremos un caso en el que queremos ajustar un modelo de regresión lineal que tiene tres variables independientes, X1, X2 y X3. La fórmula de la ecuación de regresión lineal múltiple tendrá el siguiente aspecto:
y ≈ B0 + B1*X1 + B2*X2 + B3*X3
Each independent variable will have its own coefficient or parameter (i.e. B1 B2 or B3). The coefficient Bs tells us how a change in its respective independent variable influences the dependent variable if all other independent variables remain unchanged.
Multiple regression coefficients are estimated using the same least squares method as in simple linear regression. To satisfy the least squares method, the chosen coefficients must minimise the sum of the squared residuals.
Linear Regression Assumptions
We must bear in mind that in order to model reality using Linear Regression, there are certain assumptions that must be fulfilled for the estimation to obtain good results. In order not to make this publication too long, we will mention them without elaborating on them. These assumptions are:
- The relationship between the dependent and independent variables must be linear and additive.
- The residual terms (Ei) must have a normal distribution.
- The residual terms (Ei) must have constant variance (homoscedasticity).
- The residual terms (Ei) must be uncorrelated.
- There should be no correlation between the independent variables
Evaluation metrics for regression problems
The main objective of regression analysis is to find a model that explains the observed variability in a dependent variable of interest. Therefore, it is very important to have a quantity that measures how well a regression model explains this variability. A statistic or metric that does this is called R-squared (R2). While there are other commonly used metrics, such as the RMSE, we will discuss the R2, as it is perhaps the most familiar to those with a basic knowledge of statistics. The formula for R2 is as follows:
R2 = 1 – SSE/SST
- SSE = sum((real value - pred value)**2) = sum((Yi - Yi pred)**2
- SST = sum((real value - mean value)**2) = sum((Yi - Y prom)**2)
- SSR = sum((pred value - mean value)**2) = sum((Yi pred - Y mean)**2)
The R-squared is the portion of variability explained by the model. In other words, it is the ratio of how good my model is compared to a model that always predicts the mean of the actual values. Therefore:
R2 = 1 - (my model)/(a model that always predicts the mean)
The R2 can take values less than or equal to 1 (R2 =<1). This means that the closer the R-squared is to 1, the better the model will be (=1 is perfect). On the other hand, if I predict the mean, the R-squared value would be 0, since the SSE and the SST would be the same value so the division would be equal to 1, and 1 - 1 = 0. And finally, if R-squared is negative, it means that my model is worse than predicting the mean.
Up to this point we have seen when Regression models are applied and discussed Simple and Multiple Linear Regression along with an evaluation metric for these models. In future publications we will include more advanced models that are more widely used.