## Classification model in a business context

### Motivation

Classification problems are the most frequent use cases encountered in the real world. Unlike regression problems, where an actual numerical value is predicted, classification problems attempt to associate an example with a category. Classification problems can be further divided into binary or multiclass classification. The former are used when what is to be predicted or classified has only two possible outcomes, while the latter refer to three or more possible outcomes or categories.

Some examples for which a binary classification model would be used are as follows:

• Predict whether or not a customer will buy a certain product.
• Predict whether or not a customer will churn.
• Determine whether or not a student will pass an exam.

On the other hand, we would use a multi-class classification model for:

• Analyse text comments and capture the underlying emotion, such as happiness, anger, sadness or sarcasm.
• Predict whether a team will win, draw or lose the next match.
• Analyse images of fruit and classify them into three different categories according to the degree of aesthetic quality.

The best way to work with a concept is with an example to which it can be related. To understand the business context, consider the following example:

The marketing team of a bank wants to know the propensity of customers to purchase a certain investment product. To solve this problem, the probability of purchase of customers could be calculated to find out their propensity or inclination to purchase the product. In this way, customers could be segmented and marketing campaigns could be targeted to persuade those most likely to purchase the investment.

As mentioned in previous publications, the first step in a data science project is to understand the business. This is about understanding the various factors that influence the business problem. Knowing the drivers or levers of the business is important, as it will help formulate hypotheses about the business problem, which can be verified during exploratory data analysis.

Knowing that the product to be offered tends to be popular with risk-averse customers, the following hypotheses could be made:

• Would age be a factor, with a greater propensity shown by older people?
• Is there any relationship between employment status and the propensity to purchase such an investment product?
• Would a customer's asset portfolio (housing, loan or higher bank balance) influence the propensity to buy?
• Will demographics, such as marital status and education, influence the propensity to purchase the product? If so, how do demographics correlate with propensity to buy?

### Test the veracity of hypotheses with data

Based on the above, an exploratory data analysis would help us to test the veracity of the hypotheses raised with data. As an example, the following hypothesis could be defined:

The propensity to buy the investment product is higher for older customers than for younger ones.

We could chart the number of people who buy the product according to their age to see if there is a pattern that reflects our hypothesis. From the graph we can see that the highest number of purchases of the investment product is made by customers between 25 and 40 years old, and that the propensity to buy decreases with age.

However, we are overlooking an important detail here, we are taking the data based on the absolute count of clients in each age range. If the proportion of bank clients is highest in the 25-40 age range, then we are likely to get a graph like the one we have obtained. What we should really be plotting is the proportion of customers, within each age group, who buy the product in question.

###  We can see, in the graph on the left, that in the age group from 22 years (approx.) to 60 years, individuals are not inclined to buy the product. However, in the graph on the right, we see the opposite, where the 60+ age group is much more inclined to buy the product.

Taking the proportion of users is the right approach to get the right perspective in which to view the data. This is more in line with the hypothesis we have put forward.

### Predecir la probabilidad de compra con Regresión Logística

While there are several steps between the exploratory analysis and the application of a model (such as preprocess the data for input to the algorithm, creating new variables to help improve the predictive capacity of the model, etc.), the focus of the publication is to address a classification model in a business context. That is why we will now explain how a Logistic Regression works and how it would be adapted to the business objective in question. The desired business outcome, in our use case, is to identify the customers who are likely to buy the product.

On the other hand, the goal of machine learning is to estimate a mapping function (f) between an output variable and the input variables. In mathematical form, this can be written as follows:

Y = f(X)

Y it is the dependent variable, which is our prediction of whether a customer is likely to buy the product or not.

X is the independent variable(s), which are those attributes such as age, education, bank balance, asset portfolio, etc. that are part of the dataset.

f() is a function that relates various attributes of the data to the probability that a customer will or will not buy the product. This function is learned during the machine learning process. This function is a combination of different coefficients or parameters applied to each of the attributes (or variables) to obtain the probability of purchase.

For simplicity, let's assume we have only two attributes, age and bank balance. Let's assume the age is 62 and the balance is \$900. With all these attribute values, let's assume that the mapping equation is as follows

Y = B0 + B1_Age * Age + B2_Bank_balance * Bank_balance

Using the above equation, we obtain the following:

Y = 0.1 + 0.4 * 62 + 0.002 * 900

Y = 26.7

We see that the equation used corresponds to the Linear Regression we saw in the previous publication, and that the output gives us a real number. This is where Logistic Regression comes in, which is similar to Linear Regression, but applies a sigmoid function that reduces any real-valued number to a value between 0 and 1, which makes this function ideal for predicting probabilities.

To transform the real-valued output into a probability, we use the logistic function, which has the following form:

Y =  (e^(X))/(1 + e^(X))

Here "e" is the natural logarithm.

Y = (e^(B0 + B1*X1 + B2*X2))/(1 + e^(B0 + B1*X1 + B2*X2))

Let us now look at the logistic regression function from the business problem we are trying to solve.

Y = (e^(B0 + B1_Age * Age + B2_Bank_balance * Bank_balance))/(1 + e^(B0 + B1_Age * Age + B2_BankBalance * BankBalance))

Y = (e^(0.1 + 0.4*62 + 0.02*900))/(1 + e^(0.1 + 0.4*62 + 0.02*900))

By applying this, we obtain a value of Y = 0.76 , which is a 76% probability that the customer will buy the investment product. As discussed in the previous example, model coefficients such as 0.1, 0.4 and 0.002 are the ones we learn using the logistic regression algorithm during the training process.

So far we have addressed issues such as: when to use a classification model, defining a business objective, posing and testing the veracity of a hypothesis from exploratory analysis, and finally applying the concept of logistic regression to the business problem. In the next publication we will focus on different measures to evaluate the performance of classification models.

1. 