30 augustus

Intrinsically interpretable models: explaining a linear regression model

Frida Karvouni
Auteur

A linear regression model intrinsically interpretable. It is straightforward to calculate the marginal contribution of its features.

“The aim of science is to seek the simplest explanations of complex facts”
― Alfred North Whitehead

A linear regression model is intrinsically interpretable. Its mechanics are transparent and it is fairly straightforward to calculate the marginal contributions of its features.

A black-box model is ideal for modelling complex relationships. It has a clear input and output. However, its inner workings are difficult to investigate.

Black-box model vs. Interpretable model

This blog is focused on calculating the marginal contributions of a linear regression model (Shapley values).

Marginal contribution explained

The marginal contribution of a data point Xi is defined as the deviation of the prediction for a certain data point from the average population, due to the feature.

A linear regression model is explained by using its statistical components. Assume a linear regression:

The marginal contribution of X1 is calculated as:

Explainability in practice

The modelling exercise below will shed some light on the contribution of temperature and holidays on the ice-cream searches using a linear regression model. The data is based on Google Trends data of people searching for the word “softijs” (Dutch word for ice-cream) joined to the temperature and school holidays data.

    Date         Searches        Temperature         Holiday
7  Jan 2018         5                6.1                 1
14 Jan 2018         14               3.9                 0
21 Jan 2018         10               4.1                 0
28 Jan 2018         10               7.4                 0
4  Feb 2018         15               4.6                 0
...                ...               ...                ...
28 June 2020        51               20.6                0
5  July 2020        52               16.8                1
12 July 2020        44               15                  1
19 July 2020        48               17.1                1
26 July 2020        49               17.2                1

 

The searches dataset shows a clear pattern. Ice-cream searches are more popular during the summer months and less popular during the winter and autumn.

Searches for ice-cream vs. Temperature

Holidays coincide with a few of the spikes in the searches for ice-cream. However, they do not seem to contribute largely to the ice-cream searches.

Searches for ice-cream vs. Holidays

For the modelling part, first a linear regression model is applied to the dataset.

import pandas as pd
from sklearn.linear_model import LinearRegression
def run_model(pdf, target_col="searches", date_col="date"):
    pdf_model = pdf.copy()
    pdf_model = pdf_model.sort_values(by=[date_col], ascending=True)
    pdf_model = pdf_model.set_index(date_col)
    X = pdf_model.drop(target_col, axis=1)
    y = pdf_model[target_col]
    x_train = X[:-1]
    x_test = X[-1:]
    y_train = y[:-1]
    y_test = y[-1:]
    model = LinearRegression()
    fitted_model = model.fit(x_train, y_train)
    return fitted_model, x_train, x_test, y_train
fitted_model, x_train, x_test, y_train = run_model(pdf)

 

Next, the marginal contribution of each feature is calculated:

def calculate_local_explainer(model, X):
    coef = model.coef_
    local_explainer_linear = (X - X.mean(0)) * coef
    return local_explainer_linear
local_explainer_linear = calculate_local_explainer(model=fitted_model, X=x_train)
print(local_explainer_linear)

 

Here are the results:

    Date           Temperature             Holiday                                 
7  Jan 2018           -10.6                  3.4
14 Jan 2018           -15.3                 -2.4      
21 Jan 2018           -14.9                 -2.4    
28 Jan 2018           -8.1                  -2.4
24 Jan 2018           -13.9                 -2.4
   ...                 ...                   ...
21 June 2020           13.6                 -2.4     
28 June 2020           19.5                 -2.4
5  July 2020           11.7                  3.4
12 July 2020           7.9                   3.4
19 July 2020           12.4                  3.4

 

The marginal contribution of temperature in the 28th of June 2020 is +19.5. This means that the estimated number of searches due to higher temperature is 19.5% more than average. The average temperature for this month is 16.5°C. The positive impact makes sense, since ice-cream is a popular treat during the summer months when the temperature is high.

Looking at the results for a colder month, the marginal contribution of temperature in January 2018 is negative, when the average temperature is 5.4°C. People eat less ice-cream during the coldest months of the year.

During Christmas and summer school holidays, the marginal contribution of holidays is positive. In festive dinners a dessert is often served, ice-cream being one of the options. Also, children buy more ice-cream when they are on holiday.

Conclusion

Explaining a linear regression model is a straightforward process which is easily implemented. Calculating the marginal contributions gives a clear view of the mechanics of the model. This allows a data scientist to validate the output, explain the predictions to stakeholders with more confidence and tune the model based on the findings.

Sources

[1] “Comparing black-box vs. white-box modeling” by Tamanna: https://medium.com/@tam.tamanna18/comparing-black-box-vs-white-box-modeling-bd01575b7670#:~:text=Even%20in%20less%20complex%20scenarios,a%20churn%20detection%20use%20case.

Heb je nog niet de baan gevonden die je zoekt?

Bekijk ons overzicht van alle vacatures en gebruik de zoek- en filtermogelijkheden om snel jouw vacature te vinden.