Explainable and privacy-preserving artificial intelligence - Part 2

/blog/2019-12-10-explainable-ai-part-2/images/featured_hua83deccae5aec57c8302f701314d9c9e_436438_1595x1148_resize_q75_box.jpg

Predicting and understanding the success of bank telemarketing calls.

This is the second part of our article series on the topic of Explainable Artificial Intelligence (XAI).

In our first article, we learned about examples from everyday life where AI is already impacting decisions we are making (e.g. which book to buy) as well as decisions others make about us (e.g. whether we will be invited to a job interview).

One of the drawbacks of AI based decisions is that they are generally difficult to explain in terms that are easily understandable to us, humans. As the AI decision making enters important parts of our lives, such as disease detection or in financial matters, there is an increasing need to make AI model outcomes interpretable and explainable.

This has been recognized not just by the general public but also by regulators. GDPR is only the latest regulation that is putting the “Right to explanation” at the forefront of all areas where algorithmic decisions are affecting humans.

Some of the organizations that are most impacted by these regulations and laws are financial institutions.

In this article we will show how to build and train a relatively complex machine learning model for predicting the success of a bank telemarketing calls and then use XAI methods to explain its predictions.

We will be especially interested in two types of explanations: 1) Global interpretability or determining which input variables are most important for machine learning (ML) predictions (when considering all data instances), and 2) local interpretability, or explaining the individual predictions made by AI models.

Description of data

To illustrate the application of XAI, we will use an open sourced data set from a Portuguese retail bank regarding their telemarketing call campaign to sell long-term bank deposits.

We will build a supervised machine learning model using XGBoost to help identify those customers that are likely to subscribe to the long-term deposit of the bank.

In a bank environment, the machine learning model can help the bank managers select the high potential customers that should be contacted during bank marketing campaigns, leading to better return on marketing costs and minimizing the time required to achieve success. By focusing only on the customers with higher potential, the bank can also avoid unnecessary calls, which some of the customers may find intrusive.

The bank marketing data set was obtained from UC Irvine Machine Learning Repository:
The data contains:

  • one labelled target (client subscribed to the long-term deposit: yes or no),
  • 20 input variables or features, describing customer, product, social and economic characteristics.

Loading the libraries

We will first load the models and other required libraries:

import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as sstats
import matplotlib.pyplot as plt
import operator
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score,classification_report,confusion_matrix,accuracy_score,roc_auc_score,roc_curve,auc,log_loss
from xgboost import XGBClassifier
import eli5 
from lime.lime_tabular import LimeTabularExplainer
from pdpbox import pdp, get_dataset, info_plots
import shap
import xgboost as xgb
import lime
from IPython.display import display

Importing data

We will be using the final, most complete data set with 41,188 examples. Some columns will be renamed to more descriptive names.

The feature “duration” is not known before the call is performed. Our goal is to build a predictive model, therefore, as suggested by the authors of the data set, this data variable will be dropped from the features. Our target variable is named subscribed_to_deposit and denotes whether a customer has subscribed to deposit (“yes”) or not (“no”).

df=pd.read_csv('bank-additional-full.csv',sep=';')
target_variable_name='subscribed_to_deposit'
df.rename(columns={'default':'credit_in_default','housing':'has_housing_loan','loan':'has_personal_loan','contact':'contact_type','month':'last_contact_month','day':'last_contact_day_of_week','day_of_week':'last_contact_day_of_week','marital':'marital_status','campaign':'contact_number_during_campaign','pdays':'days_from_last_contact','previous':'number_of_previous_contacts','poutcome':'outcome_from_previous_campaign','y':target_variable_name,'emp.var.rate':'employment_variation','cons.price.idx':'consumer_price_index','cons.conf.idx':'consumer_confidence_index','nr.employed':'employment index'},inplace=True)
df.drop(['duration'], inplace=True, axis=1)

Exploratory Data Analysis

We will separate the input variables in categorical and numerical features:

categorical_features=['job','marital_status','credit_in_default','has_housing_loan','has_personal_loan','contact_type','last_contact_month','last_contact_day_of_week','outcome_from_previous_campaign','education']
numerical_features=['age','contact_number_during_campaign','days_from_last_contact','number_of_previous_contacts', 'employment_variation', 'consumer_price_index','consumer_confidence_index', 'euribor3m', 'employment index']

Imbalance in the dataset with respect to the target variable

Our data set is imbalanced with respect to the target variable (subscription to the term deposit). The imbalance is due to a ratio of negative to positive cases of 7.9.

We would like to build a ML model that will minimize the number of false negative cases. Those are customers who would subscribe to the deposit, but who were incorrectly classified by the ML model as unlikely to subscribe (negative class). We therefore want to train a model that will have a high recall.

Figure 1: Imbalance of the dataset

Figure 1: Imbalance of the dataset

Correlation of numerical features

Correlation analysis shows a very high correlation between three features: employment variation, employment index and euribor3m (see Figure 2).

The data set considered was collected from May 2008 to November 2010. This was a period of the last financial crisis in which euribor3m fell from around 4.5% in 2008 to level of 0.5% to 1% in 2010 as the ECB central bank responded to crisis with monetary stimulus in the form of lower ECB rates. In the same period, the Portugal employment index was also decreasing, which explains the high correlation between euribor3m and employment indices.

Figure 2: Correlation of numerical features

Figure 2: Correlation of numerical features

A more detailed information on interaction between features and target variable is provided by the scatter plot and kernel density plots. Charts below show that during the period of high employment and high euribor3m, the bank had less success in getting customers to subscribe to the deposits.

As the financial crisis progressed through 2009 and 2010, the client’s fear of the future was one of the reasons why the propensity to save increased, resulting in a much better conversion of calls that were marketing deposits.

The contrast is particularly striking for euribor3m. In normal times, deposits rate and savings inclination have a slightly positive correlation, as the higher rates usually entice customers to save more. In a financial crisis, due to fears of an economic depression, the customer’s interest in saving increased despite the euribor3m rate falling drastically during this period.

Figure 3: Pair-plot of numerical features relevant for economic activity and monetary policy

Figure 3: Pair-plot of numerical features relevant for economic activity and monetary policy

Figure 4: Kernel Density Plot (KDE) for feature of age

Figure 4: Kernel Density Plot (KDE) for feature of age

Some of the XAI methods that we will consider later can lead to misleading explanations if the ML model is trained on a data set which has co-dependent features. This is another reason why we want to prune those features with the highest correlations.

Based on correlation analysis, the two features that are highly correlated with euribor3m will be removed from further analysis: employment index and employment variation.

Correlation of categorical features

In examining correlation of categorical features, we could in principle use one hot encoding, however, the analysis would be difficult due to the large number of resulting features.

A better approach is to implement Cramér’s V with biased correction (see also: Bergsma, Wicher, 2013, “A bias correction for Cramér’s V and Tschuprow’s T”. Journal of the Korean Statistical Society and discussion).

Cramér’s V shows a relatively high correlation between these pairs of categorical features:

  • education and job (Cramer_V=0.36)
  • has_housing_loan and has_personal_loan (0.71)
  • contact_type and last_contact_month (0.61)

Correlation between education and job is expected. It also seems that persons who have personal loan tend to also have a housing loan and vice versa. Correlation between the third pair may be related to the specifics of the marketing campaign. We will not remove any of the categorical features from the features.

Figure 5: Cramer’s V of categorical features

Figure 5: Cramer’s V of categorical features

Features values distribution for categorical features

In examining distributions for categorical features, several observations can be made from the charts:

  • persons who are single are more likely to subscribe to deposit than married or divorced ones,
  • clients who were contacted on cellular phone were much more likely to be receptive to deposit subscription than those contacted by telephone,
  • some months (September to December) were better for selling deposits than others. Possible reason for this is overlap with Lehman Brothers collapse which occurred during autumn of 2008,
  • success from the previous campaign had a positive effect on the outcome of the current campaign

Figure 6: Distribution by marital status

Figure 6: Distribution by marital status

Figure 7: Distribution by contact type

Figure 7: Distribution by contact type

Training and tuning the XGBoost model

After initial exploratory data analysis of our data set, we next turn to building and training a XGBoost machine learning model for our problem. Preparing train and test data sets:

y = df[target_variable_name].map({"no":0, "yes":1})
X = df.drop(target_variable_name, axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=.3, random_state=42)

We will one hot encode categorical features:

categorical_transformer = Pipeline(steps=[('onehot', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(transformers=[('numerical', "passthrough", numerical_features),('categorical', categorical_transformer, categorical_features)])

Parameters of XGBoost model

As mentioned earlier, the imbalance between majority class instances and minority instances is rather high, 7.9. We will thus use a hyperparameter scale_pos_weight when building XGBoost model, as it changes the weights of minority class instances. In selecting the value for imbalanced sets, one should consider three ratios: between majority and minority class instances in our data set, in real world setting and finally between the costs of a false negative and false positive. We will assume the ratios are similar and equal to the imbalance in our dataset, leading us to set scale_pos_weight at 7.9.

xgb_model = Pipeline([("preprocessor", preprocessor), ("model", XGBClassifier(scale_pos_weight=7.9, n_estimators=400,n_jobs=-1, random_state=42))])

After building the model, we perform the hyperparameter tuning for main parameters of the XGBoost model:

ht = GridSearchCV(xgb_model, {
    'model__min_child_weight': [5,10],
    'model__max_depth': [5,10],
    'model__gamma': [0.0,0.1],
    'model__n_estimators':[400],
    'model__scale_pos_weight': [7.9],
    'model__colsample_bytree"':[1,5]}, n_jobs=5, cv=5, scoring="roc_auc", verbose=0)
_=ht.fit(X_train, y_train)
_=xgb_model.set_params(**ht.best_params_)
_=xgb_model.fit(X_train, y_train)
y_pred = xgb_model.predict(X_test)
y_pred_proba = xgb_model.predict_proba(X_test)

Evaluating the model – Confusion Matrix

The trained model achieves recall of 64% on test data for target=1 (successfully subscribed to deposit):

Figure 8: Confusion Matrix

Figure 8: Confusion Matrix

with AUC score of 0.8:

Figure 9: ROC curve

Figure 9: ROC curve

Global interpretability

We will next turn to explainability of our trained XGBoost model. First, we want to learn about the global importance of features.

One of the earliest methods introduced for this purpose was permutation feature importance.

Permutation feature importance

The main idea behind this method is to estimate the importance of a feature by calculating the change in the model’s error after randomly permutating the feature values, converting it to “noise”. If random permutation leads to increase in model’s error, the feature is important for model predictions.

Permutation feature importance can be misleading in presence of co-dependent features. If, for example, we had two identical features, the shuffling of data in one feature will still have the other feature contributing to the prediction. This would lead to a misleading conclusion that these two features are less important than they are.

According to generated permutation feature importance chart (we used log-loss as error), the most important feature is euribor3m, followed by consumer_price_index, last_contact_month_oct and consumer_confidence_index.

Figure 10: Permutation feature importance

Figure 10: Permutation feature importance

Examining feature importance with SHAP method

The SHAP method is based on Shapley values, introduced by Shapley in 1953 in the context of coalitional game theory. For more background on Shapley values and SHAP method please refer to the first article in our series on XAI.

Of all XAI methods considered in this article, SHAP approach has the best theoretical foundations and is the preferred choice in the case of conflicting explanations from different approaches.

The Shapley value of a feature value for a given instance is a contribution of this feature value to the difference between the prediction and the mean baseline prediction of the model.

While the permutation feature importance considered above uses information about the decrease in model error, SHAP is based on values of feature contributions.

We can estimate the global feature importance by summing the Shapley values for each feature and then sorting:

shap.initjs()
shap_explainer = shap.TreeExplainer(xgb_model.named_steps["model"])
shap_values = shap_explainer.shap_values(preprocessor.transform(X_train))
shap.summary_plot(shap_values, preprocessor.transform(X_train), plot_type="bar", feature_names=all_features,max_display=20)

leading to the following plot:

Figure 11: SHAP feature importance

Figure 11: SHAP feature importance

According to the SHAP method, the most important feature is euribor3m, followed by contact_type_cellular, age, contact_number_during_campaign, consumer_confidence_index and days_from_last_contact.

ELI5

Another library that provides both global and local interpretability results is ELI5. We will use ‘gain’ value for parameter importance_type of explain_weights method, which means that the importance of the feature will be based on the average gain of the feature when it is used in trees.

def fetch_features_labels(preprocessor):
    return_features = []
    for transformers in preprocessor.transformers_:  
        a = transformers[1]
        b = transformers[2]
        for k in a:
            features = []
            check = getattr(k, "categories_", None)
            if check is not None:
                features.extend(k.get_feature_names(b))
            else:
                features = b
        return_features.extend(features)
    return return_features
all_features=fetch_features_labels(preprocessor)
 
eli5.explain_weights(xgb_model.named_steps["model"], feature_names=all_features,importance_type='gain',top=20)

Figure 12: Features importance calculated with ELI5

Figure 12: Features importance calculated with ELI5

Analysis of feature importance results from different methods

All three approaches - SHAP, ELI5 and Permutation Feature Importance find euribor3m as the first or second most important feature. Both SHAP and ELI5 find consumer_confidence_index and days_from_last_contact among top 5 most important features.

We would give the most weight to the results of the SHAP method as it is well grounded in theory and has several excellent properties, such as local accuracy and consistency.

SHAP results point to euribor3m as the most important feature. Given the historical context of our data, the importance of euribor3m derives mainly from the fact that during the financial crisis and the prevalent fear during that time, clients had a higher inclination to save. Other important global features include contact type being cellular, age, number of contacts during campaign, consumer confidence index and days since last contact.

Partial Dependence Plot (PDP)

Global feature importance methods give us information about average importance of features for predicting target variable.

They are, however, lacking directional information about how a given feature particularly influences the target prediction. Does a higher age increase the probability that a person will subscribe to a deposit or decrease it?

A partial dependence plot (PDP) helps us in this regard by showing us how predictions of the model depend on values of a small number of input variables. We will use pdpbox library to generate partial dependence plots:

feature_0='age'
pdp_i = pdp.pdp_isolate(model=xgboost_pdp, dataset=X_train_df, model_features=X_train_df.columns, feature=feature_0)
fig, axes = pdp.pdp_plot(pdp_i, feature_0, center=True, plot_lines=True, frac_to_plot=0.2, plot_pts_dist=True,show_percentile=True,x_quantile=True)

Figure 13: Partial Dependence Plot for feature age

Figure 13: Partial Dependence Plot for feature age

PDP for the feature of age (Figure 13) indicates that the age variable has the most influence on the probability of subscribing to deposit when clients are very old.

2-way Partial Dependence Plots

We can explore the influence of 2 features on target variable as well as interaction between these features by plotting 2-way partial dependence plots.

feature0='consumer_price_index'
feature1='age'
interaction = pdp.pdp_interact(model=xgboost_pdp, dataset=X_train_df, model_features=X_train_df.columns, features=[feature0, feature1])
fig, axes = pdp.pdp_interact_plot(pdp_interact_out=interaction, feature_names=[feature0, feature1], plot_type='contour', x_quantile=True, plot_pdp=True)

PDP for the features age and consumer confidence index (Figure 14) indicates that consumer price index (CPI) around 92.5 leads to higher values of target response variable.

Figure 14: Contour Partial Dependence Plot for age and consumer price index

Figure 14: Contour Partial Dependence Plot for age and consumer price index

CPI around 94 leads to less likely subscription to deposit. This can be explained by the relevant economic chart for the monthly CPI of Portugal, which fell from around 94 in May 2008 (the period of lower propensity to save as indicated by data in the earlier discussion) to levels around 92.5 in the period from January 2009 - January 2010 (period of higher propensity to save) .

Dependence plots with SHAP method

Another way to explore the influence of features on target variable is by using a scatter plot that shows feature values and their Shapley values. In the figure 15 below, the x-axis is the value of the feature and y-axis is the Shapley value for this feature. A very useful option is to include the interaction effect with other features via interaction_index parameter.

shap.dependence_plot(1, shap_values, X_train_df, interaction_index='euribor3m)

If we examine how the contact number during campaign influenced the Shapley values and thus likelihood of subscribing, we see that as the number of calls increased, the marketing campaign effectiveness deteriorated. Plot also indicates that during the period of financial crisis (low euribor3m, blue dots) the bank rarely contacted a client more than 15 times.

Figure 15: SHAP dependence plot

Figure 15: SHAP dependence plot

Summary plot with SHAP

The SHAP summary plot shows us an even more detailed view of the effect of features. Each point on the SHAP summary plot chart (Figure 16) shows the Shapley value for the feature and an instance. The colour of the point indicates the relative value of the feature (red: high and blue: low). The points are also slightly jittered for better display of distribution of Shapley values.

shap.summary_plot(shap_values, X_train_df,  feature_names=all_features)

Figure 16: SHAP summary plot

Figure 16: SHAP summary plot

As already noted previously, high (low) values of euribor3m have a negative (positive) impact on model output.

The Contact_type_cellular part of the chart indicates that cellular contact type (value of 1=red) has a positive contribution to likelihood of subscribing, we already noted this effect when examining distribution plots.

A notable feature is also contact_number_during_campaign. When the number of contacts is high (too many calls to the customer), it has a significant negative impact on the probability that they will subscribe to deposit.

Days_from_last_contact is another feature with a large impact on model outcome. Small value of this feature, indicating recent contact from previous campaign, significantly increases the likelihood of subscribing. A High number of days from last contact, on the other hand, has a relatively muted impact, but on the downside.

The month of contact has an important impact on outcome, with October having a positive effect, May negative and August also negative.

Clients having a low education (basic 4y) also has a significant negative impact on probability of subscribing the deposit.

Local Interpretability

Global interpretability of ML models helps stakeholders who are involved in the application of machine learning models.

Data scientists can use its results to better understand and improve their models. Business decision makers can gain insights on what variables are impacting algorithmic decision making in their organizations.

However, clients impacted by the AI models are most interested in explanations of the decision that the AI model has made in their individual case.

LIME method

In the LIME approach, we learn about the importance of features in the prediction of individual instance (or client) by slightly changing the input and observing the change in the results of the model.

Explanation is thus obtained by approximating our model with one that is linear, interpretable and that is learned locally around our prediction.

subscribed_instances=[i for i,val in enumerate(y_test) if val==1]
user_index=subscribed_instances[7]
 
columns=X_train.columns
categorical_labels = dict()
for feature in categorical_features:
    categorical_labels[columns.get_loc(feature)] = list(set(df[feature].to_list()))
 
def lime_to_dataframe(X, categorical_labels, cols):
    df_rec = pd.DataFrame(X, columns=cols)
    for a, b in categorical_labels.items():
        index_to_label = { i: j for i, j in enumerate(b) }
        df_rec.iloc[:, a] = df_rec.iloc[:, a].map(index_to_label)
    return df_rec
 
def dataframe_to_lime(X, categorical_labels):
    lime_array = X.copy()
    for a, b in categorical_labels.items():
        label_to_index = { j: i for i, j in enumerate(b) }
        lime_array.iloc[:, a] = lime_array.iloc[:, a].map(label_to_index)
    return lime_array
 
explainer = LimeTabularExplainer(dataframe_to_lime(X_train, categorical_labels).values,mode="classification",feature_names=X_train.columns.tolist(),categorical_names=categorical_labels,categorical_features=categorical_labels.keys(),discretize_continuous=True,random_state=42)
X_user = X_test.iloc[[user_index], :]
X_user_lime = dataframe_to_lime(X_user,categorical_labels).values[0]
def predictor(X):
    return xgb_model.predict_proba(lime_to_dataframe(X, categorical_labels, X_train.columns))
 
explanation = explainer.explain_instance(X_user_lime, predictor, num_features=8)
explanation.show_in_notebook(show_table=True, show_all=False)

Figure 17: LIME plot

Figure 17: LIME plot

Commentary of LIME explanation

The plot in Figure 17 shows contributions of most important individual features to the outcome probability. The contributions are based on the weights of the linear model that was fitted to the localised data set.

We have selected a random person from the test data set, which has subscribed to the deposit. Features values for the analysed client are as follows:

Figure 18: Features values for selected client

Figure 18: Features values for selected client

Our XGBoost model applied on this client’s data calculates the probability for deposit subscription as 0.97 and thus correctly identifies the client as highly likely to subscribe to the deposit.

The most important contributions to the positive prediction for this client were (according to the Figure 17):

  • euribor3m - its value indicates that the call was made during period of economic depression,
  • contact type - it was celullar, which is associated with higher likelihood of subscription,
  • another feature value with positive impact was number of contacts during campaign, which was relatively low (2),
  • negative impact was from education (client’s education is one of the lowest) and consumer price index.

LIME is using local perturbation of the dataset to fit an approximative linear model. It is possible to control the degree of “locality” by changing the kernel width parameter in the LimeTabularExplainer method. Decreasing the kernel width leads to a more local LIME method.

SHAP method

Finally, we will examine the results of the SHAP approach.

The SHAP method is rigorous and has several advantageous theoretical properties: local accuracy, consistency and contrastive explanations. Its explanations are also considerably more stable with respect to variations in input variables than some other methods (see discussion in part one of our XAI series). It is a preferred approach for many XAI practitioners.

For analysis, we have selected the example person that was also used for LIME results. We will set the link parameter to ‘logit’ to convert log-odds values to probabilities and make it easier to interpret results.

user_index=subscribed_instances[7]
explainer = shap.TreeExplainer(xgb_model.named_steps["model"])
shap_values = explainer.shap_values(preprocessor.transform(X_test)[user_index])
shap.force_plot(explainer.expected_value, shap_values, X_test_df.iloc[user_index,:],link="logit")

Figure 19: SHAP analysis for individual prediction

The most important contribution to the positive outcome is again the value of euribor3m - it indicates that the call was made during period of economic depression. Other feature values that positively contribute to the positive outcome (subscription to deposit) are:

  • only a few days since last contact during previous campaign (6 days)
  • low age (PDP chart also indicate that low age clients have a more positive contribution to probability of subscription when compared to most other ages, except for group of very old clients)
  • outcome from previous campaign was a success
  • person is a student

A negative impact on likelihood of subscription comes from number of previous contacts which was 3 (thus a bit high) and from relatively low education - basic 4y.

Conclusion

In this article, we built and trained a XGBoost machine learning model on an open source data set to predict the success of bank telemarketing calls.

We used several XAI methods to present global interpretability results - which features are most important for the predictions of our ML model, considering all data instances.

Global interpretation of AI models can help data scientists better understand and improve their models. It also allows business decision-makers to gain insights on what variables are impacting algorithmic decision making in their organizations.

We also examined local interpretability methods, and showed how predictions of the ML model for individual customers can be explained in terms of the feature values of these customers.

Local interpretability frameworks are becoming an essential tool to help provide explanation of decisions made by AI models to clients impacted by those decisions. They also help financial and other organizations to comply with “Right to Explanation” regulations and other laws.

The influence of AI decision making on our lives will significantly widen in the future. For organizations to earn and maintain trust of humans in this new type of decision-making, a comparably strong effort will be required both in the development as well as effective use of Explainable Artificial Intelligence solutions in client facing interactions.

Written by:

Samo Plibersek, Ph.D.