(Translated by https://www.hiragana.jp/)
Inconsistent estimation under correlation · Issue #232 · interpretml/interpret · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent estimation under correlation #232

Open
ruedika opened this issue May 10, 2021 · 2 comments
Open

Inconsistent estimation under correlation #232

ruedika opened this issue May 10, 2021 · 2 comments

Comments

@ruedika
Copy link

ruedika commented May 10, 2021

Hi,
I played around with some simulated GAM (generalized additive model) data with correlated features and found that EBM sometimes arrives at wrong conclusions as to the risk-driver mappings. Small changes to the data may "fix" the issue, thus I cannot say that I see this erroneous fit consistently. Looks as if the mapping of single risk-drivers is something in between the actual mapping and the marginal conditional distribution (something like Prob(Y=1/X1=x))....?

Code Example

import numpy as np
import pandas as pd
from interpret.glassbox import ExplainableBoostingClassifier
from scipy.stats import multivariate_normal
from interpret import show
from sklearn.metrics import roc_auc_score as auc
from pygam import LogisticGAM, s
import matplotlib.pyplot as plt

##construct data
np.random.seed(123)
n=100000
rho=0.9
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
num=['x1','x2','x3']
cat=[]
df['prob_true']=1/(1+np.exp(2-df.x1-2*np.sin(np.pi+df.x2)))
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'
auc(df.tar,df.prob_true)

df_test=df.sample(frac=0.2)
df=df.drop(df_test.index)

##EBM
clf = ExplainableBoostingClassifier(n_jobs=-1, interactions=0, random_state=0)
clf.fit(df[cat+num],df[target])
df_test['pred_ebm']=clf.predict_proba(df_test[cat+num])[:,1]
auc(df_test[target],df_test.pred_ebm)
##AUC is less than expected

##X1 is not a line, X3 is not flat
ebm_global = clf.explain_global()
show(ebm_global)

##compare to PyGAM results
gam=LogisticGAM(s(0)+s(1)+s(2))
gam.fit(df[cat+num],df[target])
df_test['pred_pygam']=gam.predict_proba(pd.concat([np.clip(df_test[c],df[c].min(),df[c].max()) for c in cat]+[df_test[num]],axis=1))
auc(df_test[target],df_test.pred_pygam)

for i, term in enumerate(gam.terms):
if term.isintercept:
continue

  XX = gam.generate_X_grid(term=i)
  pdep, confi = gam.partial_dependence(term=i, X=XX, width=0.95)

  plt.figure()
  plt.plot(XX[:, term.feature], pdep)
  plt.plot(XX[:, term.feature], confi, c='r', ls='--')
  plt.title(repr(term))
  plt.show()

##change data slightly (delete pi)
np.random.seed(123)
n=100000
rho=0.9
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
num=['x1','x2','x3']
cat=[]
df['prob_true']=1/(1+np.exp(2-df.x1-2*np.sin(df.x2))) # deleted np.pi
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'
auc(df.tar,df.prob_true)

df_test=df.sample(frac=0.2)
df=df.drop(df_test.index)

##EBM
clf = ExplainableBoostingClassifier(n_jobs=-1, interactions=0, random_state=0)
clf.fit(df[cat+num],df[target])
df_test['pred_ebm']=clf.predict_proba(df_test[cat+num])[:,1]
auc(df_test[target],df_test.pred_ebm)

##X1 is a line, X2 is not as sinusoidal as expected, X3 still not flat
ebm_global = clf.explain_global()
show(ebm_global)

@interpret-ml
Copy link
Collaborator

Hi @ruedika,

Thanks for the insightful question, and for taking the time to provide code! Unfortunately, no GAM fitting method can universally approximate all possible generator functions, and EBMs are no exception. As you've noticed, for functions where the generator components are very simple/smooth (e.g. linear or sinusoidal), tree based optimizations have a harder time than splines with sufficiently small sample size. However, in practice we've found that real data tends to be far messier, with occasional large step function like changes at significant thresholds (e.g. zero vs. non-zero is often highly significant in many datasets). In both benchmarks we've run and findings from the literature, we find that tree based methods tend to represent complex datasets better and outperform splines.

EBMs also have differences to other tree based GAM learning methods. Here's both EBM and a depth=1 LightGBM on your dataset (with rho=0.7 to highlight the differences a bit better):

image

While x1 and x2 are (arguably) better represented in EBMs, LightGBM is able to completely ignore the redundant x3. This is an active design choice we make in EBMs, which does come with some tradeoffs. For example, here's a slight modification of your problem where I added a new feature "x4" that is perfectly correlated with "x2", and used equivalently in the label generation:

np.random.seed(123)
n=100000
rho=0.7
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
df['x4'] = df['x2']
num=['x1','x2','x3', 'x4']
cat=[]
df['prob_true']=1/(1+np.exp(2-df.x1-np.sin(np.pi+df.x2)-np.sin(np.pi+df.x4)))
df['tar']=np.random.binomial(n=1,p=df.prob_true)

And the corresponding learned functions of both models:
image

EBM in this case learns almost identical functions for x2 and x4, whereas "greedy" tree based methods will arbitrarily choose one of the two features to put all of the probability mass on. We find that the property of proportionally allocating credit under high correlation (as opposed to putting as much mass into as few features as possible), helps users reason about the model and find bugs in their data (e.g. redundant copies of features that may cause problems later). Of course, we could still do better when it comes to regularizing redundant features out, and we're actively exploring research in this direction.

If you're curious about looking into this further, many of these topics are explored in greater detail by Kingsley Chang and his co-authors in this paper: https://arxiv.org/pdf/2006.06466.pdf . In it, they rigorously benchmark and explore how well different GAM fitting methods (including splines, XGBoost based trees, EBMs, and variants of linear models) learn a variety of additive generator functions.

Hope this helps, and happy to continue the discussion further!
-InterpretML Team

@ruedika
Copy link
Author

ruedika commented May 21, 2021

Hi,
Thanks for taking the time and for your comprehensive answer!
But I think the issue I've mentioned is still one of convergence. To speed up convergence in the above example I've played with the learning rates.
First to set up the experiment again:

import numpy as np
import pandas as pd
from interpret.glassbox import ExplainableBoostingClassifier
from scipy.stats import multivariate_normal
from interpret import show
from sklearn.metrics import roc_auc_score as auc
from pygam import LogisticGAM, s
import matplotlib.pyplot as plt

np.random.seed(123)
n=100000
rho=0.9
df=pd.DataFrame(multivariate_normal(mean=[0,0,0],cov=[[1,rho,rho],[rho,1,rho],[rho,rho,1]]).rvs(size=n),columns=['x1','x2','x3'])
num=['x1','x2','x3']
cat=[]
df['prob_true']=1/(1+np.exp(2-df.x1-2*np.sin(np.pi+df.x2)))
df['tar']=np.random.binomial(n=1,p=df.prob_true)
target='tar'
auc(df.tar,df.prob_true)

df_test=df.sample(frac=0.2)
df=df.drop(df_test.index)

Note that the AUC of a good model should be around 0.69 (model predictions mimicking the true underlying probabilities). The EBM set up as in my first post got 0.64. So I've fitted EBM with changing learning rates:

auc_list=[]
for lr in np.arange(0.01,0.2,0.01):
    clf = ExplainableBoostingClassifier(n_jobs=-1, interactions=0, random_state=0, learning_rate=lr)
    clf.fit(df[cat+num],df[target])
    df_test['pred_ebm']=clf.predict_proba(df_test[cat+num])[:,1]
    auc_list.append(auc(df_test[target],df_test.pred_ebm))

plt.plot(np.arange(0.01,0.2,0.01),auc_list)
plt.xlabel('learning rate')
plt.ylabel('AUC')

If you plot this (sorry, I can't upload) you'll see that starting from a learning rate of 0.05 EBM actually get's the right shapes. Which you can also verify directly:

clf = ExplainableBoostingClassifier(n_jobs=-1, interactions=0, random_state=0, learning_rate=0.1)
clf.fit(df[cat+num],df[target])

ebm_global = clf.explain_global()
show(ebm_global)

Overall I think that the round robin has convergence issues for this pathological dataset. One way to amend this, as you've done above, is to lower correlation. How to deal with this within the algorithm, I don't know. Maybe default learning rate schedules (instead of lr=0.01) could fix some cases?

Thanks & also happy continue discussion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants