Today:¶

  1. What to optimize (the "scoring" param in cross_validate and gridsearch)
  2. How to optimize

What to optimize¶

Last time on "LendingClub2013"...¶

We had two logistic models to predict loan default.

Model 1 used the default settings:

image.png

Model 2 "improved" on this by using one possible fix for imbalanced classes:

image.png

Exercises¶

Make (reasonable) assumptions as needed to compare the two models:

  1. How much money does this improved model save us relative to the prior model from reduced charge-offs?
  2. How much foregone profit does this "improved" model have relative to the prior model?
  3. How does this improved model do from the standpoint of a profit maximizing lender? Meaning: Would you rather use the first or the second model, and why?
  4. Write down a profit function for firm using the cells of the confusion matrix. (The four cells of the matrix are TN, TP, FN, and FP.)
  5. Based on your profit function, which metric(s) do you want to maximize/minimize in this model? There might not be a clean answer, if so: Discuss candidates and what they capture correctly about the profit function and what they miss.

How to optimize (hyperparameters)¶

As the book says,

  1. Set up your parameter grids: Start with a wide (and sparse) net.
  2. Set up gridsearchCV and then .fit() it.
  3. Plot the performance of the models.

Repeat the above steps as needed, adjusting the parameter grids to hone in on best models, until you've found an optimized model.

Let's try that...¶

I've loaded the LC data. Let's set up the model:

In [3]:
clf_logit = make_pipeline(preproc_pipe, 
                          LogisticRegression(class_weight='balanced'))

When I use cross_validate or gridsearchCV, I need to pick a scorer to optimize on.

Let's score the models directly on profit. Following the docs, I'll make a custom scorer:

In [4]:
# define the profit function

def custom_prof_score(y, y_pred, roa=0.02, haircut=0.20):
    '''
    Firm profit is this times the average loan size. We can
    ignore that term for the purposes of maximization. 
    '''
    TN = sum((y_pred==0) & (y==0)) # count loans made and actually paid back
    FN = sum((y_pred==0) & (y==1)) # count loans made and actually defaulting
    return TN*roa - FN*haircut

# so that we can use the fcn in sklearn, "make a scorer" out of that function

from sklearn.metrics import make_scorer
prof_score = make_scorer(custom_prof_score)

In this example, we will see if the "regularization" parameter in the logit function matters.

Here, I'm going to try a lot of small values, and then some higher values.

In [5]:
parameters =  {'logisticregression__C': list(np.linspace(0.00001,.5,25))+list(np.linspace(0.55,5.55,6)) }

grid_search = GridSearchCV(estimator = clf_logit, 
                           param_grid = parameters,
                           cv = 5, scoring=prof_score
                           )

After fitting the grid_search, we examine the performance:

  • Do you like A or B more?
  • How should we adjust the parameters grid?

IMO:¶

  • Option A has nearly the same performance on average with a reasonable reduction in variance
  • Option A is probably not the optimal value: Adjust the grid around Option A to look for possible improvements
  • To get large (not marginal) gains: We need to upgrade the model
    • Add X vars?
    • Different estimator?

Next time on "LendingClub2013"...¶

  • Upgrading this model