cross_validate
and gridsearch
)We had two logistic models to predict loan default.
Model 1 used the default settings:
Model 2 "improved" on this by using one possible fix for imbalanced classes:
Make (reasonable) assumptions as needed to compare the two models:
gridsearchCV
and then .fit()
it.Repeat the above steps as needed, adjusting the parameter grids to hone in on best models, until you've found an optimized model.
I've loaded the LC data. Let's set up the model:
clf_logit = make_pipeline(preproc_pipe,
LogisticRegression(class_weight='balanced'))
When I use cross_validate
or gridsearchCV
, I need to pick a scorer to optimize on.
Let's score the models directly on profit. Following the docs, I'll make a custom scorer:
# define the profit function
def custom_prof_score(y, y_pred, roa=0.02, haircut=0.20):
'''
Firm profit is this times the average loan size. We can
ignore that term for the purposes of maximization.
'''
TN = sum((y_pred==0) & (y==0)) # count loans made and actually paid back
FN = sum((y_pred==0) & (y==1)) # count loans made and actually defaulting
return TN*roa - FN*haircut
# so that we can use the fcn in sklearn, "make a scorer" out of that function
from sklearn.metrics import make_scorer
prof_score = make_scorer(custom_prof_score)
In this example, we will see if the "regularization" parameter in the logit function matters.
Here, I'm going to try a lot of small values, and then some higher values.
parameters = {'logisticregression__C': list(np.linspace(0.00001,.5,25))+list(np.linspace(0.55,5.55,6)) }
grid_search = GridSearchCV(estimator = clf_logit,
param_grid = parameters,
cv = 5, scoring=prof_score
)
After fitting the grid_search
, we examine the performance:
parameters
grid?