Common joke: 80% of ML is just data cleaning and processing
Even when outliers aren't data errors, they
Thus, scaling can be ESSENTIAL for some models https://youtu.be/9rBc3rTsJsY?t=195
Options to deal with outliers and scaling:
sklearn
)FunctionTransformer
Can often improve models greatly!
I'm going to load the classic titanic
dataset and try each of these methods
OLS: Predicting Titanic survival ==================================================== Model includes Note R2 ---------------------------------------------------- Age + Gender Baseline 0.2911 Child + Gender Binning X 0.2994 Child + Gender + Interaction Interaction 0.3093 poly deg 3 of (Age + Gender) Polynomial 0.3166
sklearn
to pick subset of varsselectKBest
, rfe
, SelectFromModel
, SFS
'passthrough'
: These are do=nothing placeholder stepspipe = Pipeline([('columntransformer',preproc_pipe),
('feature_create','passthrough'),
('feature_select','passthrough'),
('clf', LogisticRegression(class_weight='balanced'))
])
Visually:
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline-1', Pipeline(steps=[('simpleimputer', SimpleImputer()), ('standardscaler', StandardScaler())]), ['annual_inc', 'dti', 'fico_range_high', 'fico_range_low', 'installment', 'int_rate', 'loan_amnt', 'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'revol_util', 'total_acc']), ('pipeline-2', Pipeline(steps=[('onehotencoder', OneHotEncoder())]), ['grade'])])), ('feature_create', 'passthrough'), ('feature_select', 'passthrough'), ('clf', LogisticRegression(class_weight='balanced'))])Please rerun this cell to show the HTML repr or trust the notebook.
Pipeline(steps=[('columntransformer', ColumnTransformer(transformers=[('pipeline-1', Pipeline(steps=[('simpleimputer', SimpleImputer()), ('standardscaler', StandardScaler())]), ['annual_inc', 'dti', 'fico_range_high', 'fico_range_low', 'installment', 'int_rate', 'loan_amnt', 'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'revol_util', 'total_acc']), ('pipeline-2', Pipeline(steps=[('onehotencoder', OneHotEncoder())]), ['grade'])])), ('feature_create', 'passthrough'), ('feature_select', 'passthrough'), ('clf', LogisticRegression(class_weight='balanced'))])
ColumnTransformer(transformers=[('pipeline-1', Pipeline(steps=[('simpleimputer', SimpleImputer()), ('standardscaler', StandardScaler())]), ['annual_inc', 'dti', 'fico_range_high', 'fico_range_low', 'installment', 'int_rate', 'loan_amnt', 'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'revol_util', 'total_acc']), ('pipeline-2', Pipeline(steps=[('onehotencoder', OneHotEncoder())]), ['grade'])])
['annual_inc', 'dti', 'fico_range_high', 'fico_range_low', 'installment', 'int_rate', 'loan_amnt', 'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'revol_util', 'total_acc']
SimpleImputer()
StandardScaler()
['grade']
OneHotEncoder()
passthrough
passthrough
LogisticRegression(class_weight='balanced')
First thing I tried: More variables (21 in total)
... But it does worse? Any guesses why?
param_columntransformer | mean_test_score | std_test_score | |
---|---|---|---|
0 | 3 vars (Last class) | -2.70 | 5.508829 |
1 | All numeric + Grade | -3.84 | 4.428237 |
Next thing I'll try is feature selection....
PCA
param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
---|---|---|---|---|
0 | 3 vars (Last class) | -2.700 | 5.508829 | |
1 | All numeric vars + Grade | -3.840 | 4.428237 | |
2 | All numeric vars + Grade | PCA(n_components=5) | -27.768 | 5.066180 |
3 | All numeric vars + Grade | PCA(n_components=10) | -8.232 | 3.808592 |
4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
Next up: SelectKBest
, SelectFromModel
Doesn't seem the answer here. I could try other "SelectFromModel" options, but let's move on.
param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
---|---|---|---|---|
0 | 3 vars (Last class) | -2.700 | 5.508829 | |
1 | All numeric vars + Grade | -3.840 | 4.428237 | |
4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
7 | All numeric vars + Grade | SelectKBest(k=15) | -2.580 | 5.209683 |
8 | All numeric vars + Grade | SelectFromModel(estimator=LassoCV()) | -3.892 | 4.542028 |
9 | All numeric vars + Grade | SelectFromModel(estimator=LinearSVC(class_weight='balanced', dual=False,\n penalty='l1'),\n threshold='median') | -6.736 | 5.610268 |
Next up: RFE
Still no great answer.
param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
---|---|---|---|---|
0 | 3 vars (Last class) | -2.700 | 5.508829 | |
1 | All numeric vars + Grade | -3.840 | 4.428237 | |
4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
10 | All numeric vars + Grade | RFECV(cv=2,\n estimator=LinearSVC(class_weight='balanced', dual=False, penalty='l1'),\n scoring=make_scorer(custom_prof_score)) | -5.924 | 4.217969 |
11 | All numeric vars + Grade | RFECV(cv=2, estimator=LogisticRegression(class_weight='balanced'),\n scoring=make_scorer(custom_prof_score)) | -5.836 | 7.021697 |
Last up: SequentialFeatureSelector
param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
---|---|---|---|---|
0 | 3 vars (Last class) | -2.700 | 5.508829 | |
1 | All numeric vars + Grade | -3.840 | 4.428237 | |
4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
12 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=5,\n scoring=make_scorer(custom_prof_score)) | 23.580 | 2.599723 |
13 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=10,\n scoring=make_scorer(custom_prof_score)) | 11.048 | 4.400847 |
14 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=15,\n scoring=make_scorer(custom_prof_score)) | 7.640 | 3.884904 |
And finally, we can create interaction terms within a pipeline.
However, this doesn't work well, and especially not with PCA.
pretty.iloc[[0,1,-3,-2,-1],[0,2,1,3,4]]
param_columntransformer | param_feature_create | param_feature_select | mean_test_score | std_test_score | |
---|---|---|---|---|---|
0 | 3 vars (Last class) | -2.700 | 5.508829 | ||
1 | All numeric vars + Grade | -3.840 | 4.428237 | ||
15 | All numeric vars + Grade | PolynomialFeatures(interaction_only=True) | -3.504 | 8.125135 | |
16 | All numeric vars + Grade | PolynomialFeatures(interaction_only=True) | PCA(n_components=15) | -54.544 | 3.926330 |
17 | All numeric vars + Grade | PolynomialFeatures(interaction_only=True) | PCA(n_components=25) | -35.980 | 5.287525 |
Let's just go back to our happy place and stare at this happy thing:
pretty.iloc[[0,1,4,12,13,14],[0,1,3,4]]
param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
---|---|---|---|---|
0 | 3 vars (Last class) | -2.700 | 5.508829 | |
1 | All numeric vars + Grade | -3.840 | 4.428237 | |
4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
12 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=5,\n scoring=make_scorer(custom_prof_score)) | 23.580 | 2.599723 |
13 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=10,\n scoring=make_scorer(custom_prof_score)) | 11.048 | 4.400847 |
14 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=15,\n scoring=make_scorer(custom_prof_score)) | 7.640 | 3.884904 |
Are you predicting returns in your project???
Do NOT estimate any model like this:
$$ y_{t} = f(X_t) $$You need to use today's info ($X_t$) to predict tomorrows return ($y_{t+1}$):
$$ y_{t+1} = f(X_t) $$Practically, this means:
df
is sorted by asset, then time periodret_tp1 = df.groupby('asset')['ret'].shift(-1)
param_grid
and create your own. (Good practice!) Try PCA, and then Poly(2).best_estimate_
? (Perhaps you prefer the model without the absolute highest mean test score!) Pick a model that isn't the best_estimate_
, and figure out how you can use that model (to .fit()
it and .predict()
with it.)The code for today's lecture is inside handouts/ML/ML_worksheet3.ipynb