Common joke: 80% of ML is just data cleaning and processing

Even when outliers aren't data errors, they
Thus, scaling can be ESSENTIAL for some models https://youtu.be/9rBc3rTsJsY?t=195
Options to deal with outliers and scaling:
sklearn)FunctionTransformerCan often improve models greatly!
I'm going to load the classic titanic dataset and try each of these methods
OLS: Predicting Titanic survival ==================================================== Model includes Note R2 ---------------------------------------------------- Age + Gender Baseline 0.2911 Child + Gender Binning X 0.2994 Child + Gender + Interaction Interaction 0.3093 poly deg 3 of (Age + Gender) Polynomial 0.3166
sklearn to pick subset of varsselectKBest, rfe, SelectFromModel, SFS'passthrough': These are do=nothing placeholder stepspipe = Pipeline([('columntransformer',preproc_pipe),
('feature_create','passthrough'),
('feature_select','passthrough'),
('clf', LogisticRegression(class_weight='balanced'))
])
Visually:
Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('standardscaler',
StandardScaler())]),
['annual_inc', 'dti',
'fico_range_high',
'fico_range_low',
'installment', 'int_rate',
'loan_amnt', 'mort_acc',
'open_acc', 'pub_rec',
'pub_rec_bankruptcies',
'revol_bal', 'revol_util',
'total_acc']),
('pipeline-2',
Pipeline(steps=[('onehotencoder',
OneHotEncoder())]),
['grade'])])),
('feature_create', 'passthrough'),
('feature_select', 'passthrough'),
('clf', LogisticRegression(class_weight='balanced'))])Please rerun this cell to show the HTML repr or trust the notebook.Pipeline(steps=[('columntransformer',
ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('standardscaler',
StandardScaler())]),
['annual_inc', 'dti',
'fico_range_high',
'fico_range_low',
'installment', 'int_rate',
'loan_amnt', 'mort_acc',
'open_acc', 'pub_rec',
'pub_rec_bankruptcies',
'revol_bal', 'revol_util',
'total_acc']),
('pipeline-2',
Pipeline(steps=[('onehotencoder',
OneHotEncoder())]),
['grade'])])),
('feature_create', 'passthrough'),
('feature_select', 'passthrough'),
('clf', LogisticRegression(class_weight='balanced'))])ColumnTransformer(transformers=[('pipeline-1',
Pipeline(steps=[('simpleimputer',
SimpleImputer()),
('standardscaler',
StandardScaler())]),
['annual_inc', 'dti', 'fico_range_high',
'fico_range_low', 'installment', 'int_rate',
'loan_amnt', 'mort_acc', 'open_acc',
'pub_rec', 'pub_rec_bankruptcies',
'revol_bal', 'revol_util', 'total_acc']),
('pipeline-2',
Pipeline(steps=[('onehotencoder',
OneHotEncoder())]),
['grade'])])['annual_inc', 'dti', 'fico_range_high', 'fico_range_low', 'installment', 'int_rate', 'loan_amnt', 'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'revol_util', 'total_acc']
SimpleImputer()
StandardScaler()
['grade']
OneHotEncoder()
passthrough
passthrough
LogisticRegression(class_weight='balanced')
First thing I tried: More variables (21 in total)
... But it does worse? Any guesses why?
| param_columntransformer | mean_test_score | std_test_score | |
|---|---|---|---|
| 0 | 3 vars (Last class) | -2.70 | 5.508829 |
| 1 | All numeric + Grade | -3.84 | 4.428237 |
Next thing I'll try is feature selection....
PCA
| param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
|---|---|---|---|---|
| 0 | 3 vars (Last class) | -2.700 | 5.508829 | |
| 1 | All numeric vars + Grade | -3.840 | 4.428237 | |
| 2 | All numeric vars + Grade | PCA(n_components=5) | -27.768 | 5.066180 |
| 3 | All numeric vars + Grade | PCA(n_components=10) | -8.232 | 3.808592 |
| 4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
Next up: SelectKBest, SelectFromModel
Doesn't seem the answer here. I could try other "SelectFromModel" options, but let's move on.
| param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
|---|---|---|---|---|
| 0 | 3 vars (Last class) | -2.700 | 5.508829 | |
| 1 | All numeric vars + Grade | -3.840 | 4.428237 | |
| 4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
| 7 | All numeric vars + Grade | SelectKBest(k=15) | -2.580 | 5.209683 |
| 8 | All numeric vars + Grade | SelectFromModel(estimator=LassoCV()) | -3.892 | 4.542028 |
| 9 | All numeric vars + Grade | SelectFromModel(estimator=LinearSVC(class_weight='balanced', dual=False,\n penalty='l1'),\n threshold='median') | -6.736 | 5.610268 |
Next up: RFE
Still no great answer.
| param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
|---|---|---|---|---|
| 0 | 3 vars (Last class) | -2.700 | 5.508829 | |
| 1 | All numeric vars + Grade | -3.840 | 4.428237 | |
| 4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
| 10 | All numeric vars + Grade | RFECV(cv=2,\n estimator=LinearSVC(class_weight='balanced', dual=False, penalty='l1'),\n scoring=make_scorer(custom_prof_score)) | -5.924 | 4.217969 |
| 11 | All numeric vars + Grade | RFECV(cv=2, estimator=LogisticRegression(class_weight='balanced'),\n scoring=make_scorer(custom_prof_score)) | -5.836 | 7.021697 |
Last up: SequentialFeatureSelector
| param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
|---|---|---|---|---|
| 0 | 3 vars (Last class) | -2.700 | 5.508829 | |
| 1 | All numeric vars + Grade | -3.840 | 4.428237 | |
| 4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
| 12 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=5,\n scoring=make_scorer(custom_prof_score)) | 23.580 | 2.599723 |
| 13 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=10,\n scoring=make_scorer(custom_prof_score)) | 11.048 | 4.400847 |
| 14 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=15,\n scoring=make_scorer(custom_prof_score)) | 7.640 | 3.884904 |

And finally, we can create interaction terms within a pipeline.
However, this doesn't work well, and especially not with PCA.

pretty.iloc[[0,1,-3,-2,-1],[0,2,1,3,4]]
| param_columntransformer | param_feature_create | param_feature_select | mean_test_score | std_test_score | |
|---|---|---|---|---|---|
| 0 | 3 vars (Last class) | -2.700 | 5.508829 | ||
| 1 | All numeric vars + Grade | -3.840 | 4.428237 | ||
| 15 | All numeric vars + Grade | PolynomialFeatures(interaction_only=True) | -3.504 | 8.125135 | |
| 16 | All numeric vars + Grade | PolynomialFeatures(interaction_only=True) | PCA(n_components=15) | -54.544 | 3.926330 |
| 17 | All numeric vars + Grade | PolynomialFeatures(interaction_only=True) | PCA(n_components=25) | -35.980 | 5.287525 |
Let's just go back to our happy place and stare at this happy thing:
pretty.iloc[[0,1,4,12,13,14],[0,1,3,4]]
| param_columntransformer | param_feature_select | mean_test_score | std_test_score | |
|---|---|---|---|---|
| 0 | 3 vars (Last class) | -2.700 | 5.508829 | |
| 1 | All numeric vars + Grade | -3.840 | 4.428237 | |
| 4 | All numeric vars + Grade | PCA(n_components=15) | -2.452 | 3.665048 |
| 12 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=5,\n scoring=make_scorer(custom_prof_score)) | 23.580 | 2.599723 |
| 13 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=10,\n scoring=make_scorer(custom_prof_score)) | 11.048 | 4.400847 |
| 14 | All numeric vars + Grade | SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=15,\n scoring=make_scorer(custom_prof_score)) | 7.640 | 3.884904 |
Are you predicting returns in your project???
Do NOT estimate any model like this:
$$ y_{t} = f(X_t) $$You need to use today's info ($X_t$) to predict tomorrows return ($y_{t+1}$):
$$ y_{t+1} = f(X_t) $$Practically, this means:
df is sorted by asset, then time periodret_tp1 = df.groupby('asset')['ret'].shift(-1)param_grid and create your own. (Good practice!) Try PCA, and then Poly(2). best_estimate_? (Perhaps you prefer the model without the absolute highest mean test score!) Pick a model that isn't the best_estimate_, and figure out how you can use that model (to .fit() it and .predict() with it.)The code for today's lecture is inside handouts/ML/ML_worksheet3.ipynb