A new step right before preproc_pipe¶

The columntransformer has a list of pipelines, each needs a list of variables to use. Options:

  1. Define manually: num_pipe_vars = ['A','B','C'], then (numer_pipe, num_pipe_vars)
    • Maybe tedious, but explicit (good)
  2. Let the CT find the numeric vars: (numer_pipe, make_column_selector(dtype_include=np.number))
    • gets numbers, similar for cat vars
    • lets in all numeric/object variables, but often some of your columns shouldn't be used!
  3. Get pandas to list all numeric/cat vars, then drop those you don't want (example below)

Today: Preprocessing¶

  1. Discussion of prepreprocessing issues
  2. How to implement within pipelines + CV

Common joke: 80% of ML is just data cleaning and processing

  • Remember the Nate Silver quote?
  • Andrew Ng (Coursera, Stanford, Google Brain, Baidu...)

Outline of preprocessing topics¶

  • Missing values
  • Outliers
  • Feature construction
  • Feature transformation/scaling
  • Feature selection

Missing Values¶

Missing Values¶

  • Why is it missing?
  • Options: Drop, Impute, or Ignore
  • One dataset might need multiple solutions
  • Fancy (actually, easy) but can be powerful: KNN replacement
  • "Dropping the obs" might be fine... until you use the model in production

Outliers and variable scaling¶

Even when outliers aren't data errors, they

  • alter model estimations (sometimes very badly)
  • change inference
  • reduce predictive power

Thus, scaling can be ESSENTIAL for some models https://youtu.be/9rBc3rTsJsY?t=195

Options to deal with outliers and scaling:

  • "Dropping the obs" might be fine... until you use the model in production
  • Transform the variables: options/illustration here and here
  • Winsorize (no current built-in option in sklearn)
    • Can do: need to use FunctionTransformer
    • Participation call: Lemme know if you figure it out!

Feature construction¶

Can often improve models greatly!

  • binning:
    • not profits as a #, but profit bins, e.g.: lowest, low, negative, zero, positive, high, highest
    • remember the HW problem on on year vs year dummies?
    • tree-based models bin automatically
  • interactions, e.g:
    • story example: woman or child, woman and first class, finance AND coding
    • tree-based, KNN, and NN models generate interactions automatically
  • polynomial expansions. If you have $X_1$ and $X_2$:
    • Poly of degree 2: $X_1$, $X_2$, $X1^2$, $X2^2$, $X_1*X_2$
    • Visual example
    • tree-based and NN models generate interactions automatically
  • extracting info from variables (e.g. date vars or text vars)

Feature construction in action¶

I'm going to load the classic titanic dataset and try each of these methods

OLS: Predicting Titanic survival
====================================================
Model includes                 Note          R2
----------------------------------------------------
Age + Gender                   Baseline      0.2911
Child + Gender                 Binning X     0.2994
Child + Gender + Interaction   Interaction   0.3093
poly deg 3 of (Age + Gender)   Polynomial    0.3166


Feature selection and/or extraction¶

  • If you have too many variables, you will create an overfit model!
  • Options in sklearn to pick subset of vars
    • selectKBest, rfe, SelectFromModel, SFS
    • See link at top for more
  • Alt to picking variables is "combining them" via PCA
    • Illustration
    • Use if: Lots of vars AND you suspect the "true" number of vars that matters is low
    • pros: reduces overfitting, quicker estimation
    • cons: hard (very!) to interpret what the model is doing

Processing in action¶

  • Same as before: LC is loaded, pipeline is the same
  • Notice 'passthrough': These are do=nothing placeholder steps
pipe = Pipeline([('columntransformer',preproc_pipe),
                 ('feature_create','passthrough'), 
                 ('feature_select','passthrough'), 
                 ('clf', LogisticRegression(class_weight='balanced'))
                ])

Visually:

Out[34]:
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['annual_inc', 'dti',
                                                   'fico_range_high',
                                                   'fico_range_low',
                                                   'installment', 'int_rate',
                                                   'loan_amnt', 'mort_acc',
                                                   'open_acc', 'pub_rec',
                                                   'pub_rec_bankruptcies',
                                                   'revol_bal', 'revol_util',
                                                   'total_acc']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['grade'])])),
                ('feature_create', 'passthrough'),
                ('feature_select', 'passthrough'),
                ('clf', LogisticRegression(class_weight='balanced'))])
Please rerun this cell to show the HTML repr or trust the notebook.
Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['annual_inc', 'dti',
                                                   'fico_range_high',
                                                   'fico_range_low',
                                                   'installment', 'int_rate',
                                                   'loan_amnt', 'mort_acc',
                                                   'open_acc', 'pub_rec',
                                                   'pub_rec_bankruptcies',
                                                   'revol_bal', 'revol_util',
                                                   'total_acc']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['grade'])])),
                ('feature_create', 'passthrough'),
                ('feature_select', 'passthrough'),
                ('clf', LogisticRegression(class_weight='balanced'))])
ColumnTransformer(transformers=[('pipeline-1',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer()),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 ['annual_inc', 'dti', 'fico_range_high',
                                  'fico_range_low', 'installment', 'int_rate',
                                  'loan_amnt', 'mort_acc', 'open_acc',
                                  'pub_rec', 'pub_rec_bankruptcies',
                                  'revol_bal', 'revol_util', 'total_acc']),
                                ('pipeline-2',
                                 Pipeline(steps=[('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['grade'])])
['annual_inc', 'dti', 'fico_range_high', 'fico_range_low', 'installment', 'int_rate', 'loan_amnt', 'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'revol_util', 'total_acc']
SimpleImputer()
StandardScaler()
['grade']
OneHotEncoder()
passthrough
passthrough
LogisticRegression(class_weight='balanced')

First thing I tried: More variables (21 in total)

... But it does worse? Any guesses why?

Out[37]:
param_columntransformer mean_test_score std_test_score
0 3 vars (Last class) -2.70 5.508829
1 All numeric + Grade -3.84 4.428237

Next thing I'll try is feature selection....

PCA

  • This doesn't seem like the best setting for PCA
  • Key question: How many dimensions? (I try 5, 10, and 15 here)
  • Why is 15 best (among these)?
  • Why is model 4 (15 vars) > model 1 (21 vars)
Out[53]:
param_columntransformer param_feature_select mean_test_score std_test_score
0 3 vars (Last class) -2.700 5.508829
1 All numeric vars + Grade -3.840 4.428237
2 All numeric vars + Grade PCA(n_components=5) -27.768 5.066180
3 All numeric vars + Grade PCA(n_components=10) -8.232 3.808592
4 All numeric vars + Grade PCA(n_components=15) -2.452 3.665048

Next up: SelectKBest, SelectFromModel

Doesn't seem the answer here. I could try other "SelectFromModel" options, but let's move on.

Out[41]:
param_columntransformer param_feature_select mean_test_score std_test_score
0 3 vars (Last class) -2.700 5.508829
1 All numeric vars + Grade -3.840 4.428237
4 All numeric vars + Grade PCA(n_components=15) -2.452 3.665048
7 All numeric vars + Grade SelectKBest(k=15) -2.580 5.209683
8 All numeric vars + Grade SelectFromModel(estimator=LassoCV()) -3.892 4.542028
9 All numeric vars + Grade SelectFromModel(estimator=LinearSVC(class_weight='balanced', dual=False,\n penalty='l1'),\n threshold='median') -6.736 5.610268

Next up: RFE

Still no great answer.

Out[43]:
param_columntransformer param_feature_select mean_test_score std_test_score
0 3 vars (Last class) -2.700 5.508829
1 All numeric vars + Grade -3.840 4.428237
4 All numeric vars + Grade PCA(n_components=15) -2.452 3.665048
10 All numeric vars + Grade RFECV(cv=2,\n estimator=LinearSVC(class_weight='balanced', dual=False, penalty='l1'),\n scoring=make_scorer(custom_prof_score)) -5.924 4.217969
11 All numeric vars + Grade RFECV(cv=2, estimator=LogisticRegression(class_weight='balanced'),\n scoring=make_scorer(custom_prof_score)) -5.836 7.021697

Last up: SequentialFeatureSelector

Out[66]:
param_columntransformer param_feature_select mean_test_score std_test_score
0 3 vars (Last class) -2.700 5.508829
1 All numeric vars + Grade -3.840 4.428237
4 All numeric vars + Grade PCA(n_components=15) -2.452 3.665048
12 All numeric vars + Grade SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=5,\n scoring=make_scorer(custom_prof_score)) 23.580 2.599723
13 All numeric vars + Grade SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=10,\n scoring=make_scorer(custom_prof_score)) 11.048 4.400847
14 All numeric vars + Grade SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=15,\n scoring=make_scorer(custom_prof_score)) 7.640 3.884904

And finally, we can create interaction terms within a pipeline.

However, this doesn't work well, and especially not with PCA.

In [69]:
pretty.iloc[[0,1,-3,-2,-1],[0,2,1,3,4]]
Out[69]:
param_columntransformer param_feature_create param_feature_select mean_test_score std_test_score
0 3 vars (Last class) -2.700 5.508829
1 All numeric vars + Grade -3.840 4.428237
15 All numeric vars + Grade PolynomialFeatures(interaction_only=True) -3.504 8.125135
16 All numeric vars + Grade PolynomialFeatures(interaction_only=True) PCA(n_components=15) -54.544 3.926330
17 All numeric vars + Grade PolynomialFeatures(interaction_only=True) PCA(n_components=25) -35.980 5.287525

Let's just go back to our happy place and stare at this happy thing:

In [70]:
pretty.iloc[[0,1,4,12,13,14],[0,1,3,4]]
Out[70]:
param_columntransformer param_feature_select mean_test_score std_test_score
0 3 vars (Last class) -2.700 5.508829
1 All numeric vars + Grade -3.840 4.428237
4 All numeric vars + Grade PCA(n_components=15) -2.452 3.665048
12 All numeric vars + Grade SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=5,\n scoring=make_scorer(custom_prof_score)) 23.580 2.599723
13 All numeric vars + Grade SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=10,\n scoring=make_scorer(custom_prof_score)) 11.048 4.400847
14 All numeric vars + Grade SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=15,\n scoring=make_scorer(custom_prof_score)) 7.640 3.884904

OT 1¶

Are you predicting returns in your project???

Do NOT estimate any model like this:

$$ y_{t} = f(X_t) $$

  • In the X vector, a given row will have the market return for that day.
  • You can't use the return that occurred that day to predict the return of assets on the same day! ($y_t$)
  • Bc: You won't know $X_t$ until the end of the day, but you'll need to know $y_t$ at the start of the day to trade on that prediction.
  • (If you can, pls lemme know... we'll talk!)

You need to use today's info ($X_t$) to predict tomorrows return ($y_{t+1}$):

$$ y_{t+1} = f(X_t) $$

Practically, this means:

  • After you combine all your return data and other Xs
  • But before you create your holdout, create y (I like this name better):
    1. Make sure your df is sorted by asset, then time period
    2. Create y: ret_tp1 = df.groupby('asset')['ret'].shift(-1)

OT 2¶

  1. Wanna see how I ran all those models?
  2. Exercises:
    • Don't run my code as-is. It will take 15+ minutes.
    • Delete my param_grid and create your own. (Good practice!) Try PCA, and then Poly(2).
    • Try to examine and use the estimators after the grid search.
    • How can you use a model that isn't best_estimate_? (Perhaps you prefer the model without the absolute highest mean test score!) Pick a model that isn't the best_estimate_, and figure out how you can use that model (to .fit() it and .predict() with it.)
    • See how our profitable model does on the test sample!

The code for today's lecture is inside handouts/ML/ML_worksheet3.ipynb