OLS: Predicting Titanic survival
====================================================
Model includes                 Note          R2
----------------------------------------------------
Age + Gender                   Baseline      0.2911
Child + Gender                 Binning X     0.2994
Child + Gender + Interaction   Interaction   0.3093
poly deg 3 of (Age + Gender)   Polynomial    0.3166

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['annual_inc', 'dti',
                                                   'fico_range_high',
                                                   'fico_range_low',
                                                   'installment', 'int_rate',
                                                   'loan_amnt', 'mort_acc',
                                                   'open_acc', 'pub_rec',
                                                   'pub_rec_bankruptcies',
                                                   'revol_bal', 'revol_util',
                                                   'total_acc']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['grade'])])),
                ('feature_create', 'passthrough'),
                ('feature_select', 'passthrough'),
                ('clf', LogisticRegression(class_weight='balanced'))])

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline-1',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer()),
                                                                  ('standardscaler',
                                                                   StandardScaler())]),
                                                  ['annual_inc', 'dti',
                                                   'fico_range_high',
                                                   'fico_range_low',
                                                   'installment', 'int_rate',
                                                   'loan_amnt', 'mort_acc',
                                                   'open_acc', 'pub_rec',
                                                   'pub_rec_bankruptcies',
                                                   'revol_bal', 'revol_util',
                                                   'total_acc']),
                                                 ('pipeline-2',
                                                  Pipeline(steps=[('onehotencoder',
                                                                   OneHotEncoder())]),
                                                  ['grade'])])),
                ('feature_create', 'passthrough'),
                ('feature_select', 'passthrough'),
                ('clf', LogisticRegression(class_weight='balanced'))])

ColumnTransformer(transformers=[('pipeline-1',
                                 Pipeline(steps=[('simpleimputer',
                                                  SimpleImputer()),
                                                 ('standardscaler',
                                                  StandardScaler())]),
                                 ['annual_inc', 'dti', 'fico_range_high',
                                  'fico_range_low', 'installment', 'int_rate',
                                  'loan_amnt', 'mort_acc', 'open_acc',
                                  'pub_rec', 'pub_rec_bankruptcies',
                                  'revol_bal', 'revol_util', 'total_acc']),
                                ('pipeline-2',
                                 Pipeline(steps=[('onehotencoder',
                                                  OneHotEncoder())]),
                                 ['grade'])])

['annual_inc', 'dti', 'fico_range_high', 'fico_range_low', 'installment', 'int_rate', 'loan_amnt', 'mort_acc', 'open_acc', 'pub_rec', 'pub_rec_bankruptcies', 'revol_bal', 'revol_util', 'total_acc']

SimpleImputer()

StandardScaler()

['grade']

OneHotEncoder()

passthrough

passthrough


pretty.iloc[[0,1,-3,-2,-1],[0,2,1,3,4]]


pretty.iloc[[0,1,4,12,13,14],[0,1,3,4]]

	param_columntransformer	mean_test_score	std_test_score
0	3 vars (Last class)	-2.70	5.508829
1	All numeric + Grade	-3.84	4.428237

	param_columntransformer	param_feature_select	mean_test_score	std_test_score
0	3 vars (Last class)		-2.700	5.508829
1	All numeric vars + Grade		-3.840	4.428237
2	All numeric vars + Grade	PCA(n_components=5)	-27.768	5.066180
3	All numeric vars + Grade	PCA(n_components=10)	-8.232	3.808592
4	All numeric vars + Grade	PCA(n_components=15)	-2.452	3.665048

	param_columntransformer	param_feature_select	mean_test_score	std_test_score
0	3 vars (Last class)		-2.700	5.508829
1	All numeric vars + Grade		-3.840	4.428237
4	All numeric vars + Grade	PCA(n_components=15)	-2.452	3.665048
7	All numeric vars + Grade	SelectKBest(k=15)	-2.580	5.209683
8	All numeric vars + Grade	SelectFromModel(estimator=LassoCV())	-3.892	4.542028
9	All numeric vars + Grade	SelectFromModel(estimator=LinearSVC(class_weight='balanced', dual=False,\n penalty='l1'),\n threshold='median')	-6.736	5.610268

	param_columntransformer	param_feature_select	mean_test_score	std_test_score
0	3 vars (Last class)		-2.700	5.508829
1	All numeric vars + Grade		-3.840	4.428237
4	All numeric vars + Grade	PCA(n_components=15)	-2.452	3.665048
10	All numeric vars + Grade	RFECV(cv=2,\n estimator=LinearSVC(class_weight='balanced', dual=False, penalty='l1'),\n scoring=make_scorer(custom_prof_score))	-5.924	4.217969
11	All numeric vars + Grade	RFECV(cv=2, estimator=LogisticRegression(class_weight='balanced'),\n scoring=make_scorer(custom_prof_score))	-5.836	7.021697

	param_columntransformer	param_feature_select	mean_test_score	std_test_score
0	3 vars (Last class)		-2.700	5.508829
1	All numeric vars + Grade		-3.840	4.428237
4	All numeric vars + Grade	PCA(n_components=15)	-2.452	3.665048
12	All numeric vars + Grade	SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=5,\n scoring=make_scorer(custom_prof_score))	23.580	2.599723
13	All numeric vars + Grade	SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=10,\n scoring=make_scorer(custom_prof_score))	11.048	4.400847
14	All numeric vars + Grade	SequentialFeatureSelector(cv=2,\n estimator=LogisticRegression(class_weight='balanced'),\n n_features_to_select=15,\n scoring=make_scorer(custom_prof_score))	7.640	3.884904

Today: Preprocessing¶

Outline of preprocessing topics¶

Missing Values¶

Missing Values¶

Outliers and variable scaling¶

Feature construction¶

Feature construction in action¶

Feature selection and/or extraction¶

Processing in action¶

OT 1¶

OT 2¶