Hello! Today:¶

Merging, and working on "handouts/Merging exercises.ipynb"
Announcements
- A2: The grades will be solid (take a breath!)
  - Q1-Q5 (EDA) - 81%... very good! (75% last year)
  - Q8-10 ("toy" Qs) - 60%
  - Q6, 7, 11, 12: tough! 20% (40% last year)
    - ... office hours are free
  - restart / run all (without error) > save > push > "free" points
    - Only 50% did this (75 last year)
- A3: figures are 💪💪💪, reviews starting
- A4: posting by Wednesday
- Readme, gitignore, etc: Important good habits, near 100% avg
  - Don't forget to change the title and make it informative
Roman Pearce "We Hungry" topics
- Lunch N Learn
- Donuts and coffee sometime before break

Today¶

Bonus discussion on viz
Merging
Wednesday: Missing values

Viz¶

Any questions about anything on this? (Last week we mostly talked about mechanics of making plots show up. We can discuss anything on the broad topic of viz here)

Warmup - thinking about viz¶

We also have time to discuss the AWESOME array of graphs submitted for A3.

Pludits- which impressed you? How did they DO that?
Commentary (good natured only) on improving any?

Good-er visuals¶

Check out the resource page's visualization references for pointers to more resources on effective visualization

Check out 3.3.7 in the book for a quickstart on how you can customize your figures next time you want to fine tune plots (and then google a lot, no sane human memorizes matplotlib...)

Note: Updated viz chapter in the book.

Moving on...¶

Context - starting point: Remember, the class's first two objectives are to:

obtain, explore, groom, visualize, and analyze data

make all of that reproducible, reusable, and shareable

Context - right now: At this point, we've covered/added skills

✅ GitHub for collaboration and sharing
✅ GitHub for project management/development and version control
✅ Python: numpy, pandas, seaborn
✅ Datasets: CRSP (stock prices), Compustat (firm financial statements), FRED (macroeconomic and regional time series data)
✅ Data scraping: Yes, you've done this already!
✅ Finance: Downloading stock prices, compounding returns over arbitrary time spans, building portfolios

We need to talk about a few more issues before we get properly ambitious.

Context - going forward: We need to introduce a few more skills before we start really running analytical models.

Merging datasets
What to do with missing values?
What to do with outliers?
How to get a world of data off the world wide web
Working with string data

Merging¶

Parameters of the merge function+ practice
Tips / best practices

The merge function¶

In the "merging exercises" notebook, we have

import pandas as pd
left_df = pd.DataFrame({
                    'firm': ['Accenture','Citi','GS'],
                    'varA': ['A1', 'A2', 'A3']})

right_df = pd.DataFrame({
                    'firm': ['GS','Chase','WF'],
                    'varB': ['B1', 'B2', 'B3'],
                    'varc': ['C1', 'C2', 'C3']})

Let's use shift+tab to talk about the parameters.

`how` : left v. right v. inner v. outer¶

option	observations in resulting dataset
`how = "inner"`	Keys (`on` variables) that are in both datasets
`how = "left"`	"inner" + all unmatched obs in left
`how = "right"`	"inner" + all unmatched obs in right
`how = "outer"`	"inner" + all unmatched obs in left and right

Practical merging tips¶

Next class¶

Missing values, outliers, and data wrangling outro

Summary¶

We covered

Time permitting: Some high level tips for good viz
Merging intro
Missing values