Hello! Today:¶
- Merging, and working on "handouts/Merging exercises.ipynb"
- Announcements
- A2: The grades will be solid (take a breath!)
- Q1-Q5 (EDA) - 81%... very good! (75% last year)
- Q8-10 ("toy" Qs) - 60%
- Q6, 7, 11, 12: tough! 20% (40% last year)
- ... office hours are free
- restart / run all (without error) > save > push > "free" points
- Only 50% did this (75 last year)
- A3: figures are 💪💪💪, reviews starting
- A4: posting by Wednesday
- Readme, gitignore, etc: Important good habits, near 100% avg
- Don't forget to change the title and make it informative
- A2: The grades will be solid (take a breath!)
- Roman Pearce "We Hungry" topics
- Lunch N Learn
- Donuts and coffee sometime before break
Today¶
- Bonus discussion on viz
- Merging
- Wednesday: Missing values
Viz¶
Any questions about anything on this? (Last week we mostly talked about mechanics of making plots show up. We can discuss anything on the broad topic of viz here)
Warmup - thinking about viz¶
We also have time to discuss the AWESOME array of graphs submitted for A3.
- Pludits- which impressed you? How did they DO that?
- Commentary (good natured only) on improving any?
Good-er visuals¶
Check out the resource page's visualization references for pointers to more resources on effective visualization
Check out 3.3.7 in the book for a quickstart on how you can customize your figures next time you want to fine tune plots (and then google a lot, no sane human memorizes matplotlib...)
Note: Updated viz chapter in the book.
Moving on...¶
Context - starting point: Remember, the class's first two objectives are to:
- obtain, explore, groom, visualize, and analyze data
- make all of that reproducible, reusable, and shareable
Context - right now: At this point, we've covered/added skills
- ✅ GitHub for collaboration and sharing
- ✅ GitHub for project management/development and version control
- ✅ Python: numpy, pandas, seaborn
- ✅ Datasets: CRSP (stock prices), Compustat (firm financial statements), FRED (macroeconomic and regional time series data)
- ✅ Data scraping: Yes, you've done this already!
- ✅ Finance: Downloading stock prices, compounding returns over arbitrary time spans, building portfolios
We need to talk about a few more issues before we get properly ambitious.
Context - going forward: We need to introduce a few more skills before we start really running analytical models.
- Merging datasets
- What to do with missing values?
- What to do with outliers?
- How to get a world of data off the world wide web
- Working with string data
Merging¶
- Parameters of the merge function+ practice
- Tips / best practices
The merge function¶
In the "merging exercises" notebook, we have
import pandas as pd
left_df = pd.DataFrame({
'firm': ['Accenture','Citi','GS'],
'varA': ['A1', 'A2', 'A3']})
right_df = pd.DataFrame({
'firm': ['GS','Chase','WF'],
'varB': ['B1', 'B2', 'B3'],
'varc': ['C1', 'C2', 'C3']})
Let's use shift+tab to talk about the parameters.
how
: left v. right v. inner v. outer¶
option | observations in resulting dataset |
---|---|
how = "inner" |
Keys (on variables) that are in both datasets |
how = "left" |
"inner" + all unmatched obs in left |
how = "right" |
"inner" + all unmatched obs in right |
how = "outer" |
"inner" + all unmatched obs in left and right |
Practical merging tips¶
More in the book!
Pick the "keys" you'll merge on
- what are the obs levels in each data
- Ex: firm merge, or firm-year merge?
Before (before!) your merge, examine the keys
- I provided a function that will help
- double check: len() before and after merge
Always specify
how
,on
,indicator
, andvalidate
insidepd.merge()
After the merge, check that it did what you expected, and give the resulting dataframe a good name. Don't name it "merged"!!!
Next class¶
Missing values, outliers, and data wrangling outro