Teaching notes

This year:

Python exercises only: 80 median, 71 avg; substantially better than last year

The overall assignment grades will be higher because the readme/gitignore components

Hello! Today:¶

  1. Merging, and working on "Merging exercises.ipynb"
  2. Announcements
    • A2: strong grades! A tribute to hard work - bravo
    • A3: figures are 💪💪💪, reviews starting
    • A4: posting asap
    • Readme, gitignore, etc: Important good habits, near 100% avg
  3. Roman Pearce "We Hungry" topics
    • Lunch N Learn
    • Ain't forgetting: donuts and coffee

Today¶

  1. Bonus discussion on viz
  2. Merging
  3. Probably Wednesday: Missing values

Viz¶

Any questions about anything on this? (Last week we mostly talked about mechanics of making plots show up. We can discuss anything on the broad topic of viz here)

Warmup - thinking about viz¶

We also have time to discuss the AWESOME array of graphs submitted for A3.

  • Pludits- which impressed you? How did they DO that?
  • Commentary (good natured only) on improving any?
  • Query to eliminate outliers in graphs where seeing the patterns matter
  • Avoid oversaturation in scatterplots with .sample() or kind='hex'. Another option: Some students plotted various 2D density graphs

I provided examples of both in "plotting exercises" (see Q5 & Q6 answers)

Good-er visuals¶

Check out the resource page's visualization references for pointers to more resources on effective visualization

Check out 3.3.5 in the book for a quickstart on how you can customize your figures next time you want to fine tune plots (and then google a lot, since sane human memorizes matplotlib...)

Moving on...¶

Context - starting point: Remember, the class's first two objectives are to:

  1. obtain, explore, groom, visualize, and analyze data
  2. make all of that reproducible, reusable, and shareable

Context - right now: At this point, we've covered/added skills

  • ✅ GitHub for collaboration and sharing
  • ✅ GitHub for project management/development and version control
  • ✅ Python: numpy, pandas, seaborn
  • ✅ Datasets: CRSP (stock prices), Compustat (firm financial statements), FRED (macroeconomic and regional time series data)
  • ✅ Data scraping: Yes, you've done this already!
  • ✅ Finance: Downloading stock prices, compounding returns over arbitrary time spans, building portfolios

We need to talk about a few more issues before we get properly ambitious.

Context - going forward: We need to introduce a few more skills before we start really running analytical models.

  1. Merging datasets
  2. What to do with missing values?
  3. What to do with outliers?
  4. How to get a world of data off the world wide web
  5. Working with string data

Merging¶

  1. Parameters of the merge function+ practice
  2. Tips / best practices

The merge function¶

In the "merging exercises" notebook, we have

import pandas as pd
left_df = pd.DataFrame({
                    'firm': ['Accenture','Citi','GS'],
                    'varA': ['A1', 'A2', 'A3']})

right_df = pd.DataFrame({
                    'firm': ['GS','Chase','WF'],
                    'varB': ['B1', 'B2', 'B3'],
                    'varc': ['C1', 'C2', 'C3']})

Let use shift+tab to talk about the parameters.

how : left v. right v. inner v. outer¶

option observations in resulting dataset
how = "inner" Keys (on variables) that are in both datasets
how = "left" "inner" + all unmatched obs in left
how = "right" "inner" + all unmatched obs in right
how = "outer" "inner" + all unmatched obs in left and right

Practical merging tips¶

More in the book!

  1. Pick the "keys" you'll merge on
    • what are the obs levels in each data
    • Ex: firm merge, or firm-year merge?
  2. Before (before!) your merge, examine the keys
    • I provided a function that will help
    • double check: len() before and after merge
  3. Always specify how, on, indicator, and validate inside pd.merge()

  4. After the merge, check that it did what you expected, and give the resulting dataframe a good name. Don't name it "merged"!!!

Next class¶

Missing values, outliers, and data wrangling outro

Student demos¶

  • Task: I'll email after class
  • Morning: Scott (@vollmerjr) and John (@yonboyjohnjoy)
  • Afternoon: Matt (@mromano224) and Ian (@irr223)

Summary¶

We covered

  • Time permitting: Some high level tips for good viz
  • Merging intro
  • Missing values

Shut down: Restart your kernal and run code, save all open files, close JLab, commit and push your class notes repo.¶