Announcements and prelims¶

expo.png

Peer review, assignment 1¶

Great averages so far:

  1. Python exercises: 87%
  2. Markdown: 81%
  3. Gitignore: 76%
  4. Reproducibility: 79%

Wednesday: A1 grades posted, A2 due BEFORE class, go over A2 in class

Peer review comments¶

Awesome feedback!

1

In if statement, when there is no alternative action needed, maybe there is no need to include "else: pass"

2

I enjoyed reading this code because it was very simple to follow and efficient. Nothing seemed unnecessary and the thought process/logic was clear. Something that I learned was how you accomplished Q7. I struggled with this a bit and it was great to learn how you did it. Nesting for loops seems like the most clear way to do this and now that I have seen it work in action I completely understand why it works.

EDA hacks¶

  1. Ad-hoc, fast: eda.py
  2. Thorough, slow: Pandas-profiling
    • Maybe: use when you get new data, and when you think you're done with cleaning / about to start analysis
    • You still need to be able to use and run EDA code manually (df.describe()) ... much faster

"And now..."¶

Drake

Someone(s): tell me some favorite plots you've seen?

Data Viz is not this...¶

It's this¶

In other words¶

The ability to plot large datasets is both powerful and exciting

Data viz discussion¶

(Take notes during this period)

Important: Data viz (and analysis) is iterative: you learn what's worth looking at only as you go

Overlaps with our ABCD rule: PLOT A LOT! A LOT! A LOT!

Plot¶

("sns" means the "seaborn" package)

  • To explore data, discover trends/comps/relationship, present results
  • To find relationships that differ by groups
  • To understand data issues
    • Pandas EDA page reveals a mystery!
    • Solved via ABCD / plotting
  • B/c summary stats only describe part of distribution
    • Leverage (td_a) has a mean of 0.24 and std 0.38
  • B/c summary stats don't show relationships

Reasonable things to explore via plot¶

Read Chapter 3.3 for much more discussion of plotting (the whys and the hows)

  • Explore variation within variables (distributions)
  • Explore covariation between variables
  • Explore how distributions depends on groups
  • Explore how covariation depends on groups

Plotting process¶

# Step Note
0 Ask a question about the data Ex: What is the distribution of unemployment in each state?
1 Q > What the plot should look like. Draw it! Draw it on paper!
2 Plot appearance > which plot function/options to use Find a pd or sns plot example that looks like that.
3 The function dictates how data should be formatted before you call the plot Key: Wide or tall?

Plotting quickstart:¶

import pandas as pd              # quick and dirty plots
import numpy as np
import seaborn as sns            # most of my plots
import matplotlib.pyplot as plt  # formatting

Key links: 3.3.3 and the links within

Now, exercises...