Hello! Today:¶

  1. Before class: Copy today's exercise file (pandas exercises.ipynb) and the Module 2 notes from the textbook into your class notes repo, open both
  2. Discussing ASGN1 and peer reviews
  3. A brief discussion of numpy
  4. Getting dirty with pandas

Please fill out a short check in survey soon: Link here, or on the dashboard's tasks

ASGN 1 Thoughts...¶

And some awards¶

The award for most commits goes too...

  • 9am: Dahee
  • 10am: Ian

Come on down!

And also, an award for favorite README meme (TOUGH competition)

  • 9am: Lana. Honorable Mention: Matt and John
  • 10am: Austen

Come on down!

ASGN 1 Peer reviews¶

  1. Peer review: A chance to learn and teach!
    • You were added to two classmate's repos today for reviewing
    • Go to github.com. On the left side of the page, under repositories, you'll see that you have access to assignments for two peers.
  2. For each student you're reviewing, open the answer key I put in the repo and click on the link to the survey
    • Most of the review can be done looking at the repo online, but...

⭐⭐⭐ MOST IMPORTANT: You must clone the repo to your computer and run the code ON YOUR COMPUTER (The essential ingredient of collaborative coding, and a fundamental takeaway from class) ⭐⭐⭐

If you have questions while doing reviews

  • if it is a general question ("Is XYZ correct?"): @classmates, but don't identify the reviewee
  • if it is sensitive: email Jason (and CC me)

Review demo + working through Q6¶

Volunteer? I'll submit a full feedback form for your assignment as well.

Numpy summary¶

To use numpy functions, add this to beginning of your notebook: import numpy as np

Why is there a chapter in the book on it?

Numpy is great for:

  • simulations and derivatives
  • doing math operations pandas can't
    • Ex: np.median(), np.percentile(), np.floor()
  • has features pandas doesn't
    • Ex: np.nan (missing value)
  • does all of that fast

np 🤝 pandas

NP reference¶

numpy.org/doc has pages with

  • the "absolute basics" start here (very good)
  • quickstart guide (next)
  • how-to section

In-class NP examples¶

  1. Indexing looks the same (at least for 1D arrays):
In [1]:
import numpy as np

myray = np.arange(15) # create array
print("myray:", myray) 
print("slice:", myray[6:11]) # pick the 6-10th elements
# Q1: pick the odd elements
myray: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14]
slice: [ 6  7  8  9 10]
  1. Indexing looks the same (at least for 1D arrays)
  2. Booleans and masking
In [2]:
# create a random vector (every run of this --> diff #s)
from numpy.random import default_rng
rg = default_rng()
myray = rg.standard_normal(5)
print("myray:", myray) 

# Q2: how can you always select the positive elements from this?
# prof demo: booleans, a single condition-->bool, 
#            using booleans on an array/list, 
#            indexing/filtering via booleans as "masks"
#            then answer
myray: [ 1.33145787  0.54058088 -0.06731178 -0.88356257 -0.09785319]

What we just learned about boolean masking works directly in Pandas to filter data based on criteria!¶

PANDAS¶

  • 3.2.0, 3.2.1
  • How do I do X? 3.2.3 (!!!) and 3.2.7

Let's learn to use pandas by working with real data!

Exploring the incredible FRED dataset¶

FRED is https://fred.stlouisfed.org/

  • Unreal repository of data: Download, graph, and track 786,000 US and international time series from 103 sources.
  • Check out "at a glance" and "popular series"
  • Usable in other classes: easy to download, clean, modify, analyze, plot all that in seconds of coding!

Panda basics¶

Now, let's go back to the exercise file and stop right before EDA.

Data Analysis, AI, and ML are Mostly Data Wrangling.¶

spiderman.png

Data Analysis, AI, and ML are Mostly Data Wrangling.¶

Can we make it fun?

No.

OK But can we eliminate frustration?

Also no.

However, we can make it WORK. (Also, it's weirdly satisfying once you get into it.)

Data wrangling starts with EDA - Exploratory Data Analysis¶

Reference: 3.2.5

Let's try Q0-Q3.

The cookbook can help you slightly automate your EDA

Next class¶

Will resume the pandas exercises next class, ⭐⭐ during which we will have our first student led demos. ⭐⭐

  • 920am: Joseph and Fan should solve Q4 (bonus: Q5).
  • 1045am: Owen and Noah will demo their solutions to Q4 (bonus: Q5).

Student demos¶

  • Everyone will do a couple throughout the semester
  • Describe your approach (pseudocode) and then show and explain code step by step
  • Low stakes - typically small problems
  • Varieties: single or team show-and-tell, compare/contrast/discuss

Also, the link to ASGN2 is on coursesite.

Q4 and Q5¶

Demos - walk us through your attempts!

Post script, I'll build on this to show:

  • pseudo code / planning
  • iteratively developing the code
  • ⭐⭐ "for loop" in pandas = groupby ⭐⭐
  • chaining (on one line, over multiple lines)
    ( # anything between these parens
        # is "one" line of code
      )
    
  • assign + lambda
  • temporary vs permanent changes to a dataframe

Prof demo: Q6 and Q7¶

The point here isn't that this all makes perfect sense.

Follow my process.

You can look into the specific code bits later.

This was a huge week¶

Pandas = \$ : No ML without EDA, no EDA without pandas

  • Working with dataframes: reshaping, creating vars, summarizing data, and more
  • Planning/pseudo code > Developing > Chains
  • EDA: some of the things to look for + a recipe
  • Remember: In pandas, "for-loops" = groupby (usually)

Next week!¶

Bosses:

Drake

So we start learning how to produce visualizations!¶

Student demos next Monday¶

  • Task: plot how median income has evolved over time for PA
    • Bonus: plot CA, PA, and MI on one graph (arguably easier than separate!)
    • HINT for the bonus: Use import seaborn as sns to make the plot, look up hue option
  • 920am: Wasti and Lana
  • 1045am: Jake and Cole