Hello! Today¶

Assignment chatter
Introducing assignment 5 as a roadmap
- Create a folder inside your Class-Notes repo called ("Midterm sandbox")

A4 under way ... gentle reminders:

ledatascifi.github.io, google.com, and stackoverflow.com are your friends!
pseudo code more (on paper!)
- draw mini versions of what you want the dataset to look like (example)
- I usually write what I want/need and work backwards
- Clarification: "Download the patent-level RETech data"

The assignment (5 aka midterm)¶

I call it an "assignment" for continuity, but it's the 10% "Midterm Project", meaning it's 3x the weight of an assignment

Really fun news: This is a end-to-end data science project! You will be downloading a lot of files, parsing/exploring/cleaning those file, and then exploring the data.

Implications:

3x the weight --> Will take more time
Real project: Fewer (but not zero) grading portions are objective right/wrong
Deliverable will be graded like an essay, depends also on your economic/business arguments

The Midterm (aka assignment 5)¶

Basic question: Do 10-K filings contain value-relevant information in the sentiment of the text?
- Sentiment: Does the word have a positive or negative tone (e.g. "confident" vs "against")
Specific questions: Is the positive or negative sentiment in a 10-K associated with better/worse stock returns?
- This is called a "cross-sectional event study"
- Expected minimum output: Scatterplot (x = document's sentiment score, y = stock returns around the 10-K's release) with regression lines; formatted well
- Advanced output: Regression tables, heatmaps, better scatterplots/similar
New data science technique: Textual analysis and sentiment analysis.

How do we solve this? As usual: Break the problem down into parts, working backwards. (Write down a mini version of what you want the dataset to look like!)

Work with your classmate to do that and then outline an approach (psuedocode even) and take notes - this will be useful!

I'll call on TBD and TBD to share their approach in 10 minutes.

Ok, now: Intro to wizardy (scraping data from the web)¶

We want to download 500ish 10-Ks

How can we download the 10-Ks?¶

Look at 4.1: https://ledatascifi.github.io/ledatascifi-2024/content/04/01_Intro_to_scraping.html
Look for solutions for the next 5 minutes...

(Last year's class found a good one in 4.)

Next class¶

A5 is live! We will continue onwards
Building a spider to download 10-Ks
Next week: "read those 10-K to measure sentiment"

Student demos¶

Task: Will be a continuation of something from today

Summary¶

We have planned and started some serious data analysis. Let's do this!