Hello! Today¶
- Assignment chatter
- Introducing assignment 5 as a roadmap
- Create a folder inside your Class-Notes repo called ("Midterm sandbox")
A4 under way ... gentle reminders:
- ledatascifi.github.io, google.com, and stackoverflow.com are your friends!
- pseudo code more (on paper!)
- draw mini versions of what you want the dataset to look like (example)
- I usually write what I want/need and work backwards
- Clarification: "Download the patent-level RETech data"
The assignment (5 aka midterm)¶
I call it an "assignment" for continuity, but it's the 10% "Midterm Project", meaning it's 3x the weight of an assignment
Really fun news: This is a end-to-end data science project! You will be downloading a lot of files, parsing/exploring/cleaning those file, and then exploring the data.
Implications:
- 3x the weight --> Will take more time
- Real project: Fewer (but not zero) grading portions are objective right/wrong
- Deliverable will be graded like an essay, depends also on your economic/business arguments
The Midterm (aka assignment 5)¶
- Basic question: Do 10-K filings contain value-relevant information in the sentiment of the text?
- Sentiment: Does the word have a positive or negative tone (e.g. "confident" vs "against")
- Specific questions: Is the positive or negative sentiment in a 10-K associated with better/worse stock returns?
- This is called a "cross-sectional event study"
- Expected minimum output: Scatterplot (x = document's sentiment score, y = stock returns around the 10-K's release) with regression lines; formatted well
- Advanced output: Regression tables, heatmaps, better scatterplots/similar
- New data science technique: Textual analysis and sentiment analysis.
How do we solve this? As usual: Break the problem down into parts, working backwards. (Write down a mini version of what you want the dataset to look like!)
Work with your classmate to do that and then outline an approach (psuedocode even) and take notes - this will be useful!
I'll call on TBD and TBD to share their approach in 10 minutes.
Ok, now: Intro to wizardy (scraping data from the web)¶
- We want to download 500ish 10-Ks
How can we download the 10-Ks?¶
- Look at 4.1: https://ledatascifi.github.io/ledatascifi-2024/content/04/01_Intro_to_scraping.html
- Look for solutions for the next 5 minutes...
(Last year's class found a good one in 4.)
Summary¶
We have planned and started some serious data analysis. Let's do this!