L13_parsing_a_10k slides

Before class¶

Add to your Class Notes/Midterm Sandbox folder:

build_sample_exercise.ipynb
NEAR_regex.py from the community codebook. Delete everything after the "return" line.
.gitignore with **10k_files/* in it.
10k_files.zip from my Class-Notes-1045/Midterm sandbox folder into a subfolder called 10k_files/
Copy the things in the assignment's input folder in to the inputs folder here.
Optional: You can install tqdm (If you don't, then remove it from the code below.)

Where were we...¶

download_text_files.ipynb see this
build_sample.ipynb - review this and the Sandbox readme
1. create empty data, or load the sp500 wiki table
2. for each firm:
  1. find and open 10-K file
  2. define/measure sentiment vars
  3. add to dataset
3. save dataset
report.ipynb

Today¶

Some goals involving the build_sample.ipynb file:

Once we measure sentiment, how do we add that to the dataset?
How can we measure sentiment
Working with real html files (inside a zip!)

Warmup¶

Load a sentiment dictionary.

Once we measure exposure, how do we add that to the dataset?¶

Demos

Today¶

Three goals involving the build_sample.ipynb file:

Once we measure sentiment, how do we add that to the dataset?
How can we measure sentiment
Working with real html files (inside a zip!)

NLP (with our eyes)¶

'I am happy that you are here. I am all smiles. So hopeful!'

Positive sentiment: The fraction of words with a positive sentiment.

How many words?
How many positive sentiment words?

Words that entail "positive sentiment" depends on the context. We are going to use two dictionaries designed on business corpora

Work through the exercises Q0-Q6, then come back to the next slides

Finding (context-specific) sentiment in a document¶

Your positive sentiment measures will detect when firms use more positive language. But what are they talking positively about?

An approach to that problem: Look for a topic you are interested in, and see if positive words are nearby.

NEAR_regex([string1,string2]) will look for string1 near string2
seemingly simple but
1. lets us explore foundational NLP approaches
2. you will use near_regex() on your midterms
3. it's WAAAAAY better than writing your own regex (famously awful)
surprisingly powerful
- Since it will be used in a regex, string1 can exploit regex patterns
- string1 = '(patent|innovation)' will look for patent or innovation

Today¶

Three goals involving the build_sample.ipynb file:

Once we measure sentiment, how do we add that to the dataset?
How can we measure sentiment
Working with real html files (inside a zip!)

Remaining goals:¶

Open a 10-K from inside the zip
Process the HTML before any doing any NLP use NEAR_regex
Use NEAR_regex on this and measure its BHR_negative sentiment, applying the earlier exercises
As of this point: Enough to get all sentiment variables measured
If time/Wednesday:
- Get the filling date (new web ninja skills)
- Start working on the returns problem (no new skills... just a problem of imagination and work)
- Conceptual questions (max_words_between, dictionary choices, etc)
- Report questions

(No more slides, living coding from here on.)

News¶

My office hours this week
Class Wednesday

Student demos¶

Task: TBD, will be a continuation of something from today
Nobody scheduled, email me if you want to do a demo

Summary¶

We have essentially covered all the new skills needed for the midterm. There is obviously additional stuff left to figure out, but it's all solvable with your existing skills.

Main challenges:

Putting the lego pieces together
Getting the returns around the 10-K filing date

These aren't impossible, but aren't trivial: Don't wait to start on the midterm!