Before class¶
Add to your Class Notes/Midterm Sandbox
folder:
build_sample_exercise.ipynb
NEAR_regex.py
from the community codebook. Delete everything after the "return" line..gitignore
with**10k_files/*
in it.10k_files.zip
from myClass-Notes-1045/Midterm sandbox
folder into a subfolder called10k_files/
- Copy the things in the assignment's input folder in to the inputs folder here.
- Optional: You can install
tqdm
(If you don't, then remove it from the code below.)
Where were we...¶
- download_text_files.ipynb see this
- build_sample.ipynb - review this and the Sandbox readme
- create empty data, or load the sp500 wiki table
- for each firm:
- find and open 10-K file
- define/measure sentiment vars
- add to dataset
- save dataset
- report.ipynb
Today¶
Some goals involving the build_sample.ipynb file:
- Once we measure sentiment, how do we add that to the dataset?
- How can we measure sentiment
- Working with real html files (inside a zip!)
Warmup¶
Load a sentiment dictionary.
Once we measure exposure, how do we add that to the dataset?¶
Demos
Today¶
Three goals involving the build_sample.ipynb file:
- Once we measure sentiment, how do we add that to the dataset?
- How can we measure sentiment
- Working with real html files (inside a zip!)
NLP (with our eyes)¶
'I am happy that you are here. I am all smiles. So hopeful!'
Positive sentiment: The fraction of words with a positive sentiment.
- How many words?
- How many positive sentiment words?
Words that entail "positive sentiment" depends on the context. We are going to use two dictionaries designed on business corpora
- Work through the exercises Q0-Q6, then come back to the next slides
Finding (context-specific) sentiment in a document¶
Your positive sentiment measures will detect when firms use more positive language. But what are they talking positively about?
An approach to that problem: Look for a topic you are interested in, and see if positive words are nearby.
NEAR_regex([string1,string2])
will look for string1 near string2- seemingly simple but
- lets us explore foundational NLP approaches
- you will use near_regex() on your midterms
- it's WAAAAAY better than writing your own regex (famously awful)
- surprisingly powerful
- Since it will be used in a regex,
string1
can exploit regex patterns string1 = '(patent|innovation)'
will look for patent or innovation
- Since it will be used in a regex,
Today¶
Three goals involving the build_sample.ipynb file:
- Once we measure sentiment, how do we add that to the dataset?
- How can we measure sentiment
- Working with real html files (inside a zip!)
Remaining goals:¶
- Open a 10-K from inside the zip
- Process the HTML before any doing any NLP use
NEAR_regex
- Use NEAR_regex on this and measure its
BHR_negative
sentiment, applying the earlier exercises - As of this point: Enough to get all sentiment variables measured
- If time/Wednesday:
- Get the filling date (new web ninja skills)
- Start working on the returns problem (no new skills... just a problem of imagination and work)
- Conceptual questions (
max_words_between
, dictionary choices, etc) - Report questions
(No more slides, living coding from here on.)
Summary¶
We have essentially covered all the new skills needed for the midterm. There is obviously additional stuff left to figure out, but it's all solvable with your existing skills.
Main challenges:
- Putting the lego pieces together
- Getting the returns around the 10-K filing date
These aren't impossible, but aren't trivial: Don't wait to start on the midterm!