L13_parsing_a_10k slides

Where were we...¶

download_text_files.ipynb
measure_risk.ipynb
1. create empty data, or load the sp500 wiki table
2. for each firm:
  1. find and open 10-K file
  2. define/measure risk vars
  3. add to dataset
3. save dataset
analysis.ipynb

Today¶

Three goals involving the measure_risk.ipynb file:

Once we measure exposure, how do we add that to the dataset?
How can we measure exposure: NLP with our eyes (for one firm)
Translating that to code

Once we measure exposure, how do we add that to the dataset?¶

Demos:

920am: Sherzod (@Sherzod-Esanov) and Jack (@jdean53)
1045am: Theo (@Theo-Faucher) and Harrison (@SNEDDONH)

Today¶

Three goals involving the measure_risk.ipynb file:

Once we measure exposure, how do we add that to the dataset?
How can we measure exposure: NLP with our eyes (for one firm)
Translating that to code

NLP (with our eyes)¶

Let's look at the 2020 10-K for TSLA (which you have/can easily download now!)
Group discussion/CYA: Let's decide if this document discusses topic TBD?
1. Class brainstorm: Pick topics to look for
2. Group work: See if 10-K discusses our topics

Questions¶

Main Q: how can we translate the way we (humans) looked for that risk in the document into steps/rules a computer can follow? (Next slide)
Secondary Q: how can we try to encode the "size" of exposure? (E.g. no/small/large exposure)

Finding (context-specific) topics in a document¶

In that exercise, we looked for context-specific topics, meaning: is a topic is being discussed in a particular way.

Not supply chain risks overall, but those specifically due to input requirements
Not "risk" writ large, but risks relating to the environment
Not China writ large, but import competition originating in China

How could you ask a computer to look for context-specific topics?

Ask it to look for specific "Bi-grams" (or a list of bi-grams)
Two words within a certain distance (or a list of words)

near_regex() is an implementation of "two words within (some) distance"

seemingly simple fcn, but
1. lets us explore foundational NLP issues and concepts
2. you will use near_regex() on your midterms
3. it's WAAAAAY better than writing your own regex (famously awful)

Today¶

Three goals involving the measure_risk.ipynb file:

Once we measure exposure, how do we add that to the dataset?
How can we measure exposure: NLP with our eyes (for one firm)
Translating that to code

Remaining goals:¶

In python, open the TSLA 10-K file (and put into string variable)
Process the HTML before any doing any NLP use NEAR_regex
First attempts to use NEAR_regex

(No more slides, living coding from here on.)

News¶

Canceling office hours this Friday - conference, ask Julio
Wednesday, continuing with near_REGEX a bit

Student demos¶

Task: TBD, will be a continuation of something from today
Class 1: Yang (@yal521) and Harry (@hsh423)
Class 2: Austen (@arj221), Colin (@ColinGrimm), and Eric (@etstieber)

Summary¶

We have essentially covered all the new skills needed for the midterm. There is obviously additional stuff left to figure out, but it's all solvable with your existing skills.

Don't wait to start on the midterm!