Teaching notes:

These slides are used at the end of L12, in the last 10 minutes or so.

Working with strings / intro to NLP¶

  • Where we are
  • Where we are going
  • But why go there?
  • Ok, I'm convinced, let's learn that...

Where were we...¶

  1. download_text_files.ipynb
    1. where am i going to put the files? create folders.
    2. get a sample of firms
    3. download 10ks (don't push all these files to github! --> gitignore!)
  2. measure_risk.ipynb
  3. analysis_report.ipynb

Where are we going next?¶

The main challenge: How can we "create a risk variable"?

  1. download_text_files.ipynb
  2. measure_risk.ipynb
    1. create empty data, or load the sp500 wiki table
    2. for each firm: open 10-K, define/measure risk vars, add to dataset
    3. save dataset
  3. analysis.ipynb

Strings¶

  • We have access to text/html files discussing the firms.
  • Those discussions are in a basic python object type called a string.
'I am a short string. Hello!'
  • So you'll need to have a sense of working with strings.
  • ONE way to search strings: "regular expression" aka "regex"
    • flexible and powerful
    • somewhat costly (but I've worked to minimize the pain!)

I know what you're thinking: "Strings... pass"

But...

Data = power¶

  • New personal computers usually have 1TB+ of hard drive space
  • As of 2017, IBM estimated 2.5 MILLION TERABYTES of data were created everyday
  • Now: probably at least a billion additional regular internet users
    • Posts, texts, likes, searches, emails, location data
    • 2020: 40 ZETTABYTES of web data (1ZB = 1billion TB = #stars are in the observable universe)
    • 80% of that is unstructured text data

Figuring out how to put structure on that text data is POWERFUL.¶

  • Google is one example
  • One common task: identifying topics in unstructured text (does the document discuss topic X?)

The #BigPicture of this module:¶

We will write 3 seemingly "too simple" files (and call these the midterm)

Yet, these files are most of the foundation for very powerful analysis! What you'll finish with is code that is easy to adapt and extend! Or listable on your resume:

  • Building a spider to assemble "big data"
  • Leveraging "natural language processing" to make risk assessments on a "large corpus"

It's exciting! My head canon for your futures is a three-act play...

... (bare with me) ...

When we're done, your next interview be like¶

They will be like¶

And you'll be like...¶

FIN¶

The scary thing is that this was originally longer.


Alt act 3: https://giphy.com/gifs/swag-money-make-it-rain-2u11zpzwyMTy8

Ok, back to (current) reality:¶

The main challenge: How can we "create a risk variable"?

  1. download_text_files.ipynb
  2. measure_risk.ipynb
    1. create empty data, or load the sp500 wiki table
    2. for each firm: 2b. open 10-K, define/measure risk vars, add to dataset
    3. save dataset
  3. analysis.ipynb

That bolded step will require us to programmatically search the document. So...

Next class¶

Just kidding...¶

  • Next time, just enough background on strings and regex to help you use this function better
  • Practice/understanding using near_regex
  • Discuss conceptual issues with measure_risk.py
  • Loading and cleaning HTML documents so they are nice strings

Student demos¶

  • Task: Will be a continuation of something from today
  • 920am: Sherzod (@Sherzod-Esanov) and Jack (@jdean53)
  • 1045am: Theo (@Theo-Faucher) and Harrison (@SNEDDONH)

Summary¶

  • Do you feel the power of data flowing through you?
  • Introduced one way to harness it (regex searches)