Working with strings / intro to NLP¶

Where we are
Where we are going
But why go there?
Ok, I'm convinced, let's learn that...

Where were we...¶

download_text_files.ipynb
1. where am i going to put the files? create folders.
2. get a sample of firms
3. download 10ks (don't push all these files to github! --> gitignore!)
measure_risk.ipynb
analysis_report.ipynb

Where are we going next?¶

The main challenge: How can we "create a risk variable"?

download_text_files.ipynb
measure_risk.ipynb
1. create empty data, or load the sp500 wiki table
2. for each firm: open 10-K, define/measure risk vars, add to dataset
3. save dataset
analysis.ipynb

Strings¶

We have access to text/html files discussing the firms.
Those discussions are in a basic python object type called a string.

'I am a short string. Hello!'

So you'll need to have a sense of working with strings.
ONE way to search strings: "regular expression" aka "regex"
- flexible and powerful
- somewhat costly (but I've worked to minimize the pain!)

I know what you're thinking: "Strings... pass"

But...

Data = power¶

New personal computers usually have 1TB+ of hard drive space
As of 2017, IBM estimated 2.5 MILLION TERABYTES of data were created everyday
Now: probably at least a billion additional regular internet users
- Posts, texts, likes, searches, emails, location data
- 2020: 40 ZETTABYTES of web data (1ZB = 1billion TB = #stars are in the observable universe)
- 80% of that is unstructured text data

Figuring out how to put structure on that text data is POWERFUL.¶

Google is one example
One common task: identifying topics in unstructured text (does the document discuss topic X?)

The #BigPicture of this module:¶

We will write 3 seemingly "too simple" files (and call these the midterm)

Yet, these files are most of the foundation for very powerful analysis! What you'll finish with is code that is easy to adapt and extend! Or listable on your resume:

Building a spider to assemble "big data"
Leveraging "natural language processing" to make risk assessments on a "large corpus"

It's exciting! My head canon for your futures is a three-act play...

... (bare with me) ...

When we're done, your next interview be like¶

They will be like¶

And you'll be like...¶

FIN¶

The scary thing is that this was originally longer.

Alt act 3: https://giphy.com/gifs/swag-money-make-it-rain-2u11zpzwyMTy8

Ok, back to (current) reality:¶

The main challenge: How can we "create a risk variable"?

download_text_files.ipynb
measure_risk.ipynb
1. create empty data, or load the sp500 wiki table
2. for each firm: 2b. open 10-K, define/measure risk vars, add to dataset
3. save dataset
analysis.ipynb

That bolded step will require us to programmatically search the document. So...

Next class¶

Just kidding...¶

Next time, just enough background on strings and regex to help you use this function better
Practice/understanding using near_regex
Discuss conceptual issues with measure_risk.py
Loading and cleaning HTML documents so they are nice strings

Student demos¶

Task: Will be a continuation of something from today
920am: Sherzod (@Sherzod-Esanov) and Jack (@jdean53)
1045am: Theo (@Theo-Faucher) and Harrison (@SNEDDONH)

Summary¶

Do you feel the power of data flowing through you?
Introduced one way to harness it (regex searches)