Working with strings / intro to NLP¶

Where we are
Where we are going
But why go there?
Ok, I'm convinced, let's learn that...

Where were we...¶

download_text_files.ipynb
1. where am i going to put the files? create folders.
2. get a sample of firms
3. download 10ks (don't push all these files to github! --> gitignore!)
build_sample.ipynb
report.ipynb

Where are we going next?¶

The main challenge: How can we "create a sentiment variable"?

download_text_files.ipynb
build_sample.ipynb
1. create empty data, or load the sp500 wiki table
2. for each firm: open 10-K, define/measure sentiment vars, add to dataset
3. save dataset
report.ipynb

Strings¶

We have access to text/html files discussing the firms.
Those discussions are in a basic python object type called a string.

'I am a short string. Hello!'

So you'll need to have a sense of working with strings.
ONE way to search strings: "regular expression" aka "regex"
- flexible and powerful
- somewhat costly (but I've worked to minimize the pain!)

I know what you're thinking: "Strings... pass"

But...

Data = power¶

New personal computers usually have 1TB+ of hard drive space
As of 2017, IBM estimated 2.5 MILLION TERABYTES of data were created everyday
Now: probably at least a billion additional regular internet users
- Posts, texts, likes, searches, emails, location data
- 2020: 40 ZETTABYTES of web data (1ZB = 1billion TB = #stars are in the observable universe)
- 80% of that is unstructured text data

Figuring out how to put structure on that text data is POWERFUL.¶

Google is one example
LLM models is the new framework
One common task: identifying topics in unstructured text (does the document discuss topic X?)
- Until open source LLM models can run quickly on the p25 student laptop, we will avoid the LLM path.
- (Running open source LLMs to answer some question: Good project idea!)

The #BigPicture of this module:¶

We will write 3 seemingly "too simple" files (and call these the midterm)

Yet, these files are most of the foundation for very powerful analysis! What you'll finish with is code that is easy to adapt and extend! Or listable on your resume:

Building a spider to assemble "big data"
Leveraging "natural language processing" to work on a "large textual corpus"

It's exciting! My head canon for your futures is a three-act play...

... (bare with me) ...

When we're done, your next interview be like¶

They will be like¶

And you'll be like...¶

FIN¶

The scary thing is that this was originally longer.

Alt act 3: https://giphy.com/gifs/swag-money-make-it-rain-2u11zpzwyMTy8

Ok, back to (current) reality:¶

The main challenge: How can we "create a sentiment variable"?

download_text_files.ipynb
build_sample.ipynb
1. create empty data, or load the sp500 wiki table
2. for each firm: 2b. open 10-K, define/measure sentiment vars, add to dataset
3. save dataset
report.ipynb

That bolded step will require us to programmatically search the document. So...

Next class¶

Just kidding...¶

Next time,¶

Before class: Read 4.4 and subpages
Just enough background on strings and regex to build the sentiment scores and use the NEAR_regex function
Practice/understanding using near_regex
Discuss conceptual issues with build_sample.py
Loading and cleaning HTML documents so they are nice strings

Student demos¶

Task: Will be a continuation of something from today

Summary¶

Do you feel the power of data flowing through you?
Introduced one way to harness it (regex searches)