Working with strings / intro to NLP¶
- Where we are
- Where we are going
- But why go there?
- Ok, I'm convinced, let's learn that...
Where were we...¶
- download_text_files.ipynb
- where am i going to put the files? create folders.
- get a sample of firms
- download 10ks (don't push all these files to github! --> gitignore!)
- build_sample.ipynb
- report.ipynb
Where are we going next?¶
The main challenge: How can we "create a sentiment variable"?
- download_text_files.ipynb
- build_sample.ipynb
- create empty data, or load the sp500 wiki table
- for each firm: open 10-K, define/measure sentiment vars, add to dataset
- save dataset
- report.ipynb
Strings¶
- We have access to text/html files discussing the firms.
- Those discussions are in a basic python object type called a string.
'I am a short string. Hello!'
- So you'll need to have a sense of working with strings.
- ONE way to search strings: "regular expression" aka "regex"
- flexible and powerful
- somewhat costly (but I've worked to minimize the pain!)
I know what you're thinking: "Strings... pass"
But...
Data = power¶
- New personal computers usually have 1TB+ of hard drive space
- As of 2017, IBM estimated 2.5 MILLION TERABYTES of data were created everyday
- Now: probably at least a billion additional regular internet users
- Posts, texts, likes, searches, emails, location data
- 2020: 40 ZETTABYTES of web data (1ZB = 1billion TB = #stars are in the observable universe)
- 80% of that is unstructured text data
Figuring out how to put structure on that text data is POWERFUL.¶
- Google is one example
- LLM models is the new framework
- One common task: identifying topics in unstructured text (does the document discuss topic X?)
- Until open source LLM models can run quickly on the p25 student laptop, we will avoid the LLM path.
- (Running open source LLMs to answer some question: Good project idea!)
The #BigPicture of this module:¶
We will write 3 seemingly "too simple" files (and call these the midterm)
Yet, these files are most of the foundation for very powerful analysis! What you'll finish with is code that is easy to adapt and extend! Or listable on your resume:
- Building a spider to assemble "big data"
- Leveraging "natural language processing" to work on a "large textual corpus"
It's exciting! My head canon for your futures is a three-act play...
... (bare with me) ...
When we're done, your next interview be like¶
They will be like¶
And you'll be like...¶
FIN¶
The scary thing is that this was originally longer.
Alt act 3: https://giphy.com/gifs/swag-money-make-it-rain-2u11zpzwyMTy8
Ok, back to (current) reality:¶
The main challenge: How can we "create a sentiment variable"?
- download_text_files.ipynb
- build_sample.ipynb
- create empty data, or load the sp500 wiki table
- for each firm: 2b. open 10-K, define/measure sentiment vars, add to dataset
- save dataset
- report.ipynb
That bolded step will require us to programmatically search the document. So...
Next class¶
Just kidding...¶
Next time,¶
- Before class: Read 4.4 and subpages
- Just enough background on strings and regex to build the sentiment scores and use the NEAR_regex function
- Practice/understanding using near_regex
- Discuss conceptual issues with
build_sample.py
- Loading and cleaning HTML documents so they are nice strings
Student demos¶
- Task: Will be a continuation of something from today
Summary¶
- Do you feel the power of data flowing through you?
- Introduced one way to harness it (regex searches)