Web scraping - Twitter data
Stan Lee, one of America's most prolific comic book writers, died in Los Angeles at the age of ninety-five on November 12, 2018. Here, I aim to analyze social media's response, particularly Twitter, to his death.
- Explore tweets to reveal interesting insights about user activity after his death.
- Build a machine learning model that is capable of accurately classifyig the sentiment of a tweet as either positive, neutral or negative.
- Scikit-Learn, Numpy, Pandas, NLTK, Textblob, Matplotlib, and Tweepy among others.
Stan Lee was an American comic book writer, editor, publisher, and producer. He rose through the ranks of a family-run business to become Marvel Comics' primary creative leader for two decades, leading its expansion from a small division of a publishing house to a multimedia corporation that dominated the comics industry. Lee was inducted into the comic book industry's Will Eisner Award Hall of Fame in 1994 and the Jack Kirby Hall of Fame in 1995. He received the NEA's National Medal of Arts in 2008.
As a fan of Marvel comics myself, I wanted to explore his life and work in greater detail using machine learning!
Data for the analysis was collected through Twitter's public APIs. (How to extract tweets using Twitter's public APIs)
- I used the following keywords to filter the extraction - Stan Lee, StanLee, Stanley Martin Lieber
PS: Adil Moujahid does a great job introducing Text Mining using Twitter's streaming API and Python
- Refer to "Historical Tweets Extraction - Web Scrapping.ipynb" for steps to extract historical tweets as needed.
- Tools used include Python, Tableau, MS PowerBI
English comes in at #1 followed by Spanish
We see a surge in activity after his death
Majority of the tweets were made using a mobile device
Important words include:
- angeles
- awesome
- respect
- memorial
Majority of the tweets were of a positive sentiment
For more findings, please go to the "Images" folder.
Text Analytics using NLP - Web Scrapping.ipynb: Contains coded steps undertaken to
- Extract the relevant tweets
- Pre-process and structure the data for analysis
- Carry out some descriptive analytics
- Perform sentiment analysis and build a model for sentiment classification
- Logistic Regression performed the best with an accuracy of 98% and an average f1 score of 0.97