About:

In recent years, Machine Learning has had a significant impact in the performance of real-world applications. This has largely been made possible through the development of deep learning models trained on the datasets, that have been carefully curated by researchers all over the world. However, getting these datasets ready for the ML models is a tedious process. This is mainly because of the two reasons:

Image source: http://ai.stanford.edu/blog/weak-supervision/

In order to overcome this “bottleneck” of unlabeled data, weak supervision is used. Here, instead of ground-truth labeled training set, we use:

Applications in social media:

Weak supervision finds its applications in tasks that require labeled dataset, but only an unlabeled data is available, probably along with some form of supervision. This supervision can be noisy, biased or both. In social media context, depending on the task at hand, examples of weak social supervision may include:

Previously weak social supervision has been used for:

Project Objective: Demonstrate the application of weak social supervision to classify the intent of Brexit related tweets.

Dataset Information:

The dataset contains

Procedure:

Broadly followed these steps to implement this project:

Implementation:

Libraries Used:

Standard python libraries required for a ML program such as nltk, numpy scipy, pandas, sklearn, etc.

In addition, I have used snorkel to provide weak supervision sources through Labeling Functions. The Snorkel project started at Stanford in 2016, is a platform to build training data programmatically.

Image source: http://ai.stanford.edu/blog/weak-supervision/

Labeling functions used:

Note: Keywords and Hashtags to lookup for can be further refined to improve the coverage of the labeling model. A simple approach is to do a prior Exploratory Data Analysis (EDA) on frequency of these keywords and hashtags. In this project, I have performed this for hashtags, and the most frequent ones are being looked up in the implementation.

Steps to run the project:

Results and Inferences:

Labeling functions’ performance

Logistic Regression Classifier’s performance

References