Weakly Supervised Learning: Overture

Mohan Dogra
4 min readDec 5, 2020

--

Introduction

The domain of AI has grown drastically in recent years, generating state-of-the-art models. However, most of these models rely on massive sets of hand-labeled data. This hard-labor-work is extremely time-consuming and expensive: may require person-months or years to clean and assemble data. Not to forget, the data even evolves in the real-world and might need time-to-time updates.
For the above-mentioned reasons, practitioners are turning towards a weaker form of supervision: generating training data with different patterns, rules, external knowledge, or other classifiers. These all are the ways of generating the training data.

Methods for retrieving more labeled training data. Source

Weakly Supervised Learning Types

  • Incomplete
    This technique is adopted depending on human intervention for active or semi-supervised learning. For active learning, a domain expert labels the unlabeled data whose labeling cost depends only on the number of queries. For semi-supervised, non-human-intervention methods such as Generative models and Transductive-SVM are used.
  • Inexact
    This technique is adopted for Deep Learning (specially CNNs) for multi-instance learning. Here the user provides a data-bag with the collection of instances, a bag is positively labeled if at least one instance in it is positive, and is negatively labeled if all instances in it are negative. The goal is to predict the labels of unseen bags.
  • Inaccurate
    This technique is adopted for ensemble systems to identify unlabeled examples and then verify with the training set to correct it. Methods such as crowdsourcing is a cost-effective way to collect labels.

Supervision: ML with Snorkel

Snorkel is a framework used to learn from multi-sources for inaccurate or noisy data labels. Here, instead of using hand-labeled data, it asks the user to create Labeling Functions (LFs) which label the subsets of unlabeled data. These LFs can encode patterns, eternal data resources, noisy labels, weak classifiers, etc.
The best part is, if the goal of our data-modeling changes, we can easily adapt to the changes quickly by tweaking our labeling functions.

Labeling Funciton in ML. Source

Since the data is random and noisy, it is highly possible that the noisy output may overlap and may produce conflicts, to handle the same Snorkel follows the following pipeline:

  1. Applying LFs to unlabeled data
  2. Learn the accuracy of LFs without labeled data labeling and weight outputs accordingly using Generative models.
  3. The Generative model outputs a set of training labels for training powerful discriminative models.

Example of Labeling Functions:

Example of Labeling Functions. source

Here, we can see our lf1 sets a labeling condition specifying the presence of chemicals which labels the unlabeled subset 1 if true and 0 if false accordingly. And out lf2 uses a regular expression pattern to search the presence of ‘cause’ to label the subset.
In addition, there can be a range of task-specific labeling functions depending on heuristics, regex patterns, and other generating strategies.

Example of Generative Model:

Generative models. Source

This estimated generative model is used over the labeling functions to train the noise-sensitive discriminative models. We minimize the loss of the model according to the probabilities(P(L|y)) of labels generated by labeling function(L) to labeled output(y).

Highlights of Weakly-Supervised Learning

  • A Stanford study compared the productivity of teaching subject matter experts (SMEs) using Snorkel Vs. spending equivalent time hand-labeling data concluded, WSL build models 2.8x faster with 45.5% better predictive results
  • For text and image tasks Snorkel did 132% improvement over baseline technologies.

Resources:

--

--

Mohan Dogra
Mohan Dogra

Written by Mohan Dogra

AI Enthusiast | Independent researcher

No responses yet