Amazon Researchers Propose “ALLIE”: A Novel Framework to Address the Active Learning Challenges of Large-Scale Unbalanced Charts

This research summary article is based on the paper 'ALLIE: Active learning on large-scale imbalanced graphs'

Please don't forget to join our ML Subreddit

Social network analysis, financial fraud detection, molecular design, search engines, and recommender systems are all examples of graphically structured data. Graph neural networks (GNNs), in contrast to classical pointwise or pairwise models, have recently emerged as state-of-the-art models for these types of datasets due to their ability to learn and aggregate complicated relationships between (k-hop) neighborhoods.

GNNs, like other deep learning models, require a significant amount of labeled data for training in supervised environments, despite their enticing advantages. In many areas, obtaining sufficiently labeled data for training is time-consuming, labor-intensive, and expensive, limiting the use of GNNs.

Active Learning (AL) is a promising technique to get labels faster and cheaper and to train models efficiently. AL dynamically queries candidate samples for labeling to maximize the performance of the machine learned model on a budget. With different benchmark data sets, e.g. B. citation graphs and gene networks, recent improvements of AL on graphs have proven beneficial.

However, little research has been done on AL approaches to large-scale unbalanced circumstances (e.g. discovering a small portion of false reviews on an e-commerce site). This encourages academics to investigate how to query the “most informative” data to reduce GNNs’ training costs and mitigate the impact of imbalances.

It is not easy to train GNNs on unbalanced graphs using the AL technique. Because underrepresented positive samples are less likely to be selected by standard AL methods, the low prevalence rate of positive samples prevents conventional AL methods from learning the full data distribution. For example, finding abusive reviews on a shopping website can be modeled as a binary classification problem, where positive samples (ie, abusive reviews) account for a very small fraction of the flagged data.

When an AL model is trained to validate validations for labeling, it will largely provide non-abusive validations, resulting in a modest improvement in model performance. To balance the class distribution, most AL sampling strategies described in natural language processing and computer vision assume independent and identically distributed data. Due to the different relational structure and the extensive links, these methods are not directly applicable to graphically organized data.

Creating an AL method for large chart data is difficult. Popular social media platforms (like Facebook and Snapchat) have hundreds of millions of monthly active users, while online e-commerce sites (like Amazon and Walmart) contain millions of products and process billions of transactions. At this scale, searching all unlabeled samples in the graph is impractical because the computational complexity of AL techniques grows exponentially with the size of the unlabeled set. Hence, it is crucial to reduce the search space for AL algorithms on large-scale graphs.

To address both of these issues, Amazon researchers offer an Active Learning-based Large-scale Imbalanced Graphs (ALLIE) technique that combines the principle of AL on graphs with reinforcement learning for accurate and efficient node categorization. Using multiple uncertainty measures as criteria, ALLIE can successfully select meaningful unlabeled samples for labeling. In addition, the method prioritizes the categorization of less confident and “underrepresented” samples.


Researchers provide a graph coarsening mechanism for ALLIE that categorizes related nodes into clusters to scale the approach to giant graphs. The search space for the AL algorithm is reduced with a better representation of nodes in each cluster. This is the first study to use large-scale graphics and active learning to model the imbalance problem.

The team’s contributions are as follows:

Imbalance Aware Reinforcement Learning Based Graph Policy Network: The team uses a reinforcement learning technique to discover a representative subset of the unlabeled data set by optimizing the classifier’s performance. The polled nodes are more representative of the minority class.

Chart coarsening strategy to deal with large chart data: Existing approaches rarely consider scalability, making them inefficient when used in real-world scenarios. Researchers use a graph coarsening approach to shrink the action space in the policy network to reduce runtime.

Robust learning for more accurate node classification: Researchers are building a focused-loss knot classifier that downweights the well-classified samples, in contrast to traditional approaches that do not distinguish between majority and minority classes while maximizing the objective function.

ALLIE has been tested on both balanced and unbalanced datasets. The balanced datasets are based on publicly available citation charts, while the unbalanced datasets are from a private e-commerce website. In both datasets, the researchers report the performance of node classification.

According to the results, ALLIE improved an average of 2.39 percent in Macro F1 and 2.71 percent in Micro F1 versus the best baseline on balanced chart datasets. On the e-commerce website dataset, ALLIE improved positive classes (abusive users and reviews) by an average of 4.75 percent in Precision, 1.96 percent in Recall, and 3.45 percent in F1 (with 10.54 percent, 3rd place). .7 percent and 7.71 percent relative). improvement) over the best baseline. The team also performed a detailed ablation study to highlight the importance of each component of ALLIE. According to additional testing, ALLIE exceeds baselines across a variety of initial training set sizes and query budgets.


In a recent study, Amazon researchers present ALLIE, a unique active learning framework for large unbalanced charts. ALLIE uses a graph policy network to query potential nodes for labeling by maximizing the long-term performance of the GNN classifier. Compared to numerous state-of-the-art approaches, ALLIE can better deal with uneven data distribution thanks to two balancing mechanisms. ALLIE also features a graph refinement module, making it scalable for large-scale applications. The high performance of ALLIE is demonstrated by experiments with three benchmark datasets and one dataset with real retail websites.




Comments are closed.