Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification (1609.02521v1)

Published 8 Sep 2016 in stat.ML and cs.LG

Abstract: Extreme multi-label classification refers to supervised multi-label learning involving hundreds of thousands or even millions of labels. Datasets in extreme classification exhibit fit to power-law distribution, i.e. a large fraction of labels have very few positive instances in the data distribution. Most state-of-the-art approaches for extreme multi-label classification attempt to capture correlation among labels by embedding the label matrix to a low-dimensional linear sub-space. However, in the presence of power-law distributed extremely large and diverse label spaces, structural assumptions such as low rank can be easily violated. In this work, we present DiSMEC, which is a large-scale distributed framework for learning one-versus-rest linear classifiers coupled with explicit capacity control to control model size. Unlike most state-of-the-art methods, DiSMEC does not make any low rank assumptions on the label matrix. Using double layer of parallelization, DiSMEC can learn classifiers for datasets consisting hundreds of thousands labels within few hours. The explicit capacity control mechanism filters out spurious parameters which keep the model compact in size, without losing prediction accuracy. We conduct extensive empirical evaluation on publicly available real-world datasets consisting upto 670,000 labels. We compare DiSMEC with recent state-of-the-art approaches, including - SLEEC which is a leading approach for learning sparse local embeddings, and FastXML which is a tree-based approach optimizing ranking based loss function. On some of the datasets, DiSMEC can significantly boost prediction accuracies - 10% better compared to SLECC and 15% better compared to FastXML, in absolute terms.

Citations (248)

Summary

  • The paper introduces a scalable one-versus-rest framework that leverages explicit capacity control and pruning to maintain compact models.
  • It employs a dual-layer parallelization strategy to enable fast training on datasets with up to 670,000 labels.
  • Empirical results demonstrate up to a 15% precision boost and orders of magnitude reduction in model size compared to competing methods.

Overview of "DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification"

The paper "DiSMEC - Distributed Sparse Machines for Extreme Multi-label Classification" by Rohit Babbar and Bernhard Schölkopf presents a novel framework specifically designed to tackle the challenges of Extreme Multi-label Classification (XMC). XMC involves supervised learning scenarios where the number of target labels can be in the hundreds of thousands or millions, calling for advanced computational strategies.

The Problem Context

In XMC, datasets experience a power-law distribution where a large portion of labels have scant positive instances. This phenomenon compromises the efficacy of traditional methods, like embedding approaches, which assume a low-rank structure is inherent across the label space. Such an assumption is often unrealistic given the diversity and sparsity of label assignments in extreme classification tasks.

Introduction to DiSMEC

DiSMEC introduces a scalable solution that leverages a distributed framework for learning one-versus-rest linear classifiers without relying on low-rank assumptions. Key innovations are its explicit capacity control mechanisms to avoid model overfitting and inefficiencies, and a dual-layer parallelization that enhances scalability over large datasets. The double parallelization enables rapid training, even on datasets with up to 670,000 labels, by strategically distributing computations across multiple cores and nodes.

Methodology

The DiSMEC framework operates by pruning spurious parameters, thus maintaining model compactness, which is crucial given the potentially massive size of the models involved. This pruning, discussed as a model sparsity step, leads to significant reductions in model size without a loss in predictive performance. The authors demonstrate this by applying DiSMEC to several large-scale public datasets and showcasing its predictive capabilities and computational efficiency.

Empirical Results

DiSMEC outperforms other leading approaches, such as SLEEC and FastXML, both in terms of precision (boosting by as much as 15% on some datasets) and model size reduction (achieving reductions by orders of magnitude). This substantial performance gain is attributed to DiSMEC's focus on explicit capacity control via the pruning of non-informative weights, which, unlike in other one-vs-rest frameworks, does not hinder accuracy.

Theoretical Contributions and Practical Implications

Practically, DiSMEC presents a viable model for industries that must process large-scale classification efficiently, including fields like e-commerce, document tagging, and recommendation systems. The compact models and the distributed prediction mechanism further suggest potential for real-time applications.

Theoretically, DiSMEC challenges the existing reliance on low-rank assumptions in large-scale multi-label learning and provides empirical evidence for alternative strategies in handling power-law-distributed data. It suggests future work in understanding label correlation degrees and fine-tuning sparsity controls for enhanced model performance.

Conclusion

Overall, DiSMEC contributes a methodologically sound and empirically validated approach to XMC. Its innovative use of distributed computing and effective model pruning mechanisms set it apart, highlighting promising pathways for the deployment of XMC algorithms in real-world, computationally constrained environments. Future directions could explore further optimizations in handling tail labels, developing criteria for choosing between different methodological approaches based on data distribution, and examining the impacts of power-law exponents more deeply.

Youtube Logo Streamline Icon: https://streamlinehq.com