Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Classification without labels: Learning from mixed samples in high energy physics (1708.02949v3)

Published 9 Aug 2017 in hep-ph, hep-ex, and stat.ML

Abstract: Modern machine learning techniques can be used to construct powerful models for difficult collider physics problems. In many applications, however, these models are trained on imperfect simulations due to a lack of truth-level information in the data, which risks the model learning artifacts of the simulation. In this paper, we introduce the paradigm of classification without labels (CWoLa) in which a classifier is trained to distinguish statistical mixtures of classes, which are common in collider physics. Crucially, neither individual labels nor class proportions are required, yet we prove that the optimal classifier in the CWoLa paradigm is also the optimal classifier in the traditional fully-supervised case where all label information is available. After demonstrating the power of this method in an analytical toy example, we consider a realistic benchmark for collider physics: distinguishing quark- versus gluon-initiated jets using mixed quark/gluon training samples. More generally, CWoLa can be applied to any classification problem where labels or class proportions are unknown or simulations are unreliable, but statistical mixtures of the classes are available.

Citations (198)

Summary

  • The paper introduces Classification without Labels (CWoLa) as a method that trains classifiers using mixed samples without individual event labels.
  • The approach leverages the convergence of likelihood ratios between mixed and pure samples to match the performance of fully supervised classifiers.
  • Empirical results, including tests on Gaussian models and quark/gluon discrimination, demonstrate CWoLa’s robust performance in high-dimensional data settings.

Classification without Labels: Learning from Mixed Samples in High Energy Physics

The paper "Classification without labels: Learning from mixed samples in high energy physics" by Eric M. Metodiev, Benjamin Nachman, and Jesse Thaler introduces a novel classification strategy in collider physics known as Classification without Labels (CWoLa). This strategy is proposed to address the limitations of traditional supervised learning, particularly the dependency on labeled data that is often derivable only from imperfect simulations. In high energy physics, these simulations can introduce artifacts that affect model accuracy when applied to real-world data.

The CWoLa approach leverages mixed, unlabeled samples of collider events to train classifiers effectively without requiring individual event labels or known class proportions. The authors demonstrate theoretically that under conditions where the only available data are statistical mixtures of signal and background classes, training a classifier to discriminate between these mixtures yields the same optimal classifier as would be obtained in a fully-supervised setting.

This theoretical underpinning rests upon a robust statistical foundation. Specifically, the authors prove that the likelihood ratio for distinguishing two mixed samples converges to that for distinguishing pure signal from pure background, provided the samples have different class proportions. A notable result is that neither signal nor background fractions need to be known a priori for training, vastly simplifying the data requirements compared to other weak supervision methods like Learning from Label Proportions (LLP).

Empirical validation is provided initially through a toy model using Gaussian-distributed samples, revealing CWoLa's robustness to class impurity, particularly when sufficient training data are available. Building upon this, the methodology is then applied to the practical challenge of quark/gluon discrimination—a significant problem where simulation limitations are consequential. In this context, a neural network is trained using input features based on generalized angularities, and its performance under CWoLa conditions is compared against fully-supervised models. The results indicate competitive performance, suggesting that CWoLa is practically viable even in high-dimensional feature spaces.

The paper also discusses operational aspects of employing CWoLa. While label information is not necessary during training, some is required during testing to determine classifier operating points and performance metrics such as ROC curves. This can simply involve using a small set of labeled data or assuming class proportion estimates based on simpler, well-understood simulations or theoretical calculations.

The implications of this research are noteworthy. CWoLa promises a substantial paradigm shift, offering a path towards direct, model-independent classifier training on real-world data, thus circumventing simulation inaccuracies. Consequently, it stands as an enabling technique for more reliable data analysis in high energy physics experiments, potentially extending to other scientific fields facing similar challenges with imbalanced, mixed, or incomplete data.

Future directions may include refining the CWoLa framework for more complex classification tasks, exploring its applicability across a broader spectrum of datasets, and integrating it with adversarial learning techniques to enhance classifier robustness. As machine learning continues to grow its footprint in scientific research, approaches like CWoLa that reduce dependencies on theoretical simulations and improve the integrity of data-driven insights are likely to gain prominence.