Efficient semi-supervised inference for logistic regression under case-control studies (2402.15365v1)

Published 23 Feb 2024 in stat.ML and cs.LG

Abstract: Semi-supervised learning has received increasingly attention in statistics and machine learning. In semi-supervised learning settings, a labeled data set with both outcomes and covariates and an unlabeled data set with covariates only are collected. We consider an inference problem in semi-supervised settings where the outcome in the labeled data is binary and the labeled data is collected by case-control sampling. Case-control sampling is an effective sampling scheme for alleviating imbalance structure in binary data. Under the logistic model assumption, case-control data can still provide consistent estimator for the slope parameter of the regression model. However, the intercept parameter is not identifiable. Consequently, the marginal case proportion cannot be estimated from case-control data. We find out that with the availability of the unlabeled data, the intercept parameter can be identified in semi-supervised learning setting. We construct the likelihood function of the observed labeled and unlabeled data and obtain the maximum likelihood estimator via an iterative algorithm. The proposed estimator is shown to be consistent, asymptotically normal, and semiparametrically efficient. Extensive simulation studies are conducted to show the finite sample performance of the proposed method. The results imply that the unlabeled data not only helps to identify the intercept but also improves the estimation efficiency of the slope parameter. Meanwhile, the marginal case proportion can be estimated accurately by the proposed method.

References (1)

Rudin, W. (1973). Functional Analysis. New York: McGraw-Hill.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel semi-supervised method that integrates unlabeled data to resolve intercept identification challenges in case-control logistic regression.
It employs a maximum likelihood framework with an iterative algorithm, ensuring consistent, asymptotically normal, and semiparametrically efficient estimators.
Empirical results from simulations and the Pima Indians diabetes dataset confirm enhanced parameter estimation and improved prediction accuracy.

Efficient Semi-supervised Inference with Logistic Regression in Case-Control Studies

Introduction

In the statistical landscape of semi-supervised learning, inference in logistic regression models under case-control studies presents unique challenges and opportunities. Traditional methods primarily rely on labeled data, inevitably facing limitations when applicable datasets include a significant portion of unlabeled information. This exploration into semi-supervised learning schemes reveals how integrating unlabeled data can enhance the estimation of logistic regression parameters, particularly in case-control studies characterized by biased sampling.

Problem Statement

Case-control studies are invaluable in epidemiology, especially when investigating rare diseases. However, these studies entail biased sampling that complicates the estimation of logistic regression parameters. Traditionally, while slope parameters are identifiable, the intercept and, by extension, the marginal case proportion estimation remain elusive due to the absence of reliable information about the entire population. This research delineates a semi-supervised learning framework where the inclusion of unlabeled data not only facilitates the identification of the otherwise unidentifiable intercept but also refines the efficiency of estimating slope parameters.

Methodological Innovation

At the heart of this paper is a novel semi-supervised inference approach under a logistic regression model tailored for case-control sampled data. By constructing a maximum likelihood estimate (MLE) through a meticulously derived likelihood function encompassing both labeled and unlabeled datasets, the method leverages additional information intrinsic to the unlabeled data. This methodology employs an iterative algorithm to compute MLEs, proving them consistent, asymptotically normal, and semiparametrically efficient—a groundbreaking achievement that marks a significant advance from current practices.

Empirical Validation

The practical relevance and superiority of the proposed method are evidenced through extensive simulation studies and application to the Pima Indians diabetes dataset. Simulations indicate a remarkable improvement in parameter estimation efficiency and prediction accuracy, particularly emphasizing the pivotal role of unlabeled data in estimating the intercept parameter. Application to real-world data further cements the method's utility, offering enhanced insight into logistic regression analysis within case-control studies.

Theoretical Contributions

By establishing a comprehensive mathematical foundation, this research meticulously tackles the identifiability issue that plagues intercept parameter estimation in traditional logistic regression under case-control sampling. Furthermore, it rigorously delineates the asymptotic properties of the proposed estimators, asserting their consistency and efficiency. This theoretical underpinning not only broadens our understanding of semi-supervised learning in this niche but also paves the way for future investigations into optimizing inference processes with partially labeled data in epidemiological studies and beyond.

Forward-looking Remarks

As the digital age ushers in an era of data abundance, with much of it unlabeled, the importance of semi-supervised learning models will only escalate. This paper's findings underscore the untapped potential of integrating unlabeled data in enhancing inference accuracy and efficiency in case-control studies. Moreover, the door is now open for exploring how this semi-supervised approach can be extended or adapted to other statistical models and sampling designs, promising a fertile ground for future research endeavors.

Conclusion

In conclusion, this research proposes a groundbreaking semi-supervised inference method for logistic regression under case-control studies, effectively resolving the longstanding challenge of intercept identification and enhancing estimation efficiency. Through robust theoretical analysis, extensive simulations, and practical application, the paper unequivocally demonstrates the added value of incorporating unlabeled data into the inference framework, setting a new benchmark for future studies in the domain.