Toward Optimal Feature Selection in Naive Bayes for Text Categorization (1602.02850v1)

Published 9 Feb 2016 in stat.ML, cs.CL, cs.IR, and cs.LG

Abstract: Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination ($MD$) and $MD-\chi^2$ methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.

Citations (229)

View on Semantic Scholar

Summary

The paper introduces the Jeffreys-Multi-Hypothesis Divergence to optimize feature selection by maximizing class discrimination in text categorization.
It presents two algorithms, Maximum Discrimination (MD) and MD-χ², that efficiently reduce feature space while boosting classifier performance.
Results on datasets like 20-Newsgroups and Reuters demonstrate significant improvements in accuracy and F1 scores, validating the proposed approach.

Toward Optimal Feature Selection in Naive Bayes for Text Categorization

The paper presents an innovative framework for feature selection in the context of text categorization using naive Bayes classifiers. The authors focus on a method grounded in Information Theory, aimed at optimizing the discriminative capacity of feature selection, thereby enhancing classification performance. This work addresses the well-known challenge of handling high-dimensional text data by meticulously reducing feature space while retaining, or even increasing, the classifier's efficacy.

Key Contributions and Methodology

Introduction of Jeffreys-Multi-Hypothesis Divergence (JMH-divergence): The paper extends traditional divergence measures that have been predominantly used for binary classification tasks, such as the Kullback-Leibler (KL) and Jeffreys divergence. By generalizing Jeffreys divergence to handle multiple class distributions, the JMH-divergence is introduced. This measure evaluates the discriminative information available across multiple categories and is thus particularly apt for multi-class classification problems often encountered in text categorization.
Novel Feature Selection Algorithms: Leveraging the JMH-divergence, the authors propose two efficient feature selection algorithms:

Maximum Discrimination (MD) Method: This approach explicitly ranks features based on their contribution to maximizing the JMH-divergence. The structure is fundamentally a greedy strategy that iteratively selects features which contribute maximally to discrimination among classes.
Asymptotic χ²-based Method (MD-χ²): Derived from the asymptotic distribution of the divergence measures, this method further simplifies the feature selection process by leveraging the χ² distribution properties. This method exploits the relationship between divergence and the non-central χ² distribution, approximating the divergence for computational efficiency.

Numerical Results

Extensive experiments demonstrate the superior performance of the proposed methods against several benchmark data sets, including 20-Newsgroups, Reuters, and TDT2. The experiments compare the proposed feature selection methods with existing techniques such as Document Frequency (DF), Chi-square (χ²) statistic, and others, showing substantial improvements in classification accuracy and F1 measures. Importantly, these results highlight the capability of the new methods to achieve similar or better performance with significantly fewer features.

Theoretical and Practical Implications

Theoretically, this research offers a compelling framework to rethink feature selection beyond traditional relevance-based approaches. By integrating divergence measures directly aligned with classifier objectives, the proposed methods provide a bridge between feature selection and classification optimization. Practically, the results suggest substantial computational savings and efficiency in handling large-scale text data, which is especially pertinent given growing electronic text volumes.

Future Directions

The paper opens avenues for further exploration, suggesting the potential integration of divergence-based feature selection with more complex machine learning models beyond naive Bayes. Future exploration could involve investigating the proposed methods under other distributional assumptions, considering dependencies between features, and extending the approach to multi-label classification scenarios.

This work contributes a rigorous and effective solution to the feature selection problem in text categorization, laying a strong foundation for continued research and application in the field of text analytics and machine learning.

PDF Markdown