Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TELEClass: Taxonomy Enrichment and LLM-Enhanced Hierarchical Text Classification with Minimal Supervision (2403.00165v3)

Published 29 Feb 2024 in cs.CL and cs.LG

Abstract: Hierarchical text classification aims to categorize each document into a set of classes in a label taxonomy, which is a fundamental web text mining task with broad applications such as web content analysis and semantic indexing. Most earlier works focus on fully or semi-supervised methods that require a large amount of human annotated data which is costly and time-consuming to acquire. To alleviate human efforts, in this paper, we work on hierarchical text classification with a minimal amount of supervision: using the sole class name of each node as the only supervision. Recently, LLMs (LLM) have shown competitive performance on various tasks through zero-shot prompting, but this method performs poorly in the hierarchical setting because it is ineffective to include the large and structured label space in a prompt. On the other hand, previous weakly-supervised hierarchical text classification methods only utilize the raw taxonomy skeleton and ignore the rich information hidden in the text corpus that can serve as additional class-indicative features. To tackle the above challenges, we propose TELEClass, Taxonomy Enrichment and LLM-Enhanced weakly-supervised hierarchical text Classification, which combines the general knowledge of LLMs and task-specific features mined from an unlabeled corpus. TELEClass automatically enriches the raw taxonomy with class-indicative features for better label space understanding and utilizes novel LLM-based data annotation and generation methods specifically tailored for the hierarchical setting. Experiments show that TELEClass can significantly outperform previous baselines while achieving comparable performance to zero-shot prompting of LLMs with drastically less inference cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Eugene Agichtein and Luis Gravano. 2000. Snowball: extracting relations from large plain-text collections. In Digital library.
  2. Data Programming for Learning Discourse Structure. In ACL.
  3. Hierarchical Transfer Learning for Multi-label Text Classification. In ACL.
  4. MixMatch: a holistic approach to semi-supervised learning. In NeurIPS.
  5. Importance of Semantic Representation: Dataless Classification. In AAAI.
  6. Hierarchy-aware Label Semantics Matching Network for Hierarchical Text Classification. In ACL-IJCNLP.
  7. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
  8. Is GPT-3 a Good Data Annotator?. In ACL.
  9. Debiasing Made State-of-the-art: Revisiting the Simple Seed-based Weak Supervision for Text Classification. In EMNLP.
  10. Language Model as an Annotator: Exploring DialoGPT for Dialogue Summarization. In ACL-IJCNLP.
  11. Evgeniy Gabrilovich and Shaul Markovitch. 2007. Computing Semantic Relatedness Using Wikipedia-Based Explicit Semantic Analysis. In IJCAI.
  12. Siddharth Gopal and Yiming Yang. 2013. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In KDD.
  13. Variational Pretraining for Semi-supervised Text Classification. In ACL.
  14. AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators. ArXiv abs/2303.16854 (2023).
  15. Hierarchical Multi-label Text Classification: An Attention-based Recurrent Network Approach. In CIKM.
  16. MEGClass: Extremely Weakly Supervised Text Classification via Mutually-Enhancing Text Granularities. In Findings of EMNLP.
  17. Quoc Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. In ICML.
  18. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6 (2015), 167–195.
  19. Support vector machines classification with a very large-scale taxonomy. In KDD.
  20. Julian McAuley and Jure Leskovec. 2013. Hidden factors and hidden topics: understanding rating dimensions with review text. In RecSys.
  21. Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. In NeurIPS.
  22. Weakly-Supervised Neural Text Classification. In CIKM.
  23. Weakly-supervised hierarchical text classification. In AAAI.
  24. Text Classification Using Label Names Only: A Language Model Self-Training Approach. In EMNLP.
  25. OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023).
  26. Large-Scale Hierarchical Text Classification with Recursively Regularized Deep Graph-CNN. In WWW.
  27. Data Programming: Creating Large Training Sets, Quickly. In NIPS.
  28. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP.
  29. Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Found. Trends Inf. Retr. 3, 4 (apr 2009), 333–389.
  30. Automated Phrase Mining from Massive Text Corpora. TKDE 30, 10 (2018), 1825–1837.
  31. TaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names. In NAACL.
  32. Learning with Weak Supervision for Email Intent Detection. In SIGIR.
  33. Yangqiu Song and Dan Roth. 2014. On Dataless Hierarchical Text Classification. In AAAI.
  34. T. Sørensen. 1948. A Method of Establishing Groups of Equal Amplitude in Plant Sociology Based on Similarity of Species Content and Its Application to Analyses of the Vegetation on Danish Commons. Munksgaard in Komm.
  35. Text Classification via Large Language Models. In Findings of EMNLP.
  36. Doc2Cube: Allocating Documents to Text Cube Without Labeled Data. In ICDM.
  37. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv abs/2307.09288 (2023).
  38. WOT-Class: Weakly Supervised Open-world Text Classification. In CIKM.
  39. X-Class: Text Classification with Extremely Weak Supervision. In NAACL.
  40. Towards Zero-Label Language Learning. ArXiv abs/2109.09193 (2021).
  41. Hierarchical Multi-label Classification Networks. In ICML.
  42. ZeroGen: Efficient Zero-shot Learning via Dataset Generation. In EMNLP.
  43. Benchmarking Zero-shot Text Classification: Datasets, Evaluation and Entailment Approach. In EMNLP-IJCNLP.
  44. AttentionXML: Label Tree-based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification. In NeurIPS.
  45. Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias. In NeurIPS.
  46. Weakly-supervised Text Classification Based on Keyword Graph. In EMNLP.
  47. LLMaAA: Making Large Language Models as Active Annotators. In Findings of EMNLP.
  48. Hierarchical Metadata-Aware Document Categorization under Weak Supervision. In WSDM.
  49. PIEClass: Weakly-Supervised Text Classification with Prompting and Noise-Robust Iterative Ensemble Training. In EMNLP.
  50. MATCH: Metadata-Aware Text Classification in A Large Hierarchy. In WWW.
  51. Hierarchy-Aware Global Model for Hierarchical Text Classification. In ACL.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com