Emergent Mind

Abstract

Scientific document classification is a critical task for a wide range of applications, but the cost of obtaining massive amounts of human-labeled data can be prohibitive. To address this challenge, we propose a weakly-supervised approach for scientific document classification using label names only. In scientific domains, label names often include domain-specific concepts that may not appear in the document corpus, making it difficult to match labels and documents precisely. To tackle this issue, we propose WANDER, which leverages dense retrieval to perform matching in the embedding space to capture the semantics of label names. We further design the label name expansion module to enrich the label name representations. Lastly, a self-training step is used to refine the predictions. The experiments on three datasets show that WANDER outperforms the best baseline by 11.9% on average. Our code will be published at https://github.com/ritaranx/wander.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. SciBERT: A Pretrained Language Model for Scientific Text. In EMNLP-IJCNLP. 3615–3620.
  2. Selecting good expansion terms for pseudo-relevance feedback. In SIGIR. 243–250.
  3. Importance of Semantic Representation: Dataless Classification.. In AAAI. 830–835.
  4. On the Use of ArXiv as a Dataset
  5. SPECTER: Document-level Representation Learning using Citation-informed Transformers. In ACL. 2270–2282.
  6. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In SIGIR. 758–759.
  7. How Can Graph Neural Networks Help Document Retrieval: A Case Study on CORD19 with Concept Map Generation. In ECIR.
  8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
  9. Query Expansion with Locally-Trained Word Embeddings. In ACL.
  10. Soumyajit Ganguly and Vikram Pudi. 2017. Paper2vec: Combining graph and text information for scientific paper representation. In ECIR. 383–395.
  11. Luyu Gao and Jamie Callan. 2022. Unsupervised Corpus Aware Language Model Pre-training for Dense Passage Retrieval. In ACL. 2843–2853.
  12. BERTopic: Neural topic modeling with a class-based TF-IDF procedure
  13. Unsupervised Dense Information Retrieval with Contrastive Learning. TMLR (2022).
  14. Fbnetgen: Task-aware gnn-based fmri analysis via functional brain network generation. In MIDL.
  15. Dense Passage Retrieval for Open-Domain Question Answering. In EMNLP. 6769–6781.
  16. Weakly-supervised neural text classification. In CIKM. 983–992.
  17. Text classification using label names only: A language model self-training approach. EMNLP (2020).
  18. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In EMNLP-IJCNLP. 3982–3992.
  19. Stephen Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3, 4 (2009), 333–389.
  20. KNN with TF-IDF based framework for text categorization. Procedia Engineering 69 (2014), 1356–1364.
  21. Tongzhou Wang and Phillip Isola. 2020. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In ICML.
  22. X-Class: Text Classification with Extremely Weak Supervision. In NAACL. 3043–3053.
  23. A Benchmark on Extremely Weakly Supervised Text Classification: Reconcile Seed Matching and Prompting Approaches
  24. FastClass: A Time-Efficient Approach to Weakly-Supervised Text Classification. EMNLP (2022).
  25. Learning domain semantics and cross-domain correlations for paper recommendation. In SIGIR. 706–715.
  26. Neighborhood-Regularized Self-Training for Learning with Few Labels. In AAAI, Vol. 37.
  27. Counterfactual and factual reasoning over hypergraphs for interpretable clinical predictions on ehr. In Machine Learning for Health. PMLR, 259–278.
  28. Pre-train Graph Neural Networks for Brain Network Analysis. In IEEE-Big Data.
  29. AcTune: Uncertainty-Based Active Self-Training for Active Fine-Tuning of Pretrained Language Models. In NAACL. 1422–1436.
  30. COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning. In EMNLP. 1462–1479.
  31. Fine-Tuning Pre-trained Language Model with Weak Supervision: A Contrastive-Regularized Self-Training Approach. In NAACL. 1063–1077.
  32. WRENCH: A Comprehensive Benchmark for Weak Supervision. In NeurIPS.
  33. Adaptive Multi-view Rule Discovery for Weakly-Supervised Compatible Products Prediction. In KDD. 4521–4529.
  34. PRBoost: Prompt-Based Rule Discovery and Boosting for Interactive Weakly-Supervised Learning. In ACL.
  35. Character-level Convolutional Networks for Text Classification. In NIPS.
  36. Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding
  37. The Effect of Metadata on Scientific Literature Tagging: A Cross-Field Cross-Model Study. In WWW. 1626–1637.
  38. Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds. In NAACL. 279–290.
  39. Structure-enhanced heterogeneous graph contrastive learning. In SDM.
  40. ReSel: N-ary Relation Extraction from Scientific Text and Tables by Learning to Retrieve and Select. In EMNLP. 730–744.

Show All 40