SCStory: Self-supervised and Continual Online Story Discovery (2312.03725v1)
Abstract: We present a framework SCStory for online story discovery, that helps people digest rapidly published news article streams in real-time without human annotations. To organize news article streams into stories, existing approaches directly encode the articles and cluster them based on representation similarity. However, these methods yield noisy and inaccurate story discovery results because the generic article embeddings do not effectively reflect the story-indicative semantics in an article and cannot adapt to the rapidly evolving news article streams. SCStory employs self-supervised and continual learning with a novel idea of story-indicative adaptive modeling of news article streams. With a lightweight hierarchical embedding module that first learns sentence representations and then article representations, SCStory identifies story-relevant information of news articles and uses them to discover stories. The embedding module is continuously updated to adapt to evolving news streams with a contrastive learning objective, backed up by two unique techniques, confidence-aware memory replay and prioritized-augmentation, employed for label absence and data scarcity problems. Thorough experiments on real and the latest news data sets demonstrate that SCStory outperforms existing state-of-the-art algorithms for unsupervised online story discovery.
- Charu C Aggarwal and S Yu Philip. 2010. On Clustering Massive Text and Categorical Data Streams. Knowledge and Information Systems 24, 2 (2010), 171–196.
- Topic Detection and Tracking Pilot Study Final Report. In Proceedings of the DARPA broadcast news transcription and understanding workshop.
- Amit Bagga and Breck Baldwin. 1998. Entity-based Cross-document Coreferencing using the Vector Space Model. In Proceedings of the Annual Meeting of the Association for Computational Linguistics and International Conference on Computational Linguistics.
- Scalable K-Means+. Proceedings of the VLDB Endowment 5, 7 (2012).
- Longformer: The Long-Document Transformer. arXiv:2004.05150 (2020).
- Knowledge-preserving incremental social event detection via heterogeneous GNNs. In Proceedings of the Web Conference 2021. 3383–3395.
- MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2147–2157.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- On Evaluating Stream Learning Algorithms. Machine Learning 90, 3 (2013), 317–346.
- SimCSE: Simple Contrastive Learning of Sentence Embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 6894–6910.
- A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
- Generating representative headlines for news stories. In Proceedings of the World Wide Web Conference.
- Supervised contrastive learning for pre-trained language model fine-tuning. In International Conference on Learning Representations.
- Augmenting Data with Mixup for Sentence Classification: An empirical study. arXiv preprint arXiv:1905.08941 (2019).
- Lawrence Hubert and Phipps Arabie. 1985. Comparing Partitions. Journal of Classification 2, 1 (1985), 193–218.
- Towards Continual Knowledge Learning of Language Models. In International Conference on Learning Representations.
- Lifelong Pretraining: Continually Adapting Language Models to Emerging Corpora. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Mining News Events from Comparable News Corpora: A Multi-Attribute Proximity Network Modeling Approach. In Proceedings of the IEEE International Conference on Big Data.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 (2014).
- Overcoming Catastrophic Forgetting in Neural Networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
- Erdal Kuzey and Gerhard Weikum. 2014. Evin: Building a knowledge base of events. In Proceedings of the World Wide Web Conference.
- Philippe Laban and Marti A Hearst. 2017. newsLens: building and visualizing long-ranging news stories. In Proceedings of the Events and Stories in the News Workshop in conjunction with the Annual Meeting of the Association for Computational Linguistics.
- TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters. In Proceedings of the ACM Web Conference 2022. 2819–2829.
- TopicExpan: Topic Taxonomy Expansion via Hierarchy-Aware Topic Phrase Generation. (2022).
- Mathis Linger and Mhamed Hajaiej. 2020. Batch Clustering for Multilingual News Streaming. In Proceedings of the International Workshop on Narrative Extraction from Texts held in conjunction with the European Conference on Information Retrieval.
- Story Forest: Extracting Events and Telling Stories from Breaking News. ACM Transactions on Knowledge Discovery from Data 14, 3 (2020), 1–28.
- Growing story forest online from massive breaking news. In Proceedings of the ACM on Conference on Information and Knowledge Management. 777–785.
- NewsEmbed: Modeling News through Pre-trained Document Representations. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- Coco-lm: Correcting and Contrasting Text Sequences for Language Model Pretraining. Advances in Neural Information Processing Systems 34 (2021), 23102–23114.
- Multilingual Clustering of Streaming News. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 4535–4544.
- Sentence-t5: Scalable Sentence Encoders from Pre-trained Text-to-text Models. arXiv preprint arXiv:2108.08877 (2021).
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese Bert-Networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing.
- Event-Driven News Stream Clustering using Entity-Aware Contextual Embeddings. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics.
- CEP-Wizard: Automatic Deployment of Distributed Complex Event Processing. In 2019 IEEE International Conference on Data Engineering (ICDE). IEEE, 2004–2007.
- Data stream clustering: A survey. ACM Computing Surveys (CSUR) 46, 1 (2013), 1–31.
- Dense vs. Sparse Representations for News Stream Clustering. In Proceedings of the International Workshop on Narrative Extraction from Texts held in conjunction with the European Conference on Information Retrieval.
- Tacl: Improving BERT pre-training with token-aware contrastive learning. arXiv preprint arXiv:2111.04198 (2021).
- Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks. In Proceedings of the 28th International Conference on Computational Linguistics. 3436–3440.
- Representation Learning with Contrastive Predictive Coding. arXiv e-prints (2018), arXiv–1807.
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing Data Using t-SNE. Journal of Machine Learning Research 9, 11 (2008).
- Attention is All You Need. Advances in Neural Information Processing Systems 30 (2017).
- Information Theoretic Measures for Clusterings comparison: Variants, Properties, Normalization and Correction for Chance. Journal of Machine Learning Research 11 (2010), 2837–2854.
- Tongzhou Wang and Phillip Isola. 2020. Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere. In Proceedings of the International Conference on Machine Learning. PMLR, 9929–9939.
- NewsBERT: Distilling Pre-trained Language Model for Intelligent News Application. In Findings of the Association for Computational Linguistics: EMNLP.
- MIND: A Large-scale Dataset for News Recommendation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics.
- Clear: Contrastive learning for sentence representation. arXiv preprint arXiv:2012.15466 (2020).
- NETS: extremely fast outlier detection from a data stream via set-based processing. Proceedings of the VLDB Endowment 12, 11 (2019), 1303–1315.
- Ultrafast local outlier detection from a data stream with stationary region skipping. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1181–1191.
- Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2347–2357.
- Multiple dynamic outlier-detection from a data stream by exploiting duality of data and queries. In Proceedings of the 2021 International Conference on Management of Data. 2063–2075.
- Triovecevent: Embedding-based online local event detection in geo-tagged tweet streams. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 595–604.
- mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.
- Unsupervised Storyline Extraction from News Articles.. In Proceedings of the International Joint Conference on Artificial Intelligence. 3014–3021.
- Susik Yoon (12 papers)
- Yu Meng (92 papers)
- Dongha Lee (63 papers)
- Jiawei Han (263 papers)