Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 108 tok/s Pro

Kimi K2 212 tok/s Pro

GPT OSS 120B 471 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

Hierarchical Transformers for Long Document Classification (1910.10781v1)

Published 23 Oct 2019 in cs.CL, cs.LG, and stat.ML

Abstract: BERT, which stands for Bidirectional Encoder Representations from Transformers, is a recently introduced language representation model based upon the transfer learning paradigm. We extend its fine-tuning procedure to address one of its major limitations - applicability to inputs longer than a few hundred words, such as transcripts of human call conversations. Our method is conceptually simple. We segment the input into smaller chunks and feed each of them into the base model. Then, we propagate each output through a single recurrent layer, or another transformer, followed by a softmax activation. We obtain the final classification decision after the last segment has been consumed. We show that both BERT extensions are quick to fine-tune and converge after as little as 1 epoch of training on a small, domain-specific data set. We successfully apply them in three different tasks involving customer call satisfaction prediction and topic classification, and obtain a significant improvement over the baseline models in two of them.

Citations (230)

View on Semantic Scholar

Collections

Summary

The paper presents a novel methodology integrating BERT with hierarchical models using RoBERT (with LSTM) and ToBERT (with an extra transformer) for long document classification.
It demonstrates significant performance gains, achieving state-of-the-art accuracy, notably 95.48% on the Fisher corpus compared to traditional models.
The research underscores the potential of hierarchical approaches to enhance NLP tasks such as customer satisfaction analysis and large-scale text categorization.

Hierarchical Transformers for Long Document Classification

The paper presents a methodology to extend the BERT (Bidirectional Encoder Representations from Transformers) model for handling long document classification, addressing limitations in BERT's capacity to handle sequences longer than a few hundred words. This research focuses on tasks such as topic identification in spoken conversations and customer satisfaction prediction based on call transcripts, where documents often exceed 5000 words.

Methodology

The proposed methods, termed as Recurrence over BERT (RoBERT) and Transformer over BERT (ToBERT), integrate BERT with hierarchical models to overcome its limitations. The approach involves segmenting long documents into smaller parts, then processing these segments with BERT.

Recurrence over BERT (RoBERT): This technique utilizes a small LSTM (Long Short-Term Memory) layer to aggregate the sequence of segment-level representations obtained from BERT into a document-level representation. This setup allows the model to capture temporal dependencies essential for tasks like customer satisfaction prediction, where an emotional progression may occur throughout the call.
Transformer over BERT (ToBERT): Instead of an LSTM layer, ToBERT employs another transformer model on top of BERT. This method leverages the transformer’s ability to model long-range dependencies more effectively, which is particularly beneficial when the order of information is crucial.

Experimental Results

The paper evaluates these methods across three datasets: CSAT (customer satisfaction), 20 Newsgroups (topic identification in written text), and the Fisher corpus (spoken conversations). Key findings include:

Fine-tuning BERT on task-specific datasets improves performance over using pre-trained BERT features. Significant accuracy improvements are documented, notably on the Fisher dataset where ToBERT achieves a reported state-of-the-art accuracy of 95.48%.
ToBERT generally outperforms RoBERT across most datasets, showcasing superior handling of long sequences due to its layered transformer architecture.
The inclusion of position embeddings had minimal effect, except for a slight performance gain in CSAT, suggesting that temporal information can be crucial for tasks dependent on emotional or conversational progression.
Comparison with baseline models, such as MS-CNN and SVM-MCE, reveals ToBERT's superior performance, indicating robust capability in processing long-form text.

Implications and Future Work

The paper demonstrates that hierarchical transformer approaches like RoBERT and ToBERT can effectively manage long document classification tasks. These methods, by leveraging BERT’s powerful contextual embeddings and extending its architecture to accommodate long sequences, provide a substantial improvement over traditional models that rely on simpler baselines or shorter document lengths.

The research opens avenues for further exploration into directly training models end-to-end on long documents, optimizing segment size and overlap, and exploring positional encoding in greater depth. These methodologies have practical applications in fields requiring the extraction of insights from extensive textual data, such as automated customer service analytics and large-scale text categorization.

Overall, the paper contributes valuable insights into extending the applicability of transformer-based models to a broader range of natural language processing tasks requiring long document handling. These hierarchical approaches lay the groundwork for future innovations in the field, aiming to bridge the gap between advanced language understanding models and practical, real-world text processing scenarios.