Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

175 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Nearest Neighbor Speculative Decoding for LLM Generation and Attribution (2405.19325v3)

Published 29 May 2024 in cs.CL

Abstract: LLMs often hallucinate and lack the ability to provide attribution for their generations. Semi-parametric LMs, such as kNN-LM, approach these limitations by refining the output of an LM for a given prompt using its nearest neighbor matches in a non-parametric data store. However, these models often exhibit slow inference speeds and produce non-fluent texts. In this paper, we introduce Nearest Neighbor Speculative Decoding (NEST), a novel semi-parametric LLMing approach that is capable of incorporating real-world text spans of arbitrary length into the LM generations and providing attribution to their sources. NEST performs token-level retrieval at each inference step to compute a semi-parametric mixture distribution and identify promising span continuations in a corpus. It then uses an approximate speculative decoding procedure that accepts a prefix of the retrieved span or generates a new token. NEST significantly enhances the generation quality and attribution rate of the base LM across a variety of knowledge-intensive tasks, surpassing the conventional kNN-LM method and performing competitively with in-context retrieval augmentation. In addition, NEST substantially improves the generation speed, achieving a 1.8x speedup in inference time when applied to Llama-2-Chat 70B. Code will be released at https://github.com/facebookresearch/NEST/tree/main.

References (59)

Citations (4)

View on Semantic Scholar

Summary

The paper presents a novel two-stage retrieval mechanism with confidence-based interpolation and dynamic span selection to enhance token prediction.
The experimental results show a 42.3% ROUGE-1 and 21.6% FActScore improvement, demonstrating significant gains in accuracy and efficiency.
Nest’s relaxed speculative decoding accelerates generation by 1.8x while ensuring robust attribution by sourcing tokens from verified passages.

Overview of "Nearest Neighbor Speculative Decoding"

This paper introduces Nearest Neighbor Speculative Decoding (Nest), an innovative approach in semi-parametric LLMing that combines traditional LLMs (LMs) with retrieval-augmented methods to generate more accurate and reliably attributed content. This work leverages retrieval augmentation by incorporating non-parametric data stores to refine and ground the LM's predictions, addressing the well-documented issues of hallucination in LMs.

Key Contributions

Novel Architecture: Nest extends the $k$ NN-LM framework through a series of enhancements, including a two-stage retrieval process for more efficient and accurate token prediction. The architecture incorporates a confidence-based interpolation mechanism, dynamic span selection, and a relaxed speculative decoding method.
Two-Stage Retrieval: This method performs an initial passage retrieval step to narrow down the search space, followed by $k$ -nearest neighbor ( $k$ -NN) token retrieval. This approach balances accuracy and efficiency, improving generation latency while requiring less storage and computation compared to maintaining a token-level key-value store.
Confidence-Based Interpolation: Nest introduces Relative Retrieval Confidence (RRC) to dynamically adjust the balance between the LM's inherent distribution and the retrieval-augmented distribution. This adaptability ensures better performance across divergent downstream tasks.
Dynamic Span Selection: Inspired by the Copy Generator (CoG) approach, Nest can select not just the next token but a sequence of tokens (or spans) when the retrieval confidence is sufficiently high. This mechanism significantly enhances coherence and attribution in the generated text while also improving efficiency.
Relaxed Speculative Decoding: Building on speculative decoding principles, Nest employs an evaluation phase to accept or reject spans based on a confidence threshold, thereby maintaining high fluency and factuality in the output.

Experimental Validation

The authors conducted extensive evaluations using various free-form generation tasks, such as question answering and text completion, employing different-sized Llama-2-Chat models under zero-shot settings. The paper highlights several key results:

Performance Gains: On WikiText-103, Nest combined with Llama-2-Chat 70B exhibited a 42.3% improvement in ROUGE-1 and a 21.6% enhancement in FActScore on the Biography dataset compared to its base model.
Efficiency: The speculative decoding technique accelerates the generation process, achieving a 1.8x speedup in inference time for long-form generation without compromising textual quality or factual attribution.
Attribution: The incorporation of spans from verified sources ensures that a large proportion of the generated text can be directly attributed back to the corpus, enhancing the credibility of the output.

Implications and Future Directions

The implications of Nest are multifaceted, impacting both practical applications and theoretical advancements in AI:

Practical Implications: Nest demonstrates significant potential in applications requiring high factual accuracy and reliable attribution, such as automated journalism, content summarization in legal and medical domains, and educational tools. The ability to source content directly from a non-parametric store helps in mitigating hallucination and supports users in identifying the provenance of information.
Theoretical Implications: The introduction of a two-stage retrieval mechanism and confidence-based interpolation opens new avenues for blending parametric and non-parametric methods efficiently. Future research might explore more sophisticated interpolation techniques and the scalability of such hybrid models to even larger datasets and more complex retrieval tasks.
Efficiency Bridging: The paper contributes to ongoing discussions on improving the efficiency of generation models. By parallelizing the token processing and effectively balancing retrieval and generation stages, Nest demonstrates optimized performance which is crucial for real-time applications.

Conclusion

Nest represents a significant stride in semi-parametric LLMing, addressing key limitations of existing retrieval-augmented methods. Its validated improvements in both generation quality and efficiency underscore its potential for broad applicability and future exploration in enhancing LM capabilities. This work sets the stage for more robust and accountable AI-driven text generation systems, promoting reliable and factual generation in a variety of practical contexts.

PDF Markdown

Tweets

https://twitter.com/arankomatsuzaki/status/1796001847808835608

https://twitter.com/VictoriaLinML/status/1796239998003597462

https://twitter.com/alexlimh23/status/1796195387700621754

https://twitter.com/fly51fly/status/1796290651644158011

https://twitter.com/knishimae0531/status/1796502109170626689

https://twitter.com/cocteau/status/1796221054026690911