Generating Query Recommendations via LLMs (2405.19749v2)

Published 30 May 2024 in cs.IR

Abstract: Query recommendation systems are ubiquitous in modern search engines, assisting users in producing effective queries to meet their information needs. However, these systems require a large amount of data to produce good recommendations, such as a large collection of documents to index and query logs. In particular, query logs and user data are not available in cold start scenarios. Query logs are expensive to collect and maintain and require complex and time-consuming cascading pipelines for creating, combining, and ranking recommendations. To address these issues, we frame the query recommendation problem as a generative task, proposing a novel approach called Generative Query Recommendation (GQR). GQR uses an LLM as its foundation and does not require to be trained or fine-tuned to tackle the query recommendation problem. We design a prompt that enables the LLM to understand the specific recommendation task, even using a single example. We then improved our system by proposing a version that exploits query logs called Retriever-Augmented GQR (RA-GQR). RA-GQr dynamically composes its prompt by retrieving similar queries from query logs. GQR approaches reuses a pre-existing neural architecture resulting in a simpler and more ready-to-market approach, even in a cold start scenario. Our proposed GQR obtains state-of-the-art performance in terms of NDCG@10 and clarity score against two commercial search engines and the previous state-of-the-art approach on the Robust04 and ClueWeb09B collections, improving on average the NDCG@10 performance up to ~4% on Robust04 and ClueWeb09B w.r.t the previous best competitor. RA-GQR further improve the NDCG@10 obtaining an increase of ~11%, ~6\% on Robust04 and ClueWeb09B w.r.t the best competitor. Furthermore, our system obtained ~59% of user preferences in a blind user study, proving that our method produces the most engaging queries.

Citations (1)

View on Semantic Scholar

Summary

The paper presents the GQR system leveraging GPT-3 to generate query recommendations without query logs, achieving up to 27% improvement in NDCG@10.
It employs both substitution and concat evaluation protocols across datasets like Robust04 and ClueWeb09B to validate its robust performance.
The system significantly enhances user engagement by securing 59% overall user preference and a 100% success rate for rare query recommendations.

Generating Query Recommendations without Query Logs

The presented paper introduces a novel system for generating query recommendations without relying on traditional query logs, termed the GQR (Query Generation and Recommendation) system. Leveraging sophisticated natural LLMs, specifically GPT-3, the proposed system is evaluated across several metrics and datasets to validate its efficacy compared to existing baseline methods and commercial query recommendation systems.

Evaluation Metrics and Datasets

To assess the performance of the GQR system, the authors employed two established corpora, Robust04 and ClueWeb09B, alongside a set of AOL query logs. They utilized two evaluation protocols:

Substitution Protocol: A direct comparison based on the individual query recommendation metrics.
Concat Protocol: An indirect evaluation that measures the improvement when user queries are supplemented with multiple recommendations.

Key performance metrics include the Clarity Score (SCS) and Normalized Discounted Cumulative Gain (NDCG@10), which were meticulously analyzed to gauge the system’s consistency and relevance.

Key Results

Substitution Protocol

Clarity Score: The GQR (GPT-3) system yielded the highest average scores on both datasets. For Robust04, it achieved an average SCS of 10.65 with a very low standard deviation (±0.08), indicating stable performance. On ClueWeb09B, the SCS averaged 11.12 with a similarly low standard deviation (±0.10).
NDCG@10: The GQR (GPT-3) system significantly outperformed other systems. It showed an improvement of +23% and +27% in the NDCG@10 scores for the Robust04 and ClueWeb09B datasets, respectively, compared to the best competing system.

Concat Protocol

Clarity Score: Across varying rank levels (number of concatenated query recommendations), GQR (GPT-3) consistently outperformed other systems. For instance, at rank 6, it exhibited an increase in the SCS to 56.67 on Robust04.
NDCG@10: The incremental enhancement of query performance was observed, with a notable 6% and 5% enhancement on Robust04 and ClueWeb09B datasets, respectively.

User Engagement and Long Tail Query Analysis

The user paper conducted across multiple annotator groups demonstrated the GQR system’s superiority in user engagement. The system received approximately 59% of the overall user preferences, compared to 26% and 15% for the two commercial systems.

Additionally, the GQR (GPT-3) system exhibited robust performance in generating recommendations for rare (long-tail) queries, achieving a 100% success rate in suggesting relevant queries. This contrasts with other systems that struggled to provide recommendations consistently for such queries.

Prompt Study Analysis

Two separate prompt studies addressed the impact of:

The number of examples in the prompt context.
The specific content of the examples in the prompt.

Results indicated that:

Number of Examples: The performance, in terms of SCS and NDCG@10, remained stable regardless of the number of examples (ranging from 1 to 10) in the prompt.
Specific Prompt Context: Different prompts with varying examples did not significantly alter the system’s effectiveness, suggesting that GQR (GPT-3) performs robustly irrespective of specific example content.

Discussion

The findings show that the GQR (GPT-3) system not only surpasses its commercial counterparts in both metric and user engagement evaluations but also demonstrates superior capability in handling queries of varying frequencies. The stability across different prompt configurations underlines the adaptability of the model.

Implications and Future Developments

Practically, the implementation of a GQR-like system can significantly enhance user search experiences by providing precise and valuable query suggestions without dependency on historical query logs, which may be sparse or unavailable. Theoretically, this approach eases the reliance on query histories, promoting privacy and potentially leveling the playing field for new search engines or domains with limited historical data.

Future advancements could investigate integrating domain-specific training to further enhance the relevance of the generated queries or refining prompt engineering techniques to optimize GQR system performance further. Additionally, exploring more diverse datasets would offer greater insight into the system’s versatility across various use cases.

In conclusion, the proposed GQR (GPT-3) system substantially advances the state-of-the-art in query recommendation, demonstrating superior performance across multiple evaluation protocols and user studies.

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1796391419344716068

https://twitter.com/Andrea_Bacciu/status/1796441181339132185

https://twitter.com/gm8xx8/status/1796809902708408793

https://twitter.com/knishimae0531/status/1797087761813602734