Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Evaluating the Efficacy of Open-Source LLMs in Enterprise-Specific RAG Systems: A Comparative Study of Performance and Scalability (2406.11424v1)

Published 17 Jun 2024 in cs.IR and cs.CL

Abstract: This paper presents an analysis of open-source LLMs and their application in Retrieval-Augmented Generation (RAG) tasks, specific for enterprise-specific data sets scraped from their websites. With the increasing reliance on LLMs in natural language processing, it is crucial to evaluate their performance, accessibility, and integration within specific organizational contexts. This study examines various open-source LLMs, explores their integration into RAG frameworks using enterprise-specific data, and assesses the performance of different open-source embeddings in enhancing the retrieval and generation process. Our findings indicate that open-source LLMs, combined with effective embedding techniques, can significantly improve the accuracy and efficiency of RAG systems, offering a viable alternative to proprietary solutions for enterprises.

Citations (6)

Summary

  • The paper demonstrates that Llama3-8B outperforms Mistral 8x7B in metrics like contextual precision and answer relevancy.
  • It employs a hybrid retriever ensemble using FAISS and BM25, optimizing performance on enterprise-specific datasets.
  • The study highlights cost-efficiency and reduced inference times, positioning open-source LLMs as viable alternatives to proprietary models.

Evaluating the Efficacy of Open-Source LLMs in Enterprise-Specific RAG Systems

Introduction

The paper evaluates the performance of open-source LLMs within enterprise-specific Retrieval-Augmented Generation (RAG) systems. The aim is to provide a comparative analysis of different open-source LLMs in handling RAG tasks using enterprise-specific datasets, especially when considering factors such as accuracy and efficiency compared to proprietary models. Notably, the work investigates models like Llama3-8B and Mistral 8x7B, which are primarily assessed against the backdrop of being a cost-effective alternative to commercial solutions such as GPT-3.5.

Methodology

Data Collection

Data was collected from the enterprise site "i-venture.org" through web scraping, which involved parsing the site's sitemap and subsequent crawling to extract relevant content. This approach ensured a comprehensive acquisition of data critical for RAG tasks.

Text Preprocessing

The data was split into textual chunks using the NLTKTextSplitter and RecursiveCharacterTextSplitter methods. This splitting was crucial as it optimized the retrieval component by ensuring that each segment precisely aligns with queries for efficient processing.

Embedding and Vector Database

Embeddings were generated using models from Hugging Face, with a focus on BAAI/bge-large-en-v1.5 for their known performance in semantic search contexts. These embeddings were stored in a FAISS vector database, which facilitated rapid similarity searches.

LLM Integration and RAG Implementation

LLMs were integrated using APIs from Perplexity, which interact with models like Llama3-8B and Mistral 8x7B. The hybrid retriever ensemble combining FAISS and BM25 retrieval algorithms was central to the RAG framework, ensuring enhanced retrieval accuracy and contextual relevance.

Evaluation Metrics

The evaluation consisted of comparing the cosine similarity of generated answers with ground truth answers and employing the DeepEval metrics—namely, contextual precision, recall, relevancy, and answer relevancy—to comprehensively measure performance. Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1

Figure 1: Mistral: Reason Dense

Results

Performance Analysis

The key findings indicate that Llama3-8B consistently outperformed Mistral 8x7B across all relevant metrics such as unigram precision, contextual precision, and answer relevancy. Notably, Llama3 demonstrated superior usage of contextual information and maintained high precision with reduced inference times.

Cosine Similarity

The analysis showed that increasing top-k values led to diminishing returns in cosine similarity for both LLM models. This indicates a ceiling effect where additional retrieved documents no longer contribute significantly to improving answer accuracy.

DeepEval Metrics

Evaluation using the DeepEval framework showed that the Llama3-8B model also excelled in contextual metrics, highlighting its robustness and flexibility in enterprise-specific RAG tasks relative to Mistral 8x7B. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Histogram of inference time using GPT 3.5: Average response time 4.3 seconds

Discussion

The paper delineates that open-source models, particularly Llama3-8B, provide viable performance comparable to proprietary algorithms with significant cost-efficiency benefits. The observed performance of Llama3-8B negates the assumption that more parameters (as seen in Mistral 8x7B) automatically deliver better RAG performance. Furthermore, the consistent performance across varying question densities underscores Llama3’s adaptability to multifaceted AI tasks. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Mistral: Reason Dense

Conclusion

In conclusion, the Llama3-8B serves as a commendable open-source alternative for enterprises aiming to implement cost-effective and efficient RAG systems. This research illuminates the potential of accessible technologies in enhancing enterprise-specific NLP tasks and further provides a foundation for developing open-source LLMs in proprietary contexts. The work encourages further exploration into optimizing hybrid retrieval methods and refining real-time inference capabilities to improve overall system performance.

Through extensive metric analysis and comparative evaluations, the paper reaffirms the efficacy of leveraging open-source tools in enterprise AI applications, thereby setting precedence for future innovations in this domain.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Authors (2)

X Twitter Logo Streamline Icon: https://streamlinehq.com