Evaluating the Performance of Retrieval-Augmented LLM Systems

Published in

lastmile ai — blog

9 min readJun 29, 2023

Large Language Models (LLMs) that enable AI chatbots like ChatGPT continue to gain popularity as more use cases arise for generative AI. In particular, Retrieval-Augmented Generation (RAG) systems proposed in 2021, and popularized by tools such as langchain, empower many practical applications, such as question-answering with a local knowledge base.

Evaluating performance and quality of these systems is crucial to assess their capabilities and limitations. Understanding how reliable these systems are is on the top of mind for researchers, developers and consumers alike.

In this blog post, we explore the various ways to evaluate a Retrieval-Augmented LLM system.

Retrieval-Augmented Large Language Models

This is the typical set of steps to perform a question answering task based on a local knowledge base:

Build a vectorstore: We generate embedding vectors for each documents from the local knowledge base, and store the documents in a vector database indexed by the corresponding embedding vectors;
Search for context: We use the same method as previous to embed the input question and find the most relevant documents from the vectorstore;
Feed the LLM: We combine the input question with the relevant documents as context and feed them to the LLM to get an answer pertaining to your local knowledge base.

Voila! This QA system architecture works for almost any local knowledge base, ranging from personal study notes, internal documents, company financial statements, etc.

Diagram of a Typical RAG+LLM System (Image from https://blog.langchain.dev/retrieval/)

The aforementioned QA application, empowered by a Retrieval-Augmented LLM system, consists of two components:

Embedding-based context retrieval given a question/query, and;
LLM that generates a natural language response with the question augmented with relevant context.

Let’s take a look at how to evaluate each of these components in the rest of the blog post. We start with a quick guide on the concept of embeddings, but if you are familiar with embeddings, feel free to skip to the following section.

Embedding 101

In the context of Natural Language Processing (NLP), embeddings are numerical representations of words in vector form, enabling the model to interpret their meaning. These vectors consist of multiple dimensions where each dimension represents different aspects of the word. The number of dimensions is predetermined and can differ depending on the embedding model used.

You can see below how words are translated to a vector where each number represents the score for a particular dimension (ex. living being, feline).

The vector representation is useful because of the concept of distance between vectors which can help determine closeness or similarity. While it’s hard to visualize a vector space with 7 dimensions from the example above, you can calculate the distance between these vectors using various distance measures like Euclidean and cosine distances. The smaller the distance between embeddings, the closer the corresponding words likely are in meaning.

One of the challenges with vector representation is having too many dimensions. When there are too many dimensions, computational complexity increases significantly. In addition, high dimensionality can also result in problems of overfitting where a model becomes too specialized to the training data and performs poorly on unseen data.

Dimension Reduction is a process to help reduce the number of dimensions in embeddings to overcome issues with high dimensionality. Another benefit of dimensionality reduction is the ability to visualize your embeddings. For instance, see the process below converting a 7-dimension embedding to a 2-dimension embedding and how much easier it is visualize the distances between the embeddings:

1/ Evaluation of Embedding-based Context Retrieval

Ideally, semantically similar entities should be closer to one another in the embedding space. One of the issues with embeddings as mentioned above is that the vector representation often has hundreds of dimensions, making it hard to visually grasp if semantically similar entities are close to each other when represented as embeddings.

Analyzing the Embedding Space

Dimension reduction is one process of evaluating the quality of the embedding model by reducing the dimensions of the embeddings, making it easier to visualize and analyze. Dimension reduction techniques such as PCA (linear), t-SNE (non-linear), UMAP (similar to t-SNE, better at capturing global structure) help reduce n-dimensional embeddings to 2d or 3d embeddings while preserving certain properties. The lowered dimensionality of embeddings makes it easy for visual exploration, clustering, and analysis of proximity and separation patterns which all help with understanding the quality of the embedding model.

Pairwise similarity distribution is another qualitative analysis tool for evaluating the embedding model. Pairwise similarity measures the degree of similarity or relatedness between pairs of embeddings. A good embedding model should capture semantic relationships, ensuring that similar entities have higher similarity scores. By analyzing the pairwise similarity distribution, we can assess whether the embeddings exhibit the desired semantic proximity. A well-performing embedding model will exhibit a higher density of similar embeddings and a lower density of dissimilar embeddings.

Evaluating Embedding Retrieval

Precision and Recall are popular evaluation metrics for information retrieval like search. Assuming you have ground truth data, precision and recall are excellent at understanding how well the retrieval process is working. Read more about precision and recall in this blog post.

Precision measures the accuracy of the retrieved embeddings, specifically the proportion of relevant items among the retrieved embeddings. Precision@k is the technique used for embedding retrieval where k represents the number of retrieved items.
- Precision@k = (# of retrieved items @k that are relevant) / (# of retrieved items @k)
Recall measures the completeness of retrieved results, essentially the proportion of relevant items that are successfully retrieved from the entire set of relevant items. Recall@k is the technique used for embedding retrieval where k represents the number of retrieved items.
- Recall@k = (# of retrieved items @k that are relevant) / (total # of relevant items)

Recommendation:

Use techniques like pairwise similarity distribution and dimension reduction to evaluate the quality of the embedding model being used.
If you have ground truth data, calculate precision and recall of the embedding retrieval to get a quantifiable score for accuracy.

2/ Evaluation of Large Language Models

To evaluate the performance of the output, we need to have an idea of the expected output. With the ground truth output (aka reference) and the actual output (aka candidate) from the LLM, we can assess performance through an approach that follows scoring_function(reference, candidate). The scoring function can be based on:

Exact match. The candidate has to equal the reference. This is helpful for question answering tasks.
Fuzzy match. The candidate needs to be semantically similar to the reference but not necessarily exact. This is applicable for summarization tasks.

Standard Evaluation Measures

Implementing fuzzy match scoring functions for LLM outputs is ambiguous and challenging. There is a blog post as well as a comprehensive survey on Natural Language Generation (NLG) evaluation metrics, of which here are the commonly used ones:

BLEU (n-gram based, used for translation): Measures the precision of the candidate translation by counting the number of matching n-grams between reference translation and candidate translation and penalizes for excessive generation.
ROUGE (n-gram based, used for summarization): Measures the overlap and completeness (recall) between the n-gram sequences of the reference summary and the candidate summary.
BERTScore (embedding-based): Assess the similarity between candidate text and reference text through the average cosine similarity scores given the BERT embeddings (paper, blog post, GitHub link, HuggingFace demo).
BLEURT (learned using BERT): Predicts human ratings of text quality based on a set of reference and candidate text pairs. BLEURT leverages BERT’s contextual representations to compute similarity scores and provides an evaluation metric that aligns with human judgments of text quality (paper, blog post, GitHub link).

With regards to BLEURT in particular, it is pretrained on metrics such as BLEU, ROUGE, and BERTScore using regression losses, subsequently fine-tuned on human ratings. Thus, BLEURT potentially also captures all the other three metrics. Note that BLEURT scores are not calibrated (see BLEURT score distribution for more details).

Other Evaluation Measures

GPT-4. Ideally, we would have humans interpret if the output quality is good. Human raters, however, are extremely resource-intensive and not practical at scale. GPT-4 has been used as a fairly good proxy to human-raters. You may want to consider prompt engineering techniques such as few shot prompting, chain-of-thought, and self-consistency to generate more reliable evaluation results from GPT-4.
Reward Models from RLHF. Reinforcement learning from human feedback (RLHF) is a technique that learns from a “reward model” that is trained based on human feedback. We can use the reward model, e.g. reward-model-deberta-v3-large-v2, to either directly score the output from an LLM or fine-tune the reward model for your specific applications before scoring.
Word Perplexity. If you have access to the probabilities of each word output (softmax output), e.g. local LLMs, then you can also compute word perplexity (see blog post for explanation).

Recommendation:

Compute BLEURT as a standard evaluation measure for the LLM being used.
Use a top performing LLM like GPT-4 to evaluate/score the quality of the outputs (so meta!)

Where do we see these metrics used?

Here are some references of where we see the aforementioned evaluation metrics for LLMs are used. You may want to look into them to learn more about the details of how to apply these evaluation metrics to your specific LLM applications.

Academic Research Papers and Technical Reports

PaLM 2 Technical Report (ROUGE-2, BLEURT)
QLoRA: Efficient Finetuning of Quantized LLMs (GPT-4)
BigTrans: Augmenting Large Language Models with Multilingual Translation Capability over 100 Languages (GPT-4)

Evaluation Harness (Tools to Facilitate Evaluation of NLP models)

BIG-bench (BLEU, ROUGE, BLEURT)
EleutherAI/lm-evaluation-harness (BLEU, ROUGE, BLEURT; contains a subset of BIG-bench tasks). Note that Open LLM Leaderboard adopts a subset of this harness
HuggingFace Evaluate (BLEU, ROUGE, BERTScore, BLEURT)
fastchat/eval (GPT-4; see blog post from Vicuna)
InstructEval (ChatGPT)
- Paper: INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models
- “… We adopt an automatic approach by leveraging ChatGPT to judge the quality of the generated answers.”

Example of GPT-4 Evaluation on Alpaca-13B vs Vicuna-13B

Summary: Recommendation for Evaluation Metrics

Embedding-based context retrieval: we recommend visualization with dimension reduction and pairwise similarity distribution for qualitative analysis of the embeddings, and precision@k and recall@k for quantitative evaluation of the retrieval system.

Large Language Models: when ground truth data is available, we recommend BLEURT as the primary metric across all LLMs, and BLEU and ROUGE scores as supplementary metrics. For any applications where there are human ratings available, you may want to consider fine-tune BLEURT for those applications.

For cases where ground truth is not available, we recommend using GPT-4 as a proxy to an expert human rater, customized with prompt engineering techniques. Leveraging a reward model intended for RLHF to compute scores may be worth investigating.

LastMile AI

We would love to hear how you’re thinking about evaluation metrics. You can reach us on:

We are building a generative AI workshop at lastmileai.dev to allow experimenting with many different types of foundation models, including OpenAI’s ChatGPT, Google’s PaLM2 and others. Evaluating which is good for your use cases is important to us, and to you. Visit us at lastmileai.dev learn more! Thank you for reading.