Answer Relevancy

The answer relevancy metric measures the quality of your RAG pipeline's generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input. deepeval's answer relevancy metric is a self-explaining LLM-Eval, meaning it outputs a reason for its metric score.

tip

You can run real-time evaluations in production on metrics such as AnswerRelevancyMetric using Confident AI.

Required Arguments

To use the AnswerRelevancyMetric, you'll have to provide the following arguments when creating an LLMTestCase:

input
actual_output

Example

from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase

# Replace this with the actual output from your LLM application
actual_output = "We offer a 30-day full refund at no extra cost."

metric = AnswerRelevancyMetric(
    threshold=0.7,
    model="gpt-4",
    include_reason=True
)
test_case = LLMTestCase(
    input="What if these shoes don't fit?",
    actual_output=actual_output
)

metric.measure(test_case)
print(metric.score)
print(metric.reason)

# or evaluate test cases in bulk
evaluate([test_case], [metric])

There are five optional parameters when creating an AnswerRelevancyMetric:

[Optional] threshold: a float representing the minimum passing threshold, defaulted to 0.5.
[Optional] model: a string specifying which of OpenAI's GPT models to use, OR any custom LLM model of type DeepEvalBaseLLM. Defaulted to 'gpt-4-0125-preview'.
[Optional] include_reason: a boolean which when set to True, will include a reason for its evaluation score. Defaulted to True.
[Optional] strict_mode: a boolean which when set to True, enforces a binary metric score: 1 for perfection, 0 otherwise. It also overrides the current threshold and sets it to 1. Defaulted to False.
[Optional] async_mode: a boolean which when set to True, enables concurrent execution within the measure() method. Defaulted to True.

How Is It Calculated?

The AnswerRelevancyMetric score is calculated according to the following equation:

\text{Answer Relevancy} = \frac{\text{Number of Relevant Statements}}{\text{Total Number of Statements}}

The AnswerRelevancyMetric first uses an LLM to extract all statements made in the actual_output, before using the same LLM to classify whether each statement is relevant to the input.

Required Arguments​

Example​

How Is It Calculated?​

Required Arguments

Example

How Is It Calculated?