Skip to main content

Test Cases

Quick Summary

A test case is a blueprint provided by deepeval to unit test LLM outputs based on five parameters:

  • input
  • actual_output
  • [Optional] expected_output
  • [Optional] context
  • [Optional] retrieval_context
  • [Optional] latency (float)
  • [Optional] cost (float)

Except for actual_output, all parameters should originate from your evaluation dataset (if you have one). Here's an example implementation of a test case:

test_case = LLMTestCase(
input="What if these shoes don't fit?",
expected_output="You're eligible for a 30 day refund at no extra cost.",
actual_output="We offer a 30-day full refund at no extra cost.",
context=["All customers are eligible for a 30 day full refund at no extra cost."],
retrieval_context=["Only shoes can be refunded."],
latency=10.0
)

Note that only input and actual_output are mandatory.

However, depending on the specific metric you're evaluating your test cases on, you may or may not require a retrieval_context, expected_output and/or context as additional parameters. For example, you won't need expected_output and context if you're just measuring answer relevancy, but if you're evaluating hallucination you'll have to provide context in order for deepeval to know what the ground truth is.

Let's go through the purpsoe of each parameter.

Input

An input mimics a user interacting with your LLM application. The input is usually a prompt composed of dynamic data embedded in a prompt template.

prompt_template = """
Impersonate a dog when replying to the text below.

{text}
"""

prompt = prompt_template.format(text="Who's a good boy?")

test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output="ruff ruff ruff"
)

Actual Output

An actual output is simply the output your LLM application returns for a given input. This is what your users are going to interact with. Typcailly, you would import your LLM application (or parts of it) into your test file, and invoke it at runtime to get the actual output.

Expanding on the previous example:

# A hypothetical LLM application example
import chatbot

prompt_template = """
Impersonate a dog when replying to the text below.

{text}
"""

prompt = prompt_template.format(text="Who's a good boy?")

test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output=chatbot.run(prompt)
)

Expected Output

An expected output is literally what you would want the ideal output to be. Note that this parameter is optional depending on the metric you want to evaluate.

The expected output doesn't have to exactly match the actual output in order for your test case to pass since deepeval uses a variety of methods to evaluate non-deterministic LLM outputs. We'll go into more details in the metrics section.

# A hypothetical LLM application example
import chatbot

prompt_template = """
Impersonate a dog when replying to the text below.

{text}
"""

prompt = prompt_template.format(text="Who's a good boy?")

test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output=chatbot.run(prompt),
expected_output="Me, ruff!"
)

Context

The context is an optional parameter that represents additional data received by your LLM application as supplementary sources of golden truth. You can view it as the ideal segment of your knowledge base relevant to a specific input. Context allows your LLM to generate customized outputs that are outside the scope of the data it was trained on.

In RAG applications, contextual information is typically stored in your selected vector database. Conversely, for a fine-tuning use case, this data is usually found in training datasets used to fine-tune your model. Providing the appropriate contextual information when constructing your evaluation dataset is one of the most challenging part of evaluating LLMs, since data in your knowledge base can constantly be changing.

Unlike other parameters, a context accepts a list of strings.

# A hypothetical LLM application example
import chatbot

prompt_template = """
Impersonate a dog named Rocky when replying to the text below.

{text}
"""

prompt = prompt_template.format(text="Who's a good boy?")

context = ["Rocky is a good boy."]

test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output=chatbot.run(prompt),
expected_output="Me, ruff!",
context=context
)
note

Often times people confuse expected_output with context since due to their similar level of factual accuracy. However, while both are (or should be) factually correct, expected_output also takes aspects like tone and linguistic patterns into account, whereas context is strictly factual.

Retrieval Context

The retrieval_context is an optional parameter that represents your RAG pipeline's retrieval results at runtime. By providing retrieval_context, you can determine how well your retriever is performing using context as a benchmark.

# A hypothetical LLM application example
import chatbot

prompt_template = """
Impersonate a dog named Rocky when replying to the text below.

{text}
"""

prompt = prompt_template.format(text="Who's a good boy?")

# Replace this with the actual retrieved context from your RAG pipeline
retrieval_context = ["Rocky is a good cat."]

test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output=chatbot.run(prompt),
expected_output="Me, ruff!",
retrieval_context=retrieval_context
)
note

Remember, context is the ideal retrieval results for a given input and typically come from your evaluation dataset, whereas retrieval_context is your LLM application's actual retrieval results.

Latency

The latency is an optional parameter that represents how long it took your LLM application to finish executing. However, if you're trying to measure something else, like the latency for only the retrieval part of your RAG pipeline, feel free to supply that number instead of the overall latency.

test_case = LLMTestCase(
input="...",
actual_output="...",
# Replace this with the actual latency it took your
# LLM (application) to generate the actual output
latency=10.4
)
info

The only deepeval metric that uses the latency parameter is the LatencyMetric.

Cost

The cost is an optional parameter that represents the token cost for a given execution of your LLM application. However, similar to latency, the cost parameter does not strictly have to be the total completion cost (eg., it could be the embedding cost), nor does it have to be in any set currency.

test_case = LLMTestCase(
input="...",
actual_output="...",
# Replace this with the actual cost it took your
# LLM (application) to generate the actual output
cost=0.78
)
info

Similar to the LatencyMetric, the CostMetric is the only deepeval metric that uses the cost parameter.

Assert A Test Case

Before we begin going through the final sections, we highly recommend you to login to Confident AI (the platform powering deepeval) via the CLI. This way, you can keep track of all evaluation results generated each time you execute deepeval test run.

deepeval login

Similar to Pytest, deepeval allows you to assert any test case you create by calling the assert_test function by running deepeval test run via the CLI.

A test case passes only if all metrics passess. Depending on the metric, a combination of input, actual_output, expected_output, context, and retrieval_context is used to ascertain whether their criterion have been met.

test_assert_example.py
# A hypothetical LLM application example
import chatbot
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

prompt_template = """
Impersonate a dog named Rocky when replying to the text below.

{text}
"""

def test_case_1():
prompt = prompt_template.format(text="Who's a good boy?")
context = ["Rocky is a good boy."]

test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output=chatbot.run(prompt),
expected_output="Me, ruff!",
context=context
)
metric = HallucinationMetric(threshold=0.7)
assert_test(test_case, metrics=[metric])

def test_case_2():
prompt = prompt_template.format(text="Who's a good boy?")
context = ["Rocky is a good boy."]

test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output=chatbot.run(prompt),
expected_output="Me, ruff!",
context=context
)
metric = HallucinationMetric(threshold=0.7)
assert_test(test_case, metrics=[metric])

There are two mandatory and one optional parameter when calling the assert_test() function:

  • test_case: an LLMTestCase
  • metrics: a list of metrics of type BaseMetric
  • [Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics. Defaulted to True.
info

The run_async parameter overrides the async_mode property of all metrics being evaluated. The async_mode property, as you'll learn later in the metrics section, determines whether each metric can execute asynchronously.

To execute the test cases, run deepeval test run via the CLI, which uses deepeval's Pytest integration under the hood to execute these tests. You can also include an optional -n flag follow by a number (that determines the number of processes that will be used) to run tests in parallel.

deepeval test run test_assert_example.py -n 4

Evaluate Test Cases in Bulk

Lastly, deepeval offers an evaluate function to evaluate multiple test cases at once, which similar to assert_test but without the need for Pytest or the CLI.

# A hypothetical LLM application example
import chatbot
from deepeval import evaluate
from deepeval.metrics import HallucinationMetric
from deepeval.test_case import LLMTestCase

prompt_template = """
Impersonate a dog named Rocky when replying to the text below.

{text}
"""

prompt = prompt_template.format(text="Who's a good boy?")
context = ["Rocky is a good boy."]

first_test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output=chatbot.run(prompt),
expected_output="Me, ruff!",
context=context
)

second_test_case = LLMTestCase(
input=prompt,
# Replace this with your actual LLM application
actual_output=chatbot.run(prompt),
expected_output="Me, ruff!",
context=context
)

test_cases = [first_test_case, second_test_case]

metric = HallucinationMetric(threshold=0.7)
evaluate(test_cases, [metric])

There are two mandatory and three optional arguments when calling the evaluate() function:

  • test_case: an LLMTestCase
  • metrics: a list of metrics of type BaseMetric
  • [Optional] run_async: a boolean which when set to True, enables concurrent evaluation of all metrics. Defaulted to True.
  • [Optional] show_indicator: a boolean which when set to True, shows the progress indicator for each individual metric. Defaulted to True.
  • [Optional] print_results: a boolean which when set to True, prints the result of each evaluation. Defaulted to True.

Similar to assert_test, evaluate allows you to log and view test results on Confident AI. For more examples of evaluate, visit the datasets section.