Skip to content
GitHubDiscord

Checks

Ready-to-use validation checks for common testing scenarios, including function-based checks, string matching, comparisons, and LLM-powered semantic validation.


Create a check from a callable function.

Module: giskard.checks.builtin.fn

from giskard.checks import from_fn
# Simple boolean check
check = from_fn(
lambda trace: trace.last.outputs is not None,
name="has_output",
success_message="Output was provided",
failure_message="No output found"
)
# Check with custom logic
check = from_fn(
lambda trace: len(trace.last.outputs) > 10,
name="min_length_check",
description="Validates minimum output length"
)
# Async check
async def validate_response(trace):
response = trace.last.outputs
# Perform async validation
is_valid = await external_validator(response)
return is_valid
check = from_fn(validate_response, name="async_validation")

Parameters:

ParameterTypeDefaultDescription
fnCallablerequiredFunction taking trace, returning bool or CheckResult
namestr | NoneNoneOptional check name
descriptionstr | NoneNoneOptional description
success_messagestr | NoneNoneMessage when check passes
failure_messagestr | NoneNoneMessage when check fails
detailsdict | NoneNoneAdditional details to include in result

Returns:

  • FnCheck: A check instance wrapping the function

A Check whose logic is implemented as a Python callable.

Module: giskard.checks.builtin.fn

from giskard.checks.builtin.fn import FnCheck
# Create directly
check = FnCheck(
fn=lambda trace: "error" not in trace.last.outputs.lower(),
name="no_errors",
success_message="No errors detected",
failure_message="Error found in output"
)

Check that validates string patterns in trace values.

Module: giskard.checks.builtin.string_matching

from giskard.checks import StringMatching
# Check if output contains specific text
check = StringMatching(
keyword="success",
text_key="trace.last.outputs",
match_type="contains"
)
# Check with regex pattern
check = StringMatching(
keyword=r"\d{3}-\d{3}-\d{4}", # Phone number pattern
text_key="trace.last.outputs.phone",
match_type="regex"
)
# Case-insensitive matching
check = StringMatching(
keyword="error",
text_key="trace.last.outputs",
match_type="contains",
case_sensitive=False
)

Parameters:

ParameterTypeDefaultDescription
keywordstrrequiredPattern to match (string or regex)
text_keystrrequiredJSONPath to extract value from trace
match_typestr"contains"Match type: “contains”, “equals”, “startswith”, “endswith”, “regex”
case_sensitiveboolTrueWhether matching is case-sensitive

Match Types:

TypeDescriptionExample
containsPattern appears anywhere in text”hello” in “hello world”
equalsExact match”hello” == “hello”
startswithText starts with pattern”hello world”.startswith(“hello”)
endswithText ends with pattern”hello world”.endswith(“world”)
regexRegular expression matchre.search(r”\d+”, “test123”)

Validate numeric and comparable values against expected thresholds.

Check that extracted values equal an expected value.

Module: giskard.checks.builtin.comparison

from giskard.checks import Equals
# Check exact value
check = Equals(
expected_value=42,
key="trace.last.outputs.count"
)
# Check string equality
check = Equals(
expected_value="success",
key="trace.last.outputs.status"
)
# Compare against value from trace
check = Equals(
expected_value_key="trace.interactions[0].outputs.baseline",
key="trace.last.outputs.result"
)

Parameters:

ParameterTypeDefaultDescription
expected_valueAny | NoneNoneStatic expected value
expected_value_keystr | NoneNoneJSONPath to extract expected value from trace
keystrrequiredJSONPath to extract actual value
normalization_formstr | NoneNoneUnicode normalization: “NFC”, “NFD”, “NFKC”, “NFKD”

Check that extracted values do not equal an expected value.

from giskard.checks import NotEquals
check = NotEquals(
expected_value="error",
key="trace.last.outputs.status"
)

Check that extracted values are greater than an expected value.

from giskard.checks import GreaterThan
check = GreaterThan(
expected_value=0.8,
key="trace.last.metadata.confidence_score"
)

Check that extracted values are greater than or equal to an expected value.

from giskard.checks import GreaterEquals
check = GreaterEquals(
expected_value=100,
key="trace.last.outputs.user_count"
)

Check that extracted values are less than an expected value.

from giskard.checks import LesserThan
check = LesserThan(
expected_value=500,
key="trace.last.metadata.latency_ms"
)

Check that extracted values are less than or equal to an expected value.

from giskard.checks import LesserThanEquals
check = LesserThanEquals(
expected_value=1000,
key="trace.last.metadata.token_count"
)

Validation checks powered by Large Language Models for semantic understanding.

Abstract base class for creating custom LLM-powered checks.

Module: giskard.checks.judges.base

BaseLLMCheck provides a framework for building checks that leverage Large Language Models for evaluation. It handles the LLM interaction, prompt rendering, and result parsing, so subclasses only need to define the evaluation prompt.

AttributeTypeDefaultDescription
generatorBaseGenerator | NoneNoneLLM generator for evaluation. Falls back to the global default if not provided
namestr | NoneNoneOptional check name
descriptionstr | NoneNoneOptional description
get_prompt() -> str | Message | MessageTemplate | TemplateReference
Section titled “get_prompt() -> str | Message | MessageTemplate | TemplateReference”

Returns the prompt to send to the LLM. Subclasses must implement this method.

Returns:

  • Can be a string (automatically converted to MessageTemplate)
  • A Message object
  • A MessageTemplate with Jinja2 templating
  • A TemplateReference pointing to a template file
get_inputs(trace: Trace) -> dict[str, Any]
Section titled “get_inputs(trace: Trace) -> dict[str, Any]”

Provides template variables for prompt rendering. Override to customize available variables.

Parameters:

  • trace (Trace): The trace containing interaction history

Returns:

  • dict[str, Any]: Template variables (default: {"trace": trace})

Executes the LLM-based check (inherited, usually doesn’t need to be overridden).

from giskard.checks.judges.base import BaseLLMCheck
from giskard.agents.generators import Generator
@BaseLLMCheck.register("custom_llm_check")
class CustomLLMCheck(BaseLLMCheck):
custom_instruction: str
def get_prompt(self):
return f"""
Evaluate the interaction based on: {self.custom_instruction}
Input: {{{{ trace.last.inputs }}}}
Output: {{{{ trace.last.outputs }}}}
Return passed=true if the interaction meets the criteria,
passed=false otherwise. Include a reason for your decision.
"""
def get_inputs(self, trace):
# Optionally customize template variables
return {
"trace": trace,
"custom_var": "additional context"
}
# Usage
check = CustomLLMCheck(
custom_instruction="Response must be concise and helpful",
generator=Generator(model="openai/gpt-4")
)

LLM checks expect the model to return structured output with:

  • passed (bool): Whether the check passed
  • reason (str, optional): Explanation of the result

The BaseLLMCheck automatically parses this structure into a CheckResult.


Default result model for LLM-based checks.

Module: giskard.checks.judges.base

AttributeTypeDescription
passedboolWhether the check passed
reasonstr | NoneOptional explanation for the result

This is the structured output format expected from the LLM when using BaseLLMCheck.


Validates that answers are grounded in provided context documents.

Module: giskard.checks.judges.groundedness

Uses an LLM to determine if an answer is properly supported by the given context. This is crucial for RAG (Retrieval-Augmented Generation) systems to ensure responses don’t hallucinate information not present in the retrieved documents.

AttributeTypeDefaultDescription
answerstr | NoneNoneThe answer text to evaluate
answer_keystr"trace.last.outputs"JSONPath to extract answer from trace
contextstr | list[str] | NoneNoneContext document(s) that should support the answer
context_keystr"trace.last.metadata.context"JSONPath to extract context from trace
generatorBaseGenerator | NoneNoneLLM generator for evaluation

Static values:

from giskard.checks import Groundedness
from giskard.agents.generators import Generator
check = Groundedness(
answer="The Eiffel Tower is in Paris.",
context=["Paris is the capital of France.", "The Eiffel Tower is a famous landmark."],
generator=Generator(model="openai/gpt-4")
)

Extracting from trace:

check = Groundedness(
answer_key="trace.last.outputs.answer",
context_key="trace.last.metadata.retrieved_docs",
generator=Generator(model="openai/gpt-4")
)
# Run against a trace
result = await check.run(trace)

Validates that interactions conform to a specified rule or requirement.

Module: giskard.checks.judges.conformity

Uses an LLM to evaluate whether an interaction (inputs, outputs, and metadata) conforms to a given rule. The rule supports Jinja2 templating, allowing for dynamic rules that reference trace data.

AttributeTypeDescription
rulestrThe conformity rule to evaluate. Supports Jinja2 templating with access to the trace object
generatorBaseGenerator | NoneLLM generator for evaluation (falls back to default)

Static rule:

from giskard.checks import Conformity
from giskard.agents.generators import Generator
check = Conformity(
rule="The response must be professional and polite",
generator=Generator(model="openai/gpt-4")
)

Dynamic rule with templating:

check = Conformity(
rule="The response must contain the keywords '{{ trace.last.inputs.required_keywords }}' and be concise",
generator=Generator(model="openai/gpt-4")
)
# The rule is rendered at runtime with access to trace data
result = await check.run(trace)

Accessing different trace elements:

# Reference inputs
check = Conformity(
rule="Respond to the user's query: '{{ trace.last.inputs }}'"
)
# Reference metadata
check = Conformity(
rule="Use a {{ trace.last.metadata.tone }} tone in the response"
)
# Reference earlier interactions
check = Conformity(
rule="Build upon the previous answer: '{{ trace.interactions[-2].outputs }}'"
)

General-purpose LLM-based validation with custom prompts.

Module: giskard.checks.judges.judge

The most flexible LLM check that allows you to define completely custom evaluation logic through prompts. Use this when the specialized checks (Groundedness, Conformity) don’t fit your needs.

AttributeTypeDescription
promptstr | NoneInline prompt content with Jinja2 templating support
prompt_pathstr | NonePath to a template file (e.g., "checks::my_template.j2")
generatorBaseGenerator | NoneLLM generator for evaluation

Inline prompt:

from giskard.checks import LLMJudge
from giskard.agents.generators import Generator
check = LLMJudge(
prompt="""
Evaluate if the response is helpful and accurate.
User Input: {{ trace.last.inputs }}
AI Response: {{ trace.last.outputs }}
Return passed=true if the response is helpful and accurate,
passed=false otherwise. Provide a reason for your decision.
""",
generator=Generator(model="openai/gpt-4")
)

Template file:

# First, create a template file at templates/checks/safety.j2
check = LLMJudge(
prompt_path="checks::safety.j2",
generator=Generator(model="openai/gpt-4")
)

Complex evaluation:

check = LLMJudge(
prompt="""
Evaluate the multi-turn conversation quality.
Conversation history:
{% for interaction in trace.interactions %}
User: {{ interaction.inputs }}
Assistant: {{ interaction.outputs }}
{% endfor %}
Criteria:
1. Consistency across turns
2. Relevant responses
3. Professional tone
Return passed=true if all criteria are met, passed=false otherwise.
Include specific reasons for any failures.
""",
generator=Generator(model="openai/gpt-4")
)

The following variables are available in prompts:

VariableDescription
traceFull trace object with all interactions
trace.interactionsList of all interactions in order
trace.lastMost recent interaction (preferred)
trace.last.inputsInputs from the most recent interaction
trace.last.outputsOutputs from the most recent interaction
trace.last.metadataMetadata from the most recent interaction

Validate semantic similarity between outputs and expected content.

Module: giskard.checks.builtin.semantic_similarity

from giskard.checks import SemanticSimilarity
from giskard.agents.generators import Generator
check = SemanticSimilarity(
expected="The capital of France is Paris.",
actual_key="trace.last.outputs",
threshold=0.8,
generator=Generator(model="openai/gpt-4")
)

Parameters:

ParameterTypeDefaultDescription
expectedstrrequiredExpected semantic content
actualstr | NoneNoneActual output to compare
actual_keystr"trace.last.outputs"JSONPath to extract actual value
thresholdfloat0.8Similarity threshold (0.0 to 1.0)
generatorBaseGenerator | NoneNoneLLM generator for evaluation

from giskard.checks import Groundedness, Conformity, LLMJudge, Scenario
scenario = (
Scenario()
.interact(
inputs="What is the capital of France?",
outputs=lambda inputs: "Paris is the capital of France."
)
.check(Groundedness(
context=["France is a country in Europe.", "Paris is the capital."]
))
.check(Conformity(
rule="The response must be a complete sentence"
))
.check(LLMJudge(
prompt="Is the response educational and informative? Return passed=true/false."
))
)
from giskard.agents.generators import Generator
from giskard.checks import set_default_generator
# Set once, use everywhere
generator = Generator(model="openai/gpt-4", temperature=0.1)
set_default_generator(generator)
# No need to pass generator anymore
check1 = Groundedness(answer="...", context=["..."])
check2 = Conformity(rule="...")
check3 = LLMJudge(prompt="...")
from giskard.checks import CheckStatus
result = await check.run(trace)
if result.status == CheckStatus.ERROR:
print(f"Check failed with error: {result.message}")
elif result.status == CheckStatus.FAIL:
print(f"Check failed: {result.message}")
print(f"Details: {result.details}")
elif result.status == CheckStatus.PASS:
print(f"Check passed: {result.message}")

For validation logic that doesn’t fit built-in checks, create a custom check:

from giskard.checks import Check, CheckResult, Trace
@Check.register("custom_business_logic")
class CustomBusinessCheck(Check):
threshold: float = 0.9
allowed_categories: list[str] = []
async def run(self, trace: Trace) -> CheckResult:
# Extract data
output = trace.last.outputs
category = output.get("category")
confidence = output.get("confidence", 0)
# Validate category
if category not in self.allowed_categories:
return CheckResult.failure(
message=f"Invalid category: {category}",
details={"category": category, "allowed": self.allowed_categories}
)
# Validate confidence
if confidence < self.threshold:
return CheckResult.failure(
message=f"Confidence {confidence} below threshold {self.threshold}",
details={"confidence": confidence, "threshold": self.threshold}
)
return CheckResult.success(
message="Validation passed",
details={"confidence": confidence, "category": category}
)
# Use the custom check
check = CustomBusinessCheck(
threshold=0.85,
allowed_categories=["sports", "news", "entertainment"]
)
result = await check.run(trace)