Skip to content
GitHubDiscord

Quickstart

This tutorial walks you through installing the SDK, connecting to the Hub, and running a complete evaluation against an LLM agent — from dataset creation to reading results.

  • Python 3.10 or later
  • A running Giskard Hub instance (cloud or self-hosted)
  • An API key from the Hub UI (User Settings → API Keys)
Terminal window
pip install giskard-hub

The SDK reads your Hub URL and API key from environment variables. Set them before running any code:

Terminal window
export GISKARD_HUB_BASE_URL="https://your-hub-instance.example.com"
export GISKARD_HUB_API_KEY="gsk_..."

Alternatively, pass them directly to the client constructor:

from giskard_hub import HubClient
hub = HubClient(
base_url="https://your-hub-instance.example.com",
api_key="gsk_...",
)

Projects are the top-level container for all your resources. Create one or retrieve an existing one:

# Create a new project
project = hub.projects.create(
name="Customer Support Bot",
description="Evaluation project for our support chatbot",
)
# Or list existing projects and pick one
projects = hub.projects.list()
project = projects[0]
print(f"Using project: {project.name} ({project.id})")

An agent points to your LLM application. The Hub calls this endpoint during evaluations.

agent = hub.agents.create(
project_id=project.id,
name="Support Bot v1",
description="GPT-4o-based customer support chatbot",
url="https://your-app.example.com/api/chat",
supported_languages=["en"],
headers=[{"name": "Authorization", "value": "Bearer <your-app-token>"}],
)
print(f"Agent registered: {agent.id}")

Before building a dataset, run a quick scan to surface security weaknesses in your agent:

import time
scan = hub.scans.create(
project_id=project.id,
agent_id=agent.id,
tags=["gsk:threat-type='prompt-injection'"],
)
print(f"Scan started: {scan.id}")
while scan.status.state == "running":
time.sleep(10)
scan = hub.scans.retrieve(scan.id)
print(f"Scan complete. Grade: {scan.grade}")

The grade ranges from A (no issues found) to D (critical vulnerabilities detected). See Vulnerability Scanning for the full tag catalogue, KB-grounded scans, and how to review probe results and turn successful attacks into test cases.

A dataset is a collection of test cases — conversations with expected outcomes and quality checks.

dataset = hub.datasets.create(
project_id=project.id,
name="Core Q&A Suite",
description="Basic correctness and tone checks",
)
# Add a test case
hub.test_cases.create(
dataset_id=dataset.id,
messages=[
{"role": "user", "content": "What is your return policy?"},
],
demo_output="We offer a 30-day return policy for all items.",
checks=[
{
"identifier": "correctness",
"params": {
"type": "correctness",
"reference": "We offer a 30-day return policy for all items."
},
},
],
)

The checks field controls which criteria are applied to each agent response — these can be LLM-judge, embedding similarity, or rule-based checks. See Checks & Metrics for the full list of built-in checks and how to define custom ones.

Now trigger an evaluation that sends every test case to your agent and scores the responses:

import time
evaluation = hub.evaluations.create(
project_id=project.id,
agent_id=agent.id,
criteria={
"dataset_id": dataset.id,
},
name="v1 baseline",
)
print(f"Evaluation started: {evaluation.id}")
# Poll until the evaluation completes
while evaluation.status.state == "running":
time.sleep(5)
evaluation = hub.evaluations.retrieve(evaluation.id)
print("Evaluation complete!")

Once complete, fetch the per-test-case results and inspect the metrics:

evaluation_results = hub.evaluations.results.list(evaluation.id)
for eval_result in evaluation_results:
print(f"Test case {eval_result.test_case.id}: {eval_result.state}")
for check_result in eval_result.results:
print(f" {check_result.name}: {'' if check_result.passed else ''}")

You can also view the full evaluation with aggregated metrics in the Hub UI.

  • Local agents: evaluate a Python function directly without an HTTP endpoint — see Evaluations
  • Generate test cases automatically: use scenarios or knowledge bases — see Datasets
  • Vulnerability scanning: find security weaknesses with Scans
  • Schedule recurring runs: see Scheduled Evaluations
  • Full API details: see the API Reference