Core Concepts
This page explains the mental model behind the Giskard Hub and how its resources relate to each other. Reading this before diving into the how-to guides will make everything click faster.
The big picture
Section titled “The big picture”Project├── Agents (your LLM applications)├── Knowledge Bases (document collections)├── Scans (automated vulnerability probing)├── Datasets (test case collections)│ └── Test Cases (individual conversation + checks)├── Checks (built-in and custom criteria)├── Evaluations (run an agent against a dataset)│ └── Results (per-test-case outcomes)├── Scheduled Evaluations└── Tasks (issues and follow-up items)Everything belongs to a Project. Projects are the organisational unit — your team can have one project per product, environment, or use case.
Projects
Section titled “Projects”A Project is a workspace that groups all related resources: agents, datasets, evaluations, and scans. It also holds Scenarios — reusable persona and behaviour templates used when auto-generating test cases.
SDK resource: hub.projects, hub.projects.scenarios
Agents
Section titled “Agents”An Agent represents your LLM application. It can be:
- A remote agent — an HTTP endpoint that the Hub calls with a list of chat messages and expects a response from.
- A local agent — a Python function you pass directly when calling
hub.evaluations.create_local(). Useful for evaluating models without exposing an HTTP endpoint.
Agents are configured with a URL, HTTP headers (for authentication), and the list of supported languages.
SDK resource: hub.agents
Knowledge Bases
Section titled “Knowledge Bases”A Knowledge Base is an indexed collection of documents. It has three primary uses:
- Document-based dataset generation — the Hub uses the documents as source material to auto-generate realistic test cases via
hub.datasets.generate_document_based(). - Grounded vulnerability scans — when you create a scan with a
knowledge_base_id, the probes are anchored to your actual document content, making attacks more realistic and specific. - Groundedness check context — retrieve relevant documents from the KB via
hub.knowledge_bases.search_documents()and pass them as thecontextfield of agroundednesscheck assertion. This verifies that your agent’s responses are grounded in your actual documents rather than hallucinated content.
Documents are stored as text chunks with optional topics/metadata.
SDK resource: hub.knowledge_bases
A Scan runs automated vulnerability probes against an agent to detect security and safety issues. Giskard covers the OWASP LLM Top 10 categories (Prompt Injection, Excessive Agency, Misinformation, …) as well as additional categories that go beyond the OWASP framework, such as Harmful Content Generation, Brand Damaging & Reputation, Legal & Financial Risk, and Misguidance & Unauthorized Advice. Each scan produces:
- Probe Results — grouped by vulnerability category.
- Probe Attempts — individual adversarial prompts and the agent’s responses.
- A Grade (A–D) summarising the overall security posture.
Scans can optionally be anchored to a Knowledge Base to generate attacks that are specific to your document content.
SDK resources: hub.scans, hub.scans.probes, hub.scans.attempts
Checks
Section titled “Checks”A Check is a criterion evaluated on an agent’s response. Checks belong to a project and can be reused across any dataset in that project. Not all checks use an LLM judge — some are purely rule-based:
| Identifier | How it evaluates | What it checks |
|---|---|---|
correctness | LLM judge | Is the response factually correct relative to the expected output? |
conformity | LLM judge | Does the response follow specified format, tone, or style rules? |
groundedness | LLM judge | Is the response grounded in the provided context (no hallucinations)? |
semantic_similarity | Embedding similarity | Is the response semantically close to the expected output? |
string_match | Rule-based | Does the response contain a specific keyword or substring? |
metadata | Rule-based | Do JSON path values in the response metadata meet specified conditions? |
You can also define custom checks via hub.checks.create() — a named, reusable configuration of any built-in check type with pre-set parameters, so you don’t have to repeat them across test cases.
SDK resource: hub.checks
Datasets
Section titled “Datasets”A Dataset is a named collection of Test Cases. Datasets can be built in several ways:
- Manually — create test cases one by one via
hub.test_cases.create(), useful when you have precise, hand-crafted scenarios. - From real conversation logs — import a JSONL or JSON file of recorded conversations with
hub.datasets.upload(), turning production traffic into a regression suite. - From Project Scenarios — define personas or behaviour patterns (scenarios) in your project and let the Hub auto-generate diverse test cases via
hub.datasets.generate_scenario_based(). - From a Knowledge Base — the Hub generates test cases whose questions and answers are grounded in your documents via
hub.datasets.generate_document_based(), ideal for RAG agents.
SDK resource: hub.datasets
Test Cases
Section titled “Test Cases”A Test Case represents a single conversation: a sequence of {role, content} messages (with optional metadata) exchanged between a user and an agent. The conversation does not have to end with an agent message — it can be as short as a single user turn. A list of checks is applied to the agent’s actual response at evaluation time.
SDK resources: hub.test_cases, hub.test_cases.comments
Evaluations
Section titled “Evaluations”An Evaluation is a run of an agent against all test cases in a dataset. For each test case, the Hub:
- Sends the messages to the agent and records the response.
- Runs each check on the conversation stored in the test case and the agent’s actual response.
- Marks the result as
passed,failed, orerror. - When a result fails, assigns it a failure category — a structured label (with an
identifier,title, anddescription) that classifies the nature of the failure at a higher level (e.g. “Hallucination”, “Off-topic response”). This makes it easier to triage and group failures across a large dataset.
Each individual outcome is stored as a Result (hub.evaluations.results).
Local evaluations
Section titled “Local evaluations”You can also run evaluations against a local Python function using hub.evaluations.create_local(). Your local process calls the agent and collects its responses, then submits them to the Hub, which orchestrates the check runs and stores the results.
SDK resources: hub.evaluations, hub.evaluations.results
Scheduled Evaluations
Section titled “Scheduled Evaluations”A Scheduled Evaluation is a recurring evaluation job. You configure the agent, dataset, and a frequency (daily, weekly, monthly), and the Hub runs it automatically on schedule. Results are delivered by email notification, and past runs are accessible via hub.scheduled_evaluations.list_evaluations().
SDK resource: hub.scheduled_evaluations
Tasks are a lightweight issue tracker built into the Hub. When you find a problem during an evaluation or scan, you can create a task to track the follow-up work. Each task has a title, a free-text description of what needs to be fixed, one or more assignees, a status (open, in_progress, resolved), and a priority (low, medium, high).
SDK resource: hub.tasks
Playground Chats
Section titled “Playground Chats”The Hub’s UI includes a Playground where you can chat with registered agents interactively. Each conversation is stored as a Playground Chat, which you can retrieve programmatically for analysis, export, or to turn into test cases.
SDK resource: hub.playground_chats
Audit Log
Section titled “Audit Log”Every significant action in the Hub (create, update, delete) is recorded in an Audit Log. You can search events by time range, user, entity type, or action, and retrieve the history for a specific resource.
SDK resource: hub.audit