Agent Evaluation
Agent evaluation lets you systematically test your AI agents against datasets and get structured, AI-generated insights on quality, errors, and regressions.
How it works
- Instrument your agent with tracing and dataset tagging
- Run the agent on your test inputs — traces are auto-collected
- Trigger an evaluation — Promptic analyzes the traces
- Review insights on quality, errors, patterns, and suggested fixes
Step 1: Run your agent with dataset tagging
Use the ai_component context manager with dataset and run parameters to tag traces:
import promptic_sdk
promptic_sdk.init()
with promptic_sdk.ai_component("support-agent", dataset="regression-tests", run="v2.1"):
for query in test_queries:
agent.run(query)All traces from this block are automatically grouped into the "regression-tests" dataset under a run called "v2.1".
Step 2: Trigger an evaluation
Using the SDK
from promptic_sdk import PrompticClient
with PrompticClient() as client:
# Create the evaluation
evaluation = client.create_evaluation(
component_id="comp_...",
dataset_id="ds_...",
run_id="run_...", # optional: evaluate a specific run
name="v2.1 regression", # optional: human-readable name
)
# Wait for results (polls until complete, up to 5 min)
result = client.wait_for_evaluation(
component_id="comp_...",
evaluation_id=evaluation["id"],
)Using the CLI
promptic evaluations run <component-id> \
--dataset <dataset-id> \
--run <run-id> \
--name "v2.1 regression"Step 3: Review insights
The evaluation returns structured insights:
for insight in result["results"]["insights"]:
print(f"[{insight['severity']}] {insight['title']}")
print(f" {insight['description']}")
if insight.get("suggestedFix"):
print(f" Fix: {insight['suggestedFix']}")Each insight includes:
| Field | Description |
|---|---|
type | Category of the insight |
severity | Impact level |
title | Short summary |
description | Detailed explanation |
frequency | How often this issue occurs |
affectedRunIds | Which runs are affected |
suggestedFix | Recommended action |
The evaluation also provides aggregate metadata:
meta = result["results"]["meta"]
print(f"Total runs: {meta['totalRuns']}")
print(f"Total tokens: {meta['totalTokens']}")
print(f"Total cost: ${meta['totalCostUsd']:.4f}")
print(f"Avg duration: {meta['averageDurationMs']}ms")
print(f"Error rate: {meta['errorRate']:.1%}")Annotations
You can manually annotate traces within a run as positive or negative:
with PrompticClient() as client:
client.upsert_annotation(
component_id="comp_...",
run_id="run_...",
trace_db_id="trace_...",
rating="positive", # or "negative"
comment="Handled edge case well",
)Annotations help build ground truth for evaluating agent quality over time.
Comparing runs
Create multiple runs within the same dataset to compare agent versions:
# Run v1
with promptic_sdk.ai_component("support-agent", dataset="eval-set", run="v1"):
for query in test_queries:
agent_v1.run(query)
# Run v2
with promptic_sdk.ai_component("support-agent", dataset="eval-set", run="v2"):
for query in test_queries:
agent_v2.run(query)Then trigger evaluations for each run and compare insights in the dashboard.