Agent Evaluation

Agent evaluation lets you systematically test your AI agents against datasets and get structured, AI-generated insights on quality, errors, and regressions.

How it works

Instrument your agent with tracing and dataset tagging
Run the agent on your test inputs — traces are auto-collected
Trigger an evaluation — Promptic analyzes the traces
Review insights on quality, errors, patterns, and suggested fixes

Step 1: Run your agent with dataset tagging

Use the ai_component context manager with dataset and run parameters to tag traces:

import promptic_sdk

promptic_sdk.init()

with promptic_sdk.ai_component("support-agent", dataset="regression-tests", run="v2.1"):
    for query in test_queries:
        agent.run(query)

All traces from this block are automatically grouped into the "regression-tests" dataset under a run called "v2.1".

Step 2: Trigger an evaluation

Using the SDK

from promptic_sdk import PrompticClient

with PrompticClient() as client:
    # Create the evaluation
    evaluation = client.create_evaluation(
        component_id="comp_...",
        dataset_id="ds_...",
        run_id="run_...",       # optional: evaluate a specific run
        name="v2.1 regression",  # optional: human-readable name
    )

    # Wait for results (polls until complete, up to 5 min)
    result = client.wait_for_evaluation(
        component_id="comp_...",
        evaluation_id=evaluation["id"],
    )

Using the CLI

promptic evaluations run <component-id> \
  --dataset <dataset-id> \
  --run <run-id> \
  --name "v2.1 regression"

Step 3: Review insights

The evaluation returns structured insights:

for insight in result["results"]["insights"]:
    print(f"[{insight['severity']}] {insight['title']}")
    print(f"  {insight['description']}")
    if insight.get("suggestedFix"):
        print(f"  Fix: {insight['suggestedFix']}")

Each insight includes:

Field	Description
`type`	Category of the insight
`severity`	Impact level
`title`	Short summary
`description`	Detailed explanation
`frequency`	How often this issue occurs
`affectedRunIds`	Which runs are affected
`suggestedFix`	Recommended action

The evaluation also provides aggregate metadata:

meta = result["results"]["meta"]
print(f"Total runs: {meta['totalRuns']}")
print(f"Total tokens: {meta['totalTokens']}")
print(f"Total cost: ${meta['totalCostUsd']:.4f}")
print(f"Avg duration: {meta['averageDurationMs']}ms")
print(f"Error rate: {meta['errorRate']:.1%}")

Annotations

You can manually annotate traces within a run as positive or negative:

with PrompticClient() as client:
    client.upsert_annotation(
        component_id="comp_...",
        run_id="run_...",
        trace_db_id="trace_...",
        rating="positive",          # or "negative"
        comment="Handled edge case well",
    )

Annotations help build ground truth for evaluating agent quality over time.

Comparing runs

Create multiple runs within the same dataset to compare agent versions:

# Run v1
with promptic_sdk.ai_component("support-agent", dataset="eval-set", run="v1"):
    for query in test_queries:
        agent_v1.run(query)

# Run v2
with promptic_sdk.ai_component("support-agent", dataset="eval-set", run="v2"):
    for query in test_queries:
        agent_v2.run(query)

Then trigger evaluations for each run and compare insights in the dashboard.

On this page