Changelog | Promptic

This release introduces full LLM tracing, end-to-end agent evaluation, an official Python SDK, train/test split, multi-variable prompt support, chat message prompt templates, and configurable structured output evaluation.

Tracing

Promptic now provides automatic, OpenTelemetry-based observability for every LLM call your application makes. See request and response payloads, token counts, cost, and latency across OpenAI, Anthropic, Google, LangChain, and Cohere — without changing your application code.

Drop-in instrumentation via promptic_sdk.init()
Provider-agnostic span schema for cross-vendor comparison
Full trace history powering Agent Evaluation and prompt optimization downstream

See the Tracing guide to get started.

Agent Evaluation

Score multi-step agent runs end-to-end. Tag traces with dataset and run parameters, trigger an evaluation, and get AI-generated insights on quality, errors, and regressions — without writing custom evaluators.

See the Agent Evaluation guide for the full setup.

Python SDK

The official promptic_sdk package brings tracing, dataset tagging, and prompt optimization to Python applications, with first-class support for OpenAI, Anthropic, Google, LangChain, and Cohere.

Auto-instrumented OpenTelemetry tracing
Programmatic experiment management
CLI for local workflows

See the SDK installation guide and the Python SDK reference.

Train / test split

You can now split your dataset into a training set used for optimization and a held-out evaluation set used for unbiased scoring — preventing optimizers from overfitting to the same observations they're being graded on.

Configure a train_split_ratio per experiment (e.g. 0.8 reserves 20% for evaluation)
Split is deterministic per experiment, so reruns score against the same held-out set
Available in both the experiment UI and the Python SDK / optimizer hyperparameters

Multi-variable prompt support

Define and optimize prompts that take more than one input — useful for RAG, multi-turn flows, and structured tasks where context, query, and instructions are passed as separate fields. Variables flow through optimization, evaluation, and predictions consistently and work with all evaluator types.

Chat message prompt support

In addition to plain-text prompts, Promptic now optimizes structured chat messages — system, user, and assistant turns — so the format you optimize matches the format you ship. Per-turn variable interpolation is supported, with round-trip compatibility across providers that use the messages API.

Configurable structured output evaluation

Structured-output experiments can now be evaluated on only the fields that actually matter, with different matching strategies per field and judge configuration tuned to each field's role. Optional or noisy fields no longer drag down the score, and regressions are debuggable field-by-field instead of via a single aggregate number.

Per-field include, weight, and required honored by the scoring aggregation
Per-field strategy overrides (FieldStrategy for scalars, ArrayStrategy for lists, plus the new ContainsStrategy)
Configurable LLM judge with three judge types, per-field prompts, and a cosine-similarity floor for fast filtering
Per-field scoring diagnostics surfaced in the experiment view