Prompt Optimization
Promptic runs automated experiments to find the best prompt for your LLM task. You provide training examples and evaluation criteria, and Promptic iterates through candidate prompts to maximize your score.
How it works
- Create a component — A container for your LLM feature
- Create an experiment — Define the target model, task type, and optimizer
- Add observations — Input variables and expected outputs for training
- Add evaluators — Scoring criteria (accuracy, LLM judge, similarity, etc.)
- Start — Promptic runs multiple iterations, testing and scoring candidate prompts
- Deploy — Push the best prompt to production
Full example
from promptic_sdk import PrompticClient
with PrompticClient() as client:
# 1. Create a component
comp = client.create_component("email-classifier")
# 2. Create an experiment
exp = client.create_experiment(
ai_component_id=comp["id"],
target_model="gpt-4.1-nano",
task_type="classification",
optimizer="prompticV2",
)
# 3. Add training data
client.create_observations(exp["id"], [
{"variables": {"message": "50% off all items today only!"}, "expected": "spam"},
{"variables": {"message": "Your order has shipped"}, "expected": "not_spam"},
{"variables": {"message": "Click here to claim your prize"}, "expected": "spam"},
{"variables": {"message": "Meeting tomorrow at 3pm"}, "expected": "not_spam"},
# Add 20+ observations for best results
])
# 4. Add an evaluator
client.create_evaluators(exp["id"], [
{"name": "accuracy", "type": "f1", "weight": 1.0},
])
# 5. Start the experiment
client.start_experiment(exp["id"])Monitor progress in the dashboard or poll the API:
# Check the best iteration so far
best = client.get_best_iteration(exp["id"])
print(f"Score: {best['overallNormalizedScore']}")
# When `trainSplitRatio` is set, `evalNormalizedScore` reports the
# held-out eval performance. See "Train / eval split" below.
print(f"Eval score: {best['evalNormalizedScore']}")
print(f"Prompt: {best['prompt']}")Task types
| Type | Description | Use when |
|---|---|---|
classification | Maps inputs to discrete labels | Spam detection, sentiment analysis, categorization |
textGeneration | Generates free-form text | Summarization, content writing, Q&A |
structuredOutput | Produces structured JSON | Data extraction, form filling, API responses |
Evaluator types
| Type | Description | Best for |
|---|---|---|
f1 | F1 score against expected labels | Classification tasks |
referenceJudge | LLM scores predicted and expected independently, rewards matching | Intrinsic quality rubrics (e.g. "is this well-reasoned") |
comparisonJudge | LLM sees predicted and expected together, rates how they compare | Rubrics that relate the two outputs (structural match) |
generalJudge | User-defined multi-message prompt with template variables | Multi-turn judges, few-shot judges, dataset-column refs |
similarity | Text similarity to expected output | Paraphrasing, translation |
structuredOutput | Schema validation + field accuracy | Structured output tasks |
You can use multiple evaluators with different weights:
client.create_evaluators(exp["id"], [
{"name": "accuracy", "type": "f1", "weight": 0.7},
{
"name": "quality",
"type": "referenceJudge",
"weight": 0.3,
"scaleMin": 1,
"scaleMax": 5,
"config": {
"instructions": (
"Score the answer's factual accuracy. "
"5 = fully accurate and well-supported; "
"1 = incorrect or unsupported."
),
},
},
])Judge evaluator configs
All three judge types accept a scaleMin/scaleMax range and require a
config.
referenceJudge/comparisonJudge—config.instructions(string): the rubric text. The reference judge scores each side of the pair independently against the rubric (caching the expected-side judgment) and rewards predictions that match or exceed the expected score. The comparison judge scores the predicted output directly against the expected in one prompt.generalJudge—config.messages(list of{role, content}): the full judge prompt.roleissystem,user, orassistant.contentcan reference{input},{expected},{predicted}, or any dataset column name (e.g.{difficulty}). Unknown{tokens}are left as-is so misreferenced variables are visible in the rendered prompt.
# Comparison judge: structural match between predicted and expected
client.create_evaluators(exp["id"], [
{
"name": "structure",
"type": "comparisonJudge",
"weight": 1.0,
"scaleMin": 1,
"scaleMax": 3,
"config": {
"instructions": (
"3 = counts and contents match per table. "
"2 = counts match but some content differs. "
"1 = counts differ."
),
},
},
])
# General judge: custom multi-message prompt referencing dataset columns
client.create_evaluators(exp["id"], [
{
"name": "custom_judge",
"type": "generalJudge",
"weight": 1.0,
"scaleMin": 1,
"scaleMax": 5,
"config": {
"messages": [
{
"role": "system",
"content": (
"You are a strict evaluator. Reply with short "
"reasoning followed by an integer score."
),
},
{
"role": "user",
"content": (
"Difficulty: {difficulty}\n"
"Input:\n{input}\n\n"
"Expected:\n{expected}\n\n"
"Predicted:\n{predicted}"
),
},
],
},
},
])Migration from
judge— The legacyjudgeevaluator type was split in migration 0076. Existingjudgerows were automatically converted toreferenceJudge, which preserves the legacy per-side scoring semantics. New experiments must use one of the three explicit types above.
structuredOutput evaluator config
The structuredOutput evaluator scores a JSON-shaped prediction against an
expected reference. By default, scoring is derived from the
schema_definition:
- string fields → embedding similarity (semantic match via OpenAI text embeddings).
- enum / boolean / integer fields → exact equality.
- number fields → tolerance-based equality.
- nested object fields → recursive aggregation.
- array fields → content-aligned soft F1 (greedy alignment on a similarity matrix; not positional).
Override the per-field defaults via fields and add domain-specific
LLM-as-judge guidance via judge_instructions.
Per-field overrides (fields)
fields is keyed by the dotted JSON path of each field. Each entry can set:
| Key | Type | Description |
|---|---|---|
include | bool | Skip the field entirely when false. Default true. |
weight | float | Weight applied to this field's score in the aggregate. Default 1.0. |
strategy | string | Override the scalar comparison strategy. One of exact, embedding, contains, judge. |
array_strategy | string | Override the array aggregation strategy. One of exact, similarity, judge. |
Whether a field counts as required is read from the JSON schema's required array — there is no per-field required override here, so that decision lives in one place.
Scalar strategies:
| Value | Behavior |
|---|---|
exact | Character-for-character equality (with float tolerance for numbers). |
embedding | Cosine similarity via OpenAI text-embedding-3-small, with a calibrated floor (see Cosine similarity floor below). |
contains | Case-insensitive substring match — expected must appear in predicted. |
judge | LLM-as-judge per-pair scoring on string fields. Adds reasoning to the observation-details sheet. Requires the experiment's judge LLM. |
Array strategies:
| Value | Behavior |
|---|---|
exact | Multiset-intersection F1 under exact equality. Suitable for arrays of enums / IDs. |
similarity | Greedy alignment on the similarity matrix, soft F1. Default for string and object arrays. |
judge | Whole-array single-call LLM judgment that returns F1-compatible counts. Suitable for arrays of strings/objects where paraphrase or semantic equivalence matters across items. |
Judge instructions
Set judge_instructions on the evaluator config to add domain-specific
guidance to every field whose strategy or array_strategy is judge. The
built-in rubric — "do these convey the same essential information?" —
always applies; your text is appended as domain-specific guidance and is
shared across every judge field on this evaluator.
client.create_evaluators(exp["id"], [
{
"name": "extracted-fields",
"type": "structuredOutput",
"weight": 1.0,
"config": {
"schema_definition": {
"type": "object",
"properties": {
"company": {"type": "string"},
"tags": {"type": "array", "items": {"type": "string"}},
"summary": {"type": "string"},
},
},
"fields": {
# Paraphrase OK — score by semantic equivalence per pair.
"summary": {"strategy": "judge"},
# Match items by semantics, not embedding distance.
"tags": {"array_strategy": "judge"},
# Tighten the default to require an exact company match.
"company": {"strategy": "exact"},
},
# Optional: domain-specific guidance shared by every judge field.
# Leave unset to use the built-in rubric on its own.
"judge_instructions": (
"For tags, ignore casing and punctuation. "
"Treat 'planned' and 'in progress' as semantically equivalent."
),
},
},
])When a judge-configured array exceeds 50 items on either side, the
evaluator falls back to similarity aggregation and surfaces a warning
badge on the observation-details sheet so you can spot the bypass.
Cosine similarity floor
The embedding strategy maps cosine similarity to [0, 1] with a
calibrated floor (0.15, tuned for text-embedding-3-small): unrelated
string pairs are clipped to 0.0 instead of the previous ~0.55 from a
naive (cos + 1) / 2 remap, and pairs above the floor are linearly
rescaled so an exact match still lands at 1.0. Re-running an older
experiment with string-heavy schemas will show lower scores on
unrelated-string observations and a sharper gradient on near-matches.
When fine-grained semantic scoring matters more than embedding distance
(e.g. paraphrase detection with tight tolerances), configure those fields
with "strategy": "judge" instead.
Optimizers
| Optimizer | Description |
|---|---|
prompticV2 | Promptic's default optimizer. Recommended for most tasks. |
promptic | Original Promptic optimizer. |
miproV2 | DSPy MIPROv2 optimizer. Good for few-shot learning. |
bootstrapFewShot | DSPy bootstrap few-shot optimizer. |
gepa | Genetic/evolutionary prompt optimization. |
Hyperparameters
Customize the optimization process:
exp = client.create_experiment(
ai_component_id=comp["id"],
target_model="gpt-4.1-nano",
task_type="classification",
optimizer="prompticV2",
hyperparameters={
"epochs": 5, # Number of optimization rounds
"trainSplitRatio": 0.8, # Train/eval split (see below)
"numFewShots": 3, # Few-shot examples in prompt
"enableCot": True, # Chain-of-thought reasoning
},
)Train / eval split
Set trainSplitRatio (0.1–0.95) to hold out part of your observations as an
eval set. The optimizer trains on the train split only, then scores candidate
prompts against the held-out eval split each iteration. This guards against
overfitting on small datasets and surfaces prompts that generalize.
Each iteration then reports two scores:
| Field | Meaning |
|---|---|
overallNormalizedScore | Score on the train split (used to guide the search). |
evalNormalizedScore | Score on the held-out eval split. null if no split configured. |
get_best_iteration ranks iterations by evalNormalizedScore when a split is
configured, falling back to overallNormalizedScore otherwise.
best = client.get_best_iteration(exp["id"])
print(f"Train score: {best['overallNormalizedScore']}")
print(f"Eval score: {best['evalNormalizedScore']}") # None if no splitOmit trainSplitRatio (or set it to null) to train and score on the full
dataset.
Providing an initial prompt
If you already have a prompt, provide it as a starting point:
exp = client.create_experiment(
ai_component_id=comp["id"],
target_model="gpt-4.1-nano",
task_type="classification",
initial_prompt="Classify the following email as spam or not_spam.",
)The optimizer uses this as a baseline and tries to improve upon it.
Continuing from a previous experiment
Once an experiment has finished, you can clone it to keep iterating without rebuilding the dataset:
# Duplicate: same observations + evaluators, starts from the source's initial prompt.
clone = client.duplicate_experiment(exp["id"])
# Continue: same observations + evaluators, starts from the source's best
# optimized iteration. Useful for chaining optimization runs.
next_run = client.duplicate_experiment(exp["id"], continue_from_optimized=True)
client.start_experiment(next_run["id"])The CLI exposes the same flow via promptic experiments duplicate <id> and
promptic experiments continue <id> (add --start to launch immediately).
Next steps
Once your experiment completes, deploy the optimized prompt to use it in production.