Prompt Optimization

Promptic runs automated experiments to find the best prompt for your LLM task. You provide training examples and evaluation criteria, and Promptic iterates through candidate prompts to maximize your score.

How it works

Create a component — A container for your LLM feature
Create an experiment — Define the target model, task type, and optimizer
Add observations — Input variables and expected outputs for training
Add evaluators — Scoring criteria (accuracy, LLM judge, similarity, etc.)
Start — Promptic runs multiple iterations, testing and scoring candidate prompts
Deploy — Push the best prompt to production

Full example

from promptic_sdk import PrompticClient

with PrompticClient() as client:
    # 1. Create a component
    comp = client.create_component("email-classifier")

    # 2. Create an experiment
    exp = client.create_experiment(
        ai_component_id=comp["id"],
        target_model="gpt-4.1-nano",
        task_type="classification",
        optimizer="prompticV2",
    )

    # 3. Add training data
    client.create_observations(exp["id"], [
        {"variables": {"message": "50% off all items today only!"}, "expected": "spam"},
        {"variables": {"message": "Your order has shipped"}, "expected": "not_spam"},
        {"variables": {"message": "Click here to claim your prize"}, "expected": "spam"},
        {"variables": {"message": "Meeting tomorrow at 3pm"}, "expected": "not_spam"},
        # Add 20+ observations for best results
    ])

    # 4. Add an evaluator
    client.create_evaluators(exp["id"], [
        {"name": "accuracy", "type": "f1", "weight": 1.0},
    ])

    # 5. Start the experiment
    client.start_experiment(exp["id"])

Monitor progress in the dashboard or poll the API:

    # Check the best iteration so far
    best = client.get_best_iteration(exp["id"])
    print(f"Score: {best['overallNormalizedScore']}")
    # When `trainSplitRatio` is set, `evalNormalizedScore` reports the
    # held-out eval performance. See "Train / eval split" below.
    print(f"Eval score: {best['evalNormalizedScore']}")
    print(f"Prompt: {best['prompt']}")

Task types

Type	Description	Use when
`classification`	Maps inputs to discrete labels	Spam detection, sentiment analysis, categorization
`textGeneration`	Generates free-form text	Summarization, content writing, Q&A
`structuredOutput`	Produces structured JSON	Data extraction, form filling, API responses

Evaluator types

Type	Description	Best for
`f1`	F1 score against expected labels	Classification tasks
`referenceJudge`	LLM scores predicted and expected independently, rewards matching	Intrinsic quality rubrics (e.g. "is this well-reasoned")
`comparisonJudge`	LLM sees predicted and expected together, rates how they compare	Rubrics that relate the two outputs (structural match)
`generalJudge`	User-defined multi-message prompt with template variables	Multi-turn judges, few-shot judges, dataset-column refs
`similarity`	Text similarity to expected output	Paraphrasing, translation
`structuredOutput`	Schema validation + field accuracy	Structured output tasks

You can use multiple evaluators with different weights:

client.create_evaluators(exp["id"], [
    {"name": "accuracy", "type": "f1", "weight": 0.7},
    {
        "name": "quality",
        "type": "referenceJudge",
        "weight": 0.3,
        "scaleMin": 1,
        "scaleMax": 5,
        "config": {
            "instructions": (
                "Score the answer's factual accuracy. "
                "5 = fully accurate and well-supported; "
                "1 = incorrect or unsupported."
            ),
        },
    },
])

Judge evaluator configs

All three judge types accept a scaleMin/scaleMax range and require a config.

referenceJudge / comparisonJudge — config.instructions (string): the rubric text. The reference judge scores each side of the pair independently against the rubric (caching the expected-side judgment) and rewards predictions that match or exceed the expected score. The comparison judge scores the predicted output directly against the expected in one prompt.
generalJudge — config.messages (list of {role, content}): the full judge prompt. role is system, user, or assistant. content can reference {input}, {expected}, {predicted}, or any dataset column name (e.g. {difficulty}). Unknown {tokens} are left as-is so misreferenced variables are visible in the rendered prompt.

# Comparison judge: structural match between predicted and expected
client.create_evaluators(exp["id"], [
    {
        "name": "structure",
        "type": "comparisonJudge",
        "weight": 1.0,
        "scaleMin": 1,
        "scaleMax": 3,
        "config": {
            "instructions": (
                "3 = counts and contents match per table. "
                "2 = counts match but some content differs. "
                "1 = counts differ."
            ),
        },
    },
])

# General judge: custom multi-message prompt referencing dataset columns
client.create_evaluators(exp["id"], [
    {
        "name": "custom_judge",
        "type": "generalJudge",
        "weight": 1.0,
        "scaleMin": 1,
        "scaleMax": 5,
        "config": {
            "messages": [
                {
                    "role": "system",
                    "content": (
                        "You are a strict evaluator. Reply with short "
                        "reasoning followed by an integer score."
                    ),
                },
                {
                    "role": "user",
                    "content": (
                        "Difficulty: {difficulty}\n"
                        "Input:\n{input}\n\n"
                        "Expected:\n{expected}\n\n"
                        "Predicted:\n{predicted}"
                    ),
                },
            ],
        },
    },
])

Migration from judge — The legacy judge evaluator type was split in migration 0076. Existing judge rows were automatically converted to referenceJudge, which preserves the legacy per-side scoring semantics. New experiments must use one of the three explicit types above.

structuredOutput evaluator config

The structuredOutput evaluator scores a JSON-shaped prediction against an expected reference. By default, scoring is derived from the schema_definition:

string fields → embedding similarity (semantic match via OpenAI text embeddings).
enum / boolean / integer fields → exact equality.
number fields → tolerance-based equality.
nested object fields → recursive aggregation.
array fields → content-aligned soft F1 (greedy alignment on a similarity matrix; not positional).

Override the per-field defaults via fields and add domain-specific LLM-as-judge guidance via judge_instructions.

Per-field overrides (`fields`)

fields is keyed by the dotted JSON path of each field. Each entry can set:

Key	Type	Description
`include`	bool	Skip the field entirely when `false`. Default `true`.
`weight`	float	Weight applied to this field's score in the aggregate. Default `1.0`.
`strategy`	string	Override the scalar comparison strategy. One of `exact`, `embedding`, `contains`, `judge`.
`array_strategy`	string	Override the array aggregation strategy. One of `exact`, `similarity`, `judge`.

Whether a field counts as required is read from the JSON schema's required array — there is no per-field required override here, so that decision lives in one place.

Scalar strategies:

Value	Behavior
`exact`	Character-for-character equality (with float tolerance for numbers).
`embedding`	Cosine similarity via OpenAI `text-embedding-3-small`, with a calibrated floor (see Cosine similarity floor below).
`contains`	Case-insensitive substring match — expected must appear in predicted.
`judge`	LLM-as-judge per-pair scoring on string fields. Adds reasoning to the observation-details sheet. Requires the experiment's judge LLM.

Array strategies:

Value	Behavior
`exact`	Multiset-intersection F1 under exact equality. Suitable for arrays of enums / IDs.
`similarity`	Greedy alignment on the similarity matrix, soft F1. Default for string and object arrays.
`judge`	Whole-array single-call LLM judgment that returns F1-compatible counts. Suitable for arrays of strings/objects where paraphrase or semantic equivalence matters across items.

Judge instructions

Set judge_instructions on the evaluator config to add domain-specific guidance to every field whose strategy or array_strategy is judge. The built-in rubric — "do these convey the same essential information?" — always applies; your text is appended as domain-specific guidance and is shared across every judge field on this evaluator.

client.create_evaluators(exp["id"], [
    {
        "name": "extracted-fields",
        "type": "structuredOutput",
        "weight": 1.0,
        "config": {
            "schema_definition": {
                "type": "object",
                "properties": {
                    "company": {"type": "string"},
                    "tags": {"type": "array", "items": {"type": "string"}},
                    "summary": {"type": "string"},
                },
            },
            "fields": {
                # Paraphrase OK — score by semantic equivalence per pair.
                "summary": {"strategy": "judge"},
                # Match items by semantics, not embedding distance.
                "tags": {"array_strategy": "judge"},
                # Tighten the default to require an exact company match.
                "company": {"strategy": "exact"},
            },
            # Optional: domain-specific guidance shared by every judge field.
            # Leave unset to use the built-in rubric on its own.
            "judge_instructions": (
                "For tags, ignore casing and punctuation. "
                "Treat 'planned' and 'in progress' as semantically equivalent."
            ),
        },
    },
])

When a judge-configured array exceeds 50 items on either side, the evaluator falls back to similarity aggregation and surfaces a warning badge on the observation-details sheet so you can spot the bypass.

Cosine similarity floor

The embedding strategy maps cosine similarity to [0, 1] with a calibrated floor (0.15, tuned for text-embedding-3-small): unrelated string pairs are clipped to 0.0 instead of the previous ~0.55 from a naive (cos + 1) / 2 remap, and pairs above the floor are linearly rescaled so an exact match still lands at 1.0. Re-running an older experiment with string-heavy schemas will show lower scores on unrelated-string observations and a sharper gradient on near-matches.

When fine-grained semantic scoring matters more than embedding distance (e.g. paraphrase detection with tight tolerances), configure those fields with "strategy": "judge" instead.

Optimizers

Optimizer	Description
`prompticV2`	Promptic's default optimizer. Recommended for most tasks.
`promptic`	Original Promptic optimizer.
`miproV2`	DSPy MIPROv2 optimizer. Good for few-shot learning.
`bootstrapFewShot`	DSPy bootstrap few-shot optimizer.
`gepa`	Genetic/evolutionary prompt optimization.

Hyperparameters

Customize the optimization process:

exp = client.create_experiment(
    ai_component_id=comp["id"],
    target_model="gpt-4.1-nano",
    task_type="classification",
    optimizer="prompticV2",
    hyperparameters={
        "epochs": 5,              # Number of optimization rounds
        "trainSplitRatio": 0.8,   # Train/eval split (see below)
        "numFewShots": 3,         # Few-shot examples in prompt
        "enableCot": True,        # Chain-of-thought reasoning
    },
)

Train / eval split

Set trainSplitRatio (0.1–0.95) to hold out part of your observations as an eval set. The optimizer trains on the train split only, then scores candidate prompts against the held-out eval split each iteration. This guards against overfitting on small datasets and surfaces prompts that generalize.

Each iteration then reports two scores:

Field	Meaning
`overallNormalizedScore`	Score on the train split (used to guide the search).
`evalNormalizedScore`	Score on the held-out eval split. `null` if no split configured.

get_best_iteration ranks iterations by evalNormalizedScore when a split is configured, falling back to overallNormalizedScore otherwise.

best = client.get_best_iteration(exp["id"])
print(f"Train score: {best['overallNormalizedScore']}")
print(f"Eval score:  {best['evalNormalizedScore']}")  # None if no split

Omit trainSplitRatio (or set it to null) to train and score on the full dataset.

Providing an initial prompt

If you already have a prompt, provide it as a starting point:

exp = client.create_experiment(
    ai_component_id=comp["id"],
    target_model="gpt-4.1-nano",
    task_type="classification",
    initial_prompt="Classify the following email as spam or not_spam.",
)

The optimizer uses this as a baseline and tries to improve upon it.

Continuing from a previous experiment

Once an experiment has finished, you can clone it to keep iterating without rebuilding the dataset:

# Duplicate: same observations + evaluators, starts from the source's initial prompt.
clone = client.duplicate_experiment(exp["id"])

# Continue: same observations + evaluators, starts from the source's best
# optimized iteration. Useful for chaining optimization runs.
next_run = client.duplicate_experiment(exp["id"], continue_from_optimized=True)

client.start_experiment(next_run["id"])

The CLI exposes the same flow via promptic experiments duplicate <id> and promptic experiments continue <id> (add --start to launch immediately).

Next steps

Once your experiment completes, deploy the optimized prompt to use it in production.

On this page