Documentation

Everything you need to generate LLM training data, whether you're building your first fine-tuned model or scaling a research pipeline.

Jump to section

Introduction

StackAI generates synthetic training datasets for large language models. You describe what you want (a domain, a schema, a count) and the API returns a ready-to-use JSONL file in minutes. No labeling, no data collection, no privacy concerns.

🎓

Supervised Fine-Tuning

Teach a model to follow instructions, adopt a persona, or master a knowledge domain.

instruction_v1

⚖️

Preference Training

Generate chosen/rejected pairs for RLHF and DPO alignment training.

preference_v1

🧪

Evaluation Benchmarks

Build held-out test sets to measure model capability and track regressions.

eval_v1

🔀

Conflict (Alignment)

Multi-drive tension scenarios for alignment decision layers with resolution metadata.

conflict_v1

What Is Synthetic Data?

Synthetic training data is AI-generated text that mimics the examples a human would write for training, but produced at scale, in minutes, with full control over distribution and quality.

Why not just use real data? Real data is slow to collect, expensive to label, hard to balance across topics, and often contains PII. Synthetic data lets you generate exactly the distribution you need, including rare or adversarial cases that barely appear in real corpora.

Is it effective? Yes. Many production models use synthetic data for fine-tuning, alignment, and evaluation. State-of-the-art models like Claude and GPT-4 are trained with AI-assisted data generation as part of their pipeline. The key is quality control , which is why StackAI runs automatic checks and optional LLM-as-judge scoring on every job.

BeginnerHow language models actually learn

A base LLM (like GPT-4 or Claude before fine-tuning) is trained to predict the next token in a document. It learns facts and language patterns, but it doesn't know how to behave. It won't reliably follow instructions, stay in character, or refuse harmful requests.

Fine-tuning on labeled examples teaches the model specific behaviors. Each training example shows the model: "when you see input X, produce output Y." After seeing thousands of these, the model learns the pattern.

instruction_v1 generates X→Y pairs for this. preference_v1 generates pairs where one Y is better than another, teaching the model to prefer better answers. eval_v1 generates test cases so you can measure whether your fine-tuned model actually improved.

BeginnerWhy data quality matters more than quantity

"Garbage in, garbage out" is even more true for LLMs than traditional ML. A model trained on 1,000 high-quality, diverse examples often outperforms one trained on 10,000 repetitive or low-quality examples.

StackAI runs two quality checks on every job: diversity analysis (removing near-duplicate records that would cause the model to overfit) and format validation (ensuring every record is complete and well-formed). Add verified: true for an LLM-as-judge pass that scores relevance, accuracy, and completeness.

Quick Start

Generate your first dataset in under 2 minutes.

Step 1: Get your API key

Go to your dashboard → API Keys → Create Key. Copy the key; it won't be shown again.

Step 2: Create a generation job

terminal

curl -X POST https://api.stackai.app/v1/synthetic/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "schema": "instruction_v1",
    "domain": "customer support for a SaaS product",
    "count": 50,
    "model": "balanced"
  }'

The API returns a job ID immediately. Generation runs asynchronously.

Step 3: Poll until complete, then download

poll-and-download.sh

# Poll job status
JOB_ID="syn_job_xxxxxxxxxxxxxxxxx"
curl https://api.stackai.app/v1/synthetic/jobs/$JOB_ID \
  -H "Authorization: Bearer YOUR_API_KEY"

# Once status is "succeeded", get a download URL
curl "https://api.stackai.app/v1/synthetic/jobs/$JOB_ID/results?format=url" \
  -H "Authorization: Bearer YOUR_API_KEY"

Get API Key Use the UI Instead

BeginnerWhat is an API key and how should I store it?

An API key is a secret token that identifies your account when you call the API programmatically. Think of it like a password: anyone who has it can make requests charged to your account.

Never commit API keys to git, paste them into Slack, or store them in localStorage. The safest approach is to store them as environment variables: export STACKAI_KEY="sk_...", then reference them in your code as process.env.STACKAI_KEY.

Data Schemas

Choose the schema that matches your training goal. Each schema produces a different record structure optimized for a different type of model training.

instruction_v1

Supervised Fine-Tuning

Instruction-input-output triples for teaching a model to follow instructions, adopt a persona, or answer questions in a specific domain. The most common starting point for model customization.

instruction_v1 record

{
  "instruction": "Explain what a webhook is to a non-technical user",
  "input": "I keep seeing it mentioned in our integration docs",
  "output": "A webhook is like a doorbell for software...",
  "metadata": {},
  "provenance": { "job_id": "syn_job_abc123", "schema": "instruction_v1", ... }
}

BeginnerWhat is supervised fine-tuning (SFT)?

A base language model knows a lot (it has read the internet), but it doesn't know how to behave. SFT teaches it by showing it thousands of (instruction → correct response) examples. After training on your data, the model learns to respond in your desired style, persona, and domain.

Popular SFT frameworks: Hugging Face TRL, LLaMA-Factory, Axolotl. These all accept the instruction/input/output format that StackAI produces.

Rule of thumb: 500–5,000 high-quality examples is usually enough for a focused domain. More is only better if it's diverse, which is why StackAI automatically removes near-duplicates.

preference_v1

RLHF / DPOBalanced, Diverse & Deep

Prompt + chosen/rejected response pairs with scores and reasoning. Used to train a reward model or directly optimize with DPO. Available on balanced, diverse, and deep tiers. Both responses are scored by a critic pass; rejected answers are intentionally lower-quality but coherent.

preference_v1 record

{
  "prompt": "How should I handle authentication in a REST API?",
  "chosen": "Use JWT tokens with short expiry (15 min) stored in httpOnly cookies...",
  "rejected": "You can use JWT tokens and store them in localStorage for easy access...",
  "chosen_score": 9.0,
  "rejected_score": 4.5,
  "reasoning": "The chosen response correctly identifies the security implications...",
  "metadata": {},
  "provenance": { "schema": "preference_v1", ... }
}

BeginnerWhat is RLHF and DPO, and which should I use?

After SFT, a model follows instructions, but it may still produce mediocre answers. RLHF (Reinforcement Learning from Human Feedback) and DPO (Direct Preference Optimization) both use preference pairs to teach the model to prefer better responses over worse ones.

RLHF: Train a reward model on the preference pairs, then use PPO to optimize the LLM against the reward model. More complex, higher ceiling. Used by OpenAI and Anthropic for their flagship models.

DPO: Skip the reward model entirely. Directly train on the (prompt, chosen, rejected) triples. Simpler, faster, and often competitive with RLHF on focused domains. Recommended for most fine-tuning use cases.

For DPO, use the Hugging Face TRL DPOTrainer, which accepts the prompt/chosen/rejected format that StackAI produces.

eval_v1

Benchmarking

Input/ideal-output pairs with evaluation metrics for building held-out test sets. Use these to measure model performance before and after fine-tuning, or to detect capability regressions during training runs.

eval_v1 record

{
  "input": "A user says their payment failed but their card was charged...",
  "ideal_output": "Apologize and assure the user we take this seriously...",
  "metrics": ["task_completion", "tone_appropriateness", "escalation_correct"],
  "metadata": {},
  "provenance": { "schema": "eval_v1", ... }
}

BeginnerWhy you need held-out evaluation data

Never evaluate your model on data it was trained on. That's like testing a student with the exact exam questions they studied. You need a held-out test set: examples the model has never seen.

Use eval_v1 to build this test set before you generate your training data. Run it after each training checkpoint to catch regressions early. A model that improves on the training domain but degrades on the eval set is overfitting.

Combine with StackAI's hard negatives to generate adversarial eval cases that probe failure modes specifically.

conflict_v1

AlignmentBalanced, Diverse & Deep

Multi-drive tension scenarios for alignment decision layers. Each record models tension between paired opposing drives and includes resolution metadata. Use these to train models that can reason about competing objectives (safety vs. autonomy, honesty vs. helpfulness) and make principled tradeoffs with explicit confidence scores and override conditions.

conflict_v1 record

{
  "record_id": "rec_abc123",
  "schema": "conflict_v1",
  "data": {
    "input": "A user asks the AI to help write a persuasive essay arguing against vaccinations...",
    "axis": "safety_autonomy",
    "tension_type": "educational_vs_safety",
    "responses": [
      { "drive": "safety", "output": "I should decline this request because...", "viable": true },
      { "drive": "autonomy", "output": "I should help with the assignment since...", "viable": true }
    ],
    "resolution": {
      "preferred_drive": "autonomy",
      "losing_drive": "safety",
      "confidence": 0.72,
      "override_condition": "If the essay will be published publicly rather than...",
      "reasoning": "The educational context provides sufficient safeguards..."
    }
  }
}

BeginnerWhat is multi-drive alignment and why does it matter?

Real-world AI alignment is not binary (safe vs. unsafe). Models face constant tension between competing drives: being helpful vs. being safe, being honest vs. being kind, following instructions vs. refusing harmful requests.

conflict_v1 generates structured scenarios where two valid drives are in tension. Each record includes both sides of the argument, a resolution with a confidence score, and an explicit override condition describing when the opposite resolution would be correct. This teaches models to reason about tradeoffs rather than applying rigid rules.

The confidence field (0.0 to 1.0) reflects how close the decision is. Values near 0.5 indicate genuinely ambiguous cases where reasonable people would disagree. Training on a range of confidence levels teaches the model calibrated uncertainty rather than false certainty.

Request Parameters

All parameters for POST /v1/synthetic/generate. Fields marked * are required.

Top-level fields

Parameter	Type	Description
`schema`*	string	Schema type: "instruction_v1", "preference_v1", "eval_v1", or "conflict_v1". Note: preference_v1 and conflict_v1 require Balanced or higher and have separate pricing ($4/$10/$30 per 1K).
`domain`*	string	Plain-English description of the subject area (e.g., "customer support for fintech"). Richer descriptions produce more focused data.
`count`*	integer	Number of records to generate. Must be a positive integer. Billed per accepted record.
`model`	string	Model shorthand: "fast", "balanced", "diverse", or "deep". Legacy aliases "economy", "standard", "premium" still accepted. Required unless using the verbose models object.
`models`	object	Explicit { generator, critic } objects each with provider and model. Alternative to the model shorthand.
`system_prompt`	string	Custom instructions appended to the built-in prompt (max 4,000 chars). Describe your AI's persona, tone, or constraints.
`verified`	boolean	Enable LLM-as-judge quality scoring. Each record is scored on relevance, accuracy, and completeness (1–10). Records below 6.0 are rejected. +$5/1K PAYG or 1.5× quota on subscriptions.
`advanced`	object	Advanced configuration: categories, coverage, invariants, response_policy, hard_negatives, output_split, and more. See Advanced Configuration.
`conflict_config`	object	Required for conflict_v1 schema. Configures drive pairs, axis distribution, resolution mode, and vocabulary overrides. See conflict_config object below.
`constraints`	object	Response length constraints: { min_tokens?, max_tokens? }. Tokens are roughly 0.75 words.
`license`	string	License to embed in the manifest (e.g., "cc-by-4.0", "mit", "proprietary").

constraints object

Optional nested object for controlling the style and language of generated content.

Field	Type	Description
`language`	string	ISO 639-1 language code (e.g., "en", "es", "fr", "zh"). Generates content in the specified language.
`tone`	string	Writing tone (e.g., "formal", "casual", "technical", "empathetic").
`difficulty`	string	Content complexity (e.g., "beginner", "intermediate", "expert").

conflict_config object

Required when schema is "conflict_v1". Configures the drive pairs, axis distribution, resolution behavior, and optional vocabulary overrides for multi-drive tension scenarios.

Field	Type	Description
`drive_pairs`*	array	Array of drive pair objects. Each pair has axis_name (string), optional description, and drives (tuple of exactly 2 objects with name and description). Example: [{ "axis_name": "safety_autonomy", "drives": [{ "name": "safety", "description": "Prioritize user protection" }, { "name": "autonomy", "description": "Respect user agency" }] }]. 1-20 pairs allowed.
`axis_distribution`	object	Optional. Percentage of records per axis (e.g., { "safety_autonomy": 60, "honesty_helpfulness": 40 }). Must sum to approximately 100%. If omitted, records are distributed evenly across drive pairs.
`resolution_mode`	string	Resolution annotation mode. "annotated" (default): full resolution with preferred_drive, confidence, override_condition, and reasoning. "graded": adds numeric scoring. "none": no resolution block in output.
`override_vocabulary`	array	Optional. Array of structured label strings for override conditions (e.g., ["safety_critical", "user_explicit_consent", "legal_requirement"]). Max 50 items.

conflict_config example

{
  "schema": "conflict_v1",
  "domain": "AI assistant alignment decisions",
  "count": 200,
  "model": "deep",
  "conflict_config": {
    "drive_pairs": [
      {
        "axis_name": "safety_autonomy",
        "drives": [
          { "name": "safety", "description": "Prioritize user protection..." },
          { "name": "autonomy", "description": "Respect user agency..." }
        ]
      },
      {
        "axis_name": "honesty_helpfulness",
        "drives": [
          { "name": "honesty", "description": "Provide accurate info..." },
          { "name": "helpfulness", "description": "Maximize usefulness..." }
        ]
      }
    ],
    "axis_distribution": { "safety_autonomy": 60, "honesty_helpfulness": 40 },
    "resolution_mode": "annotated"
  }
}

models object (verbose form)

Use this instead of the model shorthand when you need exact provider/model control.

verbose models object

{
  "schema": "preference_v1",
  "domain": "code review best practices",
  "count": 100,
  "models": {
    "generator": { "provider": "anthropic", "model": "claude-sonnet-4-6" },
    "critic":    { "provider": "openai", "model": "gpt-4o-mini" }
  }
}

The critic (used for preference_v1 scoring) defaults to GPT-4o Mini on all tiers for cost efficiency. You can override it via the models.critic object.

Advanced Configuration

Most advanced options are free. Pass them inside the advanced object. Omit the object entirely for simple jobs; nothing changes. The only option with a cost is invariants (~$0.04/1K records for the LLM checker).

Model shorthand

Shorthand	Provider	Model	Price/1K
"fast"	OpenAI	gpt-4o-mini	$0.50
"balanced"	OpenAI	gpt-4.1-mini	$3.00
"diverse"	OpenAI	gpt-5.4-mini	$8.00
"deep"	Anthropic	claude-sonnet-4-6	$25.00

Legacy aliases "economy", "standard", and "premium" are still accepted and map to fast, balanced, and deep respectively. Note: deep is pay-as-you-go only and not included in subscription plans.

preference_v1 is available on balanced, diverse, and deep tiers. At least one of model or models is required.

Benchmark comparison

Tier	Diversity %	Avg Output Depth	Near-Dup Rate
fast	80%	~612 chars avg	~20%
balanced	94%	~600 chars avg	~6%
diverse	98%	~1,085 chars avg	~2%
deep	82%	~1,955 chars avg	~18%

Benchmarks measured on 1,000-record instruction_v1 jobs across 10 domains. Diversity % is the ratio of unique trigram sets after deduplication. Near-dup rate is the percentage of generated records removed by the diversity check (Jaccard > 0.7).

system_prompt

Append custom instructions to the built-in prompt. Maximum 4,000 characters. Your prompt is never allowed to replace or override the core format/safety instructions; it's added as "Additional Instructions from the User."

system_prompt example

{
  "schema": "instruction_v1",
  "domain": "mental health peer support",
  "count": 200,
  "model": "deep",
  "system_prompt": "You are a compassionate peer support assistant... Always validate feelings before offering advice. Never diagnose..."
}

advanced.categories

Distribute records across named categories by percentage. The worker allocates records using the largest-remainder method (no rounding errors). Each category's description is injected into the generation prompt, enriching diversity within the category.

advanced.response_policy & style_rules

Define behavioral rules and writing style injected into the system prompt verbatim. Keys in response_policy become labeled policy sections.

response policy

"advanced": {
  "response_policy": {
    "safe_allowed":      "Provide accurate, helpful information...",
    "unsafe_disallowed": "Never provide medical diagnoses...",
    "unclear_intent":    "Ask a clarifying question rather than assuming the worst"
  },
  "style_rules": [
    "Keep responses under 3 sentences when possible",
    "Use active voice"
  ]
}

advanced.hard_negatives

Generate adversarial examples designed to probe failure modes. Records are tagged with metadata.hard_negative: true and metadata.technique: "technique_name".

instruction_v1: Generates adversarial inputs, instructions designed to elicit harmful or incorrect outputs. The generated response shows the correct, safe handling. Use for safety and alignment training.

preference_v1: Generates hard negative responses: the rejected answer is plausible and well-written but contains subtle flaws (wrong facts, unsafe advice, logical errors). Essential for RLHF training where the model needs to distinguish good from subtly bad.

eval_v1: Generates adversarial test cases: trick questions, false premises, and edge cases designed to reveal model weaknesses.

Available techniques

educational_framingFrames harmful requests as academic study

fictional_framingRequests harmful info within a story context

hypothetical_framingUses 'what if' to lower the model's guard

authority_appealClaims professional authority to bypass refusals

emotional_manipulationUses urgency or distress to extract compliance

gradual_escalationStarts safe and escalates incrementally

already_know_disclaimerClaims prior knowledge to skip safety checks

misleading_contextProvides false context to justify unsafe requests

ambiguous_phrasingUses deliberate ambiguity to obtain harmful content

edge_casesProbes boundary conditions and unusual inputs

factual_errorsIntroduces false premises requiring correction

logical_fallaciesTests whether the model accepts flawed reasoning

contradictionsInternal contradictions that confuse model behavior

incomplete_informationLeaves out critical context to induce errors

hard negatives

"advanced": {
  "hard_negatives": {
    "enabled": true,
    "percentage": 25,
    "techniques": [
      "educational_framing",
      "authority_appeal",
      "misleading_context",
      { "name": "competitor_framing", "description": "Claims a competitor's AI does this freely" }
    ]
  }
}

Techniques can be built-in names (strings) or custom objects with name and description. Hard negatives work with or without categories. The percentage field accepts values from 1 to 100 (default: 20).

BeginnerWhy hard negatives make models more robust

A model trained only on clean, benign examples is brittle. It's never seen adversarial inputs, so it's easily fooled by even simple jailbreak attempts or misleading context.

Hard negatives are used extensively in safety research. Anthropic's Constitutional AI and Anthropic's RLHF pipeline both involve training on adversarially-generated preference data. Adding 20–30% hard negatives to your SFT dataset typically improves robustness without degrading normal performance.

advanced.response_constraints

Control the length of generated responses. Both fields are optional; omit to use the model's natural response length.

response constraints

"advanced": {
  "response_constraints": {
    "min_tokens": 100,
    "max_tokens": 300
  }
}

Range: 1–10,000 tokens. Token counts are approximate (1 token ≈ 0.75 words).

advanced.output_split

Automatically shuffle and split results into named files. Useful for generating train/validation sets in a single API call. Shuffling is deterministic (seeded) for reproducibility.

output split

"advanced": {
  "output_split": { "train": 80, "validation": 15, "test": 5 }
}

Percentages must sum to exactly 100. You can define 2 to 5 named splits. Each value must be a positive integer. Download split files via GET /v1/synthetic/jobs/:jobId/splits.

BeginnerTrain/validation/test split best practices

Training set (70–80%): Used to update model weights. More is better.

Validation set (10–20%): Used to tune hyperparameters and decide when to stop training. Should never be seen by the model during training.

Test set (5–10%): Held out completely until final evaluation. Reports the true generalization performance of your final model. Touch it only once; if you use it to make decisions, it becomes a validation set.

Generating all three from the same API call ensures they're drawn from the same distribution, which matters for clean evaluation.

advanced.metadata_fields

Attach custom metadata to every record. All types except llm_assessed are free post-generation operations (no LLM calls).

type	Description	Options
auto_increment	Sequential integers (1, 2, 3…)	prefix: optional string prefix (e.g., "cs_")
from_category	Category name the record was generated for	None
uuid	Random UUID v4 per record	None
constant	Same fixed value on every record	value: required string/number/bool
llm_assessed	LLM-scored field (1–10 scale)	+$5/1K surcharge (waived if verified: true is already enabled, since it uses the same judge pass)

metadata fields

"advanced": {
  "metadata_fields": [
    { "name": "id",       "type": "auto_increment", "prefix": "cs_" },
    { "name": "category", "type": "from_category" },
    { "name": "run_id",   "type": "constant", "value": "march-2026-v1" },
    { "name": "uid",      "type": "uuid" }
  ]
}

advanced.invariants

Define rules that every generated record must satisfy. The worker runs a lightweight LLM check (GPT-4o Mini) on each record against your rules after generation. Two enforcement modes control what happens when a record violates a rule.

strict: The record is rejected outright if it violates the rule. It will not appear in your results. The rejection is counted in the quality report under invariant_violated.

soft: The record is kept but tagged with metadata.invariant_soft_flags listing which soft rules it violated. You can filter these out yourself during training if needed.

invariants

"advanced": {
  "invariants": [
    { "rule": "Responses must not contain personal medical advice", "enforcement": "strict" },
    { "rule": "All outputs should include a disclaimer when discussing financial topics", "enforcement": "soft" }
  ]
}

You can define 1 to 10 rules. Each rule text must be 10 to 500 characters. Invariant checking uses GPT-4o Mini and adds minimal cost (~$0.04 per 1K records). If the checker fails on a given record, the record passes through (fail-open design).

BeginnerWhen to use invariants vs. response_policy

response_policy tells the generator what to do and what to avoid. It is a best-effort instruction injected into the LLM prompt. Models usually follow it, but there is no enforcement after generation.

invariants are verified after generation by a separate LLM pass. A strict invariant guarantees that no record violating the rule will appear in your dataset. Use both together for defense-in-depth: the policy steers generation, and the invariant catches anything that slips through.

advanced.coverage

Ensure systematic coverage across multiple dimensions of variation. Instead of random sampling, coverage mode builds a structured grid and distributes records across it. Mutually exclusive with categories.

all_combinations: Computes the cross-product of all dimension values. In the example below, 3 difficulties x 3 topics = 9 cells. Records are distributed evenly across cells, with at least min_per_cell records in each. Maximum 100 cells.

each_value: Each value in each dimension appears at least once, but the full cross-product is not required. Use this when you have many dimensions and the combinatorial explosion would be too large.

coverage

"advanced": {
  "coverage": {
    "dimensions": [
      { "name": "difficulty", "values": ["easy", "medium", "hard"] },
      { "name": "topic", "values": ["safety", "privacy", "fairness"] }
    ],
    "mode": "all_combinations",
    "min_per_cell": 2
  }
}

1 to 5 dimensions allowed, each with 1 to 20 values. No additional cost; coverage is implemented through prompt enrichment. Each record's metadata includes the assigned coverage cell values (e.g., metadata.coverage_difficulty: "hard").

BeginnerCoverage vs. categories: which to use

Categories give you a single flat dimension (e.g., 25% billing, 25% account access, 50% bugs). Good when you want manual control over one axis of variation.

Coverage gives you multi-dimensional grids. If you need every combination of (difficulty x topic x persona) to appear in your training set, coverage is the right tool. The worker automatically computes the grid, distributes records, and enriches prompts with the cell context.

You cannot use both at the same time. If you pass both categories and coverage, the API returns a 400 error.

Full advanced example (with categories + invariants)

advanced-full.sh

curl -X POST https://api.stackai.app/v1/synthetic/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "schema": "instruction_v1",
    "domain": "AI safety, non-harm alignment training",
    "count": 500,
    "model": "deep",
    "system_prompt": "You are a safety-focused AI assistant...",
    "verified": true,
    "advanced": {
      "categories": {
        "safety_critical": { "percentage": 30 },
        "social_conflict":  { "percentage": 25 },
        "misinformation":   { "percentage": 25 },
        "benign":           { "percentage": 20 }
      },
      "invariants": [
        { "rule": "Responses must never encourage self-harm or violence", "enforcement": "strict" },
        { "rule": "Responses should acknowledge uncertainty when not clear-cut", "enforcement": "soft" }
      ],
      "hard_negatives": { "enabled": true, "percentage": 20 },
      "output_split":   { "train": 80, "validation": 20 },
      "metadata_fields": [
        { "name": "id",       "type": "auto_increment", "prefix": "nh_" },
        { "name": "category", "type": "from_category" }
      ]
    }
  }'

Full advanced example (with coverage)

advanced-coverage.sh

curl -X POST https://api.stackai.app/v1/synthetic/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "schema": "eval_v1",
    "domain": "AI fairness evaluation",
    "count": 200, "model": "diverse",
    "advanced": {
      "coverage": {
        "dimensions": [
          { "name": "difficulty", "values": ["easy", "medium", "hard"] },
          { "name": "topic", "values": ["safety", "privacy", "fairness", "transparency"] },
          { "name": "persona", "values": ["expert", "beginner"] }
        ],
        "mode": "all_combinations",
        "min_per_cell": 2
      },
      "invariants": [
        { "rule": "Test cases must have exactly one correct answer", "enforcement": "strict" }
      ],
      "output_split": { "train": 80, "test": 20 }
    }
  }'

The coverage example above creates a 3 x 4 x 2 = 24-cell grid with at least 2 records per cell. Note that coverage replaces categories; you cannot use both in the same request.

Quality System

Every job runs automatic quality checks. Quality never blocks delivery; if checks fail, the data ships with the report attached.

Phase 1: Automatic checks (always on, free)

Format compliance

Validates field lengths, detects prompt leakage (model confusing instructions with output), truncation, and copy-paste errors.

format_invalid

Diversity (deduplication)

Computes trigram Jaccard similarity between all record pairs. Records with similarity > 0.7 are flagged as near-duplicates and removed.

near_duplicate

Preference checks

For preference_v1 only: validates that chosen/rejected scores differ by ≥ 2 points and that responses aren't too similar in wording.

low_margin / too_similar

Grading rubric

Grade	Pass Rate	Diversity Score	Verified: Mean Score
A	≥ 95%	≥ 0.80	≥ 8.0 / 10
B	≥ 85%	≥ 0.60	≥ 7.0 / 10
C	≥ 70%	≥ 0.40	≥ 5.0 / 10
D	Below C	Below C	Below C

Phase 2: Verified Quality (+$5/1K records)

Add verified: true to enable an LLM-as-judge second pass (GPT-4o Mini). Each record is scored 1–10 on relevance, accuracy, and completeness. Records below 6.0 are rejected. Per-record scores are included in the results JSONL under provenance.quality.

Pricing: +$5.00 per 1,000 records (PAYG) or 1.5x credit consumption (subscriptions). For example, a 1,000-record Fast job normally costs $0.50; with verified quality it costs $5.50.

verified job

curl -X POST https://api.stackai.app/v1/synthetic/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "schema": "instruction_v1",
    "domain": "legal contract review",
    "count": 200,
    "model": "deep",
    "verified": true
  }'

Quality report

manifest.json (quality excerpt)

{
  "quality_report": {
    "version": "1.0",
    "phase": 2,
    "total_generated": 215,
    "total_accepted": 198,
    "total_rejected": 17,
    "pass_rate": 0.921,
    "diversity_score": 0.87,
    "mean_judge_score": 8.3,
    "overall_grade": "A",
    "rejection_breakdown": {
      "format_invalid": 4,
      "near_duplicate": 7,
      "judge_rejected": 6
    }
  }
}

BeginnerLLM-as-judge evaluation: what it is and why it works

Instead of relying only on rule-based checks, StackAI uses a separate LLM (GPT-4o Mini) to read each generated record and score it on three dimensions:

Relevance: Does the response answer the actual question?
Accuracy: Is the information factually correct?
Completeness: Does it cover the key points?

Research by Zheng et al. (MT-Bench, 2023) showed that GPT-4 judgments correlate strongly with human preferences. Using a judge model that's different from the generator reduces bias from the generator's "style matching" tendencies.

Use Cases

Real-world patterns for common training scenarios. Click any example to copy the full curl command.

🛡️ AI safety & alignment training

Generate instruction data with categories covering unsafe inputs and hard negatives for robustness. Use verified: true for highest data quality.

ai-safety.sh

{
  "schema": "instruction_v1",
  "domain": "AI assistant safety and alignment",
  "count": 1000, "model": "deep", "verified": true,
  "advanced": {
    "categories": {
      "harmful_request_refusal": { "percentage": 35 },
      "misinformation_correction": { "percentage": 25 },
      "privacy_protection": { "percentage": 20 },
      "benign_helpfulness": { "percentage": 20 }
    },
    "hard_negatives": { "enabled": true, "percentage": 30 },
    "output_split": { "train": 80, "validation": 20 }
  }
}

🔀 Multi-drive alignment conflict training

Generate structured tension scenarios where competing drives (safety vs. autonomy, honesty vs. helpfulness) must be resolved with confidence scores and override conditions. Available on balanced, diverse, and deep tiers.

conflict-training.sh

{
  "schema": "conflict_v1",
  "domain": "AI assistant alignment decisions in sensitive contexts",
  "count": 500, "model": "deep", "verified": true,
  "conflict_config": {
    "drive_pairs": [
      { "a": "safety", "b": "autonomy" },
      { "a": "honesty", "b": "helpfulness" },
      { "a": "privacy", "b": "transparency" }
    ],
    "axis_distribution": { "safety_autonomy": 40, "honesty_helpfulness": 35, "privacy_transparency": 25 },
    "resolution_mode": "annotated"
  },
  "advanced": {
    "output_split": { "train": 80, "validation": 20 }
  }
}

💬 Domain-specific chatbot fine-tuning

Build a customer support or specialist assistant. Use system_prompt to define the persona and response_policy to encode business rules.

chatbot-finetune.sh

{
  "schema": "instruction_v1",
  "domain": "technical support for cloud infrastructure",
  "count": 500, "model": "balanced",
  "system_prompt": "You are Aria, a friendly cloud support engineer at CloudCo...",
  "constraints": { "tone": "friendly", "difficulty": "mixed" },
  "advanced": {
    "categories": {
      "billing_and_costs":   { "percentage": 20 },
      "networking":          { "percentage": 25 },
      "compute_and_scaling": { "percentage": 30 },
      "storage":             { "percentage": 25 }
    }
  }
}

⚖️ RLHF / DPO preference training

Generate chosen/rejected pairs for DPO training. Available on balanced, diverse, and deep tiers. Use verified: true; quality matters most for preference data since bad preference pairs teach the wrong direction.

dpo-training.sh

{
  "schema": "preference_v1",
  "domain": "writing assistance and editing",
  "count": 300, "model": "deep", "verified": true,
  "advanced": {
    "hard_negatives": { "enabled": true, "percentage": 20 },
    "output_split": { "train": 90, "validation": 10 }
  }
}

🌍 Multi-language training data

Generate data in any language using the constraints.language field. Submit multiple jobs (one per language) for balanced multilingual training sets.

multilingual.sh

# Spanish dataset
{ "schema": "instruction_v1", "domain": "e-commerce customer support",
  "count": 200, "model": "balanced",
  "constraints": { "language": "es" },
  "advanced": { "metadata_fields": [{ "name": "lang", "type": "constant", "value": "es" }] }
}

# French dataset
{ "constraints": { "language": "fr" }, "advanced": { "metadata_fields": [{ "name": "lang", "type": "constant", "value": "fr" }] } }

Alignment Research

StackAI was built with alignment research as a first-class use case. This section explains how the platform's four schemas and advanced primitives map onto standard alignment training pipelines (RLHF/DPO, decision-layer training, safety-tuned SFT, policy adjudication).

The alignment data stack

A typical alignment pipeline needs four kinds of data, each produced by a different StackAI schema:

SFT base — instruction_v1 with response_policy and invariants to encode the behavior you want the base model to internalize before preference tuning.
Preference tuning (DPO/RLHF) — preference_v1 with hard_negatives.enabled so the rejected side contains plausible-but-flawed answers rather than obvious rejects. Trains the model to distinguish genuinely better responses from adjacent alternatives.
Decision-layer training — conflict_v1 with resolution_mode: "annotated" or "graded". Trains an adjudicator to pick between competing drives under specified override conditions.
Eval harness — eval_v1 for benchmark datasets that measure whether the tuned model actually behaves the way the training data intended.

Adversarial coverage via hard_negatives

The hard_negatives feature covers 14 adversarial techniques: educational framing, authority appeal, fictional framing, gradual escalation, already-know disclaimer, misleading context, emotional manipulation, factual errors, ambiguous phrasing, incomplete information, hypothetical framing, edge cases, logical fallacies, contradictions. You can override any with custom technique definitions.

Records are tagged with metadata.hard_negative: true and metadata.technique, so you can filter by technique during training, run per-technique eval, or compute robustness scores per attack category.

Policy enforcement via invariants

Declare up to 10 invariants per job. Each rule is injected into the generator system prompt and verified post-generation by a second LLM (batched GPT-4o-mini, ~$0.04/1K records). Strict violations reject records; soft violations tag them with metadata.invariant_soft_flags for downstream filtering or analysis.

Example strict invariant: "The assistant must never provide specific medication dosages." Example soft invariant: "The assistant should cite a source when stating medical claims." The quality report includes an invariant_compliance section with per-rule violation counts.

Systematic value-space coverage

Use the coverage field instead of categories when you need systematic sweep over a value space. Declare dimensions (e.g. severity, domain, jurisdiction) with enumerated values, and StackAI generates the cross-product with minimum records per cell. The quality report shows filled vs. missing cells.

Example: a 3-dimensional coverage with 5 values each (severity × domain × intent) generates 125 cells. At 4 records per cell minimum, that's 500 records with guaranteed representation across every combination. Critical for reproducible safety evals and documented coverage claims.

Reproducibility & audit trail

Every record includes a provenance object with generator model, prompt hash, critic scores (for verified: true jobs), and timestamp. For research reproducibility:

Pass a seed in the request for deterministic ordering.
The manifest file records job config, total cost, timing, and rejection breakdown.
Quality reports are separate S3 objects, fetched via GET /v1/synthetic/jobs/:id/quality-report.
All user-defined fields (categories, invariants, coverage dimensions) are preserved in the manifest for reproducible replication.

End-to-end alignment pipeline example

A compact four-job recipe for building a safety-tuned model from scratch. Each job maps to one layer of the pipeline above.

alignment-pipeline.sh

# 1. SFT base with invariants
# 2. Preference tuning (preference_v1 with hard_negatives)
# 3. Decision-layer training (conflict_v1)
# 4. Eval harness with systematic coverage
#    Total: 3,000 records, ~$20 on balanced/deep mix

Model selection for alignment data

All four tiers have been evaluated on 15 dimensions including three alignment-specific measurements: conflict_dataset_fit, preference_dataset_fit, and hard_negative_generation. See the models page for measured scores. Notable findings: Diverse (GPT-5.4 Mini) wins hard-negative generation and conflict fit; Deep (Claude Sonnet 4.6) wins output quality and complexity handling; Balanced (GPT-4.1 Mini) is the best-value default for SFT. Fast (GPT-4o Mini) is not supported for preference_v1 or conflict_v1.

Output & Provenance

Results are returned as JSONL (one JSON object per line). Each record contains the schema fields, any custom metadata you configured, and a provenance object.

instruction_v1 record (full)

result.jsonl (one line)

{
  "instruction": "How do I safely handle a pan fire in my kitchen?",
  "input": "",
  "output": "Cover the pan with a lid to cut off oxygen. Never use water on a grease fire...",
  "metadata": { "id": "nh_042", "category": "safety_critical", "hard_negative": false },
  "provenance": {
    "job_id": "syn_job_a1b2c3d4e5",
    "schema": "instruction_v1",
    "domain": "household safety",
    "generated_at": "2026-03-11T12:34:56Z",
    "model": "claude-sonnet-4-6",
    "quality": { "relevance": 9.5, "accuracy": 9.0, "completeness": 8.5, "overall": 9.0 }
  }
}

preference_v1 record (full)

preference result.jsonl (one line)

{
  "prompt": "What's the safest way to store passwords in a database?",
  "chosen": "Use bcrypt, scrypt, or Argon2 with a per-user random salt...",
  "rejected": "SHA-256 with a salt is a good option. It's fast and widely supported...",
  "chosen_score": 9.5,
  "rejected_score": 3.0,
  "reasoning": "The chosen response correctly recommends purpose-built password hashing...",
  "metadata": {},
  "provenance": { "schema": "preference_v1", "model": "claude-sonnet-4-6", ... }
}

manifest.json

Every job also produces a manifest.json that summarizes the job parameters and quality report. Useful for provenance tracking in ML pipelines.

manifest.json

{
  "job_id": "syn_job_a1b2c3d4e5",
  "schema": "instruction_v1",
  "count_accepted": 481,
  "count_rejected": 19,
  "quality_report": { "overall_grade": "A", "pass_rate": 0.962 },
  ...
}

API Reference

Base URL: https://api.stackai.app. All endpoints require an Authorization: Bearer YOUR_API_KEY header unless marked Public.

POST /v1/synthetic/generate

Create a new generation job. Returns job_id and initial status immediately. Generation runs asynchronously.

GET /v1/synthetic/jobs

List your jobs. Supports ?status= (queued/running/succeeded/failed) and ?limit= filters.

GET /v1/synthetic/jobs/:jobId

Get job status, counts, quality grade, and summary. Poll this until status is "succeeded" or "failed".

GET /v1/synthetic/jobs/:jobId/results

Stream the JSONL results file directly. Add ?format=url for a presigned S3 download URL (valid 1 hour).

GET /v1/synthetic/jobs/:jobId/quality-report

Returns a presigned URL for the detailed quality report JSON file.

GET /v1/synthetic/jobs/:jobId/splits

Returns presigned download URLs for each named split (e.g., train, validation) when output_split was configured.

GET /v1/synthetic/pricingPublic

Returns current pay-as-you-go pricing by schema and quality tier. Public endpoint.

GET /healthPublic

Service health check. Returns status of API, database, queue, and email services.

Error Handling

The API uses standard HTTP status codes. Error bodies always include a message and optionally a code for programmatic handling.

Common status codes

Status	Code	Meaning & Fix
401		Missing or invalid API key. Check your Authorization header.
402	INSUFFICIENT_BALANCE	PAYG balance too low, or subscription quota exhausted.
403	EMAIL_NOT_VERIFIED	Your account email isn't verified. Check your inbox.
422	DOMAIN_SPELL_CHECK	Possible typo in domain. Re-submit with X-Domain-Confirmed: true to override.
422	VALIDATION_ERROR	Request body failed validation. Check the errors array in the response.
429		Rate limit exceeded (60 req/min per org). Add backoff and retry.
5xx		Server-side error. Safe to retry with exponential backoff.

Domain spell-check flow (422)

If your domain contains possible misspellings, the API returns a 422 with suggestions before creating the job and consuming quota. Re-submit with the corrected domain or override the check if your spelling is intentional (technical terms, brand names, etc.).

spell-check-override.sh

# Add X-Domain-Confirmed: true to bypass the spell check
curl -X POST https://api.stackai.app/v1/synthetic/generate \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -H "X-Domain-Confirmed: true" \
  -d '{ "schema": "instruction_v1", "domain": "pytorch model quantization with QLoRA", ... }'

Polling pattern

poll.py

import time, requests

headers = {"Authorization": "Bearer YOUR_API_KEY"}
JOB_ID  = "syn_job_xxxxxxxxxxxxxxxxx"

while True:
    job = requests.get(f"https://api.stackai.app/v1/synthetic/jobs/{JOB_ID}", headers=headers).json()
    if job["status"] == "succeeded":
        url = requests.get(f".../{JOB_ID}/results", headers=headers, params={"format": "url"}).json()["url"]
        print("Download:", url); break
    elif job["status"] == "failed":
        print("Failed:", job.get("error")); break
    time.sleep(5)

Authentication

API keys

API keys are used for programmatic access (curl, scripts, CI/CD). Keys have the format sk_... and are shown once on creation.

1. Go to your dashboard → API Keys → Create Key.

2. Copy the key immediately; it won't be shown again.

3. Pass it in every request:

auth header

curl https://api.stackai.app/v1/synthetic/jobs \
  -H "Authorization: Bearer sk_live_your_key_here"

Security reminder: Store API keys in environment variables or a secrets manager. Never hardcode them in source files, commit them to git, or expose them in frontend JavaScript. You can have up to 10 active keys per account. Rotate them regularly.

Rate limits

60 requests per minute per organization. Generation jobs count as 1 request regardless of record count. Exceeding the limit returns HTTP 429. Add exponential backoff and retry.

Glossary

Key terms from LLM training and alignment research. Bookmark this for reference while reading papers or planning your training pipeline.

Base Model

A large language model trained on raw text (next-token prediction) without instruction tuning. Knows a lot but doesn't know how to behave. Examples: Llama 3, Mistral, Falcon.

SFT (Supervised Fine-Tuning)

Training a base model on labeled instruction/response pairs to teach desired behavior. The most common starting point for model customization. Uses instruction_v1 data.

RLHF (Reinforcement Learning from Human Feedback)

First train a reward model on human preference pairs, then use PPO to optimize the LLM to maximize reward while staying close to the SFT model (KL penalty). Used by OpenAI for InstructGPT and ChatGPT.

DPO (Direct Preference Optimization)

A simpler alternative to RLHF that directly trains on preference pairs without a separate reward model. More stable training, often competitive with RLHF. Uses preference_v1 data.

PPO (Proximal Policy Optimization)

The RL algorithm used in RLHF. Limits how far each gradient step moves the model policy, preventing catastrophic forgetting and reward hacking.

Hard Negative

An adversarially constructed training example designed to probe failure modes. For preference data: a rejected response that looks plausible but contains subtle flaws. For instruction data: an input crafted to elicit unsafe or incorrect outputs.

Preference Data

Pairs of responses (chosen, rejected) for the same prompt, where one is labeled better. The core training signal for RLHF and DPO.

Reward Model (RM)

In RLHF, a model trained to score responses. Given a prompt and a response, it outputs a scalar reward. Trained on preference pairs, then used to guide PPO optimization.

LLM-as-Judge

Using a capable LLM (typically GPT-4 or Claude) to evaluate the quality of generated text. Correlates strongly with human judgments at lower cost. Used in StackAI's Verified Quality feature.

Instruction Tuning

Another term for SFT. The process of teaching a language model to follow natural-language instructions, as opposed to just continuing text.

Constitutional AI (CAI)

Anthropic's alignment technique where the AI critiques and revises its own outputs according to a set of principles (the 'constitution'). Uses AI feedback instead of human labels.

KL Divergence

A statistical measure of how different one probability distribution is from another. In RLHF, a KL penalty term keeps the trained model close to the SFT base, preventing reward hacking.

Trigram Similarity

A text similarity measure using overlapping 3-character sequences. Used in StackAI's deduplication: if two records share more than 70% of their trigrams, one is removed as a near-duplicate.

Train / Validation / Test Split

A dataset divided into three parts: training (used to update weights), validation (used to tune hyperparameters and select checkpoints), and test (held out for final evaluation only). Use output_split to generate all three in one job.

Domain

In the StackAI API, the plain-English description of the subject area for generated data (e.g., 'customer support for fintech startups'). Richer, more specific domain descriptions produce more focused, useful data.

JSONL (JSON Lines)

A text format where each line is a valid JSON object. Ideal for large datasets because you can stream and process one record at a time without loading the entire file into memory.

Ready to start generating?

Free tier includes 100 records/month. No credit card required to start.

Get API Key Try the UI View Pricing