Synthetic Training Data for LLMs
Generate high-quality, schema-validated datasets with full provenance tracking. Perfect for fine-tuning, RLHF, and evaluation.
Three Data Schemas
Purpose-built schemas for every ML training paradigm.
instruction_v1
Instruction-response pairs for supervised fine-tuning. Each record includes system context, user instruction, and ideal response.
preference_v1
Paired responses with preference labels for RLHF. Includes chosen/rejected responses with quality scores and reasoning.
eval_v1
Evaluation datasets for benchmarking. Input-output pairs with configurable metrics like exact_match and semantic_similarity.
One API Call Away
Generate production-quality training data with a simple REST API. Full schema validation, quality scoring, and provenance tracking built in.
- Schema-validated output guaranteed
- Critic scoring rejects low-quality records
- Deduplication across your dataset
- Full provenance metadata per record
curl -X POST https://api.stackai.app/v1/synthetic/generate \
-H "Authorization: Bearer $API_KEY" \
-H "Content-Type: application/json" \
-d '{
"schema": "instruction_v1",
"domain": "customer_support",
"count": 100
}'How It Works
Three steps from configuration to production-ready datasets.
Configure Your Job
Choose a schema, specify your domain, set constraints like language and difficulty, and pick your quality tier.
Generate with QC
Our pipeline generates data using your chosen LLMs, applies critic scoring, heuristic validation, and safety filtering.
Download Dataset
Get your validated dataset in NDJSON or JSON format with full manifest and provenance metadata.
Simple, Transparent Pricing
Pay per record or save with a subscription. No hidden fees.
Economy
$0.50
per 1K records
GPT-4o Mini · Fast prototyping
Standard
$5
per 1K records
GPT-4o · Balanced quality
Premium
$25
per 1K records
Claude Sonnet 4.5 · Production quality
Ready to Generate?
Start creating high-quality training data in minutes. No credit card required.
Get Started Free