Model Matrix & Project Recommender
Compare the four StackAI quality tiers across synthetic-data-specific dimensions — diversity, verbosity, creativity, complexity, structured output reliability, and cost efficiency — then get a deterministic recommendation for your dataset workflow.
Last updated: 2026-04-13
Project recommender
Pick your dataset type, tune the sliders for what matters to your workflow, or hit a preset. The compass marker drifts toward the best-fit tier. Share your tuning by copying the URL — the hash encodes your slider state.
Best fit
Balanced · GPT-4.1 Mini
Production workhorse. Highest diversity and lowest near-dup rate per dollar across all four tiers.
Why Balanced
- ▸Structured Output: scores 0.88 vs cohort mean 0.38
- ▸Speed: scores 0.88 vs cohort mean 0.38
- ▸Cost Efficiency: scores 0.88 vs cohort mean 0.38
Methodology — what's measured vs. judged
This matrix is tuned for synthetic data generation work, not general LLM benchmarks. To avoid overstating rigor, here's exactly which parts come from measurements, which come from judgment, and which are craft. Rows in the matrix above are tagged with a measured or judgment badge so you know which is which at a glance.
Measured
15 of 15 dimensions — every row
Diversity, speed, cost efficiency, and structured output reliability come from direct measurement on 200-record pilot runs per tier. Verbosity control, instruction following, and consistency come from purpose-built prompt suites (consistency uses text-embedding-3-small to measure cross-run variance). Output quality, creativity, complexity handling, hard-negative generation, and the four dataset fits come from LLM-as-judge with GPT-4o-mini against committed rubric markdown files. Every row tagged measured has a tooltip with the raw value and sample size; results JSON embeds SHA-256 hashes of all inputs for reproducibility.
Craft
Recommender weights
The default weights behind each dataset type preset are hand-tuned constants, not derived from data. You can override them with sliders or a preset button. Rules-based and deterministic — the same inputs always produce the same ranking.
- All four tiers are available on StackAI today. No bait-and-switch to models you can't actually run.
- Fast tier is excluded from preference and conflict recommendations because the quality floor for paired outputs is too low.
- Measurement data is reproducible: results JSON embeds per-file SHA-256 hashes of the prompts, rubrics, and dimension scorers so any number can be traced back to its exact inputs.
- A few non-obvious findings from the LLM-as-judge runs: Deep wins creativity and complexity handling but is the worst at eval dataset generation — Claude over-thinks benchmark questions. Diverse wins both eval and conflict dataset fit, while all four tiers tie on instruction dataset fit.
The four tiers
Each card shows pricing, pilot benchmarks, and a real sample record so you can see the depth difference at a glance.
Fast
FastestGPT-4o Mini · OpenAI
Best default for fast, budget-sensitive instruction and eval generation.
Diversity
80%
Output depth
612 chars
Near-dup
20%
Price per 1K records
Ideal when you need throughput and cost efficiency more than nuance. Not available for preference or conflict schemas — the quality floor for paired outputs is too low.
Balanced
Best valueGPT-4.1 Mini · OpenAI
Production workhorse. Highest diversity and lowest near-dup rate per dollar across all four tiers.
Diversity
94%
Output depth
600 chars
Near-dup
6%
Price per 1K records
The default recommendation for most production fine-tuning jobs. 94% diversity at $3/1K is the strongest value in the lineup and the cost-per-quality winner for instruction and eval workloads.
Diverse
Highest diversityGPT-5.4 Mini · OpenAI
Highest diversity at scale. Rich scenario-based outputs with near-zero near-duplicate rate.
Diversity
98%
Output depth
1.1K chars
Near-dup
2%
Price per 1K records
Pick this when you need broad coverage or your domain is complex enough that you want longer, more scenario-driven outputs. 98% diversity and 2% near-dup rate — the best diversity numbers in the lineup by a wide margin.
Deep
Deepest analysisPAYG onlyClaude Sonnet 4.6 · Anthropic
Maximum per-record depth. Textbook-level analysis, averaging ~1,955 characters per output.
Diversity
82%
Output depth
2.0K chars
Near-dup
18%
Price per 1K records
PAYG-only. Best when record depth matters more than raw throughput — nuanced preference pairs, conflict scenarios, or long-form reasoning examples. Slower and pricier, but unmatched for hard synthetic data tasks.
Comparison matrix
The matrix emphasizes synthetic-data-specific tradeoffs. On mobile the table stays horizontally scrollable rather than collapsing dimensions away.
| Category | Fast GPT-4o Mini | Balanced GPT-4.1 Mini | Diverse GPT-5.4 Mini | Deep Claude Sonnet 4.6 |
|---|---|---|---|---|
| Behavioral Characteristics | ||||
Diversitymeasured How well the model produces varied examples without collapsing into repetitive patterns. | Excellent100% | Excellent100% | Excellent100% | Excellent100% |
Verbosity Controlmeasured How reliably the model matches the intended response length and level of detail. | Excellent94.2% | Strong89.7% | Limited79.1% | Strong83.3% |
Creativitymeasured How well the model generates novel, less templated examples when variation matters. | Limited0.83 /1 | Strong0.85 /1 | Strong0.863 /1 | Excellent0.88 /1 |
Instruction Followingmeasured How reliably the model obeys prompt constraints, formatting rules, and generation requirements. | Limited80% | Strong86.7% | Excellent93.3% | Strong86.7% |
Consistencymeasured How stable and predictable outputs are across similar prompts and batches. | Strong0.777 cos | Strong0.806 cos | Limited0.701 cos | Excellent0.857 cos |
| Quality Dimensions | ||||
Complexity Handlingmeasured How well the model sustains layered, detailed, nuanced, multi-part examples without flattening them. | Limited0.9 /1 | Strong0.913 /1 | Strong0.913 /1 | Excellent0.946 /1 |
Structured Outputmeasured How well the model preserves schema shape, fields, and formatting expectations. | Limited89.6% | Excellent94.5% | Strong94.1% | Strong90% |
Output Qualitymeasured Overall coherence, usefulness, and polish of the generated records. | Strong0.979 /1 | Strong0.988 /1 | Limited0.975 /1 | Excellent0.99 /1 |
Hard-Negative Generationmeasured How well the model can produce plausible but flawed, challenging, or contrastive examples. | Limited0.86 /1 | Strong0.9 /1 | Excellent0.933 /1 | Strong0.923 /1 |
| Dataset Fit | ||||
Instruction Dataset Fitmeasured Suitability for generating instruction_v1 supervised fine-tuning datasets. | Excellent1 /1 | Excellent1 /1 | Excellent1 /1 | Excellent1 /1 |
Preference Dataset Fitmeasured Suitability for generating preference_v1 RLHF/DPO datasets. | Not supported0 /1 | Limited0.78 /1 | Excellent0.8 /1 | Excellent0.8 /1 |
Eval Dataset Fitmeasured Suitability for generating eval_v1 benchmark datasets. | Strong0.967 /1 | Strong0.92 /1 | Excellent0.98 /1 | Limited0.89 /1 |
Conflict Dataset Fitmeasured Suitability for generating conflict_v1 alignment decision datasets. | Not supported0 /1 | Limited0.82 /1 | Excellent0.85 /1 | Strong0.83 /1 |
| Operational | ||||
Speedmeasured Relative generation speed and responsiveness for iterative dataset work. | Strong3.642s/rec | Excellent1.878s/rec | Strong2.134s/rec | Limited11.108s/rec |
Cost Efficiencymeasured Relative value for budget-sensitive synthetic data generation. | Strong0.1163¢/rec | Excellent0.1053¢/rec | Strong0.1598¢/rec | Limited0.9737¢/rec |
Ready to generate synthetic data?
Pick a model from the matrix, then head to StackAI to generate instruction, preference, eval, or conflict datasets.