SynthBench
Not all synthetic surveys are built equal
A single LLM will almost always validate your idea—even when it shouldn’t. Model choice, persona prompting, and how you aggregate responses decide whether synthetic respondents actually represent real human opinions. We measure that with a single number.
Survey Parity Score (SPS) measures how closely AI-generated survey responses match real human opinion distributions. 1.0 = perfect match.
v0.1.0 · Generated 4/21/2026
Who is this for?
SynthBench gives three groups a shared source of truth for synthetic survey quality.
ML researchers
Validate synthetic respondents before citing them. Compare models on distribution fidelity, rank agreement, and persona conditioning across public survey datasets.
Survey-tech companies
Prove your product actually mirrors human opinion—not just sounds plausible. Benchmark your pipeline against a public, reproducible standard.
Policy & research teams
Audit AI-generated research you commission or consume. Know whether synthetic samples are safe to brief on, or will mislead your decisions.
Best Model vs Random Baseline
Survey Parity Score (SPS) — higher is better. 1.0 = perfect match to human survey distributions.
Bar chart comparing the top-performing model's SPS against the random-pick baseline. Higher is better; 1.0 is a perfect match to human distributions.
Key Findings
The most surprising results from our benchmark runs
Ensemble Advantage
+6-7 SPS points
Blending 3 models beats any single model. Zero additional API cost—just arithmetic on existing responses.
Persona Prompting Asymmetry
2.2× gap
Prompting the model as a Republican shifts responses 2.2× more than prompting as a Democrat—revealing the model’s progressive default lean.
Temperature Matters (Sometimes)
+4.5% for Gemini, ±0.7% for Haiku
Temperature sensitivity is model-specific, not universal. One size does not fit all.
Leaderboard Summary
Top 3 models per dataset by Survey Parity Score · activate a row (Enter) for config details
| # | Model | Dataset | SPS | Range % | p_dist | p_rank |
|---|---|---|---|---|---|---|
| 1 | SynthPanel (GPT-4o-mini) conditioned | globalopinionqa | 0.786 | — | 0.689 | 0.694 |
| 2 | Gemini 2.5 Flash | globalopinionqa | 0.770 | — | 0.687 | 0.645 |
| 3 | Llama 3.3 70B | globalopinionqa | 0.762 | — | 0.635 | 0.672 |
| 1 | SynthPanel Ensemble (3-model) ensemble | opinionsqa | 0.835 | — | 0.833 | 0.837 |
| 2 | Gemini 2.5 Flash | opinionsqa | 0.829 | — | 0.738 | 0.761 |
| 3 | SynthPanel (Sonnet 4) conditioned | opinionsqa | 0.829 | — | 0.726 | 0.793 |
| 1 | SynthPanel Ensemble (3-model) ensemble | subpop | 0.833 | — | 0.871 | 0.795 |
| 2 | SynthPanel (Gemini Flash Lite) conditioned | subpop | 0.821 | — | 0.707 | 0.780 |
| 3 | SynthPanel (Haiku 4.5) conditioned | subpop | 0.809 | — | 0.712 | 0.757 |
Run your first benchmark
Three commands to install the CLI, score your model against real survey data, and see where it lands on the leaderboard.
See the quickstartExplore the methodology
How we score models, what the Survey Parity Score measures, and why distribution fidelity matters for synthetic respondents.
Read methodologySubmit your model
Run the full benchmark on your model and add your results to the public leaderboard. Open to any provider or framework.
Submit resultsUsing SynthPanel for synthetic surveys? Get SynthPanel on GitHub