Skip to main content

SynthBench

Not all synthetic surveys are built equal

A single LLM will almost always validate your idea—even when it shouldn’t. Model choice, persona prompting, and how you aggregate responses decide whether synthetic respondents actually represent real human opinions. We measure that with a single number.

Survey Parity Score (SPS) measures how closely AI-generated survey responses match real human opinion distributions. 1.0 = perfect match.

v0.1.0 · Generated 4/21/2026

Who is this for?

SynthBench gives three groups a shared source of truth for synthetic survey quality.

ML researchers

Validate synthetic respondents before citing them. Compare models on distribution fidelity, rank agreement, and persona conditioning across public survey datasets.

Survey-tech companies

Prove your product actually mirrors human opinion—not just sounds plausible. Benchmark your pipeline against a public, reproducible standard.

Policy & research teams

Audit AI-generated research you commission or consume. Know whether synthetic samples are safe to brief on, or will mislead your decisions.

Best Model vs Random Baseline

Survey Parity Score (SPS) — higher is better. 1.0 = perfect match to human survey distributions.

Best Model vs Random Baseline — Survey Parity Score

Bar chart comparing the top-performing model's SPS against the random-pick baseline. Higher is better; 1.0 is a perfect match to human distributions.

globalopinionqa: SynthPanel (GPT-4o-mini) · opinionsqa: SynthPanel Ensemble (3-model) · subpop: SynthPanel Ensemble (3-model)

Key Findings

The most surprising results from our benchmark runs

Ensemble Advantage

+6-7 SPS points

Blending 3 models beats any single model. Zero additional API cost—just arithmetic on existing responses.

Persona Prompting Asymmetry

2.2× gap

Prompting the model as a Republican shifts responses 2.2× more than prompting as a Democrat—revealing the model’s progressive default lean.

Temperature Matters (Sometimes)

+4.5% for Gemini, ±0.7% for Haiku

Temperature sensitivity is model-specific, not universal. One size does not fit all.

Leaderboard Summary

Top 3 models per dataset by Survey Parity Score · activate a row (Enter) for config details

Top 3 models per dataset by Survey Parity Score. Activate a row to open its configuration page.
# Model Dataset SPS Range % p_dist p_rank
1 SynthPanel (GPT-4o-mini) conditioned globalopinionqa 0.786 0.689 0.694
2 Gemini 2.5 Flash globalopinionqa 0.770 0.687 0.645
3 Llama 3.3 70B globalopinionqa 0.762 0.635 0.672
1 SynthPanel Ensemble (3-model) ensemble opinionsqa 0.835 0.833 0.837
2 Gemini 2.5 Flash opinionsqa 0.829 0.738 0.761
3 SynthPanel (Sonnet 4) conditioned opinionsqa 0.829 0.726 0.793
1 SynthPanel Ensemble (3-model) ensemble subpop 0.833 0.871 0.795
2 SynthPanel (Gemini Flash Lite) conditioned subpop 0.821 0.707 0.780
3 SynthPanel (Haiku 4.5) conditioned subpop 0.809 0.712 0.757

View full leaderboard → Explore all runs →

Run your first benchmark

Three commands to install the CLI, score your model against real survey data, and see where it lands on the leaderboard.

See the quickstart

Explore the methodology

How we score models, what the Survey Parity Score measures, and why distribution fidelity matters for synthetic respondents.

Read methodology

Submit your model

Run the full benchmark on your model and add your results to the public leaderboard. Open to any provider or framework.

Submit results

Using SynthPanel for synthetic surveys? Get SynthPanel on GitHub