SynthBench

Name: SynthBench Leaderboard
Creator: SynthBench
License: https://opensource.org/licenses/MIT

Not all synthetic surveys are built equal

A single LLM will almost always validate your idea—even when it shouldn’t. Model choice, persona prompting, and how you aggregate responses decide whether synthetic respondents actually represent real human opinions. We measure that with a single number.

Survey Parity Score (SPS) measures how closely AI-generated survey responses match real human opinion distributions. 1.0 = perfect match.

Run your first benchmark Submit your model

v0.1.0 · Generated 5/23/2026

Who is this for?

SynthBench gives three groups a shared source of truth for synthetic survey quality.

ML researchers

Validate synthetic respondents before citing them. Compare models on distribution fidelity, rank agreement, and persona conditioning across public survey datasets.

Survey-tech companies

Prove your product actually mirrors human opinion—not just sounds plausible. Benchmark your pipeline against a public, reproducible standard.

Policy & research teams

Audit AI-generated research you commission or consume. Know whether synthetic samples are safe to brief on, or will mislead your decisions.

Best Model vs Random Baseline

Survey Parity Score (SPS) — higher is better. 1.0 = perfect match to human survey distributions.

globalopinionqa: SynthPanel (GPT-4o-mini) · opinionsqa: SynthPanel Ensemble (3-model) · subpop: SynthPanel Ensemble (3-model)

Key Findings

The most surprising results from our benchmark runs

Ensemble Advantage

+6-7 SPS points

Blending 3 models beats any single model. Zero additional API cost—just arithmetic on existing responses.

Income Persona Asymmetry

1.55× gap

Prompting the model as a high-income respondent ($100K+) shifts responses 1.55× more than prompting as a low-income respondent (<$30K).

Political Persona Asymmetry

2.2× gap

Prompting as a Republican shifts responses 2.2× more than prompting as a Democrat.

Leaderboard Summary

Top 3 models per dataset by Survey Parity Score · activate a row (Enter) for config details

Top 3 models per dataset by Survey Parity Score. Activate a row to open its configuration page.
#	Model	Dataset	SPS	Range %	p_dist	p_rank
1	SynthPanel (GPT-4o-mini) conditioned	globalopinionqa	0.786	—	0.689	0.694
2	Gemini 2.5 Flash	globalopinionqa	0.770	—	0.687	0.645
3	Llama 3.3 70B	globalopinionqa	0.762	—	0.635	0.672
1	SynthPanel Ensemble (3-model) ensemble	opinionsqa	0.835	—	0.833	0.837
2	Gemini 2.5 Flash	opinionsqa	0.829	—	0.738	0.761
3	SynthPanel (Sonnet 4) conditioned	opinionsqa	0.829	—	0.726	0.793
1	SynthPanel Ensemble (3-model) ensemble	subpop	0.833	—	0.871	0.795
2	SynthPanel (Gemini Flash Lite) conditioned	subpop	0.821	—	0.707	0.780
3	SynthPanel (Haiku 4.5) conditioned	subpop	0.809	—	0.712	0.757

View full leaderboard → Explore all runs →

Run your first benchmark

Three commands to install the CLI, score your model against real survey data, and see where it lands on the leaderboard.

See the quickstart

Explore the methodology

How we score models, what the Survey Parity Score measures, and why distribution fidelity matters for synthetic respondents.

Read methodology

Submit your model

Run the full benchmark on your model and add your results to the public leaderboard. Open to any provider or framework.

Submit results

Using SynthPanel for synthetic surveys? Get SynthPanel on GitHub