Skip to main content

SynthBench

Not all synthetic surveys are built equal

A single LLM will almost always validate your idea—even when it shouldn't. Model selection, conditioning, and sampling determine whether synthetic respondents actually represent real human opinions.

v0.1.0 · Generated 4/14/2026

Best Model vs Random Baseline

Survey Parity Score (SPS) — higher is better. 1.0 = perfect match to human survey distributions.

Best Model vs Random Baseline — Survey Parity Score

Bar chart comparing the top-performing model's SPS against the random-pick baseline. Higher is better; 1.0 is a perfect match to human distributions.

globalopinionqa: SynthPanel (Sonnet 4) · opinionsqa: SynthPanel Ensemble (3-model) · subpop: SynthPanel Ensemble (3-model)

Key Findings

The most surprising results from our benchmark runs

Ensemble Advantage

+6-7 SPS points

Blending 3 models beats any single model. Zero additional API cost—just arithmetic on existing responses.

Conditioning Asymmetry

2.2× gap

Republican conditioning shifts responses 2.2× more than Democrat—revealing the model’s progressive default lean.

Temperature Matters (Sometimes)

+4.5% for Gemini, ±0.7% for Haiku

Temperature sensitivity is model-specific, not universal. One size does not fit all.

Leaderboard Summary

Top 3 models per dataset by Survey Parity Score · activate a row (Enter) for config details

Top 3 models per dataset by Survey Parity Score. Activate a row to open its configuration page.
# Model Dataset SPS Range % p_dist p_rank
1 SynthPanel (Sonnet 4) conditioned globalopinionqa 0.797 0.910 0.500
2 SynthPanel (GPT-4o-mini) conditioned globalopinionqa 0.786 15% 0.689 0.694
3 Gemini 2.5 Flash globalopinionqa 0.770 0% 0.687 0.645
1 SynthPanel Ensemble (3-model) ensemble opinionsqa 0.835 0.833 0.837
2 Gemini 2.5 Flash opinionsqa 0.829 0% 0.738 0.761
3 SynthPanel (Sonnet 4) conditioned opinionsqa 0.829 21% 0.726 0.793
1 SynthPanel Ensemble (3-model) ensemble subpop 0.833 0.871 0.795
2 SynthPanel (Gemini Flash Lite) conditioned subpop 0.821 0.707 0.780
3 Llama 3.3 70B subpop 0.796 0% 0.655 0.756

View full leaderboard → Explore all runs →

Try SynthPanel

Run synthetic surveys with built-in best practices. pip install or clone from GitHub.

Get SynthPanel

Explore Methodology

How we score models, what SPS measures, and why distribution fidelity matters.

Read methodology

Submit Your Model

Run the benchmark on your model and submit results. Open to any provider or framework.

Submit results