CarWashBench v0.1 — LLM Leaderboard

This benchmark consists of 2 questions — harder, modified variants of the well-known car wash trick question. The goal is to test whether models can reason past surface-level cues to the genuinely correct answer.

Given the small question set, saturation is expected relatively quickly. I may expand the benchmark with additional questions of the same style.

Leaderboard

#	Model	Score %	Raw Mean	Majority Vote
#	Model	Score %	Raw Mean	✅ Perfect	🟡 Partial	❌ Wrong	Ties
🥇	Gemini 3.1 Pro	72.5%	1.50	1	0	0	1
🥈	GLM 5.0	47.5%	1.10	1	0	0	1
🥉	Gemini 3 Flash	12.5%	0.30	0	0	2	0
#4	Claude 4.6 Opus (extended thinking)	10.0%	0.20	0	0	2	0
#5	GPT-5.4 Thinking (Medium)	0.0%	0.00	0	0	2	0
#6	o3	0.0%	0.00	0	0	2	0
#7	Claude 4.6 Sonnet (extended thinking)	0.0%	0.00	0	0	2	0
#8	MiniMax 2.5 (max)	0.0%	0.00	0	0	2	0

Per-Model Breakdown

🥇 Gemini 3.1 Pro 72.5%

02021

mean 1.00

22222

mean 2.00

🥈 GLM 5.0 47.5%

12010

mean 0.80

22102

mean 1.40

🥉 Gemini 3 Flash 12.5%

00000

mean 0.00

20001

mean 0.60

#4 Claude 4.6 Opus (extended thinking) 10.0%

00000

mean 0.00

00002

mean 0.40

#5 GPT-5.4 Thinking (Medium) 0.0%

00000

mean 0.00

00000

mean 0.00

#6 o3 0.0%

00000

mean 0.00

00000

mean 0.00

#7 Claude 4.6 Sonnet (extended thinking) 0.0%

00000

mean 0.00

00000

mean 0.00

#8 MiniMax 2.5 (max) 0.0%

00000

mean 0.00

00000

mean 0.00

🚗 CarWashBench v0.1

Leaderboard

Per-Model Breakdown