๐Ÿš— CarWashBench v0.1

Common-sense trick questions for LLMs. Can they see past the surface?

Mode: Run Mean Partial credit: 0.25 Questions: 4 Models: 12

This benchmark consists of 2 questions โ€” harder, modified variants of the well-known car wash trick question. The goal is to test whether models can reason past surface-level cues to the genuinely correct answer.

Given the small question set, saturation is expected relatively quickly. I may expand the benchmark with additional questions of the same style.

Leaderboard

# Model Score % Raw Mean Majority Vote
โœ… Perfect ๐ŸŸก Partial โŒ Wrong Ties
๐Ÿฅ‡ Gemini 3.1 Pro
72.5%
1.50 1 0 0 1
๐Ÿฅˆ GLM 5.0
47.5%
1.10 1 0 0 1
๐Ÿฅ‰ Gemma 4 31B
20.0%
0.60 0 1 1 0
#4 Gemma 4 31B (Q8)
17.5%
0.50 0 0 1 1
#5 GPT 5.5 (medium)
17.5%
0.50 0 0 1 1
#6 Gemini 3 Flash
12.5%
0.30 0 0 2 0
#7 Claude 4.6 Opus (extended thinking)
10.0%
0.20 0 0 2 0
#8 GPT-5.4 Thinking (Medium)
0.0%
0.00 0 0 2 0
#9 o3
0.0%
0.00 0 0 2 0
#10 Claude 4.6 Sonnet (extended thinking)
0.0%
0.00 0 0 2 0
#11 MiniMax 2.5 (max)
0.0%
0.00 0 0 2 0
#12 Qwen 3.5 27B
0.0%
โ€” 0 0 0 0

Per-Model Breakdown

๐Ÿฅ‡ Gemini 3.1 Pro 72.5%
Q1
02021
mean 1.00
Q2
22222
mean 2.00
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
๐Ÿฅˆ GLM 5.0 47.5%
Q1
12010
mean 0.80
Q2
22102
mean 1.40
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
๐Ÿฅ‰ Gemma 4 31B 20.0%
Q1
00000
mean 0.00
Q2
11211
mean 1.20
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#4 Gemma 4 31B (Q8) 17.5%
Q1
10000
mean 0.20
Q2
01102
mean 0.80
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#5 GPT 5.5 (medium) 17.5%
Q1
01000
mean 0.20
Q2
01102
mean 0.80
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#6 Gemini 3 Flash 12.5%
Q1
00000
mean 0.00
Q2
20001
mean 0.60
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#7 Claude 4.6 Opus (extended thinking) 10.0%
Q1
00000
mean 0.00
Q2
00002
mean 0.40
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#8 GPT-5.4 Thinking (Medium) 0.0%
Q1
00000
mean 0.00
Q2
00000
mean 0.00
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#9 o3 0.0%
Q1
00000
mean 0.00
Q2
00000
mean 0.00
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#10 Claude 4.6 Sonnet (extended thinking) 0.0%
Q1
00000
mean 0.00
Q2
00000
mean 0.00
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#11 MiniMax 2.5 (max) 0.0%
Q1
00000
mean 0.00
Q2
00000
mean 0.00
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”
#12 Qwen 3.5 27B 0.0%
Q1
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q2
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q3
โ€”โ€”โ€”โ€”โ€”
mean โ€”
Q4
โ€”โ€”โ€”โ€”โ€”
mean โ€”