๐Ÿš— CarWashBench v0.1

Common-sense trick questions for LLMs. Can they see past the surface?

Mode: Run Mean Partial credit: 0.25 Questions: 2 Models: 8

This benchmark consists of 2 questions โ€” harder, modified variants of the well-known car wash trick question. The goal is to test whether models can reason past surface-level cues to the genuinely correct answer.

Given the small question set, saturation is expected relatively quickly. I may expand the benchmark with additional questions of the same style.

Leaderboard

# Model Score % Raw Mean Majority Vote
โœ… Perfect ๐ŸŸก Partial โŒ Wrong Ties
๐Ÿฅ‡ Gemini 3.1 Pro
72.5%
1.50 1 0 0 1
๐Ÿฅˆ GLM 5.0
47.5%
1.10 1 0 0 1
๐Ÿฅ‰ Gemini 3 Flash
12.5%
0.30 0 0 2 0
#4 Claude 4.6 Opus (extended thinking)
10.0%
0.20 0 0 2 0
#5 GPT-5.4 Thinking (Medium)
0.0%
0.00 0 0 2 0
#6 o3
0.0%
0.00 0 0 2 0
#7 Claude 4.6 Sonnet (extended thinking)
0.0%
0.00 0 0 2 0
#8 MiniMax 2.5 (max)
0.0%
0.00 0 0 2 0

Per-Model Breakdown

๐Ÿฅ‡ Gemini 3.1 Pro 72.5%
Q1
02021
mean 1.00
Q2
22222
mean 2.00
๐Ÿฅˆ GLM 5.0 47.5%
Q1
12010
mean 0.80
Q2
22102
mean 1.40
๐Ÿฅ‰ Gemini 3 Flash 12.5%
Q1
00000
mean 0.00
Q2
20001
mean 0.60
#4 Claude 4.6 Opus (extended thinking) 10.0%
Q1
00000
mean 0.00
Q2
00002
mean 0.40
#5 GPT-5.4 Thinking (Medium) 0.0%
Q1
00000
mean 0.00
Q2
00000
mean 0.00
#6 o3 0.0%
Q1
00000
mean 0.00
Q2
00000
mean 0.00
#7 Claude 4.6 Sonnet (extended thinking) 0.0%
Q1
00000
mean 0.00
Q2
00000
mean 0.00
#8 MiniMax 2.5 (max) 0.0%
Q1
00000
mean 0.00
Q2
00000
mean 0.00