Common-sense trick questions for LLMs. Can they see past the surface?
This benchmark consists of 2 questions โ harder, modified variants of the well-known car wash trick question. The goal is to test whether models can reason past surface-level cues to the genuinely correct answer.
Given the small question set, saturation is expected relatively quickly. I may expand the benchmark with additional questions of the same style.
| # | Model | Score % | Raw Mean | Majority Vote | |||
|---|---|---|---|---|---|---|---|
| โ Perfect | ๐ก Partial | โ Wrong | Ties | ||||
| ๐ฅ | Gemini 3.1 Pro | 1.50 | 1 | 0 | 0 | 1 | |
| ๐ฅ | GLM 5.0 | 1.10 | 1 | 0 | 0 | 1 | |
| ๐ฅ | Gemini 3 Flash | 0.30 | 0 | 0 | 2 | 0 | |
| #4 | Claude 4.6 Opus (extended thinking) | 0.20 | 0 | 0 | 2 | 0 | |
| #5 | GPT-5.4 Thinking (Medium) | 0.00 | 0 | 0 | 2 | 0 | |
| #6 | o3 | 0.00 | 0 | 0 | 2 | 0 | |
| #7 | Claude 4.6 Sonnet (extended thinking) | 0.00 | 0 | 0 | 2 | 0 | |
| #8 | MiniMax 2.5 (max) | 0.00 | 0 | 0 | 2 | 0 | |