SimpleBench

Where Everyday Human Reasoning Still Surpasses Frontier Models

SimpleBench Team

Introduction

We introduce SimpleBench, a multiple-choice text benchmark for LLMs where individuals with unspecialized (high school) knowledge outperform SOTA models. SimpleBench includes over 200 questions covering spatio-temporal reasoning, social intelligence, and what we call linguistic adversarial robustness (or trick questions). For the vast majority of text-based benchmarks LLMs outperform a non-specialized human, and increasingly, exceed expert human performance. However, on SimpleBench, a non-specialized human baseline is 83.7%, based on our small sample of nine participants, outperforming all 13 tested LLMs, including o1-preview, which scored 41.7%. While we expect model performance to improve over time, the results of SimpleBench confirm that the memorized knowledge, and approximate reasoning retrieval, utilized by frontier LLMs is not always enough to answer basic questions just yet.

Use all of these models on the Simple Bench app - LMcouncil.ai

Leaderboard

Rank Model Score (AVG@5) Organization
- Highest Human Score* 95.4%
- Human Baseline* 83.7%
1stNEW Gemini 3.1 Pro Preview 79.6% Google
2nd Gemini 3 Pro Preview 76.4% Google
3rdNEW Claude Opus 4.6 67.6% Anthropic
4th Gemini 2.5 Pro (06-05) 62.4% Google
5th Claude Opus 4.5 62.0% Anthropic
6th GPT-5 Pro 61.6% OpenAI
7th Gemini 3 Flash Preview 61.1% Google
8th Grok 4 60.5% xAI
9th Claude 4.1 Opus 60.0% Anthropic
10th Claude 4 Opus 58.8% Anthropic
11th GPT-5.2 Pro (xhigh) 57.4% OpenAI
12th GPT-5 (high) 56.7% OpenAI
13th Grok 4.1 Fast 56.0% xAI
14th Claude 4.5 Sonnet 54.3% Anthropic
15th GPT-5.1 (high) 53.2% OpenAI
16thNEW GLM 5 53.2% Zhipu AI
17th o3 (high) 53.1% OpenAI
18th DeepSeek 3.2 Speciale 52.6% DeepSeek
19th Gemini 2.5 Pro (03-25) 51.6% Google
20th GLM 4.7 47.7% Z.ai
21stNEW Kimi K2.5 46.8% Moonshot AI
22nd Claude 3.7 Sonnet (thinking) 46.4% Anthropic
23rd GPT-5.2 (high) 45.8% OpenAI
24th Claude 4 Sonnet (thinking) 45.5% Anthropic
25th Claude 3.7 Sonnet 44.9% Anthropic
26th o1-preview 41.7% OpenAI
27th Claude 3.5 Sonnet 10-22 41.4% Anthropic
28th Gemini 2.5 Flash (latest) 41.2% Google
29th DeepSeek R1 05/28 40.8% DeepSeek
30th o1-2024-12-17 (high) 40.1% OpenAI
31st DeepSeek V3.1 40.0% DeepSeek
32nd Kimi K2 Thinking 39.6% Moonshot AI
33rd o4-mini (high) 38.7% OpenAI
34th o1-2024-12-17 (med) 36.7% OpenAI
35th Grok 3 36.1% xAI
36th MiniMax M2.1 34.7% MiniMax
37th GPT-4.5 34.5% OpenAI
38th Gemini-exp-1206 31.1% Google
39th Qwen3 235B-A22B 31.0% Alibaba
40th DeepSeek R1 30.9% DeepSeek
41st Gemini 2.0 Flash Thinking 30.7% Google
42nd Llama 4 Maverick 27.7% Meta
43rd Claude 3.5 Sonnet 06-20 27.5% Anthropic
44th DeepSeek V3 03-24 27.2% DeepSeek
45th Gemini 1.5 Pro 002 27.1% Google
46th GPT-4.1 27.0% OpenAI
47th Kimi K2 26.3% Kimi AI
48th GPT-4 Turbo 25.1% OpenAI
49th MiniMax M2 25.0% MiniMax
50th Claude 3 Opus 23.5% Anthropic
51st Llama 3.1 405b instruct 23.0% Meta
52nd o3-mini (high) 22.8% OpenAI
53rd Grok 2 22.7% xAI
54th Mistral Large v2 22.5% Mistral
55th GPT-OSS 120B 22.1% OpenAI
56th Mistral Large 3 20.4% Mistral
57th Llama 3.3 70b instruct 19.9% Meta
58th DeepSeek V3 18.9% DeepSeek
59th Gemini 2.0 Flash Exp 18.9% Google
60th o1-mini 18.1% OpenAI
61st GPT-4o 08-06 17.8% OpenAI
62nd Command R+ 17.4% Cohere
63rd GPT-4o mini 10.7% OpenAI
temperature: 0.7, top-p: 0.95 (except o1 series)
*See Human Evaluation section of Report for details on how we calculated Human Baseline.
**We try an engineered prompt to optimize benchmark specific performance. See LLM Eval section of Report for details.

Video Summary

Evaluating Reasoning and Prompting

Performance comparison of different models on selected benchmarks

To assess LLMs fairly, we standardized prompts across all models, directing them to choose the most realistic answer step-by-step (COT). Additionally, we tested a benchmark specific engineered prompt for select models. Prompt engineering showed slight improvements suggesting that while tailored prompts can aid performance, fundamental limitations remain. In the full report, we also hypothesize that the surprising underperformance of GPT4o stems from optimizing for specific industrial applications (math and coding) at the expense of holistic reasoning.

For a deeper dive into our results and our methods, check out the full technical report here.