SimpleBench

Where Everyday Human Reasoning Still Surpasses Frontier Models

SimpleBench Team

Introduction

We introduce SimpleBench, a multiple-choice text benchmark for LLMs where individuals with unspecialized (high school) knowledge outperform SOTA models. SimpleBench includes over 200 questions covering spatio-temporal reasoning, social intelligence, and what we call linguistic adversarial robustness (or trick questions). For the vast majority of text-based benchmarks LLMs outperform a non-specialized human, and increasingly, exceed expert human performance. However, on SimpleBench, a non-specialized human baseline is 83.7%, based on our small sample of nine participants, outperforming all 13 tested LLMs, including o1-preview, which scored 41.7%. While we expect model performance to improve over time, the results of SimpleBench confirm that the memorized knowledge, and approximate reasoning retrieval, utilized by frontier LLMs is not always enough to answer basic questions just yet.

Powered by Weave, from : Click here to learn more about Weave

Leaderboard

Rank Model Score (AVG@5) Organization
- Human Baseline* 83.7%
1st Gemini 3 Pro Preview 76.4% Google
2nd Gemini 2.5 Pro (06-05) 62.4% Google
3rd Claude Opus 4.5 62.0% Anthropic
4th GPT-5 Pro 61.6% OpenAI
5th Grok 4 60.5% xAI
6th Claude 4.1 Opus 60.0% Anthropic
7th Claude 4 Opus 58.8% Anthropic
8thNEW GPT-5.2 Pro (xhigh) 57.4% OpenAI
9th GPT-5 (high) 56.7% OpenAI
10th Grok 4.1 Fast 56.0% xAI
11th Claude 4.5 Sonnet 54.3% Anthropic
12th GPT-5.1 (high) 53.2% OpenAI
13th o3 (high) 53.1% OpenAI
14thNEW DeepSeek 3.2 Speciale 52.6% DeepSeek
15th Gemini 2.5 Pro (03-25) 51.6% Google
16th Claude 3.7 Sonnet (thinking) 46.4% Anthropic
17thNEW GPT-5.2 (high) 45.8% OpenAI
18th Claude 4 Sonnet (thinking) 45.5% Anthropic
19th Claude 3.7 Sonnet 44.9% Anthropic
20th o1-preview 41.7% OpenAI
21st Claude 3.5 Sonnet 10-22 41.4% Anthropic
22nd Gemini 2.5 Flash (latest) 41.2% Google
23rd DeepSeek R1 05/28 40.8% DeepSeek
24th o1-2024-12-17 (high) 40.1% OpenAI
25th DeepSeek V3.1 40.0% DeepSeek
26th Kimi K2 Thinking 39.6% Moonshot AI
27th o4-mini (high) 38.7% OpenAI
28th o1-2024-12-17 (med) 36.7% OpenAI
29th Grok 3 36.1% xAI
30th GPT-4.5 34.5% OpenAI
31st Gemini-exp-1206 31.1% Google
32nd Qwen3 235B-A22B 31.0% Alibaba
33rd DeepSeek R1 30.9% DeepSeek
34th Gemini 2.0 Flash Thinking 30.7% Google
35th Llama 4 Maverick 27.7% Meta
36th Claude 3.5 Sonnet 06-20 27.5% Anthropic
37th DeepSeek V3 03-24 27.2% DeepSeek
38th Gemini 1.5 Pro 002 27.1% Google
39th GPT-4.1 27.0% OpenAI
40th Kimi K2 26.3% Kimi AI
41st GPT-4 Turbo 25.1% OpenAI
42nd MiniMax M2 25.0% MiniMax
43rd Claude 3 Opus 23.5% Anthropic
44th Llama 3.1 405b instruct 23.0% Meta
45th o3-mini (high) 22.8% OpenAI
46th Grok 2 22.7% xAI
47th Mistral Large v2 22.5% Mistral
48th GPT-OSS 120B 22.1% OpenAI
49thNEW Mistral Large 3 20.4% Mistral
50th Llama 3.3 70b instruct 19.9% Meta
51st DeepSeek V3 18.9% DeepSeek
52nd Gemini 2.0 Flash Exp 18.9% Google
53rd o1-mini 18.1% OpenAI
54th GPT-4o 08-06 17.8% OpenAI
55th Command R+ 17.4% Cohere
56th GPT-4o mini 10.7% OpenAI
temperature: 0.7, top-p: 0.95 (except o1 series)
*See Human Evaluation section of Report for details on how we calculated Human Baseline.
**We try an engineered prompt to optimize benchmark specific performance. See LLM Eval section of Report for details.

Video Summary

Evaluating Reasoning and Prompting

Performance comparison of different models on selected benchmarks

To assess LLMs fairly, we standardized prompts across all models, directing them to choose the most realistic answer step-by-step (COT). Additionally, we tested a benchmark specific engineered prompt for select models. Prompt engineering showed slight improvements suggesting that while tailored prompts can aid performance, fundamental limitations remain. In the full report, we also hypothesize that the surprising underperformance of GPT4o stems from optimizing for specific industrial applications (math and coding) at the expense of holistic reasoning.

For a deeper dive into our results and our methods, check out the full technical report here.