SimpleBench

Introduction

We introduce SimpleBench, a multiple-choice text benchmark for LLMs where individuals with unspecialized (high school) knowledge outperform SOTA models. SimpleBench includes over 200 questions covering spatio-temporal reasoning, social intelligence, and what we call linguistic adversarial robustness (or trick questions). For the vast majority of text-based benchmarks LLMs outperform a non-specialized human, and increasingly, exceed expert human performance. However, on SimpleBench, a non-specialized human baseline is 83.7%, based on our small sample of nine participants, outperforming all 13 tested LLMs, including o1-preview, which scored 41.7%. While we expect model performance to improve over time, the results of SimpleBench confirm that the memorized knowledge, and approximate reasoning retrieval, utilized by frontier LLMs is not always enough to answer basic questions just yet.

Powered by Weave, from : Click here to learn more about Weave

Leaderboard

Rank	Model	Score (AVG@5)	Organization
-	Human Baseline*	83.7%
1st	Gemini 2.5 Pro (06-05)	62.4%	Google
2nd	Grok 4	60.5%	xAI
3rd	Claude 4.1 Opus	60.0%	Anthropic
4th	Claude 4 Opus (thinking)	58.8%	Anthropic
5th	GPT-5 (high)	56.7%	OpenAI
6th	o3 (high)	53.1%	OpenAI
7th	Gemini 2.5 Pro (03-25)	51.6%	Google
8th	Claude 3.7 Sonnet (thinking)	46.4%	Anthropic
9th	Claude 4 Sonnet (thinking)	45.5%	Anthropic
10th	Claude 3.7 Sonnet	44.9%	Anthropic
11th	o1-preview	41.7%	OpenAI
12th	Claude 3.5 Sonnet 10-22	41.4%	Anthropic
13th	DeepSeek R1 05/28	40.8%	DeepSeek
14th	o1-2024-12-17 (high)	40.1%	OpenAI
15th	o4-mini (high)	38.7%	OpenAI
16th	o1-2024-12-17 (med)	36.7%	OpenAI
17th	Grok 3	36.1%	xAI
18th	GPT-4.5	34.5%	OpenAI
19th	Gemini-exp-1206	31.1%	Google
20th	Qwen3 235B-A22B	31.0%	Alibaba
21st	DeepSeek R1	30.9%	DeepSeek
22nd	Gemini 2.0 Flash Thinking	30.7%	Google
23rd	Llama 4 Maverick	27.7%	Meta
24th	Claude 3.5 Sonnet 06-20	27.5%	Anthropic
25th	DeepSeek V3 03-24	27.2%	DeepSeek
26th	Gemini 1.5 Pro 002	27.1%	Google
27th	GPT-4.1	27.0%	OpenAI
28th	Kimi K2	26.3%	Kimi AI
29th	GPT-4 Turbo	25.1%	OpenAI
30th	Claude 3 Opus	23.5%	Anthropic
31st	Llama 3.1 405b instruct	23.0%	Meta
32nd	o3-mini (high)	22.8%	OpenAI
33rd	Grok 2	22.7%	xAI
34th	Mistral Large v2	22.5%	Mistral
35th	GPT-OSS 120B	22.1%	OpenAI
36th	Llama 3.3 70b instruct	19.9%	Meta
37th	DeepSeek V3	18.9%	DeepSeek
38th	Gemini 2.0 Flash Exp	18.9%	Google
39th	o1-mini	18.1%	OpenAI
40th	GPT-4o 08-06	17.8%	OpenAI
41st	Command R+	17.4%	Cohere
42nd	GPT-4o mini	10.7%	OpenAI

Rank

Model

Score (AVG@5)

Organization

Human Baseline*

83.7%

1st

Gemini 2.5 Pro (06-05)

62.4%

Google

2nd

Grok 4

60.5%

xAI

3rd

Claude 4.1 Opus

60.0%

Anthropic

4th

Claude 4 Opus (thinking)

58.8%

Anthropic

5th

GPT-5 (high)

56.7%

OpenAI

6th

o3 (high)

53.1%

OpenAI

7th

Gemini 2.5 Pro (03-25)

51.6%

Google

8th

Claude 3.7 Sonnet (thinking)

46.4%

Anthropic

9th

Claude 4 Sonnet (thinking)

45.5%

Anthropic

10th

Claude 3.7 Sonnet

44.9%

Anthropic

11th

o1-preview

41.7%

OpenAI

12th

Claude 3.5 Sonnet 10-22

41.4%

Anthropic

13th

DeepSeek R1 05/28

40.8%

DeepSeek

14th

o1-2024-12-17 (high)

40.1%

OpenAI

15th

o4-mini (high)

38.7%

OpenAI

16th

o1-2024-12-17 (med)

36.7%

OpenAI

17th

Grok 3

36.1%

xAI

18th

GPT-4.5

34.5%

OpenAI

19th

Gemini-exp-1206

31.1%

Google

20th

Qwen3 235B-A22B

31.0%

Alibaba

21st

DeepSeek R1

30.9%

DeepSeek

22nd

Gemini 2.0 Flash Thinking

30.7%

Google

23rd

Llama 4 Maverick

27.7%

Meta

37th

DeepSeek V3

18.9%

DeepSeek

38th

Gemini 2.0 Flash Exp

18.9%

Google

39th

o1-mini

18.1%

OpenAI

40th

GPT-4o 08-06

17.8%

OpenAI

41st

Command R+

17.4%

Cohere

42nd

GPT-4o mini

10.7%

OpenAI

temperature: 0.7, top-p: 0.95 (except o1 series)
*See Human Evaluation section of Report for details on how we calculated Human Baseline.
**We try an engineered prompt to optimize benchmark specific performance. See LLM Eval section of Report for details.

Evaluating Reasoning and Prompting

Performance comparison of different models on selected benchmarks

To assess LLMs fairly, we standardized prompts across all models, directing them to choose the most realistic answer step-by-step (COT). Additionally, we tested a benchmark specific engineered prompt for select models. Prompt engineering showed slight improvements suggesting that while tailored prompts can aid performance, fundamental limitations remain. In the full report, we also hypothesize that the surprising underperformance of GPT4o stems from optimizing for specific industrial applications (math and coding) at the expense of holistic reasoning.

For a deeper dive into our results and our methods, check out the full technical report here.

SimpleBench

Where Everyday Human Reasoning Still Surpasses Frontier Models

Introduction

Leaderboard

Video Summary

Evaluating Reasoning and Prompting