Table of Contents

i made agi take the SAT

what happens when u give today’s frontier AI models a standardized test designed for 17 year olds? i tested 20 AI models from US and Chinese providers on the official SAT Practice Test #10 from College Board.

spoiler: they’re all above average. but none of them are getting into MIT.


robot taking test
Imagen 4 prompt: robot taking standardized test, stressed, pencil in hand, scantron sheet, realistic

tldr

  • Claude Opus 4.5 wins with 57.6% accuracy (630 score, 84th percentile)
  • Chinese models tied for 2nd place: DeepSeek V3.2, GLM-4.7, and Kimi K2.5 all hit 624
  • all models beat the human average of 530 but none cracked 650
  • OpenAI’s GPT-5.2 placed… 8th. eighth. behind three Chinese models.
  • Google’s Pro models refused to take the test bc “safety concerns” lmao

the setup

i built a benchmark script that tests models on the SAT’s Reading and Writing sections (66 questions total). this includes:
- vocabulary in context (what does “obfuscate” mean in paragraph 3)
- reading comprehension (what is the author’s main point)
- text structure (why did the author use this transition)
- grammar (semicolons are hard apparently)
- rhetorical synthesis (combine these two passages into a coherent argument)

to make sure results weren’t just random variance, each question was posed to each model 4-8 times, with the final answer determined by majority voting.

providers tested:
- US: OpenAI, Anthropic, Google, xAI
- China: DeepSeek, Zhipu AI (GLM), Moonshot (Kimi), MiniMax, ByteDance, Baidu, Alibaba (Qwen)

methodology note: i used temperature=0.3 where supported, direct API calls (Silicon Flow for Chinese models), and no chain-of-thought prompting. just “here’s the passage, here’s the question, give me A/B/C/D.” like a real standardized test. no hand-holding.


the results

at a glance

where do AI models fall on the bell curve?

below is the distribution of human SAT scores (Reading & Writing section), with the 20 AI models plotted at their equivalent positions. hover over the dots for deets.

all models scored above the human mean of 530, placing them between the 76th and 84th percentiles.

in other words: every AI model tested would beat roughly 3 out of 4 human test takers.

but also: none of them are getting into a top 20 school. the tightest clustering occurred between scores 612-630, suggesting some kind of ceiling for current architectures on this task.

model ranking

score tiers

full results table


ok but what does this actually mean

anthropic takes the crown (barely)

Claude Opus 4.5 achieved the highest accuracy at 57.6%, translating to a 630 SAT score (84th percentile). this represents a narrow but consistent lead over the rest of the pack.

but here’s the thing: the margin is tiny. the difference between 1st place (57.6%) and 2nd place (56.2%) is literally one question.

chinese models are competitive af

US vs China AI competition
Imagen 4 prompt: two groups of robots facing each other across a conference table, red vs blue lighting, holographic scoreboards

the biggest story here is how well Chinese models performed:

Model Provider Score Notes
DeepSeek V3.2 DeepSeek 624 tied for 2nd among all models
GLM-4.7 Zhipu AI 624 tied for 2nd among all models
Kimi K2.5 Moonshot 624 tied for 2nd among all models
MiniMax M2.1 MiniMax 618 beat GPT-5.2
Seed-36B ByteDance 618 beat GPT-5.2
ERNIE-4.5 Baidu 618 beat GPT-5.2

three chinese models tied with xAI’s Grok models for 2nd place. this is… not nothing. deepseek, zhipu, and moonshot are all producing models that outperform openai’s flagship on standardized tests.

wtf openai (still)

robot looking dejected at test results
Imagen 4 prompt: dejected robot at desk, crumpled test papers, other robots celebrating in background

the embarrassment continues: OpenAI’s flagship GPT-5.2 placed 8th overall at 53.0% accuracy - behind three Chinese models, two xAI models, and two Anthropic models.

Model Accuracy My Reaction
GPT-5.2 53.0% beat by deepseek…
GPT-4.1 53.0% same as 5.2???
GPT-5.1 50.0% tied with qwq-32b

this suggests openai’s recent model improvements focused on… something else? coding? reasoning? certainly not standardized test performance.

deepseek is the real deal

DeepSeek V3.2 achieved 56.1% accuracy with 97% confidence - matching Grok 4 Fast and beating every OpenAI model. this is particularly notable because:

  1. deepseek is significantly cheaper than US alternatives
  2. they achieved this with an open-weights model
  3. the V3 architecture is apparently very good at language comprehension

alibaba’s qwen had issues

Qwen3-235B completely failed the benchmark (0% accuracy) - likely due to API issues or the model returning non-standard responses. QwQ-32B (the reasoning variant) managed 50%, tying with GPT-5.1 for last place among successful models.

the qwen team might want to look into that.


the stats

84th
Top Model Percentile
630
Highest SAT Score
97%
Avg. Confidence
20
Models Tested

all models scored above the human average (530), but none achieved the 700+ scores that top human test-takers regularly hit.

for reference: if u got a 624-630 on SAT reading/writing, ur looking at state schools and maybe some lower-tier privates. MIT average is 760. harvard is 750.

AGI is above average but not ivy league material


us vs china: the leaderboard

by provider

the full leaderboard

Rank Model Provider Country Score
1 Claude Opus 4.5 Anthropic US 630
2 Gemini 3 Flash Google US 625
3 Grok 4 Fast xAI US 624
3 Grok 3 Mini xAI US 624
3 DeepSeek V3.2 DeepSeek China 624
3 GLM-4.7 Zhipu AI China 624
3 Kimi K2.5 Moonshot China 624
8 Claude Sonnet 4.5 Anthropic US 618
8 Gemini 2.5 Flash Google US 618
8 Grok 4 xAI US 618
8 MiniMax M2.1 MiniMax China 618
8 Seed-36B ByteDance China 618
8 ERNIE-4.5 Baidu China 618
14 GPT-5.2 OpenAI US 612
14 GPT-4.1 OpenAI US 612
14 Claude Sonnet 4 Anthropic US 612
14 Grok 3 xAI US 612
18 GPT-5.1 OpenAI US 600
18 Claude Haiku 4.5 Anthropic US 600
18 QwQ-32B Alibaba China 600

takeaway: chinese models are genuinely competitive. 6 of the top 13 models are from chinese companies. the “china is behind in AI” narrative needs updating.


methodology notes for the nerds

futuristic lab with data visualizations
Imagen 4 prompt: bird’s eye view of futuristic lab, holographic charts, blue and purple neon accents

**Test Source:** College Board SAT Practice Test #10 **Questions Used:** 66 (Reading & Writing Modules 1 and 2) **Iterations Per Question:** 4-8 (majority voting for final answer) **Answer Selection:** Direct letter response (A/B/C/D) **Parallel Processing:** ThreadPoolExecutor with 3-10 concurrent workers **Chinese Models API:** Silicon Flow (OpenAI-compatible) **Temperature:** 0.3 where supported **Total API Calls:** ~10,000+

models that couldn’t even show up

Model Provider Issue
Gemini 3 Pro Google Safety filter blocked questions
Gemini 2.5 Pro Google Safety filter blocked questions
o3 OpenAI API compatibility issues
o4-mini OpenAI API compatibility issues
Qwen3-235B Alibaba API/response format issues

ok what did we learn

  • ☑️ Claude Opus 4.5 is currently the best at SAT reading comprehension
  • ☑️ Chinese models are genuinely competitive - 3 tied for 2nd place
  • ☑️ DeepSeek, GLM, and Kimi all outperform OpenAI’s flagships
  • ☑️ all frontier models perform above human average (76th-84th percentile)
  • ☑️ no model cracked the 650+ scores that top human performers achieve
  • ☑️ OpenAI’s GPT-5 series continues to underperform on benchmarks
  • ☑️ there appears to be a ~630 ceiling for current architectures

the meta-lesson: the AI race is genuinely global now. chinese models aren’t just “catching up” - on standardized language tests, they’re already competitive with (and sometimes beating) US alternatives.


further work

things id like to try:
- chain-of-thought prompting (let them reason out loud)
- math section benchmarking (need to handle diagrams/figures)
- more chinese models (yi-lightning, ernie-4, etc.)
- testing on adaptive difficulty formats like the real digital SAT

if u have GPU credits to spare, feel free to run these experiments urself. benchmark code available on request.


Benchmark run on January 17-February 4, 2026. Chinese models tested via Silicon Flow API. No AI models were harmed in the making of this blog post, though OpenAI’s flagship was mildly embarrassed by its performance.