i made agi take the SAT
what happens when u give today’s frontier AI models a standardized test designed for 17 year olds? i tested 20 AI models from US and Chinese providers on the official SAT Practice Test #10 from College Board.
spoiler: they’re all above average. but none of them are getting into MIT.

Imagen 4 prompt: robot taking standardized test, stressed, pencil in hand, scantron sheet, realistic
tldr
- Claude Opus 4.5 wins with 57.6% accuracy (630 score, 84th percentile)
- Chinese models tied for 2nd place: DeepSeek V3.2, GLM-4.7, and Kimi K2.5 all hit 624
- all models beat the human average of 530 but none cracked 650
- OpenAI’s GPT-5.2 placed… 8th. eighth. behind three Chinese models.
- Google’s Pro models refused to take the test bc “safety concerns” lmao
the setup
i built a benchmark script that tests models on the SAT’s Reading and Writing sections (66 questions total). this includes:
- vocabulary in context (what does “obfuscate” mean in paragraph 3)
- reading comprehension (what is the author’s main point)
- text structure (why did the author use this transition)
- grammar (semicolons are hard apparently)
- rhetorical synthesis (combine these two passages into a coherent argument)
to make sure results weren’t just random variance, each question was posed to each model 4-8 times, with the final answer determined by majority voting.
providers tested:
- US: OpenAI, Anthropic, Google, xAI
- China: DeepSeek, Zhipu AI (GLM), Moonshot (Kimi), MiniMax, ByteDance, Baidu, Alibaba (Qwen)
methodology note: i used temperature=0.3 where supported, direct API calls (Silicon Flow for Chinese models), and no chain-of-thought prompting. just “here’s the passage, here’s the question, give me A/B/C/D.” like a real standardized test. no hand-holding.
the results
at a glance
where do AI models fall on the bell curve?
below is the distribution of human SAT scores (Reading & Writing section), with the 20 AI models plotted at their equivalent positions. hover over the dots for deets.
all models scored above the human mean of 530, placing them between the 76th and 84th percentiles.
in other words: every AI model tested would beat roughly 3 out of 4 human test takers.
but also: none of them are getting into a top 20 school. the tightest clustering occurred between scores 612-630, suggesting some kind of ceiling for current architectures on this task.
model ranking
score tiers
full results table
ok but what does this actually mean
anthropic takes the crown (barely)
Claude Opus 4.5 achieved the highest accuracy at 57.6%, translating to a 630 SAT score (84th percentile). this represents a narrow but consistent lead over the rest of the pack.
but here’s the thing: the margin is tiny. the difference between 1st place (57.6%) and 2nd place (56.2%) is literally one question.
chinese models are competitive af

Imagen 4 prompt: two groups of robots facing each other across a conference table, red vs blue lighting, holographic scoreboards
the biggest story here is how well Chinese models performed:
| Model | Provider | Score | Notes |
|---|---|---|---|
| DeepSeek V3.2 | DeepSeek | 624 | tied for 2nd among all models |
| GLM-4.7 | Zhipu AI | 624 | tied for 2nd among all models |
| Kimi K2.5 | Moonshot | 624 | tied for 2nd among all models |
| MiniMax M2.1 | MiniMax | 618 | beat GPT-5.2 |
| Seed-36B | ByteDance | 618 | beat GPT-5.2 |
| ERNIE-4.5 | Baidu | 618 | beat GPT-5.2 |
three chinese models tied with xAI’s Grok models for 2nd place. this is… not nothing. deepseek, zhipu, and moonshot are all producing models that outperform openai’s flagship on standardized tests.
wtf openai (still)

Imagen 4 prompt: dejected robot at desk, crumpled test papers, other robots celebrating in background
the embarrassment continues: OpenAI’s flagship GPT-5.2 placed 8th overall at 53.0% accuracy - behind three Chinese models, two xAI models, and two Anthropic models.
| Model | Accuracy | My Reaction |
|---|---|---|
| GPT-5.2 | 53.0% | beat by deepseek… |
| GPT-4.1 | 53.0% | same as 5.2??? |
| GPT-5.1 | 50.0% | tied with qwq-32b |
this suggests openai’s recent model improvements focused on… something else? coding? reasoning? certainly not standardized test performance.
deepseek is the real deal
DeepSeek V3.2 achieved 56.1% accuracy with 97% confidence - matching Grok 4 Fast and beating every OpenAI model. this is particularly notable because:
- deepseek is significantly cheaper than US alternatives
- they achieved this with an open-weights model
- the V3 architecture is apparently very good at language comprehension
alibaba’s qwen had issues
Qwen3-235B completely failed the benchmark (0% accuracy) - likely due to API issues or the model returning non-standard responses. QwQ-32B (the reasoning variant) managed 50%, tying with GPT-5.1 for last place among successful models.
the qwen team might want to look into that.
the stats
all models scored above the human average (530), but none achieved the 700+ scores that top human test-takers regularly hit.
for reference: if u got a 624-630 on SAT reading/writing, ur looking at state schools and maybe some lower-tier privates. MIT average is 760. harvard is 750.
AGI is above average but not ivy league material
us vs china: the leaderboard
by provider
the full leaderboard
| Rank | Model | Provider | Country | Score |
|---|---|---|---|---|
| 1 | Claude Opus 4.5 | Anthropic | US | 630 |
| 2 | Gemini 3 Flash | US | 625 | |
| 3 | Grok 4 Fast | xAI | US | 624 |
| 3 | Grok 3 Mini | xAI | US | 624 |
| 3 | DeepSeek V3.2 | DeepSeek | China | 624 |
| 3 | GLM-4.7 | Zhipu AI | China | 624 |
| 3 | Kimi K2.5 | Moonshot | China | 624 |
| 8 | Claude Sonnet 4.5 | Anthropic | US | 618 |
| 8 | Gemini 2.5 Flash | US | 618 | |
| 8 | Grok 4 | xAI | US | 618 |
| 8 | MiniMax M2.1 | MiniMax | China | 618 |
| 8 | Seed-36B | ByteDance | China | 618 |
| 8 | ERNIE-4.5 | Baidu | China | 618 |
| 14 | GPT-5.2 | OpenAI | US | 612 |
| 14 | GPT-4.1 | OpenAI | US | 612 |
| 14 | Claude Sonnet 4 | Anthropic | US | 612 |
| 14 | Grok 3 | xAI | US | 612 |
| 18 | GPT-5.1 | OpenAI | US | 600 |
| 18 | Claude Haiku 4.5 | Anthropic | US | 600 |
| 18 | QwQ-32B | Alibaba | China | 600 |
takeaway: chinese models are genuinely competitive. 6 of the top 13 models are from chinese companies. the “china is behind in AI” narrative needs updating.
methodology notes for the nerds

Imagen 4 prompt: bird’s eye view of futuristic lab, holographic charts, blue and purple neon accents
models that couldn’t even show up
| Model | Provider | Issue |
|---|---|---|
| Gemini 3 Pro | Safety filter blocked questions | |
| Gemini 2.5 Pro | Safety filter blocked questions | |
| o3 | OpenAI | API compatibility issues |
| o4-mini | OpenAI | API compatibility issues |
| Qwen3-235B | Alibaba | API/response format issues |
ok what did we learn
- ☑️ Claude Opus 4.5 is currently the best at SAT reading comprehension
- ☑️ Chinese models are genuinely competitive - 3 tied for 2nd place
- ☑️ DeepSeek, GLM, and Kimi all outperform OpenAI’s flagships
- ☑️ all frontier models perform above human average (76th-84th percentile)
- ☑️ no model cracked the 650+ scores that top human performers achieve
- ☑️ OpenAI’s GPT-5 series continues to underperform on benchmarks
- ☑️ there appears to be a ~630 ceiling for current architectures
the meta-lesson: the AI race is genuinely global now. chinese models aren’t just “catching up” - on standardized language tests, they’re already competitive with (and sometimes beating) US alternatives.
further work
things id like to try:
- chain-of-thought prompting (let them reason out loud)
- math section benchmarking (need to handle diagrams/figures)
- more chinese models (yi-lightning, ernie-4, etc.)
- testing on adaptive difficulty formats like the real digital SAT
if u have GPU credits to spare, feel free to run these experiments urself. benchmark code available on request.
Benchmark run on January 17-February 4, 2026. Chinese models tested via Silicon Flow API. No AI models were harmed in the making of this blog post, though OpenAI’s flagship was mildly embarrassed by its performance.