i made agi take the SAT

2026-01-17T19:30:00.000000

what happens when u give today’s frontier AI models a standardized test designed for 17 year olds? i tested 20 AI models from US and Chinese providers on the official SAT Practice Test #10 from College Board.

spoiler: they’re all above average. but none of them are getting into MIT.

robot taking test
Imagen 4 prompt: robot taking standardized test, stressed, pencil in hand, scantron sheet, realistic

tldr

Claude Opus 4.5 wins with 57.6% accuracy (630 score, 84th percentile)
Chinese models tied for 2nd place: DeepSeek V3.2, GLM-4.7, and Kimi K2.5 all hit 624
all models beat the human average of 530 but none cracked 650
OpenAI’s GPT-5.2 placed… 8th. eighth. behind three Chinese models.
Google’s Pro models refused to take the test bc “safety concerns” lmao

the setup

i built a benchmark script that tests models on the SAT’s Reading and Writing sections (66 questions total). this includes:
- vocabulary in context (what does “obfuscate” mean in paragraph 3)
- reading comprehension (what is the author’s main point)
- text structure (why did the author use this transition)
- grammar (semicolons are hard apparently)
- rhetorical synthesis (combine these two passages into a coherent argument)

to make sure results weren’t just random variance, each question was posed to each model 4-8 times, with the final answer determined by majority voting.

providers tested:
- US: OpenAI, Anthropic, Google, xAI
- China: DeepSeek, Zhipu AI (GLM), Moonshot (Kimi), MiniMax, ByteDance, Baidu, Alibaba (Qwen)

methodology note: i used temperature=0.3 where supported, direct API calls (Silicon Flow for Chinese models), and no chain-of-thought prompting. just “here’s the passage, here’s the question, give me A/B/C/D.” like a real standardized test. no hand-holding.

the results

at a glance

where do AI models fall on the bell curve?

below is the distribution of human SAT scores (Reading & Writing section), with the 20 AI models plotted at their equivalent positions. hover over the dots for deets.

all models scored above the human mean of 530, placing them between the 76th and 84th percentiles.

in other words: every AI model tested would beat roughly 3 out of 4 human test takers.

but also: none of them are getting into a top 20 school. the tightest clustering occurred between scores 612-630, suggesting some kind of ceiling for current architectures on this task.

model ranking

score tiers

full results table

ok but what does this actually mean

anthropic takes the crown (barely)

Claude Opus 4.5 achieved the highest accuracy at 57.6%, translating to a 630 SAT score (84th percentile). this represents a narrow but consistent lead over the rest of the pack.

but here’s the thing: the margin is tiny. the difference between 1st place (57.6%) and 2nd place (56.2%) is literally one question.

chinese models are competitive af

US vs China AI competition
Imagen 4 prompt: two groups of robots facing each other across a conference table, red vs blue lighting, holographic scoreboards

the biggest story here is how well Chinese models performed:

Model	Provider	Score	Notes
DeepSeek V3.2	DeepSeek	624	tied for 2nd among all models
GLM-4.7	Zhipu AI	624	tied for 2nd among all models
Kimi K2.5	Moonshot	624	tied for 2nd among all models
MiniMax M2.1	MiniMax	618	beat GPT-5.2
Seed-36B	ByteDance	618	beat GPT-5.2
ERNIE-4.5	Baidu	618	beat GPT-5.2

three chinese models tied with xAI’s Grok models for 2nd place. this is… not nothing. deepseek, zhipu, and moonshot are all producing models that outperform openai’s flagship on standardized tests.

wtf openai (still)

robot looking dejected at test results
Imagen 4 prompt: dejected robot at desk, crumpled test papers, other robots celebrating in background

the embarrassment continues: OpenAI’s flagship GPT-5.2 placed 8th overall at 53.0% accuracy - behind three Chinese models, two xAI models, and two Anthropic models.

Model	Accuracy	My Reaction
GPT-5.2	53.0%	beat by deepseek…
GPT-4.1	53.0%	same as 5.2???
GPT-5.1	50.0%	tied with qwq-32b

this suggests openai’s recent model improvements focused on… something else? coding? reasoning? certainly not standardized test performance.

deepseek is the real deal

DeepSeek V3.2 achieved 56.1% accuracy with 97% confidence - matching Grok 4 Fast and beating every OpenAI model. this is particularly notable because:

deepseek is significantly cheaper than US alternatives
they achieved this with an open-weights model
the V3 architecture is apparently very good at language comprehension

alibaba’s qwen had issues

Qwen3-235B completely failed the benchmark (0% accuracy) - likely due to API issues or the model returning non-standard responses. QwQ-32B (the reasoning variant) managed 50%, tying with GPT-5.1 for last place among successful models.

the qwen team might want to look into that.

the stats

84th

Top Model Percentile

630

Highest SAT Score

97%

Avg. Confidence

Models Tested

all models scored above the human average (530), but none achieved the 700+ scores that top human test-takers regularly hit.

for reference: if u got a 624-630 on SAT reading/writing, ur looking at state schools and maybe some lower-tier privates. MIT average is 760. harvard is 750.

AGI is above average but not ivy league material

us vs china: the leaderboard

by provider

the full leaderboard

Rank	Model	Provider	Country	Score
1	Claude Opus 4.5	Anthropic	US	630
2	Gemini 3 Flash	Google	US	625
3	Grok 4 Fast	xAI	US	624
3	Grok 3 Mini	xAI	US	624
3	DeepSeek V3.2	DeepSeek	China	624
3	GLM-4.7	Zhipu AI	China	624
3	Kimi K2.5	Moonshot	China	624
8	Claude Sonnet 4.5	Anthropic	US	618
8	Gemini 2.5 Flash	Google	US	618
8	Grok 4	xAI	US	618
8	MiniMax M2.1	MiniMax	China	618
8	Seed-36B	ByteDance	China	618
8	ERNIE-4.5	Baidu	China	618
14	GPT-5.2	OpenAI	US	612
14	GPT-4.1	OpenAI	US	612
14	Claude Sonnet 4	Anthropic	US	612
14	Grok 3	xAI	US	612
18	GPT-5.1	OpenAI	US	600
18	Claude Haiku 4.5	Anthropic	US	600
18	QwQ-32B	Alibaba	China	600

takeaway: chinese models are genuinely competitive. 6 of the top 13 models are from chinese companies. the “china is behind in AI” narrative needs updating.

methodology notes for the nerds

futuristic lab with data visualizations
Imagen 4 prompt: bird’s eye view of futuristic lab, holographic charts, blue and purple neon accents

**Test Source:** College Board SAT Practice Test #10 **Questions Used:** 66 (Reading & Writing Modules 1 and 2) **Iterations Per Question:** 4-8 (majority voting for final answer) **Answer Selection:** Direct letter response (A/B/C/D) **Parallel Processing:** ThreadPoolExecutor with 3-10 concurrent workers **Chinese Models API:** Silicon Flow (OpenAI-compatible) **Temperature:** 0.3 where supported **Total API Calls:** ~10,000+

models that couldn’t even show up

Model	Provider	Issue
Gemini 3 Pro	Google	Safety filter blocked questions
Gemini 2.5 Pro	Google	Safety filter blocked questions
o3	OpenAI	API compatibility issues
o4-mini	OpenAI	API compatibility issues
Qwen3-235B	Alibaba	API/response format issues

ok what did we learn

☑️ Claude Opus 4.5 is currently the best at SAT reading comprehension
☑️ Chinese models are genuinely competitive - 3 tied for 2nd place
☑️ DeepSeek, GLM, and Kimi all outperform OpenAI’s flagships
☑️ all frontier models perform above human average (76th-84th percentile)
☑️ no model cracked the 650+ scores that top human performers achieve
☑️ OpenAI’s GPT-5 series continues to underperform on benchmarks
☑️ there appears to be a ~630 ceiling for current architectures

the meta-lesson: the AI race is genuinely global now. chinese models aren’t just “catching up” - on standardized language tests, they’re already competitive with (and sometimes beating) US alternatives.

further work

things id like to try:
- chain-of-thought prompting (let them reason out loud)
- math section benchmarking (need to handle diagrams/figures)
- more chinese models (yi-lightning, ernie-4, etc.)
- testing on adaptive difficulty formats like the real digital SAT

if u have GPU credits to spare, feel free to run these experiments urself. benchmark code available on request.

Benchmark run on January 17-February 4, 2026. Chinese models tested via Silicon Flow API. No AI models were harmed in the making of this blog post, though OpenAI’s flagship was mildly embarrassed by its performance.