AI in Mathematics
A progress ladder showing how far AI has climbed in mathematical reasoning — from solving textbook problems to competing at the International Mathematical Olympiad, publishing in top journals, and the distant summit of open millennium-prize problems.
Automation Progress
AI systems now routinely score 90%+ on challenging undergraduate-level math benchmarks (MATH, GSM8K) and handle problems from competitions like AMC/AIME with ease. This level is considered solved.
GPT-4o scores 76.6% on MATH benchmark
OpenAI's GPT-4o achieved 76.6% on the MATH dataset of competition-level problems, up from <5% just two years earlier with GPT-4's initial attempts.
OpenAI o1 reaches 94.8% on MATH
The o1 reasoning model achieved 94.8% on MATH and 83.3% on AIME 2024, demonstrating that chain-of-thought reasoning dramatically lifts mathematical performance.
AI has progressed from silver to gold at the IMO — widely considered the hardest pre-university math competition in the world, requiring deep creative reasoning and rigorous proof.
AlphaGeometry solves IMO geometry at silver level
DeepMind's AlphaGeometry solved 25 of 30 historical IMO geometry problems, approaching gold-medal performance in geometry specifically.
AlphaProof + AlphaGeometry2 reach IMO silver (28/42)
DeepMind's combined system solved 4 of 6 problems at the 2024 IMO, scoring 28/42 — the first AI to officially reach silver-medal standard. Used formal Lean proofs with RL.
Both OpenAI & DeepMind achieve IMO gold (35/42)
OpenAI's experimental reasoning model and DeepMind's Gemini Deep Think both scored 35/42 (5 of 6 problems) — gold-medal standard. Both used natural-language proofs completed within the 4.5-hour time limit, a major leap from 2024's multi-day formal approach.
Beyond solving known problems, AI has begun producing genuinely new mathematical discoveries — finding novel conjectures, beating decades-old bounds, and publishing results in Nature.
DeepMind AI discovers new knot-theory conjecture
Working with mathematicians from Oxford and Sydney, DeepMind's ML system discovered a previously unknown relationship between algebraic and geometric invariants of knots, leading to a new theorem. Published as a Nature cover story.
AlphaTensor discovers faster matrix multiplication algorithms
DeepMind's AlphaTensor used reinforcement learning to discover novel matrix multiplication algorithms, improving on Strassen's 50-year-old method for 4×4 matrices and 70+ other sizes. Nature cover story.
FunSearch breaks 20-year record on cap set problem
DeepMind's FunSearch used LLMs to discover new solutions to the cap set problem (extremal combinatorics), producing the largest cap sets found in 20 years and surpassing human mathematicians' best constructions. Published in Nature.
AlphaProof paper published in Nature
The full AlphaProof system, demonstrating olympiad-level formal mathematical reasoning via reinforcement learning in Lean, was published in Nature — validating AI's ability to generate machine-verifiable proofs at the highest competition level.
Can AI tackle problems that take professional mathematicians hours or days? The FrontierMath benchmark tests exactly this — hundreds of original, expert-crafted research-level problems. The best models are still far from human expert performance, but progress is rapid.
FrontierMath benchmark: SOTA models solve <2%
When launched, the best AI models (including o1) could solve less than 2% of FrontierMath's expert-level problems spanning number theory, algebraic geometry, and category theory. Fields Medalist Terence Tao called the problems "extremely challenging" even for expert mathematicians.
Rapid progress: GPT-5 series reaches ~26%, GPT-5.4 hits 47.6%
Within a year, model performance has surged. GPT-5 scored ~26.3%, and the latest GPT-5.4 solves 47.6% of FrontierMath problems — a remarkable leap, though still below expert human level on the hardest problems.
AI as research co-pilot (Tao's vision)
Fields Medalist Terence Tao envisions AI as a "co-pilot" for mathematical research — not replacing mathematicians but accelerating discovery by handling tedious computations, suggesting proof strategies, and formalizing arguments in proof assistants like Lean.
The summit: solving problems that have resisted all human efforts for decades or centuries. The seven Millennium Prize Problems (six remain unsolved) represent the pinnacle of mathematical difficulty. No AI has made meaningful direct progress on any of them. These problems likely require entirely new mathematical frameworks, not just better computation.
Riemann Hypothesis (1859)
Concerns the distribution of prime numbers. Unproven for over 160 years. Widely considered one of the deepest unsolved problems in all of mathematics. No AI system has made direct progress.
P vs NP (1971)
The most important open question in theoretical computer science. Asks whether every problem whose solution can be quickly verified can also be quickly solved. Resolving it would reshape cryptography, optimization, and our understanding of computation.
Birch and Swinnerton-Dyer Conjecture, Hodge Conjecture, and others
The remaining Millennium Problems — along with countless other major open conjectures — remain firmly beyond AI's current reach. These likely require not just pattern recognition but entirely novel mathematical insight.