From Language Models to Theorem Proving: How DeepSeek-Prover-V2 Bridges Natural Language and Formal Logic

May 04, 2025

There is a Chinese version of this article after the English version.

本文英文版之后附有中文版，请下拉查看。

DeepSeek made a surprise drop at the very end of the May Day holiday: DeepSeek-Prover-V2 (model on HuggingFace). Perhaps because of the holiday—or maybe because this release isn’t paired with flashy, record-breaking headlines like previous models—I haven’t seen much media discussion yet.

But I was deeply impressed by their earlier work, especially their reasoning structure. So, I spent time during the break reading this new paper thoroughly, and the deeper I went, the more interesting it became. Here’s a write-up of what I learned and thought was worth sharing:

This article walks through two angles:

A brief summary of the model’s core idea and technical approach;
A few aspects I personally found fascinating—especially how it connects human-style reasoning with machine-verifiable logic.

1. Why Formal Theorem Proving Is the Next Hard Problem for LLMs

Large language models (LLMs) have already demonstrated impressive performance in math problem solving, especially with Chain-of-Thought (CoT) prompting. But those step-by-step reasoning traces are still fundamentally informal—based on intuition, heuristics, and learned patterns.

Formal theorem provers like Lean or Coq, on the other hand, require every proof step to be explicit, rigorous, and mechanically verifiable. This raises a fundamental challenge: LLMs are great at imprecise, fuzzy language. Computers demand machine-level precision. So how do we bridge the gap?

2. Overview of the Method: From Natural Language to Recursive Lean Proofs

Step 1: Cold-start Data via CoT + Lean Subgoal Decomposition

Use the general-purpose model DeepSeek-V3 to understand theorems and decompose them into subgoals;
Generate a have := sorry structure to sketch the proof in Lean;
Then, a lightweight 7B model recursively completes each subgoal, producing full formal Lean proofs;
This pipeline generates hundreds of synthetic cold-start examples that include both reasoning traces and verifiable proofs.

Step 2: Curriculum Learning + Reinforcement Learning (GRPO)

Introduce problems of increasing difficulty via curriculum learning;
Apply Group Relative Policy Optimization (GRPO) to reward not only correctness but also structural alignment with the original reasoning plan;
Train two reasoning modes in parallel:
- CoT mode: transparent, interpretable step-by-step reasoning;
- Non-CoT mode: fast, efficient Lean code generation.

3. What I Found Fascinating

3.1 The Model Isn't Just Guessing—It's Really Completing Logical Loops

Past math models often felt like “getting the answer right, but not knowing why.” Here, the model produces Lean code that passes formal verification—every step must logically follow. This marks a shift from "language generation" to deductive construction.

3. 2 Subgoal Decomposition Feels Like a Math PhD’s Writing Process

Humans often sketch a proof outline first, then fill in each part. This system mimics that exact flow:

DeepSeek-V3 breaks down the theorem into lemmas, generates a lemma := sorry structure, and the 7B model fills in the blanks. This structure reduces the proof search space dramatically.

3.3 The 7B Model Outperforms the 671B Model on Some Problems

On the Putnam benchmark, the 7B model solved 13 problems that the 671B couldn’t.

It used techniques like Cardinal.toNat, which weren’t in the larger model’s outputs—suggesting that smaller models can sometimes discover unique, non-mainstream but effective strategies.

3.4 The Gap Between Informal and Formal Reasoning Is Narrowing

DeepSeek-V3 solved 8 AIME problems via natural language CoT reasoning; DeepSeek-Prover-V2 formalized 6 of them in Lean.

This is perhaps the clearest side-by-side comparison of informal language-based reasoning vs. formal proof construction, and it's encouraging to see how closely they align.

4. Experimental Highlights

Benchmark Metric DeepSeek-Prover-V2 Result miniF2F-test Pass@8192 88.9%, ProofNet-test Pass@1024 37.1%, PutnamBench Pass@512 49 / 658, AIME 24&25 (formal) Solved 6 / 15. They also introduced a new benchmark—ProverBench—containing 325 formalized problems drawn from textbooks and math competitions. A valuable addition to the field.

5. From “Speaking” to “Proving”—LLMs Just Got a Lot Smarter

This paper isn’t about prompt hacking or brute-force answer searching. It’s about systematically building a path from intuitive reasoning to formal logic. We know language models operate by predicting the next token. But as someone with a background in mathematics, I see a deeper connection here: It’s like developing an intuitive number sense first—a fuzzy, directional understanding of a theorem—then, through training and structure, acquiring a rigorous framework to prove it. And that framework enables not just internal reasoning, but external communication and validation.

What DeepSeek-Prover-V2 achieves is making that transition—from intuition to structure—accessible to machines. It doesn't just look like it's reasoning. It really is proving like a mathematician. One day, we may witness an AlphaZero-style system for math—one that conquers IMO problems not by memorization, but by reasoning from first principles.

📄 Original paper on arXiv

以下是中文版本

DeepSeek 果然在五一假期的最后一刻放了一个大招，发布了最新的模型：DeepSeek-Prover-V2（模型链接）。或许是因为假期原因，又或许这次不像上次那样伴随“破纪录”级的 result-driven 效果，感觉媒体上讨论的不多？不过我之前读过他们的第一篇论文，对他们的论证思路印象很深刻。这次趁着假期休息，我认真读完了这篇新论文，越读越觉得有趣，于是写下这篇读书笔记，和大家一起分享一下：

我会从两个角度来讲这篇论文：

简单介绍论文的主要思路和技术路线；
分享我觉得特别值得注意、甚至“有点意思”的几个点，尤其是它如何连接“人类的思考方式”和“机器的逻辑严谨”。

一、研究背景：为什么“形式化证明”是 LLM 的下一块硬骨头？

大语言模型（LLM）在解数学题时已有不俗表现，尤其是在使用 Chain-of-Thought（逐步推理）策略下。但这些自然语言推理路径，本质上仍属于非形式、依赖直觉和启发式的思维。相比之下，形式化证明系统如 Lean、Coq 要求每一步推导都必须严谨、显式、可验证。这就引发一个核心挑战：LLM 擅长“人类语言的模糊表达”，但计算机需要“机器语言的绝对严谨”——怎么在中间搭桥？

二、论文方法概览：从自然语言到 Lean 代码的递归构造

1. 冷启动数据构造：CoT + Lean 子目标（Subgoals）

使用通用模型 DeepSeek-V3 对定理进行理解与子目标分解，输出 Chain-of-Thought + 多个 have := sorry 结构；
然后使用一个精简版的 7B 模型递归补全每个子目标，拼接后构成完整的 Lean 证明；
最终生成成百上千条包含推理链与可验证证明的冷启动样本。

2. 课程学习 + 强化学习（GRPO）

通过 curriculum learning 逐步喂给模型越来越难的问题；
使用 Group Relative Policy Optimization（GRPO）训练模型，在保证验证正确的同时，还奖励与原始推理路径结构对齐的解法；
两种模式并行训练：
- CoT 模式：强调推理路径与解释性；
- non-CoT 模式：高效 Lean 代码生成。

三、我觉得特别有意思的地方

1. 模型不再“瞎猜”，而是真的在做逻辑闭环

大模型以前在解数学题时，总给人一种“答对了但说不出为什么”的感觉。这篇论文让模型写的 Lean 代码真的能过验证，每一步都是严格推理的结果。这从“语言生成”迈向了“推断生成”。

2. 子目标分解就像一个 AI 数学博士的写作习惯

人类做数学证明常常先写一个“proof sketch”，也就是先列一个解题提纲和大致步骤，再逐步展开证明每个步骤。这个系统也是模拟这个思路设计的，先使用DeepSeek-V3 对定理进行理解与子目标分解，然后使用小模型证明每个子目标：先写出 lemma1 := sorry，然后再一条条补全。这种结构化的方式极大减少了搜索空间。

3. 7B 小模型竟然在一些题目上赢了 671B 大模型

在 Putnam 竞赛题中，7B 模型成功解出 13 个大模型解不出的题。原因是它用了大模型没有掌握的一些“冷门技巧”（如 Cardinal.toNat 处理集合大小），这说明小模型可能更容易产生“非主流但有效”的推理路径。

4. 非形式与形式化推理的差距在缩小

DeepSeek-V3 解了 8 道 AIME（自然语言形式），而 DeepSeek-Prover-V2 解了其中 6 道的 Lean 形式化版本。这是目前为止最接近“自然语言理解”与“严谨证明构造”的对比，显示了语言模型推理能力的进一步融合。

四、实验成绩非常亮眼

Benchmark 指标 DeepSeek-Prover-V2 成绩 miniF2F-test Pass@8192 88.9% ，ProofNet-test Pass@1024 37.1%， PutnamBench Pass@512 49 / 658 ，AIME 24&25 (形式化) 15 题中解出 6 / 15。此外，作者还构建了新的 ProverBench 数据集，包括 325 道来自教材和竞赛的 Lean 形式题，是非常有价值的研究资源。

五、从“会说话”到“会证明”，LLM 又进了一大步

这篇论文不是在玩 prompt，也不是用 brute-force 强算出答案，而是系统性地构建出一条从自然语言思考到形式化逻辑表达的路径。我们知道，语言模型本质上是在做“下一个 token 的概率预测”，而这其实与数学家的思维方式并不矛盾。作为一个数学背景出身的人，我非常能共鸣这种过程：它就像是人们对一个定理最初产生的“数感”——一种模糊但有方向的直觉。而真正的数学能力，不止于此。在大量训练数据的滋养下，这种“感知”会逐渐沉淀为一个严谨的证明框架，这个框架既能帮助我们在逻辑上走得更稳健，也能将我们的思维成果与他人共享和验证。而 DeepSeek-Prover-V2 做的，就是将这种“模糊感知到清晰框架”的路径，变成了机器也能走通的通路。它不只是让模型“看起来像在推理”，而是让它真的像数学家一样地证明。

未来，也许我们真的能见到一个 AlphaZero 式的自动数学系统：它不是靠背过题库、套模板，而是靠真正的逻辑推理能力，去一步步攻克像 IMO 这样的人类智慧高地。

原文指路：https://arxiv.org/pdf/2504.21801

AI Craft Lab

Discussion about this post