To be fair, however, these proofs are using several of the automatic proving methods built into Lean, like linarith, nlinarith, field_simp, and so on. Each of these are hundreds of lines of manually tweaked heuristics.
It's still quite interesting. I think if we had a good integration of AI methods into the Lean runtime itself the theorem prover could be excellent. You could have a cooperative model where the human mathematician writes down statements, and the AI fleshes out a proof or catches typos that you make. Right now Lean has similar functionality but the automatic proving isn't as good as it could be.
I just wish the Lean community wasn't stuck on the Lean 3 -> Lean 4 migration for the past year. I'm getting flashbacks of Python 2 -> Python 3 ....
This is definitely something that we'll be looking into more closely in the future. Our old and busted `gptf` tactic was a good start but we can do much better!
I may be misunderstanding, but it seems like gptf can only apply theorems from the set you guys trained it on ... in which case I don't see how it could possibly help with the development of a theory beyond its first few statements. Have I got that right? and if so is it something that might be addressed in future iterations?
How long (or how much compute resources) does it take to solve one of these IMO problems right now? My estimation is that just the time to run tactics linarith for many different cases would be significant enough to make it too slow to do interactively, but I'm curious to check my guess.
It takes a lot of CPU (lean side) on top of GPU (neural side) indeed. But technically, when properly parallelized, it takes no more than 1-2h to crack the hardest problem we've cracked so far.
As others in this thread have pointed out, some of those proofs involve calls to high-power decision procedures, so I was curious to see if as a baseline, could Z3 solve the problems in the blog post with no guidance?
For 5/6 of the problems, Z3 solves them automatically and instantly, with the same problem encoding as the problems in the post. (Problem 2 involves factorial, so that one can't be straightforwardly translated to Z3.)
The average scores of high-school students who take AMC and AIME are around 35+-% [1].
"We achieved a new state-of-the-art (41.2% vs 29.3%) on the miniF2F benchmark, a challenging collection of high-school olympiad problems." [2]
This system's performance is already better than the average scores of students who take these tests, who are usually very good at math.
The test/validation sets are drawn from AMC, AIME, and IMO which are competitive math competitions at the high school level (IMO being the hardest). Even though the topics are generally limited to high school curriculum, solving the problems often requires a degree of creative thinking and knowledge synthesis such that many/most college graduates cannot solve them.
Even though this is an amazing result, I am a bit disappointed, but not surprised, that most of these problems are equation/inequality problems. Those kinds of problems are among the easiest for computers; most inequalities like that can be solved with S.O.S and the likes. If it can solve the combinatorics section on the IMO, I would certainly be impressed.
This work is interesting, but it doesn't seem to acknowledge a lot of the state-of-the-art work in this area. Possibly that's because much of it is done in Programming Languages venues, as opposed to Machine Learning venues. It can also be hard to compare works because there are several proof languages that folks use (the big ones are Coq, Isabelle/HOL, and Lean), with mutually exclusive benchmark sets. However, the paper here cites CoqGym and Gamepad, ML-based proof synthesis works that work in the Coq language, but doesn't reference TacTok (https://dl.acm.org/doi/10.1145/3428299) or Proverbot9001 (http://proverbot9001.ucsd.edu/), both of which are later works which significantly improve accuracy and performance on the same language (and for CoqGym, the same benchmarks). Disclaimer: I'm the author of Proverbot9001.
We’re just waiting for hardware engineers to produce faster hardware/software engineers to make the implementation more efficient. ;-)
Having said that, the problems solved here are fairly simple. I can do half of them in my head (how old are the students doing these quizzes?). Getting from the written form to a description that Lean can work with probably is harder than getting Lean to solve them. Skimming the paper, it doesn’t look like they included that first step.
What problem can't be solved, given enough computation power?
For example, determining the 3d description of a photographed scene (a difficult problem) you can just enumerate all the possible scenes, raytrace them, and pick the most probably one (e.g. the one with the simplest description).
Copilot actually knows a bit of Lean and can be helpful when formalizing stuff. But it does not get the critical feedback we get from the formal systems (as it's not designed for formal systems).
It would be interesting to eval copilot on the same benchmark as I'm pretty sure it can close some of the proofs still.
To be fair, however, these proofs are using several of the automatic proving methods built into Lean, like linarith, nlinarith, field_simp, and so on. Each of these are hundreds of lines of manually tweaked heuristics.
It's still quite interesting. I think if we had a good integration of AI methods into the Lean runtime itself the theorem prover could be excellent. You could have a cooperative model where the human mathematician writes down statements, and the AI fleshes out a proof or catches typos that you make. Right now Lean has similar functionality but the automatic proving isn't as good as it could be.
I just wish the Lean community wasn't stuck on the Lean 3 -> Lean 4 migration for the past year. I'm getting flashbacks of Python 2 -> Python 3 ....