Even more on AI and math: solution to one of my open problems and implications
Again a more serious post that doesn’t actually fit under “AI fails” as much as “AI successes” — though, as we will see, something is left to be desired and I think there are real concerns about humans going out of the loop on mathematics (and more).
I recently posted two open problems in theoretical computer science (with applications in AI) that I first posted in 2015 in a program at the Simons Institute for the Theory of Computing. I had worked on them before that and have given them to various people asking for open problems over the years, but never got a solution (admittedly I’m not sure how hard people other than me tried). I tried ChatGPT 5.5 Pro on them, but while it made some partial progress, it didn’t solve them.
But a little over a week ago, my colleague Ryan O’Donnell sent me a solution to the first problem (I put it here), also generated by ChatGPT 5.5 Pro. Apparently Ryan’s prompting skills are better — in fact, it found the solution on the first try, though running for nearly an hour on it out of the gate! Ryan is an amazing theoretical computer scientist, but doesn’t work in the specific area of this problem. He did suggest to ChatGPT to try to prove NP-hardness (a standard concept in computer science that was one of the two ways likely to resolve the problem) first, because he felt it’s likely better at such proofs. (The other problem I posed resisted solution.)
Now, the unfortunate aspect of the proof is that it is 7+ pages of math. I tried to understand it, but, even though this is my research area, I had a hard time seeing the intuition behind the proof or keeping all the introduced notation in my head at the same time. Ryan and I both threw the AI back at the proof to try to find faults in it including by checking lots of examples, but it did not find faults.
What was I to do? I felt I couldn’t just leave it at that. In the end, I decided the easiest thing to do would be to just prove it myself after all. Let’s pause for a second on that. This is a problem that I was unable to solve when I was younger and had a lot fewer commitments, that I gave to various people over the years, and it never got resolved. Meanwhile I had the AI-generated proof in hand that I suspected, based mostly on the repeated AI interrogation, is correct (and I still do). And still I thought it would be easier to just prove it myself than try to figure out the AI proof. Part of it was also that I care about this problem and was unsatisfied with it being resolved with such a proof. It did help to know which way the problem would resolve, of course.
What was it that made the AI proof so hard to read for me? Maybe it’s just my own incompetence at reading such a proof. Superficially, it’s a well-written proof, though little intuition is given (maybe there is some more than it gives explicitly?) and it introduces a lot of different notation. Intuitively it feels to me like a proof where the author powered through to an extreme extent, where most human mathematicians would at some point say: “This is just getting too complicated/unintuitive, let’s try something simpler.” Human mathematicians (myself included) sometimes also power through to some extent, which doesn’t tend to lead to the nicest proofs, but this seemed extreme.
Does this problem inherently require a very complicated proof? Well, at least not that complicated. In the end, I did manage to prove it myself with what I think is a significantly easier proof (I put it here) that actually proves something a bit stronger. It’s certainly shorter, at 2 pages with more generous margins (or a 1-page sketch that is basically complete). And at least I can keep it in my head, though of course people usually have an easier time keeping their own proofs in their heads than someone else’s! The proof is also closer in style to proofs I did in the past for related results.
Ryan and I threw ChatGPT 5.5 Pro back at my proof to critique it. While both copies were in the end satisfied with it, mine seemed more inclined to compliment me on the proof while Ryan’s was stingy with compliments and more inclined to point out places where I should be more precise (though mine did some of that too, and all those comments did improve the proof). In both cases I was impressed with the AI’s apparent depth of understanding of the proof.
So overall, at the very least, the AI (with Ryan’s prompting) gave me the motivation/pressure to find a proof, the general direction I should look in / confidence about what the right answer was, maybe I did get a bit more detailed intuition from trying to read the AI proof (not sure), and good feedback on what I was doing. But the whole thing took me quite a while, causing me to fall behind on other things. I hesitate to point out that there are a bunch of variants of that problem pointed out in the original post…
This is where I’m worried about where math will go. It seems likely that there will be a flood of AI-generated math proofs. There is a lot of math to be done. I recently saw Dominik Peters posting here that he already has a whole bunch of AI-generated proofs of results but he just hasn’t had time to write them up properly yet. Dominik is an awesome researcher and I’m sure he won’t send out proofs into the world that he is not satisfied with. But I highly doubt everyone will be like that. And again, Ryan was able to get AI to produce a presumably correct proof on (what at least for me was) a challenging problem that isn’t even in Ryan’s precise area.
So what will we do with a flood of AI-generated proofs? The peer review system is already extremely strained. I imagine that there just isn’t enough human capacity to deeply understand all these proofs or find more human-understandable versions.
What is the purpose of a mathematical proof anyway? One is to make sure that the statement proved is correct. Another is to provide insight and understanding into why it is correct. Before computers, the two purposes didn’t come apart all that much — the best way to make sure that something was correct was to give everyone (or at least yourself) insight into why it was correct. This changed with computer-aided mathematics, and the proof of the four-color theorem left many uncomfortable. The LLM-generated proofs (when they are not using a formal language) are in a bit of a strange space between the purposes. I suspect many proofs will start to be generated that human beings will not understand (if only because they don’t have the time to go through all of them), and yet we’ll mostly trust that they are correct because the LLMs are pretty good at checking them. Once we humans are not keeping up, we’ll fall far behind — I don’t think I could have done this proof had I not spent a lot of time proving related results earlier in my life. Would I have done that if I could, back then, just have pressed a button and waited for an hour for the answer?
One possible conclusion is that, to do things right, we actually need more human mathematicians, or more generally to grow academia, just to keep up with the AI. This is a phenomenon that seems to be popping up in a variety of contexts — for example, it is much easier to see how we can use AI to significantly improve peer review if we ask the human reviewers to actually do more work than they were doing before (e.g., checking through detailed AI reviews). Probably there are lots of other examples, e.g., if human lawyers actually read through many related cases surfaced by AI, it would presumably improve their work. Unfortunately, at least in the case of academia, rapid growth practically does not seem in the cards. Perhaps we can train the AI to write more human-friendly proofs (or more generally make more human-accessible contributions).
Perhaps another direction is to create a separate “track” for especially human mathematics. (I’ve recently been interested in that anyway, including a recent “proof by picture” result — in response to a question posed by the same Ryan O’Donnell! — that I’m still expanding on.) What would this look like in contexts other than math? Would it be a good thing? In general, what do you think?

