I was able to use an extended conversation with an AI https://chatgpt.com/share/68ded9b1-37dc-800e-b04c-97095c70eb29 to help answer a MathOverflow question https://mathoverflow.net/questions/501066/is-the-least-common-multiple-sequence-textlcm1-2-dots-n-a-subset-of-t/501125#501125 . I had already conducted a theoretical analysis suggesting that the answer to this question was negative, but needed some numerical parameters verifying certain inequalities in order to conclusively build a counterexample. Initially I sought to ask AI to supply Python code to search for a counterexample that I could run and adjust myself, but found that the run time was infeasible and the initial choice of parameters would have made the search doomed to failure anyway. I then switched strategies and instead engaged in a step by step conversation with the AI where it would perform heuristic calculations to locate feasible choices of parameters. Eventually, the AI was able to produce parameters which I could then verify separately (admittedly using Python code supplied by the same AI, but this was a simple 29-line program that I could visually inspect to do what was asked, and also provided numerical values in line with previous heuristic predictions).

Here, the AI tool use was a significant time saver - doing the same task unassisted would likely have required multiple hours of manual code and debugging (the AI was able to use the provided context to spot several mathematical mistakes in my requests, and fix them before generating code). Indeed I would have been very unlikely to even attempt this numerical search without AI assistance (and would have sought a theoretical asymptotic analysis instead).

ChatGPT - Conjecture disproving strategy

A conversational AI system that listens, learns, and challenges

ChatGPT
I encountered no issues with hallucinations or other AI-generated nonsense. I think the reason for this is that I already had a pretty good idea of what the tedious computational tasks that needed to be performed, and could explain them in detail to the AI in a step-by-step fashion, with each step confirmed in a conversation with the AI before moving on to the next step. After switching strategies to the conversational approach, external validation with Python was only used at the very end, when the AI was able to generate numerical outputs that it claimed to obey the required constraints (which they did).
@tao that said recent chain of thought work has made hallucination a lot better. I’m still curious what sorts of solutions will come from the neuro symbolic approach and how much incremental value these approaches will have.
@mchav When you say that hallucination is a lot better now what do you mean? That it is less frequent? That it is more obviously wrong so it is easier to spot? That it is as frequent as before but somehow less incorrect?
@oantolin @mchav As a rough analogy, without chain-of-thought the model is blurting out the first thing that comes to mind. With chain-of-thought, the model will think about the options, try different solutions, double-check its work, etc. And testing shows that the more it thinks, the more likely it is to get the right answer.
@erjiang So, of the various options I described you claim hallucinations are less frequent?
@oantolin Yes, and on chatgpt the reasoning models tend to search more which further reduces hallucinations. Although I’m not sure how you are defining hallucinations vs incorrectness.
@erjiang I don't know how to define "hallucinations" either, I use that word because I see other people use it. I don't know if people who use it actually know what they mean by it or not. Let's say for purposes of this discussion I meant "incorrectness" the entire time, which is the thing I actually care about.
@oantolin Ok yeah, so if we’re talking about the domain of mathematics, then the model’s probability of getting the answer right scales with amount of thinking (aka “test-time compute”). Though non-linearly and up to a certain point. Since GPT-5 Pro spends a lot more compute than GPT-5 Thinking, it’s much less likely to give eg an incorrect derivation or result. Obviously this doesn’t mean infinite compute will solve a Millenium Problem though.
@oantolin If you don’t have a Pro subscription and you have any problems that you feel are a bit out of reach of GPT-5, feel free to send me a prompt and I’ll run it through GPT-5 Pro and see how it fares.