Are there useful or interesting ways to use LLMs other than prompting them? I feel like compressing all the text in the world via a hierarchically structured statistical model is probably useful, but that we're using it in a way that is unlikely to do what we'd hope.

Like everyone I'm impressed and amazed by what they can do, but also very frequently nonplussed by their stupidity. The amazing things convince me there's something important here, but the stupidity seems to have a consistent character that makes me think we're using them in a non-optimal way.

For example, I would expect them to be good at gathering text from their training data that is talking about the same thing in two different ways. This seems like it would be very helpful for synthesizing views on a complex question, but I think not.

I think that part of synthesizing multiple views is building a mental model of the underlying meaning, finding the points of disagreement, and putting that into a new framing. This feels like an inherently back and forth process that LLMs can't do by their very structure.

"Reasoning" models with chain of thought get a little way towards this but they feel like an overkill solution that also isn't enough to really address it. But I have to admit I don't know much about their internals and I've never really had the chance to use them myself.

Another aspect is that I'm not sure it's possible to train models to produce truth, in some sense. I feel like we learn this by living in the world and trying to use the imperfect pieces of knowledge and skills we have to achieve stuff. Without that connection, can it go beyond compression?

So that's why I'm wondering if there is another way to use LLMs that more clearly makes use of the fact that they're an incredible compression scheme? Search seems like one possibility but maintaining sources would undermine their compressing role I guess.

Maybe generating good keywords and alternative phrases that people use when talking about something, that could be the starting point of a literature search? Has anyone tried using them in this way just via prompting? Or maybe there's another way to use the core model without prompting?

@neuralreckoning there certainly are lots of possible applications of llms-beyond-chatbots. It's early days but besides the "deep research" applications, https://arxiv.org/abs/2502.18864 is perhaps along the lines of what you describe for synthesizing research hypotheses, and https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/AlphaEvolve.pdf is one way of tackling the "external truth" problem (by using verifiable rewards).
Towards an AI co-scientist

Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.

arXiv.org
@benm yeah I think both of those papers are precisely the ways I wouldn't expect LLMs to work well. I personally don't find them convincing, but I guess we'll see if anything interesting gets generated by them. I haven't seen it yet.

@neuralreckoning

I'll give you one tip.

Think of a difficult to explain model. Like maybe getting a good understanding of a mathematical model. Then prompt Gemini 2.5 pro or chat gpt to create a Google Collab that models that thing, along with knobs to change different variables of that model, or perhaps show you how slightly different models model differently.

Then let someone twiddle with the knobs for a bit to understand the model.

Try explaining that as fast with other methods. It's an extremely fast way to communicate these sorts of ideas. The fact that it's a Google Collab makes it very portable. You can send a link to someone and boom they have it on their computer without installing a bunch of libraries etc.

@BlueBee @neuralreckoning could you link an example please?

@nsmarkov @neuralreckoning

I can give you a prompt, but not today. Not enough time left in the day.

I'll think of a good example..

@nsmarkov @neuralreckoning

So crazy!

But when attempting to do a little experimentation with this, I found that Gemini 2.5 pro might be a better coder than ChatGPT, which was unexpected.

At least for the specific thing I was doing.

One issue I'm running into is that I have no real world problem I'm trying to solve that can serve as inspiration. I can't pull one from work for reasons...

Trying to fabricate one with no inspiration is proving difficult.

I'll try some more and get back to you.

But as some direction if people have any ideas. Think of equations you might model in a graphing calculator where being able to visualize how variables effect the output. Sometimes we have these equations and it's hard to understand what they are doing, but once you can play with the variables and see the changes in real time one is able to get a more intuitive sense of them.

Writing these can sometimes be time consuming, so being able to write a quick description of what you are trying to model and having Gemini pro 2.5 pop out a website or a Collab that visualizes the equation or simulation can be a really good way to communicate the idea.

@neuralreckoning Why do you say synthesizing different points of view isn't good? Is it personal experience or based on the world model requirement you think should be involved? I think they provide pretty good approximation of that based on the training corpus, of course without a model or any guarantees.

To me it seems the design of the autoregressive LLM and its input data is pretty limiting on other applications. If words were tokens, you could look up commonly used synonims from embedding distances, but I'm not sure how to do that for a subsequence of tokens

@neuralreckoning
FWIW, the commercial offerings are heavily doctored experiences-of-LLMs, and the workplace ones reflect the home user product expectations in my opinion. So the overt uselessness, "what-do-you-think"ery and whiny tone of LLMs are manifestations of "corporate ethics" RLHF training sweatshops.

An interesting trail, especially before it got junkified by commercial marketing attention, was the goal of large models working on finding new formal mathematical proofs.

@nsmarkov

@nsmarkov maybe I can give an example. In the auditory neuroscience modelling world, there were two models of the adaptation in the auditory nerve fibre that started off quite different from each other. Over time they added features to these models to account for the things that the other model accounted for that they didn't. At some point someone wrote a paper saying that in their evolved forms, these two models had actually become mathematically equivalent, even though formally they look different. I think that's an example of the sort of thing I wouldn't expect an LLM to be able to do without explicitly suggesting the idea that they might be the same, because I can't see how it could. It would pick up on the fact that there were two models in the literature and that there is some disagreement between the two, but it couldn't realise that they were actually formally the same model because that would involve having an intuition that they might be, and doing a multistage unprompted calculation.

@neuralreckoning thank you for the example, I like it, and it highlights a complicated question well. And I agree, I can't imagine how the current LLM generation would do it.

Potentially, it could probably work like this: when you ask it about a summary for the field, it should understand that you're asking about the meaning of the models, rather than what is known in the literature (otherwise it would report you the same converged model under different names). Then it should build an abstract representation of them (e.g. DAGs), and use that to deduplicate the models, and unite the two.

To me personally the extracting abstract representations from language and reasoning in that abstract representation space are quite interesting, but honestly I haven't researched literature on this.