Autoresearch on an old research idea

https://ykumar.me/blog/eclip-autoresearch/

Autoresearch on an old research idea | Blog | Yogesh Kumar

Yogesh Kumar's personal website

I often use LLMs to explore prior art and maybe find some alternative ways of thinking of problems. About 90% of what it tells me is useless or inapplicable to my domain due to a technicality it could not have known, but the other 10% is nice and has helped me learn some great new things.

I can’t imagine letting an agent try everything that the LLM chatbot had recommended ($$$). Often coming up in recommendations are very poorly maintained / niche libraries that have quite a lot of content written about them but what I can only imagine is very limited use in real production environments.

On the other hand, we have domain expert “consultants” in our leadership’s ears making equally absurd recommendations that we constantly have to disprove. Maybe an agent can occupy those consultants and let us do our work in peace.

> agent try everything that the LLM chatbot had recommended ($$$)

A lot depends on whether it is expensive to you. I use Claude Code for the smallest of whims and rarely run out of tokens on my Max plan.

Our experiments aren’t free. We use cloud infrastructure. An experiment costs on the order of tens of dollars, so massively parallelizing “spaghetti at wall” simulators is costly before we even talk about LLMs.

I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember or things where even being flat out wrong is okay and you just do it yourself.

For all the folks spending a lot of time and energy in setting up MCP servers, AGENTS.md, etc. I think this represents more that the LLM cannot do what it is being sold as by AI boosters and needs extreme amounts of guidance to reach a desired goal, if it even can. This is not an argument that the tech has no value. It clearly can be useful in certain situations, but this is not what OpenAI/Anthropic/Perplexity are selling and I don’t think the actual use cases have a sustainable business model.

People who spend the energy to tailor the LLMs to their specific workflows and get it to be successful, amazing. Does this scale? What’s going to happen if you don’t have massive amounts of money subsidizing the training and infrastructure? What’s the actual value proposition without all this money propping it up?

> I find LLMs useful in regurgitating one-liners that I can’t be bothered to remember

I found LLMs make a fabulous frontend for git :-D

ah, you've found the danger zone!

> I find LLMs useful in regurgitating one-liners

This was the case for me a year ago. Now Claude or Codex are routinely delivering finished & tested complete features in my projects. I move much, much faster than before and I don’t have an elaborate setup - just a single CLAUDE.md file with some basic information about the project and that’s it.

People keep saying this and I agree Claude has gotten a lot better even in my own experience, but I think the value is questionable.

What’s the point of adding features that are inscrutable? I have gotten Claude to make a feature and it mostly works and if it doesn’t work quite right I spend a massive amount of time trying to understand what is going on. For things that don’t matter too much, like prototyping, I think it’s great to just be able to get a working demo out faster, but it’s kind of terrifying when people start doing this for production stuff. Especially if their domain knowledge is limited. I can personally attest to seeing multiple insane things that are clearly vibe coded by people who don’t understand things. In one case, I saw API keys exposed because they were treating database users as regular user accounts for website login auth.

> I move much, much faster than before

This is a bad metric as has been attested multiple times in unrelated situations. Moving faster is not necessarily productivity nor is it value.

I think the main value lies in allowing the agent to try many things while you aren't working (when you are sleeping or doing other activities), so even if many tests are not useful, with many trials it can find something nice without any effort on your part.

This is, of course, only applicable if doing a single test is relatively fast. In my work a single test can take half a day, so I'd rather not let an agent spend a whole night doing a bogus test.

Try this if the main link is not responsive - https://archive.is/6xLiU

> “ The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in.”

Good lens.

The crux of the auto research repo is basically one file - program.md which is a system prompt that can be summarized as “do this in a loop: improve train.py, run the training, run evals, record result. Favor simplicity”. The other files are an arbitrary ML model that is being trained.

Take some working code. Ask an LLM to fix bugs. Measure performance and test coverage. Feed the results back into the LLM. Repeat.

This has been the standard approach for more complex LLM deployments for a while now in our shop.

Using different models across iterations is also something I've found useful in my own experiments. It's like getting a fresh pair of eyes.

Can we modify this approach to get LLMs that are good at specific programming languages or frameworks? That seems to be where local LLMs could really shine.
It's just RL-everything.

Would love to have a small local model that only knows about rails and mvc web development

Alternatively, a modular model with multiple “experts” that I could mix and match for my specific stack

I don’t need the model to know all of the Internet plus 20 different human languages. I just want it to be really good with the stack of the project

pretty cool experiment, i thought about someone maybe doing this and am happy you did it in this way. nice writeup too. this made me giggle a bit:
"At one point it got tired of waiting for training to finish and just ended the conversation. I wouldn’t give it full autonomy just yet :)"

thanks for sharing your results and the road to them!