0 Followers
0 Following
2 Posts
Fred Weitendorf

Founder at Accretional (accretional.com). Building an agent mesh based on open source

Sponsor for statue.dev and previously at Google working on Serverless Infrastructure for Cloud Run and Cloud Functions

fred @ company or https://www.linkedin.com/in/fred-weitendorf-40b505b6/
This account is a replica from Hacker News. Its author can't see your replies. If you find this service useful, please consider supporting us via our Patreon.

Officialhttps://
Support this servicehttps://www.patreon.com/birddotmakeup

To make the most of these architectures I think the key is essentially moving more of the knowledge/capabilities out of the "weights" and into the complimentary parts of the system in a way that's proportionate to the capabilities of the hardware.

In the past couple months there's been a kind of explosion in small-models that are occupying a niche in this kind of AI-transcoding space. What I'm hoping we're right on the cusp of achieving is a similar explosion in what I'd call tool-adaptation, where an LLM paired with some mostly-fixed suite of tools and problem cases can trade off some generality for a specialized (potentially hyper-specialized to the company or user) role.

The thing about more transcoding-related tasks is that they in general stay in sync with what the user of the device is actively doing, which will also typically be closely aligned with the capabilities of the user's hardware and what they want to do with their computer. So most people aren't being intentional about this kind of stuff right now, partly out of habit I think, because only just now does it make sense to think of personal computer as "stranded hardware" now that they can be steered/programmed somewhat autonomously.

I'm wondering if with the right approach to MoE on local devices (which local llms are heading towards) we could basically amortize the expensive hit from loading weights in and out of VRAM through some kind of extreme batch use case that users still find useful enough to be worth the latency. LoRa is already really useful for this but obviously sometimes you need more expertise/specialization than just a few layers' difference. Experimenting with this right now. It's the same basic principle as in the paper except less of a technical optimization and more workload optimization. Also it's literally the beginning of machine culture so that's kind of cool

It's more complicated than that. Small specialized LLMS are IMO better framed as "talking tools" than generalized intelligence. With that in mind, it's clear why something that can eg look at an image and describe things about it or accurately predict weather, then converse about it, is valuable.

There are hardware-based limitations in the size of LLMs you can feasibly train and serve, which imposes a limit in the amount of information you can pack into a single model's weights, and the amount of compute per second you can get out of that model at inference-time.

My company has been working on this specifically because even now most researchers don't seem to really understand that this is just as much an economics and knowledge problem (cf Hayek) as it is "intelligence"

It is much more efficient to strategically delegate specialized tasks, or ones that require a lot of tokens but not a lot of intelligence, to models that can be served more cheap. This is one of the things that Claude Code does very well. It's also the basis for MOE and some similar architectures with a smarter router model serving as a common base between the experts.