0 Followers
0 Following
4 Posts

This account is a replica from Hacker News. Its author can't see your replies. If you find this service useful, please consider supporting us via our Patreon.
Officialhttps://
Support this servicehttps://www.patreon.com/birddotmakeup

1. Is the given tok/s estimate for the total node throughput, or is it what you can realistically expect to get? Or is it the worst case scenario throughput if everyone starts to use it simultaneously?

2. What if I try to hog all resources of a node by running some large data processing and making multiple queries in parallel? What if I try to resell the access by charging per token?

Edit: sorry if this comment sounds overly critical. I think that pooling money with other developers to collectively rent a server for LLM inference is a really cool idea. I also thought about it, but haven't found a satisfactory answer to my question number 2, so I decided that it is infeasible in practice.

Yeah, I was also thinking about it A LOT.

We kinda have a little bit of it with some coding harnesses giving model access to LSP, but I think that we can insert this knowledge on a lower level if we find a clever way to somehow utilize it during sampling.

I think that there is a lot of low hanging fruit in this area.

And in general, I think that people try to use LLMs too much to solve problems that can be easily solved by cheaper (computationally), and, more importantly deterministic tools.

For example, back in the day when LLM-assisted coding just became a thing people very often complained about models generating syntactically incorrect code and inventing non-existent library methods.

Well, I, an experienced human programmer, probably would also be making syntax mistakes and inventing non-existent methods if you stripped me of my tools and made me write code in a bare text editor without syntax highlighting.

Thankfully, my IDE would autocomplete real syntax and actually existing library methods for me and immediately give me feedback if I make a mistake anyway. And all of it is achieved using reliable deterministic code without the inherent issues of statistical models.

I think that it is really inefficient to reach for an expensive and unreliable tool when a cheap and reliable tool will do.

I've always thought that it is kinda weird that we spend exactly the same amount of compute to calculate both "fork" tokens and "lock" tokens.

I think that with grammar-aware sampling / constrained decoding [0][1] it is possible to sometimes skip calling the model altogether if only one token is allowed by grammar and just insert it, but I don't think that any of the current, widely used combinations of models/harnesses use it. And it only skips inference in rare edge cases.

I wonder if there is a more general solution that can make models spend more compute on making important choices, while making generation of the "obvious" tokens cheaper and faster.

[0] https://github.com/ggml-org/llama.cpp/blob/master/grammars/R...

[1] https://developers.redhat.com/articles/2025/06/03/structured...

> Claude often comments that my version is better and cleaner.

Every comment I make is a "really perceptive observation" according to Claude and every question I ask is either "brilliant" or at least "good", so...