Mastodawn

@ramin_hal9001 @kentpitman @chiply @karthink @someodd @screwlisp as someone who maintains local knowledge base, I can tell you that "just narrow" won't work. Personally, you can maintain controlled vocabulary for keyword search, but even for a single person that breaks over time as your knowledge evolves. Llm tech is actually useful here. Not per se, but for semantic search (so-called embeddings). It is pretty good at capturing relevant results. But even that, if we scale up, is tricky.

@yantar92

«…you can maintain controlled vocabulary for keyword search…that breaks…as…knowledge evolves. Llm tech is…useful here…for semantic search…pretty good at capturing relevant results. But even that, if we scale up, is tricky.…»

The history of search follows a parallel track. Keywording of web pages was useful but could not keep up. Full text search won out because it could address scale.

I sum this up differently than many people seem to, though. I think we sacrificed the notion of a right answer for a sort order, so that we could accomplish the search ourselves on well-ordered data. It still makes us a bit vulnerable to sort predicate, but at least we can inspect what's going on.

Note that there was for a time a flirtation with "I'm feeling Lucky" to just say "give me the first item, ignore the others". It seems obvious to me that this did not win, since even the option went away.

But now LLMs offer us ONLY "I'm feeling lucky" (dressed up as "I'm trusting you to work in my best interest" -- what could possibly go wrong?). One cannot inspect the near misses.

Even on technological grounds, this clearly has a cost. RAG makes search faster and for "ordinary things" it may be better. But the web used to be a place where you could search for the obscure thing and it would search everything. Now it narrows you to "just the likely places" before even starting the search. In effect, the new "SEO" will not be about making sure you're in the RAG set, but, importantly, the democratizing effect of a search that would at least search everything is gone.

If you have a hobby site that is the only source of something but your metaphor is ill-chosen, you'll get searched in the wrong set because the coarse categorization is wrong for the outset, and you are intended to pay money to even be in the game. That's a big step backward.

cc @ramin_hal9001 @chiply @karthink @someodd @screwlisp

@kentpitman @ramin_hal9001 @chiply @karthink @someodd @screwlisp but rag does full search. It is just a ranking

@yantar92

I asked ChatGPT 5.5 what it thought of this question we were discussing and it says what I was trying to say in a way that satisfies me and maybe gets my point across better than I was doing, especially the second paragraph:

«At a coarse level, RAG is not simply “the LLM searches everything.” A RAG system first uses a retrieval layer to narrow a large corpus down to a small set of candidate chunks likely to be relevant to the query. In vector or hybrid RAG, that narrowing is often based not on literal keyword matching alone, but on learned representations: embeddings, semantic similarity, metadata filters, and sometimes rerankers. The generator then answers using only that selected context, plus whatever is in the prompt/model.

So yes: retrieval involves ranking, but the ranking is doing architectural work. It is not merely producing an ordered list for the model to inspect exhaustively; it is selecting what enters the model’s context window at all. In that sense, RAG is better understood as relevance-based subsetting followed by generation, often with ranking and reranking inside the subsetting step.»

cc @ramin_hal9001 @chiply @karthink @someodd @screwlisp

@kentpitman @ramin_hal9001 @chiply @karthink @someodd @screwlisp sure. For the purposes of LLM, rag ranked list should be trimmed. But rag is nothing but similarity score. It is a number assigned to each searched entry in db. It does not have to be trimmed.
*Edit*: to be precise RAG abbreviation applies to llm retrieval in particular. But what I am referring to is one of the steps, which is similarity ranking. A better term would be vector search

@yantar92

Right, and what I'm saying is that certain kinds of content don't compete because they are screened "for efficiency" in an initial "likely relevance" pass without being given the same sense of focus.

This favors "obvious searches" and disfavors "searches for obscure things".

I asked GPT 5.5 again to comment on this second round of what you said and what I was going to reply (which I have not edited subsequent to asking it) and it offered this summary:

«In practice, ranking becomes filtering once the system only surfaces the top candidates. Vector/semantic search is great for “things like this,” but it can be worse for obscure exact needles: a rare phrase, quote, error string, or idiosyncratic reference may not score highly under the embedding model, even though literal search would have found it. So the issue is not whether every entry can theoretically get a similarity score; it’s whether the target survives retrieval into the surfaced candidate set.»

(I said I would quote it directly «as long as you're not tailoring it on some theory that i've requested you to agree with me. i'm just seeking neutral points of view here» and it confirmed «That’s a fair use of it, and yes — the point is neutral rather than tailored to make your side “win.”»)

cc @ramin_hal9001 @chiply @karthink @someodd @screwlisp

@kentpitman @ramin_hal9001 @chiply @karthink @someodd @screwlisp I agree. Afaik, real vector searches often employ a mixed ranking on keyword search + similarity. That said, keyword matches alone are not good enough because terminology is not always the same. Terminology also changes over years. So, you need to maintain keyword similarity or aliases on top to make search work. Maybe @publicvoit can comment

@yantar92 @ramin_hal9001 @chiply @karthink @someodd @screwlisp @publicvoit

The thing that bugs me is that it used to be that you could easily find a page that, for example, spoke about the banking industry as a metaphor for some other thing, let's say cooking. But now when you say "find me this exact quote" it first undoes the quote and says "oh, this is a quote about the banking industry, those are usually found over here in the corpus of literature about banks" and then it doesn't find exactly the thing that would have been so distinctive because the keywords on the item will say they are about "cooking", not "about banking" so even if they're well-keyworded, unless someone tagged the post to be about banking (when metaphor really is not "aboutness") then calling up a quote like that will not find it.