Progress report:

  • Wikipedia import works without hitches now
  • UI is progressing, search works, some drill-down options are WIP
  • Vector embedding is now a separate process that can be farmed out to community provided compute
  • Help pages written
  • DB optimized

Next up:

  • implement missing drill-down/search refinement options
  • bugfixes for the web crawler before I unleash it onto domains that are not mine
  • Packaging the compute process in a way anybody can run it (will need a Vulkan enabled GPU and 1GB VRAM)

Some screenshots attached (iphone mobile view)

#diySearchEngine

Progress on the search result interface (Mockup, implementation follows tomorrow)

Two pictures attached: First one is a clean screenshot, second one has fields marked:

  • Red: Search rank, so you can see if the pages match better or worse
  • Green: website defined keywords, acts as a kind of web directory when clicking on them. The ranking of the result page for keywords is not defined yet
  • Blue: Refinement options; search for similar pages to this result, search who links to that result and look where this result links to
  • Brown: Search mode switch: exact does a full text search, similarity does a vector search
  • Pink Arrow: The search box is actually a textarea which can be expanded as similarity search gains usefulness from context, so write an essay to find results if you want or paste a paragraph where you suspect plagiarism to find the probable source.
  • Yellow Arrows: Additional metadata like when the page was updated last and when it was crawled last (if the crawl time is too long in the past it will be hilighted red to make clear that the result could be stale)

The idea for all the refinement options is to keep track on how you got to the final search result you exit the search engine with. So it will somehow present your meandering path through the results.

#diySearchEngine

Search engine project seems possible with the research I have done. What it should be in the end is something like a Google search as it was before they kicked "don't be evil" from their claim. No ads, just search results but enhanced with modern technology. I want to have some kind of "Research Mode" as well, so you could decide if you want exact matches (full text search) or similar items (vector search) and then based on the search results "drill down" into a result to find similar results to the selected one (or stuff that links there, etc.)

Plan for deployment of the Search Engine is to keep the index on a shared PostgreSQL cluster and have some crawler nodes with additional crawlers hosted by community. Frontend could be expanded by community hosted instances as well, the base implementation will basically be an API and a basic search frontend.

For vector embedding I think it would make sense doing it "Folding at Home"-Style, basically only do the full text indexing on the PostgreSQL server and run (possibly GPU enabled) worker processes to do the embedding on nodes provided by the community (in addition to some slow CPU only inference on the server).

So you could basically run a inference node on your desktop which would use the compute power of your GPU to provide vector indexing compute resources to the search engine "to give back" something or provide crawler nodes to spread the load. The search itself does not need GPU, just the indexing.

Would you contribute to such an effort?

#diySearchEngine

Not perfect but workable, User interface for the browser next, then the web-crawler.

#diySearchEngine

For the search Index i am saving:

  • Chunk text in Markdown
  • Matryoshka Embedding vectors (512 dims)
  • Embedding vectors are half precision
  • Postgres Full Text search Vector
  • Combined Document vector (same dims and precision)

That leads to a projected size of 380GB of Index for just english Wikipedia... Much better than the predicted 1000GB before and the search results are fine.

Embedding model used right now is Jina-v5-nano-retrieval in Q4_K_M quantization.

Searching 310k documents and 1.5 million chunks still takes sub-milliseconds (embed overhead is 8-10 milliseconds for the search query, search itself is around 0.4 ms). Fulltext search with ts_vector search is around 0.4 ms as well.

Index sizes for the import:

  • FTS chunk: 1.9 GB
  • Vector chunk: 1.9 GB
  • Vector document: 404MB

So to fully index Wikipedia i suspect the index would grow linearly (so around 9GB for document vectors) which sounds doable with a 64GB RAM server (including all the reference links)

#diySearchEngine

And if I am bored: I already collected 3.2 million external links to crawl from about 300k Wikipedia articles.

Sounds like i really have to implement that web-crawler now.

#diySearchEngine

Importing Wikipedia at 1200 articles/minute (from a dump, not crawling, I am not evil...)

Bottleneck seems to be something in llama.cpp, the CPU of the two inference processes is pegged at 100% but doesn't go over that, but the graphics cards only go to 70%.

Sounds like something that could be multi-threaded is only single threaded as parallelism for the inference is set to 4 and 16 for the GPUs and I rarely shedule a batch of chunks that is bigger than 6 or 8.

The crawlers are at 15% CPU, I have 16 cores at 5.6 GHz on this CPU so it's not the CPU either.

Yet another yak?

#yay #diySearchEngine

Indexing is slow as hell...

CUDA GPU is Nvidia Geforce 1080,
ROCM is Radeon XT 7900 XTX.

This is using EmbeddingGemma-300M... have to wait to see if the search results will be better than a faster BERT model.

#diySearchEngine

Proof you can absolutely use a PC as a space heater.

But I am just indexing a Wikipedia dump!

At least the workers are stable now, but the CPU has room... perhaps I'll start a third one with CPU only inference...

#diySearchEngine

Orrr turns out if you try to parse a language which has context with a Regex it won't work that good... (Or in this case the backtracking in the regex leads to catastrophic backtracking which is an infinite loop)

Does nobody know how to write parsers anymore?

And no, I won't shave that yak, WikiText is just catastrophically bad...

#diySearchEngine