Mastodawn

The Old C Dog May 11, 2024

I'm on here looking for text indexers and everything is 'lightning fast exoscale terafloops that scales to enterprise quantawarbles with polytopplic performanations' and it would be great if this industry could breathe into a bag until it remembers that one person with one computer is a constituency that matters.

mhoye May 11, 2024

If your “open source software” requires a datacenter-scale strata and is optimized for, or maybe only meaningful to, datacenter-scale problems, is not open source in any way that matters. “Free as in corporate risk management” and “free as in labor arbitrage” are not aspirations.

stephen May 11, 2024

@mhoye This is one of the problems with Kubernetes.

mhoye May 11, 2024

Free as in your first hit is always free.

Jack Jackson May 11, 2024

@dmaonR respectfully, my dinky homelab k3s cluster, running on a couple of Raspberries Pi, begs to differ. Perfectly feasible and practical to run k8s on personal-scale hardware.

Unless I've misunderstood what problem you are referring to? In which case, apologies.

(Ok so *technically* I have added a beefy PowerEdge node to that cluster too - but that's because I *wanted* to, not because of scale requirements/limitations)

Benjamin Kwiecień 🇵🇸May 11, 2024

@mhoye it can still matter, I think

idlestate's SDF liason acct May 11, 2024

"too big to fork"

❓ucblockhead May 11, 2024

@mhoye what’s your alternative for people who make data center scale software? Are you saying they should not open source what they make?

The Doctor May 12, 2024

@mhoye Can I quote you on that?

mhoye May 12, 2024

@drwho I said it, so why not.

@mhoye Do you want recommendations for your text-indexers question?

In the full text search department:
Take a look at melisearch (https://github.com/meilisearch/meilisearch)
or Apache solr (https://solr.apache.org/guide/solr/latest/index.html), which is like elastic search but without all the licensing kerfuffle

GitHub - meilisearch/meilisearch: A lightning-fast search engine API bringing AI-powered hybrid search to your sites and applications.

A lightning-fast search engine API bringing AI-powered hybrid search to your sites and applications. - meilisearch/meilisearch

GitHub

mhoye May 11, 2024

@4censord I do, thank you. My goals here are minimalism of implementation and wholly-local computation.

@mhoye ah, both of these are not what i would consider minimalist. Melisearch is smaller than solr, but especially solr has all the enterprise features of an enterprise java project from the 2000-ends.

Also, both run as a separate service that your application connects to, instead of being build into your application.

If you rather have something that is build into your application, maybe the full text search sqlite module is better suited. (https://www.sqlite.org/fts5.html)

SQLite FTS5 Extension

Tanguy ⧓ Herrmann May 12, 2024

@mhoye @4censord and do you need an app that does it for you with some configuration? Or more of a library you can call any which way you want?

mhoye May 12, 2024

@dolanor @4censord Ideally the outcome of this is a self-hosted webpage that gives me a reasonably good search experience for the extensive documentation already on my computer. I'd prefer not to run a web server locally - that's seems unnecessary - and even a periodic manual refresh rather than anything automatic is probably fine.

@mhoye @dolanor hmm, that does sound like the usecase for something like the SQLite full text search module. (https://www.sqlite.org/fts5.html)
Or, if you are ok with having to use java, check if Apache Lucene is an option, thats the search engine behind apache solr (https://lucene.apache.org/core/)

SQLite FTS5 Extension

mhoye May 12, 2024

@4censord @dolanor yeah, I think my initial approach will be something like pandoc -> SQLite plus a front end.

Tanguy ⧓ Herrmann May 12, 2024

@mhoye @4censord by self hosted webpage without a web server, do you mean without some traditional webserver like Apache/nginx? Or no web server at all?

And if so, does the generated webpage should hold all the data in it that would be searched via JavaScript in the page content itself?

Paul May 11, 2024

@mhoye Aside from being (I am reasonably certain) a line from Red Dwarf, you make a very good point.

When my parents moved from Windows to Linux some years back, it was not about the open source or the creative commons that convinced it was a good move, it was cost, ease of use and ease of maintenance.

mhoye May 11, 2024

@plwt Ubuntu no longer has the word "human" on their homepage and I put a few seconds into feeling sad about that every other week.

Paul May 11, 2024

@mhoye True, but all is not completely lost - https://ubuntu.com/community I understand that they are looking to rejuvenate their community and there is certainly one person there (I was fortunate to meet before they joined Canonical) who has the experience to see that happen.

The Ubuntu Community | Ubuntu

Ubuntu is an open source software operating system that runs from the desktop, to the cloud, to all your internet connected things.

Ubuntu

alastair87 (old account)May 11, 2024

@mhoye It's difficult to know without having more about the application and data you have in mind.

alastair87 (old account)May 11, 2024

@mhoye If you just have a load of plain text files that aren't huge you may get on just fine with *nix terminal tools or PowerShell. It may not be worth the hassle of a separate index. But I doubt if you are asking the question it's that simple.

Jef Poskanzer May 11, 2024

@mhoye My one person / one computer search toolbox:
- Just use egrep, or an equivalent linear search tool.
- If egrep is too slow, it might be due to searching lots of separate files, because filesystems are slow. Try pre-processing by concatening all the files into one.
So far I have not needed a third tool.

Dan Riley May 11, 2024

@jef @mhoye back in the I’d say glimpse, https://manpages.ubuntu.com/manpages/focal/man1/glimpse.1.html but it doesn’t look like there’s any active maintenance

Ubuntu Manpage: glimpse - search quickly through entire file systems

mhoye May 11, 2024

@dan131riley @jef Glimpse, Woosh and a few others have been mostly abandoned, it looks like.

mhoye May 11, 2024

@jef I could gin something up in with egrep without a ton of effort, sure, but that fundamentally presupposes that I already know what I'm looking for, as represented on disk in string form. I'm also interested in ease of access for people who haven't been neck deep in the shell for decades, but maybe more importantly, in _casual_ discoverability. I mean, who just opens a dictionary, looks up the word they were after and closes it again?

Eli the Bearded May 11, 2024

@jef @mhoye I use grep or ripgrep a lot of the time but it really sucks for the case of "find these three words in any order in the same paragraph" and I would like to do things like that sometimes.

Amir Livne Bar-on May 11, 2024

@mhoye if it's data usable by a single person, sqlite is almost always the easiest choice and is usually good enough. for text indexing you'd need a fts5 table, i've never used this extension though so take this advice with a grain of salt.

Ben Zanin May 11, 2024

@Pashhur @mhoye I've used FTS5 and its predecessors, they take just a little bit of schema creation work but then they just work perfectly, speedily, every time. You're offered a lot of very rich query functionality but you don't actually need to learn or use any of it to get good results out of the box.

CMDR Yojimbosan 🅅⁂May 11, 2024

@mhoye I have been happy with Xapian, as visible in email clients/backends like notmuch. Inspired by sup, although I'm not so confident that I know what their reverse-index library is.

Andrew Radev May 11, 2024

@mhoye A friend of mine built minisearch, which might fit what you're looking for: https://github.com/lucaong/minisearch

GitHub - lucaong/minisearch: Tiny and powerful JavaScript full-text search engine for browser and Node

Tiny and powerful JavaScript full-text search engine for browser and Node - lucaong/minisearch

GitHub

alastair87 (old account)May 11, 2024

@mhoye Have a look at Recoll - www.recoll.org. Linux or Windows.

Lou Katz May 11, 2024

@alastair @mhoye Works real well!

Dan Riley May 11, 2024

@alastair @mhoye Xapian (the back end for recoll) looks potentially interesting, will have to take a look at it...tx!

Ertain May 11, 2024

@mhoye When I see software that touts itself as scalable, that makes me think that it can easily grow with whatever the user needs. Works great for big organizations. But when the software is shrunk down to the size of the user's needs and it doesn't work as intended? Then it may not be scalable after all.

Ted Mielczarek May 11, 2024

@mhoye the Quickwit marketing text says all the enterprise things you dislike but the docs say that it's a single binary you can download and run on your machine: https://quickwit.io/

Search more with less | Quickwit

Sub-second search & analytics engine on cloud storage

hrbrmstr 🇺🇦 🇬🇱 🇨🇦May 11, 2024

@tedmielczarek @mhoye this works super well on a single workstation.

The component that make it work is also open source: https://github.com/quickwit-oss/tantivy

GitHub - quickwit-oss/tantivy: Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust

Tantivy is a full-text search engine library inspired by Apache Lucene and written in Rust - quickwit-oss/tantivy

GitHub

Fabrice Desré May 11, 2024

@tedmielczarek @mhoye Based on the same underlying indexer (Tantivy), this one has a http api: https://github.com/lnx-search/lnx

GitHub - lnx-search/lnx: ⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable, typo tollerant deployment of the tantivy search engine.

⚡ Insanely fast, 🌟 Feature-rich searching. lnx is the adaptable, typo tollerant deployment of the tantivy search engine. - GitHub - lnx-search/lnx: ⚡ Insanely fast, 🌟 Feature-rich searching. lnx ...

GitHub

Dan Riley May 11, 2024

@tedmielczarek @mhoye Tantivy does look useful (and might save me from trying to revive glimpse in my retirement years). I should confess that at work I do run a full ELK stack...

Parade du Grotesque 💀May 11, 2024

I read part of your sentence as 'polytopic trepanation' and I thought of AI scraping data straight from the source, and I will be under my bed, crying softly if you don't mind.

Jan May 11, 2024

@mhoye I'm all in on “lightweight alternatives to Elasticsearch / Solr”. Let's see if I starred something useful…
https://github.com/valeriansaliou/sonic (Rust)
https://github.com/zincsearch/zincsearch (Go)
https://github.com/askorama/orama (JS)
https://github.com/CloudCannon/pagefind (specifically for static sites)
https://github.com/kbrsh/wade (Rust, library like Lucene)

GitHub - valeriansaliou/sonic: 🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM.

🦔 Fast, lightweight & schema-less search backend. An alternative to Elasticsearch that runs on a few MBs of RAM. - valeriansaliou/sonic

GitHub

jjude May 12, 2024

@jnv @mhoye Does Meillisearch (https://github.com/meilisearch/meilisearch) fit into this?

GitHub - meilisearch/meilisearch: A lightning-fast search engine API bringing AI-powered hybrid search to your sites and applications.

A lightning-fast search engine API bringing AI-powered hybrid search to your sites and applications. - meilisearch/meilisearch

GitHub

Jan May 12, 2024

@jjude https://unfug.social/@4censord/112422875957476448

4censord (@[email protected])

@[email protected] Do you want recommendations for your text-indexers question? In the full text search department: Take a look at melisearch (https://github.com/meilisearch/meilisearch) or Apache solr (https://solr.apache.org/guide/solr/latest/index.html), which is like elastic search but without all the licensing kerfuffle

UnFUG Mastodon

Norm Tovey-Walsh May 11, 2024

@mhoye I’ve had success with https://github.com/projectEndings/staticSearch

GitHub - projectEndings/staticSearch: A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection

A codebase to support a pure JSON search engine requiring no backend for any XHTML5 document collection - projectEndings/staticSearch

GitHub

Mark Finkle May 12, 2024

@mhoye Would you consider SQLite? I use its full text search for some projects.
https://www.sqlite.org/fts5.html

SQLite FTS5 Extension

mhoye May 12, 2024

@mfinkle I'm strongly inclined that way, definitely.