#searchengines #selfhost

I got tired of putting my trust search engines that run on somebody else's machine.

DuckDuckGo blocks all trackers, but it silently whitelists Microsoft's (because of the commercial agreement they have with Bing), except that they failed to mention it until they were caught with their hands in the jar.

Startpage is Dutch, it's notoriously sensitive on privacy and it's been around for longer than Google, but it's now been acquired by an ads company and I don't see things going well for it.

Brave claims to be the ultimate solution for privacy, except when they put some crypto miner to run in your browser so they can make a bit of extra money on the side.

So I've decided to take even this matter into my hands and run my own search engine. You can access it at https://search.fabiomanganiello.com. It runs SearXNG in a Docker container on one of my servers at home. It's a bit slower than major search engines, but not that much, and it's still a little price I'm ready to pay for freedom.

Search

SearXNG — a privacy-respecting, open metasearch engine

@blacklight Some searx instances are fed by a locally running #yacy which does the crawling, so you can also get a greater degree of independence from the giants. I noticed a lot of your results grab from both DDG & Qwant. Both DDG & Qwant are MS syndicates, so you could drop one of them and replace with #Mojeek or #Gigablast (which do their own independent crawling).
@blacklight I also suggest filtering out Cloudflare sites. This thread covers that: https://infosec.exchange/@bojkotiMalbona/108596156060149270
bojkotiMalbona (@[email protected])

The world’s most #privacy-respecting search service (#ombrelo) has been taken offline indefinately. Really bad news for privacy enthusiasts! API is still up but probably not many people even knew it existed much less how to use it.

Infosec Exchange

@koherecoWatchdog this is a trade-off that I'm very happy to discuss.

I've thought of dropping DDG (and maybe Brave) results entirely. I'm a big fan of Mojeek, but there's a trade-off between privacy purism and relevance of the results that I'm still trying to strike here.

As an example: a search of my name on Mojeek won't return any links to my content on LinkedIn, Medium, or CF-powered websites like Hackernoon, dev.to, IoT4All, BetterProgramming. Nor any links to my books (because they are sold on the likes of Amazon, eBay, ebooks.com or springer.com), nor any links to my music (because, outside of Bandcamp, it's on the likes of Spotify, Tidal etc.), nor any links to my past talks (because they are mostly on YouTube).

Yes, it reports the links to my self-hosted websites and my apps on F-Droid, but those are only a small fraction on the content about me on the web. And the same considerations that apply to searching my name apply to any other search that a user may want to perform. Even the most privacy-aware user wouldn't want to use a search engine that shows to them only a fraction of the web. It should probably be their choice if they want to click on a result hosted on Amazon or Medium. The search engine can ensure that their privacy is protected and that it doesn't collect any private data about users (or that could associate users to queries), but IMHO it shouldn't omit results that may be relevant for what the user actually needs to do.

If users can't find what they're looking for, they'll just fall back on another search engine, which defeats the whole purpose of running an alternative engine. And, if users fall back on another search engine too often, they'll eventually just stop using yours - a point that @thelinuxEXP made quite well in a recent video.

@blacklight
I see no trade-off if you include Mojeek results in /aggregate/ with other indexes. Doesn’t searx do a round-robin on results from the various sources? I’m glad to hear #Mojeek is coming up short on #Cloudflare sites -- that’s a very rare feature for privacy enthusiasts. Most self-proclaimed privacy-focused search services have failed to evolve beyond controlling their own privacy abuse & fall short of privacy-respecting results.
@thelinuxEXP
@blacklight
I don’t believe this for a second: “Even the most privacy-aware user wouldn't want to use a search engine that shows to them only a fraction of the web.” Seeing the wide open firehose of all possible web results is exactly what privacy-ambivilous users want. Privacy enthusiasts are burned out on seeing the garbage the web has become. IIRC a recent study showed that ~70k websites out of ~80k were sites clusterfucked w/js-trackers.
@thelinuxEXP

@blacklight
The primary job of a good search tool is to filter out (or down rank) the results you don’t want.

Searx instances that show Bing or Google results are a dime a dozen (regardless if they src from a syndicate). There’s nothing special about your instance or any other searx instance if it only proxies the giants. What makes a search svc stand out is
1→ unique sources (yacy, mojeek, etc), or
2→ unique filtering (anti-CF or anti-bloat).
@thelinuxEXP

@thelinuxEXP @blacklight If you’re not keen on offering privacy-respecting /results/, then another way to be distinguished from the rest is to offer bloat-free results. search·marginalia·nu and wiby.me both do that, but they’re both tor-hostile so I don’t use them. If a searx instance were to proxy those two I would bookmark it and use it. BTW, #exalead is another true search engine (in-house index)

@koherecoWatchdog @thelinuxEXP in Searx/SearxNG is quite easy, even on the client, to customize which engines you want to query. So if a user is bothered by results returned by Bing or DDG, they can simply turn off those engines and turn on only Mojeek/Gigablast instead. I would also love it if CF filtering was available as a feature that the client can toggle as it likes.

But, as admins, we shouldn't be too opinionated and take those decisions for the user. My priorities for a search engine are:

1. It should not track you, nor collect any data about you, nor show you unsolicited sponsored content.

2. It should be functional for its purpose - which is to return what the user is looking for in most of the cases.

A search engine like the one that you describe probably won't return results from more than half of the Web, and it's hard to tick the "be functional" box if you filter out so many results.

As a developer, a search engine that downranks results from StackOverflow, or doesn't return technical articles from Medium, dev.to or Hackernoon, would be almost useless.

A search engine that doesn't return (or downranks) results from the Guardian, the Economist or the CNN (or even Fox) would provide a very incomplete view of the world when you search for news.

A search engine that doesn't return results from LinkedIn or Glassdoor won't be of much help to someone is looking to hire or be hired.

A search engine that doesn't return results from any major sources for online shopping or reviews won't be very helpful if I want to buy a new product.

My point is that even the most privacy-enthusiast user sometimes needs to access websites that aren't 100% privacy respecting, because some content that they need may be only available on those websites.

And if my search engine doesn't return those results, then people will just go to another search engine that may have those results, but it's likely to be more privacy-invasive than mine. And that's exactly what I want to avoid: the purest search engine is literally of no use if people can't get relevant results out of it for their day-to-day activities.

@blacklight @thelinuxEXP Ah, right I forgot about the client side engine toggles. I think you’re using the term /relevancy/ different than the industry. E.g. Google does not determine relevancy purely as a function of raw content & search query. If a website has a non-white background, Google drops its relevancy. Relevancy is about knowing your audience. Cloudflare sites are irrelevant to privacy seekers.
@thelinuxEXP @blacklight Google’s audience is mainstream avg Joes who don’t give a shit about privacy, who use defenseless browsers (which work on most websites), who don’t want background color. If you simply run with Google’s rankings, you are also catering for those users. While privacy enthusiasts are not served well with that ranking b/c we have defensive browsers w/js disabled & a Tor IP.
@blacklight @thelinuxEXP The “It should be functional” box is most certainly not ticked when ¾ of the top ranking results lead to CF-pushed CAPTCHAs & sites so littered w/js they are dysfunctional in a secure browser. Those results /are/ irrelevant to privacy enthusiasts b/c we need sites that function, which don’t snoop on us. The privacy-hostile results are time-wasting pollution to privacy seekers
@thelinuxEXP @blacklight My workflow is to click down the list and open ~4—10 tabs & then run through the junk & hit control-w on the dysfunctional/garbage sites to get down to 1 or 2 that are fit. The search service should be doing that for me. Why doesn’t it? It’s because privacy seekers are in such a minority that no search service (except #Ombrelo) is willing to serve such a small audience.
@blacklight @thelinuxEXP Ombrelo saves me time because CF sites are treated as irrelevant. It still gives tor-hostile results though, so I have to go back & click on favicons to get mirrored versions of some sites. But #Ombrelo is the king of privacy respecting search because it knows the needs of the audience & no other search service has put user needs above Google & Microsoft.
@thelinuxEXP @blacklight If you were to design a search service that caters for privacy seekers, your userbase will of course shrink dramatically because we are a small marginalized group.

@koherecoWatchdog @thelinuxEXP "privacy seekers" is an umbrella term for people that fall on a wide spectrum. On one end, you have those who simply click on "Reject cookies", or use the browser's incognito mode, and they're fine with it. On the other end, you have people like you who uncompromisingly shape their whole online experience around the idea of absolute privacy. And you have a lot of shades of gray in between.

When I take decisions on how to shape the public services that I host, I try to aim at the middle point in this spectrum - the guy who wants no ads, no trackers and no bloat, and as little JS as possible, but who doesn't mind reading news articles on a major outlet, or following artists on Spotify, or discovering companies and co-workers on LinkedIn, or searching for programming questions on StackOverflow, or reading blogs on Medium. These users may decide to surf these websites like a guy who wears four condoms, just in case (with a DNS sinkhole, ad/tracker blocker, Tor, NoScript, some alternative client for those services, or all of these solutions), but they may still be interested in content that is only available on these platforms, and leaving out results from these platforms will lead to a bad experience, functionally speaking. The Web is powerful only when it doesn't get too opinionated about what results should reach the end user.

I try to also give freedom of configuration to those in the "privacy seekers" spectrum that are more uncompromising. For example, users can disable search engines that they don't want in their results, and I would also be happy to work on a PR for Searx/SearxNG to implement a user option to disable CF results if there's enough interest. But these settings shouldn't be the default, nor should the service be too opinionated and impose them to the user. If I do so, I may gain the trust of the more uncompromising users, but I will lose all the others in the middle of the spectrum. And those in the middle of the spectrum are likely to just go back to whatever crappy search engine they were using before, so they won't be better off.

If those who are more uncompromising decide that they won't get onboard with a solution just because it doesn't follow exactly their idea of how the Internet should work, then we'll keep getting fragmentation, forks and endless discussions, instead of creating solutions that appeal to the highest possible number of users and can cause a real dent in Big Tech's numbers.

When designing for privacy, I believe in striking reasonable trade-offs between providing a service that appeals to enough users (because function-wise it is similar to what they were using before, so migrating doesn't come with huge costs) and privacy purism. If you're too purist and pretend that any website that is somehow touched by Big Tech doesn't exist, then you lose users. In many cases, those users will just keep using Google, Bing, DDG or whatever, so they won't be any better off. If we push people away with our purism, we may be able to build our little happy bubble where Big Tech is not allowed in any form and shape, but we won't be able to build a convincing case for others to join us. And that really doesn't align with my mission - which is to raise the privacy level for as many people as possible, not to provide a privacy purist solution only for a niche.

@blacklight
I would not say you’re anywhere near the middle of that spectrum. At the most flimsy side of the range you have #DDG, #startpage, & Qwant, which deliver privacy-abusing results & also financially feed surveillance capitalists (#MCAG) by way of ad revenue, by paying for API access & maintaining link rankings, sending traffic where those tech giants intend.
@thelinuxEXP
@blacklight
Your #searx instance slightly improves on that by:
① avoiding ad revenue
② scraping results instead of paying tech giants for API access (guessing, as there is no mention of MS/Google relationships)
③ giving a cached link Your service moves the needle in the right direction by like ~15%; no different than other searx instances.
@thelinuxEXP
@blacklight
Your instance has a looong way to go:
① No onion URL for the search page
② No onion URL replacements (nytimes·com should be replaced w/nytimesn7cgm…onion)
③ No clearnet URL replacements (medium·com should be replaced w/scribe.rip; mojeek does this but mojeek must be turned on & even then scribe.rip is ranked lower than startpage’s medium·com) ④ No option to filter out CF results or even downrank them
@thelinuxEXP
@blacklight
⑤ Cloudflare results are not even red-flagged
⑥ The rankings favor privacy abuse (e.g. the very 1st result & many shortly below give 403/462 to Tor users, several Amazon shopping links often appear on the 1st page)
⑦ Cached links are offered for sites that are archive.org-hostile, like Quora. Those sites should at least be downranked and there should be an archive.ph link instead, as well as a red-flag to inform users of the problem.
@thelinuxEXP
@blacklight
⑧ The source code and issue tracker directs users into MS’s walled garden of #Github. There is no in-band way for users to report search issues.
⑨ The “settings→engines” tab incorrectly lists DDG, Qwant, & Startpage as “engines”. They are not. They are /syndicates/. Calling them engines alongside Mojeek & Gigablast is not just misinfo but it’s also an injustice that encourages users to regard syndicates as equals with engines.
@thelinuxEXP
@blacklight
The tab should be called “data sources” or “search services”, & thereunder should be a section for “engines” which only lists search *engines*. A section below that should be “meta-search services” (or alternatively a label that Facebook has not hi-jacked). In that section it should be clear that DDG & Qwant are MS syndicates, and that Startpage is a Google syndicate.
@thelinuxEXP
@blacklight
For whatever reason there is a widespread attitude among normies that meta-search service are somehow inferior to search engines (as if to be whisky snobs who find single malt whisky to always trump the blends). #Searx should take advantage of that & correctly tag the MS/Google-supporting syndicates with their true nature.
@thelinuxEXP

@koherecoWatchdog @thelinuxEXP what do you use when you have to search something to buy online? Or an answer to a programming question? Or the latest album by your favourite artist? Or news about a recent event? Or a video? Or the phone number of a restaurant?

These resources are, in most of the cases, hosted on websites that are either blocked or downranked on Mojeek, or that are CF-powered. Not to mention the case where you have to search things for your actual work. This is not about the average Joe: this is about how every human uses a search engine and expects it to work.

Sure, there can be alternatives to Amazon, Medium, StackOverflow, Spotify, the Guardian, CNN, YouTube, Booking or OpenTable, but we all know that most of the alternatives don't have the same availability of content, and people sometimes have to get results from those websites to find what they're looking for.

So not returning those results would make a search engine overall less useful. There's no point in saying "everything we return is pure" when everything we return is probably irrelevant for a given search query. If the user is searching for those things, and we don't return them, then they will fall back on Google, and we've already lost our battle.

People can be sensitive about privacy issues without going all the way to the point where they would never click on a website just because it's proxied by CF, or even feel disturbed by its mere presence in search results. Sensitivity on privacy issues goes on a wide spectrum (from the guy who doesn't care at all to the guy who would never click on a CF link), and as an admin it's my job to provide users with as many options as possible to customize the experience to their liking, while ensuring that my websites themselves don't collect any data at all, ever. This is already a better situation than the one most of the people experience today, and I wouldn't dismiss this achievement.

Talking about the super privacy enthusiast - people like you and me already browse the web with browsers heavily tuned for privacy, at least one or two client-side ad/tracker blockers, probably PiHole to dump everything at the first DNS query, and probably NoScript to disable JS by default, probably proxied through Tor. So what exactly are you worried about? That my search engine would show you a result to a site that isn't 100% privacy-aware, and if you click on it all of a sudden you get tracked?