How I Cut My Google Search Dependence in Half
108 points by asciimoo
108 points by asciimoo
That tool sounds helpful, but for the other 50% of the searches, try Kagi.
I've been using it for a while, and it's the best search engine I've used. The results are good, there's no advertising, they're not tracking me everywhere on the internet, I can opt-out of AI, and they have filtering and blocking of sites from search resutls. It's not free, but I average between 700 and 1000 searches a month, so to me it's worth it.
I agree about Kagi, that currently it is the most decent solution we have - but who knows for how long.. I've heared that there are controversies around the CEO and after all, it is a profit oriented business.
Yeah, I've seen the controveries and personally think they're overblown. Kagi paying for access to the Yandex index isn't making a meaningful contribution to Russia's invasion of Ukraine, and I think there are better ways to send a message than sacrificing search quality. And emailing that blogger was a little out of touch, but hardly a reason to stop using their search engine (IMO).
And I can't fault them for being profit oriented. I don't work for free, either. At least I'm paying for it, rather than some advertiser.
Kagi paying for access to the Yandex index isn't making a meaningful contribution to Russia's invasion of Ukraine.
The question here is how they access other indicies. Do they forward search queries? Are there any information leak in the data exchange? As far as I know we don't have too many details in this regard and we have no ability to verify their statements.
And I can't fault them for being profit oriented.
Me neither, it just signals caution, that maybe one day the business goals are going to conflict with user interests.
I don't think a business being profit-oriented is a bad thing on its own. So long as the profit model aligns with user interests. The only way Kagi remains viable is by aligning with those interests.
I've heared that there are controversies around the CEO
What controversies?
I use duckduckgo for 99% of my search queries, it's fast, the results are much better than Google's and they also claim to protect my privacy.
Looks great!
Is there anything preventing my secrets (like API Keys) from being scraped? (Usually they are displayed exactly once when creating them).
There is no specific safety net to prevent storing sensitive data. You can create rules to forbid indexing specific URLs and it is also possible to delete URLs from the index. Do you have any better idea to resolve these cases?
Maybe add a step to filter tokens prior to indexing them? (I didn't look at your implementation, so no idea how feasible that is.) There seem to be existing libraries for secret scanning.
This might help a bit: https://github.com/s4nj1th/secr-cli/blob/main/internal/rules/rules.go (pattern-based api keys detection).
You could maybe do some heuristics to determine if a page is a "settings page". Probably those are low value to have in the search index anyway, and I'd think that usually that's where API keys would pop up.
Probably hard to implement: prevent indexing words that never appeared and appear only once on a page (if a word has been seen previously or is repeated on the page, it is unlikely a secret).
The final product surely looks interesting; many people always comment on how they keep a personal wiki as a knowledge base, and the article mentions this too.
What I do not like about the article and the repository is that it does not mention at all how this works; only focuses on what it's trying to solve. Why is tagged as "go" aside from supposedly being coded in go? Why the article mentions the browser extensions but the Github README don't?
Thanks for the feedback. There are quite a few inconsistencies around the project indeed as you pointed out. The development is in a very early stage, I focused much more on the functionality than everything else.
Why the article mentions the browser extensions but the Github README don't?
It is mentioned, but misleadingly in the "Build" section. I'm going to emphasize it more.
For the record, I mentioned the browser extensions since the article does not mention how it (automatically) works, so I assumed that the extensions message a background process in the host.
Remember that tags on lobsters are used for filtering; that's the main purpose they provide. When picking tags, it's tempting to think "who would be interested in reading this post?" but what you should be asking is "who would be interested in not reading this post?"
I almost missed this post because I filter the go tag, despite it being extremely relevant to my interests! (I built my own search engine last year: https://search.technomancy.us )
This is pretty nice. I'm not quite enough of a data/link hoarder to actually deploy this myself but I love the concept and can see where it would be useful. I was wondering what they used to manage the index and it looks like https://github.com/blevesearch/bleve.
I have a whole array of search engines lined up via XMonad's Search module, each just a key chord way. This includes nlab, arXiv, zbmath, wikipedia, github (both a general search and a "jump to repo"), several programming language docs (hoogle, crates.io, doc.rust-lang.org, clojure docs, and so on), openstreetmap, and a few more. Extremely convenient not having the extra indirection of another search engine, I would definitely recommend a setup like this to anyone (I think most browsers also support this via a prefix to the given search in the address bar). The fact that this also indexed the content of the given site is interesting, however, and definitely worth a look!
Interesting, I just installed this on my home server and I'm trying it out. It looks like short words aren't indexed (e.g. searching for "a few" after indexing this page) returns no results - presumably this is expected but I thought it was broken at first.
I also put this behind an nginx reverse proxy - if anyone else is trying this make sure you also proxy Connection and Upgrade headers.
indexed (e.g. searching for "a few" after indexing this page) returns no results - presumably this is expected but I thought it was broken at first.
Yes, the default behavior of the indexer is to exclude common english words. Perhaps I should make it configurable, or at least emphasize it.
I also put this behind an nginx reverse proxy - if anyone else is trying this make sure you also proxy Connection and Upgrade headers.
I'm going to mention the reverse-proxy headers in the setup guide, thanks for the feedback.
This sounds great! I try to keep my web history somewhat curated by running a script on the places.sqlite (which is where Firefox stores history) so that I can use it as my primary source of completions in the address bar and completely disable web suggestions.
I've always wanted to use this web history going back years to build something similar but never got around to it, because most of the time I'm either searching for something I've already found before, or I'm searching for things on specific websites (which you can set up in Firefox by using keyword search bookmarks). I'll try Hister out shortly. Cheers!
I like this, I'll probably start using it!! I just set it up and it is blazing fast.. Question about how it should be setup. Should I have a single instance for all computers and point them at it or does it make more sense to have one local per box?
I'd probably want my history available from all machines in a single place but i'm not sure on the latency impact.
It depends on your use-case. Don't worry about the latency, it is insignificant. If you have a work and a personal computer, it is probably better to run two distinct instances, otherwise one one global.
Be aware that currently there is no authentication implemented yet, so make sure that you run it behind a VPN or with an authenticating proxy to prevent unauthorized access.
I have been waiting for a replacement for pre-chrome Operas full text history search for a long time. This might be it.
I'd suggest turning off text justification, at least on mobile. Couldn't make myself read the article though interested in the topic in principle.
This has been such a great tool, I often just search for things I have searched before, so this makes it a lot faster. Kudos!
Was this post written by an LLM? It has several stylistic and structural indicators of LLM writing.
If you don’t like manipulative SEO, AI suggestions, & lack of privacy, why would you host the code & bug tracker on Microsoft’s GitHub? That platform perpetuated “stars” meaning something leading to “star hacking” (including a call to action at the bottom of this page) along with a sidebar of “suggested for you”, Copilot is embedded in everything now, & you make contributors have an account, agree to Microsoft’s ToS, & then all interactions are used to train the machine that they sell back to us as a product.
This is a free software project, everything around it is public. Therefore there is no privacy involved. The code is accessible without login and anybody can create a fork/mirror anywhere. But, you are right, I should provide an alternative source like Radicle for people disliking GitHub.
But, you are right, I should provide an alternative source like Radicle for people disliking GitHub.
In your shoes, I would rather spend the time programming, or having fun, than setting up a repo elsewhere just to appease some nagging person, honestly.
You can have your cake & eat it too by choosing a better long-term, ideology-aligned default for you & your potential community. If you want to mention privacy or freedom (this project is AGPL) as a reason for your tool, you shouldn’t be touching anything involving Microsoft for the tooling. Then when you are just “having fun” & doing projects, your default isn’t exposing yourself or others. I’ll quote it again:
Choosing proprietary tools and services for your free software project ultimately sends a message to downstream developers and users of your project that freedom of all users—developers included—is not a priority.
—Matt Lee
I've created a Codeberg repository as well: https://codeberg.org/asciimoo/hister
I hope that from now on you are going to be a happy Hister user and active contributor on Codeberg ;)
Why are you implying that the commenter is nagging? I get that the opinion on weather to spend time programming or having fun or setting up repos (as if they don't have overlap, but I'm digressing) is a personal one.
But I wonder why would you consider suggesting a better code hosting provider "nagging". If they suggested a different lint style, or different IDE or a different programming language "nagging"?
A lot of what we post and discuss here isn't about programming, so I'm curious why is this irking you, and not the fact, as someone noticed in other comments for example, that the linked article mentions very little about programming and is mostly about this product.
Therefore there is no privacy involved.
The code is accessible (when the servers are online), but not filing bugs or feature requests or even submitting patches/pull requests. Many overlook this part of the lock-in. You can not expose folks to it by choosing a different host or offering other channels. Radicle would be cool.
This looks like a great project. How do you store the page contents? Does the web extension pass all that information to Hister? An alternative solution is to use a proxy to listen to all the traffic in two directions. That way, you also have an archive of all the information that you sent out.
With regards to GitHub, I agree with toastal. If you continue to use GitHub
How do you store the page contents?
The indexing is done by a golang library called bleve, it builds a file based index.
Does the web extension pass all that information to Hister?
Yes. Alternatively you can use the command line tool to add URLs to the index.
An alternative solution is to use a proxy to listen to all the traffic in two directions.
It would require a MiTM proxy to intercept HTTPS and has the disadvantage of missing browser rendered content. What would be the benefit of capturing the request data in this case?
Rendered content would indeed be missed by a MITM proxy. I had not thought about that. Snapshotting the DOM via the extension catches more indeed. Is that snapshot and assets (images, css, fonts) also available and does it look like when the site was visited?
A MITM proxy would be able to capture content when using a browser without extension support or an in-app browser. Storing outgoing data is valuable too for cases where there's a dispute about what data is sent out or to track what data is leaked by web apps.
Honestly having SPA stuff left out of the search index would be a feature, not a bug in my book.
I know I should read the instructions instead of asking, so shame me. :)
How does the index work? More especifically, does it does a plain search in the HTML code? Does it understand what's is visible to the user? e.g. if should not match "span" in <span> if used as HTML. Does it work for JS heavy sites? I mean, if site loads data async, is it saved too?
An alternative solution is to use a proxy to listen to all the traffic in two directions.
One limitation of this approach is that it doesn't capture data that is rendered client-side. Unfortunately more and more websites are moving in this direction. In theory you could reconstruct the client-side state by effectively running a browser in an isolated environment and replaying network requests but that would be pretty tedious to set up.