I keep trying to find things like “making waffles from sour dough discard” and all the sites are the same: long meandering paragraphs full of links to other things on the site with dubious instructions.
Considering at this point I can pretty much identify the type of site by looking at it; are there good extensions or search engines which might remove them from search results?
No, because there’s no reliable way to distinguish AI-generated spam sites from non-AI-generated spam sites. I’ll also add that I don’t expect there to be one promptly forthcoming: any attempt to identify them is going to run into improved systems, and that’s gonna happen even if the systems aren’t explicitly intending to evade detection. If it were easy, Google would have done so years back. I can recognize some now, but the SEO spam crowd that’s creating this is trying hard to pollute search engine results, and if someone implements a generalized “block” that’s effective, they’re going to keep looking for alternatives until they find something that gets through.
On Kagi, I can set the acceptable date range on results to prior to the emergence of LLMs, but that cuts out a lot of material that I want to see. For some searches, that might work, but it’s not really a general solution.
You can manually blacklist or deprioritize sites on Kagi. Probably can either run some sort of local proxy or Greasemonkey-style plugin that would let you do so in browser on any search engine. Problem is that there are people making these sites faster than you’re going to be banning them.
Kagi’s also got a “pin” and a “raised priority” feature for a list of sites, and I suppose could whitelist some “known good” sites. Kagi’s “blacklist/deprioritize/prioritize/pin” feature does not have the ability to exchange sites between users (and I imagine that there’d be some privacy issues with doing so) aside from Kagi running a “leaderboard” of the most-blacklisted/deprioritized/prioritized/pinned sites. One could probably do the “proxy” or “plugin” route as well for a variety of websites on other search engines. Any general solution would need to have some level of interchange, since requiring every individual user to maintain a “killfile” on websites is going to be impractical. It may be that the human labor involved in curation is outweighed by how cheap it is to generate new websites; not sure.
At some point, I assume that it may become practical to just make a conservative whitelist of “non-spam” sites that accepts that many useful websites will be excluded because we just can’t validate them as not being non-spam. Probably require human curation, which is either going to need volunteer labor or a commercial service.
There’s also a secondary problem that if you curate content at the domain level, Web 2.0 sites that permit posting content (Reddit, Wikipedia, the Threadiverse, etc) can have individual users inserting AI-generated spam. So a general solution is probably going to need to permit some sort of sub-domain level filtering for at least major sites.
And there’s also the wrinkle that a “trusted good” site or user can become a spammer at some point. Spammers/people who want to run influence operations have been buying high-karma Reddit accounts — and the reputation that comes with them — for quite some years. Domains expire, or their operators change. Reputation has value, and it can be sold. So that also has to be addressed.
This isn’t really a qualitative change. I mean, people have hand-crafted spam websites that try to grab searchers before. It’s just that the ability to use a computer to do it is way more cost-efficient, brings the cost way down, and thus opens up a lot of opportunity for spam that wouldn’t have made sense financially before. So what you’re really aiming to do is to get the cost to make a spam website up. One possibility — which I am absolutely confident that TLS certificate issuers would like — would be to have tiers of TLS certificate, some of which are a lot more expensive. Search engine indexers could check and validate the TLS “cost tier” when indexing a site. That will artificially inflate the cost of running a website, and can be done to an arbitrary degree. That’s not fantastic, since it also tends to cut out non-spam individual/low-cost websites, but if you’re a large company somewhere, the price is basically a rounding error compared to what a spammer needs to make to make his super-cheap-to-generate LLM-generated website worthwhile. Could be a component in a system that takes into account other factors.