I'm still amazed at how bad Google search is for certain purposes (not that there's anything better), and it's largely due to content farms. I've installed Google's Personal Blocklist chrome extension which allows me to filter out specific domains (ask.com, ehow.com, answers.yahoo.com, wikihow.com, etc) and that does help some. It's interesting to me that google search used to have preferences allowing you to block domains but they removed it.
I actually find myself narrowing my searches by sites I know will have reliable information, like this one, reddit, certain forums based on the search topic, etc. I think there would be some real value in creating a search engine that was very selective about the sites it crawls. Honestly, crawling the comments from the best user-participation sites on the web (reddit, HN, SO, quora, etc.) would probably make for a very useful search engine.
This was easy to fix before Google broke their personal exclusion lists. You used to be able to block a set of sites from your Google searches. I guess Google figured out they were losing potential advertisers.
Try blekko.com (we get rid of all those spammy sites). If you prefer the traditional look (without all the categorization displayed separately) you can use our legacy interface at http://edit.blekko.com
I looked at this minus-million thing again yesterday when my brother brought it up. I'd seen it before, but we were talking about search engine crappiness and alternatives.
Since I remembered Blekko (met Greg, the founder guy, in SF once) I pulled it up and we searched for yunnan, a Provice in China, as a test. Most of the results were for commercial tour operators or thinly veiled redirects for such.
The million-missing one on the other hand turned up more interesting or 'bespoke' content.
I am ignorant of such matters but would have thought a spamassassin-style Bayesian model based on sentiment analysis, advertising frequency, update frequency, hosting location, content originality or any similar clump of readily obtainable metrics would be enough to usefully cull the vast majority of the useless modern stuff.
I mean, if I want commercial tour operators, I'll tell the search engine by typing something like "prices" or "companies" or "costs" or whatnot.
The other realization we had, doing this on an iPad, was that search engine interfaces positively suck. They're still stuck in the 90s. With a touch-based interface, there should be a more interactive model for query refinement than text editing. Like, uncheck [x] commercial sites.
I just noticed that I often go to reddit to search for things instead of Google, even though their search sucks. But it's better than 50 pages of yahoo answers and similar sites, plus some random forums.
Indeed, I had to switch back from DuckDuckGo to Google as I just couldn't deal with all the nonsense sites that polluted my search results, and DDG doesn't offer a personal blocklist feature to eliminate them.
For example, sites like yellowpages.com, whitepages.com, superpages.com, zillow.com, citysquares.com will all pollute basic searches that look like job descriptions ("custom cabinetry new york", for example).
(I complained to DDG about this a while back, and it looks like they have added some negative boosts to some spam sites.)
I searched for "180sx" on this site and the first ENTIRE PAGE of links were amazingly relevant and from sites I'd never seen before. Seriously useful information -local Australian body part suppliers, build logs.
I google car-related stuff all day and this is the most useful stuff I've seen in ages. From the first page of results.
I actually find myself narrowing my searches by sites I know will have reliable information, like this one, reddit, certain forums based on the search topic, etc. I think there would be some real value in creating a search engine that was very selective about the sites it crawls. Honestly, crawling the comments from the best user-participation sites on the web (reddit, HN, SO, quora, etc.) would probably make for a very useful search engine.