One step further that Ive prototyped is Encrypted and Distributed Search. The VPN relays willing to take the traffic can also double as Web crawlers. The vpn clients encrypt their search terms and vpn relays encrypt their search indexes, and perform ElGamal Homomorphic private set intersection with MINHASH in Elliptic Curve Field. This leads to better then key word, worse then current age context search from google but with Strong Elliptic Curve privacy guarantees.
To make this type of search higher precision&recall you would have to focus especially on the indexing part (e.g. improve NLU of concepts in the pages), right?
The training of such ML models could be federated across the nodes in a private way.
Indexing is important for sure. The problem is to preserve privacy and not falling back to heavy weight general purpose Multi-party computation we have to give up a bit on the precision and recall of modern search engines. Minhash, more specifically Locality Sensitive Hashing (LSH) is a good first approximation (Better then Term Freqency, worse them ML based search). Right now much of the web is unqueryable, my first goal was to allow the deep web and TOR services to be searched even at just a rudimentary level.