Do I understand correctly that nearly all issues were related to counting (i.e. numerical operations)? that makes it still impressive because you can do that client-side with the structured data
I think this is just a propagation of the issues larger organizations may generally have with dependencies.
You don’t want your applications be dependent on the update cycle of other organizations/individuals. What if the bug fix does not get merged into the master fast enough? Do you fork? When do you recombine again?
But otherwise I agree, sharing code and making it better for everyone as well as delegating complicated parts of the application to others are well practices.
You are definitely right; there are numerous classic applications (i.e. outside of the cutting-edge CV/NLP stuff) that could greatly benefit from such a measure.
The question is: Why don’t people use these models?
While Bayesian Neural Networks might be tricky to deploy & debug for some people, Gaussian Processes etc. are readily available in sklearn and other implementations.
My theory: most people do not learn these methods in their „Introduction to Machine Learning“ classes. Or is it lacking scalability in practice?
They often don’t scale, they are tricky to implement in frameworks that people are familiar with, but, most importantly, they make crude approximations meaning after all this effort they often don’t beat simple baselines like bootstrap. It’s an exciting area of research though.
Do you have an anecdotal guess on the scalability barrier maybe? Like does it take too long with more than 10,000 data points having 100 features? Just to get a feel.
Please don't quote me on that, as it was academic work in a given language and a given library and might not be representative of the whole ecosystem.
But in a nutshell, on OK-ish CPUs (Xeons a few generations old), we started seeing problems past a few thousands points with a few dozens features.
And not only was the training slow, but also the inference: as we used the whole sampled chain of the weights distributions parameters, not only was memory consumption a sight to behold, but inference time quickly grew through the roof when subsampling was not used.
And all that was on standard NNs, so no complexity added by e.g. convolution layers.
The main bottleneck in GP models is the inversion of an NxN covariance matrix, so training with the most straightforward algorithm has cubic complexity (and quadratic memory complexity). 10k instance is what I’ve seen as the limit of tractability.
The input dimensionally doesn’t necessarily matter since it’s kernel method, but if you have many features and want to do feature selection or optimize parameters things can really stack up.
There are scalable approximate inference algorithms, and pretty good library support (gpflow, gpytorch, etc), but it seems like they are not widely known, and there are definitely tradeoffs to consider among the different methods.
Im not sure if I understand this correctly but wouldn’t this prevent other sites from being able to emerge in the future? Basically a forced monopoly on currently popular sites.
We're lamenting the fact that these authentic documentation sites are not the top results, instead being overtaken by bad copycat sites. I'm suggesting not relying on algorithmic improvements to do this and just pinning them to the top, since this is the wanted outcome anyway. The suggestion is not that these pins should be immutable for all time.
I'm not suggesting this be done for commercial sites, just neutral documentation sites for languages and frameworks. This would be strictly better from the perspective of the searcher.