NeedMoreTime4Me's comments

NeedMoreTime4Me · on Jan 10, 2025

Do I understand correctly that nearly all issues were related to counting (i.e. numerical operations)? that makes it still impressive because you can do that client-side with the structured data

michaelt · on Jan 10, 2025

Some would say the numerical information is among the most important parts of an invoice.

NeedMoreTime4Me · on June 24, 2022

If this works, it’s a legitimately good idea. Otherwise the thief might realize he can’t connect and try to wipe the device directly.

NeedMoreTime4Me · on Jan 23, 2022

Incredibly well-designed application!

Anki Import worked perfectly (except MathJax) and the there are many well thought out details. Thanks for the tip.

NeedMoreTime4Me · on Jan 9, 2022

I think this is just a propagation of the issues larger organizations may generally have with dependencies. You don’t want your applications be dependent on the update cycle of other organizations/individuals. What if the bug fix does not get merged into the master fast enough? Do you fork? When do you recombine again?

But otherwise I agree, sharing code and making it better for everyone as well as delegating complicated parts of the application to others are well practices.

NeedMoreTime4Me · on Jan 8, 2022

You are definitely right; there are numerous classic applications (i.e. outside of the cutting-edge CV/NLP stuff) that could greatly benefit from such a measure.

The question is: Why don’t people use these models? While Bayesian Neural Networks might be tricky to deploy & debug for some people, Gaussian Processes etc. are readily available in sklearn and other implementations.

My theory: most people do not learn these methods in their „Introduction to Machine Learning“ classes. Or is it lacking scalability in practice?

b3kart · on Jan 8, 2022

They often don’t scale, they are tricky to implement in frameworks that people are familiar with, but, most importantly, they make crude approximations meaning after all this effort they often don’t beat simple baselines like bootstrap. It’s an exciting area of research though.

disgruntledphd2 · on Jan 8, 2022

It takes more compute, and the errors from badly chosen data vastly outweigh the uncertainties associated with your parameter estimate.

To be fair, I suspect lots of people do this, but for whatever reason nobody talks about it.

shakow · on Jan 8, 2022

> Or is it lacking scalability in practice?

Only speaking from my own little perspective in bioinformatics, lack of scalability above all else, both for BNNs and GPs.

Sure, the library support could be better, but that was not the main hurdle, more of a friction.

NeedMoreTime4Me · on Jan 8, 2022

Do you have an anecdotal guess on the scalability barrier maybe? Like does it take too long with more than 10,000 data points having 100 features? Just to get a feel.

shakow · on Jan 8, 2022

Please don't quote me on that, as it was academic work in a given language and a given library and might not be representative of the whole ecosystem.

But in a nutshell, on OK-ish CPUs (Xeons a few generations old), we started seeing problems past a few thousands points with a few dozens features.

And not only was the training slow, but also the inference: as we used the whole sampled chain of the weights distributions parameters, not only was memory consumption a sight to behold, but inference time quickly grew through the roof when subsampling was not used.

And all that was on standard NNs, so no complexity added by e.g. convolution layers.

rsfern · on Jan 9, 2022

The main bottleneck in GP models is the inversion of an NxN covariance matrix, so training with the most straightforward algorithm has cubic complexity (and quadratic memory complexity). 10k instance is what I’ve seen as the limit of tractability.

The input dimensionally doesn’t necessarily matter since it’s kernel method, but if you have many features and want to do feature selection or optimize parameters things can really stack up.

There are scalable approximate inference algorithms, and pretty good library support (gpflow, gpytorch, etc), but it seems like they are not widely known, and there are definitely tradeoffs to consider among the different methods.

NeedMoreTime4Me · on Jan 8, 2022

Im not sure if I understand this correctly but wouldn’t this prevent other sites from being able to emerge in the future? Basically a forced monopoly on currently popular sites.

feanaro · on Jan 9, 2022

We're lamenting the fact that these authentic documentation sites are not the top results, instead being overtaken by bad copycat sites. I'm suggesting not relying on algorithmic improvements to do this and just pinning them to the top, since this is the wanted outcome anyway. The suggestion is not that these pins should be immutable for all time.

I'm not suggesting this be done for commercial sites, just neutral documentation sites for languages and frameworks. This would be strictly better from the perspective of the searcher.