Gartner estimate that by 2022, 40% of AI/ML models will be trained on synthetic ...

ska · on Feb 20, 2020

The "mays" in this statement are doing a lot of work.

There are a few areas of current practice that have this feature: a) the arguments & evidence for it being worse are pretty simple and b) the arguments & evidence for potential benefits are either weak or very convoluted. This is never a good sign.

I think this happens mostly because the reasons these things are being done are for the most part not technical, but the technically oriented people involved don't like to think about that way, and would rather talk about technical solutions - but that is operating at the wrong data.

The business & cost cases behind not doing this "right" in some abstract sense are pretty clear too, though. I wish more people would just be clear about this, and spend less effort obfuscating and more in clearly quantifying the cost of these workarounds.

Any time you hear someone starting off by saying things like "we don't really need good labels", "this synthetic data will be better, actually", "we'll use transfer from X because it's already done most of the work", etc., well what follows is quite likely to be good fertilizer.

Note, I'm not saying these approaches don't have value, just that there is an awful lot of magical thinking going on around it, and a lot of failures due to that.

thruflo · on Feb 20, 2020

There’s only one “may” there but yes, it masks “potentially losing crucial information”. The post I linked to is pretty clear on that.

I totally agree that the business need is the driver and that people miss the imperatives if they look at it purely from a technical or mathematical lens.

In a sense, synthetic data is the least bad actually viable solution. The democracy of privacy / data agility :p

ska · on Feb 20, 2020

I was referring to both "it may have some drop in utility today... " and "in future it may well perform better than ..."

My issue is not that people just miss the imperatives, but that they also misapply effort because of it. Accept there is a cost and try and quantify the impact. Make intelligent risk management decisions based on that. Sometimes that decision is "this is unlikely to work, what else can we do".

qmmmur · on Feb 20, 2020

https://forensic-architecture.org/investigation/cv-in-triple...

Also this for a practical use of synthetic data.