Gartner estimate that by 2022, 40% of AI/ML models will be trained on synthetic data.
Just because it may have some drop in utility today from real data, there are all sorts of scenarios where that’s outweighed by speed and ease of working with fake data rather that tied up in red tape real data.
The "mays" in this statement are doing a lot of work.
There are a few areas of current practice that have this feature: a) the arguments & evidence for it being worse are pretty simple and b) the arguments & evidence for potential benefits are either weak or very convoluted. This is never a good sign.
I think this happens mostly because the reasons these things are being done are for the most part not technical, but the technically oriented people involved don't like to think about that way, and would rather talk about technical solutions - but that is operating at the wrong data.
The business & cost cases behind not doing this "right" in some abstract sense are pretty clear too, though. I wish more people would just be clear about this, and spend less effort obfuscating and more in clearly quantifying the cost of these workarounds.
Any time you hear someone starting off by saying things like "we don't really need good labels", "this synthetic data will be better, actually", "we'll use transfer from X because it's already done most of the work", etc., well what follows is quite likely to be good fertilizer.
Note, I'm not saying these approaches don't have value, just that there is an awful lot of magical thinking going on around it, and a lot of failures due to that.
There’s only one “may” there but yes, it masks “potentially losing crucial information”. The post I linked to is pretty clear on that.
I totally agree that the business need is the driver and that people miss the imperatives if they look at it purely from a technical or mathematical lens.
In a sense, synthetic data is the least bad actually viable solution. The democracy of privacy / data agility :p
I was referring to both "it may have some drop in utility today... " and "in future it may well perform better than ..."
My issue is not that people just miss the imperatives, but that they also misapply effort because of it. Accept there is a cost and try and quantify the impact. Make intelligent risk management decisions based on that. Sometimes that decision is "this is unlikely to work, what else can we do".
Just because it may have some drop in utility today from real data, there are all sorts of scenarios where that’s outweighed by speed and ease of working with fake data rather that tied up in red tape real data.
See for example: https://hazy.com/blog/2019/12/09/data-science-on-test-data
Also, synthetic data can be augmented, rebalanced etc. Which is why in future it may well perform better than real data for data science work.
For example, think about this person does not exist and then apply that to business data.
Disclosure: Hazy cofounder. We’ve been doing smart synthetic data for a few years now.