In health care AI there is some tendency to use generated data for training. Idea might be that org A has real patient data but for privacy reason cannot share data, but if they create a sufficiently strong generator they can share that and org B can train their classifier without ever accessing sensitive data. Alternatively, it can be also used if you simply have too little data and need augmentation.
This just seems like something that will catastrophically fail. If you can build a good enough generator you can just build the ML model internally. And if you can't the statistics of what you provide are going to be off enough that any strong model is going to be wrong in strange ways.
This is an incredibly important point: in order for your synthetic data to be useful your simulator must have already solved the problem at hand. In theory there is no need to even fool around with generating the synthetic data and going through the charade of training a model on it; simply exact the outcome model from your simulator directly as that's implicitly what you are doing. For example, if you have a generative model that provides densities, you can simply compute P(Y | X) = P(X, Y) / P(X).
But this is not how generators work. They generally produce samples in the from
G: Q -> (X,Y)
where Q is some prior from which you are sampling. If they are not invertible then you straight up cannot get P(X,Y) out of the generator. Even if it is invertible getting P(X) requires integrating out the Y which might be infeasible (since the model is not integrable and is sufficiently fast changing that you need very, very many samples).
Very true. If you've solved the labeling/extraction problem using a means other than ML, you can use that means to generate synthetic data. The situation at my company is exactly this.
Say you use regular expressions to extract sensitive data from standardized, but numerously varied, form documents. The pieces of information extracted are very common classes of data: first name, last name, dates, physical locations.
During the extraction process you can save the complement of the extraction (the "leftovers") and insert generated data at the extraction points. Also, because you've extracted the actual sensitive data, you can exclude that from the set of values used for generation, if it's practical.
Sometimes people get caught up in the math and theory that they fail to see the practical solutions.
I agree that this is very tricky. I think the most interesting synthetic healthcare data generation I saw was using causal inference (where SMEs can bake in a bunch of expert knowledge during skeleton construction) and then generated data by getting the weights on the edges from a smaller dataset. At the same time, it is very hard to ensure that you synthetic dataset actually reflects real world. On one hand SME knowledge might give extra oomph to synthetic data generation (as this knowledge is equivalent to some highly abstracted training) but also if the "expert knowledge" is wrong then it's a recipe for disaster.
> In health care AI there is some tendency to use generated data for training.
Which is part of the reason for the high failure rate.
Good governance and data access for health data is a very hard problem. Good labeling is also hard/expensive in this space.
So there is an incentive for people wanting to do ML/AI without solving above to try any kind of shortcut they can think of. This incentive doesn't help solve any real problems.
The classic solution to "too little data" is use a simpler and/or less discriminating model. It's still the only one with a good track record.
Transfer learning is nothing like a silver bullet. True, it has become an important work around but it's no panacea and the track record is at best mixed.
People already use it quite a lot. More importantly they misuse it a lot. I'd be less concerned with increasing the usage, and more concerned that the people using it understand the implications and trade offs.