A little off-topic, since we’re not talking about true synthetic data here (vs. wiping PII), but the future of synthetic data is no-data due to differentiable programming. Instead of a program outputting vast amounts of synthetic data, it is written with a library or language that is differentiable and whose gradients can be integrated straight into the training of a model. A few PyTorch libraries dealing with 3D modeling have been released lately that accomplish that, and a good deal of work in Julia is making promising advances. I’m curious to see how overfitting will be addressed, but there may come a time when large datasets become a thing of the past as low-level generation of data becomes just another component of a model’s architecture.