Yup, people tend to confuse concepts and refer to synthetic data as anonymised data. They are very different things.
Anonymised data or redacted data are transformations of a data set that _hopes_ not to leak too much PII / sensitive data. People don’t use ML to anonymise but they do use ML to classify as a first step before splatting or generalising.
In that case, its absolutely right that the ML classifier not being 100% results in PII leaking.
This is a key reason why anonymisation and redaction are widely seen as problematic and are being replaced by synthetic data and, maybe in future, homomorphic encryption.
Homomorphic encryption and any encryption in-use technology is no guarantee of privacy on its own. Synthetic data has the same dillema of utility vs anonymity as any other anonymization tech.
Anonymised data or redacted data are transformations of a data set that _hopes_ not to leak too much PII / sensitive data. People don’t use ML to anonymise but they do use ML to classify as a first step before splatting or generalising.
In that case, its absolutely right that the ML classifier not being 100% results in PII leaking.
This is a key reason why anonymisation and redaction are widely seen as problematic and are being replaced by synthetic data and, maybe in future, homomorphic encryption.