Addendum: Most data has a fraction which is noise, or is dependent on information not completely captured by training data.
In both cases, training to a perfect fit of training data makes no sense.
Learning the particular noise in training data guarantees worse results on real data, where the noise will, by definition, be different.
The same is true for exactly reproducing data that depended on some unaccounted for information. Once applied, the unaccounted information in the training data will have no predictive quality.
So perfectly fitting data is usually a terrible idea, even when it can be done.
Training data captures a problem to be solved, as best it can. But it isn’t the same as the actual problem.
——
Best practice:
1. Use training data to optimize a model
2. Use separate validation data to approximate current generalization quality, and stop at (or revert to) the point where validation performance was best.
3. Use separate test data, not used for model design in any way, for a completely independent approximation of generalization performance.
Wide disparities between validation & test performance suggest problems. Ideally they should track each other fairly closely.
If not, more data is probably needed to characterize the problem more reliably.
NEVER just retrain, or tweak training parameters, until training & the validation stop produce good test performance. That means the test data was actually used in the design and is no longer an independent measure of performance!
——
Assuming you are having trouble getting similar validation & test performances:
A good way to ensure test performance is real, is to retrain the model in exactly the same way, several times, with different random divisions of training, validation & test data. If test results are good regardless of data divisions, then they are reliable.
Then you can eliminate the test performance dependency, by randomly selecting any of the models. Don’t choose the one with best test results!
That takes discipline!
In that case, the mean of all test performances is your best estimate of generalization performance, regardless of model.
(Throwing out models with the worst and best test performances is ok, it avoids outlier training failures or successes equally. Note I said “avoids”, not “eliminates”, since the worst & best test performances are estimates of generalization, not a measure of actual generalization.)
In both cases, training to a perfect fit of training data makes no sense.
Learning the particular noise in training data guarantees worse results on real data, where the noise will, by definition, be different.
The same is true for exactly reproducing data that depended on some unaccounted for information. Once applied, the unaccounted information in the training data will have no predictive quality.
So perfectly fitting data is usually a terrible idea, even when it can be done.
Training data captures a problem to be solved, as best it can. But it isn’t the same as the actual problem.
——
Best practice:
1. Use training data to optimize a model
2. Use separate validation data to approximate current generalization quality, and stop at (or revert to) the point where validation performance was best.
3. Use separate test data, not used for model design in any way, for a completely independent approximation of generalization performance.
Wide disparities between validation & test performance suggest problems. Ideally they should track each other fairly closely.
If not, more data is probably needed to characterize the problem more reliably.
NEVER just retrain, or tweak training parameters, until training & the validation stop produce good test performance. That means the test data was actually used in the design and is no longer an independent measure of performance!
——
Assuming you are having trouble getting similar validation & test performances:
A good way to ensure test performance is real, is to retrain the model in exactly the same way, several times, with different random divisions of training, validation & test data. If test results are good regardless of data divisions, then they are reliable.
Then you can eliminate the test performance dependency, by randomly selecting any of the models. Don’t choose the one with best test results!
That takes discipline!
In that case, the mean of all test performances is your best estimate of generalization performance, regardless of model.
(Throwing out models with the worst and best test performances is ok, it avoids outlier training failures or successes equally. Note I said “avoids”, not “eliminates”, since the worst & best test performances are estimates of generalization, not a measure of actual generalization.)