Is this an example of where the old algorithm is capable of exploiting the information in its training database, but is not capable / not configured to ever explore? So by feeding it additional (context, recommendation, result) samples from the new algorithm, it is rapidly able to exploit the information to offer improved recommendations, even though it would never have proposed those recommendations?
More generally, it sounds like the old algorithm (and perhaps the new one too) are rigged to myopically try to make the best decision right now - to conservatively maximise the value of this recommendation - without considering that there will be value in future of carrying out some ongoing experimental work to try new things and grow a diverse training dataset, which could pay off in subsequent rounds of recommendation.
A simple to describe but sub-optimal strategy to improve this would be to use an epsilon-greedy recommendation system: e.g. set epsilon=1%, so 99% of the time it makes a recommendation using the original algorithm, and 1% of the time makes a recommendation at random (to gain novel information).
I read a little about this kind of thing a few years ago: explore/exploit tradeoffs, online learning, regret minimisation, bandit algorithms, contextual bandits, upper confidence bounds, ...
It sounds like you're talking about the idea of introducing noise in order to prevent stagnation and make sure learning continues.
One of the trivial ways to do this with a recommender system is to change the priority of some search results so that, say, a page 5 result shows up on page 1.
You also do something similar with introducing noise in nns for image processing.
A co-occurrence model isn't really meant to be used for exploration. I'm eliding a bunch of details, as it's just one of many different recommendation algorithms, and there's an exploration layer on top of the whole ensemble, which includes an epsilon-greedy component.
I may have missed the subtext/point of your earlier comment:
that the additional samples generated using the new candidate algorithm were visible to the existing algorithm, making comparison of the two algorithms difficult, and that this visibility (or its consequences) was not initially anticipated.
Is this an example of where the old algorithm is capable of exploiting the information in its training database, but is not capable / not configured to ever explore? So by feeding it additional (context, recommendation, result) samples from the new algorithm, it is rapidly able to exploit the information to offer improved recommendations, even though it would never have proposed those recommendations?
More generally, it sounds like the old algorithm (and perhaps the new one too) are rigged to myopically try to make the best decision right now - to conservatively maximise the value of this recommendation - without considering that there will be value in future of carrying out some ongoing experimental work to try new things and grow a diverse training dataset, which could pay off in subsequent rounds of recommendation.
A simple to describe but sub-optimal strategy to improve this would be to use an epsilon-greedy recommendation system: e.g. set epsilon=1%, so 99% of the time it makes a recommendation using the original algorithm, and 1% of the time makes a recommendation at random (to gain novel information).
I read a little about this kind of thing a few years ago: explore/exploit tradeoffs, online learning, regret minimisation, bandit algorithms, contextual bandits, upper confidence bounds, ...