> My guess is that -probably no. It's more likely you had a stream of good luck in your earlier interactions and now you're observing regression to the mean.
Or, more straightforwardly, with "beginner's luck", which can be seen as a form of survivor bias. Most people, when they start gambling, win and lose close to the average. Some people, when they start gambling, lose more than average -- and as a result are much less likely to continue gambling. Others, when they start gambling, win more than average -- and as a result are much more likely to continue gambling. Most long-term / serious gamblers did win more than average when starting out, because the ones who lost more than average didn't become long-term / serious gamblers.
Almost certainly a similar effect would happen w/ GPT-4: People who had better-than-average interactions to begin with became avid users, and really are experiencing a lowering of quality simply by statistics; people who had worse-than-average interactions to begin with gave up and never became avid users.
One could try to re-run the benchmarks that were mentioned in the OpenAI paper, and see how they fare; but it's not unlikely that OpenAI themselves are also running those benchmarks, and making efforts to keep them from falling.
Probably the best thing to do would be to go back and find a large corpus of older GPT-4 interactions, attempt to re-create them, and have people do a blind comparison of which interaction was better. If the older recorded interactions consistently fare better, then it's likely that ongoing tweaks (whatever the nature of those tweaks) have reduced effectiveness.
Or, more straightforwardly, with "beginner's luck", which can be seen as a form of survivor bias. Most people, when they start gambling, win and lose close to the average. Some people, when they start gambling, lose more than average -- and as a result are much less likely to continue gambling. Others, when they start gambling, win more than average -- and as a result are much more likely to continue gambling. Most long-term / serious gamblers did win more than average when starting out, because the ones who lost more than average didn't become long-term / serious gamblers.
Almost certainly a similar effect would happen w/ GPT-4: People who had better-than-average interactions to begin with became avid users, and really are experiencing a lowering of quality simply by statistics; people who had worse-than-average interactions to begin with gave up and never became avid users.
One could try to re-run the benchmarks that were mentioned in the OpenAI paper, and see how they fare; but it's not unlikely that OpenAI themselves are also running those benchmarks, and making efforts to keep them from falling.
Probably the best thing to do would be to go back and find a large corpus of older GPT-4 interactions, attempt to re-create them, and have people do a blind comparison of which interaction was better. If the older recorded interactions consistently fare better, then it's likely that ongoing tweaks (whatever the nature of those tweaks) have reduced effectiveness.