The hype around machine learning has reached a point where every time someone sees a moderately challenging problem they leap to some nebulous "ML" solution.
No, we've long had the technology to solve most of this. Clip correspondence is content based image retrieval across a database of keyframes. Matching framing and position is an image registration problem (feature correspondence). Color balance seems like an almost trivial problem if you've solved the other ones, because the information _is_ there -- just modify each channel to match the histogram in the lower resolution image [1]
The challenge is that earlier stages in the pipeline need to be robust to inexact matches, and we don't want to rely on absolute color or pixel position. But I don't think that should be insurmountable for a slightly creative implementor - use local variation in color, gradient descriptors, pull in the motion vectors for an additional channel, etc...
Sure, you could go down the deep learning path by trying to reduce scenes to bags of labeled objects and semantic actions but that's bringing a water cannon to a squirt gun fight, which only makes sense if Google is giving you a free water cannon.
I'd probably try it if I really thought there was a fortune to be made here, but it's such a niche application. When something hasn't been done yet, there's usually a good economic reason, unfortunately.
Cropping a video to match another, and adjusting the color balance to match, seems simple enough, compared to asking a computer to do it without a reference. In other words, "make X look like Y" as compared to "make X look good".
Of course, the quality won't be as good as something done by a professional, but the question is that since we don't have a HD DS9, whether the version produced by an automated system is noticeably better than what we have.
I've never done it myself, basing it on the cost of the machines that professionals do, and the cost of employing people to do it (both do it and make the judgement calls), you're probably looking at $200k for feature film quality.
If you're aiming for network TV quality you can probably do an episode for $5k though.
Of course that's from scratch. The trouble is that using the original video as your source will have lost a lot of data, and that means making a lot of judgement calls about what the scene is meant to be doing, so you're not much nearer.