It's the same thing. Predict the next pixel, or the next token (same way you handle regular images), or infill missing tokens (MAE is particularly cool lately). Those induce the abstractions and understanding which get tapped into.
It's incredibly hard to disambiguate and accurately label images using the reports (area of my research).
Reports are also not analogous to ground truth labels, and you don't always have histopathologic/clinical outcomes.
You also have drift in knowledge and patient trends, people are on immunotherapy now and we are seeing complications/patterns we didn't see 5 years ago. A renal cyst that would have been follow-up to exclude malignancy before 2018 is now definitively benign, so those reports are not directly usable.
You would have to non-trivially connect this to a knowledge base of some form to disambiguate, one that doesn't currently exist.
And then there's hallucination.
Currently if you could even extract actionable findings, accurately summarize reports and integrate this with workflow you could have a billion dollar company.
Nuance (now owned by Microsoft) can't even autofill my dictation template accurately using free-text to subject headings.
What's the medical imaging equivalent to "predict the next word"?