Curious that have people find more details regarding what is the architecture of...

Curious that have people find more details regarding what is the architecture of this "mistral-ocr-latest". I have two question that

1. I was initially thinking this is VLM parsing model until I saw it can extract images. Then, I assume it is a pipeline of an image extraction and a VLM model while their result is combined to give the final result.

2. In this case, benchmark the pipeline result vs a end to end VLM such as gemini 2.0 flash might not be apple to apple comparison.