My understanding is that in multimodal models, both text and image vectors align...

netdur on March 25, 2025 | parent | context | favorite | on: Qwen2.5-VL-32B: Smarter and Lighter

My understanding is that in multimodal models, both text and image vectors align to the same semantic space, this alignment seems to be the main difference from text-only models."