Multimodal is the only image generation modality that matters going forward. Flu...

zaptrem · on May 8, 2025

Good chance a future llama will output image tokens

echelon · on May 8, 2025

That's my hope. That Llama or Qwen bring multimodal image generation capabilities to open source so we're not left in the dark.

If that happens, then I'm sure we'll see slimmer multimodal models over the course of the next year or so. And that teams like Black Forest Labs will make more focused and performant multimodal variants.

We need the incredible instructivity of multimodality. That's without question. But we also need to be able to fine tune, use ControlNets to guide diffusion, and to compose these into workflows.