More

yeldarb · 2026-06-23T12:17:23 1782217043

Suspect choice for the paper to only include a single DETR from 2022 in the headline pareto chart and claim to have "the strongest AP–latency trade-off"... Clearly the authors were aware of models that exceed theirs given they even mentioned some of them in the introduction.

> In parallel, DETR [5] cast detection as end-to-end set prediction, and its real-time descendants (RT-DETR [98], D-FINE [55], DEIM [21], RF-DETR [62]) have narrowed the accuracy gap with CNN based detectors on standard benchmarks.

yeldarb · 2026-06-23T11:32:50 1782214370

It’s a big improvement if you’re already paying them but, given their aggressive approach to licensing, I can’t imagine why anyone would choose to use an Ultralytics model on a new project in 2026. You’re just asking to be shaken down and have to pay off a large bill down the line.

RF-DETR is both faster and more accurate and truly open source with an Apache 2.0 license: https://github.com/roboflow/rf-detr

Full disclosure: I’m one of the co-founders of Roboflow (we made RF-DETR, wrote this blog post, and are a sub-licensor of Ultralytics’ models.)

MrGLaDOS · 2026-06-23T12:42:41 1782218561

“RF-DETR is both faster and more accurate and truly open source with an Apache 2.0 license”

Misleading marketing statement.

The catch is that for image resolutions >=700x700pixels (most production usecases), the roboflow license is actually PML1.0 instead of Apache2.0 https://github.com/roboflow/rf-detr#license

yeldarb · 2026-06-24T04:29:03 1782275343

That may be true for legacy CNNs but very few production use-cases require such a large resolution with DETRs. The latency scales quadratically with the resolution.

Regardless, you can do whatever resolution you want with the Apache 2.0 model. Just change the config at runtime; it was trained to be resolution agnostic.

You are correct that we also released larger models with a larger backbone under a different, non open-source license.

krapht · 2026-06-23T13:01:42 1782219702

> The catch is that for image resolutions >=700x700pixels (most production usecases)

Citation needed? 2XL looks like you go up to 800x800 pixel inputs. This isn't the dealbreaker you say it is - all pipelines benefit from thoughtful crop and rescaling before going to inference.

MrGLaDOS · 2026-06-23T14:49:41 1782226181

See the url in my comment (search for the term rfdetr-2xlarge). 2XL does indeed go up to 800x800 and has PML1.0 license instead of apache 2.0.

Rescaling is fine for some purposes but but not for all. For many domain-specific (often less common and odd dimensioned) objects, downscaling will severely reduce recall. There is a reason that Roboflow slaps a license that is not open source on those specific architectures.

In some cases tiled inferencing (for example with https://github.com/obss/sahi ) might do the job.

yeldarb · 2026-06-24T17:35:13 1782322513

> See the url in my comment (search for the term rfdetr-2xlarge). 2XL does indeed go up to 800x800 and has PML1.0 license instead of apache 2.0.

All of the models, including the Apache 2.0 ones, can be configured to go higher than 800x800. The difference between the ones with the PML license and the Apache 2.0 ones is the backbone, not the resolution.

I'd suggest you read the ICLR paper[1] which shows clearly the difference between the backbones at various latencies in Figure 1.

> For many domain-specific (often less common and odd dimensioned) objects, downscaling will severely reduce recall.

We released an entire paper[2] at Neurips about the long-tail transferability of models across a multitude of domains and benchmarked RF-DETR against that benchmark. The Apache 2.0 model is pareto optimal over the larger PML model at latencies less than the XL size.

(I'm one of the co-founders of Roboflow and worked on RF-DETR and RF100-VL.)

[1] https://arxiv.org/abs/2511.09554 [2] https://arxiv.org/abs/2505.20612

yeldarb · 2025-11-20T04:26:04 1763612764

Yes, it should.

exe34 · 2025-11-20T07:39:28 1763624368

thanks!

yeldarb · 2025-11-19T20:38:24 1763584704

We (Roboflow) have had early access to this model for the past few weeks. It's really, really good. This feels like a seminal moment for computer vision. I think there's a real possibility this launch goes down in history as "the GPT Moment" for vision. The two areas I think this model is going to be transformative in the immediate term are for rapid prototyping and distillation.

Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).

We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).

We also have a playground[5] up where you can play with the model and compare it to other VLMs.

[1] https://github.com/autodistill/autodistill

[2] https://blog.roboflow.com/sam3/

[3] https://rapid.roboflow.com

[4] https://github.com/roboflow/rf-detr

[5] https://playground.roboflow.com

sorenjan · 2025-11-19T22:55:16 1763592916

SAM3 is probably a great model to distill from when training smaller segmentation models, but isn't their DINOv2 a better example of a large foundation model to distill from for various computer vision tasks? I've seen it used for as starting point for models doing segmentation and depth estimation. Maybe there's a v3 coming soon?

https://dinov2.metademolab.com/

nsingh2 · 2025-11-19T23:16:20 1763594180

DINOv3 was released earlier this year: https://ai.meta.com/dinov3/

I'm not sure if the work they did with DINOv3 went into SAM3. I don't see any mention of it in the paper, though I just skimmed it.

yeldarb · 2025-11-20T04:22:50 1763612570

We used DINOv2 as the backbone of our RF-DETR model, which is SOTA on realtime object detection and segmentation: https://github.com/roboflow/rf-detr

It makes a great target to distill SAM3 to.

sorenjan · 2025-11-20T19:27:59 1763666879

> It makes a great target to distill SAM3 to.

Could you expand on that? Do you mean you're starting with the pretrained DINO model and then using SAM3 to generate training data to make DINO into a segmentation model? Do you freeze the DINO weights and add a small adapter at the end to turn its output into segmentations?

dangoodmanUT · 2025-11-19T21:01:32 1763586092

I was trying to figure out from their examples, but how are you breaking up the different "things" that you can detect in the image? Are you just running it with each prompt individually?

rocauc · 2025-11-19T21:09:03 1763586543

The model supports batch inference, so all prompts are sent to the model, and we parse the results.

mchusma · 2025-11-19T23:47:24 1763596044

Thanks for the linkes! Can we run rf-detr in the browser for background removal? This wasn't clear to me from the docs

yeldarb · 2025-11-20T04:24:16 1763612656

We have a JS SDK that supports RF-DETR: https://docs.roboflow.com/deploy/sdks/web-browser

yeldarb · 2025-11-19T16:19:27 1763569167

We (Roboflow) have had early access to this model for the past few weeks. It's really, really good. This feels like a seminal moment for computer vision. I think there's a real possibility this launch goes down in history as "the GPT Moment" for vision.

The two areas I think this model is going to be transformative in the immediate term are for rapid prototyping and distillation.

Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).

We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).

We also have a playground[5] up where you can play with the model and compare it to other VLMs.

[1] https://github.com/autodistill/autodistill

[2] https://blog.roboflow.com/sam3/

[3] https://rapid.roboflow.com

[4] https://github.com/roboflow/rf-detr

[5] https://playground.roboflow.com

yeldarb · 2025-08-04T09:38:04 1754300284

Wonder if you can subtract these vectors to get the opposite effect and what that ends up being for things like sycophancy or hallucination.

I also wonder what other personality vectors exist.. would be cool to find an “intelligence” vector we could boost to get better outputs from the same model. Seems like this is likely to exist given how prompting it to cosplay as a really smart person can elicit better outputs.

yeldarb · on June 1, 2025

Is this a new product or a marketing page tying together a bunch of the existing MediaPipe stuff into a narrative?

Got really excited then realized I couldn’t figure out what “Google AI Edge” actually _is_.

Edit: I think it’s largely a rebrand of this from a couple years ago: https://developers.googleblog.com/en/introducing-mediapipe-s...

yeldarb · on Jan 23, 2025

Hey all, sharing a project we made in 2 hours at the Vercel+NVIDIA hackathon last week.

While the app is cool, the thing that blew my mind is that the entire app was coded by Vercel's v0 agent. In other words: I did not write a single line of code to create the app (though my teammate did write the backend scraper & DB filler by hand).

[1] Writeup: https://blog.roboflow.com/nycerebro/

[2] Repo (including the generated code + initial meaty prompts): https://github.com/yeldarby/nycerebro

[3] v0 session: https://v0.dev/chat/nyc-erebro-app-RwzRUEMGveH?b=b_6AuWalvG7...

yeldarb · on Jan 23, 2025

I've been reflecting a bit on this and remembering what it used to be like when I did hackathons regularly a decade or so ago. This project seems on-par with the type of 48 hour hackathon project I used to do (assuming CLIP had existed), but now I was able to do it in 2 hours instead of 48.

I can't imagine someone non-technical building something like this with prompting. The success of the project was highly dependent on my direction of the model to do what I wanted it to do (even though I gave it leeway in exactly how to do it). It did feel a bit like managing another engineer to do something vs doing it myself.

I don't use agents like this in my day to day work yet (I experimented with OpenHands a couple of months ago but it was frustrating, expensive, and took just as long as doing the task myself). But I'm thinking I probably will be a year from now.

A few times when the model got stuck I copy/pasted some stuff into o1 and pasted its response back into v0 (felt kind of like "escalating" to a more senior engineer) and that helped it get unstuck. Future models will be even more capable than o1. I imagine there will likely need to be a UI for "bringing in the big guns" of a smarter model in the future even if the grunt-work is done by a fast+cheap base model.

There's probably also something to letting the model "speak its native tongue". I don't know next.js but letting the model work with patterns it's been trained on probably helped it be more effective (compared to having OpenHands work in my own codebase using a structure it's unfamiliar with).

yeldarb · on Nov 24, 2024

Is there any Docker alternative on Mac that can utilize the MPS device in a container? ML stuff is many times slower in a container on my Mac than running outside

habitue · on Nov 24, 2024

The issue you're running into is that to run docker on mac, you have to run it in a vm. Docker is fundamentally a linux technology, so first emulate x86_64 linux, then run the container. That's going to be slow.

There are native macos containers, but they arent very popular

AbuAssar · on Nov 24, 2024

Docker can run ARM64 linux kernel, no need to emulate x86

majormajor · on Nov 24, 2024

You still pay the VM penalty, though it's a lot less bad than it used to be. And the Arm MacBooks are fast enough that IME they generally compare well against Intel Linux laptops even then now. But it sounds like first-class GPU access (not too surprisingly) isn't there yet.

fl0id · on Nov 24, 2024

Podman-Desktop can do it

yeldarb · on Nov 17, 2024

More context from Jeremy Howard (fast.ai): https://x.com/jeremyphoward/status/1857765905188651456