Remix.run Logo
Meta Segment Anything Model 3(ai.meta.com)
127 points by lukeinator42 5 hours ago | 28 comments
daemonologist 42 minutes ago | parent | next [-]

First impressions are that this model is extremely good - the "zero-shot" text prompted detection is a huge step ahead of what we've seen before (both compared to older zero-shot detection models and to recent general purpose VLMs like Gemini and Qwen). With human supervision I think it's even at the point of being a useful teacher model.

I put together a YOLO tune for climbing hold detection a while back (trained on 10k labels) and this is 90% as good out of the box - just misses some foot chips and low contrast wood holds, and can't handle as many instances. It would've saved me a huge amount of manual annotation though.

gs17 an hour ago | parent | prev | next [-]

The 3D mesh generator is really cool too: https://ai.meta.com/sam3d/ It's not perfect, but it seems to handle occlusion very well (e.g. a person in a chair can be separated into a person mesh and a chair mesh) and it's very fast.

Animats an hour ago | parent [-]

It's very impressive. Do they let you export a 3D mesh, though? I was only able to export a video. Do you have to buy tokens or something to export?

modeless 4 minutes ago | parent | next [-]

The model is open weights, so you can run it yourself.

TheAtomic 23 minutes ago | parent | prev [-]

I couldn't download it. Model appears to be comparable to Sparc3D, Huyunan, etc but w/o download, who can say? It is much faster though.

clueless an hour ago | parent | prev | next [-]

With a avg latency of 4 seconds, this still couldn't be used in real-time video, correct?

Etheryte 16 minutes ago | parent [-]

Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:

> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.

hodgehog11 an hour ago | parent | prev | next [-]

This is an incredible model. But once again, we find an announcement for a new AI model with highly misleading graphs. That SA-Co Gold graph is particularly bad. Looks like I have another bad graph example for my introductory stats course...

bangaladore 19 minutes ago | parent | prev | next [-]

Probably still can't get past a Google Captcha when on a VPN. Do I click the square with the shoe of the person who's riding the motorcycle?

rocauc an hour ago | parent | prev | next [-]

A brief history. SAM 1 - Visual prompt to create pixel-perfect masks in an image. No video. No class names. No open vocabulary. SAM 2 - Visual prompting for tracking on images and video. No open vocab. SAM 3 - Open vocab concept segmentation on images and video.

Roboflow has been long on zero / few shot concept segmentation. We've opened up a research preview exploring a SAM 3 native direction for creating your own model: https://rapid.roboflow.com/

yeldarb 2 hours ago | parent | prev | next [-]

We (Roboflow) have had early access to this model for the past few weeks. It's really, really good. This feels like a seminal moment for computer vision. I think there's a real possibility this launch goes down in history as "the GPT Moment" for vision. The two areas I think this model is going to be transformative in the immediate term are for rapid prototyping and distillation.

Two years ago we released autodistill[1], an open source framework that uses large foundation models to create training data for training small realtime models. I'm convinced the idea was right, but too early; there wasn't a big model good enough to be worth distilling from back then. SAM3 is finally that model (and will be available in Autodistill today).

We are also taking a big bet on SAM3 and have built it into Roboflow as an integral part of the entire build and deploy pipeline[2], including a brand new product called Rapid[3], which reimagines the computer vision pipeline in a SAM3 world. It feels really magical to go from an unlabeled video to a fine-tuned realtime segmentation model with minimal human intervention in just a few minutes (and we rushed the release of our new SOTA realtime segmentation model[4] last week because it's the perfect lightweight complement to the large & powerful SAM3).

We also have a playground[5] up where you can play with the model and compare it to other VLMs.

[1] https://github.com/autodistill/autodistill

[2] https://blog.roboflow.com/sam3/

[3] https://rapid.roboflow.com

[4] https://github.com/roboflow/rf-detr

[5] https://playground.roboflow.com

dangoodmanUT an hour ago | parent [-]

I was trying to figure out from their examples, but how are you breaking up the different "things" that you can detect in the image? Are you just running it with each prompt individually?

rocauc an hour ago | parent [-]

The model supports batch inference, so all prompts are sent to the model, and we parse the results.

xfeeefeee an hour ago | parent | prev | next [-]

I can't wait until it is easy to rotoscope / greenscreen / mask this stuff out accessibly for videos. I had tried Runway ML but it was... lacking, and the webui for fixing parts of it had similar issues.

I'm curious how this works for hair and transparent/translucent things. Probably not the best, but does not seem to be mentioned anywhere? Presumably it's just a straight line or vector rather than alpha etc?

rocauc an hour ago | parent | next [-]

I tried it on transparent glass mugs, and it does pretty well. At least better than other available models: https://i.imgur.com/OBfx9JY.png

Curious if you find interesting results - https://playground.roboflow.com

nodja an hour ago | parent | prev [-]

I'm pretty sure davinci resolve does this already, you can even track it, idk if it's available in the free version.

fzysingularity 2 hours ago | parent | prev | next [-]

SAM3 is cool - you can already do this more interactively on chat.vlm.run [1], and do much more. It's built on our new Orion [2] model; we've been able to integrate with SAM and several other computer-vision models in a truly composable manner. Video segmentation and tracking is also coming soon!

[1] https://chat.vlm.run

[2] https://vlm.run/orion

visioninmyblood an hour ago | parent [-]

Wow this is actually pretty cool, I was able to segment out the people and dog in the same chat. https://chat.vlm.run/chat/cba92d77-36cf-4f7e-b5ea-b703e612ea...

fzysingularity 8 minutes ago | parent [-]

Nice, that's pretty neat.

HowardStark an hour ago | parent | prev | next [-]

Curious if anyone has done anything meaningful with SAM2 and streaming. SAM3 has built-in streaming support which is very exciting.

I’ve seen versions where people use an in-memory FS to write frames of stream with SAM2. Maybe that is good enough?

dangoodmanUT an hour ago | parent | prev | next [-]

This model is incredibly impressive. Text is definitely the right modality, and now the ability to intertwine it with an LLM creates insane unlocks - my mind is already storming with ideas of projects that are now not only possible, but trivial.

sciencesama an hour ago | parent | prev | next [-]

Does the license allow for commercial purposes?

rocauc an hour ago | parent | next [-]

Yes. It's a custom license with an Acceptable Use Policy preventing military use and export restrictions. The custom license permits commercial use.

visioninmyblood an hour ago | parent | prev | next [-]

I just check and it seems to commercial permissiable.Companies like vlm.run and roboflow are using for commercial use as show by thier comments below. So i guess it can be used for commercial purposes.

rocauc an hour ago | parent [-]

Yes. But also note that redistribution of SAM 3 requires using the same SAM 3 license downstream. So libraries that attempt to, e.g., relicense the model as AGPL are non-compliant.

colesantiago an hour ago | parent | prev [-]

Yes, the license allows you to grift for your “AI startup”

exe34 11 minutes ago | parent | prev | next [-]

can anyone confirm if this fits in a 3090? the files look about 3.5GB, but I can't work out what the memory needs will be overall.

foota 5 minutes ago | parent | prev [-]

Obligatory xkcd: https://xkcd.com/1425/