If this model is so good at estimating depth from single image, shouldn't it also be able to take multiple images as input and estimate even better? But searching a bit it looks like this is supposed to be a single image to 3D only. I don't understand why it does not (can not?) work with multiple images.

▲

milleramp a day ago | parent | next [-]

It's using Apple's SHARP method, which is monocular. https://apple.github.io/ml-sharp/

▲

MillionOClock a day ago | parent | prev | next [-]

I also feel like an heavily multimodal model could be very nice for this: allow multiple images from various angles, optionally some true depth data even if imperfect (like what a basic phone LIDAR would output), why not even photos of the same place even if it comes from other sources at other times (just to gather more data), and based on that generate a 3D scene you can explore, using generative AI for filling with plausible content what is missing.

▲

voodooEntity a day ago | parent | prev | next [-]

If you have multiple images you could use photogrammetry.

At the end, if you want to "fill in the blanks" llm will always "make up" stuff, based on all of its training data.

With a technology like photogrammetry you can get much better results, therefor if you have multiple angled images and dont really need to make up stuff, its better to use such

	▲	TeMPOraL a day ago \| parent \| next [-]
		You could use both. Photogrammetry requires you to have a lot of additional information, and/or to make a lot of assumptions (e.g. about camera, specific lens properties, medium properties, material composition and properties, etc. - and what are reasonable range for values in context), if you want it to work well for general cases, as otherwise the problem you're solving is underspecified. In practice, even enumerating those assumptions is a huge task, much less defending them. That's why photogrammetry applications tend to be used for solving very specific problems in select domains. ML models, on the other hand, are in a big way, intuitive assumption machines. Through training, they learn what's likely and what's not, given both the input measurements and the state of the world. They bake in knowledge for what kind of cameras exist, what kind of measurements are being made, what results make sense in the real world. In the past I'd say that for best results, we should combine the two approaches - have AI supply assumptions and estimates for otherwise explicitly formal, photogrammetric approach. Today, I'm no longer convinced it's the case - because relative to the fuzzy world modeling part, the actual math seems trivial and well within capabilities of ML models to do correctly. The last few years demonstrated that ML models are capable of internally modeling calculations and executing them, so I now feel it's more likely that a sufficiently trained model will just do photogrammetry calculations internally. See also: the Bitter Lesson.
	▲	esafak a day ago \| parent \| prev [-]
		Surely this is not an LLM?

▲

shrinks99 a day ago | parent | prev | next [-]

I'm going to guess this is because the image to depth data, while good, is not perfectly accurate and therefore cannot be a shared ground truth between multiple images. At that point what you want is a more traditional structure from motion workflow, which already exists and does a decent job.

▲

SequoiaHope a day ago | parent | prev | next [-]

Multi-view approaches tend to have a very different pipeline.

▲

echelon a day ago | parent | prev [-]

Also, are we allowed to use this model? Apple had a very restrictive licence, IIRC?

	▲	godelski 17 hours ago \| parent [-]
		https://github.com/apple/ml-sharp/blob/main/LICENSE