> The insight driving the program, Naga said, is that the limiting factor for AV development is no longer the underlying technology. “The bottleneck is data,” he said. “[Companies like Waymo] need to go around and collect the data, collect different scenarios. You may be able to say: in San Francisco, ‘At this school intersection, I want some data at this time of day so I can train my models.’ The problem for all these companies is access to that data, because they don’t have the capital to deploy the cars and go collect all this information.”

You can’t be the CTO of Uber wanting to do AVs, and get the data collection requirement shockingly wrong.

Waymo’s bottleneck has never been data. When they want data about a school intersection in SF at a certain time of day, they just... synthetically generate it and simulate: https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-f...

Waymo is able to deploy with less (but targeted and high quality) data collection by having world class simulation capabilities. Not that they haven't collected huge amounts of data as it's no doubt important (I've heard their onboard storage is transferred and emptied every few days), it's just not a bottleneck. They have the most efficient operation in the AV industry.

The best example of why data collection isn’t the bottleneck is Tesla. They boast about billions of miles of data, yet they’re struggling to put out fully autonomous vehicles.

▲

KaiserPro 32 minutes ago | parent | next [-]

> The best example of why data collection isn’t the bottleneck is Tesla.

Exactly. plus any delivery company/dashcam company can provide a bunch of data where ever there is any sizeable population.

About 8 years ago, that data would have been really valuable, but at best its nice to have.

the only thing that is valuable is the breadth of different cars, but even then its not that much of a differentiator.

▲

simmonmt 2 hours ago | parent | prev | next [-]

> When they want data about a school intersection in SF at a certain time of day, they just... synthetically generate it and simulate

I think it's more about detecting changes to the world. You need boots on the ground, so to speak, to see that new speed limit sign or the new lane paint. The Waymo vehicle can no doubt react to changes in the world when it encounters them, relaying them back to the mothership, but it's better to know about them in advance.

▲

ra7 an hour ago | parent | next [-]

Most AVs, definitely Waymo vehicles, are self mapping. They can detect environment changes and relay it to the entire fleet. That's because they map using the same vehicles as the fleet.

▲

MagicMoonlight an hour ago | parent | prev | next [-]

That’s dumb then. It shows it’s just brute force rather than AI.

A human doesn’t need to be shown every single road that exists in order to drive.

	▲	ThunderSizzle 34 minutes ago \| parent [-]
		Just a bunch of sophisticated if statements, I guess.

▲

delfinom an hour ago | parent | prev [-]

>You need boots on the ground, so to speak, to see that new speed limit sign or the new lane paint.

It'll shock you to know that you can simply get this from governments, some even provide this in API form

▲

dmd an hour ago | parent | next [-]

It probably won't shock you to know that those sources of data can be months to even years delayed from what's actually out in the world.

▲

KaiserPro 31 minutes ago | parent | prev | next [-]

no visual data, you need picture data for that. companies like NC tech do it for like $1m a city. or thereabouts.

▲

paganel an hour ago | parent | prev [-]

> or the new lane paint.

I'd be surprised if this is a thing outside the biggest US (and European, for that matter) cities, judging from Google StreetView there are lots of streets in US cities/towns with almost no paint lines at all.

▲

msm_ 41 minutes ago | parent [-]

Do you mean in the API? I live in an European country and I don't think I ever saw an asphalt road without paint lines. This varies a lot between countries though.

	▲	ThunderSizzle 31 minutes ago \| parent [-]
		Many American roads don't have lines. Residential roads, parking lots, many business driveways have limited markings. Then there's roads with just the center line markers with no road should markings. Then there's a whole class of roads of lines over "demarked" old lines that weren't demarked well, or lines fading that should've been painted a long time ago. I'm surprised you've never seen a non-perfect road?

▲

suddenexample 2 hours ago | parent | prev | next [-]

Yeah I'm not so sure this CTO is on the mark here, but to be fair, I do think some of this IRL long tail/edge case data is important for Waymo. The simulation software is super interesting to me - the real world can be so chaotic, and even if they could generate every possible real life case, there needs to be validation on whether the Waymo driver is responding in the optimal way. They certainly haven't solved this problem, you can see some of their growing pains in all of these articles - floods in Austin, more and more interactions with emergency vehicles that first responders seem to believe are getting worse, etc.

Tesla on the other hand has billions of miles of data, yet because there is a limit to camera-only techniques, that data isn't that useful is it? They have no ground truth data to evaluate their camera system on, which is why sometimes you see those Teslas driving around with lidar rigs mounted on them. Going camera-only is just asking for trouble.

▲

ra7 2 hours ago | parent [-]

I agree real world data is important for Waymo. I didn't mean to say it wasn't, so I've edited my comment to reflect that. It's just that data is not some magic bullet to achieve self driving like Tesla and others suggest.

Of course, Waymo still has much more room for improvement. But it's much more efficient to supplement less but higher quality IRL data with large amounts of synthetic data, than to run a million data collection vehicles 24x7 because most IRL data is boring and useless.

Waymo said 6 years ago they simulate 20 million miles every single day [1]. Clearly, it's working for them given their scale of deployment right now.

[1] https://waymo.com/blog/2020/04/off-road-but-not-offline--sim...

	▲	skybrian an hour ago \| parent [-]
		Although most of the real-world data is probably boring, collecting more of it likely makes discovering rare edge cases more likely. But since they happen rarely, I imagine that after discovering them, they would then need to figure out how to simulate them.

▲

Sardtok an hour ago | parent | prev | next [-]

The biggest difference, is Uber has vehicles around the world. So there's more data from countries with different rules from the US. Signage is definitely different between the US and Europe.

▲

cogman10 2 hours ago | parent | prev | next [-]

> The best example of why data collection isn’t the bottleneck is Tesla. They boast about billions of miles of data, yet they’re struggling to put out fully autonomous vehicles.

Well, TBF, the tesla data was complete garbage with earlier vehicles. They had cheap and somewhat bad cameras in the earlier vehicles that was only somewhat recently updated. And even then, I don't think Tesla is at the end of their hardware journey. I think they don't think that either, which is why they've gone to a subscription only model for self driving vehicles.

Waymo, on the other hand, has gathered less data, but more high quality data. They do the expensive mapping of a city which is a big part of why their vehicles have early on been able to do some pretty impressive feats. The drawback is getting that high quality data takes a lot of time and resources.

▲

kibwen 29 minutes ago | parent [-]

> And even then, I don't think Tesla is at the end of their hardware journey.

I dunno about that. Tesla seems completely adrift, pretending to pivot with random forays into humanoid robotics or whatever, to the point that I wouldn't be surprised if they exited the consumer vehicle space altogether within the next decade. They have no answer for Chinese competitors.

	▲	cogman10 21 minutes ago \| parent [-]
		Well, let me rephrase, the previous stated goals of Tesla around self driving cars isn't complete with the current hardware.

▲

bobro an hour ago | parent | prev | next [-]

I find the idea of learning from simulated data so unintuitive. How can you radically improve your model with just your model? I take it people do it, so it must work, but i just don’t understand it at all.

	▲	ainch 4 minutes ago \| parent \| next [-]
		They're two different models - you can use the world model to train (or test like Wayve) a different car-driving model. The world model is basically intended as a more true-to-life simulator.
	▲	anon84873628 29 minutes ago \| parent \| prev \| next [-]
		I think people are skipping over the fact that Google has had cars driving around taking photos for 20 years. I imagine that was used to build the world model in the first place.
	▲	ianm218 an hour ago \| parent \| prev [-]
		Well there's a world simulation model and then the driving model. You can imagine improving i.e. a specialized math model (problem in, theorem out) with a normal LLM that knows lots of problems and theorems generally.

▲

gcheong 2 hours ago | parent | prev | next [-]

Didn't they need the data from the 200 million miles or so from actual driving before they could get to the generative model though? Data isn't everything, as you point out with Telsa (mainly because they decided to forego using lidar it would seem), but it is pretty fundamental.

	▲	ra7 2 hours ago \| parent \| next [-]
		IIRC, they had clocked 20 million real world miles before starting to scale their deployment. But they were also driving 20 million miles in the simulator every day: https://waymo.com/blog/2020/04/off-road-but-not-offline--sim...
	▲	ninjagoo an hour ago \| parent \| prev [-]
		> before they could get to the generative model though? Is that the right kind of model for this particular application?

▲

whiplash451 2 hours ago | parent | prev | next [-]

Waymo might very well be missing specific kinds of data (e.g more incidents/accidents, near-collisions etc)

Also, Uber’s data might be useful for eval, not training (e.g « here is how Waymo would behave vs human drivers therefore it is safer »)

▲

ra7 an hour ago | parent [-]

> Waymo might very well be missing specific kinds of data (e.g more incidents/accidents, near-collisions etc)

Accidents and near-collisions are exactly the kind of scenarios perfect for simulation. You don't test them out in the real world and risk injuries/deaths. You need to have confidence they're handled before you deploy.

	▲	pishpash 37 minutes ago \| parent [-]
		Again, how do you know you've handled it correctly without ground truth? Simulation without ground truth is a garbage in garbage out situation.

▲

cyanydeez 10 minutes ago | parent | prev [-]

Yes, the way to make these things safer is to make up data and simulate on that.

Do you hear yourself?

	▲	ra7 7 minutes ago \| parent [-]
		That’s literally how it works right now, so yeah.