Why do you say it's not included? Why wouldn't they include it.

If every photo in streetview was included in the training data of a multimodal LLM it would be like 99.9999% of the training data/resource costs.

It just isn't plausible that anyone has actually done that. I'm sure some people include a small sample of them, though.

▲

bluefirebrand a day ago | parent | next [-]

Why would every photo in streetview be required in order to have Geoguessr's dataset in the training data?

▲

bee_rider a day ago | parent [-]

I’m pretty sure they are saying that Geoguessr's just pulls directly from Google Streetview. There isn’t a separate Geoguessr dataset, it just pulls from Google’s API (at least that’s what Wikipedia says).

▲

bluefirebrand a day ago | parent [-]

I suspect that Geoguessr's dataset is a subset of Google Streetview, but maybe it really is just pulling everything directly

	▲	bee_rider a day ago \| parent [-]
		My guess would be that they pull directly from street-view, maybe with some extra filtering for interesting locations. Why bother to create a copy, if it can be avoided, right?

▲

clbrmbr 21 hours ago | parent | prev [-]

Yet.

This is a good rebuttal when someone quips that we “are about to run out of data”. There’s oh so much more, just not in the form of books and blogs.