Remix.run Logo
sebzim4500 a day ago

If every photo in streetview was included in the training data of a multimodal LLM it would be like 99.9999% of the training data/resource costs.

It just isn't plausible that anyone has actually done that. I'm sure some people include a small sample of them, though.

bluefirebrand a day ago | parent | next [-]

Why would every photo in streetview be required in order to have Geoguessr's dataset in the training data?

bee_rider a day ago | parent [-]

I’m pretty sure they are saying that Geoguessr's just pulls directly from Google Streetview. There isn’t a separate Geoguessr dataset, it just pulls from Google’s API (at least that’s what Wikipedia says).

bluefirebrand a day ago | parent [-]

I suspect that Geoguessr's dataset is a subset of Google Streetview, but maybe it really is just pulling everything directly

bee_rider a day ago | parent [-]

My guess would be that they pull directly from street-view, maybe with some extra filtering for interesting locations.

Why bother to create a copy, if it can be avoided, right?

clbrmbr 21 hours ago | parent | prev [-]

Yet.

This is a good rebuttal when someone quips that we “are about to run out of data”. There’s oh so much more, just not in the form of books and blogs.