Remix.run Logo
giamma 3 days ago

Love your project!

Maybe you already considered, but there are a number of collection libraries out there that are optimized for holding Java primitives and/or for very large sets of data, which could help you save significant memory. Eclipse Collections [0] and Fastutil [1] come to mind first, but there are many out there [2]

[0] https://github.com/eclipse-collections/eclipse-collections [1] https://fastutil.di.unimi.it/ [2] https://github.com/carrotsearch/hppc/blob/master/ALTERNATIVE...

allg12 2 days ago | parent [-]

Thank you! I'm a total Java noob - actually, this is the first project where I've written any Java code (had to slightly modify the Graphhopper source code to suit my needs). Those libraries look very interesting. I'm saving this post for another battle with processing maany GBs of OSM data :D

karussell 2 days ago | parent [-]

We already use carrotsearch internally so you could replace the java util classes like HashMap and HashList with it to reduce memory usage a bit. But it won't help much. E.g. all data structures (in any standard library btw) do usually double their size at some point when their size increases and then copy from the old internal array to the new internal array, which means that you need roughly 3x the current size and if that happens roughly at the end of the import process you have a problem. For that reason we developed DataAccess (inmemory or MMAP possible) which is basically a large List but 1. increases only segment by segment and 2. allows more than 2 billion items (signed int).

Another trick for planet size data structure could be to use a List instead of the Map and the OSM ID as index. Because the memory overhead of a Map compared to a List is huge (and you could use DataAccess) and the OSM IDs for planet are nearly adjacent or at least have not that many gaps (as those gaps are refilled I think).

All these tricks (there are more!) are rather tricky&low level but necessary for memory efficiency. A simpler way for your use case could be to just use a database for that, like MapDB or sqlite. But this might be (a lot) slower compared to in-memory stuff.