Remix.run Logo
KaiserPro 4 hours ago

I worked in HPC adjacent fields for a while (up until 40gig ethernet was cheap enough to roll out to all the edge nodes)

There are a couple of big things that are difficult to get your head around:

1) when and where to dispatch and split jobs (ie whats the setup cost of spinning up n binaries on n machines vs threading on y machines)

2) data exchange primitives, Shared file systems have quirks, and a they differ from system to system. But most of the time its better/easier/faster to dump shit to a file system than some fancy database/object store. Until its not. Distributed queues are great, unless you're using them wrong. Most of the time you need to use them wrong. (the share memory RPC is a whole another beast that fortunatly I've never had to work with directly. )

3) dealing with odd failures. As the number of parallel jobs increase the chance of getting a failure reaches 1. You need to bake in failure modes at the very start.

4) loading/saving data is often a bottle neck, lots of efficiecny comes from being clever in what you load, and _where_ you load it. (ie you have data affinity, which might be location based, or topology based, and you don't often have control over where your stuff is placed.)