| ▲ | riffraff 3 hours ago | |||||||
Perhaps one issue lacking discussion in the article is how easy it is to find devs? I've never worked in HPC but it seems it should be relatively simple to find a C/C++ dev that can pick up OpenMP, or one that already knows it, compared to hiring people who know Chapel. The "scaling down" factor (how easy or interesting it is to use tool X for small use) seems a disadvantage of HPC-only languages, which creates a barrier to entry and a reduction in available workforce. | ||||||||
| ▲ | kinow 2 hours ago | parent | next [-] | |||||||
I think hpc devs need an extra set of skills that are not so common. Such as parallel file systems, batch schedulers, NUMA, infiniband, and probably some domain-specific knowledge for the apps they will develop. This knowledge is also probably a bit niche, like climate modelling, earthquake simulation, lidar data processing, and so it goes. And even knowing OpenMP or MPI may not suffice if the site uses older versions or heterogeneous approaches with CUDA, FPGA, etc. Knowing the language and the shared/distributed mem libs help, but if your project needs a new senior dev than it may be a bit hard to find (although popularity of company/HPC, salary, and location also play a role). | ||||||||
| ||||||||
| ▲ | KaiserPro an hour ago | parent | prev [-] | |||||||
I worked in HPC adjacent fields for a while (up until 40gig ethernet was cheap enough to roll out to all the edge nodes) There are a couple of big things that are difficult to get your head around: 1) when and where to dispatch and split jobs (ie whats the setup cost of spinning up n binaries on n machines vs threading on y machines) 2) data exchange primitives, Shared file systems have quirks, and a they differ from system to system. But most of the time its better/easier/faster to dump shit to a file system than some fancy database/object store. Until its not. Distributed queues are great, unless you're using them wrong. Most of the time you need to use them wrong. (the share memory RPC is a whole another beast that fortunatly I've never had to work with directly. ) 3) dealing with odd failures. As the number of parallel jobs increase the chance of getting a failure reaches 1. You need to bake in failure modes at the very start. 4) loading/saving data is often a bottle neck, lots of efficiecny comes from being clever in what you load, and _where_ you load it. (ie you have data affinity, which might be location based, or topology based, and you don't often have control over where your stuff is placed.) | ||||||||