| ▲ | shihab 6 hours ago | |||||||
To be practically useful, we don't need to beat vendors, just getting close would be enough, by the virtue of being open-source (and often portable). But I found, as an example, PETSc to be ~10x slower than MKL on CPU and CUDA on GPU; It still doesn't have native shared memory parallelism support on CPU etc. | ||||||||
| ▲ | bee_rider 6 hours ago | parent [-] | |||||||
Oh dang, thanks for the heads up. I was looking at them for the “next version” of my code. The lack of a “blas/lapack/sparse equivalents that can dispatch to GPU or CPU” is really annoying. You’d think this would be somewhat “easy” (lol, nothing is easy), in the sense that we’ve got a bunch of big chunky operations… | ||||||||
| ||||||||