Remix clone Hacker News

new | show | ask | jobs Github

	▲	aengelke 2 months ago
		> Also there are processors with larger vector length How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU. > Do you have more examples of this? I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement. Compilers probably/hopefully have gotten better in the mean time and I haven't redone the experiments since then, but I'd be surprised if this changed drastically. Spilling is sometimes unavoidable, e.g. due to function calls. > Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve I was specifically referring to dynamic vector sizes. This experiment uses sizes fixed at compile-time, from the article: > For the astcenc implementation of SVE I decided to implement a fixed-width 256-bit implementation, where the vector length is known at compile time.
	▲	camel-cdr 2 months ago \| parent [-]
		> How do these fare in terms of absolute performance? The NEC TSUBASA is not a CPU. The NEC is an attached accelerator, but IIRC it can run an OS in host mode. It's hard to tell how the others perform, because most don't have hardware available yet or only they and partner companies have access. It's also hard to compare, because they don't target the desktop market. > I ported some numeric simulation kernel to the A64Fx some time ago, fixing the vector width gave a 2x improvement. Oh, wow. Was this autovectorized or handwritten intrinsics/assembly? Any chance it's of a small enough scope that I could try to recreate it? > I was specifically referring to dynamic vector sizes. Ah, sorry, yes you are correct. It still shows that supporting VLA mechanisms in an ISA doesn't mean it's slower for fixed-size usage. I'm not aware of any proper VLA vs VLS comparisons. I benchmarked a VLA vs VLS mandelbrot implementation once where there was no performance difference, but that's a too simple example.