honestly the AMD-first bit surprised me - usually ROCm support is an afterthought or just broken outright.

curious about BVH traversal specifically. dynamic dispatch patterns across GPU backends can get weird fast. did KernelAbstractions hold up there or were there vendor-specific fallbacks needed for the heavier acceleration structure work?

▲

simondanisch 6 hours ago | parent [-]

Well I'm a bit of an AMD "fanboy" and really dislike NVIDIA's vendor lock in. I'm not sure what you mean by dynamic dispatch across GPU backends - nothing should be dynamic there and most easier primitives map quite nicely between vendors (e.g. local memory, work groups etc). To be honest, the BVH/TLAS has been pretty simple in comparison to the wavefront infrastructure. We haven't done anything fancy yet, but the performance is still really good. I'm sure there are still lots of things we can do to improve performance, but right now I've concentrated on getting something usable out. Right now, we're mostly matching pbrt-v4 performance, but I couldn't compare to their NVIDIA only GPU acceleration without an NVIDIA gpu. I can just say that the performance is MUCH better than what I initially aimed for and it feels equally usable as some of the state of the art renderers I've been using. A 1:1 comparison is still missing though, since it's not easy to do a good comparison without comparing apples to oranges (already mapping materials and light types from one render to another is not trivial).

▲

the_harpia_io 6 hours ago | parent [-]

pbrt-v4 parity is a solid baseline - that codebase already leans hard on NVIDIA so a fair comparison was always going to be messy. surprised wavefront was the harder bit though, i'd have expected BVH tuning to be the nightmare.

▲

simondanisch 5 hours ago | parent [-]

To be fair I was suprised too. But I made a relatively simple straight port from the AMD rays sdk plus some input from the pbrt-v4 CPU bvh code and it just worked relatively well out of the box... This is the main intersection function which is quite simple: https://github.com/JuliaGeometry/Raycore.jl/blob/sd/multityp... I'm not even using local memory, since it was already fast enough ;) But I think we can still do quite a lot, large parts of the construction code are still very messy, and I want to polish and modularize the code over time.

	▲	the_harpia_io 4 hours ago \| parent [-]
		makes sense honestly - straight port from a solid SDK beats reimplementing everything from scratch. local memory optimization is one of those rabbit holes anyway. construction code being messy is just that stage of the project