yakiimo02 at March 30th, 2010 10:00 — #1
Description Hi, I wrote a DX11 DirectCompute implementation of a Buddhabrot/Nebulabrot fractal renderer. The submitted picture is a Nebulabrot with max iteration values set to red:10,000 green:1,000 blue:100. I rendered the above image (original image was 1592×1028) at around 14 fps (140,000 samples a second) for around 35 minutes.
On my HD5750, with the above max iteration values, I actually get frame rates around 42 fps, but I didn't change the default yielding behavior of DXUT when the application becomes inactive, so the above render was performed at 14 fps (doh!).
I wrote a CPU implmentation (non-simd & non-multithreaded) as well and my DirectCompute implementation is around 4-6 times faster than the CPU version. My CPU is an Intel Core2Quad Q6600 2.4ghz (not overclocked). I had earlier written a Mandelbrot DirectCompute implementation and that was 50+ times faster than CPU. Since the Buddhabrot is more complex than the Mandelbrot, I guess reduced performance is to be expected. I'm guessing the extensive scattered global memory writes of a Buddhabrot implementation may be slowing down the DirectCompute version.
For more details about my implementation (source & binary provided) see my blog post:
reedbeta at March 30th, 2010 12:41 — #2
Very cool! I've been meaning to spend some time checking out this compute shader stuff...
poita at March 30th, 2010 23:04 — #3
Have you tried writing it in a good ol' pixel shader for comparison?
roel at March 31st, 2010 04:50 — #4
Cool indeed! And subscribed to your blog
yakiimo02 at March 31st, 2010 10:05 — #5
Hi everyone. Thanks for the comments!
@Reedbeta Yeah, compute shaders seem pretty cool. It seems not everything, but lots of cool stuff can be sped up using it. Personally want to try gi pathtracer, fluid dynamics and post effects stuff in the future (those seem to be what other ppl have had success with so far.)
@poita The Buddhabrot algorithm requires a lot of random scattered writes. The above Nebulabrot has an iteration max of 10,000, so in the worst case, 9,999+999+99 scattered writes to the output uav buffer are gonna occur in a single compute shader thread (and there are 10,000 threads executing in parallel. The # 10,000 for total thread count is not related to the iteration max. Just coincidence that I have 100 thread groups each with 100 threads = 10,000 ttl.) I think it'll be hard and unnatural to implement it in the pixel shader.
@roel Thanks for subscribing!