I didn't have time to react here yet, but now...
first of all there is a huge difference in compiling OpenCL source (respectively CUDA) and shaders. Especially when the compiler has to take atomics, loops, huge branches, etc. into account. When you write same codes in *same* language (it's technically different), and compile it with 2 different compilers (both with different level of optimization), then you get 2 different resulting speeds, on single machine.
Also shader languages are more limited and they do some heavy optimizations (sometimes even "destructive"), that just can't be done in OpenCL/CUDA, because some1 might need that feature (in shader languages you say that specific features are unavailable)
Next thing - if your code is the same and there is more than 100 times difference, try very simple code - writing just some data to the texture (adding 2 textures together or so - in CUDA adding 2 buffers and show last buffer (texture) on screen through D3D/GL and CL/GL CL/D3D (respectively CUDA/GL CUDA/D3D interop). The timings should be similar (shader should be a bit faster though, but no more than like max. 5 times, probably u get around 2 times or so). If it is still 100 times slower, there is something going terribly wrong (bad workgroup sizes, etc. - F.e. you use just 1/16th of all processing units available on GPU or so).
i think i have to assign threads in a different way, actually im not sure how the blocks, threads and grids even work. how would you
assign threads for a raytracer?
I tried 2 approaches in recent time.
1st (faster on my AMDs in KD trees, not yet tried in BVHs) - use 1 single quite big CL kernel to create rays, traverse them through KD trees and test them against triangles. Quite fast, I get dozen(s) of MRays/s.
2nd - smaller CL kernels to generate rays, traverse them and shade. A bit slower (5% or so), but those kernels look more human that 1 big beast (with heavy use of #defines and other magic to be the fastest).
I also should try persistent threads that we work with in lab (but I haven't implemented them to my framework yet).
EDIT: I have slept a bit, and got some little time - so few hints on the code that might speed it up a bit (if you have already fixed the loop to be faster, these things might increase performance a little bit more):
// First of all, processing chars and integers on GPU is not as effective as using float
// so it's recommended to pass float4* d_output instead of uchar4* - it will be faster and a lot
// Note that here you probably should be writing directly to texture and not to buffer (that has to be
// read back if I remember correctly), this wastes a lot of time. You basically run compute kernel
// wait till it finishes, read back data it generates, and then write them back to GPU memory.
// Fix me if I'm wrong here, and if you're actually working with texture. I'm used to work with OpenCL
// and there you have buffers and textures separately (and from those few examples of CUDA I've tried
// long time ago I'm still thinking that it works the same way in CUDA).
// Second thing, pasting image dimensions as uints are also waste, you're just dividing by
// these values converted to floating point. Pass directly floats invImageW
// where invImageW = 1.0f / imageW, same goes for imageH
// and pass integer image width separately
__global__ void d_render(uchar4 *d_output, uint imageW, uint imageH, float lod)
uint x = blockIdx.x * blockDim.x + threadIdx.x;
uint y = blockIdx.y * blockDim.y + threadIdx.y;
float u = x / (float) imageW;
float v = y / (float) imageH;
// There is normalize in cuda, i can't say whether it's faster, but most likely compiler friendlier
// than this - so dir = normalize(dir); - it won't be slower than this, and most likely it'll be even faster
// I think that Stainless said everything here, the loop is basically done wrong (in terms of performance)
// Again, normalize, also don't write to color by components, it probably won't be done in single mov, but in multiple ones
// It would be better to write directly to texture here, but well, if you want it in buffer, then as I mentioned, use float4.
uint i = y * imageW + x;
d_output[i]=to_uchar4(color * 255.0);