I've been doing a CUDA project for a client, and I'm really suprised by the mess it is.
I've now got my head around a lot of the stuff that was confusing me, but I'm still amazed by what's missing.
I want to use CUDA for image compositing, so I am reading in values from multiple textures and outputing a single texture.
The pixel handling routines seem to have been excluded for CUDA.
I assumed that since CUDA runs in the GPU, we would be able to do things like
uchar4 pixel3 = pixel1 + pixel2;
uchar4 pixel4 = pixel3 * colour;
But it doesn't exist. I haven't tried
float4 pixel3 = pixel1 + pixel2;
float4 pixel4 = pixel3 * colour;
But I don't hold out any hopes that it exists
Anybody else looked at CUDA in detail?
Have I missed something, are there pixel handling routines hidden away in the api somewhere?
CUDA is a GPGPU (General Programming GPU) languange, so expect no graphics related concepts in there. In CUDA you need to copy your image data into GPU buffers (uchar4, float4, it depends on your source data type). Then write a kernel that loads the data for the current thread for all the images, combine them, and write the output. Finally, read the data back from GPU to CPU.
To know what data to load you need to use the current thread and block indices: threadIdx & blockIdx and their sizes: gridDim (blocks in grid) & blockDim (threads in block).
Look at the examples here: http://docs.nvidia.com/cuda/cuda-c-programming-guide/
Look at cutil_math header (it is not part of the CUDA since version 5.0, so you'll have to look at either helper_math.h in new SDK or download older cutil_math.h F.e. from http://code.google.com/p/cudpp/source/browse/trunk/common/inc/cutil_math.h?r=157 )
Note that you can also write your own math functions for packed format data types.
yes, I've been doing that.
Combining uchar4 values myself is easy enough, it just seems ridiculous to me that the hardware supports something useful, but the software does not expose it to the user.
I'll have a look at the new helper_math.h not looked in there yet.
It doesn't support uchar4, as that one can't be directly loaded and processed within the register on GPU/CPU (from the hardware side - at least afaik*). GPUs and CPUs are generally working with either 32-bit or 64-bit registers, their vector types are generally packed 32-bit or packed 64-bit data types (either floating point or integral like SSE/AVX on CPU) ... thats probably why they don't expose the functionality directly though ... also I wonder whether GPU allows for vector execution of uchar (if someone has information on this, please share )
Anyway for image processing I strongly recommend using floats (or doubles, if high precision is critical), instead of uchar value types.
Edit: *Although, when I think about this - driver might introduce several implementations that could process these in high performance manner:
Using 32-bit register to store whole uchar4, although I'm sure that could introduce some masking (which might slow down processing a bit).
Processing 4 times using 8-bit control register is also possible, although that could be slower than previous idea (e.g. introduce some masking).
It is also possible that driver just uses SIMD register to store single uchar4 in such manner that: |---A|---B|---C|---D| And so it will ignore first 3 bytes in each 32-bit register. Of course it is wasting some performance this way, but in the end, this might be one of the fastest and easiest solution. Of course in case where driver would be doing this, one could use whole 32-bit integers (int4) or floating points (float4) to do the same thing with more precision, the only disadvantage would be that textures stored as float4 use 4-times more memory than textures stored as uchar4.
Maybe if I'll find some time, I'll try to test whether it really brings some performance win, or whether it is just memory win. (Or in worst case, whether it is even performance lose). ... Saving work and opening my favourite text editor ... Time to code!
Yes, GPUs are designed to work with floats natively. Integer types are supported but are quite possibly actually slower than floats (depending on what HW you have and what you're doing). Besides, most image processing operations are much more convenient to express in float form.
BTW, on current GPU architectures (both NVIDIA and AMD) a float3 will be stored in 3 registers and if you add float3 + float3 you generate 3 separate add instructions. We get parallelism by operating on many pixels (or other items) at once, not by operating on all 3 or 4 components of a vector at once.
Still, I agree it's annoying that CUDA doesn't provide math operators for float3 etc. built-in. After all, it's a handy programming model (for many fields, not just graphics) regardless of whether it actually maps to the hardware. I've gotten in the habit of just copying helper_math.h into each new CUDA project I start.
Haven't had chance to look at helper_math.h yet, fighting with getting the result of my compositing into an opengl texture for debugging. Turns out to be more complex than I thought.
adding pixels in a shader does more than just adding the values together. Adding 0.75 to 0.75 ends up as 1.0 rather than 1.5 for instance.
then there is alpha blending to take into account
you know I've always just used these facilities in shaders without really thinking about what code is being generated
If I didn't need CUDA's hardware video encoding, I would be tempted to rewrite my code to use XNA and HLSL and see just what difference it makes. Actually might do it anyway and see if I can see any difference in the generated images
I'm using uchar4 at the moment because I need to load in two textures for every image I generate. When designing my code I assumed that this would be the bottle neck rather than the CUDA part of the project. Loading a 1920 by 1080 texture as uchar4's is significantly faster than float4
I think we need to do some research here, could be some interesting results
You can also store your textures in memoryas uchar4, but convert them to float in the kernel after loading, do your operations, then convert them back before writing them out. That's ultimately what's happening when you sample a texture and output to a render target in HLSL - although there, the hardware is doing the conversions for you.
If there's a visual element to it, I find working directly with shaders more convenient. I only have experience with OpenCL, but there's a cost associated with host/device memory copies, which should also be true for CUDA. That's not something you want to do liberally in real-time. When I worked on some projects using both CL and GL, I had to micromanage the two systems so that they wouldn't starve each other for resources. AMD drivers were (still?) notoriously bad too and under heavy loads, it was very easy to lockup the GPU. Very annoying to sit there for 30 seconds until the driver times out and reboots itself. I had to lower the workload, sacrificing overall simulation speed for FPS and stability.
I wrote more image filters than photoshop and gimp combined. I have versions for both the CPU, GPU, and I did a select few for OpenCL. Overall, processing image data with shaders is vastly better IMO. GPGPU is fast, naturally, but I also write a lot of visual tools and it's just much better to have everything running with the same API. GPGPU is really good for complex algorithms that would otherwise require hack'ish solutions with shaders. Under those circumstances, I'm willing to trade overall speed for productivity. So having said that, I would say it's worth porting some of your stuff to HLSL. I think you'll enjoy the exercise.
As for the uchar and floats thing, just convert your inputs and outputs, like what Reed said. You'll have a universal set of image filters and you won't have to worry about format types. You'll get the same load-time performance and only tradeoff negligible runtime performance for the conversions.
The reason I decided to use CUDA was the hardware video encoding, the cards we are using support it and it is much faster than software implementations.
Can't give details, but the basic idea is
Load a 2D layered texture
Loop over all input files
Combine input files using layered texture as an input to the equation
Add to video
The key to me was that I could do everything in CUDA without transferring the frame backwards and forwards from video ram.
The 2D layered texture, once loaded stays in video ram. The generated frame can be converted to YUV in video ram The video frame generated in video ram
So the only host transfers in the inner loop are to load in the two input textures, and write out the video frame
I found the process of getting CUDA up and running in visual studio painless, really easy
However the details of writing CUDA code turned out to be a very acute pain. I found the documentation to be awful. A lot of things seem to have changed over the various versions, and you have the two schemes to work with. Total CUDA and CUDA c++.
When I finally was happy with my code, it took me a whole day to get it to compile. Not because I had done anything wrong, the code was correct. Hidden away in the project settings was a value that forced CUDA to compute 1.0 and shader model 1.0
Once I found and changed that, everything compiled fine.
But failed to run.
After more research and many internet searches I found out that my problem was not with the code, this time it was that my display drivers were older than the CUDA SDK I had installed. Why this should be a complete fail is beyond me, but hey ho.
Once I had updated my display drivers from 3.1 to 3.3, it finally ran!
I have done a hell of a lot of HLSL, not just for games. I've used HLSL as part of a multi touch remote input system, which was fun. Image filtering in HLSL is really fun to work with. I can see why you wrote a load of filters, it's amazing what a small equation can produce when applied to pixels.
Thinking about the whole pipeline I am working on, I know I am going to have to promote one of the textures from uchar4. I haven't decided yet what format to use, my instincts say promoting them to 16 bit, I will have to see what the difference will be loading a 1920 by 1080 image in the various formats
Just had a thought (before anyone else says it, yes it was painful).
A while ago I wrote a Forth to HLSL/GLSL converter for a little project.
If you haven't had a look yet, have a look at http://forthsalon.appspot.com/
Basically what I did was write a Forth compiler that output shader code.
The end result was animated backdrops like this
Would you lot be interested in a little xmas competition to generate some animated graphics in shader code?
If there is a contest, I'm in (for any prizes, because that will earn all competitors even more eternal glory on DevMaster) ... So uhm, maybe create another thread for it?
trying to think of a prize
what would you like?
mmm, bragging rights.. And if you write your solution in Forth... dimension, you get a free ride to the funny farm
I like those nice young men in their clean white coats, oh they are coming to take me away again....
Well... I'd be in for no prize too, but hard to tell whether there would be more people who would be in for no prize.
So what do you think we should set as the parameters?
HLSL/GLSL ? html 5 c++?
Do we allow input textures or just insist it be done purely in a single shader
Mhm... hard to decide (I'm actually writing this for like 10 minutes)...
As for technology, it's up to decision whether we want users to just do the magic with single shader, or allow them to do quite complex stuff with shaders and some C++ (e.g. allowing them to actually write multi-pass techniques and combine them together).
I'd personally vote for procedural scene and textures (either made with C++ or in shaders), that could give us some magic code.