I am developing an application for work that deals with sensor performance. I need to calculate the standard deviation of an entire color buffer (1-channel 16-bit floating point frame buffer object) in Opengl. For testing purposes, I am simply reading up the buffer into main memory and calculating the standard deviation on the CPU. Of-course, as you might expect, this is a killer on the FPS.
I have googled and googled, searched every where but I cannot find any references or leads to this. The only reference that was mentioned was some article by Horn that talks about calculating the sum of an entire buffer using shaders, but I could not find the article itself.
My first idea is to simply use the non-programmable pipeline and do additive blending onto another 16-bit f.p. surface drawing verticle lines ontop of each other over an over again with the texture coordinates referencing the buffer for each verticle line, once for the number of pixels in the x direction (I may skip pixels to just approximate the s.d.), and then draw single points ontop of each other referencing the horizontal sums. This would give me the mean value, then I would repeat the process somehow using the differences minus the mean squared. This doesn't seam terribly efficient but it would be better than reading it into the main memory. Also, I don't know how to do this without having to ping-pong between textures when it comes to the second pass using the mean.
Anybody have better (more efficient) ideas for this?
Thanks in advance,
A couple of ideas:
(1) use the automatic mipmap generation capabilities of the hardware to generate the mean (you can read back the value of the topmost mip level to get the mean)
(2) render the texture to a second texture using a shader that calculates the squared difference at each pixel, and use automatic mipmap generation again to generate the mean
(3) then you can read back the mean and take the square root.
BTW, all this falls under the category of general-purpose GPU computation (a phrase you may want to google). People have done much more complicated things with it, like performing simulations of cloth on the GPU using texels to store the position of vertices of the cloth surface, and shaders to update them at each timestep.
Thanks for the quick reply Reedbeta,
I am using ARB_texture_rectangle for the frame buffer object to support abritrary buffer size (non-powers of two), and also so the coordinate referencing is easier (instead of 0.0 to 1.0, they are referenced by 'pixel' i.e. 0 to 511 etc.) This object does not support mip-mapping. http://www.opengl.org/registry/specs/ARB/texture\_rectangle.txt BTW. Would generating mip-maps every frame be fast enough?
I have looked into this GPGPU stuff, but surprisely I haven't found any info for such a simple task. There is a reference to a 'summation' method in the book GPU gems 2, so I may just purchase a copy. I don't think the solution to this is trivial because of the parallel nature of shaders.
I think you haven't found a reference for a summation method because everyone does that with mipmaps :lol:
However, if you really must use ARB_texture_rectangle, you can still write a pixel shader to do a 2x2 box filter on the image and resample it to an image of half the size in each dimension. Then repeat until the image is small enough. Of course you have to figure out how to handle odd dimensions if you work with non-PO2 textures. However, I believe this is the most efficient way to do the resampling, taking advantage of the GPU parallelism to the greatest extent possible. And yes, you should be able to attain a decent FPS on this with a bit of optimization. It's not a very costly operation compared to some of the things people like to do, e.g. post-processing blur and bloom filters on every frame.
I just found ARB_texture_non_power_of_two,
It appears that this DOES support mip-mapping, the only difference between this and ARB_texture_rectangle is the way coordinates are referenced.
I'll give both methods a try to see which is faster. At first I thought that mip-maps might be overkill (because I don't need any of the intermediate levels) but it is only 1/3 of the original buffer overhead because of the logarithmic way each level reduces so we'll see.
thanks for your help,