Maybe I do not correctly understand the exact meaning of complexity denoted by O-notation. What I've done is taken the N-Body force-interaction problem, which requires at least N(N-1)/2 forces to be calculated (typically in serial), and rearranged the them into groups such that it should be possible to calculate the forces N/2 at a time without any fighting over shared memory (supposing you have N/2 cores). Ideally, the time for computations should therefor be linear in N.
Yesterday I spent a bit of time reading about GPGPU and was disappointed to find that apparently GPUs are quite limited in the kinds of problems they are suited for (eg, one must avoid shared global memory as much as possible and one must break work into certain sized chunks depending on hardware, and transferring data to and from GPU is slow as well). Unfortunately, it seems that the parallelization trick I came up with may only be practical on systems with many CPUs (ie, supercomputer) or on specialized FPGA's.