I've heard the advice many times to not bother with low level optimization. In the past, I've taken the position that clarity and organization of code and algorithm design is far more important. However, recently I've been working on a project that involves many-body interactions where performance is already starting to be an issue.
I spent a good half a day exploring and benchmarking various ways to make the functions for those interactions faster (they are called many millions of times per second). For example, I tried the fast inverse square root function that was used in Quake, and the fast sine/cosine functions the Nick shared with us here. I also tried several fast 3d distance equations of various crudeness.
Each fast math function was tested in isolation or near isolation and called 10 million times and compared. The results surprised me. A few of the functions were slightly faster that their "slow" counterparts, and quite a few of the "fast" functions were slower. Supposing I haven't made some stupid systematic mistake in my benchmarks, it seems that the GCC compiler produces really fast code on its own. I'm wondering if you have had similar experiences. Has anyone else actually benchmarked Nick's fast sine and cosine code?
what are you actually doing, an atomic reaction simulator? (just jesting, this is way higher than my league hehe)
reedbeta delete my post.
@Rouncer, your post got deleted? My project isn't anything exotic, just a simulation of many interacting "people". The AI for each "person" will (if I can work it out) be fairly computationally expensive. At the moment, I just have a bunch of cubes with collision forces between them, drag forces, and some simple biased Brownian motion to make them swarm. 1000 of them on my single core machine runs 20-25 frames a second. As mundane as that sounds, that's already half a million interactions per frame for collisions. Since the system is highly dynamic, not so sure whether space partitioning would really help.
I've heard the advice many times to not bother with low level optimization.
That is totally wrong advice. You should not bother with low level optimization, 90% of the time. That 10% is critical for the overall performance of your application. The reason some people have stopped bothering about performance altogether, is that other libraries (and drivers) take care of some of the low-level work.
Has anyone else actually benchmarked Nick's fast sine and cosine code?
I have. It's the fastest algorithm I know of. Make sure you're not using any branch operations though.
Have you considered SIMD code (SSE/AVX)? Future x86 CPUs will feature AVX2 with gather and FMA support.
1000 of them on my single core machine runs 20-25 frames a second. As mundane as that sounds, that's already half a million interactions per frame for collisions. Since the system is highly dynamic, not so sure whether space partitioning would really help.
Sort the objects in the x, y and z direction, to quickly determine which potentially collide.
When designing efficient algorithms, you should consider both big-O notation and low-level details of the algorithm. While faster sqrt/sin/cos may result in some improvents (maybe even few times the performance, if the algorithm is very heavy on those operations), big-O and good cache coherency can result in several orders of magnitude improvement in comparison. Those are the big fishes you should try to catch first, because other optimizations potentially pale in comparison and may be completely useless if you have to change the entire algorithm for sake of better big-O & cache coherency.
It sounds to me that you are struggling because of bad big-O and focusing on faster sqrt/sin/cos implementations potentially result only in quite modest improvements. You should try to figure out how you could improve the algorithm in the high-level instead. For example do you have to consider all objects for each object, or is it enough to use only the closest objects? You mentioned space partitioning for dynamic objects, so could you put the objects into a grid for example for faster queries & updates? Furthermore, since we are talking about a simulation, could you exploit frame-to-frame coherency somehow?
Thanks all for your replies.
How much faster did you find your sine and cose to be? I'll give your functions another look.
I've never tried writing assembly before, but would consider it. My machine is an old amd64 with sse2.
I really like your dimensional sorting idea. That would avoid a lot of conditional statements (which from my experience are quite slow) that a space partitioning trees would require, and could maintain some coherency from frame to frame and not need too much resorting. Will definitely give that a shot.
You're absolutely right about the importance of making my algorithm more efficient. There are still things I need to explore in that regard. This is the first time I've delved into a low-level approach, and I was surprised at just how ineffective it was. Still, at the end of the day, I'm also going to be calling a handful of math functions a very large number of times, so at some point, faster math functions would be nice.
It's not "don't bother with low level optimization", it's "don't bother with low level optimization first, above all or obsessively in all parts of your app."
You should code iteratively, refactoring sections for performance and only after benchmarking your whole program shows it to be necessary.
The raison d'etre for that motto is that many coders are perfectionists when it comes to their code. Some of us will spend days optimizing the number of angels on the head of a pin if given half the chance... B)
So, you tested each function with 10 million calls. The question I'd ask is will your app call that function 10 million times in such a way that the difference in timings would affect the end goal of the app?
You should try some spatial subdivision algorithms first before going to low level optimization. ASM-written bubble sort will be much slower then quick/heap/tree/subdivision sorting written in C/C++.
if you run on the gpu, its going to go faster, too.
After carefully testing your fast sine function again (the c version, not the assembly version), it does seem to be about twice as fast as the standard function. =)
So starting implementing your dimensional sorting suggestion. Using insertion-sort to sort the objects in each dimension is super fast since there's good coherency between frames. Then the idea was to take each object and use the sorted lists to figure out which other objects are nearby in each dimension. Across the 3 dimensions, there would be some duplicate nearby objects, so I'd cull those and then finally do interactions for what's left over.
However, dimensional sorting doesn't seem to work if my people/objects are different sizes, which is desirable for my project. Still thinking about workaround...
Further, for each dimension (say x), each object potentially collides with all the objects inside of a slice in the other dimensions (all the objects in the collision range in x, but with any y or z). This is really expensive if the object density is high. This isn't an issue for a 1D system, and less for a 2D system, but seems like it would be crippling in 3D. Is there something I'm missing?
When I get a chance I may finish up the sort thing and get some real numbers anyway.
@alphadog Well, I've already established that yes, certain functions are being called upwards of 10million times a second. Even if I can reduce the number of interaction checks I can do with spatial partitioning, those functions will be called a lot. For example, I just benchmarked my collision algorithm, and commenting out its contents doubles the performance of my application. Performance is ok now, but once I add artificial intelligence for each object, its going to get slow.
The idea of writing a sort-algorithm in assembly never even crossed my mind. I don't think I will try it. Do you have a suggestion for a spatial subdivision algorithm that works well for many dynamic objects? My intuition says the cost would outweigh the benefit because of all the updating, but I'm not so experienced with these things.
Adding an initial box-test to my collision algorithm boosted performance by about 15%, which is not groundbreaking, but nice for 3 extra lines of code.
I also benched the circular, doubly-linked list that I was storing my objects/people in versus ordinary arrays. Iterating through the objects for interactions and rendering is only between 0 and 20% faster (total application performance) with arrays. I'll stick with the convenience of the linked-list for now.
While faster sqrt/sin/cos may result in some improvents (maybe even few times the performance, if the algorithm is very heavy on those operations), big-O and good cache coherency can result in several orders of magnitude improvement in comparison. Those are the big fishes you should try to catch first...
I don't think you should perform one or the other optimization first. Algorithmic optimization may result in a significant performance improvement, but there are very few guarantees. I've known cases where people rewrote an entire library over the course of months, only to find out it was all in vain and they could have concentrated their effort elsewhere...
It's really a cost/gain/risk analysis. Sometimes low-level optimizations are quite straightforward and after implementing them you get a much clearer picture of the algorithmic complexity. At other times it's obvious that first rewriting things in assembly and then trying to optimize at a higher level is asking for trouble.
The simple truth is that there are no golden hammers in software optimization. You always have to evaluate every approach, and there's no substitute for prototyping and profiling.
if you run on the gpu, its going to go faster, too.
There's no guarantee of that. The brute force approach may run faster on a powerful GPU, but with spacial subdivision the CPU is a lot more efficient. Also note that the round-trip communication overhead can make GPGPU processing inferior.
And with AVX2 future CPUs will become a lot more powerful at throughput computing so you don't have to make any compromises.
that's already half a million interactions per frame for collisions. Since the system is highly dynamic, not so sure whether space partitioning would really help.
Well. I guess you just compare all pairs:
1000 persons \\^2 / 2 = 500000
With spatial subdivision of 10 * 10 cells, and uniform disribution (the ideal case, I know, but still), you get:
(1000 persons / 10\\^2 cells)\\^2 * 10\\^2 cells / 2 = 5000
An improvement by 2 orders of magnitude. Should cover the added complexity.
edit: Messed up the last formula.
you make a convincing argument
I tried out and tested the uniform spatial partitioning method you suggested. With 1000 uniformly sized randomly moving objects, I tuned the partition size for best performance and required area. The best grid had 20x20 partitions, where each partition was 4 times as wide as an object.
The partitioning method performs about twice as fast as the N*(N-1)/2 method (total application performance). Certainly, this is a decent gain, but not nearly as high as you predicted, and it comes at some cost in code clarity.
Determining interactions requires higher complexity than your calculation predicts:
numberOfCells * averageNumberOfCollisionsPerCell * averageNumberObjectsThatEachObjectNeedsToCheckWhetherCollisionWithHasAlreadyBeenCalculated
= 400* \~(1000/400)\\^2/2 * \~50 = \~60,000
an order of magnitude slower than your estimate.
The various ideas here have got me thinking. There's another partitioning strategy I thought up today that should be faster, and require no boundaries. I will report back when that's done.
I'm unclear about what you are trying to do.
You talk about people, then talk about 3d.
People cannot fly, so taking a projection through the view volume simplifies your collision handling. I.e a plan of the volume.
If you have some kind of terrain with massive vertical displacements, you can then cull the 2d collisions based on a difference in the vertical displacement squared.
In general you should never need a square root in collision detection, it is only needed once you decide a collision has occured and you move into collision resolution.
If the "people" are of more or less a uniform size then you can treat the plan of the volume as a grid and then you have loads of techniques you can use to reduce the computational tasks.
You can treat the grid as a texture and move nearly everything into the gpu.
Use image recognition tricks
lot's of things.
Can you give a better description of what you are trying to do?
Then maybe we can give targeted advice rather than general case advice.
I'm building a collision system for a game that will involve a large number of both movable objects and moving "beings". Beings will come in various sizes. Although I have not made a final decision regarding terrain, my plan at this point is for the terrain to be a roughly spherical world (like in the game Populous). Regardless or whether I go that route, or with a more planar terrain, I have a lot of heterogeneous moving objects that need to collide with each other in either 2 or 3 dimensions, and soon, will also need to have AI interactions between objects and beings in proximity.
@everyone Since my creatures interact from a more macroscopic perspect, I've decided to do force-based interactions. I'm not doing traditional rigid-body collision detection followed by collision response. My collision forces are based on a fast s-curve formula, but also involves a single vector unitization, which means a square-root. Bounding box and distance squared checks are used to cull non-collisions first.
I've spent the past week trying out different partitioning strategies and to find something that works well for my application. I just roughed out a new partitioning scheme this morning and results look great. You guys were right: the performance advantage (of the right partitioning method) is indeed well beyond what any math optimization could accomplish.
Basically, I take all my objects and find the center-of-position (ie, unweighted center-of-mass) in the dimension where the objects are most spread out. Then I split at the center-of-position into two partitions, and put any object that lie on the partition edge into a separate bin. Doing this recursively, the partitions quickly get small . There is no penalty for objects moving, or what order they are in, no limits to how spread out the objects are, and this strategy finds optimal split planes. It's also quite fast. 4000 moving objects (brute-force that would be an impossible 8 million collisions per frame) goes at playable frame rates with just 8 smallest-level partitions. Haven't done any tuning yet for what the best number of partition levels is.
Thank you all for your advice and ideas!
It is possible you have too many cells in your partition. There is always an optimal number of cells, depending on the implementation, the distrubution of your objects and then number of objects.
Also, what is "averageNumberObjectsThatEachObjectNeedsToCheckWhetherCollisionWithHasAlreadyBeenCalculated"?
What I mean is the scheme I'm using, the algorithm does a nice job of choosing how to split partitions, and I'm going to make it so that it dynamically chooses how many times to split each frame for best performance. The only situation that this doesn't work well in is when all objects occupy the same space.
With the previous grid scheme you suggested, averageNumberObjectsThatEachObjectNeedsToCheckWhe therCollisionWithHasAlreadyBeenCalculated"? is because some objects may intersect more than one grid cell, so each object must store a list of all the objects it has already done collision with, and that list needs to be checked and updated for every collision. I suppose there are other ways of dealing with this.
next page →