These 20 guys are only getting to the screen at 10 fps, and im wondering why... its not the vertex count because if i use a lod i get the exact same fps.
The thing is each guy is 9 draw primitives, armour and body and weapons, and I think that might be why its going slow.
How can I get the guys to the screen without overloading the draw primitives calls?
If I just render the body and head by itself I get a much better frame rate, so how do I draw clothes and armour without slowing down?
I sorta got around the problem by baking all the clothing and accessories, etc, into the same model and drew them all at once, and it worked.
only problem is now materials are going to be a little bit harder... oh well, thanks anyway.
9*20 = 180 draw calls shouldn't be a problem. Are you using an optimized build?
ahhh! i changed it to release mode and yeh it started working properly, why was it going slow under the debug build? I cant ever remembr it happening before thats why I was really confused...
Debug builds aren't optimized, so a lot of things run slower. I've seen this myself recently with text; I wrote a simple font renderer for my engine and drawing a couple paragraphs of text took something like 15 ms! Release build runs nice and quick, though.
Rendering only 20 characters in debug build shouldn't be that slow though. It's not that CPU taxing.
Rendering only 20 characters in debug build shouldn't be that slow though. It's not that CPU taxing.
You're only half-right - it also heavily depends on the compiler settings (F.e. running application with -ggdb3 -O0 and without -fomit-frame-pointer can be really slow), target configuration, etc. - although you're right that 20 chars should be okay even in debug build.
[Now comes heavy wizardy, black magic and compiler related stuff]
Ad some basic compiler flags:
Basically for debug you want debugging symbols (meaning to see F.e. variable name during debugging, instead of address - so you can see what's going on) - e.g. some -ggdb3 and -O0 are really good for these. Also don't use -fomit-frame-pointer (Don't keep the frame pointer in a register for functions that don't need one. This avoids the instructions to save, set up and restore frame pointers; it also makes an extra register available in many functions. It also makes debugging impossible on some machines. - as specified in GCC specs)
For release version, if you don't care about size, you don't need debugging symbols, neither you don't need to keep frame-pointer, so instead of -ggdb3 -O0 -fomit-frame-pointer it is better to use some -O1 or -O2 (O means optimization set - O1 is basic optimization set, O2 is O1 + some others)
For release version, if you care about size it is best to use -Os (Optimization set for size)
And at last (and my favourite) you can go "roarrrrr!" hit your compiler with a club and use O3 (don't know whether it is available in MSVC, also hitting with club was meant literally - don't do it, your compiler might not be tough enough to survive ).
Anyway for detailed description, see here http://gcc.gnu.org/o...ze-Options.html most of the flags are same in most compilers (dunno where MSVC reference is, though lots of flags are similar).
That's my point, regardless of how bad complier settings you got, rendering 20 characters shouldn't be that CPU taxing
Yeah, I would tend to agree, but it seems to be the case that it is sometimes much slower. I don't understand why myself. Like I said, in my case I saw text rendering being several times slower (on the GPU) when I was in a debug build, despite the fact that what the GPU was doing should've been exactly the same. It wasn't due to using the debug runtime of D3D either; using the debug runtime in a release build caused no measurable slowdown.
Text rendering is kind of special case though where you feed GPU with dynamic data (I assume you use dynamic vertex/index buffer to send the text quads to the GPU). If you use debug d3d dll it could verify the vertex/index data after unlock() or something like that to make things appear much slower for GPU. But yeah, I have seen close to 10x slower debug builds in some f'ed up game engines where every single thing is accessed through an accessor function or something like that. Still 10fps sounds pretty low, but meh.
Okay,so let's do some benchmarking on my side.
Testing machine - Core i3 + 3 GiB RAM + Radeon HD 5470 (e.g. my current laptop, where I'm sitting). Operating system - Debian Squeeze.
Testing pipeline - Deferred renderer with fast CPU ray tracer for computing VPL positions, using just single spotlight, fully dynamic, resolution @ 1280x720. Note numbers might not be the peak performance of the PC, as I'll be playing music during the test (but during all the tests!)
1.) simple scene, just some 150 triangles, 4 different materials
2.) Sibenik cathedral
1.) Debug build (compiler flags are -O0 -ggdb3 ... means no optimisations and we want debugging info for gdb level 3)
2.) Release build (compiler flags are -O1)
3.) Release build v2 (compiler flags are -O2)
4.) Release build v3 (compiler flags are -O3)
5.) Black magic build (my own compiler flags, starting with O3, contating lots of compiler magic stuff)
Scene/Build Simple scene Sibenik cathedral
Debug 57.050750 112.404224
Release v1 56.718257 110.835902
Release v2 56.137256 110.731450
Release v3 55.950382 110.438174
Black magic 55.079345 109.939037
As you can see debug/release doesn't have that much impact for small scenes, it has slightly larger for large scenes, though it is not THAT huge (well of course I would need to profile the application how much time are we rendering on GPU and waiting on CPU to actually say how much it gives/takes). Gimme a sec...
Interesting. Certainly doesn't make much difference in your case. And Jarkko, yeah, I guess dynamic-vertex-buffer-related overhead could be an issue. It's still odd that the poor performance was showing up on the GPU instead of the CPU, though (I time CPU with QueryPerformanceCounter and GPU with D3D11 timestamp queries). It could be something's wrong with my timing code, though, or maybe the OS was consistently interrupting the GPU at the same point in my frame, or something crazy like that.
Okay, so here comes the profiling...
I've written my own profiler on my game engine and well, it reports quite lots of details (it can measure percentage, milliseconds, even some engine calls, etc.), lets sum them to: CPU and GPU. What will be where?
In CPU I'm counting ray-tracing part of course, draw calls, state changes (e.g. changing FBO or shader), et cetera - e.g. CPU calls and CPU instructions generally.
In GPU I'm counting just rendering time on GPU (e.g. in first case time we're waiting till GPU actually finishes computing of something, in second case the time we're actually doing GPU work)
EDIT: Why this counting? Well basically we want to spend as much time on GPU as possible, because there is whole lot more stuff to do on CPU (AI, Physics, etc.). And also, speed of GPUs grows alot more quickly than speed of CPUs.
Basically lets use same scenes and compiler stuff as in first case ... and measure where are we spending most time... (Again the first two values are for simple scene, the next two for Sibenik).
Debug CPU 50.900325%, GPU 49.099675% CPU 48.323266%, GPU 51.676734%
Release v1 CPU 50.612472%, GPU 46.387528% CPU 45.059028%, GPU 54.940972%
Release v2 CPU 48.958393%, GPU 51.041607% CPU 41.787474%, GPU 58.212526%
Release v3 CPU 48.578865%, GPU 51.421135% CPU 41.175955%, GPU 58.824045%
Black magic CPU 48.492081%, GPU 51.507919% CPU 40.038613%, GPU 59.961387%
So here we see, that as optimisations won't give us much speed (some 2% more time waiting for GPU stuff) for simple scenes (it will give us boost though - but it is less visible in overall performance), it will give us pretty huge boost for large scenes (almost 10% more of time is spend on waiting for GPU - e.g. it is time to optimize GPU side , if I did some really large and very complex scene, something like Power plant, or so, we would se even larger boost on CPU side with optimizations).
If we always wait till GPU finishes (e.g. we'll count absolute time spend on GPU to absolute time on CPU - lets force wait for GPU through glFinish(); ), the results will be:
Debug CPU 14.074966%, GPU 85.925034% CPU 8.012333%, GPU 91.987667%
Release v1 CPU 13.981151%, GPU 86.018849% CPU 7.988983%, GPU 92.011017%
Release v2 CPU 13.893738%, GPU 86.106262% CPU 7.966127%, GPU 92.033873%
Release v3 CPU 13.869523%, GPU 86.130477% CPU 7.946651%, GPU 92.053349%
Black magic CPU 13.503755%, GPU 86.496245% CPU 7.845981%, GPU 92.154019%
Here is pretty simple to see, that in simple cases GPU is mostly relaxing (as there is more CPU utilization), and in more complex scene it is pretty much having a lot harder time than with simple scene. It can also be seen (here in simple scene) that optimizations gave us some 0.5% totally (means some 3.6% in CPU only performance).
Also note, that my whole engine is heavily optimized (it uses dynamic BVHs, scenegraphs, visibility culling, intrinsics (especially in ray-tracing part), etc.) - it would be even more visible on some less optimized application.
Anyway so technically what can we see from this (apart from that I'm heavy-optimizing guy)?
Basically we can see that compiler optimizations are giving us ability to write more high-level and/or better structured code, not that we can write algorithms with complexity O(n\\^3) and O3 will solve the speed for us (It won't, it even wasn't designed to solve it), we should rather re-consider whether not to use O(n log n) rather (even though we will spend few days more on the algorithm).
I'm also not saying that MSVC applications shouldn't be much faster when switched from Debug to Release mode (I'd like also to note that MSVC stores huge amount of debugging symbols in Win32 applications).
And the last note: In this testing I've used GCC 4.5 on Linux. The whole application is written directly in XLib + OpenGL (graphics part).
And no post is complete without the image shining out of it (Simple scene - showing the GI stuff):