Okay, so here comes the profiling...
I've written my own profiler on my game engine and well, it reports quite lots of details (it can measure percentage, milliseconds, even some engine calls, etc.), lets sum them to: CPU and GPU. What will be where?
In CPU I'm counting ray-tracing part of course, draw calls, state changes (e.g. changing FBO or shader), et cetera - e.g. CPU calls and CPU instructions generally.
In GPU I'm counting just rendering time on GPU (e.g. in first case time we're waiting till GPU actually finishes computing of something, in second case the time we're actually doing GPU work)
EDIT: Why this counting? Well basically we want to spend as much time on GPU as possible, because there is whole lot more stuff to do on CPU (AI, Physics, etc.). And also, speed of GPUs grows alot more quickly than speed of CPUs.
Basically lets use same scenes and compiler stuff as in first case ... and measure where are we spending most time... (Again the first two values are for simple scene, the next two for Sibenik).
Debug CPU 50.900325%, GPU 49.099675% CPU 48.323266%, GPU 51.676734%
Release v1 CPU 50.612472%, GPU 46.387528% CPU 45.059028%, GPU 54.940972%
Release v2 CPU 48.958393%, GPU 51.041607% CPU 41.787474%, GPU 58.212526%
Release v3 CPU 48.578865%, GPU 51.421135% CPU 41.175955%, GPU 58.824045%
Black magic CPU 48.492081%, GPU 51.507919% CPU 40.038613%, GPU 59.961387%
So here we see, that as optimisations won't give us much speed (some 2% more time waiting for GPU stuff) for simple scenes (it will give us boost though - but it is less visible in overall performance), it will give us pretty huge boost for large scenes (almost 10% more of time is spend on waiting for GPU - e.g. it is time to optimize GPU side , if I did some really large and very complex scene, something like Power plant, or so, we would se even larger boost on CPU side with optimizations).
If we always wait till GPU finishes (e.g. we'll count absolute time spend on GPU to absolute time on CPU - lets force wait for GPU through glFinish(); ), the results will be:
Debug CPU 14.074966%, GPU 85.925034% CPU 8.012333%, GPU 91.987667%
Release v1 CPU 13.981151%, GPU 86.018849% CPU 7.988983%, GPU 92.011017%
Release v2 CPU 13.893738%, GPU 86.106262% CPU 7.966127%, GPU 92.033873%
Release v3 CPU 13.869523%, GPU 86.130477% CPU 7.946651%, GPU 92.053349%
Black magic CPU 13.503755%, GPU 86.496245% CPU 7.845981%, GPU 92.154019%
Here is pretty simple to see, that in simple cases GPU is mostly relaxing (as there is more CPU utilization), and in more complex scene it is pretty much having a lot harder time than with simple scene. It can also be seen (here in simple scene) that optimizations gave us some 0.5% totally (means some 3.6% in CPU only performance).
Also note, that my whole engine is heavily optimized (it uses dynamic BVHs, scenegraphs, visibility culling, intrinsics (especially in ray-tracing part), etc.) - it would be even more visible on some less optimized application.
Anyway so technically what can we see from this (apart from that I'm heavy-optimizing guy)?
Basically we can see that compiler optimizations are giving us ability to write more high-level and/or better structured code, not that we can write algorithms with complexity O(n\\^3) and O3 will solve the speed for us (It won't, it even wasn't designed to solve it), we should rather re-consider whether not to use O(n log n) rather (even though we will spend few days more on the algorithm).
I'm also not saying that MSVC applications shouldn't be much faster when switched from Debug to Release mode (I'd like also to note that MSVC stores huge amount of debugging symbols in Win32 applications).
And the last note: In this testing I've used GCC 4.5 on Linux. The whole application is written directly in XLib + OpenGL (graphics part).
And no post is complete without the image shining out of it (Simple scene - showing the GI stuff):