[This is about my C++ Profiler which may be found on Google Code under High Performance C++ Profiler]
My interest in code profiling started when I was making hudbot. What with code injection and patching, function hooking, data hijacking, and OpenGL, I knew I had relatively no experience in what I was attempting and that I could easily be producing some amazing slowdowns if I wasn’t careful.
Unfortunately, C++ profilers seem to come in three varieties, all of which have a fatal downside:
- Sampling Profilers which are fast, multi-threaded, but inaccurate and have decent output (sometimes too detailed). Some examples are VTune, CodeAnalyst, google-perftools and Sleepy.
- Instrumenting Profilers which are accurate, multi-threaded, but slow, and have decent output. Some examples are GlowCode and the now defunct DevPartner Profiler Community Edition.
- Instrumenting Profilers which are fast, accurate, but single threaded and have limited output. These range from extremely simple profilers like Peter Kankowski’s Poor Man’s Profiler to the more complicated and full-featured Shiny C++ Profiler.
The obvious outcome is that if you want fast and accurate, like I did, you’ll have to use an existing profiler or write it yourself and instrument your code manually. With a little work, fancy stuff like call trees can be added. Once you get it tested and working, you can start going crazy profiSegmentation fault.
Oh yeah, about that. There are no multi-threaded instrumented profilers that are open source, and depending on how your single threaded profiler works, the results when trying to use it in a multi-threaded environment can range from bad data to outright crashing. It’s possible to patch the profiler to only allow the main thread in, but this adds unnecessary slowdowns and doesn’t address how to profile other threads. This is where my profiler comes in!
Pieces of a high performance multi-threaded C++ profiler
The main piece of a high performance profiler is what mechanism is used to get the timestamps. High precision is the obvious main requirement, but it must also have as low a latency as possible. If you’re making millions of calls a second to your profiler, the timestamp mechanism could become the limiting factor in your app’s performance and make it so unresponsive that testing it is infeasible.
On an x86, this means you must go with rdtsc. It is low latency, high precision, and is portable to gcc. This choice is unfortunately not without it’s trade offs. rdtsc does not serialize, so unless you insert a serializing instruction like cpuid before it (and bloat the latency in the process) or use the new rtdscp instruction, the cycle count you receive may not be 100% accurate. rdtsc is not guaranteed to be sync’d across all CPUs in a multi-core / multi-CPU system, so even single threaded timing has the possibility of being incorrect if the thread is scheduled across multiple CPUs. But, and this is a big but, for what I want there is nothing else to use. If someone else has different needs they can replace the timer function, but for the volume of calls I’m interested in, latency needs to be the bare minimum.
This is the part nobody seems to know or care how to do so I was on my own. To avoid overhead, synchronization must be avoided at all costs for the general path.
Easy solution: Track each thread’s profiler state with thread-local storage and do the heavyweight statistics work on-demand instead of worrying about aggregating on the fly. A little more work needs to be done depending on the details of the profiler implementation, but nothing difficult.
The only change for the user is needing to explicitly enter & exit threads for the profiler. An important note is that while threads must be explicitly instrumented, if you do not instrument a thread, it will not hurt the system in any way; the thread will just not show up in the profile stats.
How the profiler data is stored is a function of how much detail is needed. In my case I wanted full call trees which means it has to be dynamically allocated and call nodes must be dynamically located. Obviously this will require a hash table, but it should not be a generic fully functional hash table! Because of the limited scope of what the profiler needs, corners can be cut so that it does what it needs to quickly.
Probably the easiest gain to be made is not hashing the function name when searching for it in the hash table! If you use compile time constant strings like __FUNCTION__, you can simply take the address of the string as the hash value and skip the O(n) hashing of the string. Small fixups may be needed (I use >> 5 on the address as the strings are possibly aligned), but in the end it’s a big win.
If using an open hash table like I am, there is never a need to delete individual entries, so you do not need a tombstone value for a deleted key or a check in your lookup against them. There’s either a value in a slot, or it’s empty.
I also use linear probing because the table should be empty enough and the hash dispersion should be spread out enough that the extra code to do quadratic or better probing would slow the search down in the general case. I do not inline the elements in the table however; The hash table is an array of pointers to individually created profiler nodes, with each node being its own hash table. The size of each node is nearing 64 bytes with 32 bit pointers, so I figured scanning them would produce cache misses regardless, and inlining them in the table was more trouble than it was worth.
If the goal is to profile an entire run of a program, then you don’t need synchronization. Profile in to the threadlocal profilers, wait for all threads to finish, and combine the results. Easy!
My goal is more difficult. I want to not only dump the profiling results at any time, I want to be able to reset the results at any time. When I’m profiling games, I often want to profile a specific action or sequence of actions. I don’t want to include the startup time and whatever is required to get to that specific thing I want to profile, e.g. running a timedemo or analyzing a horrible script compiler.
So how to synchronize without slowing everything down, while still being able to crawl a potentially mutating hash table safely? I compromised by accepting that a few call nodes may get extra calls by locking on any mutation to the hash table array, but not mutations to the timing information. When the dump/reset are using a thread’s hash table, it will take the thread’s lock, crawl it’s information, and release it. The thread is still free to profile to existing nodes during this time, but any attempt to insert a new node will block.
Reset is handled not by deleting a call tree, but by crawling the tree and resetting each node’s timer to zero (threads which are no longer active are of course fully deleted). This has the byproduct of not only being able to reset all threads when they’re at arbitrary positions in their call tree, but also to call reset from arbitrary positions in any call tree. I ran in to this need the hard way when profiling Tribes script system, and the call to reset the profiler is done with a script command, bad things happened. To keep reset-but-not-yet-active nodes out of dumps, it’s necessary to check that nodes have a non-zero time before walking them.
A small caveat with reset: Because there is no synchronization for existing nodes, it’s possible that
ticks += ( getticks() - started )
could execute at the same time the thread that is resetting everything executes
ticks = 0
Now, if the code is generated so the value is summed directly in to ticks, there is no issue. ticks will either get incremented, then stomped, or stomped, then incremented. This seems to be true in both gcc and msvc release builds, so there is no issue. But in debug builds, ticks can be read in to a temporary variable, summed, and written back to ticks, which could stomp the 0 written by the reset thread. Since this isn’t an issue in release builds and will only happen on a freak chance in debug builds, I don’t see it being an issue, but it’s something to be aware of.
A fully working profiler is still not useful if it’s output is incomplete or difficult to read. Look at the documentation for understanding gprof’s call graph. 5-6 pages of text for a glut of confusing numbers and diagrams when a well-made call tree/graph doesn’t need documentation and can be understood easily with visual examination. Google’s perf-tools require you to run the output through a perl script, and still the results are hideous and hard to read! Which is why I came up with:
I originally only had the ASCII dumper. For games (which use text consoles) or command line apps, text output is quick and easy. A real tree structure is rendered so you aren’t left playing guessing games as to how far each item is indented or trying to match up rows of numbers to their function names on the other side of the screen. Self time while inside the tree is only useful to that specific branch, so I also include the top callers overall by self time so global bottlenecks can be identified.
Text is great, but I thought it would also be useful to have a fancier dump with things like colored hotspots and mouse hover highlighting. Archiving the generated HTML files is also an excellent way to track and show improvements over time.
As an aside, the HTML dump is actually the main source of the code bloat in the profiler now. I chose to integrate the generation of the HTML dump in the C++ code versus post-processing data with a script because every step between you and the final product is another reason to not use it at all.
To test out the performance of the type of applications I’m interested in profiling, I ran Thierry Berger-Perrin’s sphereflake app through a variety of profilers on my E8400. sphereflake was built for a single thread so Shiny would run, while my profiler (and everyone else) was running in multi-threaded mode. The only difference in the general case between single and multi-threading mode in my profiler is whether the pointer to the local state is a) static, or b) a thread-local static. The overhead from tls handling is non-applicable, so it can be run in multi-threading mode at all times without worry.
Non-instrumented run time for sphereflake is about 3 seconds on my E8400, and the functions I profile result in around 10 million calls during that period. The breakdown of performance by profiler:
- My profiler is 1.09x slower than baseline.
- Sleepy (the sampling profiler) is inexplicably is 1.19x slower. Sleepy itself didn’t seem to use any CPU time, so I can only guess that something it did with the system was affecting context switches or whatnot.
- Shiny, switched to use rdtsc as it’s timing mechanism, is 1.27x slower.
- GlowCode was unable to profile all of the functions needed regardless of how many optimizations I tried to turn off, hence the ¹. I eventually got it up to around 4.5 million calls, at which point it was slowing the program down by 1.89x. Not too impressive for the “WORLD’S FASTEST PROFILER!”.
- Shiny, using QueryPerformanceCounter, is 1.95x slower. This is why QPC is bad for high volume profiling, the latency adds up fast.
- Visual Studio 2005’s “Profile Guided Optimization” (not pictured) not only ran 4.75x slower than baseline during profiling, but after it “optimized” the app from the profile data, the app ran slower!
So that’s how to do high performance profiling! Even with Tribes generating 480+ million calls during a 2 ½ minute timedemo of a 25 minute map, it still runs at 300fps+ and accurately reports what is using time where. (No, it’s not OpenGL::setupBitmapTexCoords. BAD VTune!)