originally written circa 2009
Here's the bug I found today:
Doesn't immediately seem menacing, does it?
But this was a showstopper for a major product release on SuSE 9 Linux. It only interfered with one feature—the deployment of services to an application server; but that's enough to hold up a significant segment of testing, and of course we can't ship with a major feature not working. The user-visible symptom: try to deploy, and the client reports that the server closed the connection without sending a response. Check the communications-server log: nothing out of the ordinary. Check the execution-server log, though, and you'll see that the communications server crashed due to SIGSEGV and was automatically restarted by the execution server. Unpleasant, particularly if you happened to be using the communications server for anything else at the time. The best-effort fatal-condition handling in the communications server wasn't catching it. I build the comms server for debug and run under gdb: the debugger catches the signal, but the backtrace is garbled. Try again, stopping before the error occurs and single-stepping into it, and I see that the signal is being raised in the function prologue—before any of my code was actually being reached. The icing on the cake was that this was code that hadn't been changed in nearly a year, and had been working fine.
Into the Breach
I took a look at the generated assembly (in gdb, that's "x/10i address", to see the 10 instructions around the faulting address, which gdb reports when it catches the signal). The function prologue on x86 Linux does some setup, subtracts a value from the stack pointer, and then calls the next instruction—seems a bit odd to me, but that's the ABI, apparently. The SIGSEGV was raised on the call. What's a call do? Typically, it pushes the current instruction pointer address onto the stack and loads a new address into the instruction pointer. My first thought was that the new address was invalid, and I was getting a SIGSEGV trying to read it; but no, it was fine. Then, I have to admit, I spent far too long trying other dead ends on the assumption that gdb was reporting bad information, before I took another look at the code and realized what the problem was. It was that bit just before the call where the prologue "subtracts a value from the stack pointer". That subtraction operation reserves room on the stack for automatic storage-class variables declared by the function. And in this case, it was subtracting a great big number. The comms server is aggressively multithreaded, so when it starts threads it starts them with a rather small stack, and a function that tries to allocate a whopping great chunk of it is going to overrun that. In this case, it put the address of the top of the stack on a page that wasn't writable, so when the call instruction tried to push the IP, it faulted. That's extremely dangerous, because it's hard to detect and hard to diagnose when you're lucky enough to detect it. Note how this works (in this implementation): the compiler doesn't check to see that the stack has room for what it's trying to allocate, it just adjusts the stack pointer by a value. That might land you on a non-writable page, as in this case. Or it might land you on a page that you can write to, and the function silently works, but trashes memory belonging to some other part of the program. And maybe while the function is in progress another thread gets a timeslice and overwrites the return address, and when the first function ends it returns into some other place entirely. Normally, I avoid this problem by not declaring giant auto-class objects. What had happened here was that the NGROUPS_MAX macro, which is set by a system header and in traditional Unixes was a relatively small value (10 or so), is 65536 in SuSE 9. And gid_t is a 4-byte type, so the declaration I quoted above was trying to grab a tidy 256KB of stack space. My quick fix was to replace NGROUPS_MAX with "20". This particular array is used only to iterate through the supplementary group list of the current effective user to determine if group permissions apply before performing file I/O, and that operation is actually only used by the comms server when deciding whether an HTTP request for a directory list can be satisfied, which is not even a feature we document. So I'm not worried about a false negative (which will fail safe, by not exposing information) here.
Looking at Listings
Perhaps more interesting is how I checked for more instances of the same problem. I hacked the makefile to compile everything to assembler (by changing "-c" to "-S" on the gcc command line), then did a quick grep through the assembler source for instructions featuring large values and the stack pointer:
grep '$[0-9][0-9][0-9][0-9][0-9]*, %esp' *.s
(That regex finds literals of four or more decimal digits being applied to the stack pointer, in the style of gcc's x86 assembly listing.) This caught one more case, a completely unnecessary 8KB buffer in the HTTP parser. At least that's code I inherited from someone else, not my own mistake. The nice thing about this technique is that it catches this problem even when it's not obvious in the C source, as with my NGROUPS_MAX bug, because in the assembly all macros have been expanded, constants folded, and so on. By the way, you can create nice combined-C-and-assembly listings with Microsoft Visual C on WIndows and do similar tricks with them. Use the "/FAcs" switch, and optionally direct the output to a file other than the default (filename.cod) with the "/Fapathname" switch. In the my group, we include assembly-listing generation as part of our Windows build process, and save the listings in source control along with the binaries. That makes debugging dumps much easier.