In this blog, I will log my stories on debugging adventures.
A few days ago, my program began to randomly crash after a few minutes. This was a multithreaded Win32 application and I dreaded trying to untwist the threads to find out where the memory corruption was happening.
The first clue was that my program was jumping to memory location 0x00000401 and it was always the same location no matter how long it took to get to that address. I went through my code: there was one thing that used the constant 0x0401, so I tried changing the constant, but the error remained at 0x0401. One thing that I realized early on was that when your program jumps to random memory locations, it is likely to either be corruption of the stack, since the return register is stored on it, and a when your subroutine does a return, it can jump to random locations. The other possibility is corrupting your virtual function table, but that is less likely. Stack corruption is pretty easy.
I inspected the stack (looking at the ESP pointer at the point of exception) and indeed there it was, a 01 04 00 00 sitting in memory. The trick now was figuring out how it got there. VS8 lets you put a break point on data memory accesses, so I did that. However, that is when it got really weird. I discovered that a memory location that was writing to the part of the stack was writing a 0 to it. The exact assembly instruction was something like
MOV DWORD [EBP],0
which is supposed to write a 0 to the memory location at EBP. Well, the memory location was correct in EBP, but I certainly wasn't getting a zero! I figured it must be one of the other threads that is doing it. But even if I froze all of the threads, the same memory corruption would happen at the same ASM instruction! It seemed impossible.
After talking to some people with more experience with Win32 x86 ASM, I thought maybe it had to do with a virtual memory (MMU) issue. That seems unlikely, since that would probably take the entire WinXP OS down pretty quick. But it got me thinking that in addition to my application, the OS could be writing this memory location in physical space, from a device driver or something.
Another day of tracking down this bug got me to thinking that this problem is probably an orphaned pointer, one that has gone out of scope but is still being used by the OS. Also, in round-about terms, I had a good idea where this orphaned pointer came from: if the corruption happens in the stack space of subroutine C which is a child of subroutine A, then there is probably another child of A (let's call it B) that is the source of the orphaned pointer (or rather, subroutine B creates some automatic stack variable, and sends it's address elsewhere, and then destroys the stack variable). Well sure enough I found it:
{
...
DWORD eventMask;
::WaitComEvent(..., &eventMask, ...);
...
return;
}
This function can either fill in eventMask immediately OR IT CAN FILL IT IN LATER ON ITS OWN TIME. D'oh. Since eventMask was a stack variable that goes out of scope after the return, ::WaitComEvent was the OS device driver call that was writing to my stack when I didn't want it.
The solution was easy: just make eventMask a static variable or otherwise make sure that it is always in scope until the result of WaitComEvent is determined.
Oh, the 0x0401 value was the eventMask being written by the OS. Except it was written in the wrong spot.
That was a good 10 hours of debugging.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment