Hack Debugger: December 2008

Sunday, December 21, 2008

STR91x bootloader trouble, PFQ/BC

Most start-up code (91x_init.s for example) writes 0x191 to the SCU->SCR0 register, which among other things enables the PFQ/BC (Pre-Fetch Queue/Branch Cache). This is normally a good thing. Of the 16 entries in the Branch Cache, 15 are used for generic branches while the 16th one is used only for the special IRQ branch at address 0x18. It appears that this branch cache value is initialized when the PFQ/BC is turned on AND read-only after that. That means if the jump at 0x18 changes, the BC entry will be wrong.

Why would the jump at 0x18 ever change? It changes if you remap the banks, i.e. use BANK1 as the bootloader boot bank, and then switch to BANK0 as the main application. Of course, you shouldn't even do this if you are not using at least rev H of the STR912FAW, but that is a different story (the chip reset circuit has a bug that makes this impossible).

What happens if you don't do something special a programmer's worst nightmare: jumps are random when you let the program run, but if you step through the ASM code, it works fine (because the cache is turned off if you are doing single steps). And the random jumps are not a software bug that you can fix.

But there is a fix. Change the boot code to disable the PFQ/BC, then reenable it later in your code.


; --- Enable 96K of RAM  & Enable DTCM & AHB wait-states, disable PFQ/BC until it is flushed. The bootloader has cached the IRQ already and that is BAD!
        LDR     R0, = SCU_BASE_Address  
        LDR     R1, = 0x0196  ; not 0x0191 as in other boot code!
        STR     R1, [R0, #SCU_SCR0_OFST]

Later you can reenable using the 91x_lib call (inside __low_level_init() if you are using IAR EWARM):


  SCU_PFQBCCmd(ENABLE);     // Enabled Branch Cache feature of STR91x.

Friday, December 19, 2008

Anatomy of STR912 boot sequence in EWARM

Power-on: waits til voltages rise above Low-Voltage Detect, then sets PC to 0x00000000

First instruction jumps to Reset Handler. The jump is usually a LD PC,[PC+xxx] command that loads the PC from a table just past the end of the vector handlers.

Reset Handler is always written in ASM because it needs to do things that are unsafe in C, like initializing flash memory interfaces, RAM interfaces and needs access to the CPSR register, which you don't have from C. This usually sets the flash memory interface, sets up stack space for each of the system modes, then jumps to ?main. In EWARM, the Reset Handler is installed at 0x180, and is given the symbol __iar_program_start. If you do not write your own, EWARM will pick one for you.

?main - This does basically 3 things:

Call __low_level_init(). You can write this in C, as long as you are careful not to use global variables. This is useful for setting the clock speed and other things that don't need to be done in ASM. EWARM has a default dummy __low_level_init().
Call __iar_init_memory. Not something the user should deal with. This is where the C global variables are initialized, and global C++ objects are constructed.
Call main. This is the traditional C entry point.

main - Your plain old C function where the real fun begins.

I've seen several examples (the code from ST) that does the low level init in main() and not earlier. This is not a big deal, but it means that the clock will be running much slower on power-up, so if you are doing a lot of C++ initialization, it will boot slower (perhaps 4x slower if you are using a 25 MHz crystal and can run the PLL at 96 MHz).

EWARM "unable to halt core" on STR912

There are a few reasons why you might be unable to halt the core using the Jlink debugger and EWARM, on a STR912 chip.

1. You wrote to the wrong registers and misconfigured the memory configuration. In this case, you might need to erase the chip and use the debugger to step through the code until you find the bad instruction. You can erase the chip using either ST's CAPS program, or your can do it with the command line utility "J-Link STR9 Commander" that you can download from Segger as part of the free jlink tool suite.

2. Your USB J-link debugger is plugged into a slow USB port, PCMCIA-to-USB hub or USB-to-USB hub. I had lots of trouble with my PCMCIA-to-USB hub. Using the built-in USB port works great.

3. You accidentally turned on the security bit using CAPS. Once it is turned on, your program will run, but you will not be able to connect to the device (except to erase the chip). You will need to do a full chip erase to regain control. See #1 for hints on erasing the chip.

Thursday, December 18, 2008

Quick and dirty Message Box from command line

If you are building a tool chain that is primarily command line scripting, but you need to alert the user to a problem, the simplest built-in Windows method is to use VBScript and the msgbox command.

Create a one line file whose name ends in *.VBS (failmsg.vbs for example), and contains something like this:


msgbox "Build failed. Please check Log files"

then in your scripts (ie a BAT file), call the vbs file using


cscript /nologo failmsg.vbs

You could also pass parameters to the vbs script, but since I know nothing about vbs except what you just saw, I don't know how.

Perl, BAT files and quoteless quoting.

As much as I hate DOS batch files (*.BAT), sometimes you are forced to use them for integration with other build tools.

One frustration I had with a BAT file was the normal Unix rules for quoting, using nested single and double quotes, don't work.

For example, in unix (or Cygwin), you can do


perl -e 'print "Hello\n";'

But DOS doesn't know about single quotes. If you try


perl -e "print 'Hello\n';"

You will get


Hello\n%

... the newline is ignored since you used single quotes. (My version of Cygwin displays the % meaning that the output does not have an EOL character).

The work around is to use the rarely used "qq" perl functions to do quote-less quoting. Thus the following example


perl -e "print qq{Hello\n};"

prints out the correct message with the newline.


Hello

Built-in "fix me" alarms in source code

Several times during development, it makes sense not to fully implement all of the features of a particular function and just leave a stub that gets your code to compile. Somewhere in the comments, I usually put "fixme" so that it is easy to find again later when I finish everything else.

However, as time passes, sometimes you don't remember to search for all the "fixme"s and these can creep into production builds.

A simple hack is to put time bombs in your code that refuse to compile after a certain time has elapsed, ie like 2 weeks.

In your prebuild hook, write something that writes a time stamp to a header file, like this:


#! c:/perl/bin/perl

use POSIX;
open F,">buildtime.h" or die;
print F POSIX::strftime(qq{#define BUILDTIME %Y%m%d\n},localtime);

Then in your code, put in hooks like this:


#include "buildtime.h"
#if BUILDTIME > 20090125
#error Fix this bug
#endif

where the time stamp is simply the concatenation of year, month and day (YYYYMMDD), in that order to get the greater-than comparison to work. In my example, the code will refuse to build after Jan 25, 2009.

EWARM "bug" when dealing with ARM-based VIC

The VIC on the STR912 is a tricky beast if you use the debugger, because viewing the status of the registers will mess up the VIC logic. The way the VIC works is that it has 2 modes, active and inactive modes. To go from active to inactive, you read the VIC ADR register, and to return you write to the VIC ADR register. However, the debugger also reads the VIC when you halt it. Thus, if you halt the processor at all and the VIC register display is visible, basically the VIC will stop working.

The only time it is safe to view the VIC is *after* it is in inactive mode, ie after the read from the VIC ADR register.

Otherwise, the symptom is that the VIC never triggers the IRQ any more, even if the register bits all show the interrupts are asserted.

Sunday, December 7, 2008

Linker Tools Warning LNK4217

If you get this warning and you are trying to link an EXE with a static LIB file, check that the code generation option is "static" for both the EXE and for the LIB. Don't select a DLL option for one and static for the other.

Saturday, December 6, 2008

Detecting whether a USB device is plugged in, via a Makefile or script

I needed a quick (5 minute) solution that would check that my USB dongle was inserted before running the scripted build. The dongle contains the license key for the compiler.

I found a small application supplied by Microsoft called DEVCON.EXE. You can get it from this blog or from Microsoft. The two key commands are 'devcon status *' or 'devcon find ...'. First you need to figure out what the Dongle is called in device language.
Using Cygwin, I ran devcon twice, once with the dongle inserted and once without.


devcon status * > a
devcon status * > b
diff a b

That got me the crucial VID and PID values "USB\VID_04B9&PID_0300" and saved me from paging through hundreds of devices. I used the * wildcard to ignore everything after the PID. When you run 'devcon find', it will also give more information about the device. I use this additional information to make sure that I have the right device. In my case, the string "SafeNet" appears.

Then I constructed this target in the Makefile that would make sure that the dongle was inserted, or else fail the build immediately. I used the result of 'grep' (success if it finds a match, fail if it doesn't) to conditional run the "echo" command.


all: donglecheck
     make otherstuff

donglecheck:
     @./common/tools/devcon find 'USB\VID_04B9&PID_0300*' | grep 'SafeNet' || ( echo "*** Please Insert IAR Dongle for EWARM ***" ;  false )

The reason for this is that the IAR compiler will hang for 5 to 10 minutes before timing-out if there is no dongle, and I can't wait that long.

Perl and DLLs

I had written a DLL to be used with SWIG and Perl. It worked fine on my laptop (with Visual Studio 8 installed). However, if I moved it to another laptop, perl would refuse to load the associated module file with an error that said something to the effect that "it could not load the module because the application was incorrectly configured". I should paste the exact text, but I don't have it with me right now.

To make a long story short, this means that either the main DLL is not there, or one of the linked DLLs is not there. I knew the main DLL was there, but it turned out that the problem was that I had used the DEBUG build, which uses the debug version of the CRT (C runtime) library. As a check, I copied the debug version of the MSCRT*D.DLL to the other laptop and it magically worked!

The long term solution here was to rebuild using the Release build. This means,

Properties->Configuration Properties->C/C++->Code Generation->Runtime Library = Multi-threaded DLL (/MD)

This option has to be set for ALL projects that end up in the final DLL. It is a compiler option, not a linker option, which doesn't make any sense to me. But it is.

Case closed.

So I thought.

Then I pressed onwards and installed on another (older) laptop. Same error message. But in this case, I realized that I was compiling with VS 2003, which is newer than WinXP. I realized that maybe the laptop does not have the latest service pack. So I copied the redistributable DLLs for the CRT (C:/Program Files/Microsoft Visual Studio 8/*/*/redist/*CRT*/*.DLL) and everything worked!

Case doubly closed!

Now the perl module loaded properly. But then when the perl script tried to launch another application (in the same directory), I got a new fault message! To make this long story short, yet again, I needed the MFC8 and MFCLOC8 libaries to get the application to run.

Unexplainable Memory Corruption -- Explained by local stack variables and OS call

In this blog, I will log my stories on debugging adventures.

A few days ago, my program began to randomly crash after a few minutes. This was a multithreaded Win32 application and I dreaded trying to untwist the threads to find out where the memory corruption was happening.

The first clue was that my program was jumping to memory location 0x00000401 and it was always the same location no matter how long it took to get to that address. I went through my code: there was one thing that used the constant 0x0401, so I tried changing the constant, but the error remained at 0x0401. One thing that I realized early on was that when your program jumps to random memory locations, it is likely to either be corruption of the stack, since the return register is stored on it, and a when your subroutine does a return, it can jump to random locations. The other possibility is corrupting your virtual function table, but that is less likely. Stack corruption is pretty easy.

I inspected the stack (looking at the ESP pointer at the point of exception) and indeed there it was, a 01 04 00 00 sitting in memory. The trick now was figuring out how it got there. VS8 lets you put a break point on data memory accesses, so I did that. However, that is when it got really weird. I discovered that a memory location that was writing to the part of the stack was writing a 0 to it. The exact assembly instruction was something like

MOV DWORD [EBP],0

which is supposed to write a 0 to the memory location at EBP. Well, the memory location was correct in EBP, but I certainly wasn't getting a zero! I figured it must be one of the other threads that is doing it. But even if I froze all of the threads, the same memory corruption would happen at the same ASM instruction! It seemed impossible.

After talking to some people with more experience with Win32 x86 ASM, I thought maybe it had to do with a virtual memory (MMU) issue. That seems unlikely, since that would probably take the entire WinXP OS down pretty quick. But it got me thinking that in addition to my application, the OS could be writing this memory location in physical space, from a device driver or something.

Another day of tracking down this bug got me to thinking that this problem is probably an orphaned pointer, one that has gone out of scope but is still being used by the OS. Also, in round-about terms, I had a good idea where this orphaned pointer came from: if the corruption happens in the stack space of subroutine C which is a child of subroutine A, then there is probably another child of A (let's call it B) that is the source of the orphaned pointer (or rather, subroutine B creates some automatic stack variable, and sends it's address elsewhere, and then destroys the stack variable). Well sure enough I found it:

{
...
DWORD eventMask;
::WaitComEvent(..., &eventMask, ...);
...
return;
}

This function can either fill in eventMask immediately OR IT CAN FILL IT IN LATER ON ITS OWN TIME. D'oh. Since eventMask was a stack variable that goes out of scope after the return, ::WaitComEvent was the OS device driver call that was writing to my stack when I didn't want it.

The solution was easy: just make eventMask a static variable or otherwise make sure that it is always in scope until the result of WaitComEvent is determined.

Oh, the 0x0401 value was the eventMask being written by the OS. Except it was written in the wrong spot.

That was a good 10 hours of debugging.

Hack Debugger