Strange Heisenbug

24 Jan 2011

Hi,

OK, this is a long shot, but I figured it couldn't hurt to post just on the slim chance somebody's seen similar strange behavior.

I have a snippet of code that looks like this:

typedef unsigned char byte;
const int NumLEDs = 4;
DigitalOut leds[NumLEDs] = { DigitalOut(LED1), DigitalOut(LED2),
                             DigitalOut(LED3), DigitalOut(LED4) };

void lcdnumber( CheapLCD * lcd, int y, int32_t value )
{
    HoldInterrupts noIR;
    leds[0] = 0;  // Adding these lines avoids
    leds[1] = 0;  // freeze in lcd I/O code below?!?!
    lcd->clear( BLACK, (byte)10, (byte)(y-2), (byte) 80, (byte)(y+12) );
    lcd->draw_number( WHITE, BLACK, (byte)10, (byte)y, value );
}

Now, I'd be calling lcdnumber from an interrupt routine (attached to PinDetect) and it worked fine. But then I tried to add another call to it from the main loop of the code, and it froze in lcdnumber(). I started turning the LEDs on at various places in the code to see where it failed.

Now here's the weird part. As soon as I set LED1 and LED2 in the lcdnumber() routine, it started working. Take either of those statements out, and it freezes up. More tracking down with printf() shows that the actual freeze is happening when it writes an SPI command to p11/p13 (that's what the "cheap LCD" is connected to). It actually doesn't matter what I set LED1 and LED2 to (zero or one), as long as I set both before I call the into LCD I/O code.

Can anybody fathom why setting the LEDs would magically cause the SPI::write() call to not freeze up? I'm willing to leave those lines in the code, but I'd sure like to know why they're needed...

24 Jan 2011

If you are calling the function from "main loop" code did you remember to remove the irq context callbacks that PinDetect will make as it sounds like a timing issue of some sort. What happens if you remove the setting of the led[0]/led[1] and replace it with a short wait_us(1000); (pr similar) type instead? It may not be loading the values with anything that is maling the difference, it maybe just timing related in some way.

24 Jan 2011

Hi Andy, thanks for the reply.

I tried substituting wait_us(1000), instead of setting the LEDs, but it still froze. I suspected that interrupts might be an issue, so I shut them off while talking to the LCD object; that's what the HoldInterrupts object does. The code for it looks like this:

// HoldInterrupts.h 
class HoldInterrupts {
public:
    HoldInterrupts();
    ~HoldInterrupts();
};

// HoldInterrupts.cpp
static int gHoldInterruptLevel = 0;

HoldInterrupts::HoldInterrupts() 
{
    if (gHoldInterruptLevel == 0)
        __disable_irq();
    gHoldInterruptLevel++;
}

HoldInterrupts::~HoldInterrupts() 
{
    gHoldInterruptLevel--; 
    if (gHoldInterruptLevel == 0)
        __enable_irq();
}

However, using HoldIterrupts had no effect. For now I've left it in, since shutting off interrupts while talking to SPI devices doesn't seem like a bad idea...

24 Jan 2011

Btw, I should point out, above you said "calling lcdnumber from an interrupt routine (attached to PinDetect) and it worked fine". It may have worked fine but you could have been on "unstable ground". Let me explain...

Since we are in a "tiny world" of the embedded micro, we don't have the luxury of a multitasking OS to manage our resources. So we need to do it ourselves. Most programs we write will start at main() and within there we find an infinite loop, basically liek this:-

int main() {
    // usually some setup code and then 
    while(1) {
        // this is where our program sits.
    }    
}

You can think of this part of the program "main context" or "user context". However, another context exists, the "interrupt context". This exists when some event fires an interrupt and it's assocate interrupt servive rountine is called. Knowing which context you are in is pretty important and I'll explain why.

I am not sure how Mbed have implemented their Serial but lets follow an often used example of serial output buffering. When you call putc(c) for the first time the hardware is idle, it's sending nothing so putc() can write the byte to the THR register and return. Takes very little time. However, a following putc(c) will find the hardware busy as it's still sending the first character. So it stores c in an output buffer and enables transmitter going idle interrupts so that when the first character has completed sending, the ISR gets the second c from teh output buffer and sends it. When the output buffer is empty it switches off the transmitter going idle interrupt. Simple enough yes? Remember, functions such as printf() will call putc() "under the bonnet" multiple times for the string it sending.

Ok, so with that in mind, this is where it gets important. In your user code it's possible to have callbacks from library components (eg InterruptIn, Ticker, Timeout, PinDetect uses Ticker callbacks internally, etc etc). When one of these callbacks are made you are in "interrupt context". Why is this important? Well, while you are in interrupt context any other interrupts that fire will not be serviced until the current interrupt is completed (ie once you have returned from the currect callback). Now, knowing this, if we block in a callback, serial sending will stop sending. Worse, if your callback does a printf()... well, you get the idea, that's going to end in deadlock if the output buffer fills and putc() blocks until there's space which there never will be because it can't send.

(note, I was writing this when you posted a followup)

I doubt SPI::write() uses an interrupt (although I may be wrong, we can't see the source). It's SSP hardware only has "half full" and "half empty" irqs so is generally pretty useless, so I would guess that it merely waits for the current data to be sent, reads DR and returns it.

Are any callbacks kicking off any SPI::writes() ? Basically (thinking aloud here) if you call an SPI::write() in main loop context it'll wait until it's sent the data and return DR the result. However, if a callback starts a new SPI::write while the first is still sending then it's write will go into the SPI output FIFO and it'll then return the DR result from the main context SPI::write(). Your interrupt context SPI::write will then return it's value to the main context SPI::write. That's a possible timing issue that could cause havoc if true. Maybe Simon et al could enlighten us on some of this?

24 Jan 2011

This might be the same problem that causes the malloc() problem and the floating-point problem. I have seen much more cases of such strange program hangs, and they go away if one manages to change the memory layout (e.g. adding or removing printf's, or adding / removing class members). If it is the same problem, it should go away with the next compiler update.