Arrrrgh…. I finally got it….
For months I have been trying to figure out why my cross-compiled kernel and rootfs won’t boot on the NSLU2 aka Slug. Since I didn’t want to solder pins for the serial I/O onto the board I tried a quite unique approach: Since the Slug has 4 LEDs that can be easily controlled through GPIO, I could hook up at some function and send all output through the LEDs. Sure this would be slow (~1 character per second, maybe 1.5) but that would be enough to at least read kernel panic messages while retaining full warranty on the hardware.
Finding out how LEDs were controlled took me about 1,5 hours, hooking up and testing first things another hour. I hooked up before the call of init (disk 2 on), right after init (disk 1 on) and on kernel panics (both disk leds on). So what happened: It got init and then crashed with a panic after a few seconds. I figured out I could hook up at uart_console_write() or panic() and then read any output by blinking a byte in 4 steps (one byte on each disk LED, signal indication by setting power to green and “clock” indication through power amber). Well it started blinking for hours and hours and hours… But all I could decode was just infinite rubbish, no matter what I tried. Even a comparison i
msleep() seems to suspend the currently running task, so it is non-blocking regarding the whole system.
mdelay() blocks the system (or at least the active CPU) if running single-threaded.
So why was that small change critical to my code? I don’t know exactly. But I know that panic() disables scheduling before any further action. So what happens if some code fragment used by panic() tries to relay on scheduling? Something seems to get corrupted very seriously, maybe some kind of heap or stack overflow happens. Maybe some process/scheduler data gets screwed up. I don’t know. But that seemingly tiny difference of blocking vs. non-blocking functions (what function does what isn’t always that clear if you’re new to Linux kernel programming) really makes a very big difference.
I finally recorded my kernel panic on dv tape and will decode it tomorrow using a simple tool I wrote. If it is finished I will make it available from this website.
Here’s a small excerpt from the ~20 minutes long message transmitting “BUG: sched[…]” (I did not decode more than that yet):
Edit: I decoded all 20 minutes. What was readable (the decoder was a quick and dirty solution since yet) led me to “BUG: scheduling while atomic:” in kernel source and was simply caused by a remaining msleep in my LED function. I got rid of it and now got a clean message “Attempted to kill init!”. Now, that’s where the debugging begins…