Saturday, July 21, 2018

Off-Target Testing And TDD For Embedded Systems

I've recently started reading things by James Grenning (Wingman-sw.com), one of the authors of the Agile Manifesto. My interest in his work relates to Test-Driven Development (TDD) for embedded systems.

A copy of his book Test Driven Development for Embedded C is currently winging its way to me. His site links to a webinar he gave last summer, Test-Driven Development for Embedded Software, that makes a great introduction to the topic.

I found one of his answers on Quora interesting. The question was: Can I perform a unit test when creating C firmware for ARM Cortex-M MCUs? The answer, of course, is yes. Specifically, testing can be done off-target (i.e. not on the target embedded system).

I wrote a long comment on the answer, and decided it might make an interesting blog post. So the remainder of this post reproduces it substantially as it appears there, with some cleanup. He very kindly asked if I would be interested in adding it to his Stories From The Field.

My Three Stories

I can offer three anecdotes that show why I give a big thumbs up to off-target testing. Off-target testing puts you on target!

The first case was back in 1995. I had recently transferred to the DEChub group at Digital Equipment Corporation to work on networking equipment.

They had a problem with their popular DECbridge 90 product, an office- or departmental-scale stackable Ethernet bridge running an in-house custom RTOS on Motorola 68K, all written in C. It would run for weeks or months at a customer site, then suddenly crash. That would throw the LAN into a tizzy as it went through a Spanning Tree Protocol (STP) reconfiguration event. Then the LAN would do it again once the bridge came back up and advertised its links.

So it could be very disruptive to the local network and everyone running on it, completely unpredictable. No one had been able to reproduce the problem in the lab.

I was tasked with finding and fixing it. This platform had very little in the way of crash dump and debug support, and software update was done by burning and replacing an EPROM. It did have an emulator pod, so that was how most debugging was done.

The problem here was the long run time between failures. That made trying to collect useful information from repeated test runs, either real or via emulator, impractical to the point of impossibility.

The one clue we knew from the crash log was that it was an OOM condition (Out Of Memory). The question was why. Other than supporting STP, which is a bit of complex behavior, a bridge is a pretty simple device, just L2 forwarding. Packet comes in, look it up in the bridge tables, forward it out the appropriate interfaces.

The key dynamic structure was the MAC address table. A bridge is a learning device in that it learns which MAC addresses are attached to which links. It builds up the table as it runs, learning the local network topology and participating in STP. So this table was certainly a prime suspect, but it had capacity for thousands of entries, yet it was crashing in LANs with only tens or hundreds of nodes.

The table used a B-tree implementation that was public-domain software from some university. We speculated that it was a memory leak in either the B-tree itself, or our interfacing to it.

So I pulled out the B-tree code and built a test program for it that would go through tens of thousands of adds and deletes in various simple patterns. This is similar to the type of test fixture that Brian Kernighan and Rob Pike later talked about in their book The Practice Of Programming.

I ran this off-target, on a VAX/VMS. VMS supported a simple Ctrl-T key in the terminal interface that would show the process memory consumption, similar to what the Linux ps command shows. The turnaround time on playing with this setup was minutes, build and run, with the full support of an OS to help me chase things down, including good old printf logging and customized dumping of data structures (VMS also had a debugger similar to gdb).

With this I could see that under some patterns, memory consumption was monotonically increasing. So yeah, a memory leak. Further exploration allowed me to home in on the right area of the code.

It was right there in the B-tree memory release code: it would free the main B-tree nodes (the main large data element being managed), but not the associated pointer nodes that it used for bookkeeping. So on every B-Tree node release, it would leak 8 bytes of memory.

This was a case of a very slow memory leak, that only manifested with lots of table changes. In a customer environment, it could take a long time to chew through the memory. In the lab running on-target, it was even slower, since we didn't know what the cause was, so we didn't know how to trigger and reproduce it.

Off-target, it took less than an hour to find. Code change: add two lines to free the pointer nodes. This was after many man-weeks of effort for all the people involved in trying to reproduce and chase down the problem, plus all the aggravation caused at customer sites. Ten minutes to code, build, and verify the fix.

The second case was just recently. I implemented an FSM based directly on the book Models to Code: With No Mysterious Gaps, by Leon Starr, Andrew Mangogna, and Stephen J. Mellor (backed up by Executable UML: A Foundation for Model Driven Architecture, by Stephen J. Mellor and Marc J. Balcer, and Executable UML: How to Build Class Models by Leon Starr; I highly recommend the trio of books). Thank you, Leon, Andrew, Stephen, and Marc!

I was also reading Robert C. Martin's Clean Architecture: A Craftsman's Guide to Software Structure and Design and Clean Code: A Handbook of Agile Software Craftsmanship at the time, which heavily influenced the work (and finally motivated me to fully commit to TDD; Grenning contributed the chapter "Clean Embedded Architecture" to Clean Architecture). Mellor and Martin are both additional Agile Manifesto authors.

A product of all this reading, the FSM was a hand-built and -translated version of the MDA (Model -Driven Architecture) approach, in C on a PIC32 running bare metal superloop.

The FSM performed polymorphic control of cellular communication modules connected via a UART. The modules use the old Hayes modem "AT" command set to connect to the cell network and perform TCP/IP communications.

It was polymorphic because it had to support 4 different modules from 2 different vendors, each with their own variation of AT commands and patterns of asynchronous notifications (URC's, Unsolicited Result Codes).

If you think LANs and WANs are squirrelly, just wait till you try cellular networks. I could hardly get two test runs to repeat the same path through the FSM. Worse, there were corner cases that the network would only trigger occasionally.

It was horribly non-deterministic. How can I be sure I've built the right thing when I can't stimulate the system to produce the behavior I want to exercise?

The solution: build a nearly-full-coverage test suite to run off-target. I built a trivial simulator with fake system clock and UART ISR that I ran on an Ubuntu VM on my Mac. That gave me full support for logging and gdb.

This wasn't quite TDD, but it was one step away: it was Test-After Development, and instead of Google Test/Mock or some other framework, I built my own ad-hoc fakes and EXPECT handling.

With this I was able to create scenarios to test every path in the FSM, for all the module variants. Since I had control of the fake clock and ISR, I could drive all kinds of timing conditions and module responses. It did help that the superloop environment was pure RTC (Run To Completion, which coincidentally is required for Executable UML state machines), rather than preemptive multitasking/multithreading. But I could have faked that as well if necessary.

I was able to fix several bugs that would have been hell to reproduce and debug on-target. In all cases, just as with the B-tree, the code changes were trivial. The time-consuming and hard part is always the debug phase to figure out what's going wrong and what needs to be changed. Doing the actual changes is usually simple.

That debug phase is where non-TDD approaches run into trouble, especially when they have to be done on-target. It can consume unbounded amounts of development time. The time required to do TDD is far shorter, and for a significant number of problems can either completely eliminate the debug phase, or narrowly direct it to the specific code path of a failing test.

The third case was this past week, when I did my first true TDD off-target thing for some embedded code. The platform is embedded Linux on ARM, so full OS, with cross-compiled C++.

I built the code in my Ubuntu VM and used Google Test/Mock, mocking out the file I/O (standard input stream and file output) and system clock. The code wasn’t particularly complex, but it did have a corner case dealing with a full buffer that represented the greatest bug risk.

I used very thin InputStreamInterface, OutputFileInterface, and ClockInterface classes as the OSAL (Operating System Abstraction Layer) to provide testability (thank you, Robert and James!).

It was gloriously wonderful and liberating to build it TDD style, red-green-refactor, and I knew I had all the paths in the code covered, including the unusual ones. That instills great confidence in what I did. No more worrying that I got everything right. I was able to demonstrate that I did.

Did it take a little extra time? Sure, mostly because I’m still on the learning curve getting into the TDD flow. But if I hadn’t used TDD and this code had produced failures, it would take me longer after the fact to chase down the bug. Plus I was able to avoid the impact on all the other people in the organization affected by the development turnaround cycle.

And just today, I added more behavior to that component using the TDD method. I was able to work fully confident that I wasn't breaking any of the stuff I had already done, and just as confident in the new code.

So I'm definitely a believer in off-target testing, and from now on I'll be doing it TDD.

Another benefit of this off-target TDD model? Working out of that Ubuntu VM on my Mac, I'm totally portable. I can work anywhere, at a coffee shop, on the train commuting, at the airport, on a plane, at home. I can be just as productive as if I had my full embedded development environment in front of me. Then once I'm back at my full environment, I have tested, running code ready to go.

For reference, these are the books that taught me TDD while in different jobs, both highly recommended:
  • Test Driven Development: By Example, by Kent Beck (yet another Agile Manifesto author). I was introduced to the book and TDD in general by new coworker Steve Vinoski in 2007, whose cred in my eyes went way up when I noticed his name in the acknowledgements of James O. Coplien's Advanced C++ Programming Styles and Idioms.
  • Working Effectively With Legacy Code, by Michael C. Feathers. Amazon tells me I bought this in 2013. At the time I used it to start adding unit test coverage to our codebase at work. What makes this book particularly useful is the fact that nearly all software development requires working with legacy code to some degree, even on brand new projects. It also helps you avoid creating a legacy of code that future developers will curse.