Combining C With Assembly

The latest issue of Embedded Systems Design has an interesting article on combining C code with assembly code for DSP applications. In some cases, they show an ten fold speed up for an assembly plus C implementation versus straight C code.

Now before anyone starts hollering, please remember that embedded programming is not at all like normal desktop application programing. Generally you're operating in a "slim" (i.e., resource poor) environment. Heck, some embedded processors only have a few hundred bytes of RAM available and you might be talking about a clock frequency of just a few MegaHertz. Squeezing everything as much as possible becomes paramount. For larger apps, C is the standard environment and assembly is sometimes added for time-critical segments. In general, I don't like to add assembly because it reduces portability and readability, increases code maintenance, and is generally more time-consuming. But I understand places where it is needed. What I think needs to be done, though, is to squeeze as much out of standard C as is possible before going down the assembly path. What follows is an example of two code variations on the ubiquitous string copy function. In C, strings are nothing more than a contiguous packed array of bytes. Both functions do the same thing: they copy bytes from the source string to a destination string until (and including) a null character is reached (which signifies the end of the string). The first variation is the way a newbie would typically approach the problem. First, initialize an array index to 0, and then copy the bytes one at a time, incrementing the index and checking for the null at each iteration. When you reach the null, quit the loop and add a terminating null to the destination. Pretty simple, right?

Here's the C code (I apologize for the lack of proper formatting):

void strcpy1( char d[], char s[] )
{
unsigned long c = 0;

while( s[c] != 0 )
{
d[c] = s[c];
c++;
}
d[c] = 0;
}

OK, here's the (slightly modified) output from an old Motorola 68000 series CPU assembler:

_strcpy1:
linka5,-4
movem.lreg,-(sp)
clr.l-4(a5); c = 0
.1005
move.l-4(a5),d0; index
move.l12(a5),a0; base of source array
tst.b(a0,d0.l)
beq.1006
move.l-4(a5),d0
move.l12(a5),a0; base of source array
move.l-4(a5),d1
move.l8(a5),a1; base of dest array
move.b(a0,d0.l),(a1,d1.l)
add.l#1,-4(a5); c++
bra.1005
move.l-4(a5),d0
move.l8(a5),a0
clr.b(a0,d0.l); force the null termination
movem.l(sp)++,reg
unlka5
rts

FYI: the 68000 has eight address and data registers, a0-a7 and d0-d7. In this version, a5 is used as the base for local variables. sp is the stack pointer. Anyway, there's lots of moving and updating the index, accessing the arrays, and so forth. It's not a lot of assembly, but then again it's a pretty small function. There are approximately 11 instructions in the main loop (.1005 down to the unconditional branch instruction). It doesn't seem like it could be tweaked much, does it?

Here's version two, the sort of thing an experienced programmer would do:

void strcpy2( register char *d, register char *s )
{
while( *d++ = *s++ );
}

The C code is much more compact, but check out the assembly:

_strcpy2:
movem.lreg,-(sp)
move.l4(sp),a2; dest ptr
move.l8(sp),a3; source ptr
.1007
move.b(a3)+,(a2)+
tst.b-1(a3)
beq.1008
bra.1007
.1008
movem.l(sp)+,reg
rts

Now that's a tiny loop: four instructions, including the branch statements. This version takes up less space and executes faster.

As I often tell my programming students, it's usually not a matter of whether or not you can implement an algorithm, but in discovering the most efficient way to do it in order to meet design performance goals. If some code rework allows you to do that without resorting to assembly, so much the better.

More like this

Last week, I promised to continue my discussion of ropes. I'm going to break that promise. But it's in a good cause. If you're a decent engineer, one of the basic principles you should follow is keeping this as simple as possible. One of the essential skills for keeping things simple is realizing…
Joy of joys, [Cat's Eye Technologies](http://catseye.mine.nu:8080/projects/), the home of Chris Pressey, one of the most prolific designers of thoroughly bizarre languages is back up! And so, today, we're going to take a look at one of his masterpieces, the language *Smith*. This is one of my…
Due to work stuff, I'm very busy this week, and I don't have time to write a detailed pathological language post, so I chose something that doesn't take a lot of explanation, but which is delightfully twisted. It's a language called [Muriel](http://web.archive.org/web/20021205092706/http://demo.…
I've got a real treat for you pathological programming fans! Today, we're going to take a quick look at the worlds most *useful* pathological programming language: TECO. TECO is one of the most influential pieces of software ever written. If, by chance, you've ever heard of a little editor called "…

Elegant!

By Suesquatch (not verified) on 22 Dec 2007 #permalink

Hi,

I think this example is misleading. Terrible in fact. strcpy1 and strcpy2 are implementing identical algorithms in almost identical ways. This is not the kind of optimization humans are good at -- it is the kind that compilers are good at. And so I wondered why you used a really old compiler, and I bet if you use a modern compiler your results will be much different.

In fact, here let me do it. Here is what I get from gcc 4.0.2 (not all that recent) compiling for an x86:

Here are the results using gcc on x86, looking just at the loop body:

strcpy1 without compiler optimizations 14 instructions
strcpy2 without compiler optimizations 13 instructions
strcpy1 with full compiler optimizations 6 instructions
strcpy2 with full compiler optimizations 6 instructions

The loops in strcpy1 and strcpy2 resulted in almost identical assembly for the body when optimizations were turned on. And even with no optimizations, the two are about even, since they have to do almost exactly the same work in both cases.

Just to be complete, let me elaborate a bit on the differences between the two functions. The only real difference is that in strcpy1, the human didn't notice that the epilog (setting the null byte) might as well just be the last iteration of the loop, meaning that the loop can always execute at least once. The second one takes advantage of this. This results in a shorter prolog and epilog in the strcpy2 case (and by chance a 1-instruction gain in the unoptimized case). The compiler isn't (yet) smart enough to notice this optimization, so even with full optimization the assembly outside the loop body is different. But as you know, optimizing outside the loop body is not important, especially in a function like strcpy, where all the work is inside.

Optimizing isn't about using fewer characters in your code. It is almost entirely about changing the algorithms and datastructures (at a high level) and perhaps changing the structure and flow the code (at a low level). These are places where humans can reason about things much better than a compiler. But when it comes to micro-optimizing blocks of a few lines of code, compilers are pretty good (they know about all sorts of architectural tradeoffs, caching issues, obscure gotchas, alternative instructions, etc., and can out-optimize most any casual programmer.

Sorry about the rant. This kind of thing sets me off for some reason. I deal with it at work all the time -- people who insist on coding stupid algorithms and stupid datastructures and ignoring the program structure, then jumping right into tiny micro-optimizations, all without bothering to turn on compiler optimizations.

Kevin, you're absolutely right - up to a point. As the OP says, programming a DSP or embedded microcontroller really is rather different. There's really two points that makes the situation special:

* A microcontroller, and especially a DSP, tends to have a "weird and wonderful" architecture, with lots of special features and tricks built in. DSP's for example, will often have a whole set of vector manipulation instructions. Compilers are typically not very good at exploiting such special-case features (or may even ignore them completely for code generation); they can require you to restructure the whole task quite a lot in order to get the full benefit, and compilers don't really have the birds-eye view of the problem that allows them to do so.

In fact, the resurgence of RISC architecture came from the realization that a compiler tends to ignore all but a fairly small set of instructions for code generation. But the whole point of a DSP (especially) is that it does have very specific instructions for doing some kind of operations very, very fast.

* And a microcontroller can sometimes be _really_ space and time limited. I used, fairly recently, a controller with a total of 2Kb of EEprom and 128 bytes of Ram (including the execution stack, of course). and every single byte you save is important. Also, a hand-coded routine may be the difference between managing an operation within a set time limit or not do so. This is true especially when you end up having to fudge/approximate/cheat in order to make it, and you need to see exactly how many cycles you're spending and what kind of simplification will get you below the limit.

I do use C when I can - I'm not a masochist. But I'm not about to throw away a tool that can be powerful and may sometimes be necessary either.

Kevin, with something like a PowerPC or Pentium III with multiple pipelines and out-of-order execution, compiler optimization will out-perform human optimization almost every time, but when you're dealing with something like a basic 8086 or 68000 (or - god forbid - a Z80) that's simply no longer true. Remember that this is discussing embedded programming, not desktop applications or information processing. You also need to consider that when writing for embedded systems, the compilers you use often have nothing like the optimizers available on modern desktop C compilers.

As Janne has mentioned, there's one challenge that exists in embedded systems that simply does not exist in desktop systems: having a routine that must execute in a specific time-limit. On a desktop system, you can't even think about doing this; the operating system has much more say over your execution than your code. But on an embedded system, you can - and often have to - calculate precisely how long a routine will take; back when I did embedded programming, I have fixed at least one routine by adding NOP instructions. (if I recall correctly, it was writting a value to a memory-mapped EEPROM location, then setting an I/O output when it had been successfully written.)

Optimizing isn't about using fewer characters in your code. It is almost entirely about changing the algorithms and datastructures (at a high level) and perhaps changing the structure and flow the code (at a low level).

I agree wholeheartedly and I did not mean to imply that because version 2 had smaller C source that it was necessarily better. I chose this example because it was small and easy to see, and didn't use a sophisticated CPU (by today's standards). What I am getting at is that sometimes you need to look a little deeper (in this case, at the asm output), and it may well be possible to achieve your goal without resorting to assembly. As others have mentioned, I'm not talking about a modern desktop processor (with appropriately sophisticated tools). In such a case, I would think very hard about my algorithm(s) and compare the compiler optimizations via a timing profile (sometimes I don't care if version X has fewer asm instructions, what I care about is execution speed, and fewer instructions doesn't always mean faster). In any case, I don't completely trust compiler optimizations. I always check the results against unoptimized code (a habit after being bitten by MSVC
a few years ago on a desktop app).

Bottom line: If the C code doesn't meet size/speed expectations on the first try, don't assume that you have to go to assembly. Rethinking the code may get you there.