Strange Loops: Ken Thompson and the Self-referencing C Compiler

By goodmath on April 15, 2007.

I'm currently reading "I am a Strange Loop" by Douglas Hofstadter. I'll be posting a review of it after I finish it. A "strange loop" is Hofstadter's term for a Gödel-esque self-referential cycle. A strange loop doesn't have to involve Gödel style problems - any self-referential cycle is a strange loop.

Reading this book reminded me of my favorite strange-loop story. It's actually
a story about software security, and the kinds of stunts you can play with
software if you're clever and subtle. It's the story of the Unix C compiler, and the virtually invisible back-door security hole inserted into it by Ken Thompson - a story he told in his Turing award lecture..

(Idiot that I am, I originally said it was Dennis Ritchie who did this... Leave it to me to link to the original lecture, and not notice that I got the author wrong!)

In Unix systems, there's a program named "login". login is the code that takes your username and password, verifies that the password you gave is the correct one for the username you gave, and if so, logs you in to the system.

For debugging purposes, Thompson put a back-door into "login". The way he did it was by modifying the C compiler. He took the code pattern for password verification, and embedded it into the C compiler, so that when it saw that pattern, it would actually generate code
that accepted either the correct password for the username, or Thompson's special debugging password. In pseudo-Python:

  def compile(code):
    if (looksLikeLoginCode(code)):
      generateLoginWithBackDoor()
    else:
      compileNormally(code)

With that in the C compiler, any time that anyone compiles login,
the code generated by the compiler will include Ritchie's back door.

Now comes the really clever part. Obviously, if anyone saw code like what's in that
example, they'd throw a fit. That's insanely insecure, and any manager who saw that would immediately demand that it be removed. So, how can you keep the back door, but get rid of the danger of someone noticing it in the source code for the C compiler? You hack the C compiler itself:

  def compile(code):
    if (looksLikeLoginCode(code)):
      generateLoginWithBackDoor(code)
    elif (looksLikeCompilerCode(code)):
      generateCompilerWithBackDoorDetection(code)
    else:
      compileNormally(code)

What happens here is that you modify the C compiler code so that when it compiles itelf, it inserts the back-door code. So now when the C compiler compiles login, it will insert the back door code; and when it compiles
the C compiler, it will insert the code that inserts the code into both login and the C compiler.

Now, you compile the C compiler with itself - getting a C compiler that includes the back-door generation code explicitly. Then you delete the back-door code from the C compiler source. But it's in the binary. So when you use that binary to produce a new version of the compiler from the source, it will insert the back-door code into
the new version.

So you've now got a C compiler that inserts back-door code when it compiles itself - and that code appears nowhere in the source code of the compiler. It did exist in the code at one point - but then it got deleted. But because the C compiler is written in C, and always compiled with itself, that means thats each successive new version of the C compiler will pass along the back-door - and it will continue to appear in both login and in the C compiler, without any trace in the source code of either.

I've tried to illustrate this with the diagram to the right. The arrows with solid heads are inputs to a compiler; the lines with hollow heads are outputs from the compiler. The solid lines are the original compilation paths: the C compiler source code (without back-door) gets compiled by the C compiler, generating a new (backdoor-free) C-compiler executable; the login source code gets compiled by the C compiler, generating a new login executable without any back door; and the C compiler with backdoor code generates a new C compiler executable with the back door.

Now the tricky part. The dotted lines are what happens after you compile the C compiler with the back-door code included. In thofse paths, code is getting compiled by the C compiler with the back-door. When you compile the login source code with the new C compiler, you get a new login executable which contains back-door code that wasn't in the source; and when you compiler the C compiler source code with the back-door C compiler, it will generate a C compiler executable with the back-door code included.

So you've got a loop now: the C compiler will recognize the logic source and insert a back door into it; and using self-reference, the C-compiler will recognize itself, and insert the back door into any versions of itself.

More like this

The Basic Balanced Search Tree: Red-Black trees.

During my Haskell tutorial, I used balanced binary search trees as an example. I've had people write to me asking me to write about that in a non-functional programming post, because the Haskell one got too tied up in how to do things without assignments, and with tail recursion. Binary search…

Peer Reviewed Bad ID Math

In comments to [my recent post about Gilder's article][gilder], a couple of readers asked me to take a look at a [DI promoted][dipromote] paper by Albert Voie, called [Biological function and the genetic code are interdependent][voie]. This paper was actually peer reviewed and accepted by a journal…

Google's New Language: Go

I've been being peppered with questions about Go, the new programming language just released as open-source by Google. Yes, I know about it. And yes, I've used it. And yes, I've got some strong opinions about it. Go is an interesting language. I think that there are many fantastic things about it…

The Nasty Little Truth about Idiots Who Don't Understand Dimensions

I managed to trash yet another laptop - the city commute through the subways seems to be pretty hard on computers! - so while I'm sitting and slowly restoring my backups, I was looking through the folder where I keep links to crankpots that I'd like to mock someday. I noticed one that I found…

Ken Thompson, not DR.

You've missed a close-italics tag somewhere.

This article nicely shows the fact that computing security is founded in the very basic building blocks of computer systems.
Should people write independent and minimal compilers in C (or even assembly) for compiling compilers to prevent the scenario in the article? Or how do we make sure that this sort of thing isn't likely to happen in large scale in real life?

There's an analogy to the von Neumann-Ulam solution to the abstract problem of self-reproduction without infinite regress. See it?

This article, http://www.acsa-admin.org/2005/abstracts/47.html, describes a way to counter the attack described by Ken Thompson. Of course, you still need either a fully trusted compiler, or two malicious compilers which produce different backdoors.

> I first read of the possibility of such a Trojan horse in an Air Force critique ... of the security of an early implementation of Multics.

This is the most amazing line in the article. Truly the old masters had a deep and subtle understanding of all these issues.

Some anonymous reviewer tipped Thompson off to this fundamental, beautiful security "hack".

I wish we knew more about this anonymous reviewer.

Unrelated, but similarly amazing to me, is how modern a design VMS possessed in 1980. Virtualization and completely plug-and-play security. We are barely catching up.

"... the security of an early implementation of Multics... "

The security of Multics was famously bad. The idiot in charge of this parlayed his uselessness by becoming the dark horse candidate of a graduate department Chairmanship in Computer Science. The outgoing Chairman, who insulted all the faculty by calling them children, in a pre-emptive strike on the fact that he was younger than them, and constantly boasting about his textbook royalties, and was busy with wholesale plagiarism, stepped aside.

All the candidates were polarized in selection, with many strong supports and many strong oppositions. The luke-warm Multics dude was nobody's first choice, but had no strong opposition, so he became chairman.

His textbook was riddled with errors, and applied to a different mainframe than the department had. Half the students in his Operating Systems class dropped out, citing textbook and the professor's attitude.

He fired the seasoned veteran grad student who was his T.A., and was there to teach him the departmental culture and interface with students.

The firing was overturned by a blue-ribbon panel that included a federal mediator.

The chairman, annoyed, lashed out at grad students. He refused to let the dyslexic student (paper work proved it) take PhD qualifying exam as he took all other exams, with his wife reading him the written questions. He kicked out the grad student from France, as he didn't like French universities. He alienated the student from Cambridge University. He insisted on several qualified people being denied PhDs. A third of the faculty left the university, to go somepl=ce saner. Anywhere.

And to this day, since the faculty ousted him from Chairmanship with a Vote of No Confidence, he boasts about what a genius he was in designing the security of Multics.

I've mentioned no names here. But I'd swear any of this under oath.

One of the nice things about that Turing lecture is that he starts it with a brief discussion of quines (though he doesn't use that word); if you've never thought about them before, thinking about them and writing one or two is a useful exercise in getting your head around what's going on technically in the rest of the talk.

Ineffably sneaky. No wonder I gave up my subscription to 'Computer Design' in the late 70's. Wouldn't trust anything that can't run on a 4040.

Don't worry, I'm sure we could run this on a 4040. (You can fit a C compiler into 64K, right?)

There's still a Ritchie reference left at the end of the paragraph after the first pseudo-Python block.

Just have to say, Douglas Hofstadter is my hero, along with John H. Conway!

I agree with you Susan B. They are great at writing about math in an entertaining way.

verify the binary size expected for the complier.
as with any security or hack,, make the verify ck a secret.

When Assembly Language was King,,
The hack was to remove bytes and replace them with yours.
The Security was to inspect for the changes and output them
disassembled.

Never ask a security question that you do not already know the answer to.

Just a thought in passing,,

.,.

Comment #4, referring to David A. Wheeler's paper "Countering Trusting Trust through Diverse Double-Compiling", didn't separate the URL from its comma. So here's that URL on its own: http://www.acsa-admin.org/2005/abstracts/47.html . There was a discussion of Wheeler's paper at Lambda the Ultimate: http://lambda-the-ultimate.org/node/1184 about a year and a half ago.

Small typo: "So you've got a loop now: the C compiler will recognize the __logic__ source and insert a back door into it;"

Should be "login" I assume. Great post, by the way!

I have compiled GNU-Linux from scratch. IE only source code. (see www.linuxfromscratch.org) When you compile the compiler you then use the compiled compiler to compile the compiler and then you check to make sure they are the same. ie

Compile compiles compile(1)
compile(1) compiles compile(2)
you then diff compile(1) and compile(2)
and if they are the same you use
compile(2) to compile compile(3)
and then diff them to make sure they are the same.

So this backdoor would not work ....
Unless you put a hack in diff .....

Incidently this is done automatically via make
in GCC.

Stephen George: Sorry, diff will not work. The original compile already contains the backdoor, and compile(1) is identical to it, as are all the following versions since they have identical source.

You would not compromise the GCC source code, but the distro you use to bootstrap Linux-from-scratch.

compile contains the backdoor, and builds it into compile(1). compile(1) now also contains the backdoor, and compiles it into compile(2). The compiler check would pass, with no hacking of the gcc source required. Every program, including login, is now compiled with the compromised GCC.

Stephen: This doesn't help if your original compiler contains the backdoor. Both subsequent compilers would contain the same backdoor and thus be the same.

Unless you trust the original compiler, you can't be sure.

One way to defeat this attack is to write a program that takes a C program as input, obfuscates it (permuting source file, variable, and function names, inserting sequences of instructions that are guaranteed to have no effect on the function they are inserted into, but which the compiler would not be able to optimize away, ...). Run the source for your compiler through the obfuscator, compile it with the back-doored-compiler binary (which doesn't recognize the source as the source for the compiler anymore) and you obtain a non-back-doored (albeit perhaps inefficient due to the nonsense sequences of instructions inserted by the obfuscator) compiler binary, which you can now use to recompile your compiler source once again to obtain a non-back-doored, efficient compiler binary.

Now, if the attacker knows exactly how your program obfuscator works, she could rewrite the compiler source detector to take it's obfuscation into account and still insert the back door. I am guessing it is easier to write an obfuscator than a detector, however, so the attacker will have a much harder time pulling this off than a defender would have defeating it.

Wow You guys are completely right.
I am not saying wow because of the back door.

I am saying wow because I said something stupid and no one was rude to me.

Now that is really something

Thanks guys.

Chris Marshall-
Code obfuscation would not work if the transform is not done on a simple text matching, but is done at a transform level internal to compiler structures. For example, the original code might be set up so that it transforms to a very special parse tree, and that tree is what triggers if backdoor. It is not hard to do this, and as long as the code semantics (not syntax) match the tree, the backdoor can be inserted.

Mark:
What did you think of Godel, Escher Bach?

I am assuming you read it! 8^)

Chris Marshall writes that the back-doored-compiler could operate on parse trees, thus defeating the obfuscator. I disagree. I actually think Chris Marshall's solution is brilliant (and far more realistic than the usual solution of using a different compiler). The obfuscator need not simply alter variable names... it can introduce odd behavior the compiler is not clever enough to resolve. For instance, replacing "for i := 1 to list.length()" with "for i := 1 to list.length + new list(2,0,45).sort.first()" would only be defeated by a back-doored-compiler that can sort lists at compile time. If the compiler is that smart, then I can beat it by introducing spurious values using prime factorization of large numbers.

KeithB:

GEB is my personal favorite non-fiction book. It's pure gold, all the way through. That's why I'm reviewing "I am a Strange Loop": when I heard that Hoftstadter had written a sequel, I immediately wrote to the publisher, and asked for a review copy. I was delighted when they sent me one.

As I said, I'll post the review here as soon as I finish it. So far, I think it's excellent, but not up there with GEB.

Michael:

Your idea of using prime factorization of large numbers is a neat way of pointing out how hard the compiler's detection algorithm can be made to work by an obfuscator.

In the ACM paper, Thompson ominously wonders how far down such a trojan might be placed (I think he goes so far to muse about the microcode for a CPU). I think the further down you try to place such a trojan, the easier it would be to fool it through obfuscation.

If a compiler with access to the libraries of a modern operating system can't defeat an obfuscation algorithm, what chance does a less complex computing level like microcode have?

Assuming it's only the C compiler and nothing lower-level that has the backdoor, one could get around it by writing a C compiler in Pascal, B, Smalltalk, or something else that compiles to machine code without using C as an interpreter or intermediary, preferably something whose compiler was not written in C, though I imagine that a level of indirection would be enough to throw off the backdoor-inserting code; after all, the backdoor code recognizes C compiler sources; presumably it wouldn't recognize Pascal compiler sources. So compile your Pascal compiler, write a C compiler in Pascal, and then any C compiler compiled with THAT compiler should be safe.

Michael:You're correct that there are more complicated obfuscation techniques that could likely thwart this particular kind of back door, but Chris' point - that simple variable renaming (and likely other things like loop unrolling) - wouldn't do it if the back door was implemented on a parse tree. move it after optimization and it gets even harder (although you may have to implement one back door for each optimization level/option, perhaps).Dan:The language the original compiler is written in isn't really relevant. The first-level hack (hiding the login translation hack in the C compiler binary) would be implemented by your Pascal/Smalltalk/whatever program and then that binary would still produce corrupted versions of the login program. Also, no reason why it couldn't also implement the first-order hack on C source.The point you're almost getting to, I think, is that this is pretty code-dependent, but that comes back to the first posted work-around: use multiple, independent compilers. Now if only everyone would stop using those stupid gcc-only language extensions... ;-)Personally, I thought the best part of Thompson's speech was the way he introduced what he'd done: this wasn't a nefarious capability added for the purpose of being malicious or giving cool talks, it was just the extension of an inherent capability with lots of important uses elsewhere.

Anthony:

I understood the parse tree point when I wrote my original post, which is why I talked about inserting sequences of instructions (that don't change the result of the function they occur in) that the compiler could not to optimize away. File, variable, and function renaming would not be sufficient, but it would be necessary (lest you leave a specially named varaible in place that the compiler detector happened to be keying off of).

Regarding the multiple compiler/language approach, I think you can thwart a more powerful adversary with the same level of effort put into obfuscation than you could by writing more cross compilers. It is much harder to write compilers than obfuscators.

Chris Marshall

Mark,

You've written your summary as though Ken Thompson actually did the hack he wrote about.

For debugging purposes, Thompson put a back-door into "login". The way he did it was by modifying the C compiler.

But it was just a thought experiment. He never actually put a backdoor in "login", much less the double-trojan in the C compiler.

-- Don

OK, nevermind, I take it back. Reading the ACM lecture more carefully, I see that Thompson apparently is claiming that he actually installed the trojan C compiler at a real site, as a proof of concept if nothing else.

This also shows that while Internet security has become a big topic in the last 10 years - that the core systems and core internet have been around since late 1960's. Considering there have been upgrades, etc. over the last 30-40 years, one would assume that backdoors from the 1960's-1980's were all found and removed. But suppose this is not the case or suppose many systems - even if OS and hardware were upgraded had core aspects of the system that were replicated and contain "hidden wonders" and "backdoors". So how secure would all the telephone switching systems (early on built on Unix), banking systems, etc really be today - supposing that there are many hidden backdoors or maybe dormant agents lying around? Everyone seems to be looking at the present security holes - this is the first time I have seen someone look back in computing history to point out that trojans, viruses, and backdoors existed long before commercial internet. Great post and great research Mark!

Wow - fascinating!

Is this a case for a compiler written in an interpreted language like Python rather than C? (Probably horrendously slow, I imagine.) That way you could inspect the code that was being used to compile directly...

Or would that not help, as the interpreter itself could be comprised?

I can't believe you used "pseudo-Python" in an article about C.

When I was in college, my mother brought home -- from a used book store -- Metamagical Themas. I loved that collection and went on to get GEB. Since then, I've eagerly anticipated all of Hofstadter's books. I'm halfway through the current one, and I like it, but it doesn't yet have the magic of some of his others.

Despite its less scientific bent, Le Ton Beau de Marot remains my favorite. I even penned my own poem based on the title work as a marriage proposal to my wife.

I have great respect for Dennett, Dawkins, etc., but Hofstadter has a charm that is very personal.

There's a new review of "I Am a Strange Loop" at americanscientist.org that begins:

Self Assembly
Margaret A. Boden

I Am a Strange Loop. Douglas R. Hofstader. xxiii + 384 pp. Basic Books, 2007. $26.95.

Douglas Hofstadter suffers from the grave disadvantage of having written a masterpiece as a young man: the utterly unique GÃ¶del, Escher, Bach: An Eternal Golden Braid. This exhilarating intellectual and rhetorical extravaganza, published in 1979, was focused on the new ways of studying life and minds that were being offered by cognitive science. The book spanned mathematical logic, artificial intelligence, artificial life, psychology, neuroscience and the philosophy of mind. Along the way, it provided deep insights into mathematics, music and creativity--plus countless deliciously outrageous puns. Despite the puns, it was translated many times and became a cult book worldwide.

[truncated]

By the way, today, 28 April 2007, is the 101st birthday of Kurt GÃ¶del. This comment refers to itself.

THe article is not about C as such, it merely uses C as the example. Since psudo-python is clear to anyone who reads it, even non-computer scientists, it is a good choice.

The solution to the "compromised C compiler" problem is to write your own partial C interpreter in assembly language. (I say "partial" because it only needs to be able to understand enough of the language to interpret the compiler source code and it doesn't matter how long it takes to do its job as long as it finishes eventually.) You then feed your compiler source code to the interpreter. Now the combination is behaving as a compiler but it's only doing what the source code says it will do. You're sure of this because you wrote the interpreter (so you trust it) and you read the source code of the compiler and didn't find anything nasty in it.

Now you're not using a compromised compiler to compile the compiler, so the compiler you compile on the compiler running interpretatively (which you will then use to compile the login command) is clean. (Only the compromised compiler, which you would have had to use to compile the compiler if you had never written your interpreter, will put the extra code into any compiler that it compiles.)

Or is it? Because all modern Intel and AMD 80x86-type processors are actually using a RISC processor internally, which runs a hard-coded program to emulate 80x86 instructions. So there's no way to be really sure that, say, an instruction such as ADC AL,BL is only adding BL + C to AL and not doing something else on the sidelines ..... You'd have to build your own processor from discrete components to be really sure. Although, taken right to the extreme, you can't even prove Ohm's Law is true ..... every single voltmeter and ammeter ever made already relies on it.

This, incidentally, is why I would never use a compiler where the Source Code was not made available for viewing -- or anything compiled with it. It's still a lot easier to tamper with than something like GCC, which is available for all to view.

I can't believe you used "pseudo-Python" in an article about C.http://www.sourcingmap.com/bottle-opener-c-1158_1776.html

There is a relatively straight-forward way of getting back to having a clean compiler again, even if your current compiler is compromised: pass the source code through an obfuscator before compiling it. Provided the ken-hacked compiler's context-sensitive checks are based purely on lexical checks, it will no longer be able to self-recognise and so it won't be able to insert the malicious patch into your new code. This new compiler can be used to recompile the original sources (which can be assumed to be and/or checked to be free of backdoors). If the resulting compiler is different from the one you were originally using, then you know that something wicked had been going on (assuming, reasonably, that a compiler is purely deterministic and should always produce the same output for a given input).

Unfortunately, this method isn't foolproof. "Ken" (convenient name for the attacker) could come up with an anti-anti-ken compiler hack. If he knows how your obfuscator works then he can get around it in some way (either undo the obfuscation before parsing, or self-identify on some attributes that are invariant under your obfuscation, eg, shift from lexical recognition to recognition of patterns in the intermediate code representation). Dr. Fred Cohen (the man who popularised the term "computer virus") has written about the impossibility (in general) of recognising viral code in setups like this. In practice, however, even a simple anti-ken hack like this will defeat practically any naive implementation of the Ken hack. Given how abstruse the attack is in the first place, and how unlikely it is to appear in the wild (given that there are easier attack vectors available to exploit), I don't really see how an arms race between Ken, anti-Ken, anti-anti-Ken and so on would develop.

Another, parallel method of detecting/defeating the Ken hack is simply to build your compiler program using a different machine and/or compiler. As with the obfuscation method, you can use your temporary (clean) compiler to recompile the original sources leaving you with a binary that can be compared directly with the original (suspect) binary. A combination of the two provides defense in depth, and should be sufficient for even the most paranoid. Those people can then get back to worrying about their linker, kernel, communications links, etc ;-) (or, for the truly paranoid, a self-aware "Kuang" class of virus).

Do You realize that what You obtain is kind of monad? This diagrams may be joined one by one to produce new compilers with bacdors and new login with bacdors as they product. Probably there is method to check if this kind of monadic composition may be found and resolved...

mark,
the link u mentioned does not exist now. i found a copy of the same at http://cm.bell-labs.com/who/ken/trust.html

Btw, good explanation. i came to know a few new things.

- ankit

Strange Loops: Ken Thompson and the Self-referencing C Compiler

More like this

The Basic Balanced Search Tree: Red-Black trees.

Peer Reviewed Bad ID Math

Google's New Language: Go

The Nasty Little Truth about Idiots Who Don't Understand Dimensions

Moving on

Goodbye, Scienceblogs

Seed, Conflicts of Interest, and Sleaze

Searching for Topics

Saturday Recipe: Ginger Scallion Sauce

Acupuncture bait and switch: Electrified hot flash edition

Protecting and Exploring Mars from Earth

Ask Ethan: Does Dark Energy Mean We're Losing Information About The Universe? (Synopsis)