I came across an article yesterday about programming languages, which hit on one of my major peeves, so I can't resist responding. The article is at greythumb.org, and it's called Programmer's rant: what should and should not be added to C/C++.
It's a variation on the extremely common belief that C and C++ are the best languages to use when you need code to run fast. They're not. They're good at things that need to get very close to the hardware - not in the efficiency sense, but in the sense of needing to be able to fairly directly munge the stack, address specific hardware registers, etc. But they are dreadful languages for writing real scientific and/or numerical code.
To quote the part of the article that set me off:
First of all, these fears are nonsense. C and C++ are never going to disappear. Why? Because there are classes of programming problems that are still and will always be CPU bound and there is still no language as fast as C or C++ for these problems. I highly doubt that there ever will be.
I'm talking about things like: scientific number crunching, game/simulation physics, raytracing, real-time 3d graphics, audio processing, codec implementation, high-speed network packet routing, evolutionary computation (my personal favorite :), and of course implementing all these high-level languages' runtimes. There are also problems like OS and hardware driver implementation where you need something "close to the metal" that can interact closely with and even embed assembly language. C is basically shorthand for assembler, which is why it's the preferred language for that kind of thing.
For these tasks, premature optimization at the level of language and framework choice is not evil. In some cases it's a requirement. I predict that at least some of these tasks will still be done in C, C++, or some language with similar characteristics 50 years from now. To give you an idea of just how much faster C can be for tasks like this, I have found that evolvable instruction set based evolutionary computation is almost twice as fast when competently implemented in C than a similar competent implementation in Java.
Here's the problem. C and C++ suck rocks as languages for numerical computing. They are not the fastest, not by a longshot. In fact, the fundamental design of them makes it pretty much impossible to make really good, efficient code in C/C++. There's a good reason that Fortran is still the language of choice for real, intense scientific applications that require the absolute best performance that can be drawn out of our machines - applications like computational fluid dynamics.
Making real applications run really fast is something that's done with the help of a compiler. Modern architectures have reached the point where people can't code effectively in assembler anymore - switching the order of two independent instructions can have a dramatic impact on performance in a modern machine, and the constraints that you need to optimize for are just more complicated than people can generally deal with.
So for modern systems, writing an efficient program is sort of a partnership. The human needs to careful choose algorithms - the machine can't possibly do that. And the machine needs to carefully compute instruction ordering, pipeline constraints, memory fetch delays, etc. The two together can build really fast systems. But the two parts aren't independent: the human needs to express the algorithm in a way that allows the compiler to understand it well enough to be able to really optimize it.
And that's where C and C++ fall down. C and C++ are strongly pointer-based languages. The real
semantics of almost anything interesting end up involving pretty much unrestricted pointers. In C and
C++, there's no such thing as an array - there's just pointers, which you can subscript and a
shorthand for pointer arithmetic and indirection(x[n] in C/C++ is the same thing as *(x+n).)
That pointer based nature means that in a C or C++ program, it's very hard for a compiler to figure out what things are independent. It comes down to a problem called alias detection. Alias detection is identifying when two variables might be referencing the same location. Alias detection becomes a horrific mess in the presence of unrestricted pointers. Let me show you an example:
for (int i=0; i < 20000) {
for (int j=0; j < 20000) {
x[i][j] = y[i-2][j+1] * y[i+1][j-2];
}
}
If you look at that loop, it can be parallelized or vectorized without any problem if and only if the array pointed to by x and the array pointed to by y are completely distinct with no overlap. But there's no way to write code in C or C++ that guarantees that. If it were Fortran-77, you could easily check if they were distinct. If it were Fortran-98, you could check if x or y were declared as possible pointer targets, and the programmer could make it obvious that they didn't overlap if they wanted to. But you can't do that in C or C++. (And Fortran isn't even the best - an experimental language called Sisal from Lawrence Livermore labs used to be able to beat Fortran by around 20% on typical code!)
That example involves parallelization of code, but alias related problems aren't just an issue for parallelism; it's just easiest to show an example for parallelism. The aliasing issues in C and C++ have a very direct impact on real code.
Let me tell you about a concrete example of this, and then I'll stop ranting. About six years ago, I was working on a project where I needed to implement a rather messy algorithm to compute something called the "longest common subsequence" (LCS) of two arrays. The standard algorithm for computing LCS is using something called dynamic programming; it's O*(n3) time, and O(n2) space. There's an algorithm that was designed by people doing computational biology that can do it in the same time, but using on average O(n) space.
I didn't know what language to use for this project, so I decided to do an experiment. I wrote the LCS algorithm in a bunch of different languages, to compare how complex the code was, and how fast it ran. I wrote the comp bio algorithm in C, C++, OCaml, Java, and Python, and recorded the results. What I got timing-wise for running the programs on arrays of 2000 elements each was:
- C: 0.8 seconds.
- C++: 2.3 seconds.
- OCaml: 0.6 seconds interpreted, 0.3 seconds fully compiled.
- Java: 1 minute 20 seconds.
- Python: over 5 minutes.
About a year later, testing a new JIT for Java, the Java time was down to 0.7 seconds to run the code, plus about 1 second for the JVM to start up. (The startup times for C, C++, and Ocaml weren't really measurable - they were smaller than the margin of error for the measurements.)
The Objective-Caml bytecode interpreter was faster than the carefully hand-optimized C program! Why? Because the OCaml compiler could recognize that the arrays were completely independent - it didn't need to worry about one iteration of the loop stepping on the values used by another. The C compiler couldn't apply a lot of useful optimizations, because it couldn't be sure that they were valid.
And it's not just non-assignment based functional languages where you can see supposedly less-efficient high level languages crushing the performance of C/C++. CMU CommonLisp can beat C/C++ on numeric code. There was a paper a few years back documenting it: using a Sun SPARC workstation, if you use the optional type declarations, and write scientific/numeric code in Lisp, using vectors (Lisp arrays) and assignments to implement exactly the same algorithm as C, the CMU CommonLisp code will perform better than C code generated by either the Solaris C compiler or GCC with maximum optimization.


Comments
Thanks for the breakdown. I *used* to be a C/C++ purist, but I got over it.
All languages have their place in the Tao, but do not use COBOL if you can avoid it. :-)
Posted by: Keith Sader | November 2, 2006 10:13 AM
Mark
What is your take on languages/packages such as MATLAB, Mathematica and Maple?
Posted by: Gaurav | November 2, 2006 10:19 AM
I work with the ACM programming competition team here at Cornell. Lots of CPU bound programs in those competitions, and your program has to finish in 2 minutes to be accepted.
Except for the start-up time, I have never seen any difference between Java/C/C++ that couldn't be accounted for runtime initialization. GUI and graphics code is a whole 'nother deal, for obvious reasons.
OCAML is damn fast. I have seen some reall nice network stack code written in it. But I wouldn't want to be doing applications in it.
Posted by: Walker | November 2, 2006 10:29 AM
Gaurav:
I haven't personally used MATLAB, Maple, etc., so I don't know much about their performance characteristics. I know my father used MATLAB for data analysis a few years ago, but the stuff he was doing wasn't supercomputer complexity, but statistical
analysis of semiconductor production runs, so expressiveness was more important than efficiency.
Posted by: Mark Chu-Carroll | November 2, 2006 10:37 AM
Walker:
In recent Java VMs, it seems that the performance of most things written in Java do end up being comparable to, or even better than, C/C++. The main thing in Java is the highly expensive startup - the JVM takes quite a bit of time to start up, *and* the performance ramps up gradually as the JIT compiler optimizes the code. Once the JIT stabilizes, the performance is often quite excellent.
OCaml is definitely screamingly fast, but I don't see why you wouldn't want to write apps in it. I've done a *lot* of programming in OCaml, including writing several rather substantial applications. (The open-source Stellation project which I led had its first complete prototype implemented in OCaml. It needed to be rewritten in Java because we wanted to submit it to Eclipse.org. The Java version was about six times longer for equivalent function, and it was much slower.)
Posted by: Mark Chu-Carroll | November 2, 2006 10:46 AM
Thanks for the post, Mark, and specially for the experiment; I usually program sequence stuff, and discrete dynamics (automata, boolean networks), and only rarely truly numerical things. I'm a bit lazy about changing languages, though, but I'll keep this in mind.
...and, I have an (only slightly OT) question, which I guess some fellow commenter could answer (?). I usually program in C++, and I like the Visual Studio IDE. Since I had to give up illegal software, I'm abandoning VS6 and moving to VS Express. What I don't know is what they mean with the "managed" code of VC++ Express... Does it somehow go through MS' virtual machine, like VB and C#?? Will my little hard-working console apps run as fast now as they did with VC6?
Posted by: dileffante | November 2, 2006 10:47 AM
dileffante:
The managed code is code that compiles to the CLR, which is MSs virtual machine. But it optimizes through a JIT just like Java - the startup time will be slightly higher, but the JIT will rapidly make up for it. Long running applications should perform *better* than the old-fashioned C++ compiler, because the JIT compiler has more information available - it can use dynamic information to guide the optimization.
Posted by: Mark Chu-Carroll | November 2, 2006 10:52 AM
While I know that C compilers are really conservative about this, doesn't it matter as to how x and y are declared? After all if they are dclared as:
int x[1000], y[1000];
then the compiler can know that they are independent.
However, if they are declared as part of the function definition, then yes, they could be overlapping.
In other languages though, couldn't you have the same problem with sub arrays passed to functions?
Posted by: KeithB | November 2, 2006 10:53 AM
Horseshit! Bollocks! :)
It's entirely possible to tell the C or C++ compiler that you never alias or that you alias in a function-by-function basis. That, and alias analysis gets better in these compilers all the time. You've really overstated your case by a long shot on the aliasing front, Mark. Send me your LCS code, and I'll take a crack at it.
Also, if you have core routines that are really important and take up a large fraction of your code's run time, you should probably write a hand-coded in assembly routine for each architecture that is important to you. In fact, you should probably have an expert from the CPU vendor do the coding for you, unless you happen to have an assembly expert laying around (e.g., K. Goto's work on the BLAS). At that point, the glue language comes down to personal preference and expressiveness.
BTW, I've seen a number of these comparisions where an expert in one language does an implementation in several languages and, lo and behold, discovers that their favorite language wins out. It's all the same computer underneath, so if you aren't getting fast code in one language (that's sufficiently expressive) IT'S YOUR FAULT (whether because you aren't a good enough programmer in that language, don't know your compiler well enough, or aren't using the right compiler (the old Programming Language Shootout used GCC, which often gets its butt kicked by the Intel compiler on Intel machines)) not the language's.
Finally, I've got a wicked fast parallel, unstructured (mesh), 3D, time-dependent CFD code that's written in C++. It was originally a C/Fortan code (Fortran for the tight, important loops), and then it became a C-only and got about 5% faster. When we refactored it into C++ it gained another 5%. When we refactored the main matrix formation routines to be more compiler-like, we gained 5% in that part of the code.
Posted by: billb | November 2, 2006 11:01 AM
KeithB:
`int x[1000]` in C or C++ doesn't have much meaning. It's semantically equivalent to declaring "x" as `int *x`. You can reassign X any time you want, to make it point to a different location.
And your point WRT function definitions is dead on: the compiler can't do much optimization if the parameters to the function are pointers.
The reason why it's often better in other languages is because they make it much harder to create situation where there's ambigous aliasing. You *can* create cases where you can't prove independence in other languages - but in C++, you can almost *never* prove independence; in Fortran, you can *usually* prove independence.
Posted by: Mark Chu-Carroll | November 2, 2006 11:02 AM
@ dileffante
The "managed" C++ code means that it is managed by the .Net VM. This also means that it will be compiled into MSIL (the .net intermediate language). There are benefits for doing this, the VM will watch for buffer overflows, and it can also talk with non-managed code. How fast will this be? I am not sure, but I think that it is more directed for developing safe code than it is for speed.
Posted by: Jake Ham | November 2, 2006 11:09 AM
To Mark Chu-Carroll:
I think you're mistaken about int x[1000] and int *x being equivalent. I just tried two different things with GCC, and they both failed to compile:
int x[1000];
int y[1000];
x = y; /* fails with error: incompatible types in assignment */
x++; /* fails with error: wrong type argument to increment */
Posted by: Anthony Arkles | November 2, 2006 11:17 AM
I hope I am not teaching my grandmother to suck eggs, but the C faq has a whole section on pointers and arrays:
http://c-faq.com/aryptr/index.html
6.1:
I'm still mystified. Is a pointer a kind of array, or is an array a kind of pointer?
Answer:
An array is not a pointer, nor vice versa. An array reference (that is, any mention of an array in a value context), turns into a pointer (see questions 6.2 and 6.3).
There are perhaps three ways to think about the situation:
Pointers can simulate arrays (though that's not all; see question 4.1).
There's hardly such a thing as an array (it is, after all, a ``second-class citizen''); the subscripting operator [] is in fact a pointer operator.
At a higher level of abstraction, a pointer to a block of memory is effectively the same as an array (though this says nothing about other uses of pointers).
But, to reiterate, here are two ways not to think about it:
4. ``They're completely the same.'' (False; see question 6.2.)
5. ``Arrays are constant pointers.'' (False; see question 6.9.)
See also question 6.8.
Posted by: KeithB | November 2, 2006 11:39 AM
billb:
How do you tell a C++ compiler that there's no aliasing between two three dimensional arrays of floats?
I can't give you the code I did the LCS tests on - as I said, it was six years ago! It's somewhere on a backup tape, and I'd need to get permission from my employer to release it.
At the time, I had just come off of a project where I was part of a team implementing a C++ compiler. C++ was definitely my strongest language at the time. The code in all of the languages was as carefully hand-optimized as I could make it. (The 0.8 seconds for C code for the kind of LCS I was doing was amazing performance, much better than I expected from *any* of the implementations.)
I went into the experiment expecting C and C++ to both kick the butts of everything else, and I was trying to find out just how much penalty I'd pay by using one of the "higher level" languages. One of the big questions that I went into the experiment with was how much penalty would I pay for using C++ instead of C? (That is, I viewed C++ as higher level than C, because of things like templates, better typing, cleaner memory allocation/management, etc; and I wanted to know how much penalty I'd get from it compared to
plain C, and where it fit into the scale of performance from what I expected to be fastest (C) to what I expected to be slowest (Python).
And as I said, I did my best to optimize the C, because I wanted the numbers to be *accurate*. I didn't want to spend two years writing a system, and then get beaten up by my management because of poor performance relative to what I could have gotten from C. I seriously wanted to make sure that I wasn't going to screw myself over, so I really wanted to compare the best that each language had to offer.
As I said above, at the time, I was a serious C++ coder; the thing I coded immediately before the Stellation experiments was a very hairy template analysis for a C++ compiler - roughly 10,000 lines of code fitting into a 1.5million line codebase.
And at the time, I had *never* used OCaml; I had used its predecessor, Caml-Light for some experiments about 4 years earlier. But I was *far* from a highly skilled Caml hacker.
The results were extremely surprising to me, and I did spend some time profiling to try to figure out just why the OCaml was so much faster. The specific reason was that the Caml code did some really clever stuff - it basically did something like local constant propagation that were based on
be able to identify relations between subscripts used to access different arrays, and having done that, it could do some dramatic code rewriting that made it possible to merge loops, and hoist some local constants out of the restructured merged loop.
Posted by: Mark Chu-Carroll | November 2, 2006 11:46 AM
Thanks, Mark and Jake!
Posted by: dileffante | November 2, 2006 11:47 AM
if a loop contains code that can be broken into parallel paths, but you don't notice it, or you notice it but don't change your algorithm to make use of that knowledge, it's not the language's fault - it's your fault.
Posted by: cleek | November 2, 2006 11:51 AM
Nice to read this. All the youngsters around here make fun of me because I use Fortran; most of them use C++. I am not a programmer, but anyone who works in my field has to be able to program to do the work and to make sense of the results of programs others write and provide.
Posted by: Mark | November 2, 2006 11:53 AM
@Guarav:
Mathematica was legendary among MIT physics undergrads for being impossible to operate. I don't think I got a single thing to work with it that I didn't redo more quickly in another language. Of course, my sampling was undoubtedly biased, and your actual mileage may vary.
MATLAB struck me as being the wrong tool for every problem. I was able to do just about everything I wanted, but laboriously. Handling strings in MATLAB code was downright painful: for example, IIRC, an array of strings must have all members the same length.
Posted by: Blake Stacey | November 2, 2006 11:56 AM
I dunno, "-fno-alias" on the command line, perhaps?
BTW, C++ compiler writing in C++ may make you an expert in the parts of C++ that are useful for , but it doesn't make you and expert in numerically (floating-point/scientific) intensive C++ coding. But I didn't really intend to question your bona fides.
My basic point was that you didn't really get the best that the languages had to offer, but you did get some of the best that the languages had to offer to a few days (or weeks maybe) of optimization work.
Posted by: billb | November 2, 2006 11:57 AM
There is a well known way to write efficient C code for numerically intensive applications;
write C wrappers to handle I/O and user interfacing, and have them call fortran subroutines...
Works.
Posted by: Steinn Sigurdsson | November 2, 2006 12:03 PM
> for (int i=0; i > for (int j=0; j > x[i][j] = y[i-2][j+1] *
> y[i+1][j-2];
> }
> }
I can't see why the above code can't be optimized wrt parallel computing. Now with global optimization, a good C++ compiler could easily figure out that whether x and y points to the same array, or whether they overlap. Doesn't it?
Posted by: weapon | November 2, 2006 12:12 PM
"How do you tell a C++ compiler that there's no aliasing between two three dimensional arrays of floats?"
I don't know about C++, but in C99 you have the "restrict" keyword:
"...it serves as a 'no alias' hint to the C compiler. This means that the pointer is, at the moment, the only way to access the object to which it points."
This is from Harbison and Steele, 5th edition.
Posted by: KeithB | November 2, 2006 12:15 PM
Blake: That's why MATLAB eventually got cell arrays (a kind of meta-array for those who don't know MATLAB). MATLAB is a great language for prototyping small problems that involve linear algebra or fourier transforms where you don't care to spend the time implementing these core routines yourself of call a more complicated library. In my experience anything outside of 1-D is best done somewhere else.
However, I still do all my line and scatter plotting in MATLAB. I've just gotten used to the way they look.
Posted by: billb | November 2, 2006 12:17 PM
billb:
-fnoalias isn't the kind of thing I was talking about. That's an assertion that you should *always* assume that things are independent unless you can prove that they aren't (ie, the opposite of normal). The big problem with a lot of programming in C/C++ is that there are *some* loops where all of the values are independent; and there are some where *some* of the values are independent, and there are *some* where nothing is independent. In Fortran, you can often recognize which variables are definitely independent; in SISAL, you can always recognize which variables are definitely independent; in C/C++ it's very hard to recognize which variables are independent.
In Fortran-98, there's now support for pointers. But the way that they work, when you declare or allocate a value, you have to declare that it can be assigned in a way that produces aliases. There's no equivalent for that in C/C++. (I know there was an attempt to get a "noalias" keywork into C++, but it failed.)
Posted by: Mark Chu-Carroll | November 2, 2006 12:23 PM
Mark Chu-Carroll:
Did your OCaml implementation have problems with parallelism on an SMP machine? From what I read (and understand), OCaml does not allow multi-threading within the same process regarding accessing shared-memory. So, a numerical analysis algorithm written in OCaml might be limited to a single-processor machine if efficiency is desired.
Source:
http://caml.inria.fr/pub/ml-archives/caml-list/2002/11/64c14acb90cb14bedb2cacb73338fb15.en.html
Sorry for the long URL.
Posted by: Astrochicken | November 2, 2006 12:27 PM
weapon:
Sometimes, it's easy to recognize whether two arrays are distinct; sometimes it isn't. If you've got code where the
declarations are right there, then it's easy. But there's lots of code where it's very hard to tell.
When you start to see code like there - where the arrays are getting passed around to lots of different functions, and some of the arrays are return values from functions, and those functions are called from multiple call points... Tractable global analysis just doesn't hold up well. There's just too much potential aliasing going on.
Posted by: Mark Chu-Carroll | November 2, 2006 12:30 PM
Your choice of language probably depends a lot on whether the most time-consuming part of the operation is the part that happens in the computer or the part that happens in your head, trying to a) implement your algorithm and b) understand what it did afterwards. I have simulation codes in C++ that could probably be made faster in Fortran, and parsers in Python that could probably be faster in C++, but in both cases it's not the computer that's the rate-determining step in my workflow, it's me. Also if I want to write code so that someone else understands what it's doing... I'll trade a little speed for some clarity.
Posted by: Stephen Wells | November 2, 2006 12:32 PM
Astrochicken:
I have no idea how the code would have run on an SMP. The project was building a fast, lightweight SCM systems for developing software on workstations. So the tests were run on
an IBM desktop PC. It's long enough ago that I don't recall which pentium generation it was, or what the processor clock speed was. But it was a single processor x86-family machine.
Posted by: Mark Chu-Carroll | November 2, 2006 12:33 PM
Mark Chu-Carroll:
"But it was a single processor x86-family machine."
That was your first mistake. But, that is a religious war for another thread.
Regarding SMP and OCaml, I think we can conclude that the hardware platform constrains the development language decision as much as the problem set.
Posted by: Astrochicken | November 2, 2006 12:43 PM
Mark, no one disagrees that there are aliasing problems in C/C++, but I think that you'd find that compilers have come a long way in the last 6 years. A very long way. So much so that I think you would reevaluate your claim that they are dreadful languages for writing scientific code in if you took another look.
Posted by: billb | November 2, 2006 12:44 PM
Astrochicken:
The machine wasn't my choice. In industry, we're often constrained by all sorts of external realities - things like hardware and operating systems (and even programming languages) are often dictated by the requirements of whoever's paying for the research.
Posted by: Mark Chu-Carroll | November 2, 2006 12:47 PM
Well, the de-facto aliasing rules in Fortran are convienient for optimizing compilers. Two parameters to a subroutine are de-facto assumed to be independent, so no fancy dependence checking algorithms are needed. Of course if you know enough of the ins and outs of a specific implementation of (say C), you can probably find ways to change the compilers aliasing (or other) assumption, either via compile line options, or pragmas.
The more serious problems in wringing performance out of systems has to do with the hidden, but very strongly non-uniform performance of memory. In general the languages aren't designed to allow programmers to control things like bytes/words/cache-lines/ and memory pages, yet
the hardware uses or abuses these constructs based upon the low level details, largely the sequence of memory addresses that comes from the program, and whether it makes efficient -or sometimes very inefficient usage of the hidden memory management layers of the hardware.
Especially if your goal is to use SSE/Altivec, whose instructions can handle multiple ops/instruction, things like address alignment
with respect to natural memory boundaries, and/or the compilers ability to know if two arrays, have the same alignment wrt these boundaries can be crucial to its ability to exploit such features.
Posted by: bigTom | November 2, 2006 1:15 PM
billb's right about dealing with aliasing, although it obviously depends on the machine and compiler you're using since it is outside the language spec.
For example, I used SGI's C/C++ compilers back in the 90s and even then you had the choice of
(i) at the function level, declaring variables to be non-aliasing
(ii) for individual loops, adding a pragma to say it was safe to parallelize
It's a pain to have to do this, but it's not such a pain as to rule out C/C++ for numerical programming, not by a long shot.
Posted by: jonmoore | November 2, 2006 1:27 PM
But there's no way to write code in C or C++ that guarantees that.
Sure there is -- use the Standard Template Library, and use a vector or some such.
I have no clue if the compilers are smart enough to recognize that and do things that wouldn't be safe to use with pure pointers, but in principle it could be done.
I program mostly in C++, and I *almost never* use pointers any more. C++ still has its own quirks and problems, but the argument that "C++ is entirely a pointer-based language" is obselete in the face of the STL.
-Rob
Posted by: Rob Knop | November 2, 2006 1:28 PM
Rob:
The STL doesn't get rid of pointers - it just disguises them. STL is implemented using pointers! And in fact, much of the STL code is some of the most hideous gorp that I've ever had the misfortune to have to read. Might be fine from a users perspective (although I don't even like it from *that* point o view), but it's a mess internally. Optimizing for alias elimination is *not* a reason to use STL.
Posted by: Mark Chu-Carroll | November 2, 2006 1:34 PM
Gang:
Please try to remember that my point isn't that C and C++ suck; it's that the following statement from the linked article is *incorrect*:
My point is that not only is it not true that there is no language as fast as C and C++, but that there are languages that are often *faster*. The fallacy in the title of this post is that using C/C++ is *the only way* to write code if speed is your primary concern.
Posted by: Mark Chu-Carroll | November 2, 2006 1:37 PM
For non-real time systems, how much a performance hit is ok? If my app is 10% slower in language X than C++, but I can implement it twice as fast, then I get more time down the pub.
I do quite a bit of simulation work, and have given up on fighting compilers and languages and avoiding stupid bugs arising from trying to do clever things. Its not worth the hassle. Although, I guess your mileage will vary.
For me, writing bug free code is very important. C++ (and similar languages) offer too many places to make mistakes. In fiddly scientific computing this is a big issue. I wonder how many papers contain duff results because someone was accessing beyond array bounds in C?
Posted by: Simon | November 2, 2006 1:53 PM
Every C/C++ compiler that's used by people who care about performance has a version of restrict, __restrict, or __restrict__. The aliasing argument is certainly true if you've never written performant code and don't really know what you're talking about, but otherwise it's pretty bogus.
Second, you've got to be joking about OCaml. It's capped at 25% of my current PCs CPU resources, and that doesn't seem about to change, sadly: http://caml.inria.fr/pub/ml-archives/caml-list/2005/04/3fe33497071e1d9a065533d825ba848b.en.html
Distribution across multiple CPU resources will be the most important optimization consideration within 2-3 years, if it isn't already.
Posted by: Scott | November 2, 2006 2:32 PM
cleek:
Sometimes rewriting a loop in a way that makes it faster is the wrong thing to do.
The example that I mentioned with OCaml is a good one. The original code had two separate loops; OCaml merged them and then extracted local constants.
Could I have done the same thing by hand in the C code? Yes. But it would have been a stupid thing to do. The code was ultimately going to be used as a part of a large complicated system. The algorithm was based on taking two passes over the input arrays. The results of the first pass are used as inputs to the second pass. Semantically, the two are *really* very different.
It happens that the compiler can figure out that on the n+1th iteration of loop one, you have enough information to run the nth iteration of loop two, provided you play a few tricks - which is what the compiler did.
If I were to write the code as the merged loop, it would make it significantly harder to understand, and significantly harder to debug if there were a problem.
In writing real code, that's a common tradeoff. Even if the compiler *can't* do that for you, swapping a 20% increase in execution time for a significant improvement in readability and maintainability is a *very* good exchange. There are even common cases where a 100% performance penalty is acceptable in exchange for better code clarity. (For example, in that prototype of Stellation, there were some parts of the code that were highly IO bound. Processing a line of input takes *much* less time - on the order of 1/100th the time - than *reading* a line of input. So making the code to process a line take twice as long in order to make what's being done to it is fine, since the main effect of it is to reduce the amount of idle time the code spends waiting for the next line of input.
You don't want to deliberately *waste* cycles. But if you've got a choice between writing comprehensible, maintainable code, and writing the most perfectly optimal fast code, often you want to pick the clean stuff.
So going back to that original loop example: merging the two passes into a single pass that plays pointer tricks would be a terrible thing to do. It would take the code from a very clear two-pass algorithm into a very hard-to-follow one-pass.
Posted by: Mark C. Chu-Carroll | November 2, 2006 2:39 PM
wrt to aliasing why not just use the restrict keyword or some variant thereof that should do the trick. no ? also, i guess, compilers in general are conservative while doing code generation. it is _much_ better to have _slow_ but correct code, instead of fast but incorrect one. if you can nudge the compiler onto the right path, it should go a long way in improving code performance. but, when everthing is exhausted, you stil have assembly !
Posted by: anupam | November 2, 2006 2:55 PM
Any optimizations your C compiler is incapable of identifying you can always rectify with some inline assembly code!
Seriously, there's a reason we like compilers to optimize for us, and any C programmer should be well versed in what C can't do for you rather than just assert anything can be done. Of course anything can be done -- you can inline assembler.
Posted by: BMurray | November 2, 2006 2:56 PM
Sometimes rewriting a loop in a way that makes it faster is the wrong thing to do.
true. but only if speed is not your primary concern.
i write 2D image processing software professionally, and aside from correct results, speed is all my customers care about. they don't care if i have to use MMX/SSE or twist the logic into knots, they just want it to run faster than my competitors. they regularly send me benchmarks of my stuff vs. Lead and Pegasus - if i'm slow, they're unhappy. they don't care if the code is unmaintainable - they're not maintaining it, i am. so i do my best to keep it commented, so i'll at least know why i did what i did.
clarity is great. at my day job, i primarily use C#, and i go out of my way to simplify and clarify. but speed's not so high on the priority list, there. we can always buy a faster web server.
so... i agree, code clarity is important, but not always.
Posted by: cleek | November 2, 2006 3:02 PM
BMurray:
Actually, writing inline assembly code really isn't a good thing on modern architectures. On modern architectures,
making code run fast is incredibly hard - probably beyond the ability of almost any human being in a reasonable amount of time. You need to work out the memory access times, to figure out when to issue fetches and stores; you need to arrange instructions to keep the pipelines full without blocking, etc. A few years ago, we just routinely said we could throw in some hand-optimized assembler when we needed to; that's really not true anymore. Now, most hand-written assembler will be *worse* than what the compiler generates, because the compiler can optimize for for cache, pipeline, and memory and the tradeoffs between them.
Posted by: Mark C. Chu-Carroll | November 2, 2006 3:08 PM
Mark, you've switched arguments. You declared C and C++ unusable for scientific purposes, but now you're saying that we should simply consider other languages that might be slower or might be faster. What gives?
Posted by: billb | November 2, 2006 3:11 PM
billb:
I didn't intend to switch arguments :-). I just went a bit overboard in how I stated it. I never intended my point to be that no scientific code should be written in C/C++; I was responding to the idea in the linked article that scientific/numeric code should *only* be written in C++, and that *nothing else* is *as fast* as C/C++, and that further nothing else *will ever be* as fast as C/C++. But in fact, not only is C/C++ always the fastest thing, but it's often *not* the fastest, and there are numerous choices of languages that are at least as fast as C++.
Personally, *I* would *never* write scientific or numeric code in C or C++. But then, these days, I can't think of much of anything that I would voluntarily write in C++. I still use C from time to time, and think it's a very good language for low level OS-type code; I don't think it's a good choice for numeric/scientific. I've come to *despise* C++, and personally, I don't think it's suitable for much of anything. It's a giant wretched pile of muck.
You *can* write scientific code in C or C++. But it's a whole lot harder to make it run as fast as possible using a language like C/C++ than it is in a lot of other languages.
It really is a lot easier to write most kinds of numeric/scientific code in Fortran than in C++, and it's much easier for the compiler to generate optimal code for Fortran than for C/C++. Sisal routinely beat Fortran by 20%, and Fortran routinely beats C and C++.
Posted by: Mark C. Chu-Carroll | November 2, 2006 3:30 PM
Considering what can be done in C++, I'd say the level of muck (from negative muck to decidedly no muck to considerable muck) is up to the author. Of course, I'm getting ready to replace my input parser with embedded lua, so what do I know?
I'm not a C++ bigot by far, but I can make it do what I like, and it interfaces seamlessly with the C and POSIX standard libraries on the UNIX-like sytems I do all my work on (large clusters). Those two together make it hard for me to beat with anything else.
Posted by: billb | November 2, 2006 3:43 PM
"Actually, writing inline assembly code really isn't a good thing on modern architectures. On modern architectures,
making code run fast is incredibly hard - probably beyond the ability of almost any human being in a reasonable amount of time. You need to work out the memory access times, to figure out when to issue fetches and stores; you need to arrange instructions to keep the pipelines full without blocking, etc"
This is true but not true - for small functional units, especially in constrained systems, we have never found the compiler better than hand doing it. The compiler is often good enough that it isn't worth the effort, but once you get down to less than 50 "important" instruction units, people can iterate and get things as good or faster. You learn how many times you can unroll with a given workload, you can usually throw out some instructions, you really can do a better job than a compiler most of the time in getting superscalar order. There aren't really that many ways to be out of order that matter, and cache is of course really of supreme importance - data cache mostly.
Of course the amount of code this matters for is almost zero really. If we had just dumped doing this I don't know if anyone would really have noticed, ecept for not having hard limits, but it did make the programmers aware of things. (This was a couple years ago, but nothing has changed much I hear.)
Posted by: Markk | November 2, 2006 3:45 PM
for those interested, here is a bunch of straight forward (as in not part of huge projects) algorithms common in many fields of computer science, though you mustn't take it as some absolute benchmark like 3dmark is for example:
http://shootout.alioth.debian.org/gp4sandbox/
this does at times show the superiority of non C languages, though in grand majority of cases, C family holds the crown of performance.
Though what I find interesting about that benchmark is the "alternative" implementation section (multiple implementation in the same laguage). I particularly like the example of alt algorithm in Lua taking about 80% less time than "naive" version. The trick was that for computing alternating sum it's better to do 2 loops, one for plusses and one for minuses, and then subtract the results, than it is to do a one loop alternating summation. Though apparently that was interpreter's fault and has been fixed now, but there are other (less dramatic in performance diff) examples there.
That's the type of thing I love about that site.
Posted by: Alexei K | November 2, 2006 3:48 PM
hi there,
nice post.
"About a year later, testing a new JIT for Java, the Java time was down to 0.7 seconds to run the code, plus about 1 second for the JVM to start up."
I'm wondering what type of hardware you had. Even now, a JVM on a Pentium 4 2.x GHz 2 GB RAM on windows xp, it takes more than a second for the JVM to load and give control to the main() method.
Also, it should be noted that the C startup times can be as big/small as you want. You can trim down the libraries that get initialized and called to what you really need in the application. For the hardware that was popular 6 years ago (say year 2000), the C startup time can be less than 0.1 second.
BR,
~A
Posted by: anjan bacchu | November 2, 2006 3:59 PM
Are there any languages out there that are designed to take advantage of SMP machines? It would seem that would be awfully useful for scientific work.
Posted by: John | November 2, 2006 4:11 PM
anjan:
I don't remember the exact model. It was the IBM distro of the JVM (IBMs JIT was better than Suns), running under Linux on a single processor intel PC.
I'm surprised that you see such a slow startup for Java. My machine (a brand new MacBook Pro, Intel Core Duo 2Ghz) can run Java code much faster than that... Quick testing looks like it's under 0.2 seconds to main. Hard to nail it down any more precisely than that.
Posted by: Mark C. Chu-Carroll | November 2, 2006 4:12 PM
FWIW, here are some language benchmarks.
http://shootout.alioth.debian.org/gp4/benchmark.php?test=all&lang=all
C places first, OCaml places seventh, Java places 12th.
Posted by: Derek | November 2, 2006 4:14 PM
John:
There are lots of languages that try to take advantage of SMP. A few examples are: Sisal, which I mentioned above, was designed for parallel processing, and did a great job. One of the design points in Fortran-98 was to make it easy to write programs that could be compiled for various kinds of parallel machines, including SMP. SCL is a common lisp designed for scientific computing, and which has support for SMP. Lots of C/C++ compilers provide SMP support.
Posted by: Mark C. Chu-Carroll | November 2, 2006 4:22 PM
Nice thread.
Even though it's admittedly flawed, check out the Programming Languages Shootout (http://shootout.alioth.debian.org/). If you dive deep, you can see some surprising results.
However, I must admit that C is still not necessarily slower than anything else given manually inserted aliasing pragmas, profile based branching optimizations, CPU manufacturer provided compilers, etc.
Fortunately, only a tiny percentage use C like that. I bet most people don't even care enough to set the CPU arch flags right.
However, a lot of people do care about things like threading and multiprocessing to get performance. C can be in serious trouble here.
Posted by: Koray | November 2, 2006 4:24 PM
@John: that's why we have the all mighty MPI and its lesser brethren (by lesser, I'm talking about my opinion of them, not their theoretical ability). Languages as such don't really have parrallelism built in in any way I'd call serious, but some are made receptive to integration with other API, of which MPI is one. (you might be more familiar with Thread like api's that most languages have built in, but I don't like that programming paradigm and would never code using it)
@Derek: owned ya there by 26 min :)
Posted by: Alexei K | November 2, 2006 4:26 PM
Wha? There's plenty of evidence that C/C++ runs quickly in a lot of different situations. In fact, you yourself state:
* C: 0.8 seconds.
* C++: 2.3 seconds.
* OCaml: 0.6 seconds interpreted, 0.3 seconds fully compiled.
* Java: 1 minute 20 seconds.
* Python: over 5 minutes.
In this list, C/C++ aren't the fastest, but they're a whole lot better than some other languages on the list. Derek's link says the same thing: http://shootout.alioth.debian.org/gp4/benchmark.php?test=all〈=all
Further, you're making a mistake by saying that because C/C++ handles parallelism bad, it's all-around bad.
And you're showing that you're a language snob with that statement. All you're doing is trying to justify your own personal preferences by trying to find situations where C/C++ does badly and acting like it's all-around bad.
BTW, I've seen C++ code which is optimized for different CPUs, and setup to take advantage of the parallel processing inherent in those CPUs - you can get a big speed boost out of that kind of code. You're making the same mistake that so many other people make: you don't understand the language well enough or don't want to write the opimized routines, and then complain that the language is just inherently inferior, justifying your own preconceived ideas. There are lots of ways to judge languages (ease of use, personal preferences, ability to create programs quickly, speed, memory usage, etc), and even I don't like all programming languages, but I'm not going to construct some bad argument about why you shouldn't use language X, that I happen dislike.
Posted by: BC | November 2, 2006 4:42 PM
SMP ain't enough parallelism.
Posted by: billb | November 2, 2006 4:46 PM
This is totally random, but I'd like to point out that there was a game for Playstation 2 (can't remember the title, but it was one of the major blockbasters of the time) that was written *entirely* in Scheme, but it was some sort of compiled version of Scheme that actually produced assembler code for PS2.
I don't really mean this as a proof of anything, just thought it was a nice thing, and it made me for the first time consider that perhaps C family isn't the be-all-end-all of programming.
Posted by: Alexei K | November 2, 2006 4:50 PM
comments about stl efficiency are a little too hasty, imho.
have you seen blitz++?
http://www.oonumerics.org/blitz/
also see http://www.oonumerics.org and http://acts.nersc.gov/pooma/
"""Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. It uses template techniques to achieve high performance. The current versions provide dense arrays and vectors, random number generators, and small vectors and matrices."""
""" Goals of the Blitz++ library
The C++ programming language offers many features useful for tackling complex scientific computing problems: inheritance, polymorphism, generic programming, and operator overloading are some of the most important. Unfortunately, these advanced features came with a hefty performance pricetag: until recently, C++ lagged behind Fortran's performance by anywhere from 20% to a factor of ten. As a result, the adoption of C++ for scientific computing has been slow.
Is there a way to soup up C++ so that we can keep the advanced language features but ditch the poor performance? This is the goal of the Blitz++ project: to develop techniques which will enable C++ to rival -- and in some cases even exceed -- the speed of Fortran for numerical computing, while preserving an object-oriented interface. The Blitz++ Numerical Library is being constructed as a testbed for these techniques.
Recent benchmarks show C++ encroaching steadily on Fortran's high-performance monopoly, and for some benchmarks, C++ is even faster than Fortran! These results are being obtained not through better optimizing compilers, preprocessors, or language extensions, but through the use of template techniques. By using templates cleverly, optimizations such as loop fusion, unrolling, tiling, and algorithm specialization can be performed automatically at compile time.
Another goal of Blitz++ is to extend the conventional dense array model to incorporate new and useful features. Some examples of such extensions are flexible storage formats, tensor notation and index placeholders. """
Posted by: los alamos | November 2, 2006 5:00 PM
comments about stl efficiency are a little too hasty, imho.
have you seen blitz++?
http://www.oonumerics.org/blitz/
also see http://www.oonumerics.org and http://acts.nersc.gov/pooma/
"""Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. It uses template techniques to achieve high performance. The current versions provide dense arrays and vectors, random number generators, and small vectors and matrices."""
""" Goals of the Blitz++ library
The C++ programming language offers many features useful for tackling complex scientific computing problems: inheritance, polymorphism, generic programming, and operator overloading are some of the most important. Unfortunately, these advanced features came with a hefty performance pricetag: until recently, C++ lagged behind Fortran's performance by anywhere from 20% to a factor of ten. As a result, the adoption of C++ for scientific computing has been slow.
Is there a way to soup up C++ so that we can keep the advanced language features but ditch the poor performance? This is the goal of the Blitz++ project: to develop techniques which will enable C++ to rival -- and in some cases even exceed -- the speed of Fortran for numerical computing, while preserving an object-oriented interface. The Blitz++ Numerical Library is being constructed as a testbed for these techniques.
Recent benchmarks show C++ encroaching steadily on Fortran's high-performance monopoly, and for some benchmarks, C++ is even faster than Fortran! These results are being obtained not through better optimizing compilers, preprocessors, or language extensions, but through the use of template techniques. By using templates cleverly, optimizations such as loop fusion, unrolling, tiling, and algorithm specialization can be performed automatically at compile time.
Another goal of Blitz++ is to extend the conventional dense array model to incorporate new and useful features. Some examples of such extensions are flexible storage formats, tensor notation and index placeholders. """
Posted by: los alamos | November 2, 2006 5:02 PM
comments about stl efficiency are a little too hasty, imho.
have you seen blitz++?
http://www.oonumerics.org/blitz/
also see http://www.oonumerics.org and http://acts.nersc.gov/pooma/
"""Blitz++ is a C++ class library for scientific computing which provides performance on par with Fortran 77/90. It uses template techniques to achieve high performance. The current versions provide dense arrays and vectors, random number generators, and small vectors and matrices."""
""" Goals of the Blitz++ library
The C++ programming language offers many features useful for tackling complex scientific computing problems: inheritance, polymorphism, generic programming, and operator overloading are some of the most important. Unfortunately, these advanced features came with a hefty performance pricetag: until recently, C++ lagged behind Fortran's performance by anywhere from 20% to a factor of ten. As a result, the adoption of C++ for scientific computing has been slow.
Is there a way to soup up C++ so that we can keep the advanced language features but ditch the poor performance? This is the goal of the Blitz++ project: to develop techniques which will enable C++ to rival -- and in some cases even exceed -- the speed of Fortran for numerical computing, while preserving an object-oriented interface. The Blitz++ Numerical Library is being constructed as a testbed for these techniques.
Recent benchmarks show C++ encroaching steadily on Fortran's high-performance monopoly, and for some benchmarks, C++ is even faster than Fortran! These results are being obtained not through better optimizing compilers, preprocessors, or language extensions, but through the use of template techniques. By using templates cleverly, optimizations such as loop fusion, unrolling, tiling, and algorithm specialization can be performed automatically at compile time.
Another goal of Blitz++ is to extend the conventional dense array model to incorporate new and useful features. Some examples of such extensions are flexible storage formats, tensor notation and index placeholders. """
Posted by: los alamos | November 2, 2006 5:03 PM
MarkCC wrote:
"The example that I mentioned with OCaml is a good one. The original code had two separate loops; OCaml merged them and then extracted local constants.
Could I have done the same thing by hand in the C code? Yes. But it would have been a stupid thing to do. The code was ultimately going to be used as a part of a large complicated system. The algorithm was based on taking two passes over the input arrays. The results of the first pass are used as inputs to the second pass. Semantically, the two are *really* very different. "
Is their any reason why this is unique to OCaml? Couldn't *any* compiler do the same thing?
If this thread keeps going are you going to make C/C++ Friday's Pathological Language?
8^)
Posted by: KeithB | November 2, 2006 5:27 PM
KeithB:
Lots of compilers *could* do something like what the OCaml did; the key is whether or not it can determine enough information about the semantics of the program to be certain that the transformation is valid.
Most functional language compilers would do the equivalent transformation. I'm pretty sure that most of the really good CommonLisp compilers would; Sisal *definitely* did that kind of transformation (Sisal was designed specifically to make transformations like that easy); SML/NJ *might*.
I know that some Java JITs do that kind of transformation on a dynamic basis; the information to ensure its validity isn't there in a form that can be easily detected statically in Java, but the JIT gets to use dynamic information as well, and most JITs can recognize a loop as a hot spot in the code, analyze and generate specialized code with a very fast dynamic independence check.
Pretty much *every* high performance fortran compiler would do the equivalent transformation; most Fortran code would be tractable to recognize the validity.
I won't go so far as to say that no C++ compiler does it, but it's *very* hard to get enough static information about multidimensional arrays in C++ to be able to do something like that safely.
Posted by: Mark C. Chu-Carroll | November 2, 2006 5:41 PM
And WRT the question about whether I'd make C++ the weeks pathological language... In general, I don't consider C++ to be particularly pathological. Poorly designed? Yes. Pathological? No.
(The reason I said "in general" is because the template system in C++ can get pretty damned pathological. The C++ template system is, I *think*, Turing complete. I know for certain that it's at least primitive recursive. One of the tests that I saw a member of the C++ compiler team write was a program that made the compiler compute a factorial *at compile time*. Templates are very close to being a functional programming language of a particularly bizarre sort.)
Posted by: Mark C. Chu-Carroll | November 2, 2006 5:56 PM
There's nothing stopping the compiler from writing two versions of the function, a slow version for the aliased case and a fast, parallel version for the more common case. Or simpler, just throw a giant if statement around the whole function:
if (overlap(x,y)) {// ... slow ...
} else {
//... fast, parallel ...
}
If there are 20,000*20,000 iterations, then this optimization would be well worth it.
Posted by: Dan Vanderkam | November 2, 2006 6:42 PM
Erwin Unruh showed in 1994 (in connection with the ANSI C++ commitee) that you could use C++ templates to compute an arbitrary number of primes at compile time, and output them as compiler error messages.
More recently, Todd Veldhuizen produced something more direct, showing how to embed Turing machines as template instantiations.
As to your comment about templates being close to a bizarre functional programming language, it only takes a couple of lines of template code to produce a template providing the typical functional way of calculating factorials -- at compile time.
Posted by: JJS | November 2, 2006 7:15 PM
JJS:
Thanks for the pointer. I didn't know about Veldhuizen; his paper is a piece of diabolical genius. For anyone who's interested, here's the URL for a paper showing how to build a Turing machine using C++ templates, so that the compiler needs to run the TM simulation to compile the program!
http://osl.iu.edu/~tveldhui/papers/2003/turing.pdf
Brilliant!
Posted by: Mark C. Chu-Carroll | November 2, 2006 7:28 PM
My pet peeve is lumping C and C++ into the same basket. C++ is at its most horrid when it's programmed like you would program in C.
In the case of numeric programming, for example, C++ has some very nice BLAS bindings these days, like Boost.uBLAS. And unlike C or Fortran, they're even nice to use, because you can use the addition and multiplication operators instead of having to remember what SAXPY means.
Incidentally, try writing the LCSS algorithm in Haskell some time. Dynamic programming in Haskell is really, REALLY elegant. You set up a data structure full of unevaluated thunks which depend on other thunks, then just evaluate the element you want. Lazy evaluation takes care of the rest.
You can even do this with infinite data structures.
Posted by: Pseudonym | November 2, 2006 8:27 PM