Last time, we dug into a new debate: are languages really decoupled from the hardware that runs them? Are the semantics of a language modelling mainstream hardware, and should they? Can we escape to new models of hardware, and new models of semantics?
We saw that Backus made a good point arguing that languages have grown and fattened entrenched in the Von Neumann Architecture: we can fairly wonder then, have Von Neumann machines grown and fattened entrenched in Von Neumann languages as well?
It’s a very reasonable idea: Hardware and Languages evolved in parallel to each other.
A glorified Von Neumann language
Now tell me, which language comes to your mind with the info I have given to you so far? A low-level, performant, and essentially a glorified Von Neumann language, which one could that be… If you’re thinking of the C language, then we’re on sync. Else, let me convince you.
This language was designed to be essentially a pretty assembly for a V.N. machine 1: a variable corresponds to a memory cell, if statements are mere test-and-jump instructions, and assignments are the quintessential fetch-and-store instruction. And if we consider the industry doctrine of backwards compatibility, it’s easy to see how new development on architectures was set to remain compatible with whatever code was already written. And in this course of events we’re at a point where I would say that, quite blatantly, C is no longer a low-level and performant language per se, we have just build hardware to keep it that way.
Distancing itself from actual hardware
In the wake of the recent Meltdown and Spectre vulnerabilities 3, it’s worth spending some time looking at root causes. […] The features that led to these vulnerabilities, along with several others, were added to let C programmers continue to believe they were programming in a low-level language, when this hasn’t been the case for decades.
Hasn’t been the case for decades? Why?
Modern VN Hardware
We all know it. Moore’s Law is hitting a snag. Clocks can’t get any faster for many reasons: the main one, thermodynamics. Higher frequencies mean temperatures raising faster than what cooling systems can keep up with. There’s also the quantum-mechanics: if the clock goes faster, the area that a pulse can reach in a tick gets smaller, therefore the size shrinks, and quantum uncertainty hits: electrons start having unpredictable behaviour.
An alternative path to keep Moore’s Law active was to be found other than just incrementing clock speed, and it had to keep old code (read, C) faster and faster. Modern microprocessors do the job by many complex tricks, but they all are in slight conflict with a classic C machine.
Keep the CPU busy
Multiple cores is the option we all know: put more cores so each one can execute one more thread. A parallel (pun intended) improvement is ILP, short for Instruction Level Parallelism, where several instructions are executed simultaneously. There are many ways to achieve this, like register renaming: a processor can detect sequences of execution that are independent of each other, and execute them in parallel assigning to their execution different register sets. Another very old trick is pipelining, where if one instruction takes several clock cycles to execute, but each tick requires a different part of the processor, we can execute several instructions at the same time if each one is, at every single clock cycle, depending on a different processor part. Another trick to keep code faster, related to pipelines, is branch prediction: a pipeline that loads the wrong instruction must be flushed, which is expensive, so the processor tries to predict with branch will be the correct one and execute it, or even executes all branches until one of them is decided to be flushed, therefore keeping the pipeline full.
Keep the Memory busy
The Memory bottleneck is cached in faster memory, and highly complex mechanisms of cache coherency are built into the chip to keep all threads of execution in sync. Memory is fetched continuously and loaded into the faster cache, where guessing the segment of memory to fetch next requires again complex prediction mechanisms.
C is not like that
But not all that glitters is gold: C threads are traditionally heavy, context switches take a lot of time, and creation and destruction of threads is known to be expensive. Besides, C is essentially a sequential machine with global flat memory, where modelling parallelization is done through painful locking mechanisms, and traditional code hasn’t scale proportionally to the numbers of cores added.
Even if we stick to one stream of execution, the average C code has a branching every seven instructions, rendering pipeline flushing far too probable and making branch prediction ever more important; even worse when we consider that executing the wrong branch might read illegal memory and pipeline flushing might not come in time 4. C code believes to be executed sequentially and we all know the pain of its shared memory model for multithreading communications, and power-hungry engines like the register renaming one is known to make some optimisations unsafe.
And regarding the flat memory model of C, we quote David Chisnell’s paper: “Efficient use of the cache is one of the most important ways of making code run quickly on a modern processor, yet this is completely hidden by the abstract machine, and programmers must rely on knowing implementation details of the cache […] to write efficient code.”
Modern C Compilers
If the language achieves its performance due to its proximity to the underlying machine, that should mean that a compiler should be a mere translator. Indeed, if you tell a C programmer from the times of Ritchie, or of Fortran or Assembly for that matter, that their compiler would reorder and make strong changes to their code, they would have thrown the baby out with the bathwater. But today, we all rush to put all these fancy flags to our compilers to make our program faster, smaller, safer, and whatnot; distancing C code from its reputed low-level idea of performance achievement.
To quote Chisnall’s example, a classic thing to do in C is processing a sequence of data through looping: optimising this requires assuring that the iterations are independent and then trying to vectorise them, as vector code is much faster than scalar code 5. But the compiler is not always free to reorder, in fear of breaking C layout guarantees like padding and alignment.
C is not like that either
To make things worse, some optimisations can even run into undefined behaviour. Consider loop unswitching: if branching is expensive under the risk of pipeline flushing, a loop with a condition is a plain nightmare. Hence, a compiler attempts to transform this torture into a conditional with two loops, one on each side, in order to facilitate vectorisation and reduce pipeline flushing, henceforth changing control flow opposed to what the programmer had in mind. And now imagine a loop that ends up being iterated zero times: if the branch depends on the loop index, then this index is now uninitialised: “Some dead code has now been transformed into undefined behaviour”.
Old Lies and New Perspectives
Chisnall says one more interesting sentence, that for once feels perhaps too bold to me, but it’s nonetheless a very interesting food of thought: “Perhaps it’s time to stop trying to make C code fast and instead think about what programming models would look like on a processor designed to be fast.” How would such a processor look like?
Next, we will explore in a series of monographs these hardware architectures, the nitty-gritty details, the advantages and disadvantages, and the alternatives.
- I’ll quote this paper extensively today.
- I’ll talk about these catastrophic vulnerabilities in the near future.
- This is the essence of the recent Meltdown and Spectre vulnerabilities, a point that Chisnall really tries to make in his paper. Consider the over-simplified
if (untrusted_input < 20) return kernel_data[untrusted_input];, which might temporarily save in some register data from what would usually be an access violation.