How to widen the bottleneck of processing? If the solution were trivial, it wouldn’t merit a post – in fact the workarounds are of the kind that’s easy to invent but hard to formalise.
Like the megapixels are not a true indicator of the quality of a camera, or the decibels are not a true indicator of the quality of a speaker, the clock speed is not a true indicator of the performance of processors. Let’s see just a little of what’s going on under the hood. Remember, this is all going to be just the tip of the iceberg.
More than just the GHz 1
Recently, I analysed the case of hardware being developed to make C code fast. To make the point, we need to remember how C code execution looks like, and think how can we make it faster.
The C abstract machine to which the C language codes, is, as I have said before, a sequential V.N. machine, where memory is flat and lives far away from the arithmetic unit, its cells are addressed by name and fetched randomly, and instructions are executed one-at-a-time. These are the three essential points we need to enrich.
On the levels of abstraction
Hierarchical Memory
In the last decades, just like batteries have not improved at the same speed our cellphones had, memory hasn’t speed up as much as processors. This is inherently a trade-off problem, as we need bigger and bigger pools of memory, that don’t drag too much energy nor heat, and that are cheap: the solution is DRAM, where memory cells are cheap and small enough to allow for high densities: that’s the only reason you can afford 32Gb of RAM in your laptop.
So if we need big memories, at the expense of making it harder to make them faster, we resort to a solution that Von Neumann already sketched: caches. We do it when we memorise the Fibonacci function: instead of finding an exponentially complicated value, we remember the previous one, making the next calculus very easy.
We therefore build a layered architecture, where often needed information is ready on smaller but faster memories, and less needed must be fetched from afar. We can imagine as an analogy a craftsman on his desk. Next to him, there are his main tools: some pencils, a ruler, knifes, plenty of different inks, and standard sheets of papers… In the shelf across the room, he can find several more hundred inks, but he keeps in his desk his favourite few dozens, while in the other shelf across the room he keeps many different kinds of paper. Downstairs in the cellar, he keeps the dangerous chemicals not often needed, together with his printing machine, and such an enormous amount of all kinds of papers in all formats, just in case he runs out of a needed one upstairs, or he just suddenly needs a really weird kind of paper.
His desk is a hand away, he needs to stand up and walk a bit for the hundred inks, and going for some unique paper sheet in the cellar will take him a long time. But he obviously can’t fit everything on the desk.
Modern processors usually have three levels of cache living in-house, together with the cores. L1 and L2 are private to each core, while one big L3 is shared among them.
Pipelines
Instructions, being one at a time, can only be improved by executing them faster, which is a very unexciting path to research. Alternatively (and sticking to the single-core idea as to stick to the sequentiality of C), it’s easy to guess that executing many things at the same time would also be a big improvement, right?
An easy first step is to understand how a simple execution engine works: an instruction is fetched, it’s subsequently decoded, and then followed by its execution. And the cycle repeats. But the parts of the processor that do the fetching, decoding, and executing, are essentially three different pieces: hence, we can pipeline instructions: we fetch the first one, and while it’s being decoded, the fetcher is free to fetch a second. When the first instruction is decoded and sent to execution, the second instruction moves to the decoding phase, and a third one enters the fetching phase… This way, we’re loading instructions ahead of time, when the first one hasn’t even finished, there are two other already moving.
Modern processors take this quite far (but not too much). Pipelines have 5 to 6 different stages, and more than 20 instructions can be concurrently loaded, therefore doing plenty job ahead of time. A nice improvement.
But two problems arise with pipelines. First, an instruction might need data obtained by the preceding one, hence, it should wait for it. Second, an instruction might direct a far jump, therefore making the incoming instructions a wasted job. In this cases, the pipeline is usually flushed and started over, which is the reason why you wouldn’t normally see bigger pipelines than the current ones, as the price of flushing is indeed big. But fix-ups can be proposed to keep pipelines busy, and to even raise single-core concurrency. 2
Out-of-order execution
What if an instruction is totally independent from the preceding ones? What if the second is a much faster instruction than the first? They can be executed, instead of the sequential requested order, in any other order whatsoever: in parallel, or even in inverted order. As long as their result are visible to the outside world in the original sequential order, a processor is free to execute them as he pleases.
Consider the following program 3:
[ccne_cpp]
e = a + b
f = c + d
m = e * f
[/ccne_cpp]
Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously.
Register Renaming 4
To make things even stronger in out-of-order execution, cores can reserve register sets, hidden from the software, and, taking again a wikipedia example, make interesting improvements on code like this:
[ccne_cpp]
R1 = M[1024]
R1 = R1 + 2
M[1032] = R1
R1 = M[2048]
R1 = R1 + 4
M[2056] = R1
[/ccne_cpp]
Instructions 4, 5, and 6 are independent of instructions 1, 2, and 3, but the processor cannot finish 4 until 3 is done, otherwise instruction 3 would write the wrong value. This restriction can be eliminated by changing the names of some of the registers:
[cce_cpp] 1 R1 = M[1024] 2 R1 = R1 + 2 3 M[1032] = R1 [/cce_cpp] |
[cce_cpp] 4 R2 = M[2048] 5 R2 = R2 + 4 6 M[2056] = R2 [/cce_cpp] |
Speculative execution
If you’re the craftsman I just talked about above these lines, then tell me, in your limited desk, what would you chose to have? Or imagine, you go to the cellar to pick up that weird letter size papyrus for your next masterpiece: would you take only one sheet? Or if you need a H3 pencil from the wardrobe, now that you’ve already stood up from your seat, wouldn’t you just intuitively take a few H5 and H2 and HB as well? And if after sitting back to your chair you later need a H3, and the next day you need a B5… wouldn’t you just go next time and take simply all of your pencils?
It’s called locality: temporal locality, and locality of reference. What you’ve need before, you’re likely to need it again. And what you need right now, you’re likely to need the thing just next to it.
Now, based on these ideas, the processor doesn’t know what your program might need, but he attempts to guess. If you need an address in memory, the processor will just in case load the entire line (that address and the next 63 bytes), just in case, because he knows that chances are you’ll need them. If on a condition, you always assume truth, next time this condition arises the processor might just execute the true path ahead of the answer, just in case when the answer finally arrives from any slower medium, he’s already done the job for you. And if he was wrong, then we just fetch memory again or flush out the spurious results, respectively, and continue as if nothing has happened.
For a very useful note on how branch prediction might make your code a ton faster, please read this fantastic SO answer.
Back to the metal
Have we then made the processor faster? When “instructions are executed one-at-a-time” we got not just multi-cores, but pipelines and concurrency in a single stream of execution, which is what C represents. When “memory is flat and lives far away from the arithmetic unit”, we erected a hierarchical model built on proximity and speed versus sheer size. When “the memory cells are addressed by name and fetched randomly”, we put together out-of-order and speculative execution, to cope up with the risk of arbitrary jumps flushing the pipeline.
All sounds fine, right? Not to mention the many cores in our processors, right?
Nōn omne quod nitet aurum est
All that glitters is not gold: the L3 is a global state shared among all cores and threads, speculative execution can lead to guessing dangerous things, and inter-core communication in this memory model is bound to contention and deadlocks all the time. Next, we will see what are these issues, why do they arise, what are their consequences, and we will explore some alternatives and modern research.
- http://www.lighterra.com/papers/modernmicroprocessors/
- For more information, this article was quite pleasant.
- https://en.wikipedia.org/wiki/Instruction-level_parallelism
- It must be noted that low-powered chips such as Cortex, Atom, or AVR, are in-order, because logics like the register renaming are extremely power hungry!
[…] Last time we saw that, to enhance the memory bottleneck, architectures build a layered memory model. While the main memory is living on its own piece of hardware, integrated with the chip there is a cache memory bucket that serves the fast accesses. As we also saw, there’re three layers: L1 and L2 are private to each core 3, while L3 is shared across all cores and threads. Every time a core attempts to fetch from memory, it checks each level, and if in the end it needed to go down all the way to main memory, it copies the found data all the way up, exploiting the locality we saw last time. […]
[…] Next, we will explore in a series of monographs these hardware architectures, the nitty-gritty details, the advantages and disadvantages, and the alternatives. […]