Processors, Part One: the Past. The Architecture

How to widen the bot­tle­neck of pro­cessing? If the solu­tion were trivi­al, it would­n’t mer­it a post – in fact the work­arounds are of the kind that’s easy to invent but hard to formalise.

Like the mega­pixels are not a true indic­at­or of the qual­ity of a cam­era, or the decibels are not a true indic­at­or of the qual­ity of a speak­er, the clock speed is not a true indic­at­or of the per­form­ance of pro­cessors. Let’s see just a little of what’s going on under the hood. Remember, this is all going to be just the tip of the iceberg.

More than just the GHz 1

Recently, I ana­lysed the case of hard­ware being developed to make C code fast. To make the point, we need to remem­ber how C code exe­cu­tion looks like, and think how can we make it faster.

The C abstract machine to which the C lan­guage codes, is, as I have said before, a sequen­tial V.N. machine, where memory is flat and lives far away from the arith­met­ic unit, its cells are addressed by name and fetched ran­domly, and instruc­tions are executed one-at-a-time. These are the three essen­tial points we need to enrich.

On the levels of abstraction

Hierarchical Memory

In the last dec­ades, just like bat­ter­ies have not improved at the same speed our cell­phones had, memory has­n’t speed up as much as pro­cessors. This is inher­ently a trade-off prob­lem, as we need big­ger and big­ger pools of memory, that don’t drag too much energy nor heat, and that are cheap: the solu­tion is DRAM, where memory cells are cheap and small enough to allow for high dens­it­ies: that’s the only reas­on you can afford 32Gb of RAM in your laptop.

So if we need big memor­ies, at the expense of mak­ing it harder to make them faster, we resort to a solu­tion that Von Neumann already sketched: caches. We do it when we mem­or­ise the Fibonacci func­tion: instead of find­ing an expo­nen­tially com­plic­ated value, we remem­ber the pre­vi­ous one, mak­ing the next cal­cu­lus very easy.

We there­fore build a layered archi­tec­ture, where often needed inform­a­tion is ready on smal­ler but faster memor­ies, and less needed must be fetched from afar. We can ima­gine as an ana­logy a crafts­man on his desk. Next to him, there are his main tools: some pen­cils, a ruler, knifes, plenty of dif­fer­ent inks, and stand­ard sheets of papers… In the shelf across the room, he can find sev­er­al more hun­dred inks, but he keeps in his desk his favour­ite few dozens, while in the oth­er shelf across the room he keeps many dif­fer­ent kinds of paper. Downstairs in the cel­lar, he keeps the dan­ger­ous chem­ic­als not often needed, togeth­er with his print­ing machine, and such an enorm­ous amount of all kinds of papers in all formats, just in case he runs out of a needed one upstairs, or he just sud­denly needs a really weird kind of paper.

His desk is a hand away, he needs to stand up and walk a bit for the hun­dred inks, and going for some unique paper sheet in the cel­lar will take him a long time. But he obvi­ously can­’t fit everything on the desk.

Modern pro­cessors usu­ally have three levels of cache liv­ing in-house, togeth­er with the cores. L1 and L2 are private to each core, while one big L3 is shared among them.

Pipelines

Instructions, being one at a time, can only be improved by execut­ing them faster, which is a very unex­cit­ing path to research. Alternatively (and stick­ing to the single-core idea as to stick to the sequen­ti­al­ity of C), it’s easy to guess that execut­ing many things at the same time would also be a big improve­ment, right?

An easy first step is to under­stand how a simple exe­cu­tion engine works: an instruc­tion is fetched, it’s sub­sequently decoded, and then fol­lowed by its exe­cu­tion. And the cycle repeats. But the parts of the pro­cessor that do the fetch­ing, decod­ing, and execut­ing, are essen­tially three dif­fer­ent pieces: hence, we can pipeline instruc­tions: we fetch the first one, and while it’s being decoded, the fetch­er is free to fetch a second. When the first instruc­tion is decoded and sent to exe­cu­tion, the second instruc­tion moves to the decod­ing phase, and a third one enters the fetch­ing phase… This way, we’re load­ing instruc­tions ahead of time, when the first one has­n’t even fin­ished, there are two oth­er already moving.

Modern pro­cessors take this quite far (but not too much). Pipelines have 5 to 6 dif­fer­ent stages, and more than 20 instruc­tions can be con­cur­rently loaded, there­fore doing plenty job ahead of time. A nice improvement.

But two prob­lems arise with pipelines. First, an instruc­tion might need data obtained by the pre­ced­ing one, hence, it should wait for it. Second, an instruc­tion might dir­ect a far jump, there­fore mak­ing the incom­ing instruc­tions a wasted job. In this cases, the pipeline is usu­ally flushed and star­ted over, which is the reas­on why you would­n’t nor­mally see big­ger pipelines than the cur­rent ones, as the price of flush­ing is indeed big. But fix-ups can be pro­posed to keep pipelines busy, and to even raise single-core con­cur­rency. 2

Out-of-order execution

What if an instruc­tion is totally inde­pend­ent from the pre­ced­ing ones? What if the second is a much faster instruc­tion than the first? They can be executed, instead of the sequen­tial reques­ted order, in any oth­er order what­so­ever: in par­al­lel, or even in inver­ted order. As long as their res­ult are vis­ible to the out­side world in the ori­gin­al sequen­tial order, a pro­cessor is free to execute them as he pleases.

Consider the fol­low­ing pro­gram 3:

[ccne_cpp]
e = a + b
f = c + d
m = e * f
[/ccne_cpp]

Operation 3 depends on the res­ults of oper­a­tions 1 and 2, so it can­not be cal­cu­lated until both of them are com­pleted. However, oper­a­tions 1 and 2 do not depend on any oth­er oper­a­tion, so they can be cal­cu­lated simultaneously.

Register Renaming 4

To make things even stronger in out-of-order exe­cu­tion, cores can reserve register sets, hid­den from the soft­ware, and, tak­ing again a wiki­pe­dia example, make inter­est­ing improve­ments on code like this:

[ccne_cpp]
R1 = M[1024]
R1 = R1 + 2
M[1032] = R1
R1 = M[2048]
R1 = R1 + 4
M[2056] = R1
[/ccne_cpp]

Instructions 4, 5, and 6 are inde­pend­ent of instruc­tions 1, 2, and 3, but the pro­cessor can­not fin­ish 4 until 3 is done, oth­er­wise instruc­tion 3 would write the wrong value. This restric­tion can be elim­in­ated by chan­ging the names of some of the registers:

[cce_cpp]
1 R1 = M[1024]
2 R1 = R1 + 2
3 M[1032] = R1
[/cce_cpp]
[cce_cpp]
4 R2 = M[2048]
5 R2 = R2 + 4
6 M[2056] = R2
[/cce_cpp]

Speculative execution

If you’re the crafts­man I just talked about above these lines, then tell me, in your lim­ited desk, what would you chose to have? Or ima­gine, you go to the cel­lar to pick up that weird let­ter size papyr­us for your next mas­ter­piece: would you take only one sheet? Or if you need a H3 pen­cil from the ward­robe, now that you’ve already stood up from your seat, would­n’t you just intu­it­ively take a few H5 and H2 and HB as well? And if after sit­ting back to your chair you later need a H3, and the next day you need a B5… would­n’t you just go next time and take simply all of your pencils?

It’s called loc­al­ity: tem­por­al loc­al­ity, and loc­al­ity of ref­er­ence. What you’ve need before, you’re likely to need it again. And what you need right now, you’re likely to need the thing just next to it.

Now, based on these ideas, the pro­cessor does­n’t know what your pro­gram might need, but he attempts to guess. If you need an address in memory, the pro­cessor will just in case load the entire line (that address and the next 63 bytes), just in case, because he knows that chances are you’ll need them. If on a con­di­tion, you always assume truth, next time this con­di­tion arises the pro­cessor might just execute the true path ahead of the answer, just in case when the answer finally arrives from any slower medi­um, he’s already done the job for you. And if he was wrong, then we just fetch memory again or flush out the spuri­ous res­ults, respect­ively, and con­tin­ue as if noth­ing has happened.

For a very use­ful note on how branch pre­dic­tion might make your code a ton faster, please read this fant­ast­ic SO answer.

Back to the metal

Have we then made the pro­cessor faster? When “instruc­tions are executed one-at-a-time” we got not just multi-cores, but pipelines and con­cur­rency in a single stream of exe­cu­tion, which is what C rep­res­ents. When “memory is flat and lives far away from the arith­met­ic unit”, we erec­ted a hier­arch­ic­al mod­el built on prox­im­ity and speed versus sheer size. When “the memory cells are addressed by name and fetched ran­domly”, we put togeth­er out-of-order and spec­u­lat­ive exe­cu­tion, to cope up with the risk of arbit­rary jumps flush­ing the pipeline.

All sounds fine, right? Not to men­tion the many cores in our pro­cessors, right?

Nōn omne quod nitet aurum est

All that glit­ters is not gold: the L3 is a glob­al state shared among all cores and threads, spec­u­lat­ive exe­cu­tion can lead to guess­ing dan­ger­ous things, and inter-core com­mu­nic­a­tion in this memory mod­el is bound to con­ten­tion and dead­locks all the time. Next, we will see what are these issues, why do they arise, what are their con­sequences, and we will explore some altern­at­ives and mod­ern research.


 

  1. http://www.lighterra.com/papers/modernmicroprocessors/
  2. For more inform­a­tion, this art­icle was quite pleasant.
  3. https://en.wikipedia.org/wiki/Instruction-level_parallelism
  4. It must be noted that low-powered chips such as Cortex, Atom, or AVR, are in-order, because logics like the register renam­ing are extremely power hungry!

2 Replies to “Processors, Part One: the Past. The Architecture”

  1. […] Last time we saw that, to enhance the memory bot­tle­neck, archi­tec­tures build a layered memory mod­el. While the main memory is liv­ing on its own piece of hard­ware, integ­rated with the chip there is a cache memory buck­et that serves the fast accesses. As we also saw, there’re three lay­ers: L1 and L2 are private to each core 3, while L3 is shared across all cores and threads. Every time a core attempts to fetch from memory, it checks each level, and if in the end it needed to go down all the way to main memory, it cop­ies the found data all the way up, exploit­ing the loc­al­ity we saw last time. […]

  2. […] Next, we will explore in a series of mono­graphs these hard­ware archi­tec­tures, the nitty-gritty details, the advant­ages and dis­ad­vant­ages, and the alternatives. […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.