# Processors, Part Three: the Future. New models.

If one big com­plex thread being executed too fast is risky as we saw, because there’s a phys­ic­al lim­it to sheer clock speed and cheat­ing optim­isa­tions are unsafe, then, we all know the oth­er option: what about doing many simple and small things inde­pend­ently? To put it dif­fer­ently: if we can­not make the clock any faster, nor any essen­tially sequen­tial stream of instruc­tions, why don’t we just put many act­ors and many roles to work togeth­er instead?

In future posts, I want to explore dif­fer­ent mod­els of com­pu­ta­tion apart from the Turing Machine, which will show dif­fer­ent mod­els of semantics and pro­gram­ming paradigms. But today, after going on for at length review­ing our mod­ern mod­els of hard­ware, I want to set some ideas about how dif­fer­ent and per­haps futur­ist­ic archi­tec­tures away from the Von Neumann might look like.

In Chisnall’s art­icle’s from a few posts back on the C lan­guage, the research­er presents us with a few final ideas on how to ima­gine a non‑C pro­cessor, togeth­er with a set of fur­ther reads and bibliography.

# Amdahl’s Law

It has been said many times, by me and by every­one else. We’re going to make our com­puters faster by adding more cores. But, no dis­cus­sion of speed-up in par­al­lel com­pu­ta­tion would be com­plete without ref­er­ence to Amdahl’s Law 1.

$latex Processors, Part Three: the Future. New models.$

where $latex Processors, Part Three: the Future. New models.$ is the spee­dup that can be achieved when par­al­lel­iz­ing a frac­tion $latex Processors, Part Three: the Future. New models.$ of a pro­gram across $latex Processors, Part Three: the Future. New models.$ cores. It’s easy to see then, that to achieve high­er spee­dup, we require a lower denom­in­at­or, which means that we need a high­er par­al­lel­is­able pro­por­tion $latex Processors, Part Three: the Future. New models.$ to reduce the first factor, and allow the second factor to be reduced as $latex Processors, Part Three: the Future. New models.$ grows.

$latex Processors, Part Three: the Future. New models.$

But the mod­el fails for its sim­pli­city: it does­n’t take into account the price of con­text switches, and the con­ten­tion of resources among threads. An exten­ded ver­sion was presen­ted by Eyerman and Eeckhout 2, includ­ing a con­ten­tion prob­ab­il­ity $latex Processors, Part Three: the Future. New models.$, and divid­ing the par­al­lel­isa­tion factor into two factors: $latex Processors, Part Three: the Future. New models.$ that con­tends, and $latex Processors, Part Three: the Future. New models.$ that does­n’t, where $latex Processors, Part Three: the Future. New models.$ and the sequen­tial part is $latex Processors, Part Three: the Future. New models.$. In this new mod­el, as the num­ber of cores grows, the equa­tions has the fol­low­ing behaviour:

$latex Processors, Part Three: the Future. New models.$

Hence, we’re required not only to increase the par­al­lel­is­able part $latex Processors, Part Three: the Future. New models.$, but also to reduce the prob­ab­il­ity of con­ten­tion $latex Processors, Part Three: the Future. New models.$, and also the price $latex Processors, Part Three: the Future. New models.$ paid for it. A meth­od to achieve this is more fine-grained abstrac­tions: the finer the sequen­tial stream of instruc­tions whilst aug­ment­ing the num­ber of streams, the more par­al­lel­ised the pro­gram is, while con­ten­tion is reduced.

## Fine-grained software

#### Software communication

Contention is essen­tially a by-product of shared resources in the wide sense: because a giv­en resource is shared among more than one pro­cess, its usage needs to be arranged to ensure col­lab­or­a­tion without mutu­al harm. Programs spanned across threads will need access to even­tu­ally lim­ited resources, and to some form of inter­com­mu­nic­a­tion. Two major mod­els are presen­ted: the Shared Memory Model, and the Message Passing Model.

##### Shared Memory

In this mod­el, pro­cesses com­mu­nic­ate across defined bound­ar­ies of shared memory. Two pro­cesses have access to the same memory frame, and mech­an­isms to ensure that data races don’t occur must be provided: locks, con­di­tion­al vari­ables, and atom­ics. This is the mod­el that almost every major lan­guage fol­lows (C/C++, Java, JS, Python, everything), and it is what makes cache coher­ency so com­plic­ated: if a giv­en shared memory line is cached by two cores and one of them mod­i­fies it, a pro­tocol to keep all caches in sync needs be designed. This mod­el is easy to extend from a sequen­tial V.N. concept, where a pro­cess can point, fetch and push to the whole memory pool, and its stream of exe­cu­tion is essen­tially sequen­tial. A dif­fer­ent abstrac­tion than the lock­ing one can be used though: Transactions, explained below.

##### Message passing

In con­trast, in the mes­sage passing mod­el, where pro­cesses are strictly isol­ated from one anoth­er, and com­mu­nic­a­tion can only hap­pen through spe­cified chan­nels of mes­sage passing. By enfor­cing this, there’s no pos­sib­il­ity over con­ten­tion, though a pro­cess might be forced to wait for a mes­sage to arrive.

In cur­rent imple­ment­a­tions, threads are known to be expens­ive, car­ry­ing a lot of state along. If a pro­gram is to be divided into arbit­rar­ily many threads, two prob­lems arise: first, con­text switches must be cheap, and second, thread schedul­ing must be effi­cient. While OSes must provide light­er abstrac­tions, often provided by third runtimes at user level (C++ runtime, JVM, and what­not), much research is being done on cheap­er con­text switches like Lazy Context Switch or cached Virtual Memory Maps.

#### STM3

Locking mech­an­isms are an effect­ive and simple way to resolve resource con­ten­tion prob­lems, but their com­plex­ity does­n’t scale as the num­ber of threads scale, spe­cially when used in a fine-grained man­ner. Instead, Transactional Memory is a lock-less old tech­nique com­ing from the data­base world, where accesses to shared regions are declared as trans­ac­tions, and any writ­ing trans­ac­tion executes optim­ist­ic­ally (believ­ing it’s the exclus­ive own­er of the resource), but if it finds con­flicts at runtime, it can back­track, not com­mit­ting the trans­ac­tion, and execut­ing again. A runtime, respons­ible for this admin­is­tra­tion, needs hard­ware sup­port to be effi­cient 4.

## Fine-grained hardware

#### Architectural communication

In any giv­en multi-core chip, com­mu­nic­a­tion between cores is essen­tial. The stand­ard imple­ment­a­tion in reign today is the shared bus, were all com­mu­nic­a­tions hap­pen through one glob­al shared bus. Its major advant­age is its sim­pli­city: only one thing may travel through at a time, hence nat­ur­ally giv­ing a strong order­ing of mes­sages, and com­mu­nic­a­tions are broad­cast, remov­ing the need for address­ing pro­to­cols. But this meth­od does­n’t scale with increas­ing core counts, as the traffic volumes increases and the bus becomes a bot­tle­neck, and with grow­ing chip size, the bus wires extend farther in dis­tance than what a single clock cycle can reach, and the energy require­ments quickly hit the power wall.

Rather, Networks-on-Chip had been designed in a way as our inter­net works: it con­sists on a col­lec­tion of routers, each one con­nec­ted to a num­ber of oth­ers. Messages are sent across the net­work until they reach their des­tin­a­tion. Routing pro­to­cols, coex­ist­ence of phys­ic­al net­works with vir­tu­al ones, or net­work buf­fer­ing, are all on-going fields of research. Furthermore, NoCs are a per­fect found­a­tion for an in-hard­ware mes­sage passing sup­port and more effi­cient and cor­rect cache coher­ency protocols.

#### SIMD

Single Instruction Multiple Data is a tech­nique to exploit data level par­al­lel­ism rather than con­cur­rency: if we have for example an array of bytes, and we want to add 1 to each one of them, a SIMD instruc­tion would apply sim­ul­tan­eously the $latex Processors, Part Three: the Future. New models.$ oper­a­tion to many bytes in the array. This is essen­tially use­ful for applic­a­tions like graph­ic pro­cessing, where we often want to add, say, a bit of blue, to every single pixel in the pic­ture. Support for this kind of exe­cu­tion is extens­ive in mod­ern hard­ware, but vec­tor­iz­a­tion of leg­acy code is often not possible.

#### Cache coherency

The cache coher­ency pro­tocol is one of the hard­est parts of a mod­ern CPU to make both fast and cor­rect. Most of the com­plex­ity involved comes from sup­port­ing a lan­guage in which data is expec­ted to be both shared and mut­able as a mat­ter of course. Consider in con­trast an Erlang-style abstract machine, where every object is either thread-loc­al or immut­able (Erlang has a sim­pli­fic­a­tion of even this, where there is only one mut­able object per thread). A cache coher­ency pro­tocol for such a sys­tem would have two cases: mut­able or shared. A soft­ware thread being migrated to a dif­fer­ent pro­cessor would need its cache expli­citly inval­id­ated, but that’s a rel­at­ively uncom­mon oper­a­tion.5