Processors, Part Three: the Future. New models.

If one big com­plex thread being executed too fast is risky as we saw, because there’s a phys­ic­al lim­it to sheer clock speed and cheat­ing optim­isa­tions are unsafe, then, we all know the oth­er option: what about doing many simple and small things inde­pend­ently? To put it dif­fer­ently: if we can­not make the clock any faster, nor any essen­tially sequen­tial stream of instruc­tions, why don’t we just put many act­ors and many roles to work togeth­er instead?

In future posts, I want to explore dif­fer­ent mod­els of com­pu­ta­tion apart from the Turing Machine, which will show dif­fer­ent mod­els of semantics and pro­gram­ming paradigms. But today, after going on for at length review­ing our mod­ern mod­els of hard­ware, I want to set some ideas about how dif­fer­ent and per­haps futur­ist­ic archi­tec­tures away from the Von Neumann might look like.

In Chisnall’s art­icle’s from a few posts back on the C lan­guage, the research­er presents us with a few final ideas on how to ima­gine a non‑C pro­cessor, togeth­er with a set of fur­ther reads and bibliography.

Amdahl’s Law

It has been said many times, by me and by every­one else. We’re going to make our com­puters faster by adding more cores. But, no dis­cus­sion of speed-up in par­al­lel com­pu­ta­tion would be com­plete without ref­er­ence to Amdahl’s Law 1.

$latex {\dis­playstyle S={\frac {1}{(1‑p)+{\frac {p}{n}}}}} &bg=fffef6$

where $latex S$ is the spee­dup that can be achieved when par­al­lel­iz­ing a frac­tion $latex p$ of a pro­gram across $latex n$ cores. It’s easy to see then, that to achieve high­er spee­dup, we require a lower denom­in­at­or, which means that we need a high­er par­al­lel­is­able pro­por­tion $latex p&bg=fffef6$ to reduce the first factor, and allow the second factor to be reduced as $latex n&bg=fffef6$ grows.

$latex {\dis­playstyle \lim_{n \to \infty} S={\frac {1}{1‑p}}} &bg=fffef6$

But the mod­el fails for its sim­pli­city: it does­n’t take into account the price of con­text switches, and the con­ten­tion of resources among threads. An exten­ded ver­sion was presen­ted by Eyerman and Eeckhout 2, includ­ing a con­ten­tion prob­ab­il­ity $latex P_{ctn} &bg=fffef6$, and divid­ing the par­al­lel­isa­tion factor into two factors: $latex p_{ctn} &bg=fffef6$ that con­tends, and $latex p_{fr} &bg=fffef6$ that does­n’t, where $latex p=p_{ctn} + p_{fr}&bg=fffef6$ and the sequen­tial part is $latex s=1‑p&bg=fffef6$. In this new mod­el, as the num­ber of cores grows, the equa­tions has the fol­low­ing behaviour:

$latex {\dis­playstyle \lim_{n \to \infty} S={\frac {1}{s + p_{fr} \cdot P_{ctn}}}} &bg=fffef6$

Hence, we’re required not only to increase the par­al­lel­is­able part $latex p&bg=fffef6$, but also to reduce the prob­ab­il­ity of con­ten­tion $latex P_{ctn}&bg=fffef6$, and also the price $latex p_{ctn}&bg=fffef6$ paid for it. A meth­od to achieve this is more fine-grained abstrac­tions: the finer the sequen­tial stream of instruc­tions whilst aug­ment­ing the num­ber of streams, the more par­al­lel­ised the pro­gram is, while con­ten­tion is reduced.

Fine-grained software

Software communication

Contention is essen­tially a by-product of shared resources in the wide sense: because a giv­en resource is shared among more than one pro­cess, its usage needs to be arranged to ensure col­lab­or­a­tion without mutu­al harm. Programs spanned across threads will need access to even­tu­ally lim­ited resources, and to some form of inter­com­mu­nic­a­tion. Two major mod­els are presen­ted: the Shared Memory Model, and the Message Passing Model.

Shared Memory

In this mod­el, pro­cesses com­mu­nic­ate across defined bound­ar­ies of shared memory. Two pro­cesses have access to the same memory frame, and mech­an­isms to ensure that data races don’t occur must be provided: locks, con­di­tion­al vari­ables, and atom­ics. This is the mod­el that almost every major lan­guage fol­lows (C/C++, Java, JS, Python, everything), and it is what makes cache coher­ency so com­plic­ated: if a giv­en shared memory line is cached by two cores and one of them mod­i­fies it, a pro­tocol to keep all caches in sync needs be designed. This mod­el is easy to extend from a sequen­tial V.N. concept, where a pro­cess can point, fetch and push to the whole memory pool, and its stream of exe­cu­tion is essen­tially sequen­tial. A dif­fer­ent abstrac­tion than the lock­ing one can be used though: Transactions, explained below.

Message passing

In con­trast, in the mes­sage passing mod­el, where pro­cesses are strictly isol­ated from one anoth­er, and com­mu­nic­a­tion can only hap­pen through spe­cified chan­nels of mes­sage passing. By enfor­cing this, there’s no pos­sib­il­ity over con­ten­tion, though a pro­cess might be forced to wait for a mes­sage to arrive.

Lightweight threads

In cur­rent imple­ment­a­tions, threads are known to be expens­ive, car­ry­ing a lot of state along. If a pro­gram is to be divided into arbit­rar­ily many threads, two prob­lems arise: first, con­text switches must be cheap, and second, thread schedul­ing must be effi­cient. While OSes must provide light­er abstrac­tions, often provided by third runtimes at user level (C++ runtime, JVM, and what­not), much research is being done on cheap­er con­text switches like Lazy Context Switch or cached Virtual Memory Maps.

STM 3

Locking mech­an­isms are an effect­ive and simple way to resolve resource con­ten­tion prob­lems, but their com­plex­ity does­n’t scale as the num­ber of threads scale, spe­cially when used in a fine-grained man­ner. Instead, Transactional Memory is a lock-less old tech­nique com­ing from the data­base world, where accesses to shared regions are declared as trans­ac­tions, and any writ­ing trans­ac­tion executes optim­ist­ic­ally (believ­ing it’s the exclus­ive own­er of the resource), but if it finds con­flicts at runtime, it can back­track, not com­mit­ting the trans­ac­tion, and execut­ing again. A runtime, respons­ible for this admin­is­tra­tion, needs hard­ware sup­port to be effi­cient 4.

Fine-grained hardware

Architectural communication

In any giv­en multi-core chip, com­mu­nic­a­tion between cores is essen­tial. The stand­ard imple­ment­a­tion in reign today is the shared bus, were all com­mu­nic­a­tions hap­pen through one glob­al shared bus. Its major advant­age is its sim­pli­city: only one thing may travel through at a time, hence nat­ur­ally giv­ing a strong order­ing of mes­sages, and com­mu­nic­a­tions are broad­cast, remov­ing the need for address­ing pro­to­cols. But this meth­od does­n’t scale with increas­ing core counts, as the traffic volumes increases and the bus becomes a bot­tle­neck, and with grow­ing chip size, the bus wires extend farther in dis­tance than what a single clock cycle can reach, and the energy require­ments quickly hit the power wall.

Rather, Networks-on-Chip had been designed in a way as our inter­net works: it con­sists on a col­lec­tion of routers, each one con­nec­ted to a num­ber of oth­ers. Messages are sent across the net­work until they reach their des­tin­a­tion. Routing pro­to­cols, coex­ist­ence of phys­ic­al net­works with vir­tu­al ones, or net­work buf­fer­ing, are all on-going fields of research. Furthermore, NoCs are a per­fect found­a­tion for an in-hard­ware mes­sage passing sup­port and more effi­cient and cor­rect cache coher­ency protocols.

SIMD

Single Instruction Multiple Data is a tech­nique to exploit data level par­al­lel­ism rather than con­cur­rency: if we have for example an array of bytes, and we want to add 1 to each one of them, a SIMD instruc­tion would apply sim­ul­tan­eously the $latex +1 &bg=fffef6$ oper­a­tion to many bytes in the array. This is essen­tially use­ful for applic­a­tions like graph­ic pro­cessing, where we often want to add, say, a bit of blue, to every single pixel in the pic­ture. Support for this kind of exe­cu­tion is extens­ive in mod­ern hard­ware, but vec­tor­iz­a­tion of leg­acy code is often not possible.

Cache coherency

The cache coher­ency pro­tocol is one of the hard­est parts of a mod­ern CPU to make both fast and cor­rect. Most of the com­plex­ity involved comes from sup­port­ing a lan­guage in which data is expec­ted to be both shared and mut­able as a mat­ter of course. Consider in con­trast an Erlang-style abstract machine, where every object is either thread-loc­al or immut­able (Erlang has a sim­pli­fic­a­tion of even this, where there is only one mut­able object per thread). A cache coher­ency pro­tocol for such a sys­tem would have two cases: mut­able or shared. A soft­ware thread being migrated to a dif­fer­ent pro­cessor would need its cache expli­citly inval­id­ated, but that’s a rel­at­ively uncom­mon oper­a­tion.5

Hardware Threading

As we men­tioned, the over­head of the OS sched­uler com­bined with a con­text switch is gen­er­ally too high to scale across increas­ing core counts. Proposals in the aca­demia has been made, for example to intro­duce hard­ware sup­port of task queues that cores can enqueue, includ­ing pri­or­ity queues; to make con­text switches lazy (cur­rently sup­por­ted in Floating-point oper­a­tions in most stand­ard Intel chips); to make cores able to handle arbit­rar­ily high num­bers of threads at once, there­fore redu­cing the fre­quency of con­text switches.

Further Reads

Most of the ideas in this post are extrac­ted from Chisnall and Chadwick’s papers, each of which quotes a lot more bib­li­o­graphy on their own. While I’m not an expert in on-going hard­ware research, I think it’s import­ant to raise aware­ness of the prob­lems and prom­ises the field under­goes. After all, you need some hard­ware to run your pro­grams, and ques­tions like how decoupled could they be from each oth­er, how any giv­en semantics should mod­el a machine, and how your hard­ware can make your lan­guage bet­ter, are import­ant to every com­puter fan.

  1. https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-832.pdf [§3.1]
  2. Stijn Eyerman and Lieven Eeckhout. Modeling crit­ic­al sec­tions in Amdahl’s law and its implic­a­tions for mul­ticore design.
  3. https://www.schoolofhaskell.com/school/advanced-haskell/beautiful-concurrency/3‑software-transactional-memory
  4. Cavascal et al. Software Transactional Memory.
  5. https://queue.acm.org/detail.cfm?id=3212479

2 Replies to “Processors, Part Three: the Future. New models.”

  1. […] ser­i­ous prob­lems to scale, and new mod­els are being developed and pop­ular­ised, like Transactional Memory or the Message Passing. And if you’ve heard they’re slow, remem­ber, lock­ing is fast just […]

  2. […] Next, we will explore some of the ideas that aca­demia and research are cur­rently explor­ing. In the mean­time, yes, once again, xkcd has the best con­clu­sion to the thought: […]

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.