On so called low-level and performant languages

Last time, we dug into a new debate: are lan­guages really decoupled from the hard­ware that runs them? Are the semantics of a lan­guage mod­el­ling main­stream hard­ware, and should they? Can we escape to new mod­els of hard­ware, and new mod­els of semantics?

We saw that Backus made a good point arguing that lan­guages have grown and fattened entrenched in the Von Neumann Architecture: we can fairly won­der then, have Von Neumann machines grown and fattened entrenched in Von Neumann lan­guages as well?

It’s a very reas­on­able idea: Hardware and Languages evolved in par­al­lel to each oth­er.

A glorified Von Neumann language

Now tell me, which lan­guage comes to your mind with the info I have giv­en to you so far? A low-level, per­form­ant, and essen­tially a glor­i­fied Von Neumann lan­guage, which one could that be… If you’re think­ing of the C lan­guage, then we’re on sync. Else, let me con­vince you.

This lan­guage was designed to be essen­tially a pretty assembly for a V.N. machine 1: a vari­able cor­res­ponds to a memory cell, if state­ments are mere test-and-jump instruc­tions, and assign­ments are the quint­es­sen­tial fetch-and-store instruc­tion. And if we con­sider the industry doc­trine of back­wards com­pat­ib­il­ity, it’s easy to see how new devel­op­ment on archi­tec­tures was set to remain com­pat­ible with whatever code was already writ­ten. And in this course of events we’re at a point where I would say that, quite blatantly, C is no longer a low-level and per­form­ant lan­guage per se, we have just build hard­ware to keep it that way.

Distancing itself from actual hardware

In David Chisnall’s paper 2 in the ACM Queue magazine, “C is Not a Low-Level Language”, the research­er gives a very inter­est­ing point of view.

In the wake of the recent Meltdown and Spectre vul­ner­ab­il­it­ies 3, it’s worth spend­ing some time look­ing at root causes. […] The fea­tures that led to these vul­ner­ab­il­it­ies, along with sev­er­al oth­ers, were added to let C pro­gram­mers con­tin­ue to believe they were pro­gram­ming in a low-level lan­guage, when this hasn’t been the case for dec­ades.

Hasn’t been the case for dec­ades? Why?

Modern VN Hardware

We all know it. Moore’s Law is hit­ting a snag. Clocks can’t get any faster for many reas­ons: the main one, ther­mo­dy­nam­ics. Higher fre­quen­cies mean tem­per­at­ures rais­ing faster than what cool­ing sys­tems can keep up with. There’s also the quantum-mech­an­ics: if the clock goes faster, the area that a pulse can reach in a tick gets smal­ler, there­fore the size shrinks, and quantum uncer­tainty hits: elec­trons start hav­ing unpre­dict­able beha­viour.

An altern­at­ive path to keep Moore’s Law act­ive was to be found oth­er than just incre­ment­ing clock speed, and it had to keep old code (read, C) faster and faster. Modern micro­pro­cessors do the job by many com­plex tricks, but they all are in slight con­flict with a clas­sic C machine.

Keep the CPU busy

Multiple cores is the option we all know: put more cores so each one can execute one more thread. A par­al­lel (pun inten­ded) improve­ment is ILP, short for Instruction Level Parallelism, where sev­er­al instruc­tions are executed sim­ul­tan­eously. There are many ways to achieve this, like register renam­ing: a pro­cessor can detect sequences of exe­cu­tion that are inde­pend­ent of each oth­er, and execute them in par­al­lel assign­ing to their exe­cu­tion dif­fer­ent register sets. Another very old trick is pipelin­ing, where if one instruc­tion takes sev­er­al clock cycles to execute, but each tick requires a dif­fer­ent part of the pro­cessor, we can execute sev­er­al instruc­tions at the same time if each one is, at every single clock cycle, depend­ing on a dif­fer­ent pro­cessor part. Another trick to keep code faster, related to pipelines, is branch pre­dic­tion: a pipeline that loads the wrong instruc­tion must be flushed, which is expens­ive, so the pro­cessor tries to pre­dict with branch will be the cor­rect one and execute it, or even executes all branches until one of them is decided to be flushed, there­fore keep­ing the pipeline full.

Keep the Memory busy

The Memory bot­tle­neck is cached in faster memory, and highly com­plex mech­an­isms of cache coher­ency are built into the chip to keep all threads of exe­cu­tion in sync. Memory is fetched con­tinu­ously and loaded into the faster cache, where guess­ing the seg­ment of memory to fetch next requires again com­plex pre­dic­tion mech­an­isms.

C is not like that

But not all that glit­ters is gold: C threads are tra­di­tion­ally heavy, con­text switches take a lot of time, and cre­ation and destruc­tion of threads is known to be expens­ive. Besides, C is essen­tially a sequen­tial machine with glob­al flat memory, where mod­el­ling par­al­lel­iz­a­tion is done through pain­ful lock­ing mech­an­isms, and tra­di­tion­al code hasn’t scale pro­por­tion­ally to the num­bers of cores added.

Even if we stick to one stream of exe­cu­tion, the aver­age C code has a branch­ing every sev­en instruc­tions, ren­der­ing pipeline flush­ing far too prob­able and mak­ing branch pre­dic­tion ever more import­ant; even worse when we con­sider that execut­ing the wrong branch might read illeg­al memory and pipeline flush­ing might not come in time 4. C code believes to be executed sequen­tially and we all know the pain of its shared memory mod­el for mul­ti­th­read­ing com­mu­nic­a­tions, and power-hungry engines like the register renam­ing one is known to make some optim­isa­tions unsafe.

And regard­ing the flat memory mod­el of C, we quote David Chisnell’s paper: “Efficient use of the cache is one of the most import­ant ways of mak­ing code run quickly on a mod­ern pro­cessor, yet this is com­pletely hid­den by the abstract machine, and pro­gram­mers must rely on know­ing imple­ment­a­tion details of the cache […] to write effi­cient code.

Modern C Compilers

If the lan­guage achieves its per­form­ance due to its prox­im­ity to the under­ly­ing machine, that should mean that a com­piler should be a mere trans­lat­or. Indeed, if you tell a C pro­gram­mer from the times of Ritchie, or of Fortran or Assembly for that mat­ter, that their com­piler would reorder and make strong changes to their code, they would have thrown the baby out with the bathwa­ter. But today, we all rush to put all these fancy flags to our com­pilers to make our pro­gram faster, smal­ler, safer, and what­not; dis­tan­cing C code from its reputed low-level idea of per­form­ance achieve­ment.

To quote Chisnall’s example, a clas­sic thing to do in C is pro­cessing a sequence of data through loop­ing: optim­ising this requires assur­ing that the iter­a­tions are inde­pend­ent and then try­ing to vec­tor­ise them, as vec­tor code is much faster than scal­ar code 5. But the com­piler is not always free to reorder, in fear of break­ing C lay­out guar­an­tees like pad­ding and align­ment.

C is not like that either

To make things worse, some optim­isa­tions can even run into undefined beha­viour. Consider loop unswitch­ing: if branch­ing is expens­ive under the risk of pipeline flush­ing, a loop with a con­di­tion is a plain night­mare. Hence, a com­piler attempts to trans­form this tor­ture into a con­di­tion­al with two loops, one on each side, in order to facil­it­ate vec­tor­isa­tion and reduce pipeline flush­ing, hence­forth chan­ging con­trol flow opposed to what the pro­gram­mer had in mind. And now ima­gine a loop that ends up being iter­ated zero times: if the branch depends on the loop index, then this index is now unini­tial­ised: “Some dead code has now been trans­formed into undefined beha­viour”.

Old Lies and New Perspectives

Chisnall says one more inter­est­ing sen­tence, that for once feels per­haps too bold to me, but it’s non­ethe­less a very inter­est­ing food of thought: “Perhaps it’s time to stop try­ing to make C code fast and instead think about what pro­gram­ming mod­els would look like on a pro­cessor designed to be fast.” How would such a pro­cessor look like?

Next, we will explore in a series of mono­graphs these hard­ware archi­tec­tures, the nitty-gritty details, the advant­ages and dis­ad­vant­ages, and the altern­at­ives.


 

  1. https://en.wikipedia.org/wiki/Von_Neumann_programming_languages
  2. I’ll quote this paper extens­ively today.
  3. I’ll talk about these cata­stroph­ic vul­ner­ab­il­it­ies in the near future.
  4. This is the essence of the recent Meltdown and Spectre vul­ner­ab­il­it­ies, a point that Chisnall really tries to make in his paper. Consider the over-sim­pli­fied if (untrusted_input < 20) return kernel_data[untrusted_input];, which might tem­por­ar­ily save in some register data from what would usu­ally be an access viol­a­tion.
  5. https://en.wikipedia.org/wiki/SIMD

2 Replies to “On so called low-level and performant languages”

  1. […] about our tech­no­lo­gic­al stack: GC? Multi-thread­ing? Hardware? Going “low-level” is per­haps not a thing of the present any­more, but work­ing more closely to our tool-chains will be import­ant: do know your com­pilers, […]

  2. […] Recently, I ana­lysed the case of hard­ware being developed to make C code fast. To make the point, we need to remem­ber how C code exe­cu­tion looks like, and think how can we make it faster. […]

Leave a Reply