Blog


Slowly getting there, perhaps

Just uploaded the 0.5 release of the simpleADL tools and associated documentation.

The toolkit itself seems to be able to do what it says on the can, although there’s still no example programs which do any memory write oeprations. Shocking, really.

But this release adds some implementation stuff. Artice 3 talks about how a processor is implemented, and presents time-tagging, a simple addition to an architectural simulto which provides quite useful performance estimation. It also discusses the need for caches, and exposes the key elements of the changes to the generated model for the r32 architecture needed to support cache simulation and time-tagging.

What’s time-tagging? It’s a simple technique - every resource in the system that can be written and read has an associated 64 bit time tag value. The interpreter has a notion of current time, stored in the variable now. We start off with every tag zero. When the interpreter performs an ‘add’ instruction, for example, it needs to read the two source registers and write the destination register. and the operation takes some time - for an add, there’s a latency of 1 clock. So the interpreter works out when the instruction can start executing - it’s the maximum of now and the time tags of the registers involved. It sets the value of now to that result, and then adds the latency of the operation to now and writes that into the timetag of the destination register.

When we have caches, each line has its own tag, and so does main memory. 

Function units need a tag if they’re not completely pipelined.

The whole scheme works quite well, without severely comprmising the runtime performance of the interpreter.

Its weakness, of course, is that it doesn’t work so well when there are multiple agents - like multiple processors - making use of shared resources. But that’s a subject for the future.

Phew….

At last, a somewhat tidied version of the simpleADL tools and documentation are published.

This release is still rather alpha-y, but seems to work with the two example architectures and their two example programs.

We’ve found problems with XCode and symbolic links, and with make and symbolic links (macOS make doesn’t sem to follow the link back for the touch date of the original file, so doesn’t do what we’d like), and so we’ve fallen back on an odious install script which populates the various folders with necessary files and builds thingsand installs them in /usr/local/bin


Architecture

Computer architecture has been annoyingly boring for a long time - on the whole, everybody’s doing RISC with a classical MMU and some form of cache coherence. (There are honourable exceptions - the Mill is worth reading about) but in general it’s all samey-samey.

This is about to change as systems which need lots of performance increasingly adopt multicore systems, with core counts running into the tens, hundreds, or thousands. There’s a bunch of examples:

  • Rex Computing has a (4-way VLIW) multicore solution which eschews caches and uses interrupts and ‘remote writes’ for intercore communication. 
  • Adapteva has a similar approach.
  • Kalray also

What these all seem to lack is a basic understanding that these things need to be programmed, and “thinking parallel” when all you have is a sequential programming language is hard. 

While there is progress being made (C++ atomics etc), in general current widely-used programming langauges either don’t support parallelism, or do it in an under-the counter way (such as providing a threads library or equivalent, along with locks and so forth). This generally makes writing correct, efficient, comprehensible code difficult.

An alternative approach is to notice that much “high performance computing” code is nested loops, and when you’re lucky, the iterations of the loops are independent, so that you can ignore identifying parallelism and let a compiler do it for you. Or, you can use an approach like OpenCL - then you identify those loops, rewrite your code as program + kernels (where the kernels capture the loops) and arrange to run the kernels on some collection of compute resource.

A more promising approach is to think of what the physics of this sort of parallel computing rewards and punishes. These multicore systems generally have a rectangular array of cores (each with some local memory) connected up with some form of mesh network - the network may provide one or more independent communications ‘planes’ to allow different sorts of data to proceed concurrently (examples - it can be helpful to have an ACK network separate from a data network, and if you’re moving cachelines-worth of data, it might also help to have an address network and a large data network).

This sort of system is capable of delivering data directly from one core into appropriate resources of another; you’re immediately tempted to think that an architecture which allows program to wait until some new data has arrived, and then process it, would be a natural fit. And so it seems - this style of programming is generally known as message-passing, and has the advantage that you don’t need locks or other shared-data-protecting structures. Also, by providing message queues/buffers in memory or registers, the architecture can speed up communication and synchronisation substantially while reducing power consumption at the same time.

But to include these features as an ordinary part of user mode program support you really want some new architecture. And the architecture tradeoffs for a very high core count system which leverages hardware message passing can be quite different from current mainstream practice.

And so there’s a new day dawning in processor/system architecture. Now all we need is a toolkit to go explore this new world.

As a first step in providing such a toolkit, we’re initiating a series of articles and software tools. The initial stuff provides an existence proof of how to define an architecture, and generate an executable model and an assembler automatically from that specification. The capabilities of this initial tool are exceedingly limited - despite the conversation above, it’s single processor only, for example. But most architects don’t use toolkits like this - they seem to work on squared paper or in text editors, and write plain text documents which seek to describe the architecture, and rely on tool makers to understand what’s written in the way they meant when constructing implementations, verification models and software tools. This inevitably leads to tears, so showing to a wider audience that this stuff needn’t be black magic seems a useful goal.

Look for the simpleADL toolkit to appear here quite soon.


Filling In

Slowly adding to the site. First addition, a few words about myself (under People).

First Post

Creating this new Kiva Design website, kicking off the creation of the LLC…

The paperwork from the State arrived yesterday, making us official…



© kiva design groupe • 2017