Blog


Gemini

Gemini

We have just received our Planet Computers Gemini; an ARM-based computer with screen and keyboard, extremely reminiscent of a Psion Series 5 machine (and, indeed, with some shared people).

Details at https://www.planetcom.co.uk/?page_id=8

GeminiKiva


In short, a 10-core machine (A53s and A72s) with 4GB RAM and 64GB storage, enhanceable with up to 256GB SD card, and quad-bootable. 

Ours has the Community Edition of Sailfish OS installed; it’s a Linux with cellphone support (the Gemini is a cellphone, too) and it isn’t Android. If Sailfish turns out to be a Bad Idea, there’s always Debian, and apparently Ubuntu too, one day soonish.

Started a separate page on getting things done with the Gemini. The idea, obviously, is to have the thing be a useful portable demonstrator of the (eventual) Good Stuff on this site, which will mean porting llvm to the machine and building thereon…

Progress….

Progress

Yep, had to rip up the instruction decode representation, but the new form is simpler to build and to walk. It’s essentially just a direct representation of a decode tree, with a tree of structures; in each structure, the level below is captured as an array of pointers to the decode nodes below. The bottom level, below which there are only instructions, holds a pointer to a single queue of instructions.

We capture the instructions as they arrive, and insert them into the single queue in field order (there’s always a ‘final opocde field’ to define this instruction rather than that, and that’s the field we use to insertion sort on.

Insertion sorting means duplicates are easily found.

Also looked at the error reporting, and tidied that up extensively. The whole thing seems much more robust now - trying it on an increasing number of acrhitectures. In particular, we want an architecture whose instruction set is as close to one-to-one with the abstract architecture implied by the triples - if they were instructions, what would they look like? This lead to a new field type being needed - an ‘operand’, whose definition includes its base register. So now we can represent accumulator machines simply, too.

fields {

// the name of a field, for assembler definition, is the name given here

// the name of a field, for assembler definition, is the name given here

op8[0:7] opcode;// 4 bits of major opcode unsigned value

op9[9:16]opcode;

l_dest[9:15]   opndsel fptr srcdest;

l_src1[16:23]  opndsel fptr src;

l_src2[24:31]  opndsel fptr src;

}

teq

teq

Oof..

I’m generating correctly-executable code for an arbitrarily large number of processors from an architecture description and ‘hand-assembled’ program (using generated field macros)

And now I begin to see the lacunae….

For example, in thinking about how to incrementally add instructions, I realized that there were no checks in place to check the uniqueness of instruction decode. Bah, humbug…

Now to think hard about how either change the instruction encoding data structures; or to faff with a collection of instructions and discern the decode tree after the fact (as I did with the very first version of this stuff, dealing with instructions presented in more or less alphabetical order of mnemonics); or to work out how to insert things into the current structure.

This last seems most attractive at the moment.

It doesn’t fix the ‘uniqueness’ thingy, but that might fall out, or be amenable to a simple ’symbol table’ solution.; 


A Digression on an Aspect of Peformance

A Digression on an Aspect of Peformance

teq (and simpleADL) are built from odds and sods of software I’ve had lying around for years.

One of these is the symbol package. The original version was essentially the example in Kernighan and Ritchie. This used a collection of queues, each holding a single-forward-linked queue to hold the symbol data structures. To insert a new symbol, you created it, filled in the name, computed a hash, and used the hash to select what queue to put it in. 

(The Second Edition of The C programming Language has the example starting on page 143, under section 6.6 Table Lookup).

A key operation in any compiler or rquivalent is looking up a name. You’ve just read a token, and eventually you’ll need to know what it is (if it’s been defined). So you’ll look it up in the name table. The simplest organization is a long list (single-forward chained queue) of all the names of interest. If this is, aat some point 280 entries long, you’ll need to make on avaergae 140 string comparisons. Since names vary fro one character in length to 20 or more, these comparisons can take a long time.

A first step is to reduce the average work by any amount you like: instead of one queue, have N. This reduces the search to 140/N comparisons. How to choose what queue to use for a given name? compute a small integer in the range 0..N-1 from the name.

K&R offer a simple function to compute a hash; here’s the code from the second edition::

/* hash: form hash value for string s */
unsigned hash(char * s)
{
    unsigned hashval;
    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31 * hashval;
     return hashval % HASHSIZE;
}

We note that often the % operator is relatively expensive on many processors, but we’ll ignore this because we don’t do it very often. But when looking up a name, we need to do a string compare between every entry in a queue and the name we’re looking up. This can be expensive. Luckiy, the standard library has the strcmp() function, generally optimised as much as practical. But it still needs to look at every name in the queue until a match is found or we reach the end of the queue indicating no match.

strcmp() looks like this, in non-weird form (from: https://opensource.apple.com/source/Libc/Libc-262/ppc/gen/strcmp.c)

/* ANSI sez:
 * The `strcmp' function compares the string pointed to by `s1' to the    
 * string pointed to by `s2'.
 * The `strcmp' function returns an integer greater than, equal to, or less
 * than zero, according as the string pointed to by `s1' is greater than,
 * equal to, or less than the string pointed to by `s2'. [4.11.4.2]
 */
 int strcmp(const char *s1, const char *s2) {
     for ( ; *s1 == *s2; s1++, s2++)
         if (*s1 == '\0')
               return 0;
     return ((*(unsigned char *)s1 < *(unsigned char *)s2) ? -1 : +1); }

With long names, that’s a lot of comparisons. Is there any way we could speed it up?

Probably. We have used one form of (intended) speedup for a long time. The structure holding the name also included the length of the string; not much point comparing the strings of equality of they’re different lengths, eh?

But that’s not all. Every name we encounter means we need to compute the hash. We could store the hash, and get a much stronger avoidance of doing the string compare. Naturally, we’d store  the value computed as hashval in the hash() function, before it’s range-reduced.

Testing the hypothesis that these represent improving performance requires a fairly large experiment, because performance all depends on the names being looked up. Random names aren’t the same as English-language names; and these aren’t the same as names used in program (‘not the same as’ means distribution of length and of characters in the names differ).

So we peformed a basic reality test. We created an array of pointers to names, and then malloc() space for each name, and created each name using our uniform() function to choose successive letters (‘a’ to ‘z’ more or less) and a separate call to uniform() to choose the length of each name. And installed them all, in order of creation, in a hash table. Then ran through the array in order, looking up each one. 

We did two experiments. In the first, we chose a letter at random and used that for the first three characters of each name, a rough’n’ready nod to the fact that names are often ‘categorised’ thus. Results:

  • Using Raw, looking up 50,000 names in our symbol table took 1021692 microseconds
  • Using Length, looking up 50,000 names in our symbol table took 959724 microseconds
  • Using Hash, looking up 50,000 names in our symbol table took 901598 microseconds 


And then, with no pre-string, we got:

  • Using Raw, looking up 50,000 names in our symbol table took 1012510 microseconds
  • Using Length, looking up 50,000 names in our symbol table took 990427 microseconds
  • Using Hash, looking up 50,000 names in our symbol table took 1020117 microseconds 


Why isn’t hash better than length - or raw - for the random strings? Probably because the string compare can decide that the strings are different after looking at the first letter of the string. To do a proper test, we need a more realistic set of names.

So, yep. The thinking was good. But the effects are so small as to be unreliable..

From this we a reminded once again that optimizing without measuring is oft-times a Bad Idea...


teq

teq

Well, we’re about a year further on. simpleADL (sadl) has not been touched in all that time, because it’s a rather limited piece of work.

Instead, we’ve been working on teq, a new toolkit. 

Note to the unwary - we don’t think teq contains major intellectual breakthroughs - its purpose is threefold: to provide a useful tool for architecture exploration (and to show by example how to construct such things); to present (eventually) some novel architectural thoughts aimed at IoT; and to provide us with (another) useful and pleasurable activity.

teq is an architecture description language, with (hopefully) the shortfalls of sadl as a language and as a tool fixed. 

The project is just coming to life - we can consume architecture models and detect errors and provide useful error messages and build the data structures which represent the architecture, and are beginning to be able to generate the necessary .c and .h files. In addition to the architecture of a processor, teq will also be able to describe an implementation of an architecture (in the sense that the clock rate of an implementation, and the effects of the micro-architecture will be specifiable) so that performance studies can be done.

teq will also be able to describe large multiprocessor systems, in which many processors operate concurrently and interact. 

Such systems will need software, so teq will also be a concurrent programming language. At the moment, we envisage this as being a slightly simplified C, with concurrency extensions more or less along the lines of occam (that is, channel-like concurrency, and static, block structured concurrency specifications) with aspects of go (that is, channels and dynamic concurrency) and a touch of csp (that is, perhaps no channels :-).

But writing code requires a compiler, so we’ve started working on how to automatically generate a compiler from the architecture and some extra information. At least initially, the compilers won’t be very good, but they’ll be of similar quality for all architectures and so not all comparisons will be useless. We’re thinking of a structure which compiles code into a simple intermediate language, being able to distributed code in that form, and having a sort of just in time converter to convert this into per-architecture loadable formats. But these are early days, and undoubtedly there will be gotchas.

And such systems also need useful widgets like DMA engines and other independently-executing non-programmable engines. teq will allow these to be defined; the approach will be to specify the behaviour of the state machine in a version of the teq programming language, with the code being executed on an appropriate abstract architecture. And, clearly, system descriptions will contain both processors and engines.

Finally, teq will allow the definition of a limited set of interconnect topologies, perhaps limited to regular meshes with some number of layers.

As each phase of this Grand Project gets to apparent usability, we shall upload the source and executable (for Mac OS X only, for the time being) along with some documentation.

The license for the teq stuff is as follows:

Copyright (c) 2018, Kiva Design Groupe LLC All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

The views and conclusions contained in the software and documentation are those of the authors and should not be interpreted as representing official policies, either expressed or implied, of anyone else.

Slowly getting there, perhaps

Slowly getting there, perhaps

Just uploaded the 0.5 release of the simpleADL tools and associated documentation.

The toolkit itself seems to be able to do what it says on the can, although there’s still no example programs which do any memory write operations. Shocking, really.

But this release adds some implementation stuff. Article 3 talks about how a processor is implemented, and presents time-tagging, a simple addition to an architectural simulation which provides quite useful performance estimation. It also discusses the need for caches, and exposes the key elements of the changes to the generated model for the r32 architecture needed to support cache simulation and time-tagging.

What’s time-tagging? It’s a simple technique - every resource in the system that can be written and read has an associated 64 bit time tag value. The interpreter has a notion of current time, stored in the variable now. We start off with every tag zero. When the interpreter performs an ‘add’ instruction, for example, it needs to read the two source registers and write the destination register. and the operation takes some time - for an add, there’s a latency of 1 clock. So the interpreter works out when the instruction can start executing - it’s the maximum of now and the time tags of the registers involved. It sets the value of now to that result, and then adds the latency of the operation to now and writes that into the timetag of the destination register.

When we have caches, each line has its own tag, and so does main memory. 

Function units need a tag if they’re not completely pipelined.

The whole scheme works quite well, without severely compromising the runtime performance of the interpreter.

Its weakness, of course, is that it doesn’t work so well when there are multiple agents - like multiple processors - making use of shared resources. But that’s a subject for the future.

Phew….

Phew….

At last, a somewhat tidied version of the simpleADL tools and documentation are published.

This release is still rather alpha-y, but seems to work with the two example architectures and their two example programs.

We’ve found problems with XCode and symbolic links, and with make and symbolic links (macOS make doesn’t sem to follow the link back for the touch date of the original file, so doesn’t do what we’d like), and so we’ve fallen back on an odious install script which populates the various folders with necessary files and builds things and installs them in /usr/local/bin


Architecture

Architecture

Computer architecture has been annoyingly boring for a long time - on the whole, everybody’s doing RISC with a classical MMU and some form of cache coherence. (There are honourable exceptions - the Mill is worth reading about) but in general it’s all samey-samey.

This is about to change as systems which need lots of performance increasingly adopt multicore systems, with core counts running into the tens, hundreds, or thousands. There’s a bunch of examples:

  • Rex Computing has a (4-way VLIW) multicore solution which eschews caches and uses interrupts and ‘remote writes’ for intercore communication. 
  • Adapteva has a similar approach.
  • Kalray also

What these all seem to lack is a basic understanding that these things need to be programmed, and “thinking parallel” when all you have is a sequential programming language is hard. 

While there is progress being made (C++ atomics etc), in general current widely-used programming langauges either don’t support parallelism, or do it in an under-the counter way (such as providing a threads library or equivalent, along with locks and so forth). This generally makes writing correct, efficient, comprehensible code difficult.

An alternative approach is to notice that much “high performance computing” code is nested loops, and when you’re lucky, the iterations of the loops are independent, so that you can ignore identifying parallelism and let a compiler do it for you. Or, you can use an approach like OpenCL - then you identify those loops, rewrite your code as program + kernels (where the kernels capture the loops) and arrange to run the kernels on some collection of compute resource.

A more promising approach is to think of what the physics of this sort of parallel computing rewards and punishes. These multicore systems generally have a rectangular array of cores (each with some local memory) connected up with some form of mesh network - the network may provide one or more independent communications ‘planes’ to allow different sorts of data to proceed concurrently (examples - it can be helpful to have an ACK network separate from a data network, and if you’re moving cachelines-worth of data, it might also help to have an address network and a large data network).

This sort of system is capable of delivering data directly from one core into appropriate resources of another; you’re immediately tempted to think that an architecture which allows program to wait until some new data has arrived, and then process it, would be a natural fit. And so it seems - this style of programming is generally known as message-passing, and has the advantage that you don’t need locks or other shared-data-protecting structures. Also, by providing message queues/buffers in memory or registers, the architecture can speed up communication and synchronisation substantially while reducing power consumption at the same time.

But to include these features as an ordinary part of user mode program support you really want some new architecture. And the architecture tradeoffs for a very high core count system which leverages hardware message passing can be quite different from current mainstream practice.

And so there’s a new day dawning in processor/system architecture. Now all we need is a toolkit to go explore this new world.

As a first step in providing such a toolkit, we’re initiating a series of articles and software tools. The initial stuff provides an existence proof of how to define an architecture, and generate an executable model and an assembler automatically from that specification. The capabilities of this initial tool are exceedingly limited - despite the conversation above, it’s single processor only, for example. But most architects don’t use toolkits like this - they seem to work on squared paper or in text editors, and write plain text documents which seek to describe the architecture, and rely on tool makers to understand what’s written in the way they meant when constructing implementations, verification models and software tools. This inevitably leads to tears, so showing to a wider audience that this stuff needn’t be black magic seems a useful goal.

Look for the simpleADL toolkit to appear here quite soon.

Filling In

Filling In

Slowly adding to the site. First addition, a few words about myself (under People).

First Post

First Post

Creating this new Kiva Design website, kicking off the creation of the LLC…

The paperwork from the State arrived yesterday, making us official…



© kiva design groupe • 2017