A Digression on an Aspect of Peformance

A Digression on an Aspect of Performance

teq (and simpleADL) are built from odds and sods of software I’ve had lying around for years.

One of these is the symbol package. The original version was essentially the example in Kernighan and Ritchie. This used a collection of queues, each holding a single-forward-linked queue to hold the symbol data structures. To insert a new symbol, you created it, filled in the name, computed a hash, and used the hash to select what queue to put it in. 

(The Second Edition of The C programming Language has the example starting on page 143, under section 6.6 Table Lookup).

A key operation in any compiler or equivalent is looking up a name. You’ve just read a token, and eventually you’ll need to know what it is (if it’s been defined). So you’ll look it up in the name table. The simplest organization is a long list (single-forward chained queue) of all the names of interest. If this is,  at some point 280 entries long, you’ll need to make on average 140 string comparisons. Since names vary from one character in length to 20 or more, these comparisons can take a long time.

A first step is to reduce the average work by any amount you like: instead of one queue, have N. This reduces the search to 140/N comparisons. How to choose what queue to use for a given name? Compute a small integer in the range 0..N-1 from the name.

K&R offer a simple function to compute a hash; here’s the code from the second edition::

/* hash: form hash value for string s */
unsigned hash(char * s)
    unsigned hashval;
    for (hashval = 0; *s != '\0'; s++)
        hashval = *s + 31 * hashval;
     return hashval % HASHSIZE;

We note that often the % operator is relatively expensive on many processors, but we’ll ignore this because we don’t do it very often. But when looking up a name, we need to do a string compare between every entry in a queue and the name we’re looking up. This can be expensive. Luckily  the standard library has the strcmp() function, generally optimised as much as practical. But it still needs to look at every name in the queue until a match is found or we reach the end of the queue indicating no match.

strcmp() looks like this, in non-weird form (from: https://opensource.apple.com/source/Libc/Libc-262/ppc/gen/strcmp.c)

/* ANSI sez:
 * The `strcmp' function compares the string pointed to by `s1' to the    
 * string pointed to by `s2'.
 * The `strcmp' function returns an integer greater than, equal to, or less
 * than zero, according as the string pointed to by `s1' is greater than,
 * equal to, or less than the string pointed to by `s2'. []
 int strcmp(const char *s1, const char *s2) {
     for ( ; *s1 == *s2; s1++, s2++)
         if (*s1 == '\0')
               return 0;
     return ((*(unsigned char *)s1 < *(unsigned char *)s2) ? -1 : +1);


With long names, that’s a lot of comparisons. Is there any way we could speed it up?

Probably. We have used one form of (intended) speedup for a long time. The structure holding the name also included the length of the string; not much point comparing the strings for equality of they’re different lengths, eh?

But that’s not all. Every name we encounter means we need to compute the hash. We could store the hash, and get a much stronger avoidance of doing the string compare. Naturally, we’d store the value computed as hashval in the hash() function, before it’s range-reduced.

Testing the hypothesis that these represent improving performance requires a fairly large experiment, because performance all depends on the names being looked up. Random names aren’t the same as English-language names; and these aren’t the same as names used in program (‘not the same as’ means distribution of length and of characters in the names differ).

So we performed a basic reality test. We created an array of pointers to names, and then malloc() space for each name, and created each name using our uniform() function to choose successive letters (‘a’ to ‘z’ more or less) and a separate call to uniform() to choose the length of each name. And installed them all, in order of creation, in a hash table. Then ran through the array in order, looking up each one. 

We did two experiments. In the first, we chose a letter at random and used that for the first three characters of each name, a rough’n’ready nod to the fact that names are often ‘categorised’ thus. Results:

  • Using Raw, looking up 50,000 names in our symbol table took 1021692 microseconds
  • Using Length, looking up 50,000 names in our symbol table took 959724 microseconds
  • Using Hash, looking up 50,000 names in our symbol table took 901598 microseconds 

And then, with no pre-string, we got:

  • Using Raw, looking up 50,000 names in our symbol table took 1012510 microseconds
  • Using Length, looking up 50,000 names in our symbol table took 990427 microseconds
  • Using Hash, looking up 50,000 names in our symbol table took 1020117 microseconds 

Why isn’t hash better than length - or raw - for the random strings? Probably because the string compare can decide that the strings are different after looking at the first letter of the string. To do a proper test, we need a more realistic set of names.

So, yep. The thinking was good. But the effects are so small as to be unreliable..

From this we a reminded once again that optimizing without measuring is oft-times a Very Bad Idea...

© kiva design groupe • 2017