Saturday, 18 October 2014

Indeterminate hash table traversal

On and off for a few weeks I've been trying to figure out why the gnetlist testsuite has been failing. I could've sworn that I regenerated the golden output two years ago and painstakingly checked that the differences were only in the order of output with the help of a special-purpose script.

I hadn't run the testsuite in a while, so it's likely that this was the first time I've run it since I got a new hard disk and started using a newer OS. (I'm using debian jessie now.) When I looked at the differences between the golden outputs and the currently produced outputs, it seemed clear that the differences were only in the order of output, and not due to the output being incorrect. At least, not more incorrect than the golden output.

Think things through, don't draw hasty conclusions



I pretty quickly focused on the possibility that GLib's hash table implementation had changed between Ubuntu 10.10's version and jessie's. But at first I was convinced that there was no material change in the hash table implementation: all the hash table stuff was still doing the same thing, albeit in a more cache-friendly way. I even copied GLib's ghash.c from version 2.26.1 (the one that Ubuntu 10.10 used), renamed a few functions and used those instead of the jessie-native GLib's functions, but still the output stayed in the different order.

Still, I was convinced that there must have been a change in hash table traversal order. What else could be reordering the output? I've barely worked on gEDA since getting my new hard disk, and in any case I also ran the testsuite of the very commit that last touched the golden outputs: still reordered output.

The simplicity (and reliability) of blunt tools



My next step was to use a blunt tool: duplicating more exactly the Ubuntu 10.10 environment. I downloaded the relevant debs and unpacked them in a staging directory, and built gEDA against those versions of GLib and a few others. (That required editing the pkg-config package metadata files to point at locations within the staging directory, then setting $PKG_CONFIG_PATH and also using some LDFLAGS=... and $LD_LIBRARY_PATH trickery.)

Finally! Linking to the Ubuntu 10.10 version of GLib reproduced the golden output. That meant that the change in output was a result of the change in computing environment, not some subtle bug I've introduced with my patches. Now I can confidently make some more order-changing patches (to decouple program output order from hash table implementation). Going from a known-good to a presumed-good state is a lot less scary than going from unknown-if-good to presumed-no-worse. I'll probably still check the regenerated golden output with my order-aware diff-checking script, just to be sure that I don't add bugs.

Mystery solved



With the gEDA source code validated (it now passes the testsuite, given the right libraries), I still want to satisfy my curiosity: why did the output order change, when it seemed like the GLib hash table implementation hadn't changed? Because, well, it had, and I had only overlooked what had changed: the hashing function itself, g_str_hash, that transforms a string to a hash value.

What I failed to realize during my first diff-hunt was that GLib 2.26.1's g_str_hash lived in a different file than the rest of the hash table implementation, so I simply didn't see the old version (and never noticed it was missing) while looking at changes to glib/ghash.c.

Don't depend on the order of hash table traversal!

No comments:

Post a Comment