Saturday, 18 October 2014

Indeterminate hash table traversal

On and off for a few weeks I've been trying to figure out why the gnetlist testsuite has been failing. I could've sworn that I regenerated the golden output two years ago and painstakingly checked that the differences were only in the order of output with the help of a special-purpose script.

I hadn't run the testsuite in a while, so it's likely that this was the first time I've run it since I got a new hard disk and started using a newer OS. (I'm using debian jessie now.) When I looked at the differences between the golden outputs and the currently produced outputs, it seemed clear that the differences were only in the order of output, and not due to the output being incorrect. At least, not more incorrect than the golden output.

Think things through, don't draw hasty conclusions



I pretty quickly focused on the possibility that GLib's hash table implementation had changed between Ubuntu 10.10's version and jessie's. But at first I was convinced that there was no material change in the hash table implementation: all the hash table stuff was still doing the same thing, albeit in a more cache-friendly way. I even copied GLib's ghash.c from version 2.26.1 (the one that Ubuntu 10.10 used), renamed a few functions and used those instead of the jessie-native GLib's functions, but still the output stayed in the different order.

Still, I was convinced that there must have been a change in hash table traversal order. What else could be reordering the output? I've barely worked on gEDA since getting my new hard disk, and in any case I also ran the testsuite of the very commit that last touched the golden outputs: still reordered output.

The simplicity (and reliability) of blunt tools



My next step was to use a blunt tool: duplicating more exactly the Ubuntu 10.10 environment. I downloaded the relevant debs and unpacked them in a staging directory, and built gEDA against those versions of GLib and a few others. (That required editing the pkg-config package metadata files to point at locations within the staging directory, then setting $PKG_CONFIG_PATH and also using some LDFLAGS=... and $LD_LIBRARY_PATH trickery.)

Finally! Linking to the Ubuntu 10.10 version of GLib reproduced the golden output. That meant that the change in output was a result of the change in computing environment, not some subtle bug I've introduced with my patches. Now I can confidently make some more order-changing patches (to decouple program output order from hash table implementation). Going from a known-good to a presumed-good state is a lot less scary than going from unknown-if-good to presumed-no-worse. I'll probably still check the regenerated golden output with my order-aware diff-checking script, just to be sure that I don't add bugs.

Mystery solved



With the gEDA source code validated (it now passes the testsuite, given the right libraries), I still want to satisfy my curiosity: why did the output order change, when it seemed like the GLib hash table implementation hadn't changed? Because, well, it had, and I had only overlooked what had changed: the hashing function itself, g_str_hash, that transforms a string to a hash value.

What I failed to realize during my first diff-hunt was that GLib 2.26.1's g_str_hash lived in a different file than the rest of the hash table implementation, so I simply didn't see the old version (and never noticed it was missing) while looking at changes to glib/ghash.c.

Don't depend on the order of hash table traversal!

Tuesday, 24 June 2014

Words from football matches

So I'm football-illiterate; I thought that FIFA's 3-letter abbreviation for Nigeria was NIG, but it's actually NGA. (NIG is Niger.) So we won't be seeing the delightfully offensive NIG - GER scoreboard during this World Cup.

We can still search for other words. Grepping for the Cartesian product of all participating countries in /usr/share/dict/words yields a few matches, although none are complete words, let alone offensive ones:
  • apalaCHICOLa
  • hITACHI
  • siNGAPORe
  • BELCHIng
  • ganGRENED
  • GREENGrocer
  • maCHINED
  • miCROCHIp
Chile features a lot!

Wednesday, 30 April 2014

Getting my old Thunderbird profile into Icedove

Do we have to play these branding games with directory names - especially hidden-by-convention ones?

I just wasted half an hour or so trying to coax Icedove into seeing the account that I used with Thunderbird just a month or two ago. So it could just magically see it, or even just permit me to import the old folder into a new one. Whatever, I just wanted my mail!

Making software easy to use for non-geeks is a laudable goal, but I don't think you reach that goal by disappearing so many control elements that the configuration dialogs appear empty. It reminds me of something one of our computer science lecturers said once, long ago: "Emacs is an operating system masquerading as an editor; Vi is an editor masquerading as nothing." "Nothing" is not a good user interface. It has no affordances, and so while it might not be an intimidating interface, it is undiscoverable (yay, start clicking random elements whose appearance suggest little about their function).

After some helpless flailing around and a quick look through some unhelpful forum threads (I swear, they're becoming less and less useful with every passing year as each generation of cargo cult users passes down its placebo magic spells to the next) I got some help on Freenode, in #debian:
<mtn> berndj-blackout: rename the folder to .icedove ;)
TL;DR at this point: doing exactly that solved my problem.

Unfortunately I had already started icedove before I got to IRC. So when I blindly did
$ mv ~/.thunderbird ~/.icedove
hoping it would simply rename the directory, it actually moved ~/.thunderbird into ~/.icedove, since by then the latter directory already existed, so mv(1) interprets the command as a request to put the source into the destination as a subdirectory:
stat("/tmp/a", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/tmp/b", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 lstat("/tmp/a/b", 0x7fff31da12e0) = -1 ENOENT (No such file or directory) rename("/tmp/b", "/tmp/a/b") = 0
(Can you spot the race condition? You can't use mv(1) as a synchronization primitive.)

It took me a while to realize what had happened. This was the clue:
$ du -sh ~/.icedove/2qeibe1d.default/ 16M /home/berndj/.icedove/2qeibe1d.default/ $ du -sh ~/.icedove/ 5.7G /home/berndj/.icedove/
Where are those 5.7GB?Hiding in ~/.icedove/.thunderbird of course!

Wednesday, 5 March 2014

Simulating filesystem errors with gdb

A prospective client needs to get a bunch of files from in-field gadgets onto the Internet. s3fs / s3fuse seem to be a convenient way to get the files onto Amazon's S3. The application demands that the in-field gadget keep retrying until it knows that a file has finally reached the mothership; to do this, I am to write a program that does the copying, that is adequately paranoid about possible failure modes.

One easily-overlooked failure mode is that close(2) can fail. Earlier write(2) operations can appear to succeed, because they may simply be writing the data to local buffers, not yet checking if the data has reached the other side of the network.

My initial assumption was that moving the files across the network using a shell script would fail to take care of all the weird corner cases, such as error-on-close. Why speculate though, when we have gdb?
$ gdb /bin/cp GNU gdb (GDB) 7.2-ubuntu Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /bin/cp...(no debugging symbols found)...done. (gdb) break close Breakpoint 1 at 0x402930 (gdb) run /tmp/a /tmp/c Starting program: /bin/cp /tmp/a /tmp/c Breakpoint 1, close () at ../sysdeps/unix/syscall-template.S:82 82 ../sysdeps/unix/syscall-template.S: No such file or directory. in ../sysdeps/unix/syscall-template.S (gdb) cont Continuing. ... Breakpoint 1, close () at ../sysdeps/unix/syscall-template.S:82 82 in ../sysdeps/unix/syscall-template.S (gdb) p errno $1 = 0 (gdb) set errno = 5 (gdb) p errno $2 = 5 (gdb) return -1 Return value type not available for selected stack frame. Please use an explicit cast of the value to return. (gdb) return (int) -1 Make selected stack frame return now? (y or n) y #0 0x0000000000406697 in ?? () (gdb) cont Continuing. /bin/cp: closing `/tmp/a': Input/output error Program exited with code 01.

Bingo! It does work (err, it does fail?) I still have some homework to do: is it necessary to use fsync(2) before closing the file, in order to really make sure that all pending errors get reported? That's one thing that cp(1) doesn't do, according to strace(1).

Friday, 27 December 2013

Reading QR codes

I wanted to get some testnet bitcoin onto my Android phone (running Andreas Schildbach's wallet app) but struggled to get coin from TP's faucet onto the phone without laboriously typing the address into the faucet's input box. (Ironically, I spent more time trying to avoid typing it in manually than it would have taken to just do that.)

My first strategy was to use libdecodeqr-examples's libdecodeqr-webcam utility. It pretended to work, showing a view of what my laptop's camera sees, and then drawing a green box framing the QR code that it recognized. But despite my attempts to help it see better by placing a converging lens in front of the camera (or the attempt to use another camera which hates being in focus), libdecodeqr-webcam just wasn't displaying the correct bitcoin URI. Sometimes the tool would show a string, and some of it would even look right, but invariably there'd be some corruption.

So I just left it for a few days. I thought the channel between the phone's display and the laptop's webcam output was just too noisy to reliably scan a QR code. Not really what I expected from QR codes (they use an error-correcting code) but hey, who am I to argue with the decoding tool?

It turns out that the tool just isn't up to the task. Even running libdecodeqr-simpletest on a locally-generated image fails, and outputs only a line of control characters. Back to searching, where I found an answer on askubuntu referring to zbar-tools. I had previously overlooked it because the short description made no mention of QR codes, only barcodes.

With zbar-tools installed, I ran zbarcam and it was able to read the QR code from the phone's display immediately - even without the extra lens. Problem solved!

Saturday, 16 November 2013

Could we have had Bitcoin 20 years ago?

I remarked the other day on #bitcoin that on a technical level there isn't really anything new about the parts which constitute Bitcoin. I think people misunderstood me a little, and thought I had made a straw man version of what I was trying to express. Maybe they read, "Bitcoin is totally lame and not new at all and Satoshi Nakamoto is just a Johnny-come-lately!" That was definitely not my intent.

What's truly new about Bitcoin is the synthesis of several technologies into a coherent whole. It reminds me of an episode of The Outer Limits, "Final Exam". The story alludes to how, sometimes, the world is just not ready for an idea, despite the availability of all the parts needed to make it happen. And then when it is time, suddenly the idea spontaneously realizes in multiple places independently. Perhaps 2009 was ripe for Bitcoin, while 1992 wasn't.

In 1992 we knew about hash-based proof-of-work systems. We'd considered using them to make email spam uneconomic. In the 1980s we had some ideas about how to achieve Byzantine fault tolerance. We knew of ways to do public-key cryptography well before then, with hints appearing as early as 1974. Elliptic curve cryptography was perhaps still a bit "too new" until fairly recently, but having short keys is a practical matter, and 512-bit RSA keys don't seem like they would've made a cryptocurrency impossible in 1992.

Someone on #bitcoin remarked that peer-to-peer networks are quite new, and in any case their application to payment systems. That's true, but I don't consider that truly central to making Bitcoin Bitcoin. Also, the Internet itself was a peer-to-peer system when it started (now maybe less so, being more heavily concentrated in big subnetworks). We also had Usenet, with news servers exchanging posts using a peer-to-peer protocol.

So yes, I'm still convinced that we could have had Bitcoin up to maybe 20 years ago, if Satoshi Nakamoto had happened to apply his/her/their mind to the problem of creating a decentralized cryptocurrency. The constituent technologies were available, if maybe a bit primitive, even awkward and inconvenient. But back then we were probably still happier with fiat currencies. We hadn't yet lost confidence in the concept as a whole, even if some local currencies did suffer a dramatic loss of confidence. We didn't have "too big to fail" yet, and we didn't have the US trying to print itself out of economic stagnation in the de-facto world currency, the US dollar.

Monday, 21 October 2013

Bitcoin mining in hexdumps

"Mining" is about the one Bitcoin term that I don't find confusing or misleading in some way. It involves sorting through piles of rock (the nonce domain) to find specks of gold (nonces that cause the block hash to satisfy a very strict condition).

The Bitcoin block header format consists of just a few fields. One of these, the nonce, is the one intended as the primary degree of freedom when searching for solutions to the riddle, "which block will be the next in the chain?" Here's an example, block 125552's block header, shown as a hexdump:
01000000 81cd02ab7e569e8bcd9317e2fe99f2de44d49ab2b8851ba4a308000000000000 e320b6c2fffc8d750423db8b1eb942ae710e951ed797f7affc8892b0f1fc122b C7F5D74D F2B9441A 42A14695
The nonce is the last field in the header. With this value (42A14695) the block header hashes to 1dbd981fe6985776b644b173a4d0385ddc1aa2a829688d1e0000000000000000 - which satisfied the network difficulty condition for this block to become block 125552. Any other value leaves you with a bunch of non-zero bits at the end of the SHA256 output. (Things are a little confusing with how Bitcoin does endianness - the test is notionally for hash < difficulty, so all those pretty zeroes occupy the msb position.)

Astute readers may point out that 32 degrees of one-bit freedom aren't enough to find a block header whose hash is constrained to 130+ leading zero bits. That's correct, and indeed there are more degrees of freedom available - they just aren't quite as easy to vary as the nonce field. There's the extraNonce field in coinbase transactions (the ones that bring new coin into circulation), which has an essentially unlimited number of degrees of freedom, and affects the block hash by changing the Merkle root. There's some freedom also in choosing the timestamp of the block (C7F5D74D in this example), but there's not all that much wiggle room due to how far out of sync the timestamp may be with network consensus time.

References: