Tuesday, March 4, 2014

Simulating filesystem errors with gdb

A prospective client needs to get a bunch of files from in-field gadgets onto the Internet. s3fs / s3fuse seem to be a convenient way to get the files onto Amazon's S3. The application demands that the in-field gadget keep retrying until it knows that a file has finally reached the mothership; to do this, I am to write a program that does the copying, that is adequately paranoid about possible failure modes.

One easily-overlooked failure mode is that close(2) can fail. Earlier write(2) operations can appear to succeed, because they may simply be writing the data to local buffers, not yet checking if the data has reached the other side of the network.

My initial assumption was that moving the files across the network using a shell script would fail to take care of all the weird corner cases, such as error-on-close. Why speculate though, when we have gdb?
$ gdb /bin/cp GNU gdb (GDB) 7.2-ubuntu Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /bin/cp...(no debugging symbols found)...done. (gdb) break close Breakpoint 1 at 0x402930 (gdb) run /tmp/a /tmp/c Starting program: /bin/cp /tmp/a /tmp/c Breakpoint 1, close () at ../sysdeps/unix/syscall-template.S:82 82 ../sysdeps/unix/syscall-template.S: No such file or directory. in ../sysdeps/unix/syscall-template.S (gdb) cont Continuing. ... Breakpoint 1, close () at ../sysdeps/unix/syscall-template.S:82 82 in ../sysdeps/unix/syscall-template.S (gdb) p errno $1 = 0 (gdb) set errno = 5 (gdb) p errno $2 = 5 (gdb) return -1 Return value type not available for selected stack frame. Please use an explicit cast of the value to return. (gdb) return (int) -1 Make selected stack frame return now? (y or n) y #0 0x0000000000406697 in ?? () (gdb) cont Continuing. /bin/cp: closing `/tmp/a': Input/output error Program exited with code 01.

Bingo! It does work (err, it does fail?) I still have some homework to do: is it necessary to use fsync(2) before closing the file, in order to really make sure that all pending errors get reported? That's one thing that cp(1) doesn't do, according to strace(1).

Friday, December 27, 2013

Reading QR codes

I wanted to get some testnet bitcoin onto my Android phone (running Andreas Schildbach's wallet app) but struggled to get coin from TP's faucet onto the phone without laboriously typing the address into the faucet's input box. (Ironically, I spent more time trying to avoid typing it in manually than it would have taken to just do that.)

My first strategy was to use libdecodeqr-examples's libdecodeqr-webcam utility. It pretended to work, showing a view of what my laptop's camera sees, and then drawing a green box framing the QR code that it recognized. But despite my attempts to help it see better by placing a converging lens in front of the camera (or the attempt to use another camera which hates being in focus), libdecodeqr-webcam just wasn't displaying the correct bitcoin URI. Sometimes the tool would show a string, and some of it would even look right, but invariably there'd be some corruption.

So I just left it for a few days. I thought the channel between the phone's display and the laptop's webcam output was just too noisy to reliably scan a QR code. Not really what I expected from QR codes (they use an error-correcting code) but hey, who am I to argue with the decoding tool?

It turns out that the tool just isn't up to the task. Even running libdecodeqr-simpletest on a locally-generated image fails, and outputs only a line of control characters. Back to searching, where I found an answer on askubuntu referring to zbar-tools. I had previously overlooked it because the short description made no mention of QR codes, only barcodes.

With zbar-tools installed, I ran zbarcam and it was able to read the QR code from the phone's display immediately - even without the extra lens. Problem solved!

Saturday, November 16, 2013

Could we have had Bitcoin 20 years ago?

I remarked the other day on #bitcoin that on a technical level there isn't really anything new about the parts which constitute Bitcoin. I think people misunderstood me a little, and thought I had made a straw man version of what I was trying to express. Maybe they read, "Bitcoin is totally lame and not new at all and Satoshi Nakamoto is just a Johnny-come-lately!" That was definitely not my intent.

What's truly new about Bitcoin is the synthesis of several technologies into a coherent whole. It reminds me of an episode of The Outer Limits, "Final Exam". The story alludes to how, sometimes, the world is just not ready for an idea, despite the availability of all the parts needed to make it happen. And then when it is time, suddenly the idea spontaneously realizes in multiple places independently. Perhaps 2009 was ripe for Bitcoin, while 1992 wasn't.

In 1992 we knew about hash-based proof-of-work systems. We'd considered using them to make email spam uneconomic. In the 1980s we had some ideas about how to achieve Byzantine fault tolerance. We knew of ways to do public-key cryptography well before then, with hints appearing as early as 1974. Elliptic curve cryptography was perhaps still a bit "too new" until fairly recently, but having short keys is a practical matter, and 512-bit RSA keys don't seem like they would've made a cryptocurrency impossible in 1992.

Someone on #bitcoin remarked that peer-to-peer networks are quite new, and in any case their application to payment systems. That's true, but I don't consider that truly central to making Bitcoin Bitcoin. Also, the Internet itself was a peer-to-peer system when it started (now maybe less so, being more heavily concentrated in big subnetworks). We also had Usenet, with news servers exchanging posts using a peer-to-peer protocol.

So yes, I'm still convinced that we could have had Bitcoin up to maybe 20 years ago, if Satoshi Nakamoto had happened to apply his/her/their mind to the problem of creating a decentralized cryptocurrency. The constituent technologies were available, if maybe a bit primitive, even awkward and inconvenient. But back then we were probably still happier with fiat currencies. We hadn't yet lost confidence in the concept as a whole, even if some local currencies did suffer a dramatic loss of confidence. We didn't have "too big to fail" yet, and we didn't have the US trying to print itself out of economic stagnation in the de-facto world currency, the US dollar.

Sunday, October 20, 2013

Bitcoin mining in hexdumps

"Mining" is about the one Bitcoin term that I don't find confusing or misleading in some way. It involves sorting through piles of rock (the nonce domain) to find specks of gold (nonces that cause the block hash to satisfy a very strict condition).

The Bitcoin block header format consists of just a few fields. One of these, the nonce, is the one intended as the primary degree of freedom when searching for solutions to the riddle, "which block will be the next in the chain?" Here's an example, block 125552's block header, shown as a hexdump:
01000000 81cd02ab7e569e8bcd9317e2fe99f2de44d49ab2b8851ba4a308000000000000 e320b6c2fffc8d750423db8b1eb942ae710e951ed797f7affc8892b0f1fc122b C7F5D74D F2B9441A 42A14695
The nonce is the last field in the header. With this value (42A14695) the block header hashes to 1dbd981fe6985776b644b173a4d0385ddc1aa2a829688d1e0000000000000000 - which satisfied the network difficulty condition for this block to become block 125552. Any other value leaves you with a bunch of non-zero bits at the end of the SHA256 output. (Things are a little confusing with how Bitcoin does endianness - the test is notionally for hash < difficulty, so all those pretty zeroes occupy the msb position.)

Astute readers may point out that 32 degrees of one-bit freedom aren't enough to find a block header whose hash is constrained to 130+ leading zero bits. That's correct, and indeed there are more degrees of freedom available - they just aren't quite as easy to vary as the nonce field. There's the extraNonce field in coinbase transactions (the ones that bring new coin into circulation), which has an essentially unlimited number of degrees of freedom, and affects the block hash by changing the Merkle root. There's some freedom also in choosing the timestamp of the block (C7F5D74D in this example), but there's not all that much wiggle room due to how far out of sync the timestamp may be with network consensus time.

References:

Wednesday, April 24, 2013

Function pointer type compatibility

I've been wondering how function pointers get passed around inside the GObject framework. Some example code seems to play fast and loose with function pointer types, relying on the fact that C's Undefined Behaviour can also include doing what one hopes. After some source-diving I finally found how GLib calls the callback. It does it through a function pointer of this type in one example:
typedef void (*GMarshalFunc_VOID__UINT_POINTER) (gpointer instance, guint arg_0, gpointer arg_1, gpointer data);
That's for a signal handler whose signature is void ()(gpointer instance, guint x, gpointer userdata). Because GtkCellRendererText * and void * are not compatible types, it's actually wrong (it invokes undefined behaviour) to simply copy the signal handler signatures from the GTK+ documentation! For this example, the correct function declaration would have to be:
void user_function(gpointer renderer, gpointer path, gpointer new_text, gpointer user_data);
I'm not sure if I want to be that pure. Too much boilerplate type conversion code. Maybe the reasonable compromise is to continue using pointers to specific types, but to make sure that at least the number of arguments matches what the marshaller functions demand. I think it's far more likely that a C implementation will be sensitive to a mismatched number of arguments (consider how cdecl vs pascal calling convention specifiers in some compilers determine a function's activation record) than that void * will have a different representation than GtkCellRendererText *.

Tuesday, February 12, 2013

Re-redesign the gEDA slotting mechanism

The slotting mechanism that is the primary subject of my gEDA fork seems to work, and solves the opamp problem, presumably also the transistor problem, and supports heterogeneous slots, unlike the more inflexible [1] stock gEDA mechanism.

But the design of the mechanism is broken: it doesn't play nicely with hierarchical designs, especially not ones that re-use schematics as distinct copies of a functional block. John Doty pointed this use case out to me; he's probably one of the heavier users of gEDA's hierarchical nature.

Essentially, the attributes my slotting mechanism currently uses, point the wrong way. It is the symbols that point to the slots they inhabit, thereby pointing "up" in the hierarchy of schematics, which is a graph (hopefully an acyclic one) and not a tree. Because the hierarchy is a graph, a schematic may have multiple parents - schematics that contain a symbol with a source= attribute pointing to it. While it would be possible to store multiple upwards-pointing attributes in a sub-schematic, doing so would damage the utility of that page as a reusable element when it accumulated slotting-related attributes from all the projects which used it as a sub-schematic.

So the slotting attributes can't point "up". Can they point "down" instead? It would be better, but perhaps still too inflexible: schematics with sub-schematics can themselves be sub-schematics. gEDA's hierarchy of schematics isn't limited in depth [2]. Only a toplevel schematic for a particular assembly [3] should sensibly assign slots in concrete parts to the symbols below that need them, since only it has no super-schematic, therefore it cannot appear in a project in multiple instances.

Then if slots need to point "down" to the symbols occupying them, we'll need not just a pointer to the symbol object, but a path through the hierarchy by which to reach it. Without the path, it would be impossible to disambiguate references to the same symbol used in a sub-schematic to multiple parents. Something like this should do:

slotsymbol=opamp3:48aa3670-55de-4dad-9587-f54e9f196837/c2490daf-a9e8-4ec8-9941-62845fc9bb29

Interpretation: 48aa3670-55de-4dad-9587-f54e9f196837 is the UUID of a hierarchical symbol (a "COMPLEX" in libgeda jargon). Perhaps, a block symbol for a bandpass filter.


Then, one of its source= attributes will point to a sub-schematic which will have another symbol on it (perhaps a generic opamp triangle symbol) whose UUID is c2490daf-a9e8-4ec8-9941-62845fc9bb29. That particular opamp function is assigned to the slot "opamp3".


Having these paths encoded in the slotting associations will allow gschem to show different pin numbers on the same symbol, depending on which instance of a sub-schematic one is looking at. gschem cannot do this yet, of course; I will have to code this extra behaviour.

I should reverse the associations now, before anyone really starts using my fork for its slotting mechanism. Patches welcome - it's a big job.

[1] Stock gEDA doesn't understand heterogeneous slots, conflates the symbol for a function (a NAND gate, for example) with the symbol for a part (correspondingly, a 74LS00 in this example), relies on the user to manually track slot assignments, and relies on fragile hacks to produce the correct netlist (setting identical refdes= attributes - or, worse, using lowercase suffixes to trigger special-case treatment in PCB). You end up with multiple symbols with a split identity between function and part, that can easily take on incompatible attributes. Imagine two NAND gates intended to be gates in the same chip, but each carries a mutually incompatible footprint= attribute. Let's not even think of the different pin numbering schemes of the various package styles - stock gEDA would demand that you edit the slotdef= attributes. How baroque!

[2] The hierarchy of schematics needs to be acyclic though; I can't see any good coming from a cycle of schematics. There is a similar issue in the component library: the gEDA file format allows any object to appear as part of the graphical representation of a symbol, including other symbols, and specifically including itself. Semantically invalid, but syntactically okay. A hare and tortoise algorithm would be able to detect cycles, and one day I'll get around to adding such a check.

[3] We could have a design consisting of a backplane and a set of identical daughterboards; each of these daughterboards would have identically-numbered parts, and this wouldn't be a problem, because the daughterboard is an entire sub-project. The important bit here is that only the topmost schematic of a particular assembly, subproject, whatever you want to call a distinct domain of refdes values, should carry the slot assignments for all the abstract symbols and slots below it.

Tuesday, January 29, 2013

dpkg MD5 checksums

My OpenOffice installation stopped working a few days ago after I Changed Nothing (tm) [1], so one of my avenues in investigating the breakage was to check for any unexplained changes to installed files. That happened to me once before [2], back when I worked at Prism, so "obviously" I felt I should check out that possibility again:

$ md5sum -c --quiet /var/lib/dpkg/info/*.md5sums

After much disk grinding (sometimes I'm sure I'm about to see a puff of hard disk powder come out of the fan exhaust), what seems to be a smoking gun:

usr/bin/gnuplot: FAILED
md5sum: WARNING: 1 of 45 computed checksums did NOT match

This is interesting! So I download the deb for gnuplot-x11 and unpack it manually (with binutils' ar), and find the same "wrong" checksum. A friend repeated the procedure and found the same "wrong" checksum, so I'm no longer suspecting a fancy worm/virus that infects new gnuplot binaries as they appear on the filesystem.

It turns out that these mismatching packages have preinst scripts that "divert" files, invalidating the naive checksum. The diverted files are still around, but their names no longer match what's in the lists of MD5 checksums.

And that's where laziness bites me in the behind: I knew by the time I started on my wild goose chase that debsums(1) checked checksums, but since I didn't have it installed and felt too lazy to install it, decided to just run the checksum files through md5sum(1). And after all that effort to get an explanation for these mismatched checksums, I installed debsums(1) anyway and discovered that it knows how to follow diversions!

Now, I'm back to still wanting to know why OpenOffice stopped working.

[1] I upgraded google-chrome, but that update involved only its own package. ooffice seemed to stop working after I tried to open some document that caused it to crash, but I no longer recall the exact sequence of events.

[2] It was almost ten years ago when gethostbyname(3) or some nearby interface seemed to stop working. Suddenly no programs could connect to anything on the Internet anymore. After a bit of bug-chasing I noticed that libc's contents had changed. I don't remember what led me to check that with rpm, but I did. I must have suspected cosmic rays, because I made a copy of libc before rebooting, in order to freeze the corrupted memory contents onto stable storage. Sure enough, after the reboot libc was fine (clearly having been reloaded from the uncorrupted copy on disk), and a diff of a hexdump showed some six bytes that differed, right inside gethostbyname(3). To this day I don't know how I might have forced the kernel to re-read what must have been a very frequently-accessed page.