Wednesday 5 March 2014

Simulating filesystem errors with gdb

A prospective client needs to get a bunch of files from in-field gadgets onto the Internet. s3fs / s3fuse seem to be a convenient way to get the files onto Amazon's S3. The application demands that the in-field gadget keep retrying until it knows that a file has finally reached the mothership; to do this, I am to write a program that does the copying, that is adequately paranoid about possible failure modes.

One easily-overlooked failure mode is that close(2) can fail. Earlier write(2) operations can appear to succeed, because they may simply be writing the data to local buffers, not yet checking if the data has reached the other side of the network.

My initial assumption was that moving the files across the network using a shell script would fail to take care of all the weird corner cases, such as error-on-close. Why speculate though, when we have gdb?
$ gdb /bin/cp GNU gdb (GDB) 7.2-ubuntu Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". For bug reporting instructions, please see: ... Reading symbols from /bin/cp...(no debugging symbols found)...done. (gdb) break close Breakpoint 1 at 0x402930 (gdb) run /tmp/a /tmp/c Starting program: /bin/cp /tmp/a /tmp/c Breakpoint 1, close () at ../sysdeps/unix/syscall-template.S:82 82 ../sysdeps/unix/syscall-template.S: No such file or directory. in ../sysdeps/unix/syscall-template.S (gdb) cont Continuing. ... Breakpoint 1, close () at ../sysdeps/unix/syscall-template.S:82 82 in ../sysdeps/unix/syscall-template.S (gdb) p errno $1 = 0 (gdb) set errno = 5 (gdb) p errno $2 = 5 (gdb) return -1 Return value type not available for selected stack frame. Please use an explicit cast of the value to return. (gdb) return (int) -1 Make selected stack frame return now? (y or n) y #0 0x0000000000406697 in ?? () (gdb) cont Continuing. /bin/cp: closing `/tmp/a': Input/output error Program exited with code 01.

Bingo! It does work (err, it does fail?) I still have some homework to do: is it necessary to use fsync(2) before closing the file, in order to really make sure that all pending errors get reported? That's one thing that cp(1) doesn't do, according to strace(1).