| Tobin Fricke's Lab Notebook ( @ 2004-04-06 16:11:00 |
Yay! I think I've recovered nearly all of the images from the accidentally-formatted Compact Flash card. Using the program dd (also available for windows), I made a direct copy of the CF card to a file on my local filesystem, so that I could analyse it later and go ahead and use the CF card for other things in the meantime.
My first attempt to get at the data was to write a program to decode the FAT16 filesystem and to read directory entries, but this proved more-or-less useless, since the FAT (file allocation table) was all zeroed out (naturally). It was a little useful, because it could read the directory entries after I located them with a hex editor, but still.. no go.
Something that I learned here is that you can point the linux fdisk program at a file containing a disk image just as well as you can point it at a block device like /dev/hda. Makes sense, but I hadn't thought of that! In any case that is not useful, since it turns out that the CF card doesn't have a partition table. My FAT16 decoding program dumped this info about the filesystem:
>./fat16 Volume identification information bytes per sector: 512 (512 expected) sectors per cluster: 4 (power of two) reserved sectors: 1 (usually 0x20) file allocation tables: 2 (should be 2) executable marker aa55 (should be aa55) signature: FAT16 sectors in partition: 253408 partition size: 129744896 bytes clusters in partition: 63352 clusters size of FAT: 247 sectors entries in FAT: 63232 max root dir entries: 512 Looking at FAT table 0 63232 clusters 0 + 10 in use 0 bad 0 reserved 63222 available Looking at FAT table 1 63232 clusters 0 + 10 in use 0 bad 0 reserved 63222 available Root Directory (starts at byte offset 0xf6f6): 00. ...v.a.. 43 CANON_DC. 0 cluster 0 date 2004-03-28 18:48:20 01. .h...a.. 4e NIKON001.DSC 512 cluster 2 date 2004-03-31 23:55:16 02. ....d... 4d MISC . 0 cluster 3 date 2004-03-31 23:55:16 03. ....d... 44 DCIM . 0 cluster 5 date 2004-03-31 23:55:20 05. ....d... 54 TRASHE~1. 0 cluster 3284 date 2004-04-01 08:08:14
All of this data is legitimate data, because after the CF card was formatted, the camera wrote a new filesystem to the card. (So some of my image data was overwritten.) I was pleased to see how easy it was to decode even the date and time information from the directory entries.
See where it says "0 + 10 in use"? In the file allocation table, there are two indications for clusters which are in use. If the file continues past this cluster into another one, then the FAT entry has the number of that new cluster. But if this cluster is (1) in use, but (2) the last cluster for this file, then it has a special value. This indicates that there are ten files in the filesystem, each using exactly one cluster.
</p>My successful approach was simpler. I piped the 128mb file through the unix program 'strings' to identify bursts legible text, using the "-t x" option to include the offsets at which strings were found, and then piped that through grep, searching for the string "Exif," which occurs in the JPEG headers of files written by this camera. Presto! A list of candidate image file locations:
>cat ~/flash.dsk | strings -t x | grep Exif 107a06 Exif 221a06 Exif 224206 Exif 3a1206 Exif 46ea06 Exif 514206 Exif 6eea06 Exif 72f206 Exif 7df206 Exif 7e0a06 Exif 885a06 Exif
Note how they are all at relatively "round number" locations -- the 'Exif' string occurs at offset 6 in a valid JPEG file, so this means that there are JPEG files at locations 0x221a00, 0x224200, etc. Numbers ending in 00 in hexidecimal are divisible by 256 — which is just what we expect, since this FAT16 filesystem uses 512 byte sectors!
Okay, now we know where the JPEG files start, but in order to extract them, we need to know where they end. It turns out that JPEG files are encoded as a chain of blocks, where each block consists of the byte 0xFF followed by a "marker indicator," which tells you what kind of block this is. After this, there is a two-byte field that tells you how long this block is. The last block will be one called the "Start of Stream" (SOS) block, that is identified by the marker 0xDA. After the SOS block comes the image data itself. There's no indication in advance of how long the image data will be, but it's terminated by the sequence 0xFF 0xD9 (apparently it's guaranteed that this sequence does not occur in the image data — it's huffman encoded, so that's not surprising).
So, here is the whole procedure to read a JPEG file, ignoring all the data:
read a word W and check that W = 0xFF 0xD8 ("Start of Image")
repeat
read a byte, check that it is indeed 0xFF. if not, abort
read a byte M, this is the marker type.
read a word (two bytes) W
read W bytes, and throw them away
until M == 0xDA ("Start of Stream")
repeat
read a word W
until W == 0xFF 0xD9 ("End of Image")
This assumes that the data will be stored contiguously on disk.. i.e., the JPEG file exists in one big chunk. In general that will not be the case, since FAT works by dividing up the filesystem into chunks of several kilobytes called clusters, and then keeping a table to remember the chain of clusters occupied by any file. As a FAT filesystem is used, it will tend to become `fragmented,' with each file stored in many separate pieces. But if you start with an empty FAT filesystem and only add files to it, the files will each be contiguous, which is what I'm relying on for the above to work.
</p>One of the morals of this story is that, if you want to destroy your data, formatting is not enough. Anyways, here's one of the rescued pictures (cropped scaled down by 30%), from our tidepooling expedition to Bodega Bay:
![[tidepooling at bodega bay]](http://splorg.org/people/tobin/pictures/bodega_tidepools.jpg)
My reference for the structure of the FAT16 filesystem was a nice reference written by Jack Dobiash. My (totally unpolished) program to read a FAT16 filesystem is available as projects/filesystem/fat16.c. I found a document on the EXIF File Format that described the JPEG format adequately for my needs. Another interesting find is the program Foremost; written by the United States Air Force Office of Special Investigations, it scans through a mess of data (eg, a disk image) looking for the signatures of various files, and it extracts them if it can. But it is very simplistic, relying on a short "start of file" signature and a short "end of file" signature, so it can't quite work for JPEG files as described above (because the end-of-file signature might appear spuriously before the end of the variable-length header).
Now maybe I should do some Actual Work™.