Saturday, October 29, 2005
Touring El-Torito (viewing Minix 3 sources)
Briefly, the Minix3 CD takes a very liberal view of the standard CD formats, and appends raw filesystem images (/ and /usr) after the ECMA-119/IS0-9660 data. The rip3 utility accesses this data, which is otherwise hidden.
The rip3 project was fun to write, and had a number of unexpected hurdles. The project started as an excuse to read the ECMA-119 (ISO-9660) and El-Torito specifications - both of which are quite palatable, and the initial intent was to read data directly from the CD as a raw block device. The ECMA-119 specification divides the CD into system and data areas, with the system area occupying the first 16 sectors of the disc. The fun starts with the data area, which contains all of the file data and supporting structures, including a catalogue of 'volume descriptors' at the head of the data area. The ECMA-119 specification makes very few assumptions about the ordering of the descriptors, whereas El-Torito asserts very definitely that the primary volume descriptor will be at sector 16, and that the boot record descriptor will be at sector 17! These assumptions allow huge simplifications in the rip3 code.
The primary volume descriptor contains all sorts of interesting information about the CD, including the volume space size ('the number of logical blocks in which the volume space of the volume is recorded'). Now all of the interesting Minix file data is stored after the volume space, and my initial thought was that after opening the CD device as a raw device (c.f. CreateFile( "\\.\D:", ... ) with suitable buffer aligned required by FILE_FLAG_NO_BUFFERING which is assumed for raw access) I could treat the CD as a block device, and randomnly access data anywhere on the disc. This failed, as the Microsoft CD device driver prevents access beyond the end of the volume... quite different from Unix, where dd can slurp data freely... Fortunately, almost no effort was required to change the code to access an ISO file image, where seeking 'beyond the bounds' was quite acceptable!
Moral of the story: don't trust the metadata! Seems to apply to the semantic web as well :-)
An ugly consequence of attempting to access the raw CD device was the heavy use of Windows specific functions (e.g. CreateFile), so I next turned attention to cleaning up the code to just use the standard C I/O functions fopen() etc... The code changes were quite simple, but for some reason I started getting short data reads where fread() returned less data than requested. I've not diagnosed the problem, but a characteristic of the code was the use of fseek() prior to each read. Note to self: create a simple test case for the problem... in the meantime, the code still uses the Windows API, and hence, won't be directly portable to other systems (apart from the case-insensitivity issues discussed below).
Finally, I was able to access the raw Minix file systems. Writing the code to interpret the inodes, etc..., and access the file data was fun. Despite 'knowing' the theory for years, getting down and dirty with the code certainly tested my knowledge. To stick with the moral of todays post, I made the fatal mistake of trusting the metadata - in this case, the following comment in /usr/src/servers/fs/super.h:
/* Super block table. The root file system and every mounted file system * has an entry here. The entry holds information about the sizes of the bit * maps and inodes. The s_ninodes field gives the number of inodes available * for files and directories, including the root directory. Inode 0 is * on the disk, but not used. Thus s_ninodes = 4 means that 5 bits will be * used in the bit map, bit 0, which is always 1 and not used, and bits 1-4 * for files and directories. The disk layout is: * * Item # blocks * boot block 1 * super block 1 (offset 1kB) * inode map s_imap_blocks * zone map s_zmap_blocks * inodes (s_ninodes + 'inodes per block' - 1)/'inodes per block' * unused whatever is needed to fill out the current zone * data zones (s_zones - s_firstdatazone) << s_log_zone_size * * A super_block slot is free if s_dev == NO_DEV. */
The statement Inode 0 is on the disk, but not used is only partially true: inode 0 is expressed in the inode map, but does not appear in the block of inodes (following the zone map). In other words, the first entry in inodes is inode 1, which anchors the root directory for the file system. Prior to this observation, the extraction was reading wierd data (i.e. didn't work!) from all over the image.
The final hurdle was working around the (brain-dead) case-insensitivity of the Windows file system. Since the Minix file system is case-sensitive (like any reasonable system), conflicts can arise when a directory contains two names differing only in case. Luckily, in the Minix 3.1.1 release this only occurs once, inside the nvi sources. This is well enough away from the interesting kernel source code that it is unlikely to affect anybody. The code also handles the MS-DOS 'reserved names' (e.g. aux, con, prn, etc...) safely - I only wish I'd written this code earlier, since the original ACK code contained files inconveniently named 'aux.c' that refused to be extracted on Windows. Whilst not a big deal to work around, it was certainly an inconvenience.
In summary, the tour through El-Torito was all too short. I'd love to return to those exotic environs, and other exciting places, in the near future...