<!DOCTYPE Article PUBLIC "-//Davenport//DTD DocBook V3.0//EN"> <Article> <ArtHeader> <Title>The extended-2 filesystem overview</Title> <AUTHOR > <FirstName>Gadi Oxman, tgud@tochnapc2.technion.ac.il</FirstName> </AUTHOR > <PubDate>v0.1, August 3 1995</PubDate> </ArtHeader> <Sect1> <Title>Preface</Title> <Para> This document attempts to present an overview of the internal structure of the ext2 filesystem. It was written in summer 95, while I was working on the <Literal remap="tt">ext2 filesystem editor project (EXT2ED)</Literal>. </Para> <Para> In the process of constructing EXT2ED, I acquired knowledge of the various design aspects of the the ext2 filesystem. This document is a result of an effort to document this knowledge. </Para> <Para> This is only the initial version of this document. It is obviously neither error-prone nor complete, but at least it provides a starting point. </Para> <Para> In the process of learning the subject, I have used the following sources / tools: <ItemizedList> <ListItem> <Para> Experimenting with EXT2ED, as it was developed. </Para> </ListItem> <ListItem> <Para> The ext2 kernel sources: <ItemizedList> <ListItem> <Para> The main ext2 include file, <FILENAME>/usr/include/linux/ext2_fs.h</FILENAME> </Para> </ListItem> <ListItem> <Para> The contents of the directory <FILENAME>/usr/src/linux/fs/ext2</FILENAME>. </Para> </ListItem> <ListItem> <Para> The VFS layer sources (only a bit). </Para> </ListItem> </ItemizedList> </Para> </ListItem> <ListItem> <Para> The slides: The Second Extended File System, Current State, Future Development, by <personname><firstname>Remy</firstname> <surname>Card</surname></personname>. </Para> </ListItem> <ListItem> <Para> The slides: Optimisation in File Systems, by <personname><firstname>Stephen</firstname> <surname>Tweedie</surname></personname>. </Para> </ListItem> <ListItem> <Para> The various ext2 utilities. </Para> </ListItem> </ItemizedList> </Para> </Sect1> <Sect1> <Title>Introduction</Title> <Para> The <Literal remap="tt">Second Extended File System (Ext2fs)</Literal> is very popular among Linux users. If you use Linux, chances are that you are using the ext2 filesystem. </Para> <Para> Ext2fs was designed by <personname><firstname>Remy</firstname> <surname>Card</surname></personname> and <personname><firstname>Wayne</firstname> <surname>Davison</surname></personname>. It was implemented by <personname><firstname>Remy</firstname> <surname>Card</surname></personname> and was further enhanced by <personname><firstname>Stephen</firstname> <surname>Tweedie</surname></personname> and <personname><firstname>Theodore</firstname> <surname>Ts'o</surname></personname>. </Para> <Para> The ext2 filesystem is still under development. I will document here version 0.5a, which is distributed along with Linux 1.2.x. At this time of writing, the most recent version of Linux is 1.3.13, and the version of the ext2 kernel source is 0.5b. A lot of fancy enhancements are planned for the ext2 filesystem in Linux 1.3, so stay tuned. </Para> </Sect1> <Sect1> <Title>A filesystem - Why do we need it?</Title> <Para> I thought that before we dive into the various small details, I'll reserve a few minutes for the discussion of filesystems from a general point of view. </Para> <Para> A <Literal remap="tt">filesystem</Literal> consists of two word - <Literal remap="tt">file</Literal> and <Literal remap="tt">system</Literal>. </Para> <Para> Everyone knows the meaning of the word <Literal remap="tt">file</Literal> - A bunch of data put somewhere. where? This is an important question. I, for example, usually throw almost everything into a single drawer, and have difficulties finding something later. </Para> <Para> This is where the <Literal remap="tt">system</Literal> comes in - Instead of just throwing the data to the device, we generalize and construct a <Literal remap="tt">system</Literal> which will virtualize for us a nice and ordered structure in which we could arrange our data in much the same way as books are arranged in a library. The purpose of the filesystem, as I understand it, is to make it easy for us to update and maintain our data. </Para> <Para> Normally, by <Literal remap="tt">mounting</Literal> filesystems, we just use the nice and logical virtual structure. However, the disk knows nothing about that - The device driver views the disk as a large continuous paper in which we can write notes wherever we wish. It is the task of the filesystem management code to store bookkeeping information which will serve the kernel for showing us the nice and ordered virtual structure. </Para> <Para> In this document, we consider one particular administrative structure - The Second Extended Filesystem. </Para> </Sect1> <Sect1> <Title>The Linux VFS layer</Title> <Para> When Linux was first developed, it supported only one filesystem - The <Literal remap="tt">Minix</Literal> filesystem. Today, Linux has the ability to support several filesystems concurrently. This was done by the introduction of another layer between the kernel and the filesystem code - The Virtual File System (VFS). </Para> <Para> The kernel "speaks" with the VFS layer. The VFS layer passes the kernel's request to the proper filesystem management code. I haven't learned much of the VFS layer as I didn't need it for the construction of EXT2ED so that I can't elaborate on it. Just be aware that it exists. </Para> </Sect1> <Sect1> <Title>About blocks and block groups</Title> <Para> In order to ease management, the ext2 filesystem logically divides the disk into small units called <Literal remap="tt">blocks</Literal>. A block is the smallest unit which can be allocated. Each block in the filesystem can be <Literal remap="tt">allocated</Literal> or <Literal remap="tt">free</Literal>. <FOOTNOTE> <Para> The Ext2fs source code refers to the concept of <Literal remap="tt">fragments</Literal>, which I believe are supposed to be sub-block allocations. As far as I know, fragments are currently unsupported in Ext2fs. </Para> </FOOTNOTE> The block size can be selected to be 1024, 2048 or 4096 bytes when creating the filesystem. </Para> <Para> Ext2fs groups together a fixed number of sequential blocks into a <Literal remap="tt">group block</Literal>. The resulting situation is that the filesystem is managed as a series of group blocks. This is done in order to keep related information physically close on the disk and to ease the management task. As a result, much of the filesystem management reduces to management of a single blocks group. </Para> </Sect1> <Sect1> <Title>The view of inodes from the point of view of a blocks group</Title> <Para> Each file in the filesystem is reserved a special <Literal remap="tt">inode</Literal>. I don't want to explain inodes now. Rather, I would like to treat it as another resource, much like a <Literal remap="tt">block</Literal> - Each blocks group contains a limited number of inode, while any specific inode can be <Literal remap="tt">allocated</Literal> or <Literal remap="tt">unallocated</Literal>. </Para> </Sect1> <Sect1> <Title>The group descriptors</Title> <Para> Each blocks group is accompanied by a <Literal remap="tt">group descriptor</Literal>. The group descriptor summarizes some necessary information about the specific group block. Follows the definition of the group descriptor, as defined in <FILENAME>/usr/include/linux/ext2_fs.h</FILENAME>: </Para> <Para> <ProgramListing> struct ext2_group_desc { __u32 bg_block_bitmap; /* Blocks bitmap block */ __u32 bg_inode_bitmap; /* Inodes bitmap block */ __u32 bg_inode_table; /* Inodes table block */ __u16 bg_free_blocks_count; /* Free blocks count */ __u16 bg_free_inodes_count; /* Free inodes count */ __u16 bg_used_dirs_count; /* Directories count */ __u16 bg_pad; __u32 bg_reserved[3]; }; </ProgramListing> </Para> <Para> The last three variables: <Literal remap="tt">bg_free_blocks_count, bg_free_inodes_count and bg_used_dirs_count</Literal> provide statistics about the use of the three resources in a blocks group - The <Literal remap="tt">blocks</Literal>, the <Literal remap="tt">inodes</Literal> and the <Literal remap="tt">directories</Literal>. I believe that they are used by the kernel for balancing the load between the various blocks groups. </Para> <Para> <Literal remap="tt">bg_block_bitmap</Literal> contains the block number of the <Literal remap="tt">block allocation bitmap block</Literal>. This is used to allocate / deallocate each block in the specific blocks group. </Para> <Para> <Literal remap="tt">bg_inode_bitmap</Literal> is fully analogous to the previous variable - It contains the block number of the <Literal remap="tt">inode allocation bitmap block</Literal>, which is used to allocate / deallocate each specific inode in the filesystem. </Para> <Para> <Literal remap="tt">bg_inode_table</Literal> contains the block number of the start of the <Literal remap="tt">inode table of the current blocks group</Literal>. The <Literal remap="tt">inode table</Literal> is just the actual inodes which are reserved for the current block. </Para> <Para> The block bitmap block, inode bitmap block and the inode table are created when the filesystem is created. </Para> <Para> The group descriptors are placed one after the other. Together they make the <Literal remap="tt">group descriptors table</Literal>. </Para> <Para> Each blocks group contains the entire table of group descriptors in its second block, right after the superblock. However, only the first copy (in group 0) is actually used by the kernel. The other copies are there for backup purposes and can be of use if the main copy gets corrupted. </Para> </Sect1> <Sect1> <Title>The block bitmap allocation block</Title> <Para> Each blocks group contains one special block which is actually a map of the entire blocks in the group, with respect to their allocation status. Each <Literal remap="tt">bit</Literal> in the block bitmap indicated whether a specific block in the group is used or free. </Para> <Para> The format is actually quite simple - Just view the entire block as a series of bits. For example, </Para> <Para> Suppose the block size is 1024 bytes. As such, there is a place for 1024*8=8192 blocks in a group block. This number is one of the fields in the filesystem's <Literal remap="tt">superblock</Literal>, which will be explained later. </Para> <Para> <ItemizedList> <ListItem> <Para> Block 0 in the blocks group is managed by bit 0 of byte 0 in the bitmap block. </Para> </ListItem> <ListItem> <Para> Block 7 in the blocks group is managed by bit 7 of byte 0 in the bitmap block. </Para> </ListItem> <ListItem> <Para> Block 8 in the blocks group is managed by bit 0 of byte 1 in the bitmap block. </Para> </ListItem> <ListItem> <Para> Block 8191 in the blocks group is managed by bit 7 of byte 1023 in the bitmap block. </Para> </ListItem> </ItemizedList> </Para> <Para> A value of "<Literal remap="tt">1</Literal>" in the appropriate bit signals that the block is allocated, while a value of "<Literal remap="tt">0</Literal>" signals that the block is unallocated. </Para> <Para> You will probably notice that typically, all the bits in a byte contain the same value, making the byte's value <Literal remap="tt">0</Literal> or <Literal remap="tt">0ffh</Literal>. This is done by the kernel on purpose in order to group related data in physically close blocks, since the physical device is usually optimized to handle such a close relationship. </Para> </Sect1> <Sect1> <Title>The inode allocation bitmap</Title> <Para> The format of the inode allocation bitmap block is exactly like the format of the block allocation bitmap block. The explanation above is valid here, with the work <Literal remap="tt">block</Literal> replaced by <Literal remap="tt">inode</Literal>. Typically, there are much less inodes then blocks in a blocks group and thus only part of the inode bitmap block is used. The number of inodes in a blocks group is another variable which is listed in the <Literal remap="tt">superblock</Literal>. </Para> </Sect1> <Sect1> <Title>On the inode and the inode tables</Title> <Para> An inode is a main resource in the ext2 filesystem. It is used for various purposes, but the main two are: <ItemizedList> <ListItem> <Para> Support of files </Para> </ListItem> <ListItem> <Para> Support of directories </Para> </ListItem> </ItemizedList> </Para> <Para> Each file, for example, will allocate one inode from the filesystem resources. </Para> <Para> An ext2 filesystem has a total number of available inodes which is determined while creating the filesystem. When all the inodes are used, for example, you will not be able to create an additional file even though there will still be free blocks on the filesystem. </Para> <Para> Each inode takes up 128 bytes in the filesystem. By default, <Literal remap="tt">mke2fs</Literal> reserves an inode for each 4096 bytes of the filesystem space. </Para> <Para> The inodes are placed in several tables, each of which contains the same number of inodes and is placed at a different blocks group. The goal is to place inodes and their related files in the same blocks group because of locality arguments. </Para> <Para> The number of inodes in a blocks group is available in the superblock variable <Literal remap="tt">s_inodes_per_group</Literal>. For example, if there are 2000 inodes per group, group 0 will contain the inodes 1-2000, group 2 will contain the inodes 2001-4000, and so on. </Para> <Para> Each inode table is accessed from the group descriptor of the specific blocks group which contains the table. </Para> <Para> Follows the structure of an inode in Ext2fs: </Para> <Para> <ProgramListing> struct ext2_inode { __u16 i_mode; /* File mode */ __u16 i_uid; /* Owner Uid */ __u32 i_size; /* Size in bytes */ __u32 i_atime; /* Access time */ __u32 i_ctime; /* Creation time */ __u32 i_mtime; /* Modification time */ __u32 i_dtime; /* Deletion Time */ __u16 i_gid; /* Group Id */ __u16 i_links_count; /* Links count */ __u32 i_blocks; /* Blocks count */ __u32 i_flags; /* File flags */ union { struct { __u32 l_i_reserved1; } linux1; struct { __u32 h_i_translator; } hurd1; struct { __u32 m_i_reserved1; } masix1; } osd1; /* OS dependent 1 */ __u32 i_block[EXT2_N_BLOCKS];/* Pointers to blocks */ __u32 i_version; /* File version (for NFS) */ __u32 i_file_acl; /* File ACL */ __u32 i_dir_acl; /* Directory ACL */ __u32 i_faddr; /* Fragment address */ union { struct { __u8 l_i_frag; /* Fragment number */ __u8 l_i_fsize; /* Fragment size */ __u16 i_pad1; __u32 l_i_reserved2[2]; } linux2; struct { __u8 h_i_frag; /* Fragment number */ __u8 h_i_fsize; /* Fragment size */ __u16 h_i_mode_high; __u16 h_i_uid_high; __u16 h_i_gid_high; __u32 h_i_author; } hurd2; struct { __u8 m_i_frag; /* Fragment number */ __u8 m_i_fsize; /* Fragment size */ __u16 m_pad1; __u32 m_i_reserved2[2]; } masix2; } osd2; /* OS dependent 2 */ }; </ProgramListing> </Para> <Sect2> <Title>The allocated blocks</Title> <Para> The basic functionality of an inode is to group together a series of allocated blocks. There is no limitation on the allocated blocks - Each block can be allocated to each inode. Nevertheless, block allocation will usually be done in series to take advantage of the locality principle. </Para> <Para> The inode is not always used in that way. I will now explain the allocation of blocks, assuming that the current inode type indeed refers to a list of allocated blocks. </Para> <Para> It was found experimently that many of the files in the filesystem are actually quite small. To take advantage of this effect, the kernel provides storage of up to 12 block numbers in the inode itself. Those blocks are called <Literal remap="tt">direct blocks</Literal>. The advantage is that once the kernel has the inode, it can directly access the file's blocks, without an additional disk access. Those 12 blocks are directly specified in the variables <Literal remap="tt">i_block[0] to i_block[11]</Literal>. </Para> <Para> <Literal remap="tt">i_block[12]</Literal> is the <Literal remap="tt">indirect block</Literal> - The block pointed by i_block[12] will <Literal remap="tt">not</Literal> be a data block. Rather, it will just contain a list of direct blocks. For example, if the block size is 1024 bytes, since each block number is 4 bytes long, there will be place for 256 indirect blocks. That is, block 13 till block 268 in the file will be accessed by the <Literal remap="tt">indirect block</Literal> method. The penalty in this case, compared to the direct blocks case, is that an additional access to the device is needed - We need <Literal remap="tt">two</Literal> accesses to reach the required data block. </Para> <Para> In much the same way, <Literal remap="tt">i_block[13]</Literal> is the <Literal remap="tt">double indirect block</Literal> and <Literal remap="tt">i_block[14]</Literal> is the <Literal remap="tt">triple indirect block</Literal>. </Para> <Para> <Literal remap="tt">i_block[13]</Literal> points to a block which contains pointers to indirect blocks. Each one of them is handled in the way described above. </Para> <Para> In much the same way, the triple indirect block is just an additional level of indirection - It will point to a list of double indirect blocks. </Para> </Sect2> <Sect2> <Title>The i_mode variable</Title> <Para> The i_mode variable is used to determine the <Literal remap="tt">inode type</Literal> and the associated <Literal remap="tt">permissions</Literal>. It is best described by representing it as an octal number. Since it is a 16 bit variable, there will be 6 octal digits. Those are divided into two parts - The rightmost 4 digits and the leftmost 2 digits. </Para> <Sect3> <Title>The rightmost 4 octal digits</Title> <Para> The rightmost 4 digits are <Literal remap="tt">bit options</Literal> - Each bit has its own purpose. </Para> <Para> The last 3 digits (Octal digits 0,1 and 2) are just the usual permissions, in the known form <Literal remap="tt">rwxrwxrwx</Literal>. Digit 2 refers to the user, digit 1 to the group and digit 2 to everyone else. They are used by the kernel to grant or deny access to the object presented by this inode. <FOOTNOTE> <Para> A <Literal remap="tt">smarter</Literal> permissions control is one of the enhancements planned for Linux 1.3 - The ACL (Access Control Lists). Actually, from browsing of the kernel source, some of the ACL handling is already done. </Para> </FOOTNOTE> </Para> <Para> Bit number 9 signals that the file (I'll refer to the object presented by the inode as file even though it can be a special device, for example) is <Literal remap="tt">set VTX</Literal>. I still don't know what is the meaning of "VTX". </Para> <Para> Bit number 10 signals that the file is <Literal remap="tt">set group id</Literal> - I don't know exactly the meaning of the above either. </Para> <Para> Bit number 11 signals that the file is <Literal remap="tt">set user id</Literal>, which means that the file will run with an effective user id root. </Para> </Sect3> <Sect3> <Title>The leftmost two octal digits</Title> <Para> Note the the leftmost octal digit can only be 0 or 1, since the total number of bits is 16. </Para> <Para> Those digits, as opposed to the rightmost 4 digits, are not bit mapped options. They determine the type of the "file" to which the inode belongs: <ItemizedList> <ListItem> <Para> <Literal remap="tt">01</Literal> - The file is a <Literal remap="tt">FIFO</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">02</Literal> - The file is a <Literal remap="tt">character device</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">04</Literal> - The file is a <Literal remap="tt">directory</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">06</Literal> - The file is a <Literal remap="tt">block device</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">10</Literal> - The file is a <Literal remap="tt">regular file</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">12</Literal> - The file is a <Literal remap="tt">symbolic link</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">14</Literal> - The file is a <Literal remap="tt">socket</Literal>. </Para> </ListItem> </ItemizedList> </Para> </Sect3> </Sect2> <Sect2> <Title>Time and date</Title> <Para> Linux records the last time in which various operations occured with the file. The time and date are saved in the standard C library format - The number of seconds which passed since 00:00:00 GMT, January 1, 1970. The following times are recorded: <ItemizedList> <ListItem> <Para> <Literal remap="tt">i_ctime</Literal> - The time in which the inode was last allocated. In other words, the time in which the file was created. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">i_mtime</Literal> - The time in which the file was last modified. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">i_atime</Literal> - The time in which the file was last accessed. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">i_dtime</Literal> - The time in which the inode was deallocated. In other words, the time in which the file was deleted. </Para> </ListItem> </ItemizedList> </Para> </Sect2> <Sect2> <Title>i_size</Title> <Para> <Literal remap="tt">i_size</Literal> contains information about the size of the object presented by the inode. If the inode corresponds to a regular file, this is just the size of the file in bytes. In other cases, the interpretation of the variable is different. </Para> </Sect2> <Sect2> <Title>User and group id</Title> <Para> The user and group id of the file are just saved in the variables <Literal remap="tt">i_uid</Literal> and <Literal remap="tt">i_gid</Literal>. </Para> </Sect2> <Sect2> <Title>Hard links</Title> <Para> Later, when we'll discuss the implementation of directories, it will be explained that each <Literal remap="tt">directory entry</Literal> points to an inode. It is quite possible that a <Literal remap="tt">single inode</Literal> will be pointed to from <Literal remap="tt">several</Literal> directories. In that case, we say that there exist <Literal remap="tt">hard links</Literal> to the file - The file can be accessed from each of the directories. </Para> <Para> The kernel keeps track of the number of hard links in the variable <Literal remap="tt">i_links_count</Literal>. The variable is set to "1" when first allocating the inode, and is incremented with each additional link. Deletion of a file will delete the current directory entry and will decrement the number of links. Only when this number reaches zero, the inode will be actually deallocated. </Para> <Para> The name <Literal remap="tt">hard link</Literal> is used to distinguish between the alias method described above, to another alias method called <Literal remap="tt">symbolic linking</Literal>, which will be described later. </Para> </Sect2> <Sect2> <Title>The Ext2fs extended flags</Title> <Para> The ext2 filesystem associates additional flags with an inode. The extended attributes are stored in the variable <Literal remap="tt">i_flags</Literal>. <Literal remap="tt">i_flags</Literal> is a 32 bit variable. Only the 7 rightmost bits are defined. Of them, only 5 bits are used in version 0.5a of the filesystem. Specifically, the <Literal remap="tt">undelete</Literal> and the <Literal remap="tt">compress</Literal> features are not implemented, and are to be introduced in Linux 1.3 development. </Para> <Para> The currently available flags are: <ItemizedList> <ListItem> <Para> bit 0 - Secure deletion. When this bit is on, the file's blocks are zeroed when the file is deleted. With this bit off, they will just be left with their original data when the inode is deallocated. </Para> </ListItem> <ListItem> <Para> bit 1 - Undelete. This bit is not supported yet. It will be used to provide an <Literal remap="tt">undelete</Literal> feature in future Ext2fs developments. </Para> </ListItem> <ListItem> <Para> bit 2 - Compress file. This bit is also not supported. The plan is to offer "compression on the fly" in future releases. </Para> </ListItem> <ListItem> <Para> bit 3 - Synchronous updates. With this bit on, the meta-data will be written synchronously to the disk, as if the filesystem was mounted with the "sync" mount option. </Para> </ListItem> <ListItem> <Para> bit 4 - Immutable file. When this bit is on, the file will stay as it is - Can not be changed, deleted, renamed, no hard links, etc, before the bit is cleared. </Para> </ListItem> <ListItem> <Para> bit 5 - Append only file. With this option active, data will only be appended to the file. </Para> </ListItem> <ListItem> <Para> bit 6 - Do not dump this file. I think that this bit is used by the port of dump to linux (ported by <Literal remap="tt">Remy Card</Literal>) to check if the file should not be dumped. </Para> </ListItem> </ItemizedList> </Para> </Sect2> <Sect2> <Title>Symbolic links</Title> <Para> The <Literal remap="tt">hard links</Literal> presented above are just another pointers to the same inode. The important aspect is that the inode number is <Literal remap="tt">fixed</Literal> when the link is created. This means that the implementation details of the filesystem are visible to the user - In a pure abstract usage of the filesystem, the user should not care about inodes. </Para> <Para> The above causes several limitations: <ItemizedList> <ListItem> <Para> Hard links can be done only in the same filesystem. This is obvious, since a hard link is just an inode number in some directory entry, and the above elements are filesystem specific. </Para> </ListItem> <ListItem> <Para> You can not "replace" the file which is pointed to by the hard link after the link creation. "Replacing" the file in one directory will still leave the original file in the other directory - The "replacement" will not deallocate the original inode, but rather allocate another inode for the new version, and the directory entry at the other place will just point to the old inode number. </Para> </ListItem> </ItemizedList> </Para> <Para> <Literal remap="tt">Symbolic link</Literal>, on the other hand, is analyzed at <Literal remap="tt">run time</Literal>. A symbolic link is just a <Literal remap="tt">pathname</Literal> which is accessible from an inode. As such, it "speaks" in the language of the abstract filesystem. When the kernel reaches a symbolic link, it will <Literal remap="tt">follow it in run time</Literal> using its normal way of reaching directories. </Para> <Para> As such, symbolic link can be made <Literal remap="tt">across different filesystems</Literal> and a replacement of a file with a new version will automatically be active on all its symbolic links. </Para> <Para> The disadvantage is that hard link doesn't consume space except to a small directory entry. Symbolic link, on the other hand, consumes at least an inode, and can also consume one block. </Para> <Para> When the inode is identified as a symbolic link, the kernel needs to find the path to which it points. </Para> <Sect3> <Title>Fast symbolic links</Title> <Para> When the pathname contains up to 64 bytes, it can be saved directly in the inode, on the <Literal remap="tt">i_block[0] - i_block[15]</Literal> variables, since those are not needed in that case. This is called <Literal remap="tt">fast</Literal> symbolic link. It is fast because the pathname resolution can be done using the inode itself, without accessing additional blocks. It is also economical, since it allocates only an inode. The length of the pathname is stored in the <Literal remap="tt">i_size</Literal> variable. </Para> </Sect3> <Sect3> <Title>Slow symbolic links</Title> <Para> Starting from 65 bytes, additional block is allocated (by the use of <Literal remap="tt">i_block[0]</Literal>) and the pathname is stored in it. It is called slow because the kernel needs to read additional block to resolve the pathname. The length is again saved in <Literal remap="tt">i_size</Literal>. </Para> </Sect3> </Sect2> <Sect2> <Title>i_version</Title> <Para> <Literal remap="tt">i_version</Literal> is used with regard to Network File System. I don't know its exact use. </Para> </Sect2> <Sect2> <Title>Reserved variables</Title> <Para> As far as I know, the variables which are connected to ACL and fragments are not currently used. They will be supported in future versions. </Para> <Para> Ext2fs is being ported to other operating systems. As far as I know, at least in linux, the os dependent variables are also not used. </Para> </Sect2> <Sect2> <Title>Special reserved inodes</Title> <Para> The first ten inodes on the filesystem are special inodes: <ItemizedList> <ListItem> <Para> Inode 1 is the <Literal remap="tt">bad blocks inode</Literal> - I believe that its data blocks contain a list of the bad blocks in the filesystem, which should not be allocated. </Para> </ListItem> <ListItem> <Para> Inode 2 is the <Literal remap="tt">root inode</Literal> - The inode of the root directory. It is the starting point for reaching a known path in the filesystem. </Para> </ListItem> <ListItem> <Para> Inode 3 is the <Literal remap="tt">acl index inode</Literal>. Access control lists are currently not supported by the ext2 filesystem, so I believe this inode is not used. </Para> </ListItem> <ListItem> <Para> Inode 4 is the <Literal remap="tt">acl data inode</Literal>. Of course, the above applies here too. </Para> </ListItem> <ListItem> <Para> Inode 5 is the <Literal remap="tt">boot loader inode</Literal>. I don't know its usage. </Para> </ListItem> <ListItem> <Para> Inode 6 is the <Literal remap="tt">undelete directory inode</Literal>. It is also a foundation for future enhancements, and is currently not used. </Para> </ListItem> <ListItem> <Para> Inodes 7-10 are <Literal remap="tt">reserved</Literal> and currently not used. </Para> </ListItem> </ItemizedList> </Para> </Sect2> </Sect1> <Sect1> <Title>Directories</Title> <Para> A directory is implemented in the same way as files are implemented (with the direct blocks, indirect blocks, etc) - It is just a file which is formatted with a special format - A list of directory entries. </Para> <Para> Follows the definition of a directory entry: </Para> <Para> <ProgramListing> struct ext2_dir_entry { __u32 inode; /* Inode number */ __u16 rec_len; /* Directory entry length */ __u16 name_len; /* Name length */ char name[EXT2_NAME_LEN]; /* File name */ }; </ProgramListing> </Para> <Para> Ext2fs supports file names of varying lengths, up to 255 bytes. The <Literal remap="tt">name</Literal> field above just contains the file name. Note that it is <Literal remap="tt">not zero terminated</Literal>; Instead, the variable <Literal remap="tt">name_len</Literal> contains the length of the file name. </Para> <Para> The variable <Literal remap="tt">rec_len</Literal> is provided because the directory entries are padded with zeroes so that the next entry will be in an offset which is a multiplition of 4. The resulting directory entry size is stored in <Literal remap="tt">rec_len</Literal>. If the directory entry is the last in the block, it is padded with zeroes till the end of the block, and rec_len is updated accordingly. </Para> <Para> The <Literal remap="tt">inode</Literal> variable points to the inode of the above file. </Para> <Para> Deletion of directory entries is done by appending of the deleted entry space to the previous (or next, I am not sure) entry. </Para> </Sect1> <Sect1> <Title>The superblock</Title> <Para> The <Literal remap="tt">superblock</Literal> is a block which contains information which describes the state of the internal filesystem. </Para> <Para> The superblock is located at the <Literal remap="tt">fixed offset 1024</Literal> in the device. Its length is 1024 bytes also. </Para> <Para> The superblock, like the group descriptors, is copied on each blocks group boundary for backup purposes. However, only the main copy is used by the kernel. </Para> <Para> The superblock contain three types of information: <ItemizedList> <ListItem> <Para> Filesystem parameters which are fixed and which were determined when this specific filesystem was created. Some of those parameters can be different in different installations of the ext2 filesystem, but can not be changed once the filesystem was created. </Para> </ListItem> <ListItem> <Para> Filesystem parameters which are tunable - Can always be changed. </Para> </ListItem> <ListItem> <Para> Information about the current filesystem state. </Para> </ListItem> </ItemizedList> </Para> <Para> Follows the superblock definition: </Para> <Para> <ProgramListing> struct ext2_super_block { __u32 s_inodes_count; /* Inodes count */ __u32 s_blocks_count; /* Blocks count */ __u32 s_r_blocks_count; /* Reserved blocks count */ __u32 s_free_blocks_count; /* Free blocks count */ __u32 s_free_inodes_count; /* Free inodes count */ __u32 s_first_data_block; /* First Data Block */ __u32 s_log_block_size; /* Block size */ __s32 s_log_frag_size; /* Fragment size */ __u32 s_blocks_per_group; /* # Blocks per group */ __u32 s_frags_per_group; /* # Fragments per group */ __u32 s_inodes_per_group; /* # Inodes per group */ __u32 s_mtime; /* Mount time */ __u32 s_wtime; /* Write time */ __u16 s_mnt_count; /* Mount count */ __s16 s_max_mnt_count; /* Maximal mount count */ __u16 s_magic; /* Magic signature */ __u16 s_state; /* File system state */ __u16 s_errors; /* Behaviour when detecting errors */ __u16 s_pad; __u32 s_lastcheck; /* time of last check */ __u32 s_checkinterval; /* max. time between checks */ __u32 s_creator_os; /* OS */ __u32 s_rev_level; /* Revision level */ __u16 s_def_resuid; /* Default uid for reserved blocks */ __u16 s_def_resgid; /* Default gid for reserved blocks */ __u32 s_reserved[235]; /* Padding to the end of the block */ }; </ProgramListing> </Para> <Sect2> <Title>superblock identification</Title> <Para> The ext2 filesystem's superblock is identified by the <Literal remap="tt">s_magic</Literal> field. The current ext2 magic number is 0xEF53. I presume that "EF" means "Extended Filesystem". In versions of the ext2 filesystem prior to 0.2B, the magic number was 0xEF51. Those filesystems are not compatible with the current versions; Specifically, the group descriptors definition is different. I doubt if there still exists such a installation. </Para> </Sect2> <Sect2> <Title>Filesystem fixed parameters</Title> <Para> By using the word <Literal remap="tt">fixed</Literal>, I mean fixed with respect to a particular installation. Those variables are usually not fixed with respect to different installations. </Para> <Para> The <Literal remap="tt">block size</Literal> is determined by using the <Literal remap="tt">s_log_block_size</Literal> variable. The block size is 1024*pow (2,s_log_block_size) and should be between 1024 and 4096. The available options are 1024, 2048 and 4096. </Para> <Para> <Literal remap="tt">s_inodes_count</Literal> contains the total number of available inodes. </Para> <Para> <Literal remap="tt">s_blocks_count</Literal> contains the total number of available blocks. </Para> <Para> <Literal remap="tt">s_first_data_block</Literal> specifies in which of the <Literal remap="tt">device block</Literal> the <Literal remap="tt">superblock</Literal> is present. The superblock is always present at the fixed offset 1024, but the device block numbering can differ. For example, if the block size is 1024, the superblock will be at <Literal remap="tt">block 1</Literal> with respect to the device. However, if the block size is 4096, offset 1024 is included in <Literal remap="tt">block 0</Literal> of the device, and in that case <Literal remap="tt">s_first_data_block</Literal> will contain 0. At least this is how I understood this variable. </Para> <Para> <Literal remap="tt">s_blocks_per_group</Literal> contains the number of blocks which are grouped together as a blocks group. </Para> <Para> <Literal remap="tt">s_inodes_per_group</Literal> contains the number of inodes available in a group block. I think that this is always the total number of inodes divided by the number of blocks groups. </Para> <Para> <Literal remap="tt">s_creator_os</Literal> contains a code number which specifies the operating system which created this specific filesystem: <ItemizedList> <ListItem> <Para> <Literal remap="tt">Linux</Literal> :-) is specified by the value <Literal remap="tt">0</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">Hurd</Literal> is specified by the value <Literal remap="tt">1</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">Masix</Literal> is specified by the value <Literal remap="tt">2</Literal>. </Para> </ListItem> </ItemizedList> </Para> <Para> <Literal remap="tt">s_rev_level</Literal> contains the major version of the ext2 filesystem. Currently this is always <Literal remap="tt">0</Literal>, as the most recent version is 0.5B. It will probably take some time until we reach version 1.0. </Para> <Para> As far as I know, fragments (sub-block allocations) are currently not supported and hence a block is equal to a fragment. As a result, <Literal remap="tt">s_log_frag_size</Literal> and <Literal remap="tt">s_frags_per_group</Literal> are always equal to <Literal remap="tt">s_log_block_size</Literal> and <Literal remap="tt">s_blocks_per_group</Literal>, respectively. </Para> </Sect2> <Sect2> <Title>Ext2fs error handling</Title> <Para> The ext2 filesystem error handling is based on the following philosophy: <OrderedList> <ListItem> <Para> Identification of problems is done by the kernel code. </Para> </ListItem> <ListItem> <Para> The correction task is left to an external utility, such as <Literal remap="tt">e2fsck by Theodore Ts'o</Literal> for <Literal remap="tt">automatic</Literal> analysis and correction, or perhaps <Literal remap="tt">debugfs by Theodore Ts'o</Literal> and <Literal remap="tt">EXT2ED by myself</Literal>, for <Literal remap="tt">hand</Literal> analysis and correction. </Para> </ListItem> </OrderedList> </Para> <Para> The <Literal remap="tt">s_state</Literal> variable is used by the kernel to pass the identification result to third party utilities: <ItemizedList> <ListItem> <Para> <Literal remap="tt">bit 0</Literal> of s_state is reset when the partition is mounted and set when the partition is unmounted. Thus, a value of 0 on an unmounted filesystem means that the filesystem was not unmounted properly - The filesystem is not "clean" and probably contains errors. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">bit 1</Literal> of s_state is set by the kernel when it detects an error in the filesystem. A value of 0 doesn't mean that there isn't an error in the filesystem, just that the kernel didn't find any. </Para> </ListItem> </ItemizedList> </Para> <Para> The kernel behavior when an error is found is determined by the user tunable parameter <Literal remap="tt">s_errors</Literal>: <ItemizedList> <ListItem> <Para> The kernel will ignore the error and continue if <Literal remap="tt">s_errors=1</Literal>. </Para> </ListItem> <ListItem> <Para> The kernel will remount the filesystem in read-only mode if <Literal remap="tt">s_errors=2</Literal>. </Para> </ListItem> <ListItem> <Para> A kernel panic will be issued if <Literal remap="tt">s_errors=3</Literal>. </Para> </ListItem> </ItemizedList> </Para> <Para> The default behavior is to ignore the error. </Para> </Sect2> <Sect2> <Title>Additional parameters used by e2fsck</Title> <Para> Of-course, <Literal remap="tt">e2fsck</Literal> will check the filesystem if errors were detected or if the filesystem is not clean. </Para> <Para> In addition, each time the filesystem is mounted, <Literal remap="tt">s_mnt_count</Literal> is incremented. When s_mnt_count reaches <Literal remap="tt">s_max_mnt_count</Literal>, <Literal remap="tt">e2fsck</Literal> will force a check on the filesystem even though it may be clean. It will then zero s_mnt_count. <Literal remap="tt">s_max_mnt_count</Literal> is a tunable parameter. </Para> <Para> E2fsck also records the last time in which the file system was checked in the <Literal remap="tt">s_lastcheck</Literal> variable. The user tunable parameter <Literal remap="tt">s_checkinterval</Literal> will contain the number of seconds which are allowed to pass since <Literal remap="tt">s_lastcheck</Literal> until a check is reforced. A value of <Literal remap="tt">0</Literal> disables time-based check. </Para> </Sect2> <Sect2> <Title>Additional user tunable parameters</Title> <Para> <Literal remap="tt">s_r_blocks_count</Literal> contains the number of disk blocks which are reserved for root, the user whose id number is <Literal remap="tt">s_def_resuid</Literal> and the group whose id number is <Literal remap="tt">s_deg_resgid</Literal>. The kernel will refuse to allocate those last <Literal remap="tt">s_r_blocks_count</Literal> if the user is not one of the above. This is done so that the filesystem will usually not be 100% full, since 100% full filesystems can affect various aspects of operation. </Para> <Para> <Literal remap="tt">s_def_resuid</Literal> and <Literal remap="tt">s_def_resgid</Literal> contain the id of the user and of the group who can use the reserved blocks in addition to root. </Para> </Sect2> <Sect2> <Title>Filesystem current state</Title> <Para> <Literal remap="tt">s_free_blocks_count</Literal> contains the current number of free blocks in the filesystem. </Para> <Para> <Literal remap="tt">s_free_inodes_count</Literal> contains the current number of free inodes in the filesystem. </Para> <Para> <Literal remap="tt">s_mtime</Literal> contains the time at which the system was last mounted. </Para> <Para> <Literal remap="tt">s_wtime</Literal> contains the last time at which something was changed in the filesystem. </Para> </Sect2> </Sect1> <Sect1> <Title>Copyright</Title> <Para> This document contains source code which was taken from the Linux ext2 kernel source code, mainly from <FILENAME>/usr/include/linux/ext2_fs.h</FILENAME>. Follows the original copyright: </Para> <Para> <ProgramListing> /* * linux/include/linux/ext2_fs.h * * Copyright (C) 1992, 1993, 1994, 1995 * Remy Card (card@masi.ibp.fr) * Laboratoire MASI - Institut Blaise Pascal * Universite Pierre et Marie Curie (Paris VI) * * from * * linux/include/linux/minix_fs.h * * Copyright (C) 1991, 1992 Linus Torvalds */ </ProgramListing> </Para> </Sect1> <Sect1> <Title>Acknowledgments</Title> <Para> I would like to thank the following people, who were involved in the design and implementation of the ext2 filesystem kernel code and support utilities: <ItemizedList> <ListItem> <Para> <Literal remap="tt">Remy Card</Literal> Who designed, implemented and maintains the ext2 filesystem kernel code, and some of the ext2 utilities. <Literal remap="tt">Remy Card</Literal> is also the author of several helpful slides concerning the ext2 filesystem. Specifically, he is the author of <Literal remap="tt">File Management in the Linux Kernel</Literal> and of <Literal remap="tt">The Second Extended File System - Current State, Future Development</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">Wayne Davison</Literal> Who designed the ext2 filesystem. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">Stephen Tweedie</Literal> Who helped designing the ext2 filesystem kernel code and wrote the slides <Literal remap="tt">Optimizations in File Systems</Literal>. </Para> </ListItem> <ListItem> <Para> <Literal remap="tt">Theodore Ts'o</Literal> Who is the author of several ext2 utilities and of the ext2 library <Literal remap="tt">libext2fs</Literal> (which I didn't use, simply because I didn't know it exists when I started to work on my project). </Para> </ListItem> </ItemizedList> </Para> <Para> Lastly, I would like to thank, of-course, <Literal remap="tt">Linus Torvalds</Literal> and the <Literal remap="tt">Linux community</Literal> for providing all of us with such a great operating system. </Para> <Para> Please contact me in a case of an error report, suggestions, or just about anything concerning this document. </Para> <Para> Enjoy, </Para> <Para> Gadi Oxman <tgud@tochnapc2.technion.ac.il> </Para> <Para> Haifa, August 95 </Para> </Sect1> </Article>