Note: There’s a limited number of pointers (e.g., 15 above)
What to do when a file is bigger than 10 blocks?
→ Indirection - Multi-level index
Multi-level indexing
One of the data pointers points to a data-block which contains another pointer table (an indirect block)
So if we have 15 pointers in an inode, and each pointer is 4 bytes, we get 14 + (4K / 4) = 14 + 1024 = 1038 pointers, which means our files can have 1038 * 4K = 4152K (roughly 4M)
What if we need bigger files?
→ Double indirection
Add a pointer to a block that contains pointers to blocks with pointers
we get 1038 (single indirection) and an additional 1024 * 1024 pointers
More needed?
→ Triple indirection
This approach is used by ext2, ext3, the original UNIX fs, etc.
Note that we get a very imbalanced tree
But this is okay - most files are small!
Extents
If a file is stored as a set of contiguous blocks, it seems wasteful to point to each separately
Some FSs (e.g., ext4) use “extents” - a pointer together with a size in blocks
Directories
A directory also has an inode (file type = directory)
Data blocks contain file entries: an i-number + the file name + length of the name
The root directory (/) has a set i-number
Example:
inum
reclen
strlen
name
5
12
2
.
2
12
3
..
12
12
4
foo
13
12
4
bar
24
36
28
foobar_is_a_pretty_longname
Performance
With our FS, what is the number of I/O when accessing a file?
It depends on the length of the path (at least two reads per directory)
For write/create operations, bitmaps and inodes need also bemodified
→ Caching
Most file systems use main memory as a cache to store frequently accessed blocks
Cache for reads: can prevent most I/Os
Cache for writes:
Impair reliability
Most FS cache writes between 5 and 30 seconds
Better I/O scheduling
Merge writes (e.g., for the bitmaps)
Consistency
What happens if a computer/disk loses power while writing data?
Let’s say that we are trying to update a file (name is not important) with the following metadata:
The update involves adding another 20 bytes of data
This means we have to add another block
What needs to be updated?
a block needs to be allocated (block bitmap)
the inode has to be updated (size + block2)
the data block needs to be written
The power may be cut during any of these tasks
This gives us a few possible crash scenarios depending on whether just one task was completed or two
One task completed:
A block was allocated
No inode tracks it -> space leak
Data lost, inconsistency
The inode was updated
Inode points to the disk address where data was about to be written
But there’s just garbage, or contents of a previous file
Even worse: Block could be allocated twice!!
Data lost, inconsistency
The data block was written
No inode tracks it, it’s not marked as allocated
Data lost (as if it was never written), but no inconsistency
Two tasks completed:
Inode and bitmap: yes / Bata: no
Data lost, no inconsistency
Inode points to garbage
Inode and data block: yes / Bitmap: no
Inode points to correct data
But: data block may be allocated and overwritten via two different files
Inconsistency, high probability of data lost
Bitmap and data block: yes / Inode: no
No inode tacks data block -> space leak
Inconsistency,
There are generally two solutions people have come up with:
Let bad things happen and check afterwards (fsck)
Write-ahead Logging (journalling)
File System Checking (fsck)
Idea: Bad things happen, let’s see if we can fix them by checking the disk post-hoc
Usually, this means that a utility (fsck) is run regularly
Scans the whole disk, looking for inconsistencies
Bitmaps: trust inodes and mark blocks used by inodes as not free
Also, if an inode looks used, mark it as used
Inodes
Check for corruption - do they look okay?
E.g., is size consistent with the no of blocks?
If the inode cannot be fixed, clear it and deallocate
Inode ref count
Check that there is a corresponding number of directory entries
If not, update the ref count
If ref count == 0, deallocate
Duplicates: No two inodes should point to the same block
Might delete a bad inode
Or could create a copy of the block so each inode has its own
Bad blocks - pointer outside of the disk
Dir checks
Problem takes a long time for a big disk - everything needs to be scanned
Might be too late to fix a problem
Journalling
Journal - an area of disk where the file system logs what it’s abut to do
Transactions
Journal write: Write the contents of the transaction (containing TxB and the contents of the update) to the log; wait for these writes to complete.
Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction is now committed.
Checkpoint: Write the contents of the update to their final locations within the file system.
Free: Some time later, mark the transaction free in the journal by updating the journal superblock
Metadata Journalling
Data write: Write data to final location; wait for completion (the wait is optional; see below for details).
Journal metadata write: Write the begin block and metadata to the log; wait for writes to complete.
Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction (in- cluding data) is now committed.
Checkpoint metadata: Write the contents of the metadata update to their final locations within the file system.
Free: Later, mark the transaction free in journal superbloc