Lecture 12: File Systems

Intro

Persistent storage devices allow storing data (non-volatile memory)
Data stored as a sequence of bits using some physical representation, e.g.,
- magnetic (HDDs, floppy disks, tape drives)
- optic (CDs)
- electronic (SSDs)
From a (low-level) software perspective: data stored as a sequence of bytes, usually organized in blocks
Long-term storage in a linear array would be a pain:
- What if files grow?
- Where do we add new files?
How do we organize data?
→ File systems
Note, the very beginning of a disk has some standardized contents - we talked about the MBR
Modern file systems have two abstractions:
1. File - think of them simply as a collection of bytes that can be read and written and can be addressed using a name
2. Directory - a collection of files and other directories
In its simplest form, this gives rise to a tree-shaped organisation (there are some complications that we will mention later):

                   foo/
                    |
          +-----+---+---+--------+
          |     |       |        |
         bar/  baz/  info.txt hello.c        
          |     |
      img.png   |
                |
       +--------+-------+-------+-------+
       |        |       |       |       |
    meme.gif run.sh items.csv main.s letter.docx

File System API

The OS provides several system calls to interact with a file system
We’ve worked with some file I/O operations:
- open / close - open (or create) / close a file
- read / write - read / write bytes from/to file
There’s more

Files

lseek
- move withing a file
fsync
- flush a buffer to a file - force data to be written
rename
- rename a file
stat
- get information (metadata) about a file
link
- associate another name with file data
unlink
- remove (a reference to) a file / delete a file

Directories

mkdir
rmdir
opendir / readdir / closedir

FS Implementation

File systems are typically implemented as “drivers” - but, really, they do not abstract the hardware directly
- Block device drivers that fill the layer below file systems
They provide a software abstraction over the lower-level storage APIs
All they really do:
1. Translation: Translate a name + offset to the appropriate disk position
2. Free space management: Keep track of where to store data
To design a FS, we really need to understand and decide on two things:
1. Data structures
2. Access methods
Data structs - how is data and metadata organized - arrays, linked lists, trees, …
Access methods - how does the file API map to usage of these data structures
We will look at a simple file system - very similar to the first Unix FS by Ken Thompson

Useful Observations

Most files are small
- 2K is the most common size
The average file size is growing
- ~200K
Most bytes are in the large files
- A few big file use up most of the space
File systems contain lots of files
- ~100K on average
File systems are roughly half full
- Even as disks grow the number of files is ~50%
Directories are small
- Many have few entries, most have 20 or less

Organization

Blocks

A disk is divided into blocks
Fixed size - 4KB is typical
Blocks are numbered sequentially: 0 to N - 1 (for a partition of size N * 4K)
Let’s say we have 32 blocks…

+---------------------------------------------------------------+
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
+---------------------------------------------------------------+
0                                                             31

Regions

What is stored in blocks?

Data
- Majority of the blocks are data blocks - stored in the data region

+---------------------------------------------------------------+
| | | | | | | | |D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0                                                             31

File Metadata -> Inodes
- Blocks belonging to a file
- Size
- Owner
- Access
- Inode = Index node
- Stored as an inode table
- Multiple inodes / block

+---------------------------------------------------------------+
| | | |I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0                                                             31

Free space info
- Some blocks at the beginning track which blocks are free
- In our example, represented as a bitmap
- If a block is in use, its bit is set to 1 otherwise its 0
- Need to keep track of both free data blocks and free inodes
- Other representations are possible, e.g., free lists

+---------------------------------------------------------------+
| |i|d|I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0                                                             31

File system metadata - Superblock
- Information about the file system itself
- Type (ext2, ext3, ext4, btrfs, FAT, etc.) - “magic number”
- How many inodes in the file system
- How many data blocks
- Beginning of the inode table
- Important when mounting a file system

+---------------------------------------------------------------+
|S|i|d|I|I|I|I|I|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|D|
+---------------------------------------------------------------+
0                                                             31

Adressing contents of files

Each inode has an i-number (a low-level identifier) - its index in the inode table

Given an index, it’s straightforward to calculate the location of an inode on a disk (its sector):

blk = (inumber * sizeof(inode_t)) / blockSize
sector = ((blk * blockSize) + inodeStartAddr) / sectorSize

Contains an array of block pointers (addresses) for the file:

foo.txt:          +-+
          +------>| |
 +-+      |       |-|
 | |------+       | |
 |-|              |-|
 | |-----+  +---->| |
 |-|     |  |     |-|
 | |-----|--|---->| |
 +-+     |  |     |-|
         +--|---->| |
bar.txt:    |     |-|
 +-+        |     | |
 | |----+   |     |-|
 |-|    |   |     | |
 | |----|---+     |-|
 +-+    +-------->| |
                  +-+

Example inode contents (ext2):

Size	Name	Purpose
2	mode	can this file be read/written/executed?
2	uid	who owns this file?
4	size	how many bytes are in this file?
4	time	what time was this file last accessed?
4	ctime	what time was this file created?
4	mtime	what time was this file last modified?
4	dtime	what time was this inode deleted?
2	gid	which group does this file belong to?
2	links_count	how many hard links are there to this file?
4	blocks	how many blocks have been allocated to this file?
4	flags	how should ext2 use this inode?
4	osd1	an OS-dependent field
60	block	a set of disk pointers (15 total for 32-bit)
4	generation	file version (used by NFS)
4	file_acl	a new permissions model beyond mode bits
4	dir_acl	called access control lists

Note: There’s a limited number of pointers (e.g., 15 above)
What to do when a file is bigger than 10 blocks?
→ Indirection - Multi-level index

Multi-level indexing

One of the data pointers points to a data-block which contains another pointer table (an indirect block)
So if we have 15 pointers in an inode, and each pointer is 4 bytes, we get 14 + (4K / 4) = 14 + 1024 = 1038 pointers, which means our files can have 1038 * 4K = 4152K (roughly 4M)
What if we need bigger files?
→ Double indirection
Add a pointer to a block that contains pointers to blocks with pointers
we get 1038 (single indirection) and an additional 1024 * 1024 pointers
More needed?
→ Triple indirection
This approach is used by ext2, ext3, the original UNIX fs, etc.
Note that we get a very imbalanced tree
But this is okay - most files are small!

Extents

If a file is stored as a set of contiguous blocks, it seems wasteful to point to each separately
Some FSs (e.g., ext4) use “extents” - a pointer together with a size in blocks

Directories

A directory also has an inode (file type = directory)
Data blocks contain file entries: an i-number + the file name + length of the name
The root directory (/) has a set i-number
Example:

`inum`	`reclen`	`strlen`	`name`
5	12	2	`.`
2	12	3	`..`
12	12	4	`foo`
13	12	4	`bar`
24	36	28	`foobar_is_a_pretty_longname`

Performance

With our FS, what is the number of I/O when accessing a file?

It depends on the length of the path (at least two reads per directory)
For write/create operations, bitmaps and inodes need also bemodified
→ Caching
Most file systems use main memory as a cache to store frequently accessed blocks
Cache for reads: can prevent most I/Os
Cache for writes:
- Impair reliability
- Most FS cache writes between 5 and 30 seconds
- Better I/O scheduling
- Merge writes (e.g., for the bitmaps)

Consistency

What happens if a computer/disk loses power while writing data?
Let’s say that we are trying to update a file (name is not important) with the following metadata:

user: 100 perm: rw-r–r– size: 4090 block1: 8 block2: - block3: - block4: -
The update involves adding another 20 bytes of data
This means we have to add another block
What needs to be updated?
1. a block needs to be allocated (block bitmap)
2. the inode has to be updated (size + block2)
3. the data block needs to be written
The power may be cut during any of these tasks
This gives us a few possible crash scenarios depending on whether just one task was completed or two
One task completed:
1. A block was allocated
  - No inode tracks it -> space leak
  - Data lost, inconsistency
2. The inode was updated
  - Inode points to the disk address where data was about to be written
  - But there’s just garbage, or contents of a previous file
  - Even worse: Block could be allocated twice!!
  - Data lost, inconsistency
3. The data block was written
  - No inode tracks it, it’s not marked as allocated
  - Data lost (as if it was never written), but no inconsistency
Two tasks completed:
1. Inode and bitmap: yes / Bata: no
  - Data lost, no inconsistency
  - Inode points to garbage
2. Inode and data block: yes / Bitmap: no
  - Inode points to correct data
  - But: data block may be allocated and overwritten via two different files
  - Inconsistency, high probability of data lost
3. Bitmap and data block: yes / Inode: no
  - No inode tacks data block -> space leak
  - Inconsistency,
There are generally two solutions people have come up with:
1. Let bad things happen and check afterwards (fsck)
2. Write-ahead Logging (journalling)

File System Checking (fsck)

Idea: Bad things happen, let’s see if we can fix them by checking the disk post-hoc
Usually, this means that a utility (fsck) is run regularly
Scans the whole disk, looking for inconsistencies
Bitmaps: trust inodes and mark blocks used by inodes as not free
- Also, if an inode looks used, mark it as used
Inodes
- Check for corruption - do they look okay?
- E.g., is size consistent with the no of blocks?
- If the inode cannot be fixed, clear it and deallocate
Inode ref count
- Check that there is a corresponding number of directory entries
- If not, update the ref count
- If ref count == 0, deallocate
Duplicates: No two inodes should point to the same block
- Might delete a bad inode
- Or could create a copy of the block so each inode has its own
Bad blocks - pointer outside of the disk
Dir checks
Problem takes a long time for a big disk - everything needs to be scanned
Might be too late to fix a problem

Journalling

Journal - an area of disk where the file system logs what it’s abut to do
Transactions

Journal write: Write the contents of the transaction (containing TxB and the contents of the update) to the log; wait for these writes to complete.
Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction is now committed.
Checkpoint: Write the contents of the update to their final locations within the file system.
Free: Some time later, mark the transaction free in the journal by updating the journal superblock

Metadata Journalling

Data write: Write data to final location; wait for completion (the wait is optional; see below for details).
Journal metadata write: Write the begin block and metadata to the log; wait for writes to complete.
Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction (in- cluding data) is now committed.
Checkpoint metadata: Write the contents of the metadata update to their final locations within the file system.
Free: Later, mark the transaction free in journal superbloc