The Design and Implementation of a Log-Structured File System

Rosenblum and Ousterhout

One line summary:

The basic idea here is to improve write performance by buffering a sequence
of file system changes in the file cache and write all changes to the disk
sequentially in a single disk write operation.

File system design:

   * Technology trends: CPU speed is improving at a much faster rate than
     disk. Disk transfer bandwidth is increasing at a higher rate than
     access time. Enough main memory to make larger file cache possible.
     These trends lead to most disk traffic dominated by writes.
   * Workloads: Small files results in small random disk I/Os. The creation
     and deletion times for these files are often dominated by updates to
     the file system metadata.

What's wrong with the other file systems? (FFS)

   * Lots of small writes.
   * Synchronous writes for file system metadata structures (e.g.
     directories and inodes) to simplify crash recovery.
   * Each write results in many separate disk I/Os (to access file's and
     directory's attributes and data).

Basic idea of LFS:

   * Write all information (data, attributes, index blocks, directories,
     etc) to the log. This is a large asynchronous sequential transfers that
     can utilize nearly 100% of the raw disk bandwidth.
   * File retrieval: retrieval is same as Unix after finding the file's
     inode. In FFS, inodes are at fixed location. LFS: because inodes are
     written to log whenever they are manipulated, needs to use an inode map
     which keeps track of where each inode is. The map of inode map is
     stored at fixed checkpoint location on disk.
   * Free space management: Two simple approaches:
        o Compact active data to free up dead spaces. The problem here is
          long-lived data gets copied over and over again.
        o Thread free space in the log. The problem is free space becomes
          fragmented, and large writes can no longer be contiguous, this
          degenerates performance of LFS to a traditional FS.
        o Combo: Segmented log. Divide disk into large, fixed size segments.
          The size is chosen to be large enough so that transfer time to
          read/write the whole segment is >> cost of seek (to utilize full
          disk bandwidth). Compaction is done within a segment, and the log
          is threaded segment-by-segment.
   * Cleaning: clean segments periodically: read in some segments, identify
     the live data, write the live data back to a smaller number of clean
     segments. This leaves some clean segments for new data.
        o Cleaning cost increases as utilization within the segment
          increases. This is because more data from the segment needs to be
          read and written.
        o Want a bimodal distribution: most of the segments are nearly full,
          a few are empty or nearly empty. Cleaner almost always work with
          the empty segments.
        o Uniform (Greedy) cleaning model: always choose to clean the least
          utilized segments.
        o Hot and Cold model: same cleaning policy as the uniform model, but
          write the live data back "sorted" by age. This means cold
          (long-lived) data are grouped together and separated from hot
          (short-lived) data.
        o Hot and Cold yielded worst performance! Why? Cold data utilization
          drops very slowly, so it ties up free blocks longer. Solution:
          Cost-Benefit policy: allow cold segments to be cleaned at a much
          higher utilization than hot segments. Use the age of segments to
          predict stability of the segments. The older the segment, the
          "colder" it is.
   * Crash Recovery: is easy in LFS. In FFS, needs to scan the whole disk to
     reconstruct metadata. LFS reads the checkpoint state, and roll forward
     the log (after the checkpoint) to recover data.
        o Checkpoint: LFS writes checkpoint periodically to maintain a
          consistent and complete picture of the FS in the log.
        o Roll forward: allows LFS to recover data after the checkpoint
          state.

Performance:

   * LFS faster than UNIX except when files read sequentially after being
     written randomly.
   * Although LFS concentrates on speeding up small writes, it also works
     well in large file accesses. (Can utilize disk bandwidth more fully?)
   * Recovery time is in seconds, instead of minutes to hours compared to
     traditional file systems.

Lesson:

optimize the critical path (again)