The Design and Implementation of a Log-Structured File System Rosenblum and Ousterhout One line summary: The basic idea here is to improve write performance by buffering a sequence of file system changes in the file cache and write all changes to the disk sequentially in a single disk write operation. File system design: * Technology trends: CPU speed is improving at a much faster rate than disk. Disk transfer bandwidth is increasing at a higher rate than access time. Enough main memory to make larger file cache possible. These trends lead to most disk traffic dominated by writes. * Workloads: Small files results in small random disk I/Os. The creation and deletion times for these files are often dominated by updates to the file system metadata. What's wrong with the other file systems? (FFS) * Lots of small writes. * Synchronous writes for file system metadata structures (e.g. directories and inodes) to simplify crash recovery. * Each write results in many separate disk I/Os (to access file's and directory's attributes and data). Basic idea of LFS: * Write all information (data, attributes, index blocks, directories, etc) to the log. This is a large asynchronous sequential transfers that can utilize nearly 100% of the raw disk bandwidth. * File retrieval: retrieval is same as Unix after finding the file's inode. In FFS, inodes are at fixed location. LFS: because inodes are written to log whenever they are manipulated, needs to use an inode map which keeps track of where each inode is. The map of inode map is stored at fixed checkpoint location on disk. * Free space management: Two simple approaches: o Compact active data to free up dead spaces. The problem here is long-lived data gets copied over and over again. o Thread free space in the log. The problem is free space becomes fragmented, and large writes can no longer be contiguous, this degenerates performance of LFS to a traditional FS. o Combo: Segmented log. Divide disk into large, fixed size segments. The size is chosen to be large enough so that transfer time to read/write the whole segment is >> cost of seek (to utilize full disk bandwidth). Compaction is done within a segment, and the log is threaded segment-by-segment. * Cleaning: clean segments periodically: read in some segments, identify the live data, write the live data back to a smaller number of clean segments. This leaves some clean segments for new data. o Cleaning cost increases as utilization within the segment increases. This is because more data from the segment needs to be read and written. o Want a bimodal distribution: most of the segments are nearly full, a few are empty or nearly empty. Cleaner almost always work with the empty segments. o Uniform (Greedy) cleaning model: always choose to clean the least utilized segments. o Hot and Cold model: same cleaning policy as the uniform model, but write the live data back "sorted" by age. This means cold (long-lived) data are grouped together and separated from hot (short-lived) data. o Hot and Cold yielded worst performance! Why? Cold data utilization drops very slowly, so it ties up free blocks longer. Solution: Cost-Benefit policy: allow cold segments to be cleaned at a much higher utilization than hot segments. Use the age of segments to predict stability of the segments. The older the segment, the "colder" it is. * Crash Recovery: is easy in LFS. In FFS, needs to scan the whole disk to reconstruct metadata. LFS reads the checkpoint state, and roll forward the log (after the checkpoint) to recover data. o Checkpoint: LFS writes checkpoint periodically to maintain a consistent and complete picture of the FS in the log. o Roll forward: allows LFS to recover data after the checkpoint state. Performance: * LFS faster than UNIX except when files read sequentially after being written randomly. * Although LFS concentrates on speeding up small writes, it also works well in large file accesses. (Can utilize disk bandwidth more fully?) * Recovery time is in seconds, instead of minutes to hours compared to traditional file systems. Lesson: optimize the critical path (again)