The Zebra Striped Network File System

John H. Hartman and John K. Ousterhout

One-line summary: Distributed file system, performing RAID-like striping in
software on a per-client basis using LFS techniques.

Overview/Main Points

   * Building blocks:
        o RAID - Disk array, transfers divided into striping units, each
          unit to a different disk in the array. Set of consecutive striping
          units is a stripe. Includes a parity unit. Small writes - 4 times
          as expensive as in disk array without parity. (read old, read old
          parity, write new, write new parity.) Also, memory and I/O are
          potential performance bottleneck.
        o NFS - can do per-file striping, or each file in its own set of
          stripes. Small files - either striped across all servers, in which
          case network and disk overhead dominates, or placed on single
          server, then parity consumes as much space as file itself. Data
          plus parity writes must both been done as single atomic operation.
        o LFS - Zebra is LFS plus per-client striping.

   * Zebra components:
        o clients: contact file manager to get block pointers, then contact
          storage servers to get data. If write, buffer data until get full
          fragment (512 KB!!??!!).
        o storage servers: operate on stripe fragments, indexed by (client
          identifier, stripe sequence number, fragment offset within stripe)
          as opaque identifier. Can store (synchronously), append to
          (atomically so crashes ok), retrieve, delete, or find most recent
          fragment.
        o file manager: stores all file metadata (protection info, block
          pointers to file data, directories, symlinks, special files for
          I/O devices, ...). Does name lookup and cache consistency. Only
          stores block pointers, never file data. Implemented as Sprite LFS,
          so get all Sprite LFS consistency, fault tolerance, etc. for free.
        o stripe cleaner: like LFS segment cleaner. User-level process. Uses
          LFS cleaning policy.

   * Operational details:
        o Deltas - changes to blocks in a file. Contain file ID, file
          version number (for delta ordering across logs in crash recovery),
          block number, old block pointer, new block pointer. (old and new
          pointers to detect race condition between clients and cleaner.)
          Deltas stored in client's logs.
        o Writing files - batch a fragment write plus update delta,
          increment file version number. Transfer fragments to all servers
          concurrently by asynchronous RPC; client computes parity, and
          handles stripe deltas. Delay parity writing for small files - disk
          crash ok as long as client survives, as it has parity.
        o Reading files - file manager must know about all open/closes for
          cache consistency. Client fetches block pointers, then fetches
          data - 2 RPC roundtrips. Prefetching of data for large files, and
          of files for small files (locality in file access assumed).
        o Stripe cleaning - utilization computed by the stripe cleaner, by
          processing deltas from client logs, appending all that refer to a
          given stripe to a "stripe status file", which is used to identify
          live blocks without having to search through all logs. Cleaning
          similar to read+write of block, but don't don't file open, don't
          do cache consistency, don't do user/kernerl data copying, don't
          update modify times or version numbers, and generate "cleaner
          delta" instead of update delta.
        o File access/cleaning conflicts: optimistic approach. Cleaner does
          its work, issues cleaner delta. File manager looks at cleaner
          deltas and update deltas, detects conflict by looking at old block
          pointer in deltas and comparing with metadata, and favours client
          update deltas over cleaner deltas, issuing a "reject delta" so
          cleaner knows about conflict. If block cleaned before client reads
          it but after metadata fetch, get error and retry metadata fetch.
        o Adding storage servers on the fly - need to keep track of how many
          storage servers make up a stripe group for some files. Cleaning
          moves files to larger stripe group over time.

   * Consistency after crashes: three issues not in LFS/Sprite.
       1. internal stripe consistency - fragments in the process of being
          written are missing. If partial fragment, use checksum to detect.
          If a storage server misses fragments while it is down, use parity
          + neighbours to recover.
       2. stripes vs. metadata - file manager keeps track of current
          position in each client's log, and periodically checkpoints
          metadata. After file manager crash, reprocess all deltas after
          checkpoints. Version numbers give idempotency, and ordering of
          deltas across all clients' logs.
       3. stripes vs. cleaner - if stripe cleaner crashes, needs to recover
          state. Checkpoint stripe cleaner state.

   * Performance:
        o 4-5x improvement for large reads/writes because of parallelism.
        o negligible improvement for small reads/writes because clients
          contact central point on each open/close. (Name caching would fix.
          Zebra doesn't give concurrent write sharing anyway.)
        o large file write - if single server, disk is bottleneck. At 4
          servers, FDDI saturation stops linear scalability. Parity
          computation by client is expensive - like doing (N)/(N-1) writes
          for N storage servers.
        o large file read - 2 servers saturate single client (data copies
          between app, cache, and network).
        o CPU at FM and client CPU is bottleneck for small writes because of
          synchronous RPCs to open/close files.

Relevance

Ideas in Zebra were generalized to result in xFS. Distributed, reliable,
parallel file system - one day could become the InternetFS?

Flaws

   * file manager as centralization point is clearly flawed.
   * why must file manager track all open/closes if don't promise concurrent
     write sharing consistency?
   * clients do much of the work that could be pushed into the
     infrastructure.