Recovery Management in QuickSilver

Haskin, Malachi, Sawdon and Chan

One line summary:

Use atomic transaction as a mechanism to do failure recovery in the
QuickSilver client-server structured distributed system.

Some recovery techniques:

   * Timeouts: clients set timeouts on their requests to servers. Problem:
     cannot distinguish slow from crash, which can lead to inconsistencies
     in the system.
   * Connectionless protocals: servers are stateless, connectionless and
     idempotent. Problems: some action cannot be made idempotent; quiting in
     the middle of a request and retrying can also lead to inconsistencies.
   * Virtual circuits: failures are detected by the communication system
     employing connection-oriented protocols. Problem: cannot achieve
     multiserver atomicity because virtual circuits can fail independently.
   * Replication: eg. Nonstop Kernel. Problem: too expensive; need a
     transaction system underneath to provide recovery.
   * Transactions: Basic Idea: Everything belongs to some transaction, and
     transactions are designated by globally unique transaction ids. A
     transaction has an owner process and multiple other participant
     processes. The owner may commit or abort the transaction, but the
     participant can only abort.

Transcation-based recovery manager:

   * There is one recovery manager per host, and has three components:
        o Transaction manager: It keeps track of the transactions the
          processes on the host participate in. Also manages commit
          protocols for the distributed hosts.
             + TM keeps track the topology of a transaction in a distributed
               way. Each TM only need to manages its superior and
               subordinate TMs for each transaction for which there is a
               participant server process on the TM's host.
             + TM can terminate or fail a transaction in several ways. For
               example: when the owner calls commit or abort; detection of a
               permanent connection failure etc. Participant servers can
               specify whether their failures causes transaction failure or
               termination. This allow early resource reclamation for
               subordinates, while allowing errors to be seen and reported
               by the superior.
             + Termination cause commit/abort to proceed immediately.
               Failure is remembered, and the transaction is aborted when it
               does terminate.
        o Log manager: common recovery log for TM commit's log and the
          server's data recovery data.
             + LM provides several optional log services. The server can
               tell its LM what services it needs, so that it won't be
               penalized by services it does not need, nor by services used
               by other servers.
             + Server decides what recovery strategy to use. The LM does not
               interpret the data.
             +
        o Deadlock detector: (not implemented).

Commit Processing:

   * One phase: used by servers that maintain only volatile state. Server
     sends an end request to each one-phase participant. Volatile server:
     does not maintain permanent storage.
   * Two phase: used by servers that maintain recoverable state.
     Participants need to vote on the commit:
        o vote-abort: participant undo its action, and the second phase is
          used to announce abort to everybody else.
        o vote-commit-read-only: which means participant has not modified
          any recoverable resources, and requests not be included in phase
          two of the commit.
        o vote-commit-volatile: same as vote-commit-read-only, but wants to
          be notified of the results.
        o vote-commit-recoverable: participant has modified recoverable
          state, so needs to be informed of the results of phase two.
   * Need rules to handle special cases like: commit before participate;
     cycles in transaction graph; new requests after becoming prepared;
     reappearance of a forgotten transaction, etc.
   * The coordinator is at the transaction birth-site, which usually means
     user workstations that a likely to fail. To ensure reliability, can
     either migrate and/or replicate the coordinator.

Performance:

   * A lot of detailed numbers. What is the big picture? Should be
     applicable in real systems because QuickSilver is used in IBM as their
     production system.