Recovery Management in QuickSilver Haskin, Malachi, Sawdon and Chan One line summary: Use atomic transaction as a mechanism to do failure recovery in the QuickSilver client-server structured distributed system. Some recovery techniques: * Timeouts: clients set timeouts on their requests to servers. Problem: cannot distinguish slow from crash, which can lead to inconsistencies in the system. * Connectionless protocals: servers are stateless, connectionless and idempotent. Problems: some action cannot be made idempotent; quiting in the middle of a request and retrying can also lead to inconsistencies. * Virtual circuits: failures are detected by the communication system employing connection-oriented protocols. Problem: cannot achieve multiserver atomicity because virtual circuits can fail independently. * Replication: eg. Nonstop Kernel. Problem: too expensive; need a transaction system underneath to provide recovery. * Transactions: Basic Idea: Everything belongs to some transaction, and transactions are designated by globally unique transaction ids. A transaction has an owner process and multiple other participant processes. The owner may commit or abort the transaction, but the participant can only abort. Transcation-based recovery manager: * There is one recovery manager per host, and has three components: o Transaction manager: It keeps track of the transactions the processes on the host participate in. Also manages commit protocols for the distributed hosts. + TM keeps track the topology of a transaction in a distributed way. Each TM only need to manages its superior and subordinate TMs for each transaction for which there is a participant server process on the TM's host. + TM can terminate or fail a transaction in several ways. For example: when the owner calls commit or abort; detection of a permanent connection failure etc. Participant servers can specify whether their failures causes transaction failure or termination. This allow early resource reclamation for subordinates, while allowing errors to be seen and reported by the superior. + Termination cause commit/abort to proceed immediately. Failure is remembered, and the transaction is aborted when it does terminate. o Log manager: common recovery log for TM commit's log and the server's data recovery data. + LM provides several optional log services. The server can tell its LM what services it needs, so that it won't be penalized by services it does not need, nor by services used by other servers. + Server decides what recovery strategy to use. The LM does not interpret the data. + o Deadlock detector: (not implemented). Commit Processing: * One phase: used by servers that maintain only volatile state. Server sends an end request to each one-phase participant. Volatile server: does not maintain permanent storage. * Two phase: used by servers that maintain recoverable state. Participants need to vote on the commit: o vote-abort: participant undo its action, and the second phase is used to announce abort to everybody else. o vote-commit-read-only: which means participant has not modified any recoverable resources, and requests not be included in phase two of the commit. o vote-commit-volatile: same as vote-commit-read-only, but wants to be notified of the results. o vote-commit-recoverable: participant has modified recoverable state, so needs to be informed of the results of phase two. * Need rules to handle special cases like: commit before participate; cycles in transaction graph; new requests after becoming prepared; reappearance of a forgotten transaction, etc. * The coordinator is at the transaction birth-site, which usually means user workstations that a likely to fail. To ensure reliability, can either migrate and/or replicate the coordinator. Performance: * A lot of detailed numbers. What is the big picture? Should be applicable in real systems because QuickSilver is used in IBM as their production system.