Recovery Techniques for Database Systems

Joost Verhofstad

One-line summary: This paper presents seven techniques commonly used for
recovery in database systems.

Overview/Main Points

   * Definitions:

        o Failure: An event at which the system does not perform according
          to specifications. There are three kinds of failures:
            1. failure of a program or transaction
            2. failure of the total system
            3. hardware failure
        o Recovery Data: Data required by the recovery system for the
          recovery of the primary data. In very high reliability systems,
          this data might also need to be covered by a recovery mechanism...
          Data recovery data is divided into two categories : 1) data
          required to keep current values, and 2) data to make the
          restoration of previous values possible.
        o Transaction: The base unit of locking and recovery (for undo,
          redo, or completion), appears atomic to the user.
        o Database:A collection of related storage objects together with
          controlled redundancy that serves one or more applications. Data
          is stored in a way that is independent of programs using it, with
          a single approach used to add, modify, or retrieve data.
        o Correct State:Information in the database consists of the most
          recent copies of data put in the database by users and contains no
          data deleted by users.
        o Valid State:The database contains part of the information of the
          correct state. There is no spurious data, although pieces may be
          missing.
        o Consistent State:In a valid state, with the information contained
          satisfying user consistency constraints. Varies depending on the
          database and users.
        o Crash:A failure of a system that is covered by a recovery
          technique.
        o Catastrophe:A failure of a system not covered by a recovery
          technique.

   * Possible Levels of Recovery:

       1. Recovery to the correct state.
       2. Recovery to a checkpointed (past) correct state.
       3. Recovery to a possible previous state.
       4. Recovery to a valid state.
       5. Recovery to a consistent state.
       6. Crash resistance (prevention).

     The bigger the damage, the cruder the recovery technique used.

   * Recovery Techniques:

       1. Salvation program: Run after a crash to attempt to restore the
          system to a valid state. No recovery data used. Used when all
          other techniques fail or were not used. Good for cases where
          buffers were lost in a crash and one wants to reconstruct what was
          lost...(4,5)
       2. Incremental dumping: Modified files copied to archive after job
          completed or at intervals. (3,4)
       3. Audit trail: Sequences of actions on files are recorded. Optimal
          for "backing out" of transactions. (Ideal if trail is written out
          before changes). (1,2,3)
       4. Differential files: Separate file is maintained to keep track of
          changes, periodically merged with the main file. (2,3)
       5. Backup/current version: Present files form the current version of
          the database. Files containing previous values form a consistent
          backup version. (2,3)
       6. Multiple copies: Multiple active copies of each file are
          maintained during normal operation of the database. In cases of
          failure, comparison between the versions can be used to find a
          consistent version. (6)
       7. Careful replacement: Nothing is updated in place, with the
          original only being deleted after operation is complete. (2,6)

     (Parens and numbers are used to indicate which levels from above are
     supported by each technique).

     Combinations of two techniques can be used to offer similar protection
     against different kinds of failures. The techniques above, when
     implemented, force changes to:

        o The way data is structured (4,5,6).
        o The way data is updated and manipulated (7).
        o nothing (available as utilities) (1,2,3).

   * Examples and bits of wisdom:

        o Original Multics system : all disk files updated or created by the
          user are copied when the user signs off. All newly created of
          modified files not previously dumped are copied to tapes once per
          hour. High reliability, but very high overhead. Changed to a
          system using a mix of incremental dumping, full checkpointing, and
          salvage programs.
        o Several other systems maintain backup copies of data through the
          paging system (keep backups in the swap space).
        o Use of buffers is dangerous for consistency.
        o Intention lists: specify audit trail before it actually occurs.
        o Recovery among interacting processes is hard. You can either
          prevent the interaction or synchronize with respect to recovery.
        o Error detection is difficult, and can be costly.

Relevance

Recovery from failure is a critical factor in databases. In case of
disaster, it is very important that as much as possible (if not everything)
is recovered. This paper surveys the methods that we in use at the time for
data recovery.

Flaws

This paper contained excess verbosity. This paper could and should have been
shorter and more concise than it was. The examples especially could have
been clearer and less involved. It might have been more valuable to give a
single complete view of several systems than to detail the migration of (now
obsolete) systems over time. The terminology and categories presented at the
beginning were useful and potentially timeless, which the example were not.
  ------------------------------------------------------------------------
Back to index