The git index race condition

The index is partly responsible for git’s supernatural speed. It is a cache for file stats and hashes, so most of the time, we can avoid examining file contents to determine if a file has been modified and look at its stats only.

Unfortunately, the mtime typically has one second resolution only. A second is a long time in operating systems. Notably, the hash of a file could be cached, and the file modified within the same second without changing its hash, and we could no longer trust the cache for that file. Git’s Documentation/technical/racy-git.txt and comments within read-cache.c describe this problem and a partial solution. I’ll attempt a full description here.

I’ll use the time-honoured trick of couching my statements in mathematical terms to make them sound more convincing and important.

"Racily clean" files

Assume the filesystem time is nondecreasing. We shall refer to its value by the word "time", which which differs somewhat from its usual sense because it is possble to have one event follow another yet possess an identical timestamp. By abuse of terminology, I shall also use time in its usual sense, that is, as a point of time in an interval. The reader will have to determine the correct meaning.

Denote the stats of file \(F\) at time \(T\) by \(S(T, F)\), and the hash of its contents by \(H(T, F)\). Denote their cached versions by \(S_C(T, F)\), and \(H_C(T, F)\). When clear from context, we omit one or both parameters.

Lemma: Suppose a file \(F\) has cached mtime \(M\) at any point during time \(T > M\), and \(H_C(T) = H(T)\). Then at any point in time \(T_1 \ge T\),

\[ S_C(T_1) = S(T_1) \implies H_C(T_1) = H(T_1). \]

Namely, if the stats match, then the hash matches. We call such a cached hash trustworthy.

Proof: We show the contrapositive. Suppose \(H_C(T_1) \ne H(T_1)\) for \(T_1 \ge T > M\). Let \(M_1\) be the mtime of \(F\) at \(T_1\). We must have \(M_1 \ge M\) because \(H_C(M) = H(M)\) but \(H_C(M) \ne H(T_1)\).

If \(M_1 = M\), then \(H_C(T) \ne H(T)\) since \(T > M\), and \(F\) has not changed since \(M_1\). Otherwise, if \(M_1 > M\), since mtime is a stat, $S_C(T_1) \ne S(T_1)$. QED.

In other words, if we compute the hash of a file \(F\) at a time strictly larger than its mtime, the lemma guarantees the index is accurate for \(F\).

When \(M = T\), all bets are off. If we modify \(F\) fast enough, its mtime remains unchanged, and we now have the incorrect hash at time \(T\). Until updated, the cache entry of \(F\) cannot be trusted even if \(M < T\) later. We call \(F\) a "racily clean" file.

Our solution

On cache entry update for a file \(F\), mark it "dodgy" if and only if its mtime equals the current time. When we read the cache entry for \(F\), if it is "dodgy" we clear the "dodgy" flag if possible, namely if the stats and hash match and the mtime is in the past.

Theorem: If \(S_C = S\) and \(F\) is not dodgy, then \(H_C = H\).

Proof: Suppose \(F\) is not dodgy. Then \(F\) was indexed or reexamined during a time \(T\) strictly larger than its mtime \(M\). By the lemma, if \(S_C = S\) then \(H_C = H\). QED

Note this solution attains the minimal number of file reads, for any given sequence of index operations.

Git’s solution

If a given file \(F\) has mtime \(M\) equal to the index timestamp \(T\), git checks \(H(F)\) as well as \(S(F)\) to determine if it has changed. Optionally, Git can simply assume \(F\) has changed in this case.

Git also attempts to ensure cache entries with \(T > M\) are always trustworthy, so that the lemma can be used even for entries originally created with \(T = M\).

On index write, for every \(T = M\) file \(F\), if the stats match the filesystem, but the contents do not, the cached size of \(F\) is zeroed. Most of the time the file is nonempty and the discrepancy automatically causes a cache entry update next time around, getting rid of the possibly faulty entry.

Git zeros the size only if the file contents mismatch to handle a case where a file really is emptied immediately after a cache write: here, you don’t want to zero the cached size as well, otherwise the stats will happen to match later on (but the hash will be incorrect).

Unfortunately this introduces another race condition, which has since been fixed by exploiting the fact that the SHA1 of an empty file is a known sequence of 20 bytes.

Theorem: If \(T > M\) and \(S_C = S\) for a nonempty file \(F\), then \(H_C = H\).

Proof: Unpleasant exercise.

Filesystem/clock skew

In practice, the system clock and filesystem clock can be skewed, and writing to the index may span across more than one time interval.

Thus when writing the index, we note its timestamp \(T\) at file creation, which eventually becomes its timestamp when we are finished. Additionally, we perform \(T \le M\) checks, not merely \(T = M\).

This allows files to be concurrently modified as we are indexing.

Faulty solutions

The above theorems ignore the which hashes we actually want to store in the index. By modifiying cached hashes, we can construct schemes resulting in provably trustworthy cached hashes.

We want users to have complele control over the cached hashes in the index. Therefore all valid solutions to the problem can only modify stats, or introduce auxiliary data.

Ben Lynn 💡