Rarely is the full behavior of the NULL value in SQL taught or described in detail, and with good reason: Some of the SQL rules surrounding NULL can be surprising or unintuitive.
Unfortunately, if you have deal with NULL in real databases, the results can be downright frustrating. The SQLite project, for example, uses trial and error to determine how a database behaves in the presence of NULL values.
Fortunately, Date and Darwen's A Guide to the SQL Standard (fourth edition) [1] describes SQL's rules concerning NULL in good detail.
Intuitively, NULL approximately represents an unknown value.
IS NULL
or IS NOT NULL
.
WHERE
expression
evaluates to true, and groups whose HAVING
clause
evaluates to true.
COUNT(*)
counts all NULL and non-NULL tuples;
COUNT
(attribute) counts all tuples whose attribute value
is not NULL.
Other SQL aggregate functions ignore NULL values in their computation.
[4]
Here is what the SQL standard mandates for some operations involving sets and multisets.
For a simple relation R
CREATE TABLE R (a INTEGER);the following queries attempt to reliably determine the maximum known value of the attribute
a
in the table R
.
For brevity below, let
This is the obvious query:
SELECT MAX(a) FROM R
R
is empty,
the query must return NULL, not empty.
[5]
R
is a one-row table holding NULL,
Date and Darwen don't specifically declare what MAX()
should return if its argument consists of exclusively NULL values.
R
is a table holding NULL and non-NULL integers,
NULLs are ignored, and MAX()
returns the maximum integer.
This expression of the maximum seems consistent with mathematical logic, but fails completely in SQL:
SELECT DISTINCT a FROM R WHERE a >= ALL (SELECT * FROM R)
R
is empty,
the query returns empty. The >= ALL
test is vacuously true
with an empty subquery
[6],
but there is no value of a to exploit the test.
R
holds a NULL value,
the query returns empty, because the test a >= ALL(...)
returns unknown (not false!) for any NULL or maximum non-NULL integer
value of a
if the subquery includes a NULL
value.
[7]
This expression is one derivation of maximum as computed in
relational algebra:
subtract all the non-maximum values from the table R
,
leaving the maximal ones:
(SELECT DISTINCT * FROM R) EXCEPT (SELECT R.a FROM R, R AS S WHERE R.a < S.a)
R
is empty,
the query returns empty.
R
holds a NULL value,
the query returns NULL, in addition to whatever maximal integer is present
(if any).
The lower subquery never includes NULL,
so NULL is never subtracted from R
.
This expression is another writing of maximum as computed in
relational algebra:
find values not in the non-maximum values of R
:
SELECT DISTINCT * FROM R WHERE a NOT IN (SELECT R.a FROM R, R AS S WHERE R.a < S.a)This writing turns out to be subtly different from the last one.
R
is empty,
the query returns empty.
R
holds one integer, and at least one NULL value,
the query returns NULL, in addition to whatever maximal integer is present
(if any).
In this case, the subquery is always empty; the one available
integer is compared only to NULL values, so do not participate
in the subquery's result.
NULL NOT IN
(empty) is vacuously true as it is in
mathematics
[8], so
NULL is selected as part of the result.
R
holds more than one integer,
the query returns the maximal integer.
In this case, the subquery includes at least one value.
Now that the subquery is not empty,
NULL NOT IN
(nonempty result)
evaluates to unknown (not false!), and is no longer selected as
part of the result. As an aside,
NULL NOT IN
(nonempty result) returns unknown
even if the nonempty
result includes NULL.
[9]
Because it is somewhat awkward to have an expression for MAX
return two rows whose values do not equal, the following expression
adjusts the EXCEPT expression to exclude NULL from the answer:
(SELECT DISTINCT * FROM R) EXCEPT (SELECT R.a FROM R, R AS S WHERE R.a < S.a OR R.a IS NULL)
R
is empty,
the query returns empty.
R
holds NULL,
the query returns the maximal integer, or
empty if R
has no integers.
EXCEPT
will remove NULL from the result
if NULL appears in the bottom subquery, even though NULL is not
equal to NULL.
[10] Similarly,
DISTINCT
, UNION
, and INTERSECT
always returns at most one NULL.
Where SQL mandates a behavior for a query above, PostgreSQL complies.
MAX()
of NULLs only: returns NULL
(consistent with ignoring NULLs,
then computing the MAX()
of an empty remainder).
Where SQL mandates a behavior for a query above, Oracle complies.
MAX()
of NULLs only: returns NULL
(consistent with ignoring NULLs,
then computing the MAX()
of an empty remainder).
No pair of the queries from the list above are equivalent when faced with NULLs in relational data, despite their conceptual similarity.
The two implementations tested, PostgreSQL and Oracle, seem to comply with NULL behavior for the set and multiset operations tested here, even when such behavior is sometimes subtle or unintuitive.
Consider the above a good reason to define away NULLs from relational schema whenever possible.