A must-know about NOT IN in SQL – more antijoin optimization

I will try to make it short and clear: if you are writing SQL queries with “NOT IN” like
SELECT … WHERE x NOT IN (SELECT y FROM …)
you have to be sure to first understand what happens when “x” or “y” are NULL: it might not be what you want! And if it is not, I will tell you here how to fix it.

First, the simple case: if “x” and “y” are table columns which have been created with the NOT NULL clause, then they are never NULL and you can relax.

Let us consider the other cases. The complexity stems from the fact that NULL may be understood as “unspecified, might be anything” and thus SQL’s point of view is that it cannot know if NULL is equal to, for example, “coal”. Such a question yields an answer which is neither TRUE, nor FALSE: it yields UNKNOWN, which MySQL prints as a NULL:

Before starting, we need to remember two more details of SQL:

  • a WHERE clause tests a condition against a row and lets the row pass if and only if this condition yields TRUE (it rejects FALSE and UNKNOWN).
  • NOT(TRUE) is FALSE, NOT(FALSE) is TRUE, and NOT(UNKNOWN) is UNKNOWN.

Now that we are ready, let us look at this example:

it is a list of houses, and for every house, we know the type of energy used for heating it (“coal”, “wood”, “gas”, etc, or NULL for no heating).

Show me all houses which are heated with either coal or wood:

and show me all other houses:

We have a house A without heating and one B using oil:

When we test house A:

heating IN (“coal”, “wood”) -> UNKNOWN
because, says SQL, heating is NULL, and NULL might be coal, or might be wood, or not, we do not know…

heating NOT IN (“coal”, “wood”) -> UNKNOWN
because NOT IN is applying NOT to IN, and IN is UNKNOWN, and NOT(UNKNOWN) is UNKNOWN.

As a consequence:

as WHERE eliminates rows for which the condition is not TRUE, so eliminates house A.

The two SELECT’s results above are correct from SQL’s point of view. Now it is your turn to decide if they are what you expected.

If yes, then things are fine.

But I know that for some people, it is not what they expected. For example, some are shocked to see that both IN and NOT IN miss house A, like if A were in neither of the two groups (the coal-wood and the others); A seems to be invisible, kind of  a ghost…

The crux of the matter is that when I designed the table of houses, I meant NULL to be “none”, “no heating”.
Unlike SQL which means NULL to be “perhaps coal or gas or etc or nothing”.
Therefore, in my intent, NULL cannot possibly be coal or wood, so I expect IN to not return A (it does not, alright), and I expect NOT IN to return A… and it does not.

So what should I do, to make NOT IN behave as I expect?

Simple! I just have to better express what I want, in SQL.
I can change NOT IN to IN IS NOT TRUE:

this will let pass houses where IN returns FALSE or UNKNOWN; so A and B will pass, as I expect.

The same problem occurs with “NOT IN(subquery)”. Let us add this table:

Show me houses for which heating does not generate carbon-dioxide:

-> NO ROW.

A is missing, again. And again the solution is:

Now I get A.
This rewrite to IN IS NOT TRUE works pretty well.
I can instead rewrite to NOT EXISTS, which is more editing work though:

That returns A too.

If I do any of the two rewrites, I am somehow declaring to MySQL that I want NULLs to be clear-cut matches for my NOT IN.

An extra benefit is that this also allows MySQL to optimize more “aggressively”. Indeed, when any side of NOT IN is a nullable column (our case here),

SELECT … WHERE heating NOT IN (SELECT name …)

cannot be converted to an antijoin (a new feature of MySQL 8.0.17), precisely because the behaviour of NOT IN with NULLs does not match the definition of an antijoin in relational algebra. MySQL is thus limited in ways to execute this query.

But,

SELECT … WHERE heating IN (SELECT name …) IS NOT TRUE

can be converted to an antijoin. And that is also true for the NOT EXISTS rewrite.

We can check this in EXPLAIN; first we have the initial NOT IN, with a query plan showing one subquery execution per house, with a table scan each time (which is rather inefficient):

Now here are the rewritten queries, which properly use an antijoin and can thus benefit from our new hash-based join algorithm (introduced in version 8.0.18 for inner joins, and extended to semijoins, antijoins and outer joins in 8.0.20):

The antijoin plan is indeed faster; to check this experimentally, let us create one million random houses:

RAND() returns a number between 0 and 1; ROUND() rounds it to an integer from 0 to 5; 0 to 4 get a real energy source while 5 gets NULL (as 5 is not specified in CASE).

To get to one million houses, I only have to repeat the last INSERT a few times. And now the times for my search query are:

The antijoin plan returns more rows (including, as expected, the NULLs) in twenty per-cent less time.

Take-aways from this are: when using NOT IN, and if NULLs cannot be avoided, ask yourself if the behaviour with NULLs is what you want; if yes, alright; if not, consider alternatives IN IS NOT TRUE or NOT EXISTS.

Thank you for using MySQL!

Featured image by Pixabay, from pexels.com.

2 thoughts on “A must-know about NOT IN in SQL – more antijoin optimization

Leave a Reply