InnoDB Data Locking – Part 4 “Scheduling”

In this blog series, I’m describing how InnoDB locks data (tables and rows) in order to provide illusion to clients that their queries are executed one after another, and how this was improved in recent releases.

As we’ve already seen, the order in which server pretends transactions are happening (the serialization order) is tied to the order in which locks are granted to transactions. Thus the order in which locks are granted can influence the order in which transaction appear to happen, and as we shall see, performance.

Perhaps it’s worth stating explicitly, that when a resource is currently not locked by anyone and a transaction requests an access right to it, then InnoDB grants it immediately. One could imagine a different strategy in which transaction’s ID, timestamp, or previously locked resources somehow play a role, but InnoDB simply grants an uncontended resource to whoever needs it right away.

However, when a transaction finishes and thus no longer requires access to a resource, we have an opportunity to grant access to this resource to one of the other transactions which were waiting for it. The question is: which one to choose? This is the flexibility server has in shaping the serialization order and performance.

Approach 1. FIFO

The simplest and relatively good approach is to first grant the access right to the transaction which waits for it the longest, which can be accomplished by using First In First Out queue. There is one such FIFO for each resource. When a transaction requests an access to a resource we first check if it can be granted immediately or not. This can be as simple as scanning the list of transactions which currently posses access rights to this resource and check if the access rights are in conflict or not. However, to avoid starvation it makes sense to also check against transactions which were already waiting in the queue before us. We saw in InnoDB Data Locking – Part 2 “Locks” that the rules for checking for a conflict between two lock requests can bet quite complex, but eventually we should be able to decide if our new request can be granted immediately or if we have to wait. In case we have to wait, we append our request at the end of the queue specific to a given resource.

When a transaction finishes we release each of its locks one by one, each time inspecting the corresponding queue of waiters, and consider waiters one by one in the FIFO order, checking if the lock can be granted to them or not. Note that it may happen that none of them is eligible to be granted (for example, if our transaction had a shared lock and there are other holders of such access right still) or there could be multiple requests granted at the same time (for example, if our transaction had an exclusive lock on the resource, and several transactions awaited a shared lock for it). It is also quite possible, that granting the request to one of the waiters effectively prohibits granting a request to waiters later in the queue (for example, if the first of the two waiters needs an exclusive lock), which is why the order in the queue matters at all.

This algorithm was the default for many years in InnoDB, until cooperation with academia brought us improvements described below.

Variance Aware Transaction Scheduling

In their paper “Identifying the Major Sources of Variance in TransactionLatencies: Towards More Predictable Databases” Jiamin Huang, Barzan Mozafari, Grant Schoenebeck, and Thomas Wenisch from University of Michigan proposed an idea aimed at minimizing variance caused by scheduling. The idea was that FIFO while may at first seem fair in that the newcomers do not bypass the transactions already in the queue, it doesn’t really push the notion of “older transactions should have priority over fresher transactions” to their logical conclusion: that the order in the queue should be determined by the “age” of the transaction – the time it has already spent in the system. To see that FIFO does not really sort transactions by “age”, realize that a single transaction can request several resources during its lifetime, thus visiting several queues, and each time it is treated as a “newcomer” even if it has already spent a lot of time in previous queues. To make this more fair, in the patch they’ve proposed before granting a lock to waiter, they are rearranged based on birth-date (see line 2723).

AFAIK this version of the idea didn’t make it to official mysql release, because soon another improvement was proposed by the researchers from Michigan.

Contention-Aware Lock Scheduling

In their next paper “Contention-Aware Lock Scheduling for TransactionalDatabases”  Boyu Tian, Jiamin Huang, Barzan Mozafari, and Grant Schoenebeck proposed an idea to sort the waiters using a different criteria: how many (other) transactions are being (transitively) blocked because a given transaction has to wait. You see: it may happen that a waiting transaction already amassed request rights to many resources before it got to wait for a resource, and now other transactions have to wait for it to release them. If we force such a transaction to wait, we indirectly also force all those other waiting for it, which means that impact of additional “unit of waiting” is multiplied by the size of the crowd waiting behind.
Conceptually this is similar to assigning “weight” to each transaction, proportional to the size of the “subtree” of waiters waiting for it and waiters waiting for waiters waiting for it etc. The concept of “wait-for graph” is described in InnoDB Data Locking – Part 3 “Deadlocks”, but roughly speaking you can picture the waiting transactions as having arrows pointing to transactions which cause them to wait for a resource. The picture in general is a directed acyclic graph, not a tree, and thus the “weight” should be the size of a “subgraph” not a “subtree”.

You can also read about this in our earlier article “Contention-Aware Transaction Scheduling Arriving in InnoDB to Boost Performance”

This is the algorithm which was released with MySQL 8.0.3 under a slightly cooler acronym CATS (Contention Aware Transaction Scheduling).

There were some difficulties to implement the idea from the paper directly in a performant AND correct way. At first it might seem quite easy to keep track of weight of each transaction by increasing or decreasing it whenever an edge in the wait-for-graph appears or disappears – it seems tempting to assume that you simply have to add or subtract the already computed weight of one transaction to another reusing the already done computations. However on closer reflection, you can realize that not only you have to propagate the update the weight of all the nodes which are reachable from the edge’s endpoint (their weight has also changed!) – it’s not even clear what the updated value should be! You see: the definition of the “weight” depends on “is B reachable from A?” not on “what is the number of different ways you can reach B from A?”. This might be a inconsequential difference for trees, but not for more complicated graphs in which there might be more than one way to reach one node from another, and thus it is not immediately clear if adding an edge or removing an edge changes the reachability or not. For example, it might very well be the case that adding a new edge from node having weight 5 to another node doesn’t bring 5 new followers, but just, say, 3 new followers, because the other 2 already could reach the node in question. Similarly, removing a single edge doesn’t necessarily mean you’ve lost connectivity between nodes on opposite sides.

It’s tempting to ignore this discrepancy (between “number of nodes which can reach me” and “number of paths which can reach me”), brushing it off as not a big deal in practice. And this is indeed what the initial implementation did for performance reasons. Yes, we were adding and subtracting numbers which are not exactly the “weight” as defined in the paper, but they still seemed to be (co)related to “contention” caused by a given transaction. Yes, you could create wait-for graphs with exponential number of paths and thus cause an overflow, but this could rarely happen in reality and could be patched by clamping the value.

However, there was a more subtle bug lurking in this simplified implementation: it didn’t took into account that the graph itself changes over time and the number you add when the edge appears is not necessarily the same you subtract when it disappears which could lead to surprising results such as negative weight, or overflows. These again were fixed with clamping, but it become more and more apparent, that to get “the real” value one should rather recompute all values from scratch instead of doing point updates.

However this was prohibitive from performance perspective: processing the whole wait-for graph in topological order each time an edge appears or disappears stopping the whole world just to make scheduling “more fair” would be a bad trade off. Fortunately,  we could reuse ideas and implementation needed for faster deadlock detection which leads to solution we have today…

Atomic CATS!

As explained in InnoDB Data Locking – Part 3 “Deadlocks” in InnoDB we now have a place in code which makes a snapshot of a trimmed down version of the waits-for relation between transactions. It is done in a background thread, and doesn’t require stopping the whole world. This means that we can estimate the “weight” of each transaction based on this snapshot very cheaply. The computed value is still not exactly what the paper authors wanted, as in the trimmed down version of the graph we have only at most one outgoing edge per node. But, the approximation is good enough in practice, and is always recomputed from scratch, so errors if any, do not compound over time (which was leading to under- or overflows in old implementation). The computed values are stored latch-free in transaction objects using C++ atomics, and they are read the same way just before sorting the waiters.

What this change means is that the thread which performs transaction does not have to stop the word just because it has to perform update weights in the whole “downstream” part of the graph, which removes a painful bottleneck we had and makes the CATS algorithm practical. Historically we had a heuristic to choose between CATS and FIFO, because CATS was too slow in some scenarios to be worth using. With Atomic CATS we could finally ditch FIFO implementation and now use CATS in every situation. (You can “emulate” FIFO with CATS by commenting out lock_wait_compute_and_publish_weights_except_cycles and lock_wait_update_weights_on_cycle which will effectively cause all transactions to have the same weight – we don’t have a switch for that, as CATS was always better in our testing)

You can inspect currently computed values by quering INFORMATION_SCHEMA.INNODB_TRX.TRX_SCHEDULE_WEIGHT column (not to be confused with TRX_WEIGHT column, which is used to determine a the lightest deadlock victim).

The sorting of the waiters exploits the fact that very common value of weight is 1 (nobody else waits for the transaction), so before running the O(NlgN) sort, we first do O(N) scan to filter out all transactions with weight 1, which should be processed after those with schedule_weight>1. Often this means there is nothing left to sort at all.

But this is not the end of the possible optimizations of the Lock System. Quite the opposite! This was just a prerequisite, which enabled us to finally tackle the much bigger issue – the scalability of the Lock System, which is the topic of the next part InnoDB Data Locking – Part 5 “Concurrent queues”.

Thank you for using MySQL!

Leave a Reply