MySQL-5.7.4- Change master without stopping slave altogether

At MySQL, we have been working on simplifying the failover process
making it faster, more flexible and easier to use. In MySQL 5.6 we added
support for Global Transaction Identifiers (GTID), which was a huge leap in the
direction of easing the failover process hiding the details about
replication logs and positions. With MySQL 5.7.4, we are introducing a
new feature that further adds to flexibility and onliness- the user can
only shut down components that he needs to re-configure. 

What we allow with this new feature is to execute CHANGE MASTER TO
command without stopping slave altogether. We realized that stopping
slave altogether is not mandatory in all cases and doing that was more
of a cautious approach to switching master restricting more than what’s
required at times.

Lets dive deep into this to see what can be relaxed here. For this, lets
break the replication process into two modules:

M1) Receiver module (concerned with IO thread) and
M2) Applier module (concerning SQL thread or coordinator and worker
threads, whichever be the case)

We can now divide options under the command ‘CHANGE MASTER TO’ into
three groups based on
 the above classification: 
 
G1) Options that change a receiver configuration. 

G2) Options that change an applier configuration. 
G3) Options that relate to both (1) and (2).



For the precise division look at the picture below. Note that the illustration takes into account all the CHANGER MASTER TO options present currently (MySQL-5.7.4). 

 

 

Note that given its current usage, we could put the MASTER_AUTO_POSITION
option under group G1(i.e., receive side). Currently only the receiver
module uses GTID positioning but we foresee that in future it will be
good to allow the applier module to use GTID positioning. We thus keep
the master_auto_position option under group G3 to keep things
future-proof. Worried that obstructs the failover process again like
before? Well that’s not a problem as MASTER_AUTO_POSITION is a slave’s
configuration that you only set once. Then it affects all future times
that you redirect to a new immediate master. So you don’t specify it on
fail-over.

With the classifications stated above, we propose a 3-point rule stated as:

R1) For CHANGE MASTER TO options under group G1, stop only receiver
module (M1) using the command STOP SLAVE IO_THREAD command.
R2) For CHANGE MASTER TO options under group G2, stop only applier
module (M2) using the command STOP SLAVE SQL_THREAD command.
R3) For CHANGE MASTER TO options under group G3, stop both receiver (M1)
and applier modules (M2) using the command STOP SLAVE.

HOW DOES THIS RULE WORK?

Lets explore more about our 3-point rule(R1-R3):

  • Starting with rule R1, we stated that we only need to stop the
    receiver thread to change receive options. What happens to the applier
    module? Well the applier module keeps applying pending transactions,
    if any. If you have a situation where the slave was lagging behind with
    a lot of transactions queued into the slave’s logs, you can allow the
    applier module to catch up while you switch masters or change a
    configuration on the receive side keeping the master same.
  • Under rule R2, we stated that we can change configuration of applier
    module after stopping the applier threads ONLY, receiver module can be
    running while you do this. So while you fine-tune your slave applier
    module the receiver module keeps reading master’s log and copying
    transactions to the slave’s log. These could then be applied in-parallel
    by the slave when the applier module is up and running.
  • Under rule R3, you stop both receiver and applier modules. So, this is
    analogous to
    STOP SLAVE;
    CHANGE MASTER TO <master_def>;
    used before this feature was available.

Worried how relay log purge would be handled now? Well its pretty
simple- Under rules R1 and R2, we do not purge logs implicitly on
executing the CHANGE MASTER command so that the receiver or applier
whichever is running just keeps processing/adding relay logs as it would
do if no replication thread was stopped.

Finally note that you need not always look at the figure above to find
which options are allowed which thread being stopped. You just need to
ask yourself if the parameter is related to receiver thread and stop the
concerned thread. And if you go wrong there are error messages to guide
you on the right path. Look at the next section for the usage and the
errors.

EXAMPLES OF USAGE

example 1:

Previously, to change master heartbeat period, you would do a

STOP SLAVE;
CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD= <value>;
START SLAVE;

Now, with this feature you just have to stop the receiver (io) thread as
heartbeat has nothing to do with the applier thread(s).

STOP SLAVE IO_THREAD;
CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD= <value>;
START SLAVE IO_THREAD;

Note that the applier thread keeps executing the transactions in the
relay log while you change the heartbeat period for the master-slave
connection. Likewise, you could do this with all the attributes mentioned in group G1 in the figure above.

example 2:

Similarly, to change applier thread attributes, you just have to stop
the applier threads.

So instead of

STOP SLAVE;
CHANGE MASTER TO MASTER_DELAY=<value>;
START SLAVE;

it is enough to do the following with this feature.

STOP SLAVE SQL_THREAD;
CHANGE MASTER TO MASTER_DELAY=<value>;
START SLAVE SQL_THREAD;

Lastly, if you go wrong there are nicely worded error message to guide
you. So in the first case, if your receiver module is active and you
execute a

CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD= <value>;

you get an error saying:

This operation cannot be performed with a running slave io thread; run STOP SLAVE IO_THREAD first.

and if you forgot changing applier module when it was required, the
server will say:

This operation cannot be performed with a running slave sql thread; run STOP SLAVE SQL_THREAD first.

Lastly you still have the error message saying both the threads should
stop appearing only for MASTER_AUTO_POSITION option now:

This operation cannot be performed with a running slave; run STOP SLAVE first.

Lets see some examples once again:

example 3:

slave>START SLAVE;
slave>CHANGE MASTER TO MASTER_DELAY= 10;
ERROR 1900 (HY000): This operation cannot be performed with a running
slave sql thread; run STOP SLAVE SQL_THREAD first

example 4:

mysql> CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD= 10;
ERROR 1904 (HY000): This operation cannot be performed with a running
slave io thread; run STOP SLAVE IO_THREAD first.

example 5:

mysql> CHANGE MASTER TO MASTER_AUTO_POSITION= 0;
ERROR 1198 (HY000): This operation cannot be performed with a running
slave; run STOP SLAVE first

SIDE-EFFECTS?

While implementing this, we have taken special care to make sure we dont
break anything for a user switching masters like:

STOP SLAVE;
CHANGE MASTER to <master_def>;
START SLAVE.

There are absolutely NO side-effects to worry you. Note that as stated
before, CHANGE MASTER TO will not delete relay logs if one of the
receiver or applier thread is running.

Try it out and give us your feedback. As always, we look forward to
hearing from you to improve this feature. Enjoy :)

Leave a Reply

Your email address will not be published. Required fields are marked *


2 + = ten

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">