How does MySQL Replication really work?


While we do have many blog posts on replication on our blog, such as on replication being single-threaded, on semi-synchronous replication or on estimating replication capacity, I donít think we have one that covers the very basics of how MySQL replication really works on the high level. Or itís been so long ago I canít even find it. So, I decided to write one now.
Of course, there are many aspects of MySQL replication, but my main focus will be the logistics ñ how replication events are written on the master, how they are transferred to the replication slave and then how they are applied there. Note that this is NOT a HOWTO setup replication, but rather a howstuffworks type of thing.
Replication events
I say replication events in this article because I want to avoid discussion about different replication formats. These are covered pretty well in the MySQL manual here. Put simply, the events can be one of two types:
Statement based ñ in which case these are write queries
Row based ñ in this case these are changes to records, sort of row diffs if you will
But other than that, I wonít be going back to differences in replication with different replication formats, mostly because thereís very little thatís different when it comes to transporting the data changes.
On the master
So now let me start with what is happening on the master. For replication to work, first of all master needs to be writing replication events to a special log called binary log. This is usually very lightweight activity (assuming events are not synchronized to disk), because writes are buffered and because they are sequential. The binary log file stores data that replication slave will be reading later.
Whenever a replication slave connects to a master, master creates a new thread for the connection (similar to one thatís used for just about any other server client) and then it does whatever the client ñ replication slave in this case ñ asks. Most of that is going to be (a) feeding replication slave with events from the binary log and (b) notifying slave about newly written events to its binary log.
Slaves that are up to date will mostly be reading events that are still cached in OS cache on the master, so there is not going to be any physical disk reads on the master in order to feed binary log events to slave(s). However, when you connect a replication slave that is few hours or even days behind, it will initially start reading binary logs that were written hours or days ago ñ master may no longer have these cached, so disk reads will occur. If master does not have free IO resources, you may feel a bump at that point.
On the replica
Now letís see what is happening on the slave. When you start replication, two threads are started on the slave:

  1. IO thread
    This process called IO thread connects to a master, reads binary log events from the master as they come in and just copies them over to a local log file called relay log. Thatís all.
    Even though thereís only one thread reading binary log from the master and one writing relay log on the slave, very rarely copying of replication events is a slower element of the replication. There could be a network delay, causing a steady delay of few hundred milliseconds, but thatís about it.
    If you want to see where IO thread currently is, check the following in ìshow slave status\Gî:
    Master_Log_File ñ last file copied from the master (most of the time it would be the same as last binary log written by a master)
    Read_Master_Log_Pos ñ binary log from master is copied over to the relay log on the slave up until this position.
    And then you can compare it to the output of ìshow master status\Gî from the master.
  2. SQL thread
    The second process ñ SQL thread ñ reads events from a relay log stored locally on the replication slave (the file that was written by IO thread) and then applies them as fast as possible.
    This thread is what people often blame for being single-threaded. Going back to ìshow slave status\Gî, you can get the current status of SQL thread from the following variables:
    Relay_Master_Log_File ñ binary log from master, that SQL thread is ìworking onî (in reality it is working on relay log, so itís just a convenient way to display information)
    Exec_Master_Log_Pos ñ which position from master binary log is being executed by SQL thread.

Replication lag
Now I want to briefly touch the subject of replication lag in this context. When you are dealing with replication lag, first thing you want to know is which of the two replication threads is behind. Most of the time it will be the SQL thread, still it makes sense to double check. You can do that by comparing the replication status variables mentioned above to the master binary log status from the output of ìshow master status\Gî from the master.
If it happens to be IO thread, which, as I mentioned many times already, is very rare, one thing you may want to try to get that fixed is enabling slave compressed protocol.
Otherwise, if you are sure it is SQL thread, then you want to understand what is the reason and that you can usually observe by vmstat. Monitor server activity over time and see if it is ìrî or ìbî column that is ìscoringî most of the time. If it is ìrî, replication is CPU-bound, otherwise ñ IO. If it is not conclusive, mpstat will give you better visibility by CPU thread.
Note this assumes that there is no other activity happening on the server. If there is some activity, then you may also want to look at diskstats or even do a query review for SQL thread to get a good picture.
If you find that replication is CPU bound, this maybe very helpful.
If it is IO bound, then fixing it may not be as easy (or rather, as cheap). Let me explain. If replication is IO bound, most of the time that means that SQL thread is unable to read fast enough because reads are single threaded. Yes, you got that right ñ it is reads that are limiting replication performance, not writes. Let me explain this further.
Assume you have a RAID10 with a bunch of disks and write-back cache. Writes, even though they are serialized, will be fast because they are buffered in the controller cache and because internally RAID card can parallelize writes to disks. Hence replication slave with similar hardware can write just as fast as master can.
Now Reads. When your workset does not fit in memory, then the data that is about to get modified is going to have to be read from disk first and this is where it is limited by the single-threaded nature of the replication, because one thread will only ever read from one disk at a time.
That being said, one solution to fix IO-bound replication is to increase the amount of memory so working set fits in memory. Another ñ get IO device that can do much more IO operations per second even with a single thread ñ fastest traditional disks can do up to 250 iops, SSDs ñ in the order of 10,000 iops.
Questions? Comments? Concerns?