Execute SHOW GLOBAL STATUS like ‘slave_running’ and SHOW SLAVE STATUS periodically and store it in a file. Check the contents of the file periodically to see if any of your replicas have stopped replicating.
On Linux, you can have a cron job setup which executes these queries periodically and even configure to send email alert if slave_running value is ëNOí.
Slave_IO_Running tells us if the Slave is able to connect to the Master. Slave_SQL_Running indicates if data received from the Master is being processed. Last_IO_Errno, Last_IO_Error, Last_SQL_Errno, Last_SQL_Error, are all pretty much what they say, the last error number and error from the IO or SQL threads Seconds_Behind_Master shows difference between last timestamp read in the binlogs and current time. This is important to understand. It does not report directly the delay between when information is updated/inserted on the master and recorded on the slave. A slow network can cause an artificially inflated number, as well as long running queries or blocking/locking operations.
Until recently, we had been relying on Seconds_Behind_Master to tell us if Replication was working, and if it was behind the Master by any appreciable level. And, of course, we found ourselves in a perfect storm situation where Replication had silently failed. Data was being sent over to the Slave, was being read by the IO thread, but even though the SQL thread was reporting no errors, the data was not inserted into the Slave. Due to the binlogs being read, Seconds_Behind_Master was reporting 0.