SQL Zone is brought to you in partnership with:

For the past eight(8) years Schalk Neethling has been working as a freelance developer under the pseudo of Volume4 and is now the president of Overt Strategy Consulting. During this period he has completed over 300 projects ranging from full web application development to complete branding. As president and lead developer of Overt Strategy Consulting, Schalk Neethling and his team has released a 100% Java standards based content management system called AlliedBridge and business document exchange and review system, called Doc-Central. Schalk Neethling is also actively involved on a daily basis in the open source, web standards and accessibility areas and is a current active member of the Web Standards Group. Schalk is also the co-founder and president of the non-profit The South Web Standards and Accessibility Group, which aims to actively educate and raise awareness of web standards and accessibility to both the developer society as well as business large and small. Schalk also has a long relationship with DZone and is currently zone leader for both the web builder, css.dzone.com, as well as the .NET zone, dotnet.dzone.com, and you can find a lot of his writing there as well as on his blog located at schalkneethling.alliedbridge.com. Schalk is constantly expanding on his knowledge of various aspects of technology and loves to stay in touch with the latest happenings. For Schalk web development and the internet is not just a job, it is a love, a passion and a life style. Schalk has posted 173 posts at DZone. View Full User Profile

Troubleshooting Relay Log Corruption in MySQL

09.09.2008
| 15147 views |
  • submit to reddit

Have you ever seen the replication stopped with message like this:

Last_Error: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.

This is relay relay log corruption and you can check details in the MySQL Error log file. The error message describes few reasons and indeed because there is little validation (ie no checksums) in the replication there are multiple reasons for bad event to show up in relay logs.

Really this is only one of various error messages you could see if relay log corrupted. You could also see malformed queries (with some junk), complaining about event to big etc if there is a garbage in relay logs.

If relay logs are corrupted it is surely worth to check what could cause it - it could be network (especially if replicating over unreliable long distance networks), MySQL bugs on master or slave, hardware problems and few others. In any case it is worth investigating.

Investigating is what you do later but how do you fix the problem first ? The important question you need to have answered - are logs corrupted on the master ? If logs on the master are OK you can just run SHOW SLAVE STATUS on slave experiencing error and use CHANGE MASTER TO to re-point replication to Relay_Master_Log_File:Exec_Master_Log_Pos:

localhost:(none)> slave stop;
Query OK, 0 rows affected (0.00 sec)
localhost:(none)> CHANGE master TO master_log_file=Relay_Master_Log_File,master_log_pos=Exec_Master_Log_Pos
Query OK, 0 rows affected (1.16 sec)
localhost:(none)> slave start;
Query OK, 0 rows affected (0.00 sec)

This will purge existing relay logs re-fetch all events which have not been executed yet. Doing this command make sure your master is operational and it still has all the logs needed to re-fetch events.

How would you know if logs are OK on the master ? Well in this case there were probably 5 another slaves which did not have the problem - which means Master is most likely OK. In any case it is little harm to try restarting from the same position - if logs are bad on the master you would get the same error message again and can continue with investigation.

What if logs on the master are corrupted ?

In this case you have couple of choices (and you also potentially have multiple slaves to deal with). You can use mysqlbinlog (or you favorite hex editor if mysqlbinlog does not work) to find the next event start and potentially recover "corrupted" event to be manually executed on the slaves.

Skipping around event makes master and slave potentially inconsistent and you should access the risks depending on applications (and on amount of events which were corrupted) you may want to let replication continue from the new position or resync the slaves to the master.

How can you recover the slave ?

As all slaves are likely to be affected in this case you can't clone another slave. You also can't use classical method of recovery from backup - because you would need relay logs to roll forward, and they are corrupted. You can either re-clone the data from Master. (This is where LVM or similar techniques can help you a lot) or skip bad events as described and when use Maatkit mk-table-checksum to check what tables are out of sync and when use mk-table-sync to resync them.

Last method works in particularly well in case you can afford to run for a while with slaves which are a bit out of sync, which is quite often better than having just master available (also having extra load of data copied from it).

Original Author

Original Article By Peter Zaitsev

References
Published at DZone with permission of its author, Schalk Neethling. (source)

(Note: Opinions expressed in this article and its replies are the opinions of their respective authors and not those of DZone, Inc.)