[MDEV-4179] Server crash / memory corruption with MariaDB-Galera Created: 2013-02-18 Updated: 2019-06-04 Resolved: 2018-04-11 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | Galera |
| Affects Version/s: | 5.5.28a-galera |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Critical |
| Reporter: | Aleksey Sanin (Inactive) | Assignee: | Sachin Setiya (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
CentOS 5, Ubuntu 11.10 |
||
| Attachments: |
|
| Description |
|
130218 7:14:08 [Note] Slave I/O thread: connected to master 'replication@10.240.170.40:53306',replication started in log 'mysql-bin.001384' at position 796660420 To report this bug, see http://kb.askmonty.org/en/reporting-bugs We will try our best to scrape up some info that will hopefully help Server version: 5.5.28a-MariaDB-log Thread pointer: 0x0x56ceb010 Trying to get some variables. Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=on,mrr_cost_based=on,mrr_sort_keys=on,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=off The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains [root@WPDB03 ~]# *** glibc detected *** /usr/sbin/mysqld: malloc(): memory corruption (fast): 0x0000000058a58cd0 *** |
| Comments |
| Comment by Aleksey Sanin (Inactive) [ 2013-02-18 ] |
|
100% reproducible (3/3 tries): After core dump, the main mysqld is hanging and is not responding. After kill -9, the data is corrupted - the default crash recovery fails and slave doesn't start |
| Comment by Elena Stepanova [ 2013-02-18 ] |
|
Hi Aleksey, How big is the dump? Can you upload it to our ftp? (ftp.askmonty.org, you can choose the private section if you wish) Are you using thread pool in this installation as well? I'm wondering because the described algorithm involves the node restart, and as you already reported in another bug, in 5.5.28a it doesn't work with the thread pool; so I'm wondering if it could be anyhow related. |
| Comment by Aleksey Sanin (Inactive) [ 2013-02-19 ] |
|
Hi Elena, Unfortunately this bug was happening on production data and we can not share it for various reasons. Plus, the gzipped mysqldump is about 500G so it's not practical either. I'll get the configs files uploaded tomorrow though exactly same configs (ignoring different memory sizes, threads number and SSL configs) run fine on our dev environment. One more piece of a puzzle. We did another test: I know it's not much but with 500G of data the setup for the test takes almost 20 hours and it gets into an unrecoverable state afterwards so it's really hard to do any debugging. Best, Aleksey |
| Comment by Elena Stepanova [ 2013-02-19 ] |
|
Hi Aleksey, Yes, you are right, uploading 500Gb wouldn't be practical even if the data was not private. Meanwhile, back to the initial description.. 130218 7:14:08 [Note] Slave I/O thread: connected to master 'replication@10.240.170.40:53306',replication started in log 'mysql-bin.001384' at position 796660420 Or are you using both Galera replication and traditional replication on the same server? When you restart the slave after the crash, does it again attempt to start from the same position? If so, can you check the binary log to see, what kind of event is at that position? You also said "After core dump, the main mysqld is hanging and is not responding." – what is main mysqld, do you mean the master? Thanks |
| Comment by Aleksey Sanin (Inactive) [ 2013-02-19 ] |
|
my.cnf |
| Comment by Aleksey Sanin (Inactive) [ 2013-02-19 ] |
|
The mysql config file is attached. It has the galera options enabled but actually at the moment of the crash all the options were disabled. Since there were no other logs, that's the only log files we have. Sorry, I was not clear what is going on, let me try again. We have currently Master->Slave setup that we are trying to convert to Galera cluster. As a first step, we tried to simply swap MariaDB binaries with MariaDB-Galera binaries on Slave using same my.cnf config (all the galera options disabled) and continue replication as before. This is when we've got a crash. Again, I'll stress that there have been no Galera replication just yet. Just regular slave replication. Regarding thread pool, it was already disabled. Though the other bug I've run into was purely during shutdown. I didn't see any issue with it during normal run. So this is not a likely candidate. During Galera startup, it starts some kind of "recovery" as a separate process: mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.XXXXXX I believe this is the one that crashed (though I am not 100% sure). The "main" mysqld was still running but did not respond to anything but kill -9 . This is not master (in the replication) but the main mysqld started by mysqld_safe After restart from crash, it doesn't even come to slave restart. The process dies complaining about the usual "InnoDB: Database page corruption on disk or a failed". We've tried to recover it with innodb_force_recovery but after recovery pt-checksum reported numerous diffs so we decided it is not worth it the next time (note that we've run pt-checksum just before the upgrade to MariaDB-Galera binaries and it was clean). Not sure I answered all your questions, let me know if I missed something |
| Comment by Elena Stepanova [ 2013-02-19 ] |
|
Hi Aleksey, Thank you for the information, it clarifies a lot. |
| Comment by Aleksey Sanin (Inactive) [ 2013-02-19 ] |
|
Great! Do you know which options caused it? |
| Comment by Elena Stepanova [ 2013-02-19 ] |
|
I don't think there was anything wrong with your configuration as such, it's just a bug that needs to be fixed. Before starting mysqld server for real, mysqld_safe from the Galera distribution runs it with wsrep-recover option to obtain wsrep start position. It happens unconditionally, even if no wsrep options are set in the server configuration. That said, if I understand the reason of the crash correctly, you could have avoided it if you had skip-slave-start in your server config, and started replication manually instead; but since experiments in your environment are so expensive due to the size of the database, it might make sense to wait till Galera developers confirm the theory. The database corruption is another story; I haven't seen it in my tests. Probably it's due to the huge size of your on-disk data or the memory imprint (since you have 28 GB for the buffer pool) that the crashing server hangs, or just spends too long time dying, and you have to kill it in a dirty way. Still, it's not particularly clear why the data would have been corrupted. I have a few wild guesses regarding that, but unfortunately from the log excerpt we can't see what InnoDB had been doing at the moment of the crash, or what kind of event the slave was applying, if any (e.g. if it was ALTER TABLE, the data could have been damaged indeed). So I'm afraid this part might remain a mystery for the time being. |
| Comment by Aleksey Sanin (Inactive) [ 2013-02-19 ] |
|
Thanks for the update. Since I am very interested in figuring this out and since the whole slave setup was a temporarily thing for the upgrade anyway, I am going to ask our guys to try skip-slave-start tonight. In the best case we get a working galera cluster, in the worst thing the slave DB server will have to load our data one more time Regarding data corruption, the 28G value was actually a test. Since the error is in NULL pointer, we tried to shrink the pool size to see if there is an OOM error in the wsrep or something. The actual value is 80G which covers our "active" data set pretty well. Lastly, I've actually remembered that we've seen similar issue on dev environment though stack trace was different: https://mariadb.atlassian.net/browse/MDEV-4158 It was the same upgrade process though it didn't crash 100% of the time. May be it is a timing issue somewhere? Anyway, thanks again for your help. |
| Comment by Elena Stepanova [ 2013-02-19 ] |
|
Hi Seppo, To reproduce the crash, I do the following (on Ubuntu 11.10 64-bit, maria-5.5-galera revno 3378, debug build; can't use 3380 due to
mysqld --no-defaults --server-id=2 --datadir=<datadir2> --basedir=<basedir> --port=8307 --loose-lc-messages-dir=<basedir>/sql/share --loose-language=<basedir>/sql/share/english/english --socket=<socket2> wsrep-recover
safe_mutex: Trying to lock unitialized mutex at /home/elenst/maria-5.5-galera/sql/slave.cc, line 4267 Thread 5 (Thread 0x7fb924414740 (LWP 6745)): Thread 1 (Thread 0x7fb9243c9700 (LWP 6766)): |
| Comment by Seppo Jaakola [ 2013-02-26 ] |
|
There are two issues: The issue #2 has been reported in: https://bugs.launchpad.net/codership-mysql/+bug/1132974 and the fix was merged as a part of revision: http://bazaar.launchpad.net/~maria-captains/maria/maria-5.5-galera/revision/3383 |
| Comment by Xiaoqiao Guo (Inactive) [ 2013-04-25 ] |
|
my mariadb cluster 5.5.29 with galera 23.2.4 also crashed, and cannot get stack info. 130425 9:04:25 [Warning] WSREP: last seen seqno below limit for trx source: d07d44b1-acab-11e2-0800-a4b4a09fdbf6 version: 2 local: 1 state: CERTIFYING flags: 129 conn_id: 5 trx_id: 23378603 seqnos (l: 89318, g: 95459, s: 95448, d: -1, ts: 1366851863915610973) To report this bug, see http://kb.askmonty.org/en/reporting-bugs We will try our best to scrape up some info that will hopefully help Server version: 5.5.29-MariaDB-log Thread pointer: 0x0x7f86c00428d0 Trying to get some variables. Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=off The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains and my.cnf content: [mysqld]
wsrep_provider=/usr/lib64/galera/libgalera_smm.so general_log=1 skip-name-resolve #log-bin=/var/log/mysql/mysql-bin innodb_buffer_pool_size = 512M [mysqld_safe] |
| Comment by Xiaoqiao Guo (Inactive) [ 2013-04-25 ] |
|
it crashed when run a little time, total crashed 3 time this week. |
| Comment by Seppo Jaakola [ 2013-05-27 ] |
|
@Xiaoqiao, it looks like your issue is not related to the slave processing bug reported in this tracker. To troubleshoot your crash further, make first sure that you have binlog_format=ROW. |
| Comment by Sergei Golubchik [ 2018-04-11 ] |
|
No feedback, so closing |