[MDEV-10283] SEGV on ANALYZE LOCAL TABLE __ PERSISTENT FOR ALL Created: 2016-06-24 Updated: 2021-04-28 Resolved: 2021-04-28 |
|
| Status: | Closed |
| Project: | MariaDB Server |
| Component/s: | OTHER |
| Affects Version/s: | 10.0.25 |
| Fix Version/s: | N/A |
| Type: | Bug | Priority: | Major |
| Reporter: | Tom | Assignee: | Unassigned |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Linux hv3 3.13.0-86-generic #131-Ubuntu SMP Thu May 12 23:33:13 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux |
||
| Issue Links: |
|
||||||||
| Description |
|
This happened on two different servers that ran the same command on copies of the same MyISAM table. Command was: ANALYZE LOCAL TABLE ac_schematic PERSISTENT FOR ALL.
Table has 66126857 rows, 1.8GiB data length, 5.1GiB index length. Same command on two other somewhat smaller tables did not crash.
|
| Comments |
| Comment by Elena Stepanova [ 2016-06-25 ] | ||||||||||||||||||||||||||||||||||||
|
So far I couldn't reproduce the problem with the same (mostly) config file, same table structure and size and random data, on a Galera 10.0.25 node.
| ||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-06-25 ] | ||||||||||||||||||||||||||||||||||||
|
Meanwhile, psergey, cvicentiu, could you please take a look at the stack trace? Even though not debug, it's readable, maybe you'll have any ideas. | ||||||||||||||||||||||||||||||||||||
| Comment by Tom [ 2016-06-25 ] | ||||||||||||||||||||||||||||||||||||
|
Hi Elena, I'll do my best to answer your questions. Some are easier than others. I'll start with the easy ones. I was not reproduce the crash with the simple tests given below. But I want to reiterate that the same crash happened on two servers, hv3 and hv4, at the same point in a weekly process with about 10 minutes separation, which corresponds to the staggered update start times. There is more possibly relevant detail in 5) below. 1) The ac_schematic table definition correct. I typed ac2_schematic into the report by mistake (now corrected). The software using the table is called ac2 but the table is called ac_schematic. There is no table called ac2_schematic. 2) No
3) This would be very nervous-making. I'm not sure it's worth the trouble given the result in 2) above. 4) There is nothing unusual in the error log. Every hour, node hv5 desyncs itself for about 4 minutes to make a backup. hv3 and hv4 (on which 2 servers the same crash happened executing the same command) have a note in the error log every time it happens, which serves as a timestamp in the error log, which looks like this
For a while I considered that the size of the error log on hv3 might be causing a problem. It was 4.7GiB. But then the same crash happened on hv4 a few minutes later and its error log was a reasonable size. 5) The myisam data and index files for this database are prepared elsewhere and copied to the server. After copy to a temp dir outside datadir, the script runs in a shell "myisamchk --silent --force --update-state --key_buffer_size=1G --sort_buffer_size=1G --read_buffer_size=16M --write_buffer_size=16M *.MYI". After this they are copied into datadir. It didn't log any errors. Then the same script that runs runs CHECK TABLE CHANGED on each tables immediately prior to ANALYZE. It didn't log any errors.
6) Yes, I'll upload those files if I can manage. | ||||||||||||||||||||||||||||||||||||
| Comment by Tom [ 2016-06-25 ] | ||||||||||||||||||||||||||||||||||||
|
I want to draw attention to these lines in my.cnf:
The ac_* tables in the autocomplete1 and 2 databases are read-only as far as the app is concerned. They are recreated once a week double-buffer style. They are MyISAM because the process of creating the tables is extremely slow in InnoDB (10s of hours vs hours). The recreation and updates use massive DML queries that I don't dare replicate in Galera. I would prefer to have these tables in a separate mysqld from the rest of the app but haven't had time to do that yet. The I introduced the ANALYZE LOCAL TABLE command in the weekly process (as described in 5) above) for the first time this week on production servers. I did this because after a MariaDB update a few weeks ago, Sphinx indexer sometimes selects far fewer rows from the autocomplete tables than it should. This error corresponded to incorrect table stats as shown in MysqlWorkbench. And Sphinx indexer ran correctly after I analyzed tables from MysqlWorkbench. My suspicion is that the problem could be related to how DDL like ANALYZE TABLE (statement replication) behaves with replicate_ignore_ and binlog_ignore in a Galera cluster. Just speculation. For now I will remove "PERSISTENT FOR ALL" and see if "ANALYZE LOCAL TABLE ac_schematic" crashes any servers next week. | ||||||||||||||||||||||||||||||||||||
| Comment by Tom [ 2016-06-25 ] | ||||||||||||||||||||||||||||||||||||
|
I believe I have uploaded the requested files via ftp. Please confirm. | ||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-06-25 ] | ||||||||||||||||||||||||||||||||||||
|
Yes, the files have arrived, thanks. Maybe we'll have some luck with them, although of course since it's not reliably reproducible on your side, the hope is rather optimistic. | ||||||||||||||||||||||||||||||||||||
| Comment by Tom [ 2016-06-26 ] | ||||||||||||||||||||||||||||||||||||
|
ANALYZE LOCAL TABLE __ PERSISTENT FOR ALL can use a surprising amount of filesystem space. While trying to recreate the crash, I had a test VM with 55.7GiB used and 42.6GiB free in the filesystem holding datadir and tmpdir. While the analyze command ran, that gradually reached 98.3GiB and 0 (zero), then returned to the initial state in the next 5 seconds. In the command output there was a line
The largest data table analyzed was 6.6GiB and the largest index file was 5.1GiB. So it's surprising that mysqld used more than 42GiB additional file space to analyze the tables. The docs provide a warning: "Currently, engine-independent statistics is collected by doing full table and full index scans, and can be quite expensive." But this implies io and cpu expense, not disk space. | ||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-06-26 ] | ||||||||||||||||||||||||||||||||||||
|
The note actually implies everything (maybe it's a bad wording though). We have a task about the tmp space problem: MDEV-6529. | ||||||||||||||||||||||||||||||||||||
| Comment by Sergei Petrunia [ 2016-06-26 ] | ||||||||||||||||||||||||||||||||||||
|
The stacktrace unfortunately doesn't give me any clue about what could be happening. | ||||||||||||||||||||||||||||||||||||
| Comment by Tom [ 2016-06-26 ] | ||||||||||||||||||||||||||||||||||||
|
The test VM I used had equivalent tables and data and FS dimensions. It used more than 42GiB to do the ANALYZE TABLE. The server that crashed had about 40 free. So I think I can infer that it did run out of space. The test VM did not crash but it was also not running Galera with a steady stream of wsreps going both up and down. | ||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-06-26 ] | ||||||||||||||||||||||||||||||||||||
|
Thanks, it's helpful. I suppose it's possible that if the server runs out of space at an unfortunate moment, it can crash rather than complain about it as it did in your test VM and as it was happening in the other bug report. I will see if I can reproduce the crash this way. | ||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-06-29 ] | ||||||||||||||||||||||||||||||||||||
|
Looking further into it, I take back my words about it being the same problem as MDEV-6529. While with INT columns the disk usage is also somewhat more than desired, it should be nowhere near the values that you observe. I ran the same ANALYZE on your table, and expectedly had the top tmp disk usage of ~300 MB (measurements were taken every second, so the peak couldn't have been missed). So, if your server indeed takes over 40 GB executing the same query on the same table, it's very strange and worrisome. If you can give us temporary access to your test server, psergey is willing to investigate the disk usage problem "on site". If he discovers the reason, maybe it will turn out to be the root cause of the initial crash as well. | ||||||||||||||||||||||||||||||||||||
| Comment by Elena Stepanova [ 2016-07-27 ] | ||||||||||||||||||||||||||||||||||||
|
thefsb, are you still experiencing the problem? And if you are, can you provide the access as suggested above? | ||||||||||||||||||||||||||||||||||||
| Comment by Tom [ 2016-07-30 ] | ||||||||||||||||||||||||||||||||||||
|
Hi Elena, I changed my app, removing the PERSISTENT FOR ALL clause and I no longer experience this problem. I cannot easily set up the access you would need. So that's all rather unhelpful. Sorry. On the plus side, you can maybe close this bug because I think got some information wrong in the report. My app runs ANALYZE TABLE on about 10 tables in sequence and I think I was wrong about which of them triggered this error. I now think it was the previous table which does have VARCHARs. |