[MCOL-888] WriteEngineServ segfault .. in libstdc++.so.6.0 Created: 2017-08-22  Updated: 2021-01-16  Resolved: 2021-01-16

Status: Closed
Project: MariaDB ColumnStore
Component/s: writeengine
Affects Version/s: 1.0.10
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Nivesh Assignee: Unassigned
Resolution: Won't Fix Votes: 0
Labels: None
Environment:

Centos 7
mariadb Columnstore 1.0.10
2 separate UM's
4 separate PM's
ext4 lvm


Attachments: File critlog.tar.gz    

 Description   

We have moved our environment to centos 7 we get a writeengin crash
[Aug22 09:46] traps: WriteEngineServ[31598] general protection ip:7f967c8ecd77 sp:7f9671cdc5c8 error:0 in libstdc++.so. error every couple of hours on all the node in the cluster

I've added the crit log which has errors as below.

Aug 22 09:46:24 controllernode[14284]: 24.230055 |0|0|0| C 29 CAL0000: ExtentMap::getLocalHWM(): invalid OID requested: -1839475712
Aug 22 09:46:24 PrimProc[14284]: 24.230451 |0|0|0| C 28 CAL0000: Invalid Range from HWM for lbid 4955611136, range size should be <= blocksReadAhead: HWM 0, dbroot 10055, highfbo 1215999, lowfbo 1215488, blocksReadAhead 512, range size -1215487
Aug 22 09:46:24 controllernode[14284]: 24.236529 |0|0|0| C 29 CAL0000: ExtentMap::getLocalHWM(): invalid OID requested: -1839475712
Aug 22 09:46:24 PrimProc[14284]: 24.236773 |0|0|0| C 28 CAL0000: Invalid Range from HWM for lbid 4955611137, range size should be <= blocksReadAhead: HWM 0, dbroot 10055, highfbo 1215999, lowfbo 1215488, blocksReadAhead 512, range size -1215487

I have only attach the crit log as it is the only log I have sanitized .
Please give me a list of other log files required.



 Comments   
Comment by David Hill (Inactive) [ 2017-08-22 ]

Not a happy system. It looks like the Extent Map entries aren't matching the database, which is causing the issue. Extent Map is in the DBRM files in pm1 data1 directory.
And you report you moved to a Centos 7 system, so my best guess is when the database was backed up or copied from the old system to the new system, it got out of sync.
Possible ways to recover:
1. Do you still have the previous system you was running on? If so, you could shutdownsystem system and do a backup and then restore to the Centos 7 system.
2. Or you could try to look at the UM debug logs and map a specific query to the outage. If that is possible, then you could try dropping that table and create it back with data. If its is just a table or 2 that is out of sync.

1 question, how did you go about moving from the old env to this new Centos 7 environment. What was the procedure you followed and what was the old Env, Centos 6, for example.

Comment by Nivesh [ 2017-08-23 ]

Hi David we have moved every database and table manually row by row from the centos 6 infinidb system to the new centos 7 mairadb cs system

here are some more errors I got while testing only user load on this environment. with no table loads or ddl happening,
I am trying to isolate a specific query/table from the debug log however all queries in the test cases, within the times of the failures , execute properly.

ERROR 1815 (HY000) at line 1412: Internal error: An unexpected condition within the query caused an internal processing error within InfiniDB. Please check the log files for more details. Additional Information: error in BatchPrimitiveProces
ERROR 1815 (HY000) at line 1412: Internal error: An unexpected condition within the query caused an internal processing error within InfiniDB. Please check the log files for more details. Additional Information: error in BatchPrimitiveProces

==> /var/log/mariadb/columnstore/info.log <==
Aug 23 11:10:11 PrimProc[12168]: 11.786726 |0|0|0| I 28 CAL0061: PrimProc error reading file for OID 51204; decompress retry for /home/mariadb/mariadb/columnstore//data2/000.dir/000.dir/200.dir/004.dir/004.dir/FILE002.cdf decompRetry chunk 9 code=-1

==> /var/log/mariadb/columnstore/debug.log <==
Aug 23 11:10:11 PrimProc[12168]: 11.786726 |0|0|0| I 28 CAL0061: PrimProc error reading file for OID 51204; decompress retry for /home/mariadb/mariadb/columnstore//data2/000.dir/000.dir/200.dir/004.dir/004.dir/FILE002.cdf decompRetry chunk 9 code=-1

==> /var/log/mariadb/columnstore/crit.log <==
Aug 23 11:10:14 PrimProc[12168]: 14.687204 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51204; Error decompressing block 9 code=-1 part=4 seg=2
Aug 23 11:10:14 PrimProc[12168]: 14.687415 |0|0|0| C 28 CAL0000: Error decompressing block 9 code=-1 part=4 seg=2

==> /var/log/mariadb/columnstore/err.log <==
Aug 23 11:10:14 s4-totus-pm-02 PrimProc[12168]: 14.687204 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51204; Error decompressing block 9 code=-1 part=4 seg=2
Aug 23 11:10:14 s4-totus-pm-02 PrimProc[12168]: 14.687415 |0|0|0| C 28 CAL0000: Error decompressing block 9 code=-1 part=4 seg=2

==> /var/log/mariadb/columnstore/warning.log <==
Aug 23 11:10:14 s4-totus-pm-02 PrimProc[12168]: 14.687204 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51204; Error decompressing block 9 code=-1 part=4 seg=2
Aug 23 11:10:14 s4-totus-pm-02 PrimProc[12168]: 14.687415 |0|0|0| C 28 CAL0000: Error decompressing block 9 code=-1 part=4 seg=2

==> /var/log/mariadb/columnstore/info.log <==
Aug 23 11:10:14 PrimProc[12168]: 14.687204 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51204; Error decompressing block 9 code=-1 part=4 seg=2
Aug 23 11:10:14 PrimProc[12168]: 14.687415 |0|0|0| C 28 CAL0000: Error decompressing block 9 code=-1 part=4 seg=2

==> /var/log/mariadb/columnstore/debug.log <==
Aug 23 11:10:14 s4-totus-pm-02 PrimProc[12168]: 14.687204 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51204; Error decompressing block 9 code=-1 part=4 seg=2
Aug 23 11:10:14 s4-totus-pm-02 PrimProc[12168]: 14.687415 |0|0|0| C 28 CAL0000: Error decompressing block 9 code=-1 part=4 seg=2

==> /var/log/mariadb/columnstore/info.log <==
Aug 23 11:10:13 s4-totus-pm-03 PrimProc[30135]: 13.184373 |0|0|0| I 28 CAL0061: PrimProc error reading file for OID 51243; decompress retry for /home/mariadb/mariadb/columnstore//data3/000.dir/000.dir/200.dir/043.dir/004.dir/FILE003.cdf decompRetry chunk 0 code=-3

==> /var/log/mariadb/columnstore/debug.log <==
Aug 23 11:10:13 PrimProc[30135]: 13.184373 |0|0|0| I 28 CAL0061: PrimProc error reading file for OID 51243; decompress retry for /home/mariadb/mariadb/columnstore//data3/000.dir/000.dir/200.dir/043.dir/004.dir/FILE003.cdf decompRetry chunk 0 code=-3

==> /var/log/mariadb/columnstore/crit.log <==
Aug 23 11:10:16 PrimProc[30135]: 16.399164 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/err.log <==
Aug 23 11:10:16 PrimProc[30135]: 16.399164 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/warning.log <==
Aug 23 11:10:16 s4-totus-pm-03 PrimProc[30135]: 16.399164 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/info.log <==
Aug 23 11:10:16 s4-totus-pm-03 PrimProc[30135]: 16.399164 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/debug.log <==
Aug 23 11:10:16
PrimProc[30135]: 16.399164 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/crit.log <==
Aug 23 11:10:16 PrimProc[30135]: 16.399389 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/err.log <==
Aug 23 11:10:16 s4-totus-pm-03 PrimProc[30135]: 16.399389 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/warning.log <==
Aug 23 11:10:16 s4-totus-pm-03 PrimProc[30135]: 16.399389 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/info.log <==
Aug 23 11:10:16 PrimProc[30135]: 16.399389 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/debug.log <==
Aug 23 11:10:16 PrimProc[30135]: 16.399389 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/info.log <==
Aug 23 11:10:18 PrimProc[30135]: 18.584765 |0|0|0| I 28 CAL0061: PrimProc error reading file for OID 51243; decompress retry for /home/mariadb/mariadb/columnstore//data3/000.dir/000.dir/200.dir/043.dir/004.dir/FILE003.cdf decompRetry chunk 0 code=-3

==> /var/log/mariadb/columnstore/debug.log <==
Aug 23 11:10:18 PrimProc[30135]: 18.584765 |0|0|0| I 28 CAL0061: PrimProc error reading file for OID 51243; decompress retry for /home/mariadb/mariadb/columnstore//data3/000.dir/000.dir/200.dir/043.dir/004.dir/FILE003.cdf decompRetry chunk 0 code=-3

==> /var/log/mariadb/columnstore/crit.log <==
Aug 23 11:10:22 PrimProc[30135]: 22.169421 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/err.log <==
Aug 23 11:10:22 PrimProc[30135]: 22.169421 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3
Aug 23 11:10:22 s4-totus-pm-03 PrimProc[30135]: 22.169801 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/warning.log <==
Aug 23 11:10:22 PrimProc[30135]: 22.169421 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3
Aug 23 11:10:22 s4-totus-pm-03 PrimProc[30135]: 22.169801 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/info.log <==
Aug 23 11:10:22 PrimProc[30135]: 22.169421 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3
Aug 23 11:10:22 s4-totus-pm-03 PrimProc[30135]: 22.169801 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/debug.log <==
Aug 23 11:10:22 PrimProc[30135]: 22.169421 |0|0|0| C 28 CAL0061: PrimProc error reading file for OID 51243; Error decompressing block 0 code=-3 part=4 seg=3
Aug 23 11:10:22 s4-totus-pm-03 PrimProc[30135]: 22.169801 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/crit.log <==
Aug 23 11:10:22 PrimProc[30135]: 22.169801 |0|0|0| C 28 CAL0000: Error decompressing block 0 code=-3 part=4 seg=3

==> /var/log/mariadb/columnstore/err.log <==

==> /var/log/mariadb/columnstore/warning.log <==

Comment by Daniel Lee (Inactive) [ 2017-08-23 ]

Few questions:

1) What was the old configuration (How many UMs and PMs)?
2) What procedure did you following to migrate the database to the new environment?
3) Approximately how much data (GBs) is in the database?
4) Is this a production or test environment?

Thx

Comment by Nivesh [ 2017-08-24 ]

Hi David.

1) The configuration from old -> new
old New
2 um's 2 um's
4 PM - 2 DBroots per PM 4PM's 1 DBroot per PM

2) Method used to migrate. I'm just pasting the script below.
mcsmysql -e 'select columns1,column2 from table ' |cpimport tableschema table_name -s'\t' -d2 -e......

3) 16 TB of data was moved.

4) The is a production environment.

I have discovered that only certain tables have this issue.
and all the problem tables were being loaded in production when a bad query caused excessive memory usage and crashed the database.
I'm testing to see if I can reproduce the load-crash and see if this effects the extent map.

Comment by Nivesh [ 2017-08-27 ]

I have been able to reproduce the error here is the
Here are the scripts.

CREATE TABLE `test3` (
ID int(11) DEFAULT NULL,
test varchar(8000) DEFAULT NULL,
test2 varchar(8000) DEFAULT NULL
) ENGINE=Columnstore DEFAULT ;

CREATE TABLE `test3_stg` (
ID int(11) DEFAULT NULL,
test varchar(8000) DEFAULT NULL,
test2 varchar(8000) DEFAULT NULL
) ENGINE=Columnstore DEFAULT ;

insert into test3_stg values (1,'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.!@#$%()?><:{}[];,./|=', 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.!@#$%()?><:{}[];,./|=')

--repeat above insert script 200000X

insert into test3 select * from test3_stg;

--Crash Writeengine and primproc with overused memory or a faster way is to kill the primproc and write engine process on any node.
--This node then does not rolback the extents correctly corrupting the test3 table.

Comment by Nivesh [ 2017-08-28 ]

FYI we run the database on ext4 with no journaling or caching

Comment by Andrew Hutchings (Inactive) [ 2017-08-29 ]

For the INSERT...SELECT are you doing it as part of a transaction or with autocommit? If it is part of a transaction can you please try as an autocommit?

Comment by Nivesh [ 2017-08-30 ]

The insert select is run as an "autocommit" not in a transaction. and the mysql flag for cpimport on insert into is set.
ei :
infinidb_use_import_for_batchinsert=1
infinidb_import_for_batchinsert_delimiter=7

when monitoring the insert...select I see a cpimport process running.

This is only the easiest way to recreate this issue.
on any cpimport with more than 200000 rows killing the prim-proc on any node creates instability and eventually crashes the DB.
This then causes corruption in the table which was being loaded.

Comment by Andrew Hutchings (Inactive) [ 2017-08-30 ]

Ah, I see. Yes, the primproc crash would happen depending on the amount of system RAM. Not quite sure why WriteEngine would crash at that point, it shouldn't be used here. But it in any case there appears to be corruption of the column files which we should look into.

Comment by Nivesh [ 2017-09-04 ]

Hi

do you have any idea of what is causing this? Will this be fixed soon.

we have had to rollback to infinidb and we have the same issues with primproc/writeengine
crashing (Generally caused by very bad queries form the users)
however it does not cause corruption on the data the extent map rolls back those segments.

Comment by Andrew Hutchings (Inactive) [ 2017-09-04 ]

Hi,

There is no update at this time, we will try to reproduce it soon. Thanks for letting us know about the InfiniDB testing.

Comment by Nivesh [ 2017-11-02 ]

Hi

Has anyone been able to replicate this issue?
I will try reviving our migration to mariadb from infnidb during our slow time in December.

Comment by Roman [ 2020-03-21 ]

This issue could be potentially caused by the default Linux memory allocator used before 1.2.3.

Generated at Thu Feb 08 02:24:33 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.