[MDEV-26816] Galera cluster received "mariadbd[2354817]: segfault" Error Created: 2021-10-13  Updated: 2022-01-26  Resolved: 2021-11-29

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - Aria
Affects Version/s: 10.6.4
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Pon Suresh Pandian (Inactive) Assignee: Michael Widenius
Resolution: Cannot Reproduce Votes: 0
Labels: Aria, need_feedback
Environment:

Red Hat Linux 8


Attachments: Text File alf_node (1).txt     Text File alf_node_aspects (1).txt     File alf_node_aspects_data.sql     File alf_node_data.sql     Text File alf_node_properties (1).txt     File alf_node_properties_data.sql     Text File db02dmesg.txt     Text File full_bt_all_threads.txt    
Issue Links:
Relates

 Description   

Hi Team,

Customer got segfault error on db nodes. Here I have attached the backtrace report. It looks like two complex selects and one of them crashed while accessing some wrong memory area while in Aria temporary table related code.

Please check the attached backtrace report.

mariadbd[2354817]: segfault at 7f5ac5fb8862 ip 00007f5a8d127764 sp 00007f5a80438468 error 4 in libc-2.28.so[7f5a8cfc7000+1bc000]
 
[Tue Oct  5 15:24:37 2021] Core dump to |/opt/dynatrace/oneagent/agent/rdp -p 3204035 -P 3204035 -e mariadbd -s 11 pipe failed
[Tue Oct  5 15:24:54 2021] mariadbd[3205298]: segfault at 3dd0 ip 000055ca9186ca60 sp 00007f855c1fe508 error 4 in mariadbd[55ca91147000+1536000]
[Tue Oct  5 15:24:54 2021] Code: 00 00 00 00 00 00 5b 41 5c 5d c3 0f 1f 80 00 00 00 00 48 83 bb 50 02 00 00 00 0f 85 00 ff ff ff e9 4e ff ff ff 0f 1f 44 00 00 <48> 8b 87 d0 3d 00 00 c3 0f 1f 84 00 00 00 00 00 8b 97 f4 3d 00 00
[Tue Oct  5 15:24:54 2021] Core dump to |/opt/dynatrace/oneagent/agent/rdp -p 3205291 -P 3205291 -e mariadbd -s 11 pipe failed
[Tue Oct  5 15:25:14 2021] mariadbd[3206416]: segfault at 3dd0 ip 0000555794a11a60 sp 00007fdffc0c1508 error 4 in mariadbd[5557942ec000+1536000]



 Comments   
Comment by Pon Suresh Pandian (Inactive) [ 2021-11-30 ]

Hi Julien,

I have tested this issue in my environment, I cant able to re produce this issue. Here I have attached the DDL/DML file.

MariaDB [test]> select count(*) from alf_node;
+----------+
| count(*) |
+----------+
|   321777 |
+----------+
1 row in set (0.060 sec)
 
MariaDB [test]> select count(*) from alf_node_aspects;
+----------+
| count(*) |
+----------+
|  8192000 |
+----------+
1 row in set (17.464 sec)
 
MariaDB [test]> select count(*) from alf_node_properties;
+----------+
| count(*) |
+----------+
|   216000 |
+----------+
1 row in set (0.659 sec)
 
MariaDB [test]> select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN (select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN (select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN (select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC));
Empty set (0.001 sec)
 
Explain Plan :
------------
 
MariaDB [test]> explain select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN (select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN (select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN (select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC))\G
*************************** 1. row ***************************
           id: 1
  select_type: PRIMARY
        table: aspect
         type: ref
possible_keys: fk_alf_nasp_n,fk_alf_nasp_qn
          key: fk_alf_nasp_qn
      key_len: 8
          ref: const
         rows: 1
        Extra: Start temporary
*************************** 2. row ***************************
           id: 1
  select_type: PRIMARY
        table: node
         type: eq_ref
possible_keys: PRIMARY,fk_alf_node_store,idx_alf_node_mdq
          key: PRIMARY
      key_len: 8
          ref: test.aspect.node_id
         rows: 1
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: PRIMARY
        table: PROP
         type: ref
possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d
          key: idx_alf_nprop_s
      key_len: 137
          ref: const,const
         rows: 1
        Extra: Using where
*************************** 4. row ***************************
           id: 1
  select_type: PRIMARY
        table: PROP
         type: ref
possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d
          key: fk_alf_nprop_qn
      key_len: 8
          ref: const
         rows: 1
        Extra: Using where
*************************** 5. row ***************************
           id: 1
  select_type: PRIMARY
        table: PROP_0
         type: ref
possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d
          key: fk_alf_nprop_n
      key_len: 8
          ref: test.aspect.node_id
         rows: 252
        Extra: Using where; End temporary
5 rows in set (0.001 sec)

Comment by Ramesh Sivaraman [ 2021-12-07 ]

Roel Could not reproduce the issue using provided dummy data.

node1:root@localhost> select count(1) from alf_node_properties;   
+-----------+
| count(1)  |
+-----------+
| 206600536 |
+-----------+
1 row in set (13 min 30.436 sec)
 
node1:root@localhost> 
node1:root@localhost> select count(1) from alf_node;
+-----------+
| count(1)  |
+-----------+
| 105317344 |
+-----------+
1 row in set (4 min 31.001 sec)
 
node1:root@localhost> select count(1) from alf_node_aspects;
+-----------+
| count(1)  |
+-----------+
| 180000451 |
+-----------+
1 row in set (7 min 32.753 sec)
 
node1:root@localhost> 
node1:root@localhost> select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN (select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN (select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN (select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC));
+----+
| id |
+----+
| 32 |
+----+
1 row in set (0.002 sec)
 
node1:root@localhost> 

Comment by Michael Widenius [ 2022-01-10 ]

I have examined the stack trace in detail, but unfortunately this is an optimized build and some of the vital information is not available.
The crash is in aria write_block_record() when copying fixed record length columns to the record page. It is not clear why the buffer is overrun. As this is very old and stable code it is unclear how things could go wrong here.

A non optimized build would be more helpful as in this case we get more information in gdb traces that could show the issue.
In this particular case it would be very likely that a full backtrace of an unoptimized build could show in which structure the problem is.

I am not sure that a ASAN/UBSAN build will help as it is not clear if this is a logical error in record length counting or if it stray write into another memory structure that causes the fault. It is very likely it will fail in exactly the same point without any additional information.

One way to find out what is going on is to give me remote access to the computer with gdb, the core and server source.
I am willing to log in to the customer site and do the debugging there to ensure we don't copy any sensitive data.
This would enable me to find out which internal structure is wrong and what could have caused it.

It would also help to get the mysqld.err file attached to the is ticket (or at least all information related to this failure)

Comment by Michael Widenius [ 2022-01-10 ]

About optimized builds. Note that the customer ONLY needs a copy of the mariadbd executable that he can use to temporarily replace the failing one. There is no need to do a full rpm for him.

In theory someone could even log into the customer machine and compile it there. This would be the fastest way to get a quick turn around for finding the problem (as there may be needed several compile + fix + test cycles needed to find this bug...)

Generated at Thu Feb 08 09:48:09 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.