Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26816

Galera cluster received "mariadbd[2354817]: segfault" Error

Details

    Description

      Hi Team,

      Customer got segfault error on db nodes. Here I have attached the backtrace report. It looks like two complex selects and one of them crashed while accessing some wrong memory area while in Aria temporary table related code.

      Please check the attached backtrace report.

      mariadbd[2354817]: segfault at 7f5ac5fb8862 ip 00007f5a8d127764 sp 00007f5a80438468 error 4 in libc-2.28.so[7f5a8cfc7000+1bc000]
       
      [Tue Oct  5 15:24:37 2021] Core dump to |/opt/dynatrace/oneagent/agent/rdp -p 3204035 -P 3204035 -e mariadbd -s 11 pipe failed
      [Tue Oct  5 15:24:54 2021] mariadbd[3205298]: segfault at 3dd0 ip 000055ca9186ca60 sp 00007f855c1fe508 error 4 in mariadbd[55ca91147000+1536000]
      [Tue Oct  5 15:24:54 2021] Code: 00 00 00 00 00 00 5b 41 5c 5d c3 0f 1f 80 00 00 00 00 48 83 bb 50 02 00 00 00 0f 85 00 ff ff ff e9 4e ff ff ff 0f 1f 44 00 00 <48> 8b 87 d0 3d 00 00 c3 0f 1f 84 00 00 00 00 00 8b 97 f4 3d 00 00
      [Tue Oct  5 15:24:54 2021] Core dump to |/opt/dynatrace/oneagent/agent/rdp -p 3205291 -P 3205291 -e mariadbd -s 11 pipe failed
      [Tue Oct  5 15:25:14 2021] mariadbd[3206416]: segfault at 3dd0 ip 0000555794a11a60 sp 00007fdffc0c1508 error 4 in mariadbd[5557942ec000+1536000]
      

      Attachments

        1. alf_node_aspects_data.sql
          10 kB
        2. alf_node_aspects (1).txt
          0.5 kB
        3. alf_node_data.sql
          111 kB
        4. alf_node_properties_data.sql
          54 kB
        5. alf_node_properties (1).txt
          1 kB
        6. alf_node (1).txt
          2 kB
        7. db02dmesg.txt
          18 kB
        8. full_bt_all_threads.txt
          392 kB

        Activity

          Hi Julien,

          I have tested this issue in my environment, I cant able to re produce this issue. Here I have attached the DDL/DML file.

          MariaDB [test]> select count(*) from alf_node;
          +----------+
          | count(*) |
          +----------+
          |   321777 |
          +----------+
          1 row in set (0.060 sec)
           
          MariaDB [test]> select count(*) from alf_node_aspects;
          +----------+
          | count(*) |
          +----------+
          |  8192000 |
          +----------+
          1 row in set (17.464 sec)
           
          MariaDB [test]> select count(*) from alf_node_properties;
          +----------+
          | count(*) |
          +----------+
          |   216000 |
          +----------+
          1 row in set (0.659 sec)
           
          MariaDB [test]> select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN (select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN (select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN (select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC));
          Empty set (0.001 sec)
           
          Explain Plan :
          ------------
           
          MariaDB [test]> explain select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN (select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN (select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN (select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC))\G
          *************************** 1. row ***************************
                     id: 1
            select_type: PRIMARY
                  table: aspect
                   type: ref
          possible_keys: fk_alf_nasp_n,fk_alf_nasp_qn
                    key: fk_alf_nasp_qn
                key_len: 8
                    ref: const
                   rows: 1
                  Extra: Start temporary
          *************************** 2. row ***************************
                     id: 1
            select_type: PRIMARY
                  table: node
                   type: eq_ref
          possible_keys: PRIMARY,fk_alf_node_store,idx_alf_node_mdq
                    key: PRIMARY
                key_len: 8
                    ref: test.aspect.node_id
                   rows: 1
                  Extra: Using where
          *************************** 3. row ***************************
                     id: 1
            select_type: PRIMARY
                  table: PROP
                   type: ref
          possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d
                    key: idx_alf_nprop_s
                key_len: 137
                    ref: const,const
                   rows: 1
                  Extra: Using where
          *************************** 4. row ***************************
                     id: 1
            select_type: PRIMARY
                  table: PROP
                   type: ref
          possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d
                    key: fk_alf_nprop_qn
                key_len: 8
                    ref: const
                   rows: 1
                  Extra: Using where
          *************************** 5. row ***************************
                     id: 1
            select_type: PRIMARY
                  table: PROP_0
                   type: ref
          possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d
                    key: fk_alf_nprop_n
                key_len: 8
                    ref: test.aspect.node_id
                   rows: 252
                  Extra: Using where; End temporary
          5 rows in set (0.001 sec)
          

          ponsuresh.pandians Pon Suresh Pandian (Inactive) added a comment - Hi Julien, I have tested this issue in my environment, I cant able to re produce this issue. Here I have attached the DDL/DML file. MariaDB [test]> select count (*) from alf_node; + ----------+ | count (*) | + ----------+ | 321777 | + ----------+ 1 row in set (0.060 sec)   MariaDB [test]> select count (*) from alf_node_aspects; + ----------+ | count (*) | + ----------+ | 8192000 | + ----------+ 1 row in set (17.464 sec)   MariaDB [test]> select count (*) from alf_node_properties; + ----------+ | count (*) | + ----------+ | 216000 | + ----------+ 1 row in set (0.659 sec)   MariaDB [test]> select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN ( select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC )); Empty set (0.001 sec)   Explain Plan : ------------   MariaDB [test]> explain select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN ( select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC ))\G *************************** 1. row *************************** id: 1 select_type: PRIMARY table : aspect type: ref possible_keys: fk_alf_nasp_n,fk_alf_nasp_qn key : fk_alf_nasp_qn key_len: 8 ref: const rows : 1 Extra: Start temporary *************************** 2. row *************************** id: 1 select_type: PRIMARY table : node type: eq_ref possible_keys: PRIMARY ,fk_alf_node_store,idx_alf_node_mdq key : PRIMARY key_len: 8 ref: test.aspect.node_id rows : 1 Extra: Using where *************************** 3. row *************************** id: 1 select_type: PRIMARY table : PROP type: ref possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d key : idx_alf_nprop_s key_len: 137 ref: const,const rows : 1 Extra: Using where *************************** 4. row *************************** id: 1 select_type: PRIMARY table : PROP type: ref possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d key : fk_alf_nprop_qn key_len: 8 ref: const rows : 1 Extra: Using where *************************** 5. row *************************** id: 1 select_type: PRIMARY table : PROP_0 type: ref possible_keys: fk_alf_nprop_n,fk_alf_nprop_qn,idx_alf_nprop_s,idx_alf_nprop_l,idx_alf_nprop_b,idx_alf_nprop_f,idx_alf_nprop_d key : fk_alf_nprop_n key_len: 8 ref: test.aspect.node_id rows : 252 Extra: Using where ; End temporary 5 rows in set (0.001 sec)

          Roel Could not reproduce the issue using provided dummy data.

          node1:root@localhost> select count(1) from alf_node_properties;   
          +-----------+
          | count(1)  |
          +-----------+
          | 206600536 |
          +-----------+
          1 row in set (13 min 30.436 sec)
           
          node1:root@localhost> 
          node1:root@localhost> select count(1) from alf_node;
          +-----------+
          | count(1)  |
          +-----------+
          | 105317344 |
          +-----------+
          1 row in set (4 min 31.001 sec)
           
          node1:root@localhost> select count(1) from alf_node_aspects;
          +-----------+
          | count(1)  |
          +-----------+
          | 180000451 |
          +-----------+
          1 row in set (7 min 32.753 sec)
           
          node1:root@localhost> 
          node1:root@localhost> select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN (select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN (select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN (select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC));
          +----+
          | id |
          +----+
          | 32 |
          +----+
          1 row in set (0.002 sec)
           
          node1:root@localhost> 
          

          ramesh Ramesh Sivaraman added a comment - Roel Could not reproduce the issue using provided dummy data. node1:root@localhost> select count (1) from alf_node_properties; + -----------+ | count (1) | + -----------+ | 206600536 | + -----------+ 1 row in set (13 min 30.436 sec)   node1:root@localhost> node1:root@localhost> select count (1) from alf_node; + -----------+ | count (1) | + -----------+ | 105317344 | + -----------+ 1 row in set (4 min 31.001 sec)   node1:root@localhost> select count (1) from alf_node_aspects; + -----------+ | count (1) | + -----------+ | 180000451 | + -----------+ 1 row in set (7 min 32.753 sec)   node1:root@localhost> node1:root@localhost> select node.id as id from alf_node node left outer join alf_node_properties PROP_0 on (PROP_0.node_id = node.id) AND (43 = PROP_0.qname_id) where node.type_qname_id <> 149 AND node.store_id = 6 AND node.type_qname_id IN (35) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where 919 = PROP.qname_id AND PROP.string_value = 'PKKM' AND node.id IN ( select aspect.node_id from alf_node_aspects aspect where aspect.qname_id IN (1146) AND node.id IN ( select PROP.node_id from alf_node_properties PROP where (1144 = PROP.qname_id) AND PROP.boolean_value = 1) order by PROP_0.string_value ASC )); + ----+ | id | + ----+ | 32 | + ----+ 1 row in set (0.002 sec)   node1:root@localhost>
          monty Michael Widenius added a comment - - edited

          I have examined the stack trace in detail, but unfortunately this is an optimized build and some of the vital information is not available.
          The crash is in aria write_block_record() when copying fixed record length columns to the record page. It is not clear why the buffer is overrun. As this is very old and stable code it is unclear how things could go wrong here.

          A non optimized build would be more helpful as in this case we get more information in gdb traces that could show the issue.
          In this particular case it would be very likely that a full backtrace of an unoptimized build could show in which structure the problem is.

          I am not sure that a ASAN/UBSAN build will help as it is not clear if this is a logical error in record length counting or if it stray write into another memory structure that causes the fault. It is very likely it will fail in exactly the same point without any additional information.

          One way to find out what is going on is to give me remote access to the computer with gdb, the core and server source.
          I am willing to log in to the customer site and do the debugging there to ensure we don't copy any sensitive data.
          This would enable me to find out which internal structure is wrong and what could have caused it.

          It would also help to get the mysqld.err file attached to the is ticket (or at least all information related to this failure)

          monty Michael Widenius added a comment - - edited I have examined the stack trace in detail, but unfortunately this is an optimized build and some of the vital information is not available. The crash is in aria write_block_record() when copying fixed record length columns to the record page. It is not clear why the buffer is overrun. As this is very old and stable code it is unclear how things could go wrong here. A non optimized build would be more helpful as in this case we get more information in gdb traces that could show the issue. In this particular case it would be very likely that a full backtrace of an unoptimized build could show in which structure the problem is. I am not sure that a ASAN/UBSAN build will help as it is not clear if this is a logical error in record length counting or if it stray write into another memory structure that causes the fault. It is very likely it will fail in exactly the same point without any additional information. One way to find out what is going on is to give me remote access to the computer with gdb, the core and server source. I am willing to log in to the customer site and do the debugging there to ensure we don't copy any sensitive data. This would enable me to find out which internal structure is wrong and what could have caused it. It would also help to get the mysqld.err file attached to the is ticket (or at least all information related to this failure)

          About optimized builds. Note that the customer ONLY needs a copy of the mariadbd executable that he can use to temporarily replace the failing one. There is no need to do a full rpm for him.

          In theory someone could even log into the customer machine and compile it there. This would be the fastest way to get a quick turn around for finding the problem (as there may be needed several compile + fix + test cycles needed to find this bug...)

          monty Michael Widenius added a comment - About optimized builds. Note that the customer ONLY needs a copy of the mariadbd executable that he can use to temporarily replace the failing one. There is no need to do a full rpm for him. In theory someone could even log into the customer machine and compile it there. This would be the fastest way to get a quick turn around for finding the problem (as there may be needed several compile + fix + test cycles needed to find this bug...)

          People

            monty Michael Widenius
            ponsuresh.pandians Pon Suresh Pandian (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.