Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-32348

Random missing rows from result with jemalloc

Details

    Description

      Hello,
      after enabling jemalloc, we started to get random missing rows from result - sometimes 4 rows, sometimes 3 rows, sometime 2 rows ....

      I was able to create minimal reproducible that shows error both with mariadb:10.6.15 and current master quay.io/mariadb-foundation/mariadb-devel:10.6 with jemalloc turned on. Without jemalloc everything works.

      This is preview of repro: https://dbfiddle.uk/rXJGvk_h - e-shop with order items, every item has link to product and each product has more sections

      The main cause seems to be this part of query. We use subselect in JOIN to join always only one section with highest priority:

      ...
      LEFT JOIN sections s ON s.id = (
      SELECT s2.id
      FROM sections s2
      LEFT JOIN products_sections ps ON s2.id = ps.id_section
      WHERE ps.id_product = p.id
      ORDER BY s2.priority DESC
      LIMIT 1
      )
      ...

      when I remove ORDER BY it behaves normally. The problem seems to be that subselect matches two rows with same priority, so it picks random row. And sometimes it leads to completely missing row in result - it is an error, LEFT JOIN should never remove rows from result. When I run this query 10 times, I get 5 times 4 rows, 5 times only 3 rows.

      I'm able to test this on newer Mariadb versions if it will help.

      Attachments

        Activity

          ShiroDN Tomas Leypold added a comment -

          I just tried to run your mtr test on 10.4.31 (on Ubuntu 20.04) to be sure that 10.4 is really affected, but I can't reproduce it on this version, what patch version of mariadb and OS did you use @Sergei?

          main.b 'innodb'                          [ 1998 pass ]     84
          main.b 'innodb'                          [ 1999 pass ]     78
          main.b 'innodb'                          [ 2000 pass ]    113
          --------------------------------------------------------------------------
          The servers were restarted 0 times
          Spent 167.609 of 317 seconds executing testcases
           
          Completed: All 2000 tests were successful.
          

          Then I tested on two other versions (10.6, 11.2) to be sure that I tested correctly:

          main.b 'innodb'                          [ 196 pass ]     49
          main.b 'innodb'                          [ 197 fail ]
                  Test ended at 2023-10-12 17:05:48
           
          CURRENT_TEST: main.c
          --- /usr/share/mysql/mysql-test/main/b.result   2023-10-12 16:56:55.047760483 +0200
          +++ /usr/share/mysql/mysql-test/main/b.reject   2023-10-12 17:05:47.912103398 +0200
          @@ -21,6 +21,5 @@
           p1     s1      s2
           1      #       #
           2      #       #
          -3      #       #
           4      #       #
           drop tables ps,s,o,p;
           
          mysqltest: Result length mismatch
           
           - skipping '/usr/share/mysql/mysql-test/var/log/main.b-innodb/'
          main.b 'innodb'                          [ 198 pass ]     53
          main.b 'innodb'                          [ 199 pass ]     50
          main.b 'innodb'                          [ 200 pass ]     58
          --------------------------------------------------------------------------
          The servers were restarted 1 times
          Spent 9.841 of 50 seconds executing testcases
           
          Completed: Failed 35/200 tests, 82.50% were successful.
          

          ShiroDN Tomas Leypold added a comment - I just tried to run your mtr test on 10.4.31 (on Ubuntu 20.04) to be sure that 10.4 is really affected, but I can't reproduce it on this version, what patch version of mariadb and OS did you use @Sergei? main.b 'innodb' [ 1998 pass ] 84 main.b 'innodb' [ 1999 pass ] 78 main.b 'innodb' [ 2000 pass ] 113 -------------------------------------------------------------------------- The servers were restarted 0 times Spent 167.609 of 317 seconds executing testcases   Completed: All 2000 tests were successful. Then I tested on two other versions (10.6, 11.2) to be sure that I tested correctly: main.b 'innodb' [ 196 pass ] 49 main.b 'innodb' [ 197 fail ] Test ended at 2023-10-12 17:05:48   CURRENT_TEST: main.c --- /usr/share/mysql/mysql-test/main/b.result 2023-10-12 16:56:55.047760483 +0200 +++ /usr/share/mysql/mysql-test/main/b.reject 2023-10-12 17:05:47.912103398 +0200 @@ -21,6 +21,5 @@ p1 s1 s2 1 # # 2 # # -3 # # 4 # # drop tables ps,s,o,p;   mysqltest: Result length mismatch   - skipping '/usr/share/mysql/mysql-test/var/log/main.b-innodb/' main.b 'innodb' [ 198 pass ] 53 main.b 'innodb' [ 199 pass ] 50 main.b 'innodb' [ 200 pass ] 58 -------------------------------------------------------------------------- The servers were restarted 1 times Spent 9.841 of 50 seconds executing testcases   Completed: Failed 35/200 tests, 82.50% were successful.

          I had the latest 10.4 from github. And a debug build — that's just what I happened to have ready.

          May be it depends on the compiler? I've got gcc 13.2.1 20230826.

          It doesn't matter much where it repeats, when we'll have the fix, we'll see if it applies to 10.4 too or only to 10.5 and up

          serg Sergei Golubchik added a comment - I had the latest 10.4 from github. And a debug build — that's just what I happened to have ready. May be it depends on the compiler? I've got gcc 13.2.1 20230826. It doesn't matter much where it repeats, when we'll have the fix, we'll see if it applies to 10.4 too or only to 10.5 and up
          cuchac Cuchac added a comment -

          Hello,
          is there anything I can prepare to help you debug this issue?

          cuchac Cuchac added a comment - Hello, is there anything I can prepare to help you debug this issue?
          brunobear Bruno Bear added a comment -

          We really need to try jemalloc to avoid OOMs on our 10.6.12 MariaDB Servers.
          Now we are scared to use it in production, because of this bug. Less because of this specific query, but more of the possibility that there are more querys that give wrong results if we use jemalloc.
          Did maybe someone from MariaDB already look into this issue, found the reason and can therefore say if its really only this specific edge case query with left join, subselect, order by and limit?

          brunobear Bruno Bear added a comment - We really need to try jemalloc to avoid OOMs on our 10.6.12 MariaDB Servers. Now we are scared to use it in production, because of this bug. Less because of this specific query, but more of the possibility that there are more querys that give wrong results if we use jemalloc. Did maybe someone from MariaDB already look into this issue, found the reason and can therefore say if its really only this specific edge case query with left join, subselect, order by and limit?
          brunobear Bruno Bear added a comment -

          Still exists in 10.11.7

          brunobear Bruno Bear added a comment - Still exists in 10.11.7

          People

            Unassigned Unassigned
            cuchac Cuchac
            Votes:
            2 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.