Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-135

failures in buildbot in 5.5 on kvm-deb-debian5-amd64

Details

    • Bug
    • Status: Closed (View Workflow)
    • Major
    • Resolution: Fixed
    • None
    • 5.5.20
    • None
    • None

    Description

      Failing test(s): rpl.rpl_checksum_cache rpl.rpl_heartbeat_basic main.ps_3innodb main.ps main.subselect_mat_cost main.select_pkeycache main.multi_update main.union

      Attachments

        Issue Links

          Activity

            Simple test case:

            CREATE TABLE t1 (i INT, INDEX);
            INSERT INTO t1 VALUES (1);
            SELECT AVG FROM t1;
            DROP TABLE t1;

            The problem seems to be in my_decimal_div(). This dump is from
            Item_sum_avg::val_decimal():

            XXX3: SQLCOM_SELECT: SELECT AVG FROM t1
            XXX12: Item_sum_avg::val_str()
            XXX11: Item_sum_avg::val_decimal()
            XXX11: using decimal ...
            XXX11: values: 1 / 1
            XXX11: sum_dec=9.0: 1 0 0 0 0 0 0 0 0
            XXX11: count=9.0: 1 0 54436864 0 1609087657 32688 0 0 31
            XXX11: sum/count=9.9: 1 999999999 1 0 11794296 0 6144224 0 1608708904
            XXX12: decimal -> 2.0000
            XXX13 Item::send(Protocol *, ...) buffer=2.0000

            This means that Item_sum_avg::val_decimal() is computing 1/1 with
            my_decimal_div(). The result becomes 1.999999999.

            Unfortunately, the bug occurence is extremely fragile.

            I can repeat on VM vm-debian5-amd64-build by copying in source tarball and
            running debian/autobake-deb.sh. If I then add a single line fprintf() in
            do_div_mod() and `make -j2`, the problem disappears. If I remove the single
            line again and `make -j2`, the problem is still gone ...

            wierd ...

            knielsen Kristian Nielsen added a comment - Simple test case: CREATE TABLE t1 (i INT, INDEX ); INSERT INTO t1 VALUES (1); SELECT AVG FROM t1; DROP TABLE t1; The problem seems to be in my_decimal_div(). This dump is from Item_sum_avg::val_decimal(): XXX3: SQLCOM_SELECT: SELECT AVG FROM t1 XXX12: Item_sum_avg::val_str() XXX11: Item_sum_avg::val_decimal() XXX11: using decimal ... XXX11: values: 1 / 1 XXX11: sum_dec=9.0: 1 0 0 0 0 0 0 0 0 XXX11: count=9.0: 1 0 54436864 0 1609087657 32688 0 0 31 XXX11: sum/count=9.9: 1 999999999 1 0 11794296 0 6144224 0 1608708904 XXX12: decimal -> 2.0000 XXX13 Item::send(Protocol *, ...) buffer=2.0000 This means that Item_sum_avg::val_decimal() is computing 1/1 with my_decimal_div(). The result becomes 1.999999999. Unfortunately, the bug occurence is extremely fragile. I can repeat on VM vm-debian5-amd64-build by copying in source tarball and running debian/autobake-deb.sh. If I then add a single line fprintf() in do_div_mod() and `make -j2`, the problem disappears. If I remove the single line again and `make -j2`, the problem is still gone ... wierd ...

            I discovered that the problem occurs when strings/decimal.c is build with DEB_BUILD_HARDENING=1.
            The problem disappears when that file is compiled with that variable not set.

            knielsen Kristian Nielsen added a comment - I discovered that the problem occurs when strings/decimal.c is build with DEB_BUILD_HARDENING=1. The problem disappears when that file is compiled with that variable not set.

            Bug is triggered when strings/decimal.c is compiled with -D_FORTIFY_SOURCE=2 (or =1).

            knielsen Kristian Nielsen added a comment - Bug is triggered when strings/decimal.c is compiled with -D_FORTIFY_SOURCE=2 (or =1).
            knielsen Kristian Nielsen added a comment - - edited

            Ok, I analysed this in detail. My conclusion is that this is a bug in the old
            GCC version on Debian 5 "lenny" (4.3.2).

            The code does this:

            if (unlikely(dcarry == 0 && *start1 < *start2))
            ...
            buf1=start1+len2;
            ...
            SUB2(*buf1, *buf1, lo, carry);
            ...
            dcarry= *start1;

            len2 can be zero (and is, when I see the failure). SUB2 assigns to *buf1.

            Checking the disassembled GCC output, what it does is cache the value of
            *start1 from the top in register %r15d:

            a92390: 44 8b 7e fc mov -0x4(%rsi),%r15d # *start1

            and it uses this variable to assign to dcarry:

            a92512: 44 89 fb mov %r15d,%ebx # dcarry=*start1

            This is wrong, as the value in %r15d is stale. *start1 has a new value from the SUB2().

            I do not see any problems with the code in terms of violation of strict
            aliasing or other issues. My conclusion is that GCC is doing the wrong thing
            here.

            I do not think there is a point in trying to report this as a GCC bug. This is
            in a very old version of the compiler, and we do not see this problem on any
            other host/gcc version. It is probably already fixed long ago.

            I will add an #ifdef so that the debian package build can work-around the
            problem on Debian 5.

            knielsen Kristian Nielsen added a comment - - edited Ok, I analysed this in detail. My conclusion is that this is a bug in the old GCC version on Debian 5 "lenny" (4.3.2). The code does this: if (unlikely(dcarry == 0 && *start1 < *start2)) ... buf1=start1+len2; ... SUB2(*buf1, *buf1, lo, carry); ... dcarry= *start1; len2 can be zero (and is, when I see the failure). SUB2 assigns to *buf1. Checking the disassembled GCC output, what it does is cache the value of *start1 from the top in register %r15d: a92390: 44 8b 7e fc mov -0x4(%rsi),%r15d # *start1 and it uses this variable to assign to dcarry: a92512: 44 89 fb mov %r15d,%ebx # dcarry=*start1 This is wrong, as the value in %r15d is stale. *start1 has a new value from the SUB2(). I do not see any problems with the code in terms of violation of strict aliasing or other issues. My conclusion is that GCC is doing the wrong thing here. I do not think there is a point in trying to report this as a GCC bug. This is in a very old version of the compiler, and we do not see this problem on any other host/gcc version. It is probably already fixed long ago. I will add an #ifdef so that the debian package build can work-around the problem on Debian 5.

            Buildbot confirms that workaround eliminates the failure.

            knielsen Kristian Nielsen added a comment - Buildbot confirms that workaround eliminates the failure.

            People

              knielsen Kristian Nielsen
              knielsen Kristian Nielsen
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.