Ok, I analysed this in detail. My conclusion is that this is a bug in the old
GCC version on Debian 5 "lenny" (4.3.2).
The code does this:
if (unlikely(dcarry == 0 && *start1 < *start2))
...
buf1=start1+len2;
...
SUB2(*buf1, *buf1, lo, carry);
...
dcarry= *start1;
len2 can be zero (and is, when I see the failure). SUB2 assigns to *buf1.
Checking the disassembled GCC output, what it does is cache the value of
*start1 from the top in register %r15d:
a92390: 44 8b 7e fc mov -0x4(%rsi),%r15d # *start1
and it uses this variable to assign to dcarry:
a92512: 44 89 fb mov %r15d,%ebx # dcarry=*start1
This is wrong, as the value in %r15d is stale. *start1 has a new value from the SUB2().
I do not see any problems with the code in terms of violation of strict
aliasing or other issues. My conclusion is that GCC is doing the wrong thing
here.
I do not think there is a point in trying to report this as a GCC bug. This is
in a very old version of the compiler, and we do not see this problem on any
other host/gcc version. It is probably already fixed long ago.
I will add an #ifdef so that the debian package build can work-around the
problem on Debian 5.
Simple test case:
CREATE TABLE t1 (i INT, INDEX
);
FROM t1;
INSERT INTO t1 VALUES (1);
SELECT AVG
DROP TABLE t1;
The problem seems to be in my_decimal_div(). This dump is from
Item_sum_avg::val_decimal():
XXX3: SQLCOM_SELECT: SELECT AVG
FROM t1
XXX12: Item_sum_avg::val_str()
XXX11: Item_sum_avg::val_decimal()
XXX11: using decimal ...
XXX11: values: 1 / 1
XXX11: sum_dec=9.0: 1 0 0 0 0 0 0 0 0
XXX11: count=9.0: 1 0 54436864 0 1609087657 32688 0 0 31
XXX11: sum/count=9.9: 1 999999999 1 0 11794296 0 6144224 0 1608708904
XXX12: decimal -> 2.0000
XXX13 Item::send(Protocol *, ...) buffer=2.0000
This means that Item_sum_avg::val_decimal() is computing 1/1 with
my_decimal_div(). The result becomes 1.999999999.
Unfortunately, the bug occurence is extremely fragile.
I can repeat on VM vm-debian5-amd64-build by copying in source tarball and
running debian/autobake-deb.sh. If I then add a single line fprintf() in
do_div_mod() and `make -j2`, the problem disappears. If I remove the single
line again and `make -j2`, the problem is still gone ...
wierd ...