[MDEV-18737] Spider "Out of memory" on armv7hl Created: 2019-02-25  Updated: 2019-07-25  Resolved: 2019-07-25

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - Spider, Tests
Affects Version/s: 10.4.3, 10.3.13
Fix Version/s: 10.4.7

Type: Bug Priority: Critical
Reporter: Michal Schorm Assignee: Kentoku Shiba (Inactive)
Resolution: Fixed Votes: 0
Labels: spider, tests
Environment:

Fedora build system - Fedora 28, 29, Rawhide (developement branch)
ARMv7HL only


Attachments: HTML File F28_test_log     HTML File F29_test_log     HTML File F30_test_log    

 Description   

Hello from the Fedora project,

With recent 10.3.13 new bug has emerged, however it triggers only on armv7hl architecture.

Each time, the Spider SE testsuite is being run, it fails most of the tests with:

Failed 47/51 tests

The bug doesn't seem to exist on previous versions, nor different architectures (as far as I can test), nor different parts of the testsuite.

All of the tests fails with specific error, where only the number differs (somewhere between 2,5GB - 3,2GB )

[ERROR] mysqld: Out of memory (Needed 3062627376 bytes)

The behaviour is 100% reproducible in our build system on Fedora 28 to Rawhide (latest developement version). That suggests it shouldn't be caused by something like a new GCC or other build requirements.

Looking at 10.3.13 changelog, I'd blame this one:
https://jira.mariadb.org/browse/MDEV-16520
... or any other Spider related commits.


The Fedora web interface to package sources can be found here:
https://src.fedoraproject.org/rpms/mariadb
files here:
https://src.fedoraproject.org/rpms/mariadb/tree/master

I made pull request containing the rebase to 10.3.13:
https://src.fedoraproject.org/rpms/mariadb/pull-request/11#request_diff

The builds you can examine, for respective Fedora versions, can be found here:
F30: https://koji.fedoraproject.org/koji/taskinfo?taskID=33033562
F29: https://koji.fedoraproject.org/koji/taskinfo?taskID=33033608
F28: https://koji.fedoraproject.org/koji/taskinfo?taskID=33033914

The list of failing tests:

    spider/bg.spider3_fixes
    spider/bg.spider3_fixes_part
    spider/bg.spider_fixes
    spider/bg.spider_fixes_part
    spider/bg.basic_sql
    spider/bg.basic_sql_part
    spider/bg.direct_aggregate
    spider/bg.direct_aggregate_part
    spider/bg.direct_update
    spider/bg.direct_update_part
    spider/bg.function
    spider/bg.ha
    spider/bg.ha_part
    spider.spider3_fixes
    spider.spider3_fixes_part
    spider.spider_fixes
    spider.spider_fixes_part
    spider.auto_increment
    spider.basic_sql
    spider.basic_sql_part
    spider.checksum_table_with_quick_mode_3
    spider.direct_aggregate
    spider.direct_aggregate_part
    spider.direct_join
    spider.direct_join_using
    spider.direct_left_join
    spider.direct_left_join_nullable
    spider.direct_left_right_join_nullable
    spider.direct_left_right_left_join_nullable
    spider.direct_right_join
    spider.direct_right_join_nullable
    spider.direct_right_left_join_nullable
    spider.direct_right_left_right_join_nullable
    spider.direct_update
    spider.direct_update_part
    spider.function
    spider.ha
    spider.ha_part
    spider.partition_cond_push
    spider.partition_fulltext
    spider.partition_join_pushdown_for_single_partition
    spider.partition_mrr
    spider.quick_mode_1
    spider.quick_mode_2
    spider.quick_mode_3
    spider.slave_trx_isolation
    spider.timestamp

The parameters the Spider tests were ran with:

perl ./mysql-test-run.pl
  --parallel=auto --force --retry=1 --suite-timeout=900 --testcase-timeout=30
  --mysqld=--binlog-format=mixed --force-restart --shutdown-timeout=60
  --max-test-fail=5  --skip-ssl --big-test --mem --suite=spider,spider/bg 
  --max-test-fail=999 || :

Attaching the test logs for respective Fedora versions



 Comments   
Comment by Michal Schorm [ 2019-02-26 ]

The just released 10.4.3 RC version is affected as well !!

Here lies the build log for examination:
https://koji.fedoraproject.org/koji/getfile?taskID=33061689&volume=DEFAULT&name=build.log

Comment by Ian Gilfillan [ 2019-04-01 ]

Noting that this bug is preventing 10.3.13 (and will also do so for the imminent 10.3.14) from being packaged into Fedora, so increasing priority.

Comment by Kentoku Shiba (Inactive) [ 2019-04-07 ]

mschorm
I could not reproduce this issue on arm environment. Would it possible to log in and check it directly on the armv7hl environment? or Would it possible to change the source code of Spider and test it?

Comment by Michal Schorm [ 2019-04-07 ]

I will try to get you an affected environment.

UPDATE 1:

  • I got access to the ARMv7hl machine
  • I was able to reproduce the error there, by mock rebuild of the source package used in KOJI (Fedora build system)
  • Now I'm negotiating with the machine admin about the access for you

UPDATE 2:

  • For the access to the machine, you will need FAS (Fedora account system) account.
    You will upload there your public ssh key, with which you will be authentized to the machine.
    After the account will be ready, I'll ask the admin to add the account to the 'upstream-test' group, which will grant you access to the machines.
    https://admin.fedoraproject.org/accounts/user/new
    .
    I suggest you make a single account for whole MariaDB upstream.

Let me know once you have it ready, or if there's any problem.
... or if you feel like this way won't work for you at all.

Comment by Kentoku Shiba (Inactive) [ 2019-04-09 ]

mschorm
I just added an account and uploaded my public ssh key. My account name is "kentoku".
Please let me know if I need to do something more.

Comment by Michal Schorm [ 2019-04-11 ]

Kentoku
Yes, there's one more thing required. The signing of the Contributor agreement.
You can find also there.

Comment by Kentoku Shiba (Inactive) [ 2019-04-30 ]

mschorm
I'm sorry. I can not imagine which kind of contribute will I do to know detail of this issue, and I think it is not required. Is there any other way without signing? Do you know what options are used for cmake or build?

Comment by Kentoku Shiba (Inactive) [ 2019-05-01 ]

mschorm
What specific device do you use?

Comment by Michal Schorm [ 2019-06-04 ]

Kentoku
I have a good news!
After discussions between the Sysadmins and Fedora legal a decisions was made you (or your case in general) doesn't need the FPCA signed for this purpose.

So you should now have granted access to the machine.
Test Machine Fedora Resources

I set it up on arm03-packager00.cloud.fedoraproject.org machine.

Steps to reproduce:

sudo su -s /usr/bin/bash mysql

cd /usr/share/mysql-test/
./mysql-test-run --do-test=spider3_fixes --big-test --mem

That will result into a message like:

mysqltest: At line 162: query 'SELECT MAX(id) FROM t1' failed: 128: Out of memory (Needed 3062130920 bytes)

Comment by Kentoku Shiba (Inactive) [ 2019-06-04 ]

mschorm
Thank you for the news!
I just tried to log into arm03-packager00.cloud.fedoraproject.org machine, but I got the following error from ssh command. Would you please check permission of my public key file?

$ ssh kentoku@arm03-packager00.cloud.fedoraproject.org
Enter passphrase for key :
Permission denied (publickey).

Comment by Michal Schorm [ 2019-06-10 ]

I contacted the admins.
An update of keys on the server was forced and your account is listed as enabled and with a public key.

In case it wouldn't work now, we can check that you use the key the server is expecting (either you or us will send it to other - question is through which channel. It is only a public key, but you may not like to paste it here).

Comment by Kentoku Shiba (Inactive) [ 2019-06-10 ]

mschorm
I could log in and reproduce the issue. Thanks.
When I tried to use gdb for debugging, it did not work with the following error. Would you please check it? In the other hand, would it possible "Debug" build?

bash-4.4$ gdb /usr/libexec/mysqld
GNU gdb (GDB) Fedora 8.2-7.fc29
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "armv7hl-redhat-linux-gnueabi".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.
 
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/libexec/mysqld...Reading symbols from /usr/lib/debug/usr/libexec/mysqld-10.3.13-1.fc29.arm.debug...done.
done.
(gdb) run
Starting program: /usr/libexec/mysqld
warning: the debug information found in "/usr/lib/debug//lib/ld-2.28.so.debug" does not match "/lib/ld-linux-armhf.so.3" (CRC mismatch).
 
warning: the debug information found in "/usr/lib/debug//usr/lib/ld-2.28.so.debug" does not match "/lib/ld-linux-armhf.so.3" (CRC mismatch).
 
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-30.fc29.armv7hl
 
[3]+  Stopped                 gdb /usr/libexec/mysqld
bash-4.4$

Comment by Michal Schorm [ 2019-06-12 ]

I created a debug build of 3.0.15 and installed it there. (there was 3.0.13 previously).
It took longer than expected, due to MDEV-19740 I hit.
I updated the glibc debuginfo, so you can run the mysqld in gdb.
I verified the test still fails.
It doesn't say "Out of memory" anymore, but rather "2013: Lost connection to MySQL server during query", which sound's a bit more debug friendly.

You - same as me - should be able to use "sudo" on your account, so you can install / upgrade / downgrade any tool or package you may need.

Comment by Kentoku Shiba (Inactive) [ 2019-06-17 ]

mschorm
I finally fixed and pushed it. Would you please try to use the following branch?
bb-10.4-MDEV-18737

This issue looks memory alignment issue of variable argument when calling my_multi_malloc(). This method assume variable argument as a pair of pointer and uint, but caller put it as a pair of pointer and some integer. On armv7hl, this causes an issue. For fixed this, I added a cast to uint for every length values. Thank you for your cooperation.

Comment by Michal Schorm [ 2019-06-18 ]

The very first test looks promising.

So far, I did one run of released 10.3.15 tarball patched with the commit you made.

In the following days, I aim to prepare updates in Fedora to 10.3.15 and 10.4.5.
Respectively 10.3.16 which just came out and 10.4.6 which should soon follow.

Even though, I'd think it would be nice to have it fixed in the first 10.4 GA release, with this patch in hand, I don't really rush to have it there.

I'll post an update here, once I'll successfully create those updates and verify the issue is no more.

.

Thanks a lot.

Comment by Michal Schorm [ 2019-07-19 ]

The 10.3.16 and 10.4.6 looks fine with the patch.

The ticket may be closed.

Comment by Kentoku Shiba (Inactive) [ 2019-07-25 ]

Merged into 10.4 tree. Thanks.

Generated at Thu Feb 08 08:46:21 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.