Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34533

asan error about stack overflow when writing record in Aria

Details

    Description

      Compile with clang-18 and -DWITH_ASAN=1, this was a RelWithDebInfo build.

      main.union

      main.union                               w1 [ fail ]
              Test ended at 2024-07-05 00:52:49
       
      CURRENT_TEST: main.union
      mysqltest: At line 879: query 'CREATE TABLE t3 SELECT REPEAT(a,20000000) AS a FROM t1 UNION SELECT b FROM t2' failed: <Unknown> (2013): Lost connection to server during query
       
      The result from queries just before the failure was:
      < snip >
      SELECT left(a,100000000) FROM t1 UNION  SELECT b FROM t2;
      left(a,100000000)
      a
      b
      create table t3 SELECT left(a,100000000) FROM t1 UNION  SELECT b FROM t2;
      show create table t3;
      Table	Create Table
      t3	CREATE TABLE `t3` (
        `left(a,100000000)` longtext DEFAULT NULL
      ) ENGINE=MyISAM DEFAULT CHARSET=latin1 COLLATE=latin1_swedish_ci
      drop tables t1,t2,t3;
      SET @tmp_max= @@global.max_allowed_packet;
      SET @@global.max_allowed_packet=25000000;
      Warnings:
      Warning	1292	Truncated incorrect max_allowed_packet value: '25000000'
      connect  newconn, localhost, root,,;
      CREATE TABLE t1 (a mediumtext);
      CREATE TABLE t2 (b varchar(20));
      INSERT INTO t1 VALUES ('a');
      CREATE TABLE t3 SELECT REPEAT(a,20000000) AS a FROM t1 UNION SELECT b FROM t2;
       
      More results from queries before failure can be found in /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/mysql-test/var/1/log/union.log
      

      10.11-3ee6f69d49a58422f994f51a0bd7a0cb1242a3dd

      #6  0x000055fff57cb5eb in __sanitizer::Abort() ()
      #7  0x000055fff57c9775 in __sanitizer::Die() ()
      #8  0x000055fff57a9e9f in __asan::ScopedInErrorReport::~ScopedInErrorReport() ()
      #9  0x000055fff57acf25 in __asan::ReportGenericError(unsigned long, unsigned long, unsigned long, unsigned long, bool, unsigned long, unsigned int, bool) ()
      #10 0x000055fff57a2fba in __asan_memcpy ()
      #11 0x000055fff66070e2 in _ma_rec_pack (info=info@entry=0x529000dbb218, to=0x7fec5922307a "", to@entry=0x7fec59223078 "", from=0x519000234ce1 "", from@entry=0x519000234ce0 "\376") at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/storage/maria/ma_dynrec.c:1013
      #12 0x000055fff6608878 in _ma_write_blob_record (info=0x529000dbb218, record=0x519000234ce0 "\376") at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/storage/maria/ma_dynrec.c:266
      #13 0x000055fff66da4fe in maria_write (info=0x529000dbb218, record=<optimized out>) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/storage/maria/ma_write.c:284
      #14 0x000055fff5d1c758 in handler::ha_write_tmp_row (this=0x51d0002544b8, buf=<optimized out>) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_class.h:7649
      #15 select_unit::write_record (this=0x529000e36948) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_union.cc:418
      #16 0x000055fff5d1c055 in select_unit::send_data (this=0x529000e36948, values=...) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_union.cc:157
      #17 0x000055fff5b9fa2e in select_result_sink::send_data_with_check (this=0x2a50f, items=..., u=<optimized out>, sent=<optimized out>) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_class.h:5912
      #18 end_send (join=join@entry=0x529000e36a38, join_tab=join_tab@entry=0x0, end_of_records=<optimized out>) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_select.cc:23628
      #19 0x000055fff5baebdd in do_select (join=join@entry=0x529000e36a38, procedure=<optimized out>) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_select.cc:21815
      #20 0x000055fff5baced7 in JOIN::exec_inner (this=0x529000e36a38) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_select.cc:4912
      #21 0x000055fff5bab0d8 in JOIN::exec (this=0x529000e36a38) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_select.cc:4690
      #22 0x000055fff5d192a3 in st_select_lex_unit::exec (this=0x52b0000c12f0) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_union.cc:2248
      #23 0x000055fff5d12cdf in mysql_union (thd=thd@entry=0x52b0000bd218, lex=<optimized out>, result=result@entry=0x529000e36800, unit=0x52b0000c12f0, setup_tables_done_option=setup_tables_done_option@entry=0) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_union.cc:42
      #24 0x000055fff5b52827 in handle_select (thd=thd@entry=0x52b0000bd218, lex=0x2a596, lex@entry=0x52b0000c1218, result=result@entry=0x529000e36800, setup_tables_done_option=setup_tables_done_option@entry=0) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_select.cc:576
      #25 0x000055fff5cf5128 in Sql_cmd_create_table_like::execute (this=<optimized out>, thd=0x52b0000bd218) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_table.cc:12694
      #26 0x000055fff5ab6c60 in mysql_execute_command (thd=0x52b0000bd218, is_called_from_prepared_stmt=<optimized out>) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_parse.cc:6125
      #27 0x000055fff5aab1f3 in mysql_parse (thd=thd@entry=0x52b0000bd218, rawbuf=<optimized out>, length=<optimized out>, parser_state=parser_state@entry=0x7fec5965c7c0) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_parse.cc:8126
      #28 0x000055fff5aa6cb5 in dispatch_command (command=<optimized out>, thd=0x52b0000bd218, packet=<optimized out>, packet_length=<optimized out>, blocking=<optimized out>) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_parse.cc:1894
      #29 0x000055fff5aabc92 in do_command (thd=0x52b0000bd218, blocking=true) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_parse.cc:1407
      #30 0x000055fff5e0c61e in do_handle_one_connection (connect=<optimized out>, put_in_cache=<optimized out>) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_connect.cc:1415
      #31 0x000055fff5e0c021 in handle_one_connection (arg=0x508000002738) at /home/buildbot/amd64-ubuntu-2404-clang18-asan/build/sql/sql_connect.cc:1317
      #32 0x000055fff57a2b3d in asan_thread_start(void*) ()
      #33 0x00007fec64649a94 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
      #34 0x00007fec646d6a34 in clone () from /lib/x86_64-linux-gnu/libc.so.6
      

      ref: https://buildbot.dev.mariadb.org/#/builders/371/builds/7/steps/7/logs/stdio

      Also present on 10.5-9e8546e2bda81c2c97acb1e9f973ff466a82dca7

      Attachments

        Issue Links

          Activity

            monty Michael Widenius added a comment - - edited

            The problem here is that when using clang + asan, we do not get a correct value for the thread stack as some local variables are not allocated at the normal stack.

            Here follows a longer description and a solution for this issue.

            Taking the address of a variable on stack is not always working.

            It looks like that for example clang 18.1.3, when compiling with
            -O2 -fsanitize=addressan it puts local variables and things allocated by
            alloca() in other areas than on the stack.

            The following code shows the issue

            Thread 6 "mariadbd" hit Breakpoint 3, do_handle_one_connection (connect=0x5080000027b8, 
                put_in_cache=<optimized out>) at /work/server/sql/sql_connect.cc:1399
             
            THD *thd;
            1399      thd->thread_stack= (char*) &thd;
            (gdb) p &thd
            (THD **) 0x7fffedee7060
            (gdb) p $sp
            (void *) 0x7fffef4e7bc0
            

            The address of thd is 24M away from the stack pointer

            (gdb) info reg
            rax            0x52b000080998      90915868248472
            rbx            0x7fffef4e7bc0      140737208286144
            rcx            0xa5600010133       11364483531059
            rdx            0x4                 4
            rsi            0x7ffff4f000d0      140737302757584
            rdi            0x52b00007e218      90915868238360
            rbp            0x7fffef4e7c50      0x7fffef4e7c50
            rsp            0x7fffef4e7bc0      0x7fffef4e7bc0
            r8             0x60                96
            r9             0x524000060000      90434831777792
            r10            0x7fffffffffff      140737488355327
            r11            0x4                 4
            r12            0x555558e5de60      93825052040800
            r13            0x7fffedee7060      140737185214560
            r14            0xe1ad14c050        969271459920
            r15            0x52b00007e218      90915868238360
            rip            0x5555565c07f7      0x5555565c07f7
            

            r13 is pointing to the address of the thd. Probably some kind of "local stack"
            used by the sanitizer

            I have verified this with gdb on a recursive call that calls alloca() in a loop.
            In this case all objects are stored in a local heap, not on the stack.
            This makes the sanitizer a bit unpredictable when working with code that
            uses alloca() as it does not behave in the same way as code in production that
            may fail because of stack over usage.

            To get the stack pointer safely, one can use asm functions like:

            #include <stdint.h>
             
            uint64_t getsp( void )
            {
                uint64_t sp;
                asm( "mov %%rsp, %0" : "=rm" ( sp ));
                return sp;
            }
            

            To solve this issue in a portable way, we should add two functions:

            my_get_stack_pointer() that will return the address of the current stack pointer.
            The proposed solution is using asm instructions for intel 32/64 bit, powerpc, arm 32/64 bit
            and sparc 32/64 bit. Supported compilers are gcc and clang and MSCV.
            For MSCV 64 bit we are using _AddressOfReturnAddress()
            As a fallback for other compilers/arch we use the address of a local variable.

            my_get_stack_bounds() that will return the address of the base stack and stack size using
            pthread_attr_getstack() or NtCurrentTed() with fallback to using the address of a local
            variable and user provided stack size.

            Server changes are:

            • Moving setting of thread_stack to THD::store_globals() using my_get_stack_bounds().
            • Removing setting of thd->thread_stack, except in functions that allocates a lot on
              the stack before calling store_globals().
              When using estimates for stack start, we reduce stack_size with
              MY_STACK_SAFE_MARGIN (8192) to take into account the stack used before
              calling store_globals().
            monty Michael Widenius added a comment - - edited The problem here is that when using clang + asan, we do not get a correct value for the thread stack as some local variables are not allocated at the normal stack. Here follows a longer description and a solution for this issue. Taking the address of a variable on stack is not always working. It looks like that for example clang 18.1.3, when compiling with -O2 -fsanitize=addressan it puts local variables and things allocated by alloca() in other areas than on the stack. The following code shows the issue Thread 6 "mariadbd" hit Breakpoint 3, do_handle_one_connection (connect=0x5080000027b8, put_in_cache=<optimized out>) at /work/server/sql/sql_connect.cc:1399   THD *thd; 1399 thd->thread_stack= (char*) &thd; (gdb) p &thd (THD **) 0x7fffedee7060 (gdb) p $sp (void *) 0x7fffef4e7bc0 The address of thd is 24M away from the stack pointer (gdb) info reg rax 0x52b000080998 90915868248472 rbx 0x7fffef4e7bc0 140737208286144 rcx 0xa5600010133 11364483531059 rdx 0x4 4 rsi 0x7ffff4f000d0 140737302757584 rdi 0x52b00007e218 90915868238360 rbp 0x7fffef4e7c50 0x7fffef4e7c50 rsp 0x7fffef4e7bc0 0x7fffef4e7bc0 r8 0x60 96 r9 0x524000060000 90434831777792 r10 0x7fffffffffff 140737488355327 r11 0x4 4 r12 0x555558e5de60 93825052040800 r13 0x7fffedee7060 140737185214560 r14 0xe1ad14c050 969271459920 r15 0x52b00007e218 90915868238360 rip 0x5555565c07f7 0x5555565c07f7 r13 is pointing to the address of the thd. Probably some kind of "local stack" used by the sanitizer I have verified this with gdb on a recursive call that calls alloca() in a loop. In this case all objects are stored in a local heap, not on the stack. This makes the sanitizer a bit unpredictable when working with code that uses alloca() as it does not behave in the same way as code in production that may fail because of stack over usage. To get the stack pointer safely, one can use asm functions like: #include <stdint.h>   uint64_t getsp( void ) { uint64_t sp; asm( "mov %%rsp, %0" : "=rm" ( sp )); return sp; } To solve this issue in a portable way, we should add two functions: my_get_stack_pointer() that will return the address of the current stack pointer. The proposed solution is using asm instructions for intel 32/64 bit, powerpc, arm 32/64 bit and sparc 32/64 bit. Supported compilers are gcc and clang and MSCV. For MSCV 64 bit we are using _AddressOfReturnAddress() As a fallback for other compilers/arch we use the address of a local variable. my_get_stack_bounds() that will return the address of the base stack and stack size using pthread_attr_getstack() or NtCurrentTed() with fallback to using the address of a local variable and user provided stack size. Server changes are: Moving setting of thread_stack to THD::store_globals() using my_get_stack_bounds(). Removing setting of thd->thread_stack, except in functions that allocates a lot on the stack before calling store_globals(). When using estimates for stack start, we reduce stack_size with MY_STACK_SAFE_MARGIN (8192) to take into account the stack used before calling store_globals().
            danblack Daniel Black added a comment - - edited

            Note the fallbacks of my_get_stack_pointer need optimization disabled (like what MDEV-21248 (35c277851972) MDEV-31605 ( 6eed99f1f3ad (clang)) did).

            On s390x - r15 is the stack pointer (ref1 ref2) since we have that in CI.

            Assignment of thread_stack in sql_base.cc should be and assignment rather than char * cast.

            danblack Daniel Black added a comment - - edited Note the fallbacks of my_get_stack_pointer need optimization disabled (like what MDEV-21248 (35c277851972) MDEV-31605 ( 6eed99f1f3ad (clang)) did). On s390x - r15 is the stack pointer (ref1 ref2 ) since we have that in CI. Assignment of thread_stack in sql_base.cc should be and assignment rather than char * cast.
            monty Michael Widenius added a comment - - edited

            As far as I have found out and the mtr test agrees, the fallback we have is good enough even for high optimized code. Even the old code that took address of a local variable worked on all installations, except for clang + asan.
            The included unittest should fail if there is any problems with the platform. If it fails, we can then
            fix the code for that platform.

            s390x is supported by the current code.

            Please provide test case for any setup where using the my_get_stack_pointer() fails.
            For testing purposes, you can remove the hardware specific code for other cases than clang + asan.

            I have fixed all assignments to thread_stack to use (void*)

            monty Michael Widenius added a comment - - edited As far as I have found out and the mtr test agrees, the fallback we have is good enough even for high optimized code. Even the old code that took address of a local variable worked on all installations, except for clang + asan. The included unittest should fail if there is any problems with the platform. If it fails, we can then fix the code for that platform. s390x is supported by the current code. Please provide test case for any setup where using the my_get_stack_pointer() fails. For testing purposes, you can remove the hardware specific code for other cases than clang + asan. I have fixed all assignments to thread_stack to use (void*)

            it seems it would be much simpler and more robust to use __builtin_frame_address() that both gcc anf clang support.

            An address of a local variable in the inline function is strange concept anyway, I'm not even sure it'll do what you expect.

            serg Sergei Golubchik added a comment - it seems it would be much simpler and more robust to use __builtin_frame_address() that both gcc anf clang support. An address of a local variable in the inline function is strange concept anyway, I'm not even sure it'll do what you expect.

            Serg and I concluded that __builtin_frame_address() is not safe as it gives the start of the stack in the current function, not the stack after all variables are allocated.
            We also checked if we could use __builtin_stack_address (), but decided to not use it as this is not supported by clang or intel compiler.
            For functions that are using a large stack, we are using a local variable on the caller for calculation the stack pointer for not supported compilers and cpus.
            This should ensure that the return value is reasonable accurate.

            monty Michael Widenius added a comment - Serg and I concluded that __builtin_frame_address() is not safe as it gives the start of the stack in the current function, not the stack after all variables are allocated. We also checked if we could use __builtin_stack_address (), but decided to not use it as this is not supported by clang or intel compiler. For functions that are using a large stack, we are using a local variable on the caller for calculation the stack pointer for not supported compilers and cpus. This should ensure that the return value is reasonable accurate.

            Code pushed to bb-10.5-monty
            This is part of a series of commits, of which one is to be tested by Elena this week.
            Will be pushed next week.

            monty Michael Widenius added a comment - Code pushed to bb-10.5-monty This is part of a series of commits, of which one is to be tested by Elena this week. Will be pushed next week.

            Pushed to 10.5 tree

            monty Michael Widenius added a comment - Pushed to 10.5 tree

            In the changeset it does not look like any optimizations were disabled as suggested; at least I see none of the following:

            • function attribute optimize(0) (GCC) or optnone (clang)
            • #pragma optimize( "", off ) (MSVC)

            To my understanding, none of our CI systems enable link-time optimization (cmake -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON). Some GNU/Linux distributions that package MariaDB Server do that, and they might now hit some trouble.

            If I understood it correctly, the following kind of a predicate should work if used in the appropriate way:

            #ifdef _MSC_VER
            # include <intrin.h>
            # define my_get_stack_pointer() _AddressOfReturnAddress()
            #elif defined __GNUC__ /* GCC or clang or derivatives */
            # define my_get_stack_pointer() __builtin_frame_address(0)
            #else
            # error "don't know how to determine the stack pointer"
            #endif
            

            Note: Above, the predicate takes no parameters. The appropriate usage would seem to be to write a thin wrapper function that do not define any local variables, so that nothing needs to be allocated from the stack. This wrapper function would invoke the stack check and finally tail-call the function that would then be guaranteed to have enough stack space available.

            Such usage would require some larger refactoring; for example, the macro alloc_on_stack would be replaced with straight use of alloca, and the logic to check for stack overflow would have to be moved to the above mentioned wrapper function, which might call an alternative implementation that would allocate the buffer from heap.

            marko Marko Mäkelä added a comment - In the changeset it does not look like any optimizations were disabled as suggested; at least I see none of the following: function attribute optimize(0) (GCC) or optnone (clang) #pragma optimize( "", off ) ( MSVC ) To my understanding, none of our CI systems enable link-time optimization ( cmake -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON ). Some GNU/Linux distributions that package MariaDB Server do that, and they might now hit some trouble. If I understood it correctly, the following kind of a predicate should work if used in the appropriate way: #ifdef _MSC_VER # include <intrin.h> # define my_get_stack_pointer() _AddressOfReturnAddress() #elif defined __GNUC__ /* GCC or clang or derivatives */ # define my_get_stack_pointer() __builtin_frame_address(0) #else # error "don't know how to determine the stack pointer" #endif Note: Above, the predicate takes no parameters. The appropriate usage would seem to be to write a thin wrapper function that do not define any local variables, so that nothing needs to be allocated from the stack. This wrapper function would invoke the stack check and finally tail-call the function that would then be guaranteed to have enough stack space available. Such usage would require some larger refactoring; for example, the macro alloc_on_stack would be replaced with straight use of alloca , and the logic to check for stack overflow would have to be moved to the above mentioned wrapper function, which might call an alternative implementation that would allocate the buffer from heap.

            This seems to have broken the build on IBM AIX, as follows:

            10.5 9b3413c71f16a3d738af358e288a0168f4c15827

            …
            2024-10-22  8:40:17 0 [Note] Plugin 'unix_socket' is disabled.
            ERROR: 1436  Thread stack overrun:  1073789920 bytes used of a 299008 byte stack, and 32000 bytes needed. Consider increasing the thread_stack system variable.
            2024-10-22  8:40:17 0 [ERROR] Aborting
            program finished with exit code 1
            elapsedTime=13.761456
            

            marko Marko Mäkelä added a comment - This seems to have broken the build on IBM AIX, as follows: 10.5 9b3413c71f16a3d738af358e288a0168f4c15827 … 2024-10-22 8:40:17 0 [Note] Plugin 'unix_socket' is disabled. ERROR: 1436 Thread stack overrun: 1073789920 bytes used of a 299008 byte stack, and 32000 bytes needed. Consider increasing the thread_stack system variable. 2024-10-22 8:40:17 0 [ERROR] Aborting program finished with exit code 1 elapsedTime=13.761456
            danblack Daniel Black added a comment -

            Unit test also fails:

            10.5

            10.5
            buildbot@p8-aix1-mariadb:[/home/buildbot/ppc64be-aix-71/build/build]./unittest/mysys/stack_allocation-t
            1..4
            not ok 1 - stack checking failed
            not ok 2 - stack checking failed
            not ok 3 - stack checking failed
            not ok 4 - stack checking failed
            # Failed 4 tests!
            

            IBM Engineers have reached out and I've provided details on building this unit test and running it.

            I'm hoping they can help with some other remain AIX test failures too.

            A less than fully awake attempt to debug this:

            gdb ./unittest/mysys/stack_allocation-t 
            (gdb) b main
            Breakpoint 1 at 0x10000728: file /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c, line 104.
            (gdb) r
            Starting program: /home/buildbot/ppc64be-aix-71/build/build/unittest/mysys/stack_allocation-t 
            [New Thread 1]
            [Switching to Thread 1]
             
            Thread 2 hit Breakpoint 1, main (argc=1, argv=0xffffffffffffa60) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:104
            104       MY_INIT(argv[0]);
            (gdb) list
            99      {
            100       pthread_attr_t thr_attr;
            101       pthread_t check_thread;
            102       void *value;
            103     
            104       MY_INIT(argv[0]);
            105     
            106       plan(4);
            107       test_stack_detection(3, STACK_ALLOC_SMALL_BLOCK_SIZE-1);
            108       test_stack_detection(4, STACK_ALLOC_SMALL_BLOCK_SIZE+1);
            (gdb) n
            106       plan(4);
            (gdb) n
            1..4
            107       test_stack_detection(3, STACK_ALLOC_SMALL_BLOCK_SIZE-1);
            (gdb) s
            test_stack_detection (stage=3, stack_allocation=4095) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:75
            75        my_get_stack_bounds(&stack_start, &stack_end,
            (gdb) bt
            #0  test_stack_detection (stage=3, stack_allocation=4095) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:75
            #1  0x0000000100000778 in main (argc=<optimized out>, argv=<optimized out>)
                at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:107
            (gdb) info locals
            stack_start = 0xbadc0ffee0ddf00d
            stack_end = 0xbadc0ffee0ddf00d
            res = <optimized out>
            (gdb) list
            70      
            71      void test_stack_detection(int stage, size_t stack_allocation)
            72      {
            73        void *stack_start, *stack_end;
            74        int res;
            75        my_get_stack_bounds(&stack_start, &stack_end,
            76                            (void*) &stack_start, my_stack_size);
            77        stack_allocation_total= 0;
            78        res= test_stack(stack_start, stack_end, 1, stack_allocation);
            79        if (!res)
            (gdb) s
            my_get_stack_bounds (stack_start=0xffffffffffff918, stack_end=0xffffffffffff910, fallback_stack_start=0xffffffffffff918, fallback_stack_size=299008)
                at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/mysys/my_stack.c:48
            48        pthread_t thread= pthread_self();
            (gdb) n
            52        if (pthread_getattr_np(thread, &attr) == 0)
            (gdb) 
            55          pthread_attr_getstack(&attr, &stack_base, &stack_size);
            (gdb) 
            60          *stack_start= stack_base - stack_size * STACK_DIRECTION;
            (gdb) 
            61          pthread_attr_destroy(&attr); /* Clean up */
            (gdb) 
            79        *stack_end= *stack_start + stack_size * STACK_DIRECTION;
            (gdb) 
            test_stack_detection (stage=<optimized out>, stack_allocation=4095) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:77
            77        stack_allocation_total= 0;
            (gdb) info locals
            stack_start = 0x100000003ffff1c0
            stack_end = 0xffffffffffff9e0
            res = <optimized out>
            (gdb) info registers r1
            r1             0xffffffffffff8a0   1152921504606845088
            (gdb) list
            72      {
            73        void *stack_start, *stack_end;
            74        int res;
            75        my_get_stack_bounds(&stack_start, &stack_end,
            76                            (void*) &stack_start, my_stack_size);
            77        stack_allocation_total= 0;
            78        res= test_stack(stack_start, stack_end, 1, stack_allocation);
            79        if (!res)
            80          ok(1, "%llu bytes allocated on stack of size %ld with %lu alloc size",
            81             (unsigned long long) stack_allocation_total,
            (gdb) s
            78        res= test_stack(stack_start, stack_end, 1, stack_allocation);
            (gdb) s
            test_stack (stack_start=0x100000003ffff1c0, stack_end=0xffffffffffff9e0, iteration=1, stack_allocation=4095)
                at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/include/my_stack_alloc.h:44
            44      /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/include/my_stack_alloc.h: No such file or directory.
            

            danblack Daniel Black added a comment - Unit test also fails: 10.5 10.5 buildbot@p8-aix1-mariadb:[/home/buildbot/ppc64be-aix-71/build/build]./unittest/mysys/stack_allocation-t 1..4 not ok 1 - stack checking failed not ok 2 - stack checking failed not ok 3 - stack checking failed not ok 4 - stack checking failed # Failed 4 tests! IBM Engineers have reached out and I've provided details on building this unit test and running it. I'm hoping they can help with some other remain AIX test failures too. A less than fully awake attempt to debug this: gdb ./unittest/mysys/stack_allocation-t (gdb) b main Breakpoint 1 at 0x10000728: file /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c, line 104. (gdb) r Starting program: /home/buildbot/ppc64be-aix-71/build/build/unittest/mysys/stack_allocation-t [New Thread 1] [Switching to Thread 1]   Thread 2 hit Breakpoint 1, main (argc=1, argv=0xffffffffffffa60) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:104 104 MY_INIT(argv[0]); (gdb) list 99 { 100 pthread_attr_t thr_attr; 101 pthread_t check_thread; 102 void *value; 103 104 MY_INIT(argv[0]); 105 106 plan(4); 107 test_stack_detection(3, STACK_ALLOC_SMALL_BLOCK_SIZE-1); 108 test_stack_detection(4, STACK_ALLOC_SMALL_BLOCK_SIZE+1); (gdb) n 106 plan(4); (gdb) n 1..4 107 test_stack_detection(3, STACK_ALLOC_SMALL_BLOCK_SIZE-1); (gdb) s test_stack_detection (stage=3, stack_allocation=4095) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:75 75 my_get_stack_bounds(&stack_start, &stack_end, (gdb) bt #0 test_stack_detection (stage=3, stack_allocation=4095) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:75 #1 0x0000000100000778 in main (argc=<optimized out>, argv=<optimized out>) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:107 (gdb) info locals stack_start = 0xbadc0ffee0ddf00d stack_end = 0xbadc0ffee0ddf00d res = <optimized out> (gdb) list 70 71 void test_stack_detection(int stage, size_t stack_allocation) 72 { 73 void *stack_start, *stack_end; 74 int res; 75 my_get_stack_bounds(&stack_start, &stack_end, 76 (void*) &stack_start, my_stack_size); 77 stack_allocation_total= 0; 78 res= test_stack(stack_start, stack_end, 1, stack_allocation); 79 if (!res) (gdb) s my_get_stack_bounds (stack_start=0xffffffffffff918, stack_end=0xffffffffffff910, fallback_stack_start=0xffffffffffff918, fallback_stack_size=299008) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/mysys/my_stack.c:48 48 pthread_t thread= pthread_self(); (gdb) n 52 if (pthread_getattr_np(thread, &attr) == 0) (gdb) 55 pthread_attr_getstack(&attr, &stack_base, &stack_size); (gdb) 60 *stack_start= stack_base - stack_size * STACK_DIRECTION; (gdb) 61 pthread_attr_destroy(&attr); /* Clean up */ (gdb) 79 *stack_end= *stack_start + stack_size * STACK_DIRECTION; (gdb) test_stack_detection (stage=<optimized out>, stack_allocation=4095) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/unittest/mysys/stack_allocation-t.c:77 77 stack_allocation_total= 0; (gdb) info locals stack_start = 0x100000003ffff1c0 stack_end = 0xffffffffffff9e0 res = <optimized out> (gdb) info registers r1 r1 0xffffffffffff8a0 1152921504606845088 (gdb) list 72 { 73 void *stack_start, *stack_end; 74 int res; 75 my_get_stack_bounds(&stack_start, &stack_end, 76 (void*) &stack_start, my_stack_size); 77 stack_allocation_total= 0; 78 res= test_stack(stack_start, stack_end, 1, stack_allocation); 79 if (!res) 80 ok(1, "%llu bytes allocated on stack of size %ld with %lu alloc size", 81 (unsigned long long) stack_allocation_total, (gdb) s 78 res= test_stack(stack_start, stack_end, 1, stack_allocation); (gdb) s test_stack (stack_start=0x100000003ffff1c0, stack_end=0xffffffffffff9e0, iteration=1, stack_allocation=4095) at /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/include/my_stack_alloc.h:44 44 /home/buildbot/ppc64be-aix-71/build/mariadb-10.5.27/include/my_stack_alloc.h: No such file or directory.

            Answering @marko:

            "In the changeset it does not look like any optimizations were disabled as suggested; at least I see none of the following:
            function attribute optimize(0) (GCC) or optnone (clang)
            #pragma optimize( "", off ) (MSVC)"

            This was tested without and with buildbot and was not needed. The code seams to work perfectly with and without optimization.
            There is an unit test that verifies this.

            "To my understanding, none of our CI systems enable link-time optimization (cmake -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON). Some GNU/Linux distributions that package MariaDB Server do that, and they might now hit some trouble."

            I have been told that MariaDB build systems already uses compilation with link time optimization for some platforms.
            We have not seen any indication of any problem with the stack optimization with these builds.
            The way link time optimization works is not different from running -O2/-O3 on one compilation unit. As this works (and the functions are inlined already), I do not see any reason why link time optimization would cause any different results.

            If I understood it correctly, the following kind of a predicate should work if used in the appropriate way:

            #ifdef _MSC_VER

            1. include <intrin.h>
            1. define my_get_stack_pointer() _AddressOfReturnAddress()

            #elif defined _GNUC_ /* GCC or clang or derivatives */

            1. define my_get_stack_pointer() __builtin_frame_address(0)

            As was already concluded in an earlier comment for this Jira entry, the above does not work. frame pointer is the address of the stack pointer before any local variables (or variables from inline functions) are allocated on stack. As link time optimization adds even more variables on the stack, using the above approach will cause failures that is less likely to happen with the current code.

            monty Michael Widenius added a comment - Answering @marko: "In the changeset it does not look like any optimizations were disabled as suggested; at least I see none of the following: function attribute optimize(0) (GCC) or optnone (clang) #pragma optimize( "", off ) (MSVC)" This was tested without and with buildbot and was not needed. The code seams to work perfectly with and without optimization. There is an unit test that verifies this. "To my understanding, none of our CI systems enable link-time optimization (cmake -DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON). Some GNU/Linux distributions that package MariaDB Server do that, and they might now hit some trouble." I have been told that MariaDB build systems already uses compilation with link time optimization for some platforms. We have not seen any indication of any problem with the stack optimization with these builds. The way link time optimization works is not different from running -O2/-O3 on one compilation unit. As this works (and the functions are inlined already), I do not see any reason why link time optimization would cause any different results. If I understood it correctly, the following kind of a predicate should work if used in the appropriate way: #ifdef _MSC_VER include <intrin.h> define my_get_stack_pointer() _AddressOfReturnAddress() #elif defined _ GNUC _ /* GCC or clang or derivatives */ define my_get_stack_pointer() __builtin_frame_address(0) As was already concluded in an earlier comment for this Jira entry, the above does not work. frame pointer is the address of the stack pointer before any local variables (or variables from inline functions) are allocated on stack. As link time optimization adds even more variables on the stack, using the above approach will cause failures that is less likely to happen with the current code.
            danblack Daniel Black added a comment -

            The explicit Debian/Ubuntu disabling has only just been lifted (MDEV-25633). Debian LTO is opt in https://wiki.debian.org/LTO, so the disable was around Ubuntu usage. Ubuntu LTO is https://wiki.ubuntu.com/ToolChain/LTO, so effectively 22.04+ if we include maintained versions that are in BB. 22.04+ builds are for 10.6+ so it wasn't until this was merged last Monday to 10.6 that this was even had the first exposure to LTO testing.

            Since this is merged up to bb-11.4-release
            https://buildbot.mariadb.org/#/builders/352/builds/17055/steps/7/logs/stdio
            unit.stack_allocation w2 [ pass ] 51

            For values of 24.04 LTO usage, it applies to bb-10.11-release, which was first tested 2 days ago. Ubuntu 24.10 is from the bb-11.4-release, first tested 2 days ago still passes.

            The OpenSUSE LTO example MDEV-21248 is for OpenSUSE Tumbleweed (https://en.opensuse.org/openSUSE:LTO) which isn't in CI. It currently has a gcc version above Ubuntu 24.10 (by a minor gcc release number), but overall the combination of other compilation options may affect results too.

            There are some older clang test exercising this, possibly not with LTO.

            monty you are obviously right in that inline makes this, optionally (its still a likely compiler choice), not dependant on LTO. The functions of this commit are verging into the territory of undefined behaviour and while tested on our current platforms it can easily fail elsewhere, like it has for AIX (which isn't fixed by O0 or fallback btw).

            Disabling optimization of a few inline instructions was a risk management strategy in light of pervious bug reports and the considerable freedoms that optimizations have when inline.

            danblack Daniel Black added a comment - The explicit Debian/Ubuntu disabling has only just been lifted ( MDEV-25633 ). Debian LTO is opt in https://wiki.debian.org/LTO , so the disable was around Ubuntu usage. Ubuntu LTO is https://wiki.ubuntu.com/ToolChain/LTO , so effectively 22.04+ if we include maintained versions that are in BB. 22.04+ builds are for 10.6+ so it wasn't until this was merged last Monday to 10.6 that this was even had the first exposure to LTO testing. Since this is merged up to bb-11.4-release https://buildbot.mariadb.org/#/builders/352/builds/17055/steps/7/logs/stdio unit.stack_allocation w2 [ pass ] 51 For values of 24.04 LTO usage, it applies to bb-10.11-release, which was first tested 2 days ago. Ubuntu 24.10 is from the bb-11.4-release, first tested 2 days ago still passes. The OpenSUSE LTO example MDEV-21248 is for OpenSUSE Tumbleweed ( https://en.opensuse.org/openSUSE:LTO ) which isn't in CI. It currently has a gcc version above Ubuntu 24.10 (by a minor gcc release number), but overall the combination of other compilation options may affect results too. There are some older clang test exercising this, possibly not with LTO. monty you are obviously right in that inline makes this, optionally (its still a likely compiler choice), not dependant on LTO. The functions of this commit are verging into the territory of undefined behaviour and while tested on our current platforms it can easily fail elsewhere, like it has for AIX (which isn't fixed by O0 or fallback btw). Disabling optimization of a few inline instructions was a risk management strategy in light of pervious bug reports and the considerable freedoms that optimizations have when inline.
            danblack Daniel Black added a comment -

            AIX fixed e147f8a5ed8607810b39fa613db2deadd483ce93 - thanks wlad

            danblack Daniel Black added a comment - AIX fixed e147f8a5ed8607810b39fa613db2deadd483ce93 - thanks wlad

            MDEV-31151 (https://github.com/MariaDB/server/pull/2541) had fixed a bug in the stack usage check earlier. To fix MDEV-33209, I simply used more direct debug injection instead of trying to tweak the use of alloca(). If the stack overflow check is now supposed to be accurate, it could make sense to revisit that.

            marko Marko Mäkelä added a comment - MDEV-31151 ( https://github.com/MariaDB/server/pull/2541 ) had fixed a bug in the stack usage check earlier. To fix MDEV-33209 , I simply used more direct debug injection instead of trying to tweak the use of alloca() . If the stack overflow check is now supposed to be accurate, it could make sense to revisit that.

            People

              monty Michael Widenius
              danblack Daniel Black
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.