[MDEV-30665] MariaDB 10.6.12 hang 2 days after upgrade Created: 2023-02-16  Updated: 2023-04-17  Resolved: 2023-04-17

Status: Closed
Project: MariaDB Server
Component/s: Data Manipulation - Insert, Replication, Server
Affects Version/s: 10.6.12
Fix Version/s: 11.1.1, 10.11.3, 11.0.2, 10.6.13, 10.8.8, 10.9.6, 10.10.4

Type: Bug Priority: Critical
Reporter: Oli Sennhauser Assignee: Marko Mäkelä
Resolution: Duplicate Votes: 0
Labels: hang
Environment:

Ubuntu 20.04, n.a.


Attachments: File mdev_30665_1.tar.gz     File test_backtrace.trc    
Issue Links:
Duplicate
duplicates MDEV-29835 Partial server freeze Closed
Relates
relates to MDEV-30481 Hard lock up with queries in "Opening... Closed
relates to MDEV-30637 Mariadb server hangs Closed

 Description   

Today our production server was suddenly hanging. The only thing that changed was an upgrade from 10.6.11 to 10.6.12 2 days ago. So I assume it has something to do with the new release because we never experienced this before.

How did we find: Script complained that it was already running. So previous run of script did not terminate for long time.

Database shows:

SQL> SHOW PROCESSLIST;
+--------+-------------+--------------------------+--------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id     | User        | Host                     | db     | Command     | Time   | State                                                         | Info                                                                                                 | Progress |
+--------+-------------+--------------------------+--------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|      5 | system user |                          | NULL   | Slave_IO    | 222766 | Waiting for master to send event                              | NULL                                                                                                 |    0.000 |
|      6 | system user |                          | NULL   | Slave_SQL   |      3 | Slave has read all relay log; waiting for more updates        | NULL                                                                                                 |    0.000 |
|      7 | zabbix      | localhost:58656          | zabbix | Sleep       |      1 |                                                               | NULL                                                                                                 |    0.000 |
|      8 | zabbix      | localhost:58666          | zabbix | Sleep       |      2 |                                                               | NULL                                                                                                 |    0.000 |
|      9 | zabbix      | localhost:58674          | zabbix | Sleep       |      1 |                                                               | NULL                                                                                                 |    0.000 |
|     10 | zabbix      | localhost:58686          | zabbix | Sleep       |      3 |                                                               | NULL                                                                                                 |    0.000 |
|     13 | zabbix      | localhost:58708          | zabbix | Sleep       |     51 |                                                               | NULL                                                                                                 |    0.000 |
|     16 | zabbix      | localhost:58718          | zabbix | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
|     32 | zabbix      | localhost:39622          | zabbix | Sleep       |     35 |                                                               | NULL                                                                                                 |    0.000 |
|    146 | replication | maas1.fromdual.com:35856 | NULL   | Binlog Dump | 222729 | Master has sent all binlog to slave; waiting for more updates | NULL                                                                                                 |    0.000 |
|    150 | zabbix      | localhost:46088          | zabbix | Sleep       |   7903 |                                                               | NULL                                                                                                 |    0.000 |
|    367 | zabbix      | localhost:60340          | zabbix | Sleep       |   8268 |                                                               | NULL                                                                                                 |    0.000 |
|    588 | zabbix      | localhost:54340          | zabbix | Sleep       |   7958 |                                                               | NULL                                                                                                 |    0.000 |
|   1269 | zabbix      | localhost:45482          | zabbix | Sleep       |   8028 |                                                               | NULL                                                                                                 |    0.000 |
|   1484 | zabbix      | localhost:40008          | zabbix | Sleep       |   8101 |                                                               | NULL                                                                                                 |    0.000 |
|  67247 | zabbix      | localhost:47262          | zabbix | Query      |  10473 | Update                                                        | insert into history_uint (itemid,clock,ns,value) values (66915,1676576402,1,3145344),(66916,16765764 |    0.000 |
| 413224 | zabbix      | localhost:44396          | zabbix | Query       |  10473 | Update                                                        | insert into history_uint (itemid,clock,ns,value) values (26687,1676576404,32,7631),(26836,1676576403 |    0.000 |
| 413226 | zabbix      | localhost:44426          | zabbix | Query       |  10473 | Update                                                        | insert into history_uint (itemid,clock,ns,value) values (23187,1676576401,66,0),(23188,1676576401,67 |    0.000 |
| 590933 | zabbix      | localhost:51912          | zabbix | Query       |  10473 | Update                                                        | insert into history_uint (itemid,clock,ns,value) values (23177,1676576401,153,0),(23178,1676576401,1 |    0.000 |
| 652992 | zabbix      | localhost:42022          | zabbix | Query       |  10473 | Statistics                                                    | SELECT itemid,COUNT(*) AS count,AVG(value) AS avg,MIN(value) AS min,MAX(value) AS max,round(500* MOD |    0.000 |
| 652997 | zabbix      | localhost:42034          | zabbix | Query       |  10473 | Statistics                                                    | SELECT itemid,COUNT(*) AS count,AVG(value) AS avg,MIN(value) AS min,MAX(value) AS max,round(500* MOD |    0.000 |
| 652999 | zabbix      | localhost:42046          | zabbix | Query       |  10473 | Statistics                                                    | SELECT itemid,COUNT(*) AS count,AVG(value) AS avg,MIN(value) AS min,MAX(value) AS max,round(500* MOD |    0.000 |
| 674125 | zabbix      | localhost:52324          | zabbix | Query       |     54 | Opening tables                                                | insert into history_uint (itemid,clock,ns,value) values (66915,1676576402,1,3145344),(66916,16765764 |    0.000 |
| 674245 | root        | localhost                | NULL   | Query       |      0 | starting                                                      | show processlist                                                                                     |    0.000 |
+--------+-------------+--------------------------+--------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
 
SQL> status;
--------------
mysql  Ver 15.1 Distrib 10.6.12-MariaDB, for linux-systemd (x86_64) using readline 5.1
 
Connection id:          674245
Current database:
Current user:           root@localhost
SSL:                    Not in use
Current pager:          stdout
Using outfile:          ''
Using delimiter:        ;
Server:                 MariaDB
Server version:         10.6.12-MariaDB-log MariaDB Server
Protocol version:       10
Connection:             Localhost via UNIX socket
Server characterset:    latin1
Db     characterset:    latin1
Client characterset:    utf8mb4
Conn.  characterset:    utf8mb4
UNIX socket:            /var/run/mysqld/mysql-33006.sock
Uptime:                 2 days 13 hours 53 min 38 sec
 
Threads: 32  Questions: 15509507  Slow queries: 3764  Opens: 973  Open tables: 789  Queries per second avg: 69.606

Killing did not help:

SQL> SHOW PROCESSLIST;
+--------+-------------+--------------------------+--------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id     | User        | Host                     | db     | Command     | Time   | State                                                         | Info                                                                                                 | Progress |
+--------+-------------+--------------------------+--------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+
|      5 | system user |                          | NULL   | Slave_IO    | 222809 | Waiting for master to send event                              | NULL                                                                                                 |    0.000 |
|      6 | system user |                          | NULL   | Slave_SQL   |     15 | Slave has read all relay log; waiting for more updates        | NULL                                                                                                 |    0.000 |
|      7 | zabbix      | localhost:58656          | zabbix | Sleep       |      1 |                                                               | NULL                                                                                                 |    0.000 |
|      8 | zabbix      | localhost:58666          | zabbix | Sleep       |     45 |                                                               | NULL                                                                                                 |    0.000 |
|      9 | zabbix      | localhost:58674          | zabbix | Sleep       |      3 |                                                               | NULL                                                                                                 |    0.000 |
|     10 | zabbix      | localhost:58686          | zabbix | Sleep       |      1 |                                                               | NULL                                                                                                 |    0.000 |
|     13 | zabbix      | localhost:58708          | zabbix | Sleep       |     33 |                                                               | NULL                                                                                                 |    0.000 |
|     16 | zabbix      | localhost:58718          | zabbix | Sleep       |      0 |                                                               | NULL                                                                                                 |    0.000 |
|     32 | zabbix      | localhost:39622          | zabbix | Sleep       |     18 |                                                               | NULL                                                                                                 |    0.000 |
|    146 | replication | maas1.fromdual.com:35856 | NULL   | Binlog Dump | 222771 | Master has sent all binlog to slave; waiting for more updates | NULL                                                                                                 |    0.000 |
|    150 | zabbix      | localhost:46088          | zabbix | Sleep       |   7945 |                                                               | NULL                                                                                                 |    0.000 |
|    367 | zabbix      | localhost:60340          | zabbix | Sleep       |   8311 |                                                               | NULL                                                                                                 |    0.000 |
|    588 | zabbix      | localhost:54340          | zabbix | Sleep       |   8001 |                                                               | NULL                                                                                                 |    0.000 |
|   1269 | zabbix      | localhost:45482          | zabbix | Sleep       |   8071 |                                                               | NULL                                                                                                 |    0.000 |
|   1484 | zabbix      | localhost:40008          | zabbix | Sleep       |   8144 |                                                               | NULL                                                                                                 |    0.000 |
|  67247 | zabbix      | localhost:47262          | zabbix | Killed      |  10516 | Update                                                        | insert into history_uint (itemid,clock,ns,value) values (66915,1676576402,1,3145344),(66916,16765764 |    0.000 |
| 413224 | zabbix      | localhost:44396          | zabbix | Killed      |  10516 | Update                                                        | insert into history_uint (itemid,clock,ns,value) values (26687,1676576404,32,7631),(26836,1676576403 |    0.000 |
| 413226 | zabbix      | localhost:44426          | zabbix | Killed      |  10516 | Update                                                        | insert into history_uint (itemid,clock,ns,value) values (23187,1676576401,66,0),(23188,1676576401,67 |    0.000 |
| 590933 | zabbix      | localhost:51912          | zabbix | Killed      |  10516 | Update                                                        | insert into history_uint (itemid,clock,ns,value) values (23177,1676576401,153,0),(23178,1676576401,1 |    0.000 |
| 652992 | zabbix      | localhost:42022          | zabbix | Query       |  10516 | Statistics                                                    | SELECT itemid,COUNT(*) AS count,AVG(value) AS avg,MIN(value) AS min,MAX(value) AS max,round(500* MOD |    0.000 |
| 652997 | zabbix      | localhost:42034          | zabbix | Query       |  10516 | Statistics                                                    | SELECT itemid,COUNT(*) AS count,AVG(value) AS avg,MIN(value) AS min,MAX(value) AS max,round(500* MOD |    0.000 |
| 652999 | zabbix      | localhost:42046          | zabbix | Query       |  10516 | Statistics                                                    | SELECT itemid,COUNT(*) AS count,AVG(value) AS avg,MIN(value) AS min,MAX(value) AS max,round(500* MOD |    0.000 |
| 674125 | zabbix      | localhost:52324          | zabbix | Query       |     96 | Opening tables                                                | insert into history_uint (itemid,clock,ns,value) values (66915,1676576402,1,3145344),(66916,16765764 |    0.000 |
| 674245 | root        | localhost                | NULL   | Query       |      0 | starting                                                      | show processlist                                                                                     |    0.000 |
| 674255 | zabbix      | localhost:53326          | zabbix | Query       |     28 | Opening tables                                                | insert into history_uint (itemid,clock,ns,value) values (26687,1676576404,32,7631),(26836,1676576403 |    0.000 |
| 674370 | zabbix      | localhost:60396          | zabbix | Query       |     12 | Opening tables                                                | insert into history_uint (itemid,clock,ns,value) values (23177,1676576401,153,0),(23178,1676576401,1 |    0.000 |
| 674372 | zabbix      | localhost:37496          | zabbix | Query       |      6 | Opening tables                                                | insert into history_uint (itemid,clock,ns,value) values (23187,1676576401,66,0),(23188,1676576401,67 |    0.000 |
| 674373 | zabbix      | localhost:37508          | zabbix | Sleep       |      1 |                                                               | NULL                                                                                                 |    0.000 |
+--------+-------------+--------------------------+--------+-------------+--------+---------------------------------------------------------------+------------------------------------------------------------------------------------------------------+----------+

We did not see any message in the error log.

A proper shutdown (-15) was NOT successful. The database did just not terminate in a decent amount of time:

Status information:

Current dir: /home/mysql/database/prod2/data/
Running threads: 26  Cached threads: 7  Stack size: 299008
 
Key caches:
default
Buffer_size:       2097152
Block_size:           1024
Division_limit:        100
Age_threshold:         300
Partitions:              0
blocks used:             0
not flushed:             0
w_requests:              0
writes:                  0
r_requests:              0
reads:                   0
 
handler status:
read_key:   1148401981
read_next:  1015362765
read_rnd     346076179
read_first:     572208
write:        33409058
delete          556427
update:         485556
 
Table status:
Opened tables:        973
Open tables:          789
Open files:            65
Open streams:           4
 
Alarm status:
Active alarms:   0
Max used alarms: 0
Next alarm time: 0
 
Memory status:
Non-mmapped space allocated from system: 276168704
Number of free chunks:                   2878
Number of fastbin blocks:                136
Number of mmapped regions:               58
Space in mmapped regions:                538791936
Maximum total allocated space:           0
Space available in freed fastbin blocks: 9552
Total allocated space:                   99193792
Total free space:                        176974912
Top-most, releasable space:              58768
Estimated memory (with thread stack):    824827904
Global memory allocated by server:       59489400
Memory allocated by threads:             4525024
 
Events status:
LLA = Last Locked At  LUA = Last Unlocked At
WOC = Waiting On Condition  DL = Data Locked
 
Event scheduler status:
State      : INITIALIZED
Thread id  : 0
LLA        : stop:638
LUA        : stop:688
WOC        : NO
Workers    : 0
Executed   : 0
Data locked: NO
 
Event queue status:
Element count   : 0
Data locked     : NO
Attempting lock : NO
LLA             : drop_schema_events:393
LUA             : drop_schema_events:395
WOC             : NO
Next activation : never
2023-02-16 23:38:16 0 [Note] bin/mysqld (initiated by: unknown): Normal shutdown
2023-02-16 23:38:16 6 [Note] Error reading relay log event: slave SQL thread was killed
2023-02-16 23:38:16 6 [Note] Slave SQL thread exiting, replication stopped in log 'prod1_binlog.002843' at position 32730719, master: 116.203.122.76:33006
2023-02-16 23:38:16 5 [Note] Slave I/O thread exiting, read up to log 'prod1_binlog.002843', position 32730719, master 116.203.122.76:33006
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 674372 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 674370 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 674255 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 674125 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 652999 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 652997 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 652992 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 590933 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 413226 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 413224 (user : 'zabbix') did not exit
2023-02-16 23:38:36 0 [Warning] bin/mysqld: Thread 67247 (user : 'zabbix') did not exit

Kill -9 helped.
Start of the DB shows 7 transactions rolled back:

2023-02-16 23:42:15 0 [Note] Starting MariaDB 10.6.12-MariaDB-log source revision 4c79e15cc3716f69c044d4287ad2160da8101cdc as process 2319572
2023-02-16 23:42:15 0 [Note] mysqld: Aria engine: starting recovery
tables to flush: 3 2 1 0
 (0.0 seconds); 
2023-02-16 23:42:15 0 [Note] mysqld: Aria engine: recovery done
2023-02-16 23:42:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.12
2023-02-16 23:42:15 0 [Note] InnoDB: Number of pools: 1
2023-02-16 23:42:15 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-02-16 23:42:15 0 [Note] InnoDB: Using Linux native AIO
2023-02-16 23:42:15 0 [Note] InnoDB: Initializing buffer pool, total size = 25769803776, chunk size = 134217728
2023-02-16 23:42:15 0 [Note] InnoDB: Completed initialization of buffer pool
2023-02-16 23:42:15 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=37225590795933,37225598153720
2023-02-16 23:42:20 0 [Note] InnoDB: 8 transaction(s) which must be rolled back or cleaned up in total 2381 row operations to undo
2023-02-16 23:42:20 0 [Note] InnoDB: Trx id counter is 6897314056
2023-02-16 23:42:20 0 [Note] InnoDB: Starting final batch to recover 22118 pages from redo log.
2023-02-16 23:42:21 0 [Note] InnoDB: Last binlog file '/home/mysql/database/prod2/binlog/prod2_binlog.002865', position 127260559
2023-02-16 23:42:21 0 [Note] InnoDB: 128 rollback segments are active.
2023-02-16 23:42:21 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-02-16 23:42:21 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2023-02-16 23:42:21 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-02-16 23:42:21 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-02-16 23:42:21 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-02-16 23:42:21 0 [Note] InnoDB: 10.6.12 started; log sequence number 37225787347580; transaction id 6897314057
2023-02-16 23:42:21 0 [Note] InnoDB: Loading buffer pool(s) from /home/mysql/database/prod2/data/ib_buffer_pool
2023-02-16 23:42:21 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-02-16 23:42:21 0 [Note] InnoDB: Rolled back recovered transaction 6897313229
2023-02-16 23:42:21 0 [Note] Loaded 'metadata_lock_info.so' with offset 0x7f21d41fd000
2023-02-16 23:42:21 0 [Note] Recovering after a crash using /home/mysql/database/prod2/binlog/prod2_binlog
2023-02-16 23:42:21 0 [Note] Starting table crash recovery...
2023-02-16 23:42:21 0 [Note] Crash table recovery finished.
2023-02-16 23:42:21 0 [Note] Server socket created on IP: '0.0.0.0'.
2023-02-16 23:42:21 0 [Note] Server socket created on IP: '::'.
2023-02-16 23:42:21 0 [Note] Reading of all Master_info entries succeeded
2023-02-16 23:42:21 0 [Note] Added new Master_info '' to hash table
2023-02-16 23:42:21 5 [Note] Slave I/O thread: Start asynchronous replication to master 'replication@116.203.122.76:33006' in log 'prod1_binlog.002843' at position 32730719
2023-02-16 23:42:21 0 [Note] bin/mysqld: ready for connections.
Version: '10.6.12-MariaDB-log'  socket: '/var/run/mysqld/mysql-33006.sock'  port: 33006  MariaDB Server
2023-02-16 23:42:21 6 [Note] Slave SQL thread initialized, starting replication in log 'prod1_binlog.002843' at position 32730719, relay log './prod2-relay-log.006987' position: 508
2023-02-16 23:42:21 5 [Note] Slave I/O thread: connected to master 'replication@116.203.122.76:33006',replication started in log 'prod1_binlog.002843' at position 32730719
2023-02-16 23:42:21 0 [Note] InnoDB: Rolled back recovered transaction 6897313491
2023-02-16 23:42:21 0 [Note] InnoDB: Rolled back recovered transaction 6897312943
2023-02-16 23:42:21 8 [Warning] Aborted connection 8 to db: 'unconnected' user: 'unauthenticated' host: 'localhost' (This connection closed normally without authentication)
2023-02-16 23:42:21 0 [Note] InnoDB: Rolled back recovered transaction 6897265003
2023-02-16 23:42:22 0 [Note] InnoDB: Rolled back recovered transaction 6897313487
2023-02-16 23:42:22 0 [Note] InnoDB: Rolled back recovered transaction 6897265005
2023-02-16 23:42:22 0 [Note] InnoDB: Rolled back recovered transaction 6897265007
2023-02-16 23:42:22 0 [Note] InnoDB: Rolled back recovered transaction 6897265039
2023-02-16 23:42:22 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-02-16 23:42:28 0 [Note] InnoDB: Buffer pool(s) load completed at 230216 23:42:28

Please let me know what I should gather as information if it happens the next time...



 Comments   
Comment by Marko Mäkelä [ 2023-02-17 ]

oli, can you enable a core dump of the server and get a full stack trace of all threads the next time when it happens? Please be sure to install a -debuginfo or -dbgsym package, so that the function call parameters and their values will be available in the stack traces.

I am currently aware of 2 hanging bugs in InnoDB, which I have mentioned in MDEV-30637. Without having stack trace output, it is impossible for me to say if this hang is either of them.

Comment by Oli Sennhauser [ 2023-02-17 ]

We are working with binary tarballs. I did not find debuginfo or dbgsym tar.gz on your site (or the mirrors) and there was also not such binary in the normal tar.gz:
https://mirror.mva-n.net/mariadb/mariadb-10.6.12/
Further it is not clear to me how to "provoke" the full or partial stack trace because the documentation is not too clear to me as a non-developer. Is a backtrace the same as a stacktrace? I thougth a backtrace is from a living process and a stacktrace from a crashing process? Is it this what you want me to do? Our process is not crashing it is just hanging...
sudo gdb --batch --eval-command="set print frame-arguments all" --eval-command="thread apply all bt full" /usr/sbin/mariadbd $(pgrep -xn mariadbd) > mariadbd_full_bt_all_threads.txt

Comment by Marko Mäkelä [ 2023-02-17 ]

If the server hangs and does not crash by itself, you can attach GDB to the running process to produce stack traces of all threads. I looked up and adapted the following command from an earlier ticket:

sudo gdb -ex 'set height 0' -ex 'thread apply all backtrace' -ex detach -ex quit -p $(pgrep -x mariadbd)

It might be helpful to collect multiple successive samples, to more reliably determine which threads are actually hung, in case some polling loops are involved.

For a normal non-stripped executable, I don’t think that there will be detailed debugging information available, only function names, but no source code line numbers or information about variables. That might still be enough to identify it as a duplicate of a known bug.

Comment by Oli Sennhauser [ 2023-02-17 ]

I had the idea to extract the mariadbd + the debugsymbols from the ubuntu package and replace the tarball mariadbd with it. But there are so many files in the ddeb packages. And I have no clue how choose the right files:

https://mirror.mva-n.net/mariadb/repo/10.6/ubuntu/pool/main/m/mariadb-10.6/mariadb-server-10.6-dbgsym_10.6.12+maria~ubu2004_amd64.ddeb
https://mirror.mva-n.net/mariadb/repo/10.6/ubuntu/pool/main/m/mariadb-10.6/mariadb-server-core-10.6-dbgsym_10.6.12+maria~ubu2004_amd64.ddeb
https://mirror.mva-n.net/mariadb/repo/10.6/ubuntu/pool/main/m/mariadb-10.6/mariadb-server-10.6_10.6.12+maria~ubu2004_amd64.deb
https://mirror.mva-n.net/mariadb/repo/10.6/ubuntu/pool/main/m/mariadb-10.6/mariadb-server-core-10.6_10.6.12+maria~ubu2004_amd64.deb

Any suggestion?

How many samples and how big should the interval be?

Comment by Marko Mäkelä [ 2023-02-17 ]

I have not checked what exactly is in the -dbgsym packages. For GDB, I think that the main piece of interest are the /usr/lib/debug/.build-id//.debug files. I am not completely sure, but it might be that only the file that matches the BuildID reported by file mariadbd is needed.

You could start with 5 samples, taken a few seconds apart.

Comment by Oli Sennhauser [ 2023-02-17 ]

Hmm. I have completely destroyed my production and I have no clue why! Daemon when starting says port is already in use... I thought it is a timing issue but reverting the changes did not help. So need to failover.

For the notes:
readelf -x.gnu_debuglink mariadbd
readelf: Warning: Section '.gnu_debuglink' was not dumped because it does not exist!

Files must be copied here: /usr/lib/debug/.build-id//.debug

readelf -x.gnu_debuglink mariadbd

Hex dump of section '.gnu_debuglink':
0x00000000 34373065 65646361 30373632 64643431 470eedca0762dd41
0x00000010 38343935 65616363 66313739 36643238 8495eaccf1796d28
0x00000020 33336331 63662e64 65627567 00000000 33c1cf.debug....
0x00000030 c7a3241d ..$.

Comment by Oli Sennhauser [ 2023-02-17 ]

So I finally manage to make it work... Can you please check if this is what your want? I will failback on Monday and then wait for the prey...

Comment by Oli Sennhauser [ 2023-02-17 ]

test_backtrace.trc

Comment by Marko Mäkelä [ 2023-02-21 ]

oli, I am sorry for the delay.

In test_backtrace.trc, I see only one InnoDB thread, the { {buf_flush_page_cleaner()}} that is in an untimed wait, waiting for an explicit signal when a page is added to buf_pool.flush_list so that the wakeup condition is met. All tpool threads are idle, waiting for a task, and tpool::aio_linux::getevent_thread_routine() is blocked inside the io_getevents() system call. In other words, InnoDB is completely idle and has no work to do.

I see that Thread 15 is waiting for something in mysql_binlog_send() and MYSQL_BIN_LOG::wait_for_update_binlog_end_pos(). Also Thread 11 is waiting in MYSQL_BIN_LOG::wait_for_update_relay_log() as part of exec_relay_log_event(). Thread 10 is apparently reading something from a connection. Thread 8 (binlog_background_thread() is in a timed wait. I think that these threads must be the reason for the hang.

Can you describe your replication topology? Did you see any errors reported from anything that could be connected to Thread 8?

Comment by Andrei Elkin [ 2023-02-21 ]

oli, marko: the replication threads including 11,15 are probably in non-disturbed state that is never having received KILLed signal.
The fact that the graceful shutdown did not work out hints at something wrong in user threads.

oli, can you record thread apply all backtrace at time of SHUTDOWN sql returns (after timeout?)?

Comment by Oli Sennhauser [ 2023-02-21 ]

As Andrei commented correctly this was just a test to see if this is what you want and need it case it happens again. I wait now for the event if it happens again.
I will take some snapshots before kill, when I killed them and during shutdown...

Comment by Oli Sennhauser [ 2023-03-09 ]

Hello all

This morning we had the case again and I hope I could collect all the needed information:

  • The Problem started at: 2023-03-08 15:10
  • Time now is: 2023-03-09 08:23
  • The same 4 statements are hooked up
    SQL> SHOW PROCESSLIST;
    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Id User Host db Command Time State Info Progress

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

    5 system user   NULL Slave_IO 170209 Waiting for master to send event NULL 0.000
    6 system user   NULL Slave_SQL 17 Slave has read all relay log; waiting for more updates NULL 0.000
    156 replication maas1.fromdual.com:46668 NULL Binlog Dump 170151 Master has sent all binlog to slave; waiting for more updates NULL 0.000
    526 zabbix localhost:47818 zabbix Sleep 37   NULL 0.000
    531 zabbix localhost:47854 zabbix Sleep 41   NULL 0.000
    532 zabbix localhost:47858 zabbix Sleep 58   NULL 0.000
    533 zabbix localhost:47870 zabbix Query 63458 Update insert into history_uint (itemid,clock,ns,value) values (26927,1678260544,6,8),(26928,1678260544,5,2 0.000
    534 zabbix localhost:47878 zabbix Sleep 4   NULL 0.000
    536 zabbix localhost:47892 zabbix Sleep 1   NULL 0.000
    547 zabbix localhost:47984 zabbix Sleep 1   NULL 0.000
    550 zabbix localhost:48012 zabbix Sleep 0   NULL 0.000
    79768 zabbix localhost:54928 zabbix Query 63458 Update insert into history_uint (itemid,clock,ns,value) values (23177,1678282201,153,0),(23178,1678282201,1 0.000
    106839 zabbix localhost:46210 zabbix Query 63458 Update insert into history_uint (itemid,clock,ns,value) values (23190,1678282201,0,1),(23340,1678282201,1,1 0.000
    204630 zabbix localhost:57764 zabbix Query 63457 Update insert into history_uint (itemid,clock,ns,value) values (26936,1678260545,17,0),(26946,1678260545,6, 0.000
    236698 focmm 159.69.54.136:46316 NULL Query 48154 NULL SELECT table_schema as 'table_schema', SUM(data_length+data_free) AS 'SUM(data_length)', SUM(index_l 0.000
    286703 root localhost:50742 NULL Query 21279 Opening tables SELECT s.schema_name AS 'schema_name', IFNULL(t.engine, '') AS engine, IF(engine IS NULL, 0, COUNT(* 0.000
    317330 focmm_owner localhost:43668 focmm Sleep 4954   NULL 0.000
    317430 focmm 159.69.54.136:46226 NULL Query 4954 Opening tables SELECT table_schema as 'table_schema', SUM(data_length+data_free) AS 'SUM(data_length)', SUM(index_l 0.000
    326360 zabbix_ro localhost:46284 zabbix Query 280 Opening tables SELECT i.itemid, i.hostid, i.`key_`
    , MAX(h.clock) AS max_clock, FROM_UNIXTIME(MAX(h.clock)) AS
    0.000
    326836 root localhost NULL Query 0 starting show processlist 0.000

    -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
    20 rows in set (0.000 sec)

  • It happened again short after DB restart (reboot of the whole machine after upgrade):
    SQL> status;
    --------------
    mysql Ver 15.1 Distrib 10.6.12-MariaDB, for linux-systemd (x86_64) using readline 5.1

Connection id: 326836
Current database:
Current user: root@localhost
SSL: Not in use
Current pager: stdout
Using outfile: ''
Using delimiter: ;
Server: MariaDB
Server version: 10.6.12-MariaDB-1:10.6.12+maria~ubu2004-log mariadb.org binary distribution
Protocol version: 10
Connection: Localhost via UNIX socket
Server characterset: latin1
Db characterset: latin1
Client characterset: utf8mb4
Conn. characterset: utf8mb4
UNIX socket: /var/run/mysqld/mysql-33006.sock
Uptime: 1 day 23 hours 17 min 53 sec

Threads: 36 Questions: 4538579 Slow queries: 2893 Opens: 736 Open tables: 730 Queries per second avg: 26.654

  • I run the following command (result is attached):

for i in $(seq 5) ; do
gdb -ex 'set height 0' -ex 'thread apply all backtrace' -ex detach -ex quit -p $(pgrep -x mysqld) > /tmp/gdb_$i.trc
sleep 5
done

  • Restart of database (kill -9 was necessary) shows again 4 trx:

2023-03-09 8:19:33 0 [Note] bin/mysqld (initiated by: unknown): Normal shutdown
2023-03-09 8:19:33 6 [Note] Error reading relay log event: slave SQL thread was killed
2023-03-09 8:19:33 6 [Note] Slave SQL thread exiting, replication stopped in log 'prod1_binlog.002944' at position 127971388, master: 116.203.122.76:33006
2023-03-09 8:19:33 5 [Note] Slave I/O thread exiting, read up to log 'prod1_binlog.002944', position 127971388, master 116.203.122.76:33006
2023-03-09 8:19:54 0 [Warning] bin/mysqld: Thread 326360 (user : 'zabbix_ro') did not exit
2023-03-09 8:19:54 0 [Warning] bin/mysqld: Thread 317430 (user : 'focmm') did not exit
2023-03-09 8:19:54 0 [Warning] bin/mysqld: Thread 286703 (user : 'root') did not exit
2023-03-09 8:19:54 0 [Warning] bin/mysqld: Thread 236698 (user : 'focmm') did not exit
2023-03-09 8:19:54 0 [Warning] bin/mysqld: Thread 204630 (user : 'zabbix') did not exit
2023-03-09 8:19:54 0 [Warning] bin/mysqld: Thread 106839 (user : 'zabbix') did not exit
2023-03-09 8:19:54 0 [Warning] bin/mysqld: Thread 79768 (user : 'zabbix') did not exit
2023-03-09 8:19:54 0 [Warning] bin/mysqld: Thread 533 (user : 'zabbix') did not exit


2023-03-09 8:21:39 0 [Note] Starting MariaDB 10.6.12-MariaDB-1:10.6.12+maria~ubu2004-log source revision 4c79e15cc3716f69c044d4287ad2160da8101cdc as process 1342234
2023-03-09 8:21:39 0 [Note] InnoDB: Compressed tables use zlib 1.2.11
2023-03-09 8:21:39 0 [Note] InnoDB: Using transactional memory
2023-03-09 8:21:39 0 [Note] InnoDB: Number of pools: 1
2023-03-09 8:21:39 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2023-03-09 8:21:39 0 [Note] InnoDB: Using Linux native AIO
2023-03-09 8:21:39 0 [Note] InnoDB: Initializing buffer pool, total size = 25769803776, chunk size = 134217728
2023-03-09 8:21:40 0 [Note] InnoDB: Completed initialization of buffer pool
2023-03-09 8:21:40 0 [Note] InnoDB: Starting crash recovery from checkpoint LSN=37250963749232,37250972062794
2023-03-09 8:21:46 0 [Note] InnoDB: 4 transaction(s) which must be rolled back or cleaned up in total 2136 row operations to undo
2023-03-09 8:21:46 0 [Note] InnoDB: Trx id counter is 6922689151
2023-03-09 8:21:46 0 [Note] InnoDB: Starting final batch to recover 33239 pages from redo log.
2023-03-09 8:21:48 0 [Note] InnoDB: Last binlog file '/home/mysql/database/prod2/binlog/prod2_binlog.002993', position 81326734
2023-03-09 8:21:48 0 [Note] InnoDB: 128 rollback segments are active.
2023-03-09 8:21:48 0 [Note] InnoDB: Removed temporary tablespace data file: "./ibtmp1"
2023-03-09 8:21:48 0 [Note] InnoDB: Creating shared tablespace for temporary tables
2023-03-09 8:21:48 0 [Note] InnoDB: Setting file './ibtmp1' size to 12 MB. Physically writing the file full; Please wait ...
2023-03-09 8:21:48 0 [Note] InnoDB: Starting in background the rollback of recovered transactions
2023-03-09 8:21:48 0 [Note] InnoDB: File './ibtmp1' size is now 12 MB.
2023-03-09 8:21:48 0 [Note] InnoDB: 10.6.12 started; log sequence number 37251264384497; transaction id 6922689152
2023-03-09 8:21:48 0 [Note] Plugin 'FEEDBACK' is disabled.
2023-03-09 8:21:48 0 [Note] InnoDB: Loading buffer pool(s) from /home/mysql/database/prod2/data/ib_buffer_pool
2023-03-09 8:21:48 0 [ERROR] mysqld: Can't open shared library '/home/mysql/product/mariadb-10.6.12-linux-systemd-x86_64/lib/mysql/plugin/metadata_lock_info.so' (errno: 0, cannot open shared object file: No such file or directory)
2023-03-09 8:21:48 0 [Note] Recovering after a crash using /home/mysql/database/prod2/binlog/prod2_binlog
2023-03-09 8:21:48 0 [Note] Starting table crash recovery...
2023-03-09 8:21:48 0 [Note] Crash table recovery finished.
2023-03-09 8:21:48 0 [Note] Server socket created on IP: '0.0.0.0'.
2023-03-09 8:21:48 0 [Note] Server socket created on IP: '::'.
2023-03-09 8:21:48 0 [Note] Reading of all Master_info entries succeeded
2023-03-09 8:21:48 0 [Note] Added new Master_info '' to hash table
2023-03-09 8:21:48 0 [Note] InnoDB: Rolled back recovered transaction 6922409938
2023-03-09 8:21:48 5 [Note] Slave I/O thread: Start asynchronous replication to master 'replication@116.203.122.76:33006' in log 'prod1_binlog.002944' at position 127971388
2023-03-09 8:21:48 0 [Note] bin/mysqld: ready for connections.
Version: '10.6.12-MariaDB-1:10.6.12+maria~ubu2004-log' socket: '/var/run/mysqld/mysql-33006.sock' port: 33006 mariadb.org binary distribution
2023-03-09 8:21:48 6 [Note] Slave SQL thread initialized, starting replication in log 'prod1_binlog.002944' at position 127971388, relay log './prod2-relay-log.007212' position: 508
2023-03-09 8:21:48 5 [Note] Slave I/O thread: connected to master 'replication@116.203.122.76:33006',replication started in log 'prod1_binlog.002944' at position 127971388
2023-03-09 8:21:48 9 [Warning] Aborted connection 9 to db: 'unconnected' user: 'unauthenticated' host: 'localhost' (This connection closed normally without authentication)
2023-03-09 8:21:49 0 [Note] InnoDB: Rolled back recovered transaction 6922409936
2023-03-09 8:21:49 0 [Note] InnoDB: Rolled back recovered transaction 6922409937
2023-03-09 8:21:49 0 [Note] InnoDB: Rolled back recovered transaction 6922410016
2023-03-09 8:21:49 0 [Note] InnoDB: Rollback of non-prepared transactions completed
2023-03-09 8:21:56 0 [Note] InnoDB: Buffer pool(s) load completed at 230309 8:21:56
2023-03-09 8:22:01 44 [Warning] Hostname 'maas2.fromdual.com' does not resolve to '159.69.54.136'.
2023-03-09 8:22:01 44 [Note] Hostname 'maas2.fromdual.com' has the following IP addresses:
2023-03-09 8:22:01 44 [Note] - 127.0.1.1
2023-03-09 8:22:01 44 [Warning] Aborted connection 44 to db: 'unconnected' user: 'unauthenticated' host: '159.69.54.136' (This connection closed normally without authentication)
2023-03-09 8:22:01 45 [Warning] Hostname 'maas2.fromdual.com' does not resolve to '159.69.54.136'.
2023-03-09 8:22:01 45 [Note] Hostname 'maas2.fromdual.com' has the following IP addresses:
2023-03-09 8:22:01 45 [Note] - 127.0.1.1

  • I hop this helps. Please let me know if you need some more information...
Comment by Oli Sennhauser [ 2023-03-09 ]

mdev_30665_1.tar.gz

Comment by Daniel Black [ 2023-03-09 ]

The backtraces looks very much like those analyzed yesterday in MDEV-30481. Without a "thread apply all bt full" ref I haven't quite got enough details to the exact match that Marko did.

Comment by Oli Sennhauser [ 2023-03-14 ]

Happened again this morning. Collected some data. It seems to be somehow related to the machine restart:

14. Feb 09:30 restart machine + database
16. Feb 20:30 problem started (threads running) + 59 h
16. Feb 23:35 restart database
17. Feb 17:45 restart database + machine
17. Feb 18:45 restart database
17. Feb 20:25 restart database + machine

07. Mar 08:45 restart machine + database
08. Mar 15:10 problem started (threads running) + 29.5 h

09. Mar 08:10 restart database
13. Mar 19:45 restart database + machine
14. Mar 10:45 problem started (threads running) + 15h
14. Mar 11:55 restart database

I will restart the machine again now. So I expect the problem to appear again within the next 48 hours.

If you provide me a fixed binary I can try if problem goes away.

For the note: I should try if the problem does not happen when just the DB is restarted...

Comment by Andrei Elkin [ 2023-03-16 ]

I walked through one of the traces to find few threads in a waiting state like in the following:

Thread 44 (Thread 0x7f6e18b0d700 (LWP 239680)):
#0  0x00007f744349573d in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00005611cf2f7378 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7f7403800738) at ./storage/innobase/sync/srw_lock.cc:238
#2  srw_mutex_impl<true>::wait_and_lock (this=this@entry=0x7f7403800738) at ./storage/innobase/sync/srw_lock.cc:326
#3  0x00005611cf2f74de in srw_mutex_impl<true>::wr_lock (this=0x7f740380073c) at /usr/include/c++/9/bits/atomic_base.h:549
#4  ssux_lock_impl<true>::rd_wait (this=this@entry=0x7f7403800738) at ./storage/innobase/sync/srw_lock.cc:392
#5  0x00005611cec02865 in ssux_lock_impl<true>::rd_lock (this=0x7f7403800738) at ./include/mysql/psi/mysql_thread.h:786
#6  sux_lock<ssux_lock_impl<true> >::s_lock (this=<optimized out>) at ./storage/innobase/include/sux_lock.h:362
#7  buf_page_get_low (page_id=..., zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at ./storage/innobase/buf/buf0buf.cc:2826
#8  0x00005611cf33c09e in btr_cur_t::search_leaf (this=this@entry=0x7f6e18b0a7a0, tuple=tuple@entry=0x7f6d3401b3f0, mode=mode@entry=PAGE_CUR_LE, latch_mode=latch_mode@entry=BTR_MODIFY_LEAF, mtr=mtr@entry=0x7f6e18b0ae10) at ./storage/innobase/btr/btr0cur.cc:1089
#9  0x00005611cf2b47ed in btr_pcur_open (mtr=0x7f6e18b0ae10, cursor=0x7f6e18b0a7a0, latch_mode=BTR_MODIFY_LEAF, mode=PAGE_CUR_LE, tuple=0x7f6d3401b3f0) at ./storage/innobase/include/btr0pcur.h:431
#10 row_ins_clust_index_entry_low (flags=0, mode=BTR_MODIFY_LEAF, index=0x7f6dfc009f70, n_uniq=0, entry=0x7f6d3401b3f0, n_ext=0, thr=0x7f6d78046260) at ./storage/innobase/row/row0ins.cc:2713
#11 0x00005611cf2b9853 in row_ins_clust_index_entry (index=0x7f6dfc009f70, entry=0x7f6d3401b3f0, thr=0x7f6d78046260, n_ext=0) at ./storage/innobase/row/row0ins.cc:3271
#12 0x00005611cf2ba15e in row_ins_index_entry (thr=0x7f6d78046260, entry=<optimized out>, index=<optimized out>) at ./storage/innobase/row/row0ins.cc:3397
#13 row_ins_index_entry_step (thr=0x7f6d78046260, node=<optimized out>) at ./storage/innobase/row/row0ins.cc:3565
#14 row_ins (thr=0x7f6d78046260, node=<optimized out>) at ./storage/innobase/row/row0ins.cc:3690
#15 row_ins_step (thr=thr@entry=0x7f6d78046260) at ./storage/innobase/row/row0ins.cc:3819
#16 0x00005611cf2c9ae5 in row_insert_for_mysql (mysql_rec=mysql_rec@entry=0x7f6d78038cb8 "\377\070i", prebuilt=0x7f6d78045870, ins_mode=ROW_INS_NORMAL) at ./storage/innobase/row/row0mysql.cc:1308
#17 0x00005611cf220a1a in ha_innobase::write_row (this=0x7f6d7803c3d0, record=0x7f6d78038cb8 "\377\070i") at ./storage/innobase/handler/ha_innodb.cc:7912
#18 0x00005611cef2efd0 in handler::ha_write_row (this=0x7f6d7803c3d0, buf=buf@entry=0x7f6d78038cb8 "\377\070i") at ./sql/handler.cc:7614
#19 0x00005611cf159a79 in ha_partition::write_row (this=0x7f6d78039160, buf=0x7f6d78038cb8 "\377\070i") at ./sql/ha_partition.cc:4575
#20 0x00005611cef2f1cf in handler::ha_write_row (this=0x7f6d78039160, buf=0x7f6d78038cb8 "\377\070i") at ./sql/handler.cc:7614
#21 0x00005611cecac85d in write_record (thd=thd@entry=0x7f6dcc008028, table=table@entry=0x7f6d78038838, info=info@entry=0x7f6e18b0bc10, sink=sink@entry=0x0) at ./sql/sql_insert.cc:2156
#22 0x00005611cecb31a2 in mysql_insert (thd=thd@entry=0x7f6dcc008028, table_list=<optimized out>, fields=..., values_list=..., update_fields=..., update_values=..., duplic=<optimized out>, ignore=<optimized out>, result=<optimized out>) at ./sql/sql_insert.cc:1128
 
 
Thread 38 (Thread 0x7f6df07ad700 (LWP 1716)):
#0  0x00007f744349573d in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00005611cf2f72c8 in srw_mutex_impl<false>::wait (lk=<optimized out>, this=0x5611d1cfc410) at ./storage/innobase/sync/srw_lock.cc:238
#2  srw_mutex_impl<false>::wait_and_lock (this=this@entry=0x5611d1cfc410) at ./storage/innobase/sync/srw_lock.cc:326
#3  0x00005611cf2f755e in srw_mutex_impl<false>::wr_lock (this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:549
#4  ssux_lock_impl<false>::rd_wait (this=<optimized out>) at ./storage/innobase/sync/srw_lock.cc:392
#5  0x00005611cf2f766b in ssux_lock_impl<false>::rd_lock (this=<optimized out>) at /usr/include/c++/9/bits/atomic_base.h:448
#6  srw_lock_impl<false>::psi_rd_lock (this=0x5611d1cfc408, file=file@entry=0x5611cf6c01a0 "/home/buildbot/buildbot/build/mariadb-10.6.12/storage/innobase/include/fil0fil.h", line=line@entry=1071) at ./storage/innobase/sync/srw_lock.cc:453
#7  0x00005611cf21f156 in srw_lock_impl<false>::rd_lock (line=1071, file=0x5611cf6c01a0 "/home/buildbot/buildbot/build/mariadb-10.6.12/storage/innobase/include/fil0fil.h", this=0x5611d1cfc408) at ./storage/innobase/handler/ha_innodb.cc:14546
#8  fil_space_t::s_lock (this=0x5611d1cfc368) at ./storage/innobase/include/fil0fil.h:1071
#9  ha_innobase::info_low (this=0x7f6d28ba0040, flag=282, is_analyze=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:14839
#10 0x00005611cf224e07 in ha_innobase::info (flag=282, this=0x7f6d28ba0040) at ./storage/innobase/handler/ha_innodb.cc:15041
#11 ha_innobase::open (this=<optimized out>, name=<optimized out>) at ./storage/innobase/handler/ha_innodb.cc:6191
#12 0x00005611cef2641a in handler::ha_open (this=0x7f6d28ba0040, table_arg=<optimized out>, name=name@entry=0x7f6df07a71e0 "./zabbix/history_uint#P#p2023_kw11", mode=2, test_if_locked=1042, mem_root=mem_root@entry=0x0, partitions_to_open=0x0) at ./sql/handler.cc:3351
#13 0x00005611cf15e9e6 in ha_partition::open_read_partitions (this=this@entry=0x7f6d288887b0, name_buff=name_buff@entry=0x7f6df07a71e0 "./zabbix/history_uint#P#p2023_kw11", name_buff_size=name_buff_size@entry=513) at ./sql/ha_partition.cc:8832
#14 0x00005611cf15f1e5 in ha_partition::open (this=0x7f6d288887b0, name=0x7f6da4bfe2f0 "./zabbix/history_uint", test_if_locked=<optimized out>, mode=<optimized out>) at ./sql/ha_partition.cc:3812
#15 0x00005611cef2641a in handler::ha_open (this=0x7f6d288887b0, table_arg=table_arg@entry=0x7f6d28afd948, name=0x7f6da4bfe2f0 "./zabbix/history_uint", mode=mode@entry=2, test_if_locked=test_if_locked@entry=18, mem_root=mem_root@entry=0x0, partitions_to_open=0x0) at ./sql/handler.cc:3351
#16 0x00005611cedd0413 in open_table_from_share (thd=thd@entry=0x7f6d28000c58, share=share@entry=0x7f6da4bfde40, alias=alias@entry=0x7f6d28b1f638, db_stat=db_stat@entry=33, prgflag=prgflag@entry=8, ha_open_flags=18, outparam=<optimized out>, is_create_table=<optimized out>, partitions_to_open=<optimized out>) at ./sql/table.cc:4444
#17 0x00005611cec7ca9b in open_table (thd=0x7f6d28000c58, table_list=0x7f6d28b1f5f0, ot_ctx=0x7f6df07a7c30) at ./sql/sql_base.cc:2036
#18 0x00005611cec7fa02 in open_and_process_table (ot_ctx=0x7f6df07a7c30, has_prelocking_list=false, prelocking_strategy=0x7f6df07a7cc8, flags=1090, counter=0x7f6df07a7cc4, tables=0x7f6d28b1f5f0, thd=0x7f6d28000c58) at ./sql/sql_base.cc:3847
#19 open_tables (thd=thd@entry=0x7f6d28000c58, options=..., start=start@entry=0x7f6df07a7cb8, counter=counter@entry=0x7f6df07a7cc4, flags=1090, prelocking_strategy=prelocking_strategy@entry=0x7f6df07a7cc8) at ./sql/sql_base.cc:4330
#20 0x00005611cec80463 in open_tables (prelocking_strategy=0x7f6df07a7cc8, flags=<optimized out>, counter=0x7f6df07a7cc4, tables=0x7f6df07a7cb8, thd=0x7f6d28000c58) at ./sql/sql_base.h:265
#21 open_normal_and_derived_tables (thd=thd@entry=0x7f6d28000c58, tables=<optimized out>, tables@entry=0x7f6d28b1f5f0, flags=<optimized out>, dt_phases=dt_phases@entry=3) at ./sql/sql_base.cc:5367
#22 0x00005611cec804fe in open_tables_only_view_structure (thd=thd@entry=0x7f6d28000c58, table_list=table_list@entry=0x7f6d28b1f5f0, can_deadlock=can_deadlock@entry=false) at ./sql/sql_base.cc:5417
#23 0x00005611ced5b11f in fill_schema_table_by_open (thd=0x7f6d28000c58, mem_root=<optimized out>, is_show_fields_or_keys=<optimized out>, table=0x7f6d282af060, schema_table=0x5611cfc5aa80 <schema_tables+2176>, orig_db_name=0x7f6d28af6570, orig_table_name=0x7f6d28acee80, open_tables_state_backup=0x7f6df07a9df0, can_deadlock=false) at ./sql/sql_show.cc:4671
#24 0x00005611ced722e2 in get_all_tables (thd=0x7f6d28000c58, tables=0x7f6d28015aa8, cond=<optimized out>) at ./sql/sql_show.cc:5342
#25 0x00005611ced73a5e in get_schema_tables_result (join=join@entry=0x7f6d282a74a8, executed_place=executed_place@entry=PROCESSED_BY_JOIN_EXEC) at ./sql/sql_show.cc:8881
#26 0x00005611ced569c7 in JOIN::exec_inner (this=0x7f6d282a74a8) at ./sql/sql_select.cc:4761
#27 0x00005611ced57229 in JOIN::exec (this=this@entry=0x7f6d282a74a8) at ./sql/sql_select.cc:4583
#28 0x00005611ced5536a in mysql_select (thd=0x7f6d28000c58, tables=0x7f6d28015340, fields=..., conds=0x7f6d28018200, og_num=4, order=0x7f6d28018fa0, group=0x7f6d28018450, having=0x0, proc_param=0x0, select_options=<optimized out>, result=0x7f6d282a7480, unit=0x7f6d28004e08, select_lex=0x7f6d280135b8) at ./sql/sql_select.cc:5062
#29 0x00005611ced55b27 in handle_select (thd=thd@entry=0x7f6d28000c58, lex=lex@entry=0x7f6d28004d40, result=result@entry=0x7f6d282a7480, setup_tables_done_option=setup_tables_done_option@entry=0) at ./sql/sql_select.cc:554
#30 0x00005611cece3581 in execute_sqlcom_select (thd=0x7f6d28000c58, all_tables=0x7f6d28015340) at ./sql/sql_parse.cc:6261
#31 0x00005611cecf1bb0 in mysql_execute_command (thd=0x7f6d28000c58, is_called_from_prepared_stmt=<optimized out>) at ./sql/sql_parse.cc:3949
 
Thread 33 (Thread 0x7f6e0812c700 (LWP 1643)):
#0  0x00007f744349573d in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00005611cf2f7378 in srw_mutex_impl<true>::wait (lk=<optimized out>, this=0x7f7233866d58) at ./storage/innobase/sync/srw_lock.cc:238
#2  srw_mutex_impl<true>::wait_and_lock (this=this@entry=0x7f7233866d58) at ./storage/innobase/sync/srw_lock.cc:326
#3  0x00005611cec0284c in srw_mutex_impl<true>::wr_lock (this=0x7f7233866d58) at ./include/mysql/psi/mysql_thread.h:786
#4  ssux_lock_impl<true>::wr_lock (this=0x7f7233866d58) at ./storage/innobase/include/srw_lock.h:257
#5  sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7f7233866d58) at ./storage/innobase/include/sux_lock.h:428
#6  buf_page_get_low (page_id=..., zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at ./storage/innobase/buf/buf0buf.cc:2849
#7  0x00005611cf33c09e in btr_cur_t::search_leaf (this=this@entry=0x7f6e081297a0, tuple=tuple@entry=0x7f6d84030270, mode=mode@entry=PAGE_CUR_LE, latch_mode=latch_mode@entry=BTR_MODIFY_LEAF, mtr=mtr@entry=0x7f6e08129e10) at ./storage/innobase/btr/btr0cur.cc:1089
#8  0x00005611cf2b47ed in btr_pcur_open (mtr=0x7f6e08129e10, cursor=0x7f6e081297a0, latch_mode=BTR_MODIFY_LEAF, mode=PAGE_CUR_LE, tuple=0x7f6d84030270) at ./storage/innobase/include/btr0pcur.h:431
#9  row_ins_clust_index_entry_low (flags=0, mode=BTR_MODIFY_LEAF, index=0x7f6dfc009f70, n_uniq=0, entry=0x7f6d84030270, n_ext=0, thr=0x7f6d78056f10) at ./storage/innobase/row/row0ins.cc:2713
#10 0x00005611cf2b9853 in row_ins_clust_index_entry (index=0x7f6dfc009f70, entry=0x7f6d84030270, thr=0x7f6d78056f10, n_ext=0) at ./storage/innobase/row/row0ins.cc:3271
#11 0x00005611cf2ba15e in row_ins_index_entry (thr=0x7f6d78056f10, entry=<optimized out>, index=<optimized out>) at ./storage/innobase/row/row0ins.cc:3397
#12 row_ins_index_entry_step (thr=0x7f6d78056f10, node=<optimized out>) at ./storage/innobase/row/row0ins.cc:3565
#13 row_ins (thr=0x7f6d78056f10, node=<optimized out>) at ./storage/innobase/row/row0ins.cc:3690
#14 row_ins_step (thr=thr@entry=0x7f6d78056f10) at ./storage/innobase/row/row0ins.cc:3819
#15 0x00005611cf2c9ae5 in row_insert_for_mysql (mysql_rec=mysql_rec@entry=0x7f6d78049d08 "\377\230S\001", prebuilt=0x7f6d78056520, ins_mode=ROW_INS_NORMAL) at ./storage/innobase/row/row0mysql.cc:1308
#16 0x00005611cf220a1a in ha_innobase::write_row (this=0x7f6d7804d0e0, record=0x7f6d78049d08 "\377\230S\001") at ./storage/innobase/handler/ha_innodb.cc:7912
#17 0x00005611cef2efd0 in handler::ha_write_row (this=0x7f6d7804d0e0, buf=buf@entry=0x7f6d78049d08 "\377\230S\001") at ./sql/handler.cc:7614
#18 0x00005611cf159a79 in ha_partition::write_row (this=0x7f6d7804a0b0, buf=0x7f6d78049d08 "\377\230S\001") at ./sql/ha_partition.cc:4575
#19 0x00005611cef2f1cf in handler::ha_write_row (this=0x7f6d7804a0b0, buf=0x7f6d78049d08 "\377\230S\001") at ./sql/handler.cc:7614
#20 0x00005611cecac85d in write_record (thd=thd@entry=0x7f6d34000c58, table=table@entry=0x7f6d78049888, info=info@entry=0x7f6e0812ac10, sink=sink@entry=0x0) at ./sql/sql_insert.cc:2156
#21 0x00005611cecb31a2 in mysql_insert (thd=thd@entry=0x7f6d34000c58, table_list=<optimized out>, fields=..., values_list=..., update_fields=..., update_values=
 
 
								      Thread 32 (Thread 0x7f6e08177700 (LWP 1642)):
#0  0x00007f744349573d in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00005611cf2f7418 in ssux_lock_impl<true>::wait (lk=<optimized out>, this=0x7f7233866df8) at ./storage/innobase/sync/srw_lock.cc:244
#2  ssux_lock_impl<true>::wr_wait (this=this@entry=0x7f7233866df8, lk=<optimized out>) at ./storage/innobase/sync/srw_lock.cc:378
#3  0x00005611cec023af in ssux_lock_impl<true>::wr_lock (this=0x7f7233866df8) at /usr/include/c++/9/bits/atomic_base.h:413
#4  sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7f7233866df8) at ./storage/innobase/include/sux_lock.h:428
#5  buf_page_get_low (page_id=..., zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at ./storage/innobase/buf/buf0buf.cc:2849
#6  0x00005611cf325f1d in btr_block_get (index=..., page=<optimized out>, mode=<optimized out>, merge=<optimized out>, mtr=<optimized out>, err=0x7f6e08173d3c) at ./storage/innobase/include/buf0types.h:95
#7  0x00005611cf32b3b1 in btr_insert_into_right_sibling (mtr=0x7f6e08174e10, n_ext=0, tuple=0x7f6d48091390, heap=<optimized out>, offsets=0x7f6e08174058, cursor=0x7f6e08174060, flags=7) at ./storage/innobase/include/btr0cur.h:744
#8  btr_page_split_and_insert (flags=flags@entry=7, cursor=cursor@entry=0x7f6e08174060, offsets=offsets@entry=0x7f6e08174058, heap=heap@entry=0x7f6e08174050, tuple=tuple@entry=0x7f6d48091390, n_ext=0, mtr=0x7f6e08174e10, err=0x7f6e08173fb0) at ./storage/innobase/btr/btr0btr.cc:2998
#9  0x00005611cf33efcd in btr_cur_pessimistic_insert (flags=flags@entry=7, cursor=cursor@entry=0x7f6e08174060, offsets=offsets@entry=0x7f6e08174058, heap=heap@entry=0x7f6e08174050, entry=entry@entry=0x7f6d48091390, rec=rec@entry=0x7f6e08174048, big_rec=0x7f6e08174040, n_ext=<optimized out>, thr=0x0, mtr=0x7f6e08174e10) at ./storage/innobase/include/buf0types.h:135
#10 0x00005611cf328dd5 in btr_insert_on_non_leaf_level (flags=7, flags@entry=0, index=index@entry=0x7f6dfc009f70, level=level@entry=1, tuple=0x7f6d48091390, mtr=mtr@entry=0x7f6e08174e10) at ./storage/innobase/btr/btr0btr.cc:2473
#11 0x00005611cf32be83 in btr_attach_half_pages (mtr=0x7f6e08174e10, direction=111, new_block=0x7f7233866e80, split_rec=<optimized out>, block=0x7f7233866c00, index=0x7f6dfc009f70, flags=0) at ./storage/innobase/btr/btr0btr.cc:2605
#12 btr_page_split_and_insert (flags=flags@entry=0, cursor=cursor@entry=0x7f6e081747a0, offsets=offsets@entry=0x7f6e08174788, heap=heap@entry=0x7f6e08174780, tuple=tuple@entry=0x7f6d540078d0, n_ext=0, mtr=0x7f6e08174e10, err=0x7f6e08174690) at ./storage/innobase/btr/btr0btr.cc:3121
#13 0x00005611cf33efcd in btr_cur_pessimistic_insert (flags=flags@entry=0, cursor=cursor@entry=0x7f6e081747a0, offsets=offsets@entry=0x7f6e08174788, heap=heap@entry=0x7f6e08174780, entry=entry@entry=0x7f6d540078d0, rec=rec@entry=0x7f6e08174798, big_rec=0x7f6e08174778, n_ext=<optimized out>, thr=0x7f6d78035da0, mtr=0x7f6e08174e10) at ./storage/innobase/include/buf0types.h:135
#14 0x00005611cf2b51fa in row_ins_clust_index_entry_low (flags=<optimized out>, mode=<optimized out>, index=0x7f6dfc009f70, n_uniq=<optimized out>, entry=0x7f6d540078d0, n_ext=<optimized out>, thr=<optimized out>) at ./storage/innobase/row/row0ins.cc:2898
#15 0x00005611cf2b9917 in row_ins_clust_index_entry (index=0x7f6dfc009f70, entry=0x7f6d540078d0, thr=0x7f6d78035da0, n_ext=0) at ./storage/innobase/row/row0ins.cc:3288
#16 0x00005611cf2ba15e in row_ins_index_entry (thr=0x7f6d78035da0, entry=<optimized out>, index=<optimized out>) at ./storage/innobase/row/row0ins.cc:3397
#17 row_ins_index_entry_step (thr=0x7f6d78035da0, node=<optimized out>) at ./storage/innobase/row/row0ins.cc:3565
#18 row_ins (thr=0x7f6d78035da0, node=<optimized out>) at ./storage/innobase/row/row0ins.cc:3690
#19 row_ins_step (thr=thr@entry=0x7f6d78035da0) at ./storage/innobase/row/row0ins.cc:3819
#20 0x00005611cf2c9ae5 in row_insert_for_mysql (mysql_rec=mysql_rec@entry=0x7f6d780214b8 "\377^\271", prebuilt=0x7f6d780353b0, ins_mode=ROW_INS_NORMAL) at ./storage/innobase/row/row0mysql.cc:1308
#21 0x00005611cf220a1a in ha_innobase::write_row (this=0x7f6d78017420, record=0x7f6d780214b8 "\377^\271") at ./storage/innobase/handler/ha_innodb.cc:7912
#22 0x00005611cef2efd0 in handler::ha_write_row (this=0x7f6d78017420, buf=buf@entry=0x7f6d780214b8 "\377^\271") at ./sql/handler.cc:7614
#23 0x00005611cf159a79 in ha_partition::write_row (this=0x7f6d78021860, buf=0x7f6d780214b8 "\377^\271") at ./sql/ha_partition.cc:4575
#24 0x00005611cef2f1cf in handler::ha_write_row (this=0x7f6d78021860, buf=0x7f6d780214b8 "\377^\271") at ./sql/handler.cc:7614
#25 0x00005611cecac85d in write_record (thd=thd@entry=0x7f6d48000c58, table=table@entry=0x7f6d780077c8, info=info@entry=0x7f6e08175c10, sink=sink@entry=0x0) at ./sql/sql_insert.cc:2156
#26 0x00005611cecb31a2 in mysql_insert (thd=thd@entry=0x7f6d48000c58, table_list=<optimized out>, fields=..., values_list=..., update_fields=..., update_values=

marko certainly has take a look.

Comment by Oli Sennhauser [ 2023-03-23 ]

Happened again last night at 2023-03-22 21:14 After about 9 days. Restarted database expecting that it does not happen again when database is restarted without server reboot.
New theory: It happens after some select when clicking in the GUI.

Comment by Oli Sennhauser [ 2023-03-28 ]

I definitely happens also WITHOUT machine reboot. So this can be excluded. We also used the GUI heavily the last days and it did not trigger the problem. So also this is not the reason of the problem.

Comment by Marko Mäkelä [ 2023-03-30 ]

oli, a user just today reported that with a custom build that includes a fix of MDEV-29835, he had an uptime of 11 days before he had to restart the database to move it to larger storage. The previous record (without the fix) was 5 days, and usually it hung every day or every second day.

Can you try the same? It was https://ci.mariadb.org/33215/ for the original custom build, but the fix is included in any recent development snapshot of the 10.6 branches or later. Example: click on the leftmost link next to tarball-docker in https://buildbot.mariadb.org/#/grid?branch=10.6 to see the build number: https://buildbot.mariadb.org/#/builders/1/builds/33837 for the currently latest 10.6 branch. Then, navigate to https://ci.mariadb.org/33837/ or right to your platform, say, https://ci.mariadb.org/33837/amd64-ubuntu-2004-deb-autobake/debs/ and download the packages there.

Comment by Marko Mäkelä [ 2023-03-30 ]

I checked the file gdb_1.trc in mdev_30665_1.tar.gz. In that file, Thread 32 is performing a re-entrant call to btr_cur_pessimistic_insert(). This matches stack traces that have been posted to MDEV-29835. If any of Thread 13, 33, 44, 45 are operating on the same index tree and same index pages, then this would be a duplicate of MDEV-29835.

There is not enough information to say for sure if this is a duplicate of MDEV-29835. If you had issued the following commands, we could say it for sure, because then we would have the buffer page descriptor addresses:

set print frame-arguments all
thread apply all backtrace

Some more detail:

mariadb-10.6.12

Thread 32 (Thread 0x7f6e08177700 (LWP 1642)):
#0  0x00007f744349573d in syscall () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00005611cf2f7418 in ssux_lock_impl<true>::wait (lk=<optimized out>, this=0x7f7233866df8) at ./storage/innobase/sync/srw_lock.cc:244
#2  ssux_lock_impl<true>::wr_wait (this=this@entry=0x7f7233866df8, lk=<optimized out>) at ./storage/innobase/sync/srw_lock.cc:378
#3  0x00005611cec023af in ssux_lock_impl<true>::wr_lock (this=0x7f7233866df8) at /usr/include/c++/9/bits/atomic_base.h:413
#4  sux_lock<ssux_lock_impl<true> >::x_lock_upgraded (this=0x7f7233866df8) at ./storage/innobase/include/sux_lock.h:428
#5  buf_page_get_low (page_id=..., zip_size=<optimized out>, rw_latch=<optimized out>, guess=<optimized out>, mode=<optimized out>, mtr=<optimized out>, err=<optimized out>, allow_ibuf_merge=<optimized out>) at ./storage/innobase/buf/buf0buf.cc:2849

On 64-bit platforms, the block descriptor address in 10.6 or later branches should be 0x18 bytes less than the address of the lock. In this case, it would have been 0x7f7233866de0. We’d have to find the mtr_t::memo entries for this address. This is the block that Thread 32 (the likely culprit) is trying to acquire a lock on. Then, we’d have to check the mtr_t::m_memo of Thread 32 for the block descriptor addresses on which at least one blocked thread in btr_cur_t::search() is waiting for a lock. This should confirm the hang to be a duplicate of MDEV-29835.

Comment by Oli Sennhauser [ 2023-03-30 ]

Hi Marko

I used this one: https://ci.mariadb.org/33837/amd64-ubuntu-2004/mariadb-10.6.13-linux-x86_64.tar.gz

After restart:

-----------------------------------------------------------------+

Variable_name Value

-----------------------------------------------------------------+

version 10.6.13-MariaDB-log
version_comment Source distribution
version_compile_machine x86_64
version_compile_os Linux
version_malloc_library system
version_source_revision 0760ad3336521b85454b522390eb972a606def23
version_ssl_library OpenSSL 1.1.1f 31 Mar 2020

-----------------------------------------------------------------+

Comment by Marko Mäkelä [ 2023-03-30 ]

Hi Oli, yes, that is the current head of the 10.6 branch. It also includes some performance fixes (MDEV-26055, MDEV-26827).

Comment by Oli Sennhauser [ 2023-04-06 ]

Hi Marko, It is stable so far until now. I am in vacation now. Hope it does not happen during my vacation again...

Comment by Oli Sennhauser [ 2023-04-17 ]

Hi Marko, back from vacation and it did NOT happen any more. So I consider this issues fixed with 10.6.13. Thanks, Oli

Comment by Marko Mäkelä [ 2023-04-17 ]

Thank you, oli, for confirming the fix. I have the feeling that the partial fix in 10.6.12 (MDEV-30400) increased the probability of the hangs. The fix of MDEV-21136 in 10.6.9 made the hangs more probable, but we did not get that many external reports for 10.6.10 or 10.6.11.

Generated at Thu Feb 08 10:17:58 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.