[MDEV-31141] mariadbd hangs Created: 2023-04-27  Updated: 2023-05-30  Resolved: 2023-05-30

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 11.0.1
Fix Version/s: N/A

Type: Bug Priority: Critical
Reporter: Denis Chernyaev Assignee: Unassigned
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

CentOS 8
Kernel 4.18.0-486



 Description   

Yesterday we updated to 11.0.1 from 10.11 and some 8 hours later the system slowed down to a crawl and journalctl was full of

Apr 26 22:45:30 LSN-D4179 kernel: INFO: task mariadbd:17017 blocked for more than 120 seconds.
Apr 26 22:45:30 LSN-D4179 kernel:       Not tainted 4.18.0-486.el8.x86_64 #1
Apr 26 22:45:30 LSN-D4179 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:45:30 LSN-D4179 kernel: task:mariadbd        state:D stack:    0 pid:17017 ppid:     1 flags:0x00000080
Apr 26 22:45:30 LSN-D4179 kernel: Call Trace:
Apr 26 22:45:30 LSN-D4179 kernel:  __schedule+0x2d1/0x870
Apr 26 22:45:30 LSN-D4179 kernel:  schedule+0x55/0xf0
Apr 26 22:45:30 LSN-D4179 kernel:  io_schedule+0x12/0x40
Apr 26 22:45:30 LSN-D4179 kernel:  migration_entry_wait_on_locked+0x1ea/0x290
Apr 26 22:45:30 LSN-D4179 kernel:  ? filemap_fdatawait_keep_errors+0x50/0x50
Apr 26 22:45:30 LSN-D4179 kernel:  do_swap_page+0x5b0/0x710
Apr 26 22:45:30 LSN-D4179 kernel:  ? pmd_devmap_trans_unstable+0x2e/0x40
Apr 26 22:45:30 LSN-D4179 kernel:  ? handle_pte_fault+0x5d/0x880
Apr 26 22:45:30 LSN-D4179 kernel:  __handle_mm_fault+0x365/0x6c0
Apr 26 22:45:30 LSN-D4179 kernel:  handle_mm_fault+0xca/0x2a0
Apr 26 22:45:30 LSN-D4179 kernel:  __do_page_fault+0x1f0/0x450
Apr 26 22:45:30 LSN-D4179 kernel:  do_page_fault+0x37/0x130
Apr 26 22:45:30 LSN-D4179 kernel:  ? page_fault+0x8/0x30
Apr 26 22:45:30 LSN-D4179 kernel:  page_fault+0x1e/0x30
Apr 26 22:45:30 LSN-D4179 kernel: RIP: 0033:0x55a8c6c05938
Apr 26 22:45:30 LSN-D4179 kernel: Code: Unable to access opcode bytes at RIP 0x55a8c6c0590e.
Apr 26 22:45:30 LSN-D4179 kernel: RSP: 002b:0000145840bb84d0 EFLAGS: 00010206
Apr 26 22:45:30 LSN-D4179 kernel: RAX: 0000000000000000 RBX: 000014606063fe50 RCX: 00000000000000e7
Apr 26 22:45:30 LSN-D4179 kernel: RDX: 0000000080000001 RSI: 0000000000000000 RDI: 000014585fa517a0
Apr 26 22:45:30 LSN-D4179 kernel: RBP: 0000145840bb8560 R08: 00001457f81b8960 R09: 00001457ec069e78
Apr 26 22:45:30 LSN-D4179 kernel: R10: 000055a8c70b96b0 R11: 000014579ca220c8 R12: 00000000000000e7
Apr 26 22:45:30 LSN-D4179 kernel: R13: 000055a8c70b96b0 R14: 0000000000000002 R15: 00001a5400013700

Kernel was updated from 4.18.0-448 as well so I am curious is this a kernel or MariaDB error?



 Comments   
Comment by Denis Chernyaev [ 2023-04-27 ]

Here is another trace:

Apr 26 22:45:28 LSN-D4179 kernel: INFO: task mariadbd:16530 blocked for more than 120 seconds.
Apr 26 22:45:28 LSN-D4179 kernel:       Not tainted 4.18.0-486.el8.x86_64 #1
Apr 26 22:45:28 LSN-D4179 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 26 22:45:28 LSN-D4179 kernel: task:mariadbd        state:D stack:    0 pid:16530 ppid:     1 flags:0x00000080
Apr 26 22:45:28 LSN-D4179 kernel: Call Trace:
Apr 26 22:45:28 LSN-D4179 kernel:  __schedule+0x2d1/0x870
Apr 26 22:45:28 LSN-D4179 kernel:  schedule+0x55/0xf0
Apr 26 22:45:28 LSN-D4179 kernel:  schedule_preempt_disabled+0xa/0x10
Apr 26 22:45:28 LSN-D4179 kernel:  rwsem_down_read_slowpath+0x26e/0x3f0
Apr 26 22:45:28 LSN-D4179 kernel:  down_read+0x95/0xa0
Apr 26 22:45:28 LSN-D4179 kernel:  do_madvise.part.30+0x2c3/0xa40
Apr 26 22:45:28 LSN-D4179 kernel:  ? syscall_trace_enter+0x1ff/0x2d0
Apr 26 22:45:28 LSN-D4179 kernel:  ? __x64_sys_futex+0x145/0x1f0
Apr 26 22:45:28 LSN-D4179 kernel:  ? __x64_sys_madvise+0x26/0x30
Apr 26 22:45:28 LSN-D4179 kernel:  __x64_sys_madvise+0x26/0x30
Apr 26 22:45:28 LSN-D4179 kernel:  do_syscall_64+0x5b/0x1b0
Apr 26 22:45:28 LSN-D4179 kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xc6
Apr 26 22:45:28 LSN-D4179 kernel: RIP: 0033:0x14657e763a4b
Apr 26 22:45:28 LSN-D4179 kernel: Code: Unable to access opcode bytes at RIP 0x14657e763a21.
Apr 26 22:45:28 LSN-D4179 kernel: RSP: 002b:00001458256a6c78 EFLAGS: 00000206 ORIG_RAX: 000000000000001c
Apr 26 22:45:28 LSN-D4179 kernel: RAX: ffffffffffffffda RBX: 00001458254a7000 RCX: 000014657e763a4b
Apr 26 22:45:28 LSN-D4179 kernel: RDX: 0000000000000004 RSI: 00000000001fb000 RDI: 00001458254a7000
Apr 26 22:45:28 LSN-D4179 kernel: RBP: 0000000000000000 R08: 000000014572804e R09: 0000145728082532
Apr 26 22:45:28 LSN-D4179 kernel: R10: 0000000002bed1f0 R11: 0000000000000206 R12: 00001458268fa9ae
Apr 26 22:45:28 LSN-D4179 kernel: R13: 00001458268fa9af R14: 00001458256a7700 R15: 00001458256a6d40

Comment by Denis Chernyaev [ 2023-05-25 ]

The issue can be closed, looks like it was a hardware issue.

Comment by Sergei Golubchik [ 2023-05-30 ]

Thanks for the info!

Generated at Thu Feb 08 10:21:35 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.