[MDEV-26873] Partial server hang when using many threads - Jira

XML

Word

Printable

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Won't Do
Affects Version/s: 10.2(EOL), 10.3(EOL), 10.4(EOL), 10.5, 10.6, 10.7(EOL)
Fix Version/s: N/A
Component/s: Locking
Labels:
- hang

Description

Split from ~~MDEV-26381~~. Logging a simplified overview here, with easy reproducibility and keeping things simple, though there are likely more aspects to these hang(s), some described in that ticket.

Execute the attached hang.sql (identical to [MDEV-26381_OTHER_1.sql] from ~~MDEV-26381~~), using 10k threads, with all threads replaying in random order (against test db).

After a few minutes, even on optimized builds, partial hang issues will start to show. SHOW FULL PROCESSLIST attached as show_full_processlist.txt as a 10.7 example of such an occurrence. Issue is very easy to reproduce.

When logging errors (like ERROR 1146 (42S02) at line 1: Table 'test.t2' doesn't exist) to the screen, it's easy to see when the server starts locking up after 1-5 minutes as the error rate either abruptly stops or slows down clearly/significantly. It then stays in that semi-hang state for 30+ minutes, sometimes unlocking partially with some threads continuing to process transactions whilst others remain in hanged state.

Machine is not OOM, nor OOS, nor busy (nothing else running), not challenged by the 10k threads (low load average in htop). IOW, this is not server hardware/capability related in any way afaict.

Tested version/revision was 10.7.1 b4911f5a34f8dcfb642c6f14535bc9d5d97ade44 (Optimized)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

hang.sql
2 kB
2021-10-21 04:17
MDEV-26873_GDB_10.6_OPT.txt.tar.xz
105 kB
2021-10-21 06:00
MDEV-26873_GDB_FULL_10.6_OPT.txt.tar.xz
4.88 MB
2021-10-21 06:01
MDEV-26873_SHOW_FULL_PROCESSLIST_10.6_OPT.txt
4.89 MB
2021-10-21 05:03
MDEV-26873.cc
4 kB
2021-10-26 22:46
show_full_processlist.txt
4.88 MB
2021-10-21 04:45

Issue Links

relates to

MDEV-26935 Improve MDL to be scalable with many thousands OS threads

Open

split from

MDEV-26381 InnoDB: Failing assertion: fsize != os_offset_t(-1) on CREATE TABLE with many partitions / ENOMEM (out of kernel memory) from fstat handling

Closed

Activity

People

Assignee:: Vladislav Vaintroub

Reporter:: Roel Van de Paar

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2021-10-21 04:55

Updated:: 2021-10-29 20:09

Resolved:: 2021-10-29 10:06

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.