[MXS-401] Possible MaxScale hangs? Created: 2015-10-08  Updated: 2016-05-31  Resolved: 2016-05-31

Status: Closed
Project: MariaDB MaxScale
Component/s: Core
Affects Version/s: 1.2.1
Fix Version/s: 2.0.0

Type: Bug Priority: Major
Reporter: Kolbe Kegel (Inactive) Assignee: Johan Wikman
Resolution: Cannot Reproduce Votes: 0
Labels: None
Environment:

Galera, RWSplit, CentOS 7, MS Azure


Attachments: File maxscale.cnf    

 Description   

I'm trying to do some benchmarking through MaxScale using linkbench and I'm running into a problem where the linkbench threads will occasionally fail to close. I really can't tell whether this is a problem in MariaDB, a problem in MaxScale, or a problem in the Java client program (which I obviously did not write and have not studied). I don't think I ever see this hang when I have linkbench connected straight to the backend (bypassing MaxScale).

Right now, I have a state where I've manually killed all app connections across all backends, but MaxScale still shows open sessions that are in "Session ready for routing" with associated DCBs in "DCB in the polling loop". Does it seem right that MaxScale sessions/DCBs would be in those states even when the associated threads on the backends have been killed?

[mdbe@mdbec-max1 ~]$ maxadmin show sessions
Session 8182 (0x1fa56f0)
       State:                  Session ready for routing
       Service:                CLI (0x1f947d0)
       Client DCB:             0x1fa5540
       Client Address:         127.0.0.1
       Connected:              Thu Oct  8 07:01:40 2015
       Idle:                           0 seconds
Session 8002 (0x7f91084c1c60)
       State:                  Session ready for routing
       Service:                RW Split Router (0x1f95080)
       Client DCB:             0x7f910c4aefa0
       Client Address:         myapp@10.0.1.7
       Connected:              Thu Oct  8 05:44:41 2015
       Idle:                           4443 seconds
Session 7874 (0x7f9108016b90)
       State:                  Session ready for routing
       Service:                RW Split Router (0x1f95080)
       Client DCB:             0x7f9108016830
       Client Address:         myapp@10.0.1.7
       Connected:              Thu Oct  8 05:44:40 2015
       Idle:                           4443 seconds
Session 5 (0x1f93740)
       State:                  Listener Session
       Service:                CLI (0x1f947d0)
       Client DCB:             0x1f933a0
       Connected:              Wed Oct  7 13:48:33 2015
Session 4 (0x1f93300)
       State:                  Listener Session
       Service:                RW Split Router (0x1f95080)
       Client DCB:             0x1f93860
       Connected:              Wed Oct  7 13:48:33 2015
Session 3 (0x1f93b10)
       State:                  Listener Session
       Service:                RW Split Router (0x1f95080)
       Client DCB:             0x1f92ee0
       Connected:              Wed Oct  7 13:48:33 2015
Session 2 (0x1f92d30)
       State:                  Listener Session
       Service:                Read Connection Router (0x1f96150)
       Client DCB:             0x1f92b60
       Connected:              Wed Oct  7 13:48:30 2015
Session 1 (0x1f92ac0)
       State:                  Listener Session
       Service:                Read Connection Router (0x1f96150)
       Client DCB:             0x1fa0280
       Connected:              Wed Oct  7 13:48:30 2015

$ for h in mdbec-db{1,2,3}; do ssh $h mysql -t -u root -p5mA6txkHXo7vAXn/hDOn -e "'show processlist'"; done
+-------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+
| Id    | User            | Host            | db   | Command | Time  | State              | Info             | Progress |
+-------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+
|     1 | system user     |                 | NULL | Sleep   |  4077 | committed 27226826 | NULL             |    0.000 |
|     2 | system user     |                 | NULL | Sleep   | 62845 | wsrep aborter idle | NULL             |    0.000 |
|     4 | system user     |                 | NULL | Sleep   |  4077 | committed 27226820 | NULL             |    0.000 |
|     5 | system user     |                 | NULL | Sleep   |  4077 | committed 27226823 | NULL             |    0.000 |
|     6 | system user     |                 | NULL | Sleep   |  4077 | committed 27226825 | NULL             |    0.000 |
|     7 | system user     |                 | NULL | Sleep   |  4077 | committed 27226822 | NULL             |    0.000 |
|     8 | system user     |                 | NULL | Sleep   |  4077 | committed 27226827 | NULL             |    0.000 |
|     9 | system user     |                 | NULL | Sleep   |  4077 | committed 27226824 | NULL             |    0.000 |
|    10 | system user     |                 | NULL | Sleep   |  4077 | committed 27226821 | NULL             |    0.000 |
| 16122 | root            | localhost       | NULL | Sleep   |  1766 |                    | NULL             |    0.000 |
| 16132 | maxscalemonitor | 10.0.1.11:38129 | NULL | Sleep   |     0 |                    | NULL             |    0.000 |
| 16133 | root            | localhost       | NULL | Query   |     0 | init               | show processlist |    0.000 |
+-------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+
+------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+
| Id   | User            | Host            | db   | Command | Time  | State              | Info             | Progress |
+------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+
|    1 | system user     |                 | NULL | Sleep   | 62441 | wsrep aborter idle | NULL             |    0.000 |
|    2 | system user     |                 | NULL | Sleep   |  7108 | committed 25608867 | NULL             |    0.000 |
|    4 | system user     |                 | NULL | Sleep   |  7108 | committed 25608866 | NULL             |    0.000 |
|    5 | system user     |                 | NULL | Sleep   |  7108 | committed 25608871 | NULL             |    0.000 |
|    6 | system user     |                 | NULL | Sleep   |  7108 | committed 25608869 | NULL             |    0.000 |
|    7 | system user     |                 | NULL | Sleep   |  7108 | committed 25608870 | NULL             |    0.000 |
|    8 | system user     |                 | NULL | Sleep   |  7108 | committed 25608872 | NULL             |    0.000 |
|   10 | system user     |                 | NULL | Sleep   |  7109 | committed 25608865 | NULL             |    0.000 |
|   11 | system user     |                 | NULL | Sleep   |  7108 | committed 25608868 | NULL             |    0.000 |
| 8183 | root            | localhost       | NULL | Sleep   |  1772 |                    | NULL             |    0.000 |
| 8186 | maxscalemonitor | 10.0.1.11:34503 | NULL | Sleep   |     1 |                    | NULL             |    0.000 |
| 8188 | root            | localhost       | NULL | Query   |     0 | init               | show processlist |    0.000 |
+------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+
+------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+
| Id   | User            | Host            | db   | Command | Time  | State              | Info             | Progress |
+------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+
|    1 | system user     |                 | NULL | Sleep   | 61417 | wsrep aborter idle | NULL             |    0.000 |
|    2 | system user     |                 | NULL | Sleep   |  4080 | committed 27226821 | NULL             |    0.000 |
|    4 | system user     |                 | NULL | Sleep   |  4080 | committed 27226826 | NULL             |    0.000 |
|    5 | system user     |                 | NULL | Sleep   |  4080 | committed 27226822 | NULL             |    0.000 |
|    6 | system user     |                 | NULL | Sleep   |  4080 | committed 27226825 | NULL             |    0.000 |
|    7 | system user     |                 | NULL | Sleep   |  4080 | committed 27226820 | NULL             |    0.000 |
|    8 | system user     |                 | NULL | Sleep   |  4080 | committed 27226824 | NULL             |    0.000 |
|    9 | system user     |                 | NULL | Sleep   |  4080 | committed 27226823 | NULL             |    0.000 |
|   11 | system user     |                 | NULL | Sleep   |  4080 | committed 27226827 | NULL             |    0.000 |
|   12 | maxscalemonitor | 10.0.1.11:56629 | NULL | Sleep   |     0 |                    | NULL             |    0.000 |
| 8180 | root            | localhost       | NULL | Sleep   |   552 |                    | NULL             |    0.000 |
| 8183 | root            | localhost       | NULL | Query   |     0 | init               | show processlist |    0.000 |
+------+-----------------+-----------------+------+---------+-------+--------------------+------------------+----------+



 Comments   
Comment by Johan Wikman [ 2015-11-22 ]

A couple of locking issues has recently been uncovered.

  • There was a failure in binlog router to release locks in certain conditions.
  • The non-threadsafe localtime was used, instead of the thread-safe localtime_t. It appears the use of localtime not only can cause races, but even lockups.

Both of these have now been fixed in develop.

Comment by Johan Wikman [ 2015-11-24 ]

The localtime issue was a false flag. There was another issue that only made it appear as if localtime could also cause lockups.

Comment by Dipti Joshi (Inactive) [ 2015-12-01 ]

Is this duplicate of MXS-388 ?

Comment by Johan Wikman [ 2015-12-15 ]

With some limited testing this could not be repeated. However, as a more thorough attempt at trying to repeat the behaviour be warranted, it is tentatively moved to 1.3.1.

Comment by Johan Wikman [ 2016-05-31 ]

I'll close this as this was reported for 1.2.1, we could not reproduce it, and we are now at 2.0.0.

If something similar is detected, please reopen this one or create a new issue.

Generated at Thu Feb 08 03:59:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.