[MCOL-529] DBRM message queue clients need to be pooled Created: 2017-01-23 Updated: 2017-12-01 Resolved: 2017-05-10 |
|
| Status: | Closed |
| Project: | MariaDB ColumnStore |
| Component/s: | None |
| Affects Version/s: | 1.0.7 |
| Fix Version/s: | 1.0.9, 1.1.0 |
| Type: | New Feature | Priority: | Critical |
| Reporter: | Andrew Hutchings (Inactive) | Assignee: | David Hill (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Attachments: |
|
| Sprint: | 2017-7, 2017-8, 2017-9, 2017-10 |
| Description |
|
When running test007 a second time we get the following error many times towards the end: ERROR 1836 (HY000) at line 1: Running in read-only mode The error log at the time is attached. |
| Comments |
| Comment by Andrew Hutchings (Inactive) [ 2017-01-23 ] |
|
Attached debug log from the time it happened. Seems to start at timestamp 17:49:22 |
| Comment by Andrew Hutchings (Inactive) [ 2017-01-23 ] |
|
The cause of this is twofold: 1. With the 1.0.7 source there is some acceleration with regards to new connections test007 creates and destroys many connections per second. With the performance changes in 1.0.7 this is around 300 per second. The ports go into TCP time wait for a short time before they can be reused. CentOS 7 has a maximum of around 30,000 ports by default and we now have a high chance of just going over that. The workaround is to increase the amount of available ports as follows or by changing the TCP time wait kernel settings: /sbin/sysctl net.ipv4.ip_local_port_range="1024 65000" |
| Comment by Andrew Hutchings (Inactive) [ 2017-01-24 ] |
|
The 300 connections/sec are primarily coming from DBRM connections as each DBRM instance requires a new connection. We should look at pooling these connections / instances. We do use SO_REUSEADDR on connection but the TIME_WAIT is at the client end which isn't affected by this socket option. The error handling is correct so nothing we need to do there. Changing subject of ticket to reflect the DBRM optimisations required. |
| Comment by Andrew Hutchings (Inactive) [ 2017-04-14 ] |
|
Pull request for develop and develop-1.0 available. test007 should now pass with this patch. |
| Comment by Daniel Lee (Inactive) [ 2017-05-05 ] |
|
Assigned it to Mr. Hill for regression test. I still need to setup my regression test. |
| Comment by David Hill (Inactive) [ 2017-05-10 ] |
|
test007 without the workaround now passes regression test without app failures. |