[MXS-3374] MaxScale fails to update IP for a existing node that reappears with a IP change Created: 2021-01-14 Updated: 2021-01-26 Resolved: 2021-01-25 |
|
| Status: | Closed |
| Project: | MariaDB MaxScale |
| Component/s: | xpandmon |
| Affects Version/s: | 2.5.6 |
| Fix Version/s: | 2.5.7 |
| Type: | Bug | Priority: | Major |
| Reporter: | Manjinder Nijjar | Assignee: | Johan Wikman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Environment: |
Sky-GCP |
||
| Attachments: |
|
| Sprint: | MXS-SPRINT-123 |
| Description |
|
In SkySQL when a node misbehaves for any reason, K8s kills the node and restarts a new instance with same hostname but different IP. When Xpandmon is configured to send traffic directly to Xpand nodes, it seem replacement node is not being identified correctly since it reappears with a different IP. As a result MaxScale stops connecting new sessions and errors on existing session. Here is a example to demo this behavior. We have a 3 node cluster running in Sky-GCP. This is a configuration where Xpand is running with Mariadb server in same POD (1:1 config). MaxScale is configured both for Frontend (Mariadb nodes) and backend (Xpand nodes).
And then we kill one of the pods to mimic K8s behavior when node misbehaves:
When a new pod appears, its IP changes to 10.32.3.11 however Xpand config is still pointing to older IP: 10.32.3.10.
However Xpand cluster identifies this correctly and comes up fine:
List of POD's in this setup:
|
| Comments |
| Comment by Manjinder Nijjar [ 2021-01-14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This is blocker since this is going to prevent some of our customers from doing POC in Sky where they will be accessing Xpand nodes directly via MaxScale. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
msnijjar I am confused by this output
because presumably Xpand-Bootstrap should correspond to one of those @@Xpand... nodes and hence have the same state? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
msnijjar Could you provide the maxscale.log? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Röwekamp (Inactive) [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi johan.wikman. Attached you'll find the two maxscale logs maxscale-before.log Steps taken before getting the log files:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm also confused by
If 10.32.3.10 is the wrong IP, I wonder how it is possible that the state is Master, Running. If there is no-one answering to the health-check ping, the state should be Down- | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think I know why the state of @@Xpand-Monitor:node-3 is correctly shown as Master, Running, although the wrong address 10.32.3.10 is seemingly used. Based on code review It would appear that internally xpandmon is aware of and is using the correct address, but the address in the corresponding internal Server object that routers use is not updated. @msnijjar Could you repeat the test and as a final step, stop and restart MaxScale. If that causes the correct address to be shown in the output, it would confirm my hypothesis. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
And could you also before and after do
and paste the output here. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Röwekamp (Inactive) [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Restarting MaxScale will be a bit hard since the whole container in which the MaxScale process is running will be restarted once the MaxScale process is killed. Since there isn't any persistent storage attached to the MaxScale container it will default to its initial configuration which only contains the Xpand-Bootstrap node. From this Xpand-Bootstrap node MaxScale is able to acquire the correct IP addresses of the xpand cluster. But I was able to query the xpand_nodes-v1.db database as you asked.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks, this confirms it. The list servers output shows the incorrect 10.32.0.133, but the sqlite database contains the correct 10.32.0.134. So, internally xpandmon uses the correct IP, which is why the state is correctly shown, but the address in the server object used by routers is not updated, which then causes the routing to fail. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
What is the IP-adress of j2-mxp-0.j2-mxp? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Röwekamp (Inactive) [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
That would be: 10.32.0.163 If we use the IP address directly instead of the DNS name for the bootstrap node it shows itself as running. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-21 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In all list servers output, Xpand-Bootstrap is listed as being Down. So how does the Xpand monitor get going? At first startup, the Xpand monitor uses the bootstrap server defined in the configuration in order to get in contact with the Xpand cluster. It then figures out the cluster configuration and stores that information in the sqlite database. On subsequent startups, the monitor uses the data in the sqlite database and effectively ignores the information in the configuration file. Anyway, to get going, the Xpand monitor must at some point be able to connect to a Xpand node using the bootstrap server information from the configuration file. So how does that happen if you use an address using which the Xpand monitor cannot connect? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Röwekamp (Inactive) [ 2021-01-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks a lot Johan. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Run first install_build_deps.sh in BUILD, that should install all dependencies. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm not at my laptop so this comes from memory...
I rarely build packages, but I have a vague recollection that you at some point had to run it twice in a row on some OS. Hope this works... | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Röwekamp (Inactive) [ 2021-01-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks. Unfortunately install_build_deps.sh didn't resolve all dependencies. Here is the Dockerfile for reference.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Röwekamp (Inactive) [ 2021-01-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Johan. I think it works as expected. The MaxScale logfile is attached.
The MaxScale image that was used is: mariadb/skysql-maxscale-dev:2.5.6-1-quizlet-dev-6189-61d2bf | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Röwekamp (Inactive) [ 2021-01-22 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The Xpand-Bootstrap node in above's example is a load balancer for all Xpand nodes which doesn't return pings. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jens.rowekamp So it is not part of the Xpand cluster? If that's the case, then I just don't understand how the Xpand-monitor gets going. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jens.rowekampI found the reason why the traffic is routed correctly, even though list servers shows the old address. For performance reasons the address is stored in two places; one which is used for routing and another place which is used when the list servers output is generated. In that fix of mine, only the former was updated when the address of the Xpand node had changed. What packages did install_build_deps.sh not install? We'll update the script. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Jens Röwekamp (Inactive) [ 2021-01-25 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Johan. I needed to install following packages prior executing install_build_deps.sh to be able to compile MaxScale in a CentOS 8 container:
Regarding the Xpand-Bootstrap node that is used: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Johan Wikman [ 2021-01-26 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
jens.rowekampNow I finally understand your bootstrap setup. That's brilliant. |