[MXS-3380] MaxScale crash loop with cache filter + Redis Created: 2021-01-21  Updated: 2021-04-19  Resolved: 2021-02-12

Status: Closed
Project: MariaDB MaxScale
Component/s: cache
Affects Version/s: 2.5.6
Fix Version/s: 2.5.8

Type: Bug Priority: Major
Reporter: Todd Coker (Inactive) Assignee: Johan Wikman
Resolution: Fixed Votes: 0
Labels: None
Environment:

CentOS Linux release 7.7.1908 (Core)


Attachments: File datacheck-maxscale.cnf     Text File max-redis_error.txt     PNG File redis-screenshot.png    
Issue Links:
Problem/Incident
is caused by MXS-3403 The Cache Filter cannot deal with bat... Closed
Relates
relates to MXS-3400 maxscale crash with memcache on ubunt... Closed
Sprint: MXS-SPRINT-124

 Description   

After enabling the cache filter with Redis on an RDBA customer's server (datacheck), MaxScale crashes. Log file attached (max-redis_error.txt). Customer runs Redis 5.0.6.

Another MaxScale server (test) shows some similar errors but no crash. Same Redis version. The difference between datacheck and test is that datacheck has two app servers hitting it via a load balancer. They are not using cluster mode.

2021-01-20 22:28:37   error  : [storage_redis] Expected status message in the context of MULTI, but received a NIL.
2021-01-20 22:28:37   error  : [storage_redis] Expected status message in the context of queued command, but received a STRING.
2021-01-20 22:28:37   warning: [storage_redis] Unexpected redis redis return type (STATUS) received.
2021-01-20 22:28:37   error  : [storage_redis] Expected status message in the context of MULTI, but received a STRING.
2021-01-20 22:28:37   warning: [storage_redis] Unexpected redis redis return type (STATUS) received.
2021-01-20 22:28:37   error  : [storage_redis] Expected status message 'OK' in the context of MULTI, but received 'QUEUED'.
2021-01-20 22:28:37   error  : [storage_redis] Failed when reading response to MULTI: no error, 
2021-01-20 22:28:37   error  : [cache] Could not store new cache value, deleting a possibly existing old value.
2021-01-20 22:28:37   warning: [storage_redis] Unexpected redis return type (ARRAY) received.
2021-01-20 22:28:37   error  : [cache] Could not delete cache item, the value may now be stale.
2021-01-20 22:28:41   error  : [storage_redis] Expected status message in the context of MULTI, but received a NIL.
2021-01-20 22:28:41   warning: [storage_redis] Unexpected redis redis return type (STATUS) received.
2021-01-20 22:28:41   error  : [storage_redis] Expected status message 'OK' in the context of MULTI, but received 'QUEUED'.
2021-01-20 22:28:41   error  : [storage_redis] Failed when reading response to MULTI: no error, 
2021-01-20 22:28:41   error  : [cache] Could not store new cache value, deleting a possibly existing old value.
2021-01-20 22:28:41   warning: [storage_redis] Unexpected redis return type (ARRAY) received.
2021-01-20 22:28:41   error  : [cache] Could not delete cache item, the value may now be stale.
2021-01-20 22:28:42   error  : [storage_redis] Expected status message in the context of MULTI, but received a STRING.
2021-01-20 22:28:42   warning: [storage_redis] Unexpected redis redis return type (STATUS) received.
2021-01-20 22:28:42   error  : [storage_redis] Expected status message 'OK' in the context of MULTI, but received 'QUEUED'.
2021-01-20 22:28:42   error  : [storage_redis] Failed when reading response to MULTI: no error, 
2021-01-20 22:28:42   error  : [cache] Could not store new cache value, deleting a possibly existing old value.
2021-01-20 22:28:42   warning: [storage_redis] Unexpected redis return type (ARRAY) received.
2021-01-20 22:28:42   error  : [cache] Could not delete cache item, the value may now be stale.
2021-01-20 22:28:42   error  : [storage_redis] Expected status message in the context of MULTI, but received a STRING.
2021-01-20 22:28:42   warning: [storage_redis] Unexpected redis redis return type (STATUS) received.

2021-01-20 22:32:57   error  : [cache] Could not delete cache item, the value may now be stale.
*** Error in `/usr/bin/maxscale': free(): invalid pointer: 0x00007f8c08000c92 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x81679)[0x7f8ebb6d7679]
/usr/lib64/maxscale/libstorage_redis.so(redisBufferWrite+0x245)[0x7f8e683a26a5]
/usr/lib64/maxscale/libstorage_redis.so(redisGetReply+0x83)[0x7f8e683a27a3]
/usr/lib64/maxscale/libstorage_redis.so(redisvCommand+0x27)[0x7f8e683a2a37]
/usr/lib64/maxscale/libstorage_redis.so(+0x5bfd)[0x7f8e68396bfd]
/usr/lib64/maxscale/libstorage_redis.so(+0x9a6c)[0x7f8e6839aa6c]
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(+0x182dcb)[0x7f8ebe3b0dcb]
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(_ZN7maxbase10ThreadPool6Thread4mainEv+0x155)[0x7f8ebe2cf425]
/usr/lib64/maxscale/libmaxscale-common.so.1.0.0(+0x228a0f)[0x7f8ebe456a0f]
/lib64/libpthread.so.0(+0x7e65)[0x7f8ebd448e65]
/lib64/libc.so.6(clone+0x6d)[0x7f8ebb75488d]



 Comments   
Comment by Johan Wikman [ 2021-02-02 ]

I have tried to repeat the problem using the following setups:

  • 1 MaxScale
  • 16 sysbench threads hitting in parallell
    and
  • 2 MaxScales with the same backends and the same Redis server
  • 16 sysbench threads hitting one MaxScale and 16 sysbench threads hitting the other.

But everything works just fine.

Coker Is the Redis server used by anybody else but MaxScale?

Comment by Johan Wikman [ 2021-02-02 ]

I did find a bug that caused MaxScale to not immediately give up after

 
2021-01-20 22:28:37   error  : [storage_redis] Expected status message in the context of MULTI, but received a NIL.

which will be fixed in 2.5.8.

However, I suspect the cascading errors are caused by the unexpected NIL reply causing the reply handling to be aborted in the the middle. That then leads to the Redis requests performed by MaxScale and the responses it receives to get out of sync.

Comment by Johan Wikman [ 2021-02-05 ]

I will close this now, even though I have not been able to repeat the behavior.

However, storage_redis should now be much more resilient:

  • Hiredis Redis client library updated to most recent GA version.
  • Bug that caused storage_redis not to bail out at first unexpected reply from Redis fixed.
  • NIL reply from Redis explicitly handled and, as per recommendation from Hiredis developer, causes storage_redis to disconnect and then reconnect.
  • All I/O errors are detected and causes disconnect followed by reconnect.
  • If there are unexpected replies from Redis, as much information about them as possible is logged, to make it easier to figure out the cause for problems.
Comment by Johan Wikman [ 2021-02-10 ]

Coker What connector are they using?

Comment by Johan Wikman [ 2021-02-11 ]

I could repeat the behaviour when using ConnectorJ and batching.

Comment by Johan Wikman [ 2021-02-12 ]

Fixed by MXS-3403

Generated at Thu Feb 08 04:21:00 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.