Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-31818

Server crashes in choose_best_splitting

Details

    • Bug
    • Status: Open (View Workflow)
    • Major
    • Resolution: Unresolved
    • 10.11.2, 10.11.4
    • 10.11
    • Server
    • ProLiant DL360 Gen10, 48 cores, 128GB memory, Centos 8

    Description

      We have a primary with 2 replicas and have seen a crash where both replicas crash almost simultaneously at the same query. At other times during the day the same query has had no problems and unfortunately we have not managed to replicate this either.
      So far we have seen exactly the same behaviour only once more two weeks ago where a very similar query crashed both replication servers.
      In both cases there was a bulk import running on the primary server that got replicated to the crashing replicas, but on a different schema and table that is requested in the query.

      Attachments

        Issue Links

          Activity

            To clarify our setup: one replica is MariaDB 10.11.2 the other 10.11.4. The replicas are used exclusively for reading. The master is not used for queries and is only importing data.
            We had another crash of both replicas today and again there was a bulk import replicated from the master in parallel, however on an unrelated schema and a different table than the last time where the replicas crashed. The master imports run relative frequently, so this could still be coincidence.

            cbefin Christian Braeuner added a comment - To clarify our setup: one replica is MariaDB 10.11.2 the other 10.11.4. The replicas are used exclusively for reading. The master is not used for queries and is only importing data. We had another crash of both replicas today and again there was a bulk import replicated from the master in parallel, however on an unrelated schema and a different table than the last time where the replicas crashed. The master imports run relative frequently, so this could still be coincidence.

            We are still experiencing the crashes at random intervals, on average 3-4 times a month.
            I have uploaded the latest crash dump from today, although there is nothing new and it is the same method "choose_best_splitting". So far we have only ever seen the crashes while a table import on the primary is taking place that gets copied through to the replica when the problematic query is executed on the replica.
            Is there anything we can try to help finding the problem?

            cbefin Christian Braeuner added a comment - We are still experiencing the crashes at random intervals, on average 3-4 times a month. I have uploaded the latest crash dump from today, although there is nothing new and it is the same method "choose_best_splitting". So far we have only ever seen the crashes while a table import on the primary is taking place that gets copied through to the replica when the problematic query is executed on the replica. Is there anything we can try to help finding the problem?

            Hello cbefin;

            Could you read the https://jira.mariadb.org/browse/MDEV-32064 issue? i think it's very similar.

            If yes, a patch was written, but not delivered yet. Should be on the next release.

            Regards;

            rdem Richard DEMONGEOT added a comment - Hello cbefin ; Could you read the https://jira.mariadb.org/browse/MDEV-32064 issue? i think it's very similar. If yes, a patch was written, but not delivered yet. Should be on the next release. Regards;

            Hi Richard,
            thanks, I have experimented with the in_predicate_conversion_threshold setting and while I can make it crash with the steps given in the report, I was not able to reproduce a crash with our own problematic query under load and also using different values for the in_predicate_conversion_threshold. The query takes a lot longer when setting it too low, but it does not crash the db.

            cbefin Christian Braeuner added a comment - Hi Richard, thanks, I have experimented with the in_predicate_conversion_threshold setting and while I can make it crash with the steps given in the report, I was not able to reproduce a crash with our own problematic query under load and also using different values for the in_predicate_conversion_threshold. The query takes a lot longer when setting it too low, but it does not crash the db.

            We have changed our configuration last week to also use the primary node for requests in an attempt eliminate the replication as one of the factors. Today we had a simulataneous crash of the primary and one replica, which tells us that replication is not causing the instability. The crash was again happening during a bulk import of an unrelated table in a separate schema.

            cbefin Christian Braeuner added a comment - We have changed our configuration last week to also use the primary node for requests in an attempt eliminate the replication as one of the factors. Today we had a simulataneous crash of the primary and one replica, which tells us that replication is not causing the instability. The crash was again happening during a bulk import of an unrelated table in a separate schema.
            alice Alice Sherepa added a comment -

            Is it possible for you to upgrade to the recent MariaDB version? It might be the same as MDEV-31440 and with the test case, that was provided there, the crash does not happen anymore.

            alice Alice Sherepa added a comment - Is it possible for you to upgrade to the recent MariaDB version? It might be the same as MDEV-31440 and with the test case, that was provided there, the crash does not happen anymore.

            Hi, we have changed the query in the meantime to no longer use a subselect with distinct. Since then we had no crashes. As this crash was only ever observed in our productive environment we do not want to put the dangerous query back in order to test later versions of MariaDB.

            cbefin Christian Braeuner added a comment - Hi, we have changed the query in the meantime to no longer use a subselect with distinct. Since then we had no crashes. As this crash was only ever observed in our productive environment we do not want to put the dangerous query back in order to test later versions of MariaDB.

            People

              psergei Sergei Petrunia
              cbefin Christian Braeuner
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.