Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-34493

memory leak when using galera node with only replication thread

Details

    • Bug
    • Status: Needs Feedback (View Workflow)
    • Critical
    • Resolution: Unresolved
    • 10.6.16
    • 10.6
    • Galera, Server
    • None
    • redhat 8 on vmware

    Description

      we are getting memory leak in one DB server.

      situation:

      • our architecture is 3 nodes galera cluster - 2 DB nodes + 1 arbitrator in different locations. Plus running haproxy as the front end
      • haproxy redirects traffic to one DB node in normal stage. Another DB node is configured as backup in haproxy
      • the backup DB node used up its operating system available memory in 1 or 2 days. The DB node is running on 64GB memory VM and 8GB buffer pool
      • after available memory used up in backup DB node, OS killed the DB process and available memory got released
      • the backup DB node got 128 galera replication threads

      Since haproxy does not route traffic to backup DB node in normal situation, only replication and Percona PMM are using backup DB node.

      Screenshot:
      uploaded top command output, keeping decreasing available memory and output of memory_summary_global_by_event_name and memory_summary_by_thread_by_event_name

      Question:
      Any method to continue the troubleshooting? Kindly advise "memory/sql/thd::main_mem_root" is related to the memory leak or not.

      Regards,
      William Wong

      Attachments

        Activity

          LawrenceMan Lawrence Man added a comment -

          Hi support team,

          A high level summary of the problem:

          At 1:xx a.m. daily, some transactions take place at node 1, with select statements involved and transaction isolation level are repeatable read.

          Node 2 slave thread memory consumption bumped up by ~2GB per day during this batch job execution, and such memory is not released.

          We are unable to capture what was running at node 2. The information_summary.processlist.info_binary is NULL throughout the observation period.

          Thanks and best regards,

          Lawrence

          LawrenceMan Lawrence Man added a comment - Hi support team, A high level summary of the problem: At 1:xx a.m. daily, some transactions take place at node 1, with select statements involved and transaction isolation level are repeatable read. Node 2 slave thread memory consumption bumped up by ~2GB per day during this batch job execution, and such memory is not released. We are unable to capture what was running at node 2. The information_summary.processlist.info_binary is NULL throughout the observation period. Thanks and best regards, Lawrence

          frelist Is wsrep-slave-thread=128 in every node ?

          janlindstrom Jan Lindström added a comment - frelist Is wsrep-slave-thread=128 in every node ?
          LawrenceMan Lawrence Man added a comment -

          Yes, but we recently we have reduced it match with the # of vCPUs and the situation remains the same.

          LawrenceMan Lawrence Man added a comment - Yes, but we recently we have reduced it match with the # of vCPUs and the situation remains the same.
          LawrenceMan Lawrence Man added a comment -

          I got a small reproducible case as in the attached file:

          1. A table called t2 is created.
          2. An insert w/ 10mil records via the stored procedure InsertIntoLknmT2_bind_savept() caused increase in node 2's memory utilization, whenever the SP was executed for the first & second time.
          3. An SP Update_LknmT2_bind_savept was executed twice. Each time, there was rapid increase in memory utilization at node 2, during commit.
          4. An SP Update_LknmT2_bind was executed twice. Each time, there was rapid increase in memory utilization at node 2, during commit.

          It appears that the committing after running large batch of INSERT or UPDATE can lead to such situation, regardless of the presence of savepoint / release savepoint.

          Thanks and best regards,

          Lawrence

          Investigation-2025-04-03.txt

          LawrenceMan Lawrence Man added a comment - I got a small reproducible case as in the attached file: 1. A table called t2 is created. 2. An insert w/ 10mil records via the stored procedure InsertIntoLknmT2_bind_savept() caused increase in node 2's memory utilization, whenever the SP was executed for the first & second time. 3. An SP Update_LknmT2_bind_savept was executed twice. Each time, there was rapid increase in memory utilization at node 2, during commit. 4. An SP Update_LknmT2_bind was executed twice. Each time, there was rapid increase in memory utilization at node 2, during commit. It appears that the committing after running large batch of INSERT or UPDATE can lead to such situation, regardless of the presence of savepoint / release savepoint. Thanks and best regards, Lawrence Investigation-2025-04-03.txt
          LawrenceMan Lawrence Man added a comment -

          In the previously attached file, the same insert / update is executed 10mil times, in a single transaction. The SQLs are written using bind variable. In this situation, will the memory used to parse the SQL be allocated for the first SQL, and re-used for the rest of SQLs within that same transaction, or will the memory used to parse the SQL be multiplied by 10mil times, since they are in the same transaction?

          I raise the above query because it can be observed that there is a sharp rise in memory utilization, even though only a single INSERT/UPDATE SQL skeleton is used.

          LawrenceMan Lawrence Man added a comment - In the previously attached file, the same insert / update is executed 10mil times, in a single transaction. The SQLs are written using bind variable. In this situation, will the memory used to parse the SQL be allocated for the first SQL, and re-used for the rest of SQLs within that same transaction, or will the memory used to parse the SQL be multiplied by 10mil times, since they are in the same transaction? I raise the above query because it can be observed that there is a sharp rise in memory utilization, even though only a single INSERT/UPDATE SQL skeleton is used.

          People

            janlindstrom Jan Lindström
            frelist William Wong
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Git Integration

                Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.