Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-13195

Connect Engine: INSERT FROM SELECT (or CREATE TABLE AS SELECT) memory leak

Details

    Description

      Scenario:

      Bash console (it's mandatory to execute from the bash):

      1) mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'DROP TABLE IF EXISTS c1;'
      2) mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'CREATE TABLE c1 ENGINE=CONNECT TABLE_TYPE=MYSQL DBNAME='\''information_schema'\'' OPTION_LIST='\''host=127.0.0.1,port=33235,user=user,password=password'\'' `tabname`='\''tables'\'''
      3) mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'CREATE TABLE t1 AS SELECT * FROM c1'

      This scenario leads to spending an additional 500 MB (equal to value of "connect_work_size" variable) of virtual memory for each repetition. This memory is not released until you restart the mysql process. Remote database name and a table - can be any.

      Attachments

        Issue Links

          Activity

            Sergey.Antonyuk,

            You've said "for each repetition". Do you mean that on the same running server you get the memory keep growing indefinitely by just repeating DROP/CREATE/CREATE as above? Can you show it – results of some memory monitor, e.g. top (or whatever you use)?

            I can of course observe the 500MB growth with the corresponding connect_work_size, but it's a one-time event for me.

            elenst Elena Stepanova added a comment - Sergey.Antonyuk , You've said "for each repetition". Do you mean that on the same running server you get the memory keep growing indefinitely by just repeating DROP/CREATE/CREATE as above? Can you show it – results of some memory monitor, e.g. top (or whatever you use)? I can of course observe the 500MB growth with the corresponding connect_work_size , but it's a one-time event for me.

            I can confirm, calling the commands from the OP in a loop (something like while true ; do sleep 1 ; mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'DROP TABLE IF EXISTS t1; CREATE TABLE t1 AS SELECT * FROM c1' ; done) causes an unlimited memory leakage for me.

            I've spent some time debugging and I've found the possible problem in the GetPlug/GetUser functions in ha_connect.cc. It looks like under certain circumstances they can allocate a new user_connect object and overwrite the pointer to the old object with a pointer to the new one without deallocating the old object. Another issue I've found is that a global linked list with its head in user_connect::to_users seems to be accessed by multiple threads without any mutex.

            After "fixing" these problems (to the degree I understand them) I eliminated the memory leak, but now I've got occasional (once a day) segfaults, so I'm clearly still missing something.

            ekarav Egor Karavaev (Inactive) added a comment - I can confirm, calling the commands from the OP in a loop (something like while true ; do sleep 1 ; mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'DROP TABLE IF EXISTS t1; CREATE TABLE t1 AS SELECT * FROM c1' ; done ) causes an unlimited memory leakage for me. I've spent some time debugging and I've found the possible problem in the GetPlug / GetUser functions in ha_connect.cc . It looks like under certain circumstances they can allocate a new user_connect object and overwrite the pointer to the old object with a pointer to the new one without deallocating the old object. Another issue I've found is that a global linked list with its head in user_connect::to_users seems to be accessed by multiple threads without any mutex. After "fixing" these problems (to the degree I understand them) I eliminated the memory leak, but now I've got occasional (once a day) segfaults, so I'm clearly still missing something.

            Thanks Egor for working on this problem. I was just looking at a similar case reported by another customer. Indeed this code was imported from a legacy application and is more than 10 years old and has not been really tested in a multi-user configuration.

            Could you please publish the fixes you've done so I can test them and try to find the cause of the segfaults.

            bertrandop Olivier Bertrand added a comment - Thanks Egor for working on this problem. I was just looking at a similar case reported by another customer. Indeed this code was imported from a legacy application and is more than 10 years old and has not been really tested in a multi-user configuration. Could you please publish the fixes you've done so I can test them and try to find the cause of the segfaults.

            By the way, trying your example, I could not reproduce the bug with MariaDB 10.2.6 on Windows.

            Indeed, after executing 'CREATE TABLE t1 AS SELECT * FROM c1' the work area is not freed and is reused for instance when executing 'SELECT * FROM t1' but this is as designed. However, before re-executing the 'DROP TABLE c1' the memory is released and no leak occurs.

            This does not impact what Egor has found, and it is true that using a mutex must be added.

            bertrandop Olivier Bertrand added a comment - By the way, trying your example, I could not reproduce the bug with MariaDB 10.2.6 on Windows. Indeed, after executing 'CREATE TABLE t1 AS SELECT * FROM c1' the work area is not freed and is reused for instance when executing 'SELECT * FROM t1' but this is as designed. However, before re-executing the 'DROP TABLE c1' the memory is released and no leak occurs. This does not impact what Egor has found, and it is true that using a mutex must be added.

            I'm using MariaDB 10.1.24, and the command I've posted results in an unlimited memory usage growth for me, maybe it has been fixed in 10.2.
            Here are my current fixes: mariadb_10.1.24_connect.diff, this version has been working without segfaults for several days now, but it's more like a proof of concept and not an actual solution.

            ekarav Egor Karavaev (Inactive) added a comment - I'm using MariaDB 10.1.24, and the command I've posted results in an unlimited memory usage growth for me, maybe it has been fixed in 10.2. Here are my current fixes: mariadb_10.1.24_connect.diff , this version has been working without segfaults for several days now, but it's more like a proof of concept and not an actual solution.

            I don't think there are any differences between 10.1 and 10.2 concerning this memory handling.

            Looking at your fixes, I am wondering about the GetPlug changes and the PCONNECT_Guard class. This function is used to retrieve an existing user work area. The problem is that MariaDB uses many handlerton classes and they are constantly deleted and re-created. Currently, all work areas are in the list pointed by the Users variable but I did not find a way to clearly indicate to what user they belong. This is why I am using a test on the THD assuming each user is using the same base THD.

            Normally this fonction should just retrieve the same one except if a user is using several thread with different THD. It seems that you have found that sometimes this can lead to a lost of memory. However, in the other end, your fix just release the memory of another thread when GetPlug is called with a different THD. But the other thread is still alive and this can lead to unnecessary reallocation or worse crash if the other thread later try to use that memory.

            I understand that all this is unclear, and really admit that this memory handling is a mess as it is. Perhaps a better solution would be to retrieve the g pointer from the share if it is really a structure shared by all handlers of the same user. I am not familiar enough with MariaDB internals and I don't want to make such a change if it make all CONNECT engines to crash.

            So are you sure that the call to PCONNECT_Guard by GetPlug is what made the memory leak to end?

            Normally, memory is released by PopUser when MariaDB deletes the handlerton and if not, it can be just a bad handling of the count variable which should be fixed by your mutex additions.

            bertrandop Olivier Bertrand added a comment - I don't think there are any differences between 10.1 and 10.2 concerning this memory handling. Looking at your fixes, I am wondering about the GetPlug changes and the PCONNECT_Guard class. This function is used to retrieve an existing user work area. The problem is that MariaDB uses many handlerton classes and they are constantly deleted and re-created. Currently, all work areas are in the list pointed by the Users variable but I did not find a way to clearly indicate to what user they belong. This is why I am using a test on the THD assuming each user is using the same base THD. Normally this fonction should just retrieve the same one except if a user is using several thread with different THD. It seems that you have found that sometimes this can lead to a lost of memory. However, in the other end, your fix just release the memory of another thread when GetPlug is called with a different THD. But the other thread is still alive and this can lead to unnecessary reallocation or worse crash if the other thread later try to use that memory. I understand that all this is unclear, and really admit that this memory handling is a mess as it is. Perhaps a better solution would be to retrieve the g pointer from the share if it is really a structure shared by all handlers of the same user. I am not familiar enough with MariaDB internals and I don't want to make such a change if it make all CONNECT engines to crash. So are you sure that the call to PCONNECT_Guard by GetPlug is what made the memory leak to end? Normally, memory is released by PopUser when MariaDB deletes the handlerton and if not, it can be just a bad handling of the count variable which should be fixed by your mutex additions.
            ekarav Egor Karavaev (Inactive) added a comment - - edited

            First, the PCONNECT_Guard class is an artifact of my earlier attempt to manage leaks in 10.1.16, looks like in 10.1.24 it should indeed be replaced with PopUser.

            So are you sure that the call to PCONNECT_Guard by GetPlug is what made the memory leak to end?

            I've tested different configurations, the mutex alone doesn't fix the leak, the mimial fix is

            if(old_xp!=NULL && old_xp!=lxp) {
              PopUser(old_xp);
            }
            

            in GetPlug. I believe the other places where PCONNECT_Guard is used can cause leaks too, but in different circumstances.

            Normally, memory is released by PopUser when MariaDB deletes the handlerton and if not, it can be just a bad handling of the count variable which should be fixed by your mutex additions.

            As I understand it, the problem is GetPlug overwriting the pointer in ha_connect::xp (since it takes it by reference) with a new value without calling PopUser on it, which can happen if no user_connect object with the current thd has been found in the user_connect::to_users list. If the current value of ha_connect::xp is not NULL, this leads to the memory leak, despite the object still being accessable via the list, since the reference count won't be decremented in ha_connect::~ha_connect. The same logic applies to the other 3 places where xp= GetUser(ha_thd(), xp); is used in the code outside of GetPlug.

            It's interesting that without this change the engine doesn't crash, despite the list not being protected by a mutex, it's only after I made this change MariaDB started to crash with SIGSEGV, because threads started to free each other's data. This led me to add the mutex and gradually put more and more code inside it until it stopped to crash.

            ekarav Egor Karavaev (Inactive) added a comment - - edited First, the PCONNECT_Guard class is an artifact of my earlier attempt to manage leaks in 10.1.16, looks like in 10.1.24 it should indeed be replaced with PopUser . So are you sure that the call to PCONNECT_Guard by GetPlug is what made the memory leak to end? I've tested different configurations, the mutex alone doesn't fix the leak, the mimial fix is if (old_xp!=NULL && old_xp!=lxp) { PopUser(old_xp); } in GetPlug . I believe the other places where PCONNECT_Guard is used can cause leaks too, but in different circumstances. Normally, memory is released by PopUser when MariaDB deletes the handlerton and if not, it can be just a bad handling of the count variable which should be fixed by your mutex additions. As I understand it, the problem is GetPlug overwriting the pointer in ha_connect::xp (since it takes it by reference) with a new value without calling PopUser on it, which can happen if no user_connect object with the current thd has been found in the user_connect::to_users list. If the current value of ha_connect::xp is not NULL, this leads to the memory leak, despite the object still being accessable via the list, since the reference count won't be decremented in ha_connect::~ha_connect . The same logic applies to the other 3 places where xp= GetUser(ha_thd(), xp); is used in the code outside of GetPlug . It's interesting that without this change the engine doesn't crash, despite the list not being protected by a mutex, it's only after I made this change MariaDB started to crash with SIGSEGV, because threads started to free each other's data. This led me to add the mutex and gradually put more and more code inside it until it stopped to crash.

            The cause of the leak was finally found thanks to the help of Egor Karavaev.

            Indeed the leak was caused by replacing the xp pointer in the handler class structure. As it is freed when the class is deleted, only the new xp was freed. The fix is in the GetUser function:

              if (xp) {
                if (thd == xp->thdp)
                  return xp;
             
                PopUser(xp);		// Avoid memory leak
              } // endif xp
            

            Doing this, there is nothing wrong with the GetPlug function.

            In addition, muti-user parts are now secured by mutex.

            bertrandop Olivier Bertrand added a comment - The cause of the leak was finally found thanks to the help of Egor Karavaev. Indeed the leak was caused by replacing the xp pointer in the handler class structure. As it is freed when the class is deleted, only the new xp was freed. The fix is in the GetUser function: if (xp) { if (thd == xp->thdp) return xp;   PopUser(xp); // Avoid memory leak } // endif xp Doing this, there is nothing wrong with the GetPlug function. In addition, muti-user parts are now secured by mutex.

            People

              bertrandop Olivier Bertrand
              Sergey.Antonyuk Sergey Antonyuk
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.