[MDEV-13195] Connect Engine: INSERT FROM SELECT (or CREATE TABLE AS SELECT) memory leak - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.1.24
Fix Version/s: 10.1.26, 10.0.32, 10.2.8
Component/s: Storage Engine - Connect
Labels:
None
Environment:
3.2.0-4-amd64 #1 SMP Debian 3.2.35-2 x86_64 GNU/Linux

Description

Scenario:

Bash console (it's mandatory to execute from the bash):

1) mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'DROP TABLE IF EXISTS c1;'
2) mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'CREATE TABLE c1 ENGINE=CONNECT TABLE_TYPE=MYSQL DBNAME='\''information_schema'\'' OPTION_LIST='\''host=127.0.0.1,port=33235,user=user,password=password'\'' `tabname`='\''tables'\'''
3) mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'CREATE TABLE t1 AS SELECT * FROM c1'

This scenario leads to spending an additional 500 MB (equal to value of "connect_work_size" variable) of virtual memory for each repetition. This memory is not released until you restart the mysql process. Remote database name and a table - can be any.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

mariadb_10.1.24_connect.diff
8 kB
2017-07-21 09:05

Issue Links

is part of

MDEV-10496 Connect Engine: CREATE TABLE memory leak

Closed

Activity

Ascending order - Click to sort in descending order

Elena Stepanova added a comment - 2017-07-03 17:31

Sergey.Antonyuk,

You've said "for each repetition". Do you mean that on the same running server you get the memory keep growing indefinitely by just repeating DROP/CREATE/CREATE as above? Can you show it – results of some memory monitor, e.g. top (or whatever you use)?

I can of course observe the 500MB growth with the corresponding connect_work_size, but it's a one-time event for me.

Elena Stepanova added a comment - 2017-07-03 17:31 Sergey.Antonyuk , You've said "for each repetition". Do you mean that on the same running server you get the memory keep growing indefinitely by just repeating DROP/CREATE/CREATE as above? Can you show it – results of some memory monitor, e.g. top (or whatever you use)? I can of course observe the 500MB growth with the corresponding connect_work_size , but it's a one-time event for me.

Egor Karavaev (Inactive) added a comment - 2017-07-19 10:03

I can confirm, calling the commands from the OP in a loop (something like while true ; do sleep 1 ; mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'DROP TABLE IF EXISTS t1; CREATE TABLE t1 AS SELECT * FROM c1' ; done) causes an unlimited memory leakage for me.

I've spent some time debugging and I've found the possible problem in the GetPlug/GetUser functions in ha_connect.cc. It looks like under certain circumstances they can allocate a new user_connect object and overwrite the pointer to the old object with a pointer to the new one without deallocating the old object. Another issue I've found is that a global linked list with its head in user_connect::to_users seems to be accessed by multiple threads without any mutex.

After "fixing" these problems (to the degree I understand them) I eliminated the memory leak, but now I've got occasional (once a day) segfaults, so I'm clearly still missing something.

Egor Karavaev (Inactive) added a comment - 2017-07-19 10:03 I can confirm, calling the commands from the OP in a loop (something like while true ; do sleep 1 ; mysql -h 127.0.0.1 --port=3306 -u user -ppassword db -e 'DROP TABLE IF EXISTS t1; CREATE TABLE t1 AS SELECT * FROM c1' ; done ) causes an unlimited memory leakage for me. I've spent some time debugging and I've found the possible problem in the GetPlug / GetUser functions in ha_connect.cc . It looks like under certain circumstances they can allocate a new user_connect object and overwrite the pointer to the old object with a pointer to the new one without deallocating the old object. Another issue I've found is that a global linked list with its head in user_connect::to_users seems to be accessed by multiple threads without any mutex. After "fixing" these problems (to the degree I understand them) I eliminated the memory leak, but now I've got occasional (once a day) segfaults, so I'm clearly still missing something.

Olivier Bertrand added a comment - 2017-07-20 15:12

Thanks Egor for working on this problem. I was just looking at a similar case reported by another customer. Indeed this code was imported from a legacy application and is more than 10 years old and has not been really tested in a multi-user configuration.

Could you please publish the fixes you've done so I can test them and try to find the cause of the segfaults.

Olivier Bertrand added a comment - 2017-07-20 15:12 Thanks Egor for working on this problem. I was just looking at a similar case reported by another customer. Indeed this code was imported from a legacy application and is more than 10 years old and has not been really tested in a multi-user configuration. Could you please publish the fixes you've done so I can test them and try to find the cause of the segfaults.

Olivier Bertrand added a comment - 2017-07-20 21:15

By the way, trying your example, I could not reproduce the bug with MariaDB 10.2.6 on Windows.

Indeed, after executing 'CREATE TABLE t1 AS SELECT * FROM c1' the work area is not freed and is reused for instance when executing 'SELECT * FROM t1' but this is as designed. However, before re-executing the 'DROP TABLE c1' the memory is released and no leak occurs.

This does not impact what Egor has found, and it is true that using a mutex must be added.

Olivier Bertrand added a comment - 2017-07-20 21:15 By the way, trying your example, I could not reproduce the bug with MariaDB 10.2.6 on Windows. Indeed, after executing 'CREATE TABLE t1 AS SELECT * FROM c1' the work area is not freed and is reused for instance when executing 'SELECT * FROM t1' but this is as designed. However, before re-executing the 'DROP TABLE c1' the memory is released and no leak occurs. This does not impact what Egor has found, and it is true that using a mutex must be added.

Egor Karavaev (Inactive) added a comment - 2017-07-21 09:22

I'm using MariaDB 10.1.24, and the command I've posted results in an unlimited memory usage growth for me, maybe it has been fixed in 10.2.
Here are my current fixes: mariadb_10.1.24_connect.diff, this version has been working without segfaults for several days now, but it's more like a proof of concept and not an actual solution.

Egor Karavaev (Inactive) added a comment - 2017-07-21 09:22 I'm using MariaDB 10.1.24, and the command I've posted results in an unlimited memory usage growth for me, maybe it has been fixed in 10.2. Here are my current fixes: mariadb_10.1.24_connect.diff , this version has been working without segfaults for several days now, but it's more like a proof of concept and not an actual solution.

Olivier Bertrand added a comment - 2017-07-21 10:49

I don't think there are any differences between 10.1 and 10.2 concerning this memory handling.

Looking at your fixes, I am wondering about the GetPlug changes and the PCONNECT_Guard class. This function is used to retrieve an existing user work area. The problem is that MariaDB uses many handlerton classes and they are constantly deleted and re-created. Currently, all work areas are in the list pointed by the Users variable but I did not find a way to clearly indicate to what user they belong. This is why I am using a test on the THD assuming each user is using the same base THD.

Normally this fonction should just retrieve the same one except if a user is using several thread with different THD. It seems that you have found that sometimes this can lead to a lost of memory. However, in the other end, your fix just release the memory of another thread when GetPlug is called with a different THD. But the other thread is still alive and this can lead to unnecessary reallocation or worse crash if the other thread later try to use that memory.

I understand that all this is unclear, and really admit that this memory handling is a mess as it is. Perhaps a better solution would be to retrieve the g pointer from the share if it is really a structure shared by all handlers of the same user. I am not familiar enough with MariaDB internals and I don't want to make such a change if it make all CONNECT engines to crash.

So are you sure that the call to PCONNECT_Guard by GetPlug is what made the memory leak to end?

Normally, memory is released by PopUser when MariaDB deletes the handlerton and if not, it can be just a bad handling of the count variable which should be fixed by your mutex additions.

Olivier Bertrand added a comment - 2017-07-21 10:49 I don't think there are any differences between 10.1 and 10.2 concerning this memory handling. Looking at your fixes, I am wondering about the GetPlug changes and the PCONNECT_Guard class. This function is used to retrieve an existing user work area. The problem is that MariaDB uses many handlerton classes and they are constantly deleted and re-created. Currently, all work areas are in the list pointed by the Users variable but I did not find a way to clearly indicate to what user they belong. This is why I am using a test on the THD assuming each user is using the same base THD. Normally this fonction should just retrieve the same one except if a user is using several thread with different THD. It seems that you have found that sometimes this can lead to a lost of memory. However, in the other end, your fix just release the memory of another thread when GetPlug is called with a different THD. But the other thread is still alive and this can lead to unnecessary reallocation or worse crash if the other thread later try to use that memory. I understand that all this is unclear, and really admit that this memory handling is a mess as it is. Perhaps a better solution would be to retrieve the g pointer from the share if it is really a structure shared by all handlers of the same user. I am not familiar enough with MariaDB internals and I don't want to make such a change if it make all CONNECT engines to crash. So are you sure that the call to PCONNECT_Guard by GetPlug is what made the memory leak to end? Normally, memory is released by PopUser when MariaDB deletes the handlerton and if not, it can be just a bad handling of the count variable which should be fixed by your mutex additions.

Egor Karavaev (Inactive) added a comment - 2017-07-21 12:28 - edited

First, the PCONNECT_Guard class is an artifact of my earlier attempt to manage leaks in 10.1.16, looks like in 10.1.24 it should indeed be replaced with PopUser.

So are you sure that the call to PCONNECT_Guard by GetPlug is what made the memory leak to end?

I've tested different configurations, the mutex alone doesn't fix the leak, the mimial fix is

if(old_xp!=NULL && old_xp!=lxp) {

  PopUser(old_xp);

in GetPlug. I believe the other places where PCONNECT_Guard is used can cause leaks too, but in different circumstances.

Normally, memory is released by PopUser when MariaDB deletes the handlerton and if not, it can be just a bad handling of the count variable which should be fixed by your mutex additions.

As I understand it, the problem is GetPlug overwriting the pointer in ha_connect::xp (since it takes it by reference) with a new value without calling PopUser on it, which can happen if no user_connect object with the current thd has been found in the user_connect::to_users list. If the current value of ha_connect::xp is not NULL, this leads to the memory leak, despite the object still being accessable via the list, since the reference count won't be decremented in ha_connect::~ha_connect. The same logic applies to the other 3 places where xp= GetUser(ha_thd(), xp); is used in the code outside of GetPlug.

It's interesting that without this change the engine doesn't crash, despite the list not being protected by a mutex, it's only after I made this change MariaDB started to crash with SIGSEGV, because threads started to free each other's data. This led me to add the mutex and gradually put more and more code inside it until it stopped to crash.

Egor Karavaev (Inactive) added a comment - 2017-07-21 12:28 - edited First, the PCONNECT_Guard class is an artifact of my earlier attempt to manage leaks in 10.1.16, looks like in 10.1.24 it should indeed be replaced with PopUser . So are you sure that the call to PCONNECT_Guard by GetPlug is what made the memory leak to end? I've tested different configurations, the mutex alone doesn't fix the leak, the mimial fix is if (old_xp!=NULL && old_xp!=lxp) { PopUser(old_xp); } in GetPlug . I believe the other places where PCONNECT_Guard is used can cause leaks too, but in different circumstances. Normally, memory is released by PopUser when MariaDB deletes the handlerton and if not, it can be just a bad handling of the count variable which should be fixed by your mutex additions. As I understand it, the problem is GetPlug overwriting the pointer in ha_connect::xp (since it takes it by reference) with a new value without calling PopUser on it, which can happen if no user_connect object with the current thd has been found in the user_connect::to_users list. If the current value of ha_connect::xp is not NULL, this leads to the memory leak, despite the object still being accessable via the list, since the reference count won't be decremented in ha_connect::~ha_connect . The same logic applies to the other 3 places where xp= GetUser(ha_thd(), xp); is used in the code outside of GetPlug . It's interesting that without this change the engine doesn't crash, despite the list not being protected by a mutex, it's only after I made this change MariaDB started to crash with SIGSEGV, because threads started to free each other's data. This led me to add the mutex and gradually put more and more code inside it until it stopped to crash.

Olivier Bertrand added a comment - 2017-07-22 15:49

The cause of the leak was finally found thanks to the help of Egor Karavaev.

Indeed the leak was caused by replacing the xp pointer in the handler class structure. As it is freed when the class is deleted, only the new xp was freed. The fix is in the GetUser function:

  if (xp) {

    if (thd == xp->thdp)

      return xp;

    PopUser(xp);		// Avoid memory leak

  } // endif xp

Doing this, there is nothing wrong with the GetPlug function.

In addition, muti-user parts are now secured by mutex.

Olivier Bertrand added a comment - 2017-07-22 15:49 The cause of the leak was finally found thanks to the help of Egor Karavaev. Indeed the leak was caused by replacing the xp pointer in the handler class structure. As it is freed when the class is deleted, only the new xp was freed. The fix is in the GetUser function: if (xp) { if (thd == xp->thdp) return xp; PopUser(xp); // Avoid memory leak } // endif xp Doing this, there is nothing wrong with the GetPlug function. In addition, muti-user parts are now secured by mutex.

People

Assignee:: Olivier Bertrand

Reporter:: Sergey Antonyuk

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 2017-06-27 21:21

Updated:: 2017-07-22 15:49

Resolved:: 2017-07-22 15:49

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration