[MDEV-13025] Server hang at NO Errors in error logs Created: 2017-06-07  Updated: 2017-07-05  Resolved: 2017-07-05

Status: Closed
Project: MariaDB Server
Component/s: Platform Windows, Server
Affects Version/s: 10.1.18
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Su, Jun-Ming Assignee: Vladislav Vaintroub
Resolution: Not a Bug Votes: 0
Labels: None
Environment:

Windows Server 2012 R2 x 2 with Microsoft Cluster Services (Active-Standby Mode)


Attachments: Zip Archive mysqld.exe_170628_093512.dmp.zip     Zip Archive mysqld.exe_170628_093523.dmp.zip     File server_audit.c     Zip Archive server_audit.zip     Text File unique_callstacks_2.txt     Text File unique_callstacks_3.txt    

 Description   

I have a strange problem, that is my server sometimes (about 1 time per 2-3 weeks) hang, and they are:

1. Can not connect remotely.
2. Can not connect locally.
3. The mysqld process still exists.

And then I reboot the active node in Microsoft Cluster Services, change active node to another Windows Server, and the service is back.

I tried to find the error message at Windows Event service and error log, there was NO error logs for this problem. (only normally shutdown and startup mysqld)

How can I solve this problem or find the root cause?



 Comments   
Comment by Vladislav Vaintroub [ 2017-06-07 ]

if there is a hang, you can examine what the server is doing by
a ) looking at it in debugger. can you also collect minidump and attach it to the bug report?
b ) if you do not find anything interesting there, maybe it is cllient that hangs, in this case you would need to look into client server communication with wireshark.

I suggest to start with a) . I'm not sure which significance it has that you run in MS cluster environment. We did not do any specific testing for that here.

Comment by Su, Jun-Ming [ 2017-06-07 ]

Thanks, and how to do minidump at windows?
I will provide the dump data to debug.

I think b is not problem is that I can not login at DB server itself.

Comment by Vladislav Vaintroub [ 2017-06-07 ]

you can take minidump with procdump https://technet.microsoft.com/en-us/sysinternals/dd996900.aspx
some kind of remote access would be necessary, even command line with be it psexec should be fine (but you'd need to upload the minidump, so also access to file system would be necessary)

Comment by Su, Jun-Ming [ 2017-06-08 ]

OK, I will get minidump for mysqld at server hang next time, thanks.

Comment by Su, Jun-Ming [ 2017-06-28 ]

Hi, It happened again this morning, and I got the minidump for mysqld, please help me to find out the reason, thanks.

(There are 2 files that I did 2 times for procdump at that moment.)

Comment by Elena Stepanova [ 2017-07-03 ]

wlad, could you please take a look to see if you can work some magic on it? I am not getting much from the dumps, everything looks like this:

--------------------------------------------------------------------------------
 1      [External Code]
 2      [Frames may be missing, no binary loaded for mysqld.exe]
 3      mysqld.exe!00007ff7c0d3a7cc()
 
Callstack for Thread 2 (Thread Id: 4784 (0x12b0)):
 Index  Function
--------------------------------------------------------------------------------
 1      [External Code]
 2      [Frames may be missing, no binary loaded for mysqld.exe]
 3      mysqld.exe!00007ff7c0d3adca()
 
Callstack for Thread 3 (Thread Id: 4780 (0x12ac)):
 Index  Function
...
--------------------------------------------------------------------------------
 1      [External Code]
 2      [Frames may be missing, no binary loaded for server_audit.dll]
 3      server_audit.dll!00007ff8f97f18a9()
 
Callstack for Thread 335 (Thread Id: 7712 (0x1e20)):
 Index  Function
--------------------------------------------------------------------------------
 1      [External Code]
 2      [Frames may be missing, no binary loaded for server_audit.dll]
 3      server_audit.dll!00007ff8f97f18a9()

There are lots of threads in server_audit.dll, so it could be one of suspects.

Comment by Vladislav Vaintroub [ 2017-07-03 ]

elenst, the secret is putting correct .pdb, .exe, .dll files, next to .dmp, otherwise at least VS debugger does not want to work. Let me check what's here

Comment by Su, Jun-Ming [ 2017-07-03 ]

Hi, is there something wrong with my server_audit.dll ?

In order to our company audit requirements, I modify original server_audit.dll, and to meet the requirements for audit policies.

I upload the modified source code, dll file, and pdb file.

I modified (in codes there are included from /* modify start / to / modify end */, and sent out more QUERY_XXX tags for operations.

static int log_statement_ex(const struct connection_info *cn,
                            time_t ev_time, unsigned long thd_id,
                            const char *query, unsigned int query_len,
                            int error_code, const char *type)
{
  size_t csize;
 ...
 int result;
/* modify start */ 
 char *modtype = type;
/* modify end */ 
 
...
if (query && !(events & EVENT_QUERY_ALL) &&
      (events & EVENT_QUERY))
  {
    const char *orig_query= query;
   
   /* modify start */ 
    if (!filter_query_type(query, not_ddl_keywords) &&
	  filter_query_type(query, ddl_keywords))
	  modtype = "QUERY_DDL";
   if (filter_query_type(query, dml_keywords))
	  modtype = "QUERY_DML";
   if (filter_query_type(query, dcl_keywords))
	  modtype = "QUERY_DCL";
  /* modify end */
...
 
  return 0;
do_log_query:
    query= orig_query;
  }
 
/* modify start */ 
  csize= log_header(message, message_size-1, &ev_time,
                    servhost, servhost_len,
                    cn->user, cn->user_length,cn->host, cn->host_length,
                    cn->ip, cn->ip_length, thd_id, query_id, modtype);
/* modify end */ 
 
...

I had used this modified plugin on 4 single servers (2 for replication slaves, 2 for test databases) and there is no problem like this before.

Comment by Vladislav Vaintroub [ 2017-07-03 ]

I know you modified the audit plugin, because we do not have matching symbols for that. Without matching symbols it is hard to debug. This would be nice if you could provide server_audit.dll and server_audit.pdb that you compiled

But yes, it looks like it is the audit plugin that makes problems. It looks that one thread in the audit plugin is writing something ,and others are waiting

from https://jira.mariadb.org/secure/attachment/43802/unique_stacks%2Ctxt.txt

"writing" thread

.161 Id: 770.1970 Suspend: 0 Teb: 00007ff7`c014e000 Unfrozen
Start: ntdll!TppWorkerThread (00007ff9`0a7789b0)
Priority: 0 Priority class: 32 Affinity: ffff
Child-SP RetAddr Call Site
0000002e`d9f7d828 00007ff9`079e74e9 ntdll!NtWriteFile+0xa
0000002e`d9f7d830 00007ff7`c1244abb KERNELBASE!WriteFile+0x10c
0000002e`d9f7d8b0 00007ff8`f97f3199 mysqld!my_win_write+0x8b [d:\winx64-packages\build\src\mysys\my_winfile.c @ 452]
0000002e`d9f7d910 00000000`00000455 server_audit+0x3199

"waiting" threads

.165 Id: 770.c98 Suspend: 0 Teb: 00007ff7`c0146000 Unfrozen
Start: ntdll!TppWorkerThread (00007ff9`0a7789b0)
Priority: 0 Priority class: 32 Affinity: ffff
Child-SP RetAddr Call Site
0000002e`da33d248 00007ff9`0a77ddb9 ntdll!NtWaitForSingleObject+0xa
0000002e`da33d250 00007ff9`0a77b784 ntdll!RtlpWaitOnCriticalSection+0xe1
0000002e`da33d320 00007ff8`f97f18a9 ntdll!RtlpEnterCriticalSectionContended+0xa4
0000002e`da33d360 00000000`0000000f server_audit+0x18a9

Comment by Su, Jun-Ming [ 2017-07-03 ]

Thank you for helping me, and Yes, It is hard to debug for no dll and pdb. I am sorry, and I had uploaded them, please help me to figure out that, and I have no idea about using this modified plugin at another server is fine.

Comment by Vladislav Vaintroub [ 2017-07-03 ]

ok , better resolved stacks are in https://jira.mariadb.org/secure/attachment/43803/unique_callstacks_2.txt now

Comment by Vladislav Vaintroub [ 2017-07-03 ]

added callstacks for another .dmp

Comment by Vladislav Vaintroub [ 2017-07-03 ]

I'm not familiar with audit plugin, but what is suspicious here is that in both cases there is one thread that is stuck in WriteFile

0000002e`d9f7d828 00007ff9`079e74e9 ntdll!NtWriteFile+0xa
0000002e`d9f7d830 00007ff7`c1244abb KERNELBASE!WriteFile+0x10c
0000002e`d9f7d8b0 00007ff8`f97f3199 mysqld!my_win_write+0x8b [d:\winx64-packages\build\src\mysys\my_winfile.c @ 452]
0000002e`d9f7d910 00007ff8`f97f5009 server_audit!loc_write+0x9 [e:\mariadb-10.1.18\plugin\server_audit\server_audit.c @ 143]
0000002e`d9f7d940 00007ff8`f97f3d2f server_audit!write_log+0x29 [e:\mariadb-10.1.18\plugin\server_audit\server_audit.c @ 1224]
0000002e`d9f7d970 00007ff8`f97f3818 server_audit!log_statement_ex+0x50f [e:\mariadb-10.1.18\plugin\server_audit\server_audit.c @ 1721]
0000002e`d9f7de70 00007ff8`f97f197a server_audit!log_statement+0x28 [e:\mariadb-10.1.18\plugin\server_audit\server_audit.c @ 1735]
0000002e`d9f7dec0 00007ff7`c0e6ecb8 server_audit!auditing+0x10a [e:\mariadb-10.1.18\plugin\server_audit\server_audit.c @ 2021]
0000002e`d9f7def0 00007ff7`c0e6e770 mysqld!plugins_dispatch+0x28 [d:\winx64-packages\build\src\sql\sql_audit.cc @ 511]
0000002e`d9f7df20 00007ff7`c0e6e499 mysqld!event_class_dispatch+0x90 [d:\winx64-packages\build\src\sql\sql_audit.cc @ 545]
0000002e`d9f7df70 00007ff7`c0e6ebc6 mysqld!general_class_handler+0xa9 [d:\winx64-packages\build\src\sql\sql_audit.cc @ 90]
0000002e`d9f7e010 00007ff7`c0e23ce6 mysqld!mysql_audit_notify+0x46 [d:\winx64-packages\build\src\sql\sql_audit.cc @ 241]
0000002e`d9f7e060 00007ff7`c0e209e0 mysqld!mysql_audit_general+0x1e6 [d:\winx64-packages\build\src\sql\sql_audit.h @ 158]
0000002e`d9f7e350 00007ff7`c0e20d67 mysqld!dispatch_command+0x1300 [d:\winx64-packages\build\src\sql\sql_parse.cc @ 1952]
0000002e`d9f7f220 00007ff7`c0d91489 mysqld!do_command+0x177 [d:\winx64-packages\build\src\sql\sql_parse.cc @ 1108]

And every other thread stuck in plugin is waiting for some critical section, which is probably held by the file writing thread, like this

0000002e`da33d248 00007ff9`0a77ddb9 ntdll!NtWaitForSingleObject+0xa
0000002e`da33d250 00007ff9`0a77b784 ntdll!RtlpWaitOnCriticalSection+0xe1
0000002e`da33d320 00007ff8`f97f18a9 ntdll!RtlpEnterCriticalSectionContended+0xa4
0000002e`da33d360 00007ff7`c0e6ecb8 server_audit!auditing+0x39 [e:\mariadb-10.1.18\plugin\server_audit\server_audit.c @ 1974]
0000002e`da33d390 00007ff7`c0e6e770 mysqld!plugins_dispatch+0x28 [d:\winx64-packages\build\src\sql\sql_audit.cc @ 511]
0000002e`da33d3c0 00007ff7`c0e6e637 mysqld!event_class_dispatch+0x90 [d:\winx64-packages\build\src\sql\sql_audit.cc @ 545]
0000002e`da33d410 00007ff7`c0e6ebc6 mysqld!table_class_handler+0xc7 [d:\winx64-packages\build\src\sql\sql_audit.cc @ 140]
0000002e`da33d4d0 00007ff7`c0d4fc9a mysqld!mysql_audit_notify+0x46 [d:\winx64-packages\build\src\sql\sql_audit.cc @ 241]
0000002e`da33d520 00007ff7`c0d4bc20 mysqld!mysql_audit_external_lock+0xea [d:\winx64-packages\build\src\sql\sql_audit.h @ 240]
0000002e`da33d5d0 00007ff7`c0f1ff82 mysqld!handler::ha_external_lock+0xd0 [d:\winx64-packages\build\src\sql\handler.cc @ 5848]
0000002e`da33d650 00007ff7`c0f20b42 mysqld!lock_external+0x62 [d:\winx64-packages\build\src\sql\lock.cc @ 364]
0000002e`da33d690 00007ff7`c0f20a42 mysqld!mysql_lock_tables+0xc2 [d:\winx64-packages\build\src\sql\lock.cc @ 310]
0000002e`da33d6e0 00007ff7`c0e5c182 mysqld!mysql_lock_tables+0x72 [d:\winx64-packages\build\src\sql\lock.cc @ 275]
0000002e`da33d710 00007ff7`c0e5d2af mysqld!lock_tables+0xd2 [d:\winx64-packages\build\src\sql\sql_base.cc @ 5430]
0000002e`da33d740 00007ff7`c0e216e4 mysqld!open_and_lock_tables+0x6f [d:\winx64-packages\build\src\sql\sql_base.cc @ 5258]

Now, WriteFile() would not block for long time, when writing to regular file. But if file is a pipe, and the pipe is full, then blocking occurs. Can this be the case here?

Comment by Su, Jun-Ming [ 2017-07-05 ]

Hi, the audit files will be written on network share drive on different server, and I think although there are something wrong on server audit, It still can not block the database server itself.

Comment by Vladislav Vaintroub [ 2017-07-05 ]

audit plugin blocks for the entire duration of "auditing()" . It waits on a internal lock.I cannot comment whether this decision is wise, but as it stands , I would do is to try to avoid long running operations inside the auditing(), since one thread can run it at the same time. It looks like the writes to network drive take a long time in your environment

Comment by Su, Jun-Ming [ 2017-07-05 ]

Ok, maybe it was designed like this. Not liking MSSQL TraceFile method, there will be another thread to do that.

I have another question,that is the database server is operated 7x24, why it was not happened everyday or periodically, is it the randomly happened?

And is there anyway to set WriteFile() function to nonblocking or async?

Comment by Vladislav Vaintroub [ 2017-07-05 ]

It was definitely not designed for the network access in mind. If I was doing it, I would implement it to write file locally, and maybe have a batch job scheduled periodically to transfer new records to some network place.
I cannot tell why it is randomly happens for you. Network access is something where every failure is seemingly random. Again, I doubt service_audit was written to function reliably with networking file systems.

Comment by Su, Jun-Ming [ 2017-07-05 ]

Ok, thank you all for helping this issue, I will modify the design of my site.

Generated at Thu Feb 08 08:02:19 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.