[MCOL-540] Nonroot installation: PrimProc restarts when creating tables Created: 2017-02-01  Updated: 2019-07-10  Resolved: 2019-07-10

Status: Closed
Project: MariaDB ColumnStore
Component/s: PrimProc
Affects Version/s: 1.0.7
Fix Version/s: Icebox

Type: Bug Priority: Minor
Reporter: Daniel Lee (Inactive) Assignee: Andrew Hutchings (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None


 Description   

Build tested: 1.0.7-1, Binary distribution package for Ubuntu 16.04

This issue occurs only for the the following condition:

Nonroot installation
Local query enabled (1um2pm configuration)
Binary package
Ubuntu 16.04 OS

When creating a table:

UM1:
MariaDB [(none)]> create database mytest
-> ;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use mytest
Database changed
MariaDB [mytest]> create table t1 (c1 int, c2 char(50)) engine=columnstore;
ERROR 1815 (HY000): Internal error: CAL0009: IDB-2045: At least one PrimProc closed the connection unexpectedly.
MariaDB [mytest]>

PM1

crit.log and err.log have the same entries:

Feb 1 20:41:01 vagrant controllernode[5053]: 01.215946 |0|0|0| C 29 CAL0000: BRMShmImpl::BRMShmImpl(): retrying on size==0
Feb 1 20:41:01 vagrant joblist[7135]: 01.252495 |0|0|0| C 05 CAL0000: /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/dbcon/joblist/distributedenginecomm.cpp @ 382 DEC: lost connection to 10.0.0.21
Feb 1 20:41:02 vagrant ProcessMonitor[3794]: 02.993197 |0|0|0| C 18 CAL0000: *****Calpont Process Restarting: PrimProc, old PID = 5053

The same test with root install was successful.

I stopped the system, set <MysqlRep> to n in the Columnstore.xml file and started the system again. I was able to create the table successfully.



 Comments   
Comment by Andrew Hutchings (Inactive) [ 2017-02-01 ]

Hi Daniel,

Do you have a core file for PrimProc for when this happens? Also is there anything useful in DEBUG/INFO logs at the time?

Comment by Daniel Lee (Inactive) [ 2017-02-01 ]

Debug, info and warnings did not have any more useful info.

While trying to enable core file, I noticed that if I stop ColumnStore and start it again without making any changes, ColumnStore would be in operational state. I was able to create table after.

To enable core dump (default is disabled), I enabled the flag in Columnstore.xml right after I untarred the binary package. Therefore the system came up with core dump enabled. But no core file was generated when primproc crashed.

Using new, clean VMs, I installed Columnstore, and gdb primproc. When primproc crashed due to creating a table, I got the following:
This is for PrimProc on PM1.

root@vagrant:~# ps -ef |grep -i primproc
guest 5437 4152 0 23:41 ? 00:00:00 [PrimProc]
root 17920 15738 0 23:46 pts/0 00:00:00 grep --color=auto -i primproc
root@vagrant:~# gdb -p 5437
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
Attaching to process 5437
[New LWP 5487]
[New LWP 5488]
[New LWP 5490]
[New LWP 5491]
[New LWP 5492]
[New LWP 5493]
[New LWP 5494]
[New LWP 5495]
[New LWP 5496]
[New LWP 5497]
[New LWP 5498]
[New LWP 5499]
[New LWP 5500]
[New LWP 5501]
[New LWP 5502]
[New LWP 5503]
[New LWP 5504]
[New LWP 5505]
[New LWP 5506]
[New LWP 5507]
[New LWP 5508]
[New LWP 5509]
[New LWP 5510]
[New LWP 5511]
[New LWP 5512]
[New LWP 5513]
[New LWP 7637]
[New LWP 7640]
[New LWP 7648]
[New LWP 7649]
[New LWP 7672]
[New LWP 7673]
[New LWP 8011]
[New LWP 8168]

warning: File "/lib/x86_64-linux-gnu/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
add-auto-load-safe-path /lib/x86_64-linux-gnu/libthread_db-1.0.so
line to your configuration file "/root/.gdbinit".
To completely disable this security protection add
set auto-load safe-path /
line to your configuration file "/root/.gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual. E.g., run from the shell:
info "(gdb)Auto-loading safe path"

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: File "/lib/x86_64-linux-gnu/libthread_db-1.0.so" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
0x00007f7612e813a0 in pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:219
219 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) g
Ambiguous command "g": gcore, generate-core-file, goto-bookmark, gr, gu, guile, guile-repl.
(gdb) c
Continuing.

                  1. create table was done from UM1 at this point.
                    Thread 16 "PrimProc" received signal SIGABRT, Aborted.
                    [Switching to LWP 5502]
                    __GI_raise (sig=5437) at ../sysdeps/unix/sysv/linux/raise.c:37
                    37 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
                    (gdb) bt
                    #0 __GI_raise (sig=5437) at ../sysdeps/unix/sysv/linux/raise.c:37
                    #1 0x00007f761183501a in __GI_abort () at abort.c:87
                    #2 0x00007f76037f31d8 in ?? ()
                    #3 0x0000000000000000 in ?? ()
                    (gdb)
Comment by Andrew Hutchings (Inactive) [ 2017-02-02 ]

OK, so the problem is your /dev/shm is not writable by your non-root user (which it isn't by default). This causes the "BRMShmImpl::BRMShmImpl(): retrying on size==0" message it throws an exception which is uncaught and fires a sigabrt.

This would have been fixed by our post-install script if it was run as root.

Suggestions for fix:

1. Update documentation to state that post-install should be run as root (or sudo)
2. Move away from /dev/shm to the per-user systemd based tmpfs paths (not sure on CentOS 6 solution)
3. Make error log messages more explicit when this happens

Point 3 should be addressed in this ticket (as well as documentation in point 1). You can fix your own installation by either running post-install as root (or sudo) or using chmod.

Comment by Daniel Lee (Inactive) [ 2017-02-02 ]

Thanks.

I made the /dev/shm directory writable by the guest user in the base VM, then did the test again. It still failed with the same error. After the test was failed, I verified that the quest user was able to write to the /dev/shm directory.

Comment by David Thompson (Inactive) [ 2017-02-05 ]

tried reproducing this manually with 3 vms, 2 new ubuntu16 latest updates and cannot. Non root install, and verified local query enabled and working. Allso deliberately reset /dev/shm permissions before and postCfg does update this to be 777. One annoyance that took time to resolve was understanding that the LD_LIBRARY_PATH needs to be set at the top of .bashrc to avoid install failures due to the ssh remote install being non interactive login shell.

Comment by Andrew Hutchings (Inactive) [ 2017-02-05 ]

The LD_LIBRARY_PATH is unavoidable until a data directory can be configured (I think there is a Jira for that). After that the ColumnStore libs and binaries could be installed in a standard path (as part of apt/yum) and the data in the non-root area.

I think this will be nearly impossible to reproduce without an exact way of duplicating Daniel's environment so that we can figure out why /dev/shm is not writable in his case.

We can, however, artificially create this problem easily which is what I will do when improving the error messages.

Comment by David Thompson (Inactive) [ 2017-06-05 ]

Improve the error message and documentation on this for now.

Comment by Andrew Hutchings (Inactive) [ 2017-07-27 ]

Changed priority and version due to this just being a case of sorting out an error message

Comment by David Hill (Inactive) [ 2017-08-18 ]

on 1.0.11 testing, found that PrimProc was crashing on this same test...

But when I set it up to cpature a corefile, it didnt crash but the create table did hang.

I got this from the PrimProc gdb session:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185 ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S: No such file or directory.
(gdb) bt
#0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007fa64a620555 in boost::condition_variable_any::wait<boost::unique_lock<boost::mutex> > (m=...,
this=0x7ffe307b1a68) at /usr/include/boost/thread/pthread/condition_variable.hpp:184
#2 threadpool::ThreadPool::wait (this=this@entry=0x7ffe307b1a00)
at /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/utils/threadpool/threadpool.cpp:113
#3 0x0000000000498875 in primitiveprocessor::PrimitiveServer::start (this=this@entry=0x7ffe307b1a00)
at /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/primitiveserver.cpp:2139
#4 0x0000000000439122 in main (argc=<optimized out>, argv=<optimized out>)
at /home/builder/mariadb-columnstore-server/mariadb-columnstore-engine/primitives/primproc/primproc.cpp:629
(gdb) info threads
Id Target Id Frame

  • 1 Thread 0x7fa65073c740 (LWP 12237) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    2 Thread 0x7fa646e90700 (LWP 12270) "PrimProc" 0x00007fa64ae06c1d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:84
    3 Thread 0x7fa64668f700 (LWP 12271) "PrimProc" 0x00007fa64ae06c1d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:84
    4 Thread 0x7fa645e8e700 (LWP 12273) "PrimProc" 0x00007fa64ae06c1d in nanosleep ()
    at ../sysdeps/unix/syscall-template.S:84
    5 Thread 0x7fa64568d700 (LWP 12274) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    6 Thread 0x7fa644e8c700 (LWP 12275) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    7 Thread 0x7fa63ffff700 (LWP 12276) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    8 Thread 0x7fa63f7fe700 (LWP 12277) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    9 Thread 0x7fa63effd700 (LWP 12278) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    10 Thread 0x7fa63e7fc700 (LWP 12279) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    11 Thread 0x7fa63dffb700 (LWP 12280) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    12 Thread 0x7fa63d7fa700 (LWP 12281) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    13 Thread 0x7fa63cff9700 (LWP 12282) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    14 Thread 0x7fa63c7f8700 (LWP 12283) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    15 Thread 0x7fa63bff7700 (LWP 12284) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    16 Thread 0x7fa63b7f6700 (LWP 12285) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    17 Thread 0x7fa63aff5700 (LWP 12286) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    18 Thread 0x7fa63a7f4700 (LWP 12287) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    19 Thread 0x7fa639ff3700 (LWP 12288) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    20 Thread 0x7fa6397f2700 (LWP 12289) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    21 Thread 0x7fa638ff1700 (LWP 12290) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    22 Thread 0x7fa6387f0700 (LWP 12291) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    23 Thread 0x7fa637fef700 (LWP 12292) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    24 Thread 0x7fa6377ee700 (LWP 12293) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    25 Thread 0x7fa636fed700 (LWP 12294) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    26 Thread 0x7fa109e36700 (LWP 12295) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    27 Thread 0x7fa109635700 (LWP 12296) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    28 Thread 0x7fa108e34700 (LWP 12297) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    29 Thread 0x7fa103fff700 (LWP 12298) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    30 Thread 0x7fa1037fe700 (LWP 12299) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    31 Thread 0x7fa1016a1700 (LWP 12300) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    --Type <return> to continue, or q <return> to quit--
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    32 Thread 0x7fa100ea0700 (LWP 12301) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    33 Thread 0x7fa0efb53700 (LWP 12302) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    34 Thread 0x7fa0eef51700 (LWP 12303) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    35 Thread 0x7fa0edea3700 (LWP 12304) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    36 Thread 0x7fa0ed6a2700 (LWP 12305) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    37 Thread 0x7fa0ec9f5700 (LWP 12306) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    38 Thread 0x7fa0e61f7700 (LWP 12307) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    39 Thread 0x7fa0e55f5700 (LWP 12308) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    40 Thread 0x7fa0c7fff700 (LWP 12309) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    41 Thread 0x7fa0bf7fe700 (LWP 12310) "PrimProc" pthread_cond_wait@@GLIBC_2.3.2 ()
    at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
    42 Thread 0x7fa0c62a3700 (LWP 12311) "PrimProc" do_sigwait (sig=0x7fa0c62a2c4c, set=<optimized out>)
    at ../sysdeps/unix/sysv/linux/sigwait.c:64
    43 Thread 0x7fa0c51f5700 (LWP 12312) "PrimProc" 0x00007fa64ae0676d in accept ()
    at ../sysdeps/unix/syscall-template.S:84
    44 Thread 0x7fa0c49f4700 (LWP 12503) "PrimProc" 0x00007fa64ae0651d in read ()
    at ../sysdeps/unix/syscall-template.S:84
    45 Thread 0x7fa0bffff700 (LWP 12506) "PrimProc" 0x00007fa64ae0651d in read ()
    at ../sysdeps/unix/syscall-template.S:84
    46 Thread 0x7fa0beffd700 (LWP 12509) "PrimProc" 0x00007fa64ae0651d in read ()
    at ../sysdeps/unix/syscall-template.S:84
    47 Thread 0x7fa0be7fc700 (LWP 12513) "PrimProc" 0x00007fa64ae0651d in read ()
    at ../sysdeps/unix/syscall-template.S:84
    48 Thread 0x7fa0bdffb700 (LWP 12546) "PrimProc" 0x00007fa64ae0651d in read ()
    at ../sysdeps/unix/syscall-template.S:84
    49 Thread 0x7fa0bd7fa700 (LWP 12548) "PrimProc" 0x00007fa64ae0651d in read ()
    at ../sysdeps/unix/syscall-template.S:84
    50 Thread 0x7fa0bcff9700 (LWP 12555) "PrimProc" 0x00007fa64ae0651d in read ()
    at ../sysdeps/unix/syscall-template.S:84
    51 Thread 0x7fa097fff700 (LWP 12584) "PrimProc" 0x00007fa64ae0651d in read ()
    at ../sysdeps/unix/syscall-template.S:84
    (gdb) c
    Continuing.
Comment by Andrew Hutchings (Inactive) [ 2017-08-18 ]

discussed on Slack, but mentioning here for tracking: that backtrace shows an idle PrimProc with no in-progress commands. Whatever caused the hang wasn't PrimProc and is unlikely to be related to the /dev/shm permissions that this ticket is for.

Comment by Daniel Lee (Inactive) [ 2018-08-16 ]

The issue also affected root installation. 1.0.15-1 is also affected.

Generated at Thu Feb 08 02:21:51 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.