[MDEV-15488] Crashes after upgrade from 5.5.52 to 10.2.12 Created: 2018-03-06  Updated: 2021-04-06  Resolved: 2021-04-06

Status: Closed
Project: MariaDB Server
Component/s: Storage Engine - InnoDB
Affects Version/s: 10.2.12, 10.2.14
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: Doug Van Hollen Assignee: Marko Mäkelä
Resolution: Incomplete Votes: 1
Labels: None
Environment:

Linux x86_64


Attachments: File _schema_changes.frm     File corp_user_group_log.frm     File ibsqlstagea04_my.cnf     HTML File mdev15488_bt_full     Text File schema_data_20180307.log     Text File server_errorlog_ibsqlstagea04_201803061043.txt    
Issue Links:
Relates
relates to MDEV-14915 Attempted to upgrade from 5.5.52 to 1... Open
relates to MDEV-16923 10.3.8 crashes Open

 Description   

In two different environments, we have experienced periodic crashes after upgrading from 5.5.52 to 10.2.12. The first wave of investigation was covered in https://jira.mariadb.org/browse/MDEV-14915 but was written off as an upgrade procedure problem, since the crashes eventually stopped.

But now in a new environment, we have experienced three identical crashes within the first 24 hours of the upgrade, all with the same blame as before:

0x7fd2900b0700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/que/que0que.cc line 567

We need a way to prevent these crashes before we proceed with upgrading the other ten servers in our environment.

Attached is the error log detailing the upgrade at 2018-03-05 16:20, and the crashes at 180306 8:12:14 , 2018-03-06 09:59:37, and 180306 10:39:04



 Comments   
Comment by Elena Stepanova [ 2018-03-07 ]

From the error log:

2018-03-06 08:12:14 0x7fd2900b0700  InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/que/que0que.cc line 567
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
180306  8:12:14 [ERROR] mysqld got signal 6 ; This could be because you hit a bug. It is also possible that this binary or one of the libraries it was linked against is corrupt, improperly built, or misconfigured. This error can also be caused by malfunctioning hardware.
 
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
 
We will try our best to scrape up some info that will hopefully help diagnose the problem, but since we have already crashed, something is definitely wrong and this may fail.
 
Server version: 10.2.12-MariaDB-log
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=51
max_threads=502
thread_count=60
It is possible that mysqld could use up to key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1365189 K  bytes of memory Hope that's ok; if not, decrease some variables in the equation.
 
Thread pointer: 0x7fd28c3183d8
Attempting backtrace. You can use the following information to find out where mysqld died. If you see no messages after this, something went terribly wrong...
stack_bottom = 0x7fd2900afe18 thread_stack 0x49000 /home/fds/ferdb/mysql/bin/mysqld(my_print_stacktrace+0x2e)[0xde610e]
/home/fds/ferdb/mysql/bin/mysqld(handle_fatal_signal+0x471)[0x7de601]
/lib64/libpthread.so.0[0x363e80f7e0]
/lib64/libc.so.6(gsignal+0x35)[0x363e032495]
/lib64/libc.so.6(abort+0x175)[0x363e033c75]
/home/fds/ferdb/mysql/bin/mysqld[0xbafc13]
/home/fds/ferdb/mysql/bin/mysqld[0xad7f63]
/home/fds/ferdb/mysql/bin/mysqld[0xb0e9e3]
/home/fds/ferdb/mysql/bin/mysqld[0xa2b73a]
/home/fds/ferdb/mysql/bin/mysqld(_Z8closefrmP5TABLE+0x11e)[0x6a811e]
/home/fds/ferdb/mysql/bin/mysqld(_Z12tc_add_tableP3THDP5TABLE+0x333)[0x76b053]
/home/fds/ferdb/mysql/bin/mysqld(_Z10open_tableP3THDP10TABLE_LISTP18Open_table_context+0xb83)[0x59d9c3]
/home/fds/ferdb/mysql/bin/mysqld(_Z11open_tablesP3THDRK14DDL_options_stPP10TABLE_LISTPjjP19Prelocking_strategy+0xe77)[0x59f0b7]
/home/fds/ferdb/mysql/bin/mysqld(_Z20open_and_lock_tablesP3THDRK14DDL_options_stP10TABLE_LISTbjP19Prelocking_strategy+0x4b)[0x59f56b]
/home/fds/ferdb/mysql/bin/mysqld[0x5e6fe2]
/home/fds/ferdb/mysql/bin/mysqld(_Z21mysql_execute_commandP3THD+0x187c)[0x5eab4c]
/home/fds/ferdb/mysql/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x3a2)[0x5f4272]
/home/fds/ferdb/mysql/bin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcjbb+0x26f0)[0x5f7190]
/home/fds/ferdb/mysql/bin/mysqld(_Z10do_commandP3THD+0x15d)[0x5f84cd]
/home/fds/ferdb/mysql/bin/mysqld(_Z24do_handle_one_connectionP7CONNECT+0x235)[0x6dc865]
/home/fds/ferdb/mysql/bin/mysqld(handle_one_connection+0x3f)[0x6dca2f]
/home/fds/ferdb/mysql/bin/mysqld[0xa054a9]
/lib64/libpthread.so.0[0x363e807aa1]
/lib64/libc.so.6(clone+0x6d)[0x363e0e8bcd]
 
...
 
Status: NOT_KILLED

Queries from the error log which presumably caused the crash:

select 'UID:CT62258805942:',  (select SecurityName.inner_text   from    contool_utf8.tags SecurityName   where    SecurityName.submission_id = securityNode.submission_id    and SecurityName.element_id = 10156    and SecurityName.parent_id  = securityNode.node_id) as RealName,   issuerNameValue.inner_text as company_name,   (select CUSIP.inner_text   from    contool_utf8.tags securityId    inner join contool_utf8.tags CUSIPType force key (submission_id)     on CUSIPType.submission_id = securityId.submission_id     and CUSIPType.parent_id = securityId.node_id     and CUSIPType.element_id = 10151     and CUSIPType.inner_text in ('CUSIP','CINS')    inner join contool_utf8.tags CUSIP force key (submission_id)     on CUSIP.submission_id = CUSIPType.submission_id     and CUSIP.element_id = 10152     and CUSIPType.parent_id = CUSIP.parent_id   where    securityId.submission_id = issuerNode.submission_id    and securityId.element_id = 10154    and securityId.parent_id = securityNode.node_id   ) as CUSIP,   issuerPI.inner_text as primary_indicator,   estimateAction.inner_text as estimateAction,   coverageAction.inner_text as coverageAction,   targetAction.inner_text as targetAction,   ratingAction.inner_text as ratingAction,   tradingCountryCodeValue.inner_text as tradingCountry,   privateEntityID.inner_text as entityID,   tickerValue.inner_text as ticker,   (select RIC.inner_text   from    contool_utf8.tags securityId    inner join contool_utf8.tags RICType force key (submission_id)     on RICType.submission_id = securityId.submission_id     and RICType.parent_id = securityId.node_id     and RICType.element_id = 10151     and RICType.inner_text = 'RIC'    inner join contool_utf8.tags RIC force key (submission_id)     on RIC.submission_id = RICType.submission_id     and RIC.element_id = 10152     and RICType.parent_id = RIC.parent_id   where    securityId.submission_id = issuerNode.submission_id    and securityId.element_id = 10154    and securityId.parent_id = securityNode.node_id  ) as RIC,  (select ISIN.inner_text   from    contool_utf8.tags securityId    inner join contool_utf8.tags ISINType force key (submission_id)     on ISINType.submission_id = securityId.submission_id     and ISINType.parent_id = securityId.node_id     and ISINType.element_id = 10151     and ISINType.inner_text = 'ISIN'    inner join contool_utf8.tags ISIN force key (submission_id)     on ISIN.submission_id = ISINType.submission_id     and ISIN.element_id = 10152     and ISINType.parent_id = ISIN.parent_id   where    securityId.submission_id = issuerNode.submission_id    and securityId.element_id = 10154    and securityId.parent_id = securityNode.node_id  ) as ISIN, (select SEDOL.inner_text   from    contool_utf8.tags securityId    inner join contool_utf8.tags SEDOLType force key (submission_id)     on SEDOLType.submission_id = securityId.submission_id     and SEDOLType.parent_id = securityId.node_id     and SEDOLType.element_id = 10151    inner join contool_utf8.tags SEDOL  force key (submission_id)     on SEDOL.submission_id = SEDOLType.submission_id     and SEDOL.element_id = 10152     and SEDOLType.parent_id = SEDOL.parent_id   where    securityId.submission_id = issuerNode.submission_id    and securityId.element_id = 10154    and securityId.parent_id = securityNode.node_id    and SEDOLType.inner_text = 'SEDOL'   ) as SEDOL,  (select bloom.inner_text   from    contool_utf8.tags securityId    inner join contool_utf8.tags bloomType force key (submission_id)     on bloomType.submission_id = securityId.submission_id     and bloomType.parent_id = securityId.node_id     and bloomType.element_id = 10151     and bloomType.inner_text = 'Bloomberg'    inner join contool_utf8.tags bloom force key (submission_id)     on bloom.submission_id = bloomType.submission_id     and bloom.element_id = 10152     and bloomType.parent_id = bloom.parent_id   where    securityId.submission_id = issuerNode.submission_id    and securityId.element_id = 10154    and securityId.parent_id = securityNode.node_id   ) as Bloomberg ,   (select exchange.inner_text    from contool_utf8.tags  exchange    where exchange.submission_id = issuerNode.submission_id     and exchange.parent_id = tickerValue.parent_id     and exchange.element_id = 10150) as Exchange from  contool_utf8.tags issuerNode  inner join contool_utf8.tags  issuerPI   on issuerPI.submission_id = issuerNode.submission_id   and issuerPI.parent_id = issuerNode.node_id   and issuerPI.element_id = 10061  inner join contool_utf8.tags   issuerName   on issuerName.submission_id = issuerNode.submission_id   and issuerName.parent_id = issuerNode.node_id   and issuerName.element_id = 10165  inner join contool_utf8.tags issuerNameValue   on issuerNameValue.submission_id = issuerNode.submission_id   and issuerNameValue.parent_id = issuerName.node_id   and issuerNameValue.element_id = 10163  inner join contool_utf8.tags securityDetails   on securityDetails.submission_id = issuerNode.submission_id   and securityDetails.parent_id = issuerNode.node_id   and securityDetails.element_id = 10161  inner join contool_utf8.tags   securityNode   on securityNode.submission_id = issuerNode.submission_id   and securityNode.element_id = 10160   and securityNode.parent_id = securityDetails.node_id  left outer join contool_utf8.tags  estimateAction   on estimateAction.submission_id = issuerNode.submission_id   and estimateAction.element_id = 10159   and estimateAction.parent_id = securityNode.node_id  left outer join  contool_utf8.tags  coverageAction   on coverageAction.submission_id = issuerNode.submission_id   and coverageAction.element_id = 10182   and coverageAction.parent_id = securityNode.node_id  left outer join contool_utf8.tags  ratingAction   on ratingAction.submission_id = issuerNode.submission_id   and ratingAction.element_id = 10183   and ratingAction.parent_id = securityNode.node_id  left outer join  contool_utf8.tags  targetAction   on targetAction.submission_id = issuerNode.submission_id   and targetAction.element_id = 10158   and targetAction.parent_id = securityNode.node_id  LEFT OUTER JOIN contool_utf8.tags  publisherDefinedValue   ON publisherDefinedValue.submission_id = issuerNode.submission_id   AND publisherDefinedValue.element_id = 10002   AND publisherDefinedValue.inner_text = 'Ticker'   AND publisherDefinedValue.rgt < securityNode.rgt   AND publisherDefinedValue.lft > securityNode.lft  LEFT OUTER JOIN contool_utf8.tags  tickerNode   ON tickerNode.submission_id = issuerNode.submission_id   AND tickerNode.element_id = 10151   AND tickerNode.inner_text = 'ExchangeTicker'   AND tickerNode.rgt < securityNode.rgt   AND tickerNode.lft > securityNode.lft  LEFT OUTER JOIN contool_utf8.tags tickerValue   ON tickerValue.submission_id = issuerNode.submission_id   AND tickerValue.parent_id = IFNULL(tickerNode.parent_id,publisherDefinedValue.parent_id)   AND tickerValue.element_id = 10152  LEFT OUTER JOIN contool_utf8.tags  tradingCountryCodeValue   ON tradingCountryCodeValue.submission_id = issuerNode.submission_id   AND tradingCountryCodeValue.element_id = 10153   AND tradingCountryCodeValue.parent_id =  tickerValue.parent_id  LEFT OUTER JOIN contool_utf8.tags  privateEntityID   ON privateEntityID.submission_id = issuerNode.submission_id   AND privateEntityID.element_id = 10209   AND privateEntityID.parent_id =  tickerValue.parent_id where  issuerNode.submission_id ='201801290348364557'  and issuerNode.element_id = 10171 order by company_name

show create table `entitlements_diffs`.`_schema_changes`

show create table `entdata_utf8`.`corp_user_group_log`

Comment by Elena Stepanova [ 2018-03-07 ]

You could have continued using MDEV-14915, it's still open (and even if it hadn't been, we could have re-opened it). Now we need to close one of them, which one do you prefer to keep?

Comment by Elena Stepanova [ 2018-03-07 ]

Meanwhile, could you please attach .frm files for `entitlements_diffs`.`_schema_changes` and `entdata_utf8`.`corp_user_group_log` ?
I don't dare ask you to run show create table on them, because apparently it causes (or might cause) a crash; but if there is a point in time when you can risk it, e.g. your instance is to be shut down, or you have a staging, or something, then all of the above might be interesting as well:

select * from information_schema.tables where table_schema = 'entitlements_diffs' and table_name = '_schema_changes';
select * from information_schema.tables where table_schema = 'entdata_utf8' and table_name = 'corp_user_group_log';
show create table `entitlements_diffs`.`_schema_changes`;
show create table `entdata_utf8`.`corp_user_group_log`;

Also, the server is apparently a slave. Do the crashes also happen on the master, or do you upgrade slaves first, and the master runs the old version so far?
Were the tables created on the master and replicated, or were they created directly on the slave?

Comment by Doug Van Hollen [ 2018-03-07 ]

1. (My colleague requested a fresh start from January; please close that other ticket.)

2. FRM files attached. _schema_changes.frm corp_user_group_log.frm

3. Results of requested SELECTs and SHOWs are attached (no crashes resulted from running them; applications are failed away from this server so we can do what we need to for this investigation). schema_data_20180307.log

4. This server is the "primary" part of a master-master pair, so while it is configured to slave from ibsqlstagea05, this server (ibsqlstagea04) is (until we failed over this morning via VIP) the only server that takes application reads and writes.

The upgrade procedure was (in a nutshell):

  • Fail over VIP to stagea05 (running 5.5)
  • Stop stagea05's slave
  • Shut down stagea04
  • Turn off slaving and replication/binlogging in stagea04 cnf (to avoid queuing upgrade statements for rep)
  • Upgrade stagea04
  • Turn back on whatever we turned off for stagea04 and restart
  • Start stagea05's slave
  • Failback VIP to set stagea04 as primary.

5. _schema_changes is recreated each day on stagea04 as part of an ETL process (this db does not replicate to stagea05). corp_user_group_log was created on stagea04 back in 2014 (like 3 upgrades ago) and replicated to stagea05 normally.

Comment by Doug Van Hollen [ 2018-03-07 ]

Also, lest we be distracted by red herrings, here is a list of all the queries fingered by crash recovery immediately after each of the (now) 7 crashes since the upgrade:

  • select 'UID:CT62258805942:', (select SecurityName.inner_text from contool_utf8.tag s SecurityName where SecurityName.submission_id = securityNode.submission_id and SecurityName.element_id = 10156 and SecurityName.parent_id = securityNode.node_id) as RealName,...
  • show create table `entitlements_diffs`.`_schema_changes`
  • show create table `entdata_utf8`.`corp_user_group_log`
  • show create table `entdata_utf8`.`discount_schedule`
  • show create table `entdata_utf8`.`delayed_readership_log`
  • show create table `entitlements_diffs`.`_load_status`
  • show create table `entdata_utf8`.`delayed_readership`
  • show create table `entdata_utf8`.`page_price_range_active`

Please be aware that our DBAs use SQLyog to interact with the server, which runs thousands of "show creates" on startup.

My point being that these may be just queries that happened to have been running at the time of the crash, not necessarily "causes".

Comment by Elena Stepanova [ 2018-03-07 ]

Thanks for the data.

My point being that these may be just queries that happened to have been running at the time of the crash, not necessarily "causes".

Crash reports attempt to extract a query from the crashing thread, not just from any thread which happens to be running there; and usually, when they are able to find the query at all, the findings are accurate, and the query has something to do with the crash. We don't see any signs of corruption in the log, so while it' is not 100% guarantee, the chances are that the stack trace and the queries are accurate.
You are right however in the sense that the tables in the queries have nothing to do with the problem. It seems that things get ugly when your table open cache gets full (which is likely to happen when you issue bulk show create table, but also upon any other query involving tables), and the server attempts to kick off some table, most likely some particular InnoDB table which probably didn't upgrade well or is otherwise corrupt. To entertain this theory, you might try to raise table-open-cache significantly, if resources (file descriptors) allow, maybe even above the number of tables in your system, and see if it reduces the frequency of the crash. It's in no way a fix or even a viable workaround, we still need to find the actual reason, but it might give some time to you and some information to us.

Comment by Doug Van Hollen [ 2018-03-07 ]

Very interesting; thank you for the illumination.

I have upped table-open-cache from 256 to 2048. We'll try to abuse the server a bit and I'll let you know if/when we get a crash.

Comment by Doug Van Hollen [ 2018-03-08 ]

The new setting allowed us to survive some load testing, and we even failed back our staging-tier applications without incident. However a few hours later, we executed an ALTER TABLE statement on a replicated database, and the server crashed again. Below are the relevant log lines. Is this of any help?

{{
2018-03-07 17:02:25 0x7f8cefcc9700 InnoDB: Assertion failure in file /home/buildbot/buildbot/build/storage/innobase/que/que0que.cc line 567
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to https://jira.mariadb.org/
...
Server version: 10.2.12-MariaDB-log
key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=30
max_threads=502
thread_count=62
...
stack_bottom = 0x7f8cefcc8e28 thread_stack 0x49000
/home/fds/ferdb/mysql/bin/mysqld(my_print_stacktrace+0x2e)[0xde610e]
/home/fds/ferdb/mysql/bin/mysqld(handle_fatal_signal+0x471)[0x7de601]
/lib64/libpthread.so.0[0x363e80f7e0]
/lib64/libc.so.6(gsignal+0x35)[0x363e032495]
/lib64/libc.so.6(abort+0x175)[0x363e033c75]
/home/fds/ferdb/mysql/bin/mysqld[0xbafc13]
/home/fds/ferdb/mysql/bin/mysqld[0xad7f63]
/home/fds/ferdb/mysql/bin/mysqld[0xb0e9e3]
/home/fds/ferdb/mysql/bin/mysqld[0xa2b73a]
/home/fds/ferdb/mysql/bin/mysqld(_Z8closefrmP5TABLE+0x11e)[0x6a811e]
/home/fds/ferdb/mysql/bin/mysqld(_Z16tdc_remove_tableP3THD26enum_tdc_remove_table_typePKcS3_b+0x3c1)[0x76b4e1]
/home/fds/ferdb/mysql/bin/mysqld(_Z24wait_while_table_is_usedP3THDP5TABLE17ha_extra_function+0x82)[0x594822]
/home/fds/ferdb/mysql/bin/mysqld(_Z17mysql_alter_tableP3THDPcS1_P14HA_CREATE_INFOP10TABLE_LISTP10Alter_infojP8st_orderb+0x392e)[0x68906e]
/home/fds/ferdb/mysql/bin/mysqld(_ZN19Sql_cmd_alter_table7executeEP3THD+0x5bb)[0x6e0d9b]
/home/fds/ferdb/mysql/bin/mysqld(_Z21mysql_execute_commandP3THD+0x173a)[0x5eaa0a]
/home/fds/ferdb/mysql/bin/mysqld(_Z11mysql_parseP3THDPcjP12Parser_statebb+0x3a2)[0x5f4272]
/home/fds/ferdb/mysql/bin/mysqld(_ZN15Query_log_event14do_apply_eventEP14rpl_group_infoPKcj+0x6fd)[0x91ef1d]
/home/fds/ferdb/mysql/bin/mysqld[0x562993]
/home/fds/ferdb/mysql/bin/mysqld[0x570096]
/home/fds/ferdb/mysql/bin/mysqld(handle_slave_sql+0x1487)[0x571cd7]
/home/fds/ferdb/mysql/bin/mysqld[0xa054a9]
/lib64/libpthread.so.0[0x363e807aa1]
/lib64/libc.so.6(clone+0x6d)[0x363e0e8bcd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f8cc4169989): Alter table `Readership`.`RealTimeResearchFeeds` change `ProcessStatus` `ProcessStatus` tinyint(1) DEFAULT 0 NULL
Connection ID (thread ID): 11
Status: NOT_KILLED
}}

Comment by Elena Stepanova [ 2018-03-08 ]

Thanks for the update.

It somewhat confirms the theory (which was pretty straightforward anyway) that the problem occurs when a table removed from the cache. Unfortunately it is expected that increasing table_open_cache would be just a temporary and weak band-aid, not a cure. The value is server-wide, while every connection opens tables independently, so it's just the matter of time when enough concurrent connections open enough tables simultaneously. Besides, statements like FLUSH TABLES are likely to be fatal, since they close tables regardless of the size of the cache.

We need to find the root cause – either something is wrong with a particular table, or some specific workflow opens (or keeps open) a table in such a way that closing it causes the failure. The first wild theory in our ranks was about memory corruption, but I doubt it, because the failure is always the same, always very specific, and always happens in the same situation.

You mentioned earlier that it's currently possible to do all kinds of experimenting on the instance. Would it be possible, for example, to run a debug binary there for one session till the next crash, and configure it to produce a coredump? There is a chance it will be enough to figure out what's happening.

Comment by Doug Van Hollen [ 2018-03-08 ]

We have some sort of core dump from each of our last two crashes, but they're each 13G; not sure how to get it to you.

I'm happy to run a "debug binary", but I'm not sure what that is, so I'll need some instructions.

Comment by Elena Stepanova [ 2018-03-08 ]

Let's try your coredump first. Could you please run

gdb --batch --eval-command="thread apply all bt full" <path to mysqld> <path to coredump> > mdev15488_bt_full

. It will produce a text file of a reasonable size, probably within 1Mb, which you'll be able to attach here. I'm not sure that it will contain all information that we need, but at least we'll be able to see if it makes sense to try to get your coredump and work with it, or we indeed need a debug one. In the latter case, I can build and provide you a binary or a package, or point you at instructions how to build one locally, whichever you prefer. Also, alternatively, if you have a some sort of a "staging" machine where the problem occurs and can give us access to it, we can do it there ourselves.

Comment by Doug Van Hollen [ 2018-03-08 ]

Core digest attached mdev15488_bt_full

(Of your options, we would prefer to use a binary you provide, if it comes to that.)

Comment by Elena Stepanova [ 2018-03-08 ]

Which packages do you normally use? (OS, version, architecture, type of packages – rpm/deb vs bintar)

Comment by Doug Van Hollen [ 2018-03-09 ]

We use the Linux x86_64 bintar, e.g. the one from https://downloads.mariadb.com/MariaDB/mariadb-10.2.12/bintar-linux-x86_64/mariadb-10.2.12-linux-x86_64.tar.gz

Comment by Elena Stepanova [ 2018-03-15 ]

marko, is there anything helpful you can gather from mdev15488_bt_full that my point at the reason of this problem?

dvanhollen, sorry for the delay, I've had a bit of a trouble building a debug version of this particular package. It's a legacy build, compiled on an ancient CentOS 5 and provided for compatibility reasons. If you specify which system you actually use, I'll build on the same one.

Comment by Doug Van Hollen [ 2018-03-16 ]

@Elena stepanova, we use Red Hat 6.9.

Comment by Elena Stepanova [ 2018-03-16 ]

Okay, I've built a debug bintar which will hopefully work for you: ftp://ftp.askmonty.org/public/mdev15488/mariadb-10.2.12-linux-x86_64.tar.gz
Please let me know if it doesn't work, there can still be some problems with dependencies.

Please enable the coredumps for the server, here is the instruction:
https://mariadb.com/kb/en/library/how-to-produce-a-full-stack-trace-for-mysqld/

When it crashes again, you can upload the compressed coredump to ftp.askmonty.org/private.

Thanks.

Comment by Doug Van Hollen [ 2018-03-19 ]

Unfortunately this binary crashed without starting, with a different message from before (tried it twice before rolling back):

2018-03-19 9:06:30 139642722227968 [Note] InnoDB: If the mysqld execution user is authorized, page cleaner thread priority can be changed. See the man page of setpriority().
2018-03-19 9:06:30 139655833720800 [Note] InnoDB: Highest supported file format is Barracuda.
mysqld: /home/buildbot/buildbot/build/storage/innobase/dict/dict0load.cc:1444: ulint dict_check_sys_tables(bool): Assertion `!((flags & ((~(~0U << 1)) << 0)) >> 0) || flags2 & 16U' failed.
180319 9:06:30 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
...

I've uploaded a core just for completeness. Let me know what else you need.

Comment by Elena Stepanova [ 2018-03-19 ]

Thanks.
This last one is MDEV-15507.
I'll pass it over to marko to decide how to approach/debug it further, if we can't use the debug binary.

Comment by Doug Van Hollen [ 2018-04-30 ]

Just an update that this also occurs when in-place upgrading to 10.2.14 (same file mentioned in error message, line 563).

Comment by Marko Mäkelä [ 2018-04-30 ]

dvanhollen, can you try a debug version of a 10.2 development snapshot? That would contain a fix of MDEV-15507.

Comment by Doug Van Hollen [ 2018-05-17 ]

@Marko Sorry I dragged my feet and now that link is dead.

I can't commit to a timeline for delivery but I can at least grab the binaries to be tested. Any specific dev snapshot, or just the latest?

Comment by Marko Mäkelä [ 2018-05-18 ]

dvanhollen, you could as well try 10.2.15, which was just released yesterday.

Comment by Doug Van Hollen [ 2018-05-21 ]

@Marko Mäkelä Okay; just to confirm, we're no longer looking to obtain debugging info to help us investigate these crashes; now our hypothesis is that the release form of 10.2.15 may have solved our problem, right?

Comment by Marko Mäkelä [ 2018-05-23 ]

dvanhollen, I cannot be certain that the problem that you experienced was MDEV-15507. If the non-debug version of 10.2.15 does the upgrade fine, then there probably is no need to try out the debug version.

Comment by Marko Mäkelä [ 2018-12-31 ]

dvanhollen, was an upgrade to 10.2.15 or later helpful, or are you still facing this issue?

Comment by Doug Van Hollen [ 2019-01-02 ]

@marko Due to the expense of the project overall, we were forced to temporarily unresource this project back in May. This is back on our radar for 2019. Please stand by.

Comment by Marko Mäkelä [ 2019-09-30 ]

dvanhollen, did you get back to the upgrade project?

Comment by Doug Van Hollen [ 2019-09-30 ]

@Marko We have abandoned the idea of an in-place upgrade and instead are focused on a cloud transition, which would circumvent this problem. I'm afraid we can't resource additional R&D at this time.

Comment by Marko Mäkelä [ 2021-04-06 ]

The requested data was never provided. It is possible (but not certain) that this problem was addressed in MDEV-15507.

Comment by Marko Mäkelä [ 2021-04-06 ]

Another possibility is that this was fixed by MDEV-12023.

Generated at Thu Feb 08 08:21:42 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.