[MDEV-5901] EITS: killing the server leaves statistical tables in "marked as crashed" state Created: 2014-03-19  Updated: 2014-03-19  Resolved: 2014-03-19

Status: Closed
Project: MariaDB Server
Component/s: None
Affects Version/s: 10.0.9
Fix Version/s: 10.0.10

Type: Bug Priority: Major
Reporter: Sergei Petrunia Assignee: Sergei Petrunia
Resolution: Fixed Votes: 0
Labels: eits


 Description   

If one does the following sequence of operations

  • make some action that updates statistical tables (e.g. ANALYZE TABLE ... PERSISTENT FOR ALL).
  • kill the server
  • start the server again

then any action that attempts read from EITS tables will not be able to open the tables anymore. Opening the table will fail with "table marked as crashed" error.

This task is about making EITS tables more resilient to the scenario.

There are two things to be done:
1. Flush statistical table to disk as soon as we've made any modification (similar to what is done to mysql.proc)
2. Enable auto-repair for statistical tables, like it happens with regular myisam tables.



 Comments   
Comment by Sergei Petrunia [ 2014-03-19 ]

Hint from Monty: check out the code in sp.cc:

    if (table->file->ha_write_row(table->record[0]))
      ret= SP_WRITE_ROW_FAILED;
    /* Make change permanent and avoid 'table is marked as crashed' errors */
    table->file->extra(HA_EXTRA_FLUSH);

Note the HA_EXTRA_FLUSH call. We will need to add it to EITS tables.

Comment by Sergei Petrunia [ 2014-03-19 ]

I'm also trying to investigate what is needed for auto-repair.

1. Auto-repair doesn't work for mysql.proc table.
I run a CREATE PROCEDURE ... , and kill the server right after ha_myisam::write_row(). If I restart the server and attempt to use mysql.proc again, I get:

mysql> create procedure p4() begin select now(); end //
ERROR 145 (HY000): Table './mysql/proc' is marked as crashed and should be repaired

2. Auto-repair does work for regular tables.

mysql> insert into t21 values (2);
ERROR 2013 (HY000): Lost connection to MySQL server during query  
^^ -- I intentionally kill the server
 
mysql> select * from t21;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    3
Current database: test
 
+------+
| a    |
+------+
|    1 |
|    2 |
+------+
2 rows in set, 7 warnings (18 min 49.03 sec)
 
mysql> show warnings\G
Message: Table './test/t21' is marked as crashed and should be repaired
Message: Table 't21' is marked as crashed and should be repaired
Message: 1 client is using or hasn't closed the table properly
Message: Size of datafile is: 14       Should be: 7
Message: Record-count is not ok; is 2   Should be: 1
Message: Found 2 key parts. Should be: 1
Message: Number of rows changed from 1 to 2
7 rows in set (0.00 sec)

Code-wise, auto-repair happens in open_table() and open_tables(). In open_table, there is this code:

      else if (share->crashed)
        (void) ot_ctx->request_backoff_action(Open_table_context::OT_REPAIR,
                                              table_list);

and open_tables() has:

      error= open_and_process_table(thd, thd->lex, tables, counter,
                                    flags, prelocking_strategy,
                                    has_prelocking_list, &ot_ctx,
                                    &new_frm_mem);
 
      if (error)
      {
        if (ot_ctx.can_recover_from_failed_open())

Comment by Sergei Petrunia [ 2014-03-19 ]

When I try debugging a failure to open a statistical table, I see a difference in this call:

Open_table_context::request_backoff_action (this=0x7ffff7e9ede0, action_arg=Open_table_context::OT_REPAIR,

Here,

(gdb) print action_arg
$312 = Open_table_context::OT_REPAIR
(gdb) print m_has_locks
$313 = true

and because of that we don't take any action.

Comment by Sergei Petrunia [ 2014-03-19 ]

The reason is that open_and_lock_tables() is structured like this:

  if (open_tables(thd, &tables, &counter, flags, prelocking_strategy))
    goto err;
  ...
  if (lock_tables(thd, tables, counter, flags))
    goto err;
 
  (void) read_statistics_for_tables_if_needed(thd, tables);

Statistical tables are opened after the regular tables have been opened and locked (I'm wondering why can't we open them at the same time?). Because of that, deadlock prevention logic prevents repair.

Comment by Sergei Petrunia [ 2014-03-19 ]

If I force execution in Open_table_context::request_backoff_action to allow repair, then I get an assertion

thd->mdl_context.is_lock_owner(MDL_key::TABLE, table->s->db.str, table->s->table_name.str, MDL_SHARED)

in close_thread_table() for table test.t10.

Comment by Sergei Petrunia [ 2014-03-19 ]

It seems, auto-repair (item#2) is difficult to do. I will only implement flushing (item#1), for now.

Generated at Thu Feb 08 07:07:53 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.