[MDEV-5901] EITS: killing the server leaves statistical tables in "marked as crashed" state - Jira

Details

Type: Bug
Status: Closed (View Workflow)
Priority: Major
Resolution: Fixed
Affects Version/s: 10.0.9
Fix Version/s: 10.0.10
Component/s: None
Labels:
- eits

Description

If one does the following sequence of operations

make some action that updates statistical tables (e.g. ANALYZE TABLE ... PERSISTENT FOR ALL).
kill the server
start the server again

then any action that attempts read from EITS tables will not be able to open the tables anymore. Opening the table will fail with "table marked as crashed" error.

This task is about making EITS tables more resilient to the scenario.

There are two things to be done:
1. Flush statistical table to disk as soon as we've made any modification (similar to what is done to mysql.proc)
2. Enable auto-repair for statistical tables, like it happens with regular myisam tables.

Attachments

Activity

Ascending order - Click to sort in descending order

Sergei Petrunia added a comment - 2014-03-19 13:30

Hint from Monty: check out the code in sp.cc:

    if (table->file->ha_write_row(table->record[0]))

      ret= SP_WRITE_ROW_FAILED;

    /* Make change permanent and avoid 'table is marked as crashed' errors */

    table->file->extra(HA_EXTRA_FLUSH);

Note the HA_EXTRA_FLUSH call. We will need to add it to EITS tables.

Sergei Petrunia added a comment - 2014-03-19 13:30 Hint from Monty: check out the code in sp.cc: if (table->file->ha_write_row(table->record[0])) ret= SP_WRITE_ROW_FAILED; /* Make change permanent and avoid 'table is marked as crashed' errors */ table->file->extra(HA_EXTRA_FLUSH); Note the HA_EXTRA_FLUSH call. We will need to add it to EITS tables.

Sergei Petrunia added a comment - 2014-03-19 13:40

I'm also trying to investigate what is needed for auto-repair.

1. Auto-repair doesn't work for mysql.proc table.
I run a CREATE PROCEDURE ... , and kill the server right after ha_myisam::write_row(). If I restart the server and attempt to use mysql.proc again, I get:

mysql> create procedure p4() begin select now(); end //
ERROR 145 (HY000): Table './mysql/proc' is marked as crashed and should be repaired

2. Auto-repair does work for regular tables.

mysql> insert into t21 values (2);

ERROR 2013 (HY000): Lost connection to MySQL server during query

^^ -- I intentionally kill the server

mysql> select * from t21;

ERROR 2006 (HY000): MySQL server has gone away

No connection. Trying to reconnect...

Connection id:    3

Current database: test

+------+

| a    |

+------+

|    1 |

|    2 |

+------+

2 rows in set, 7 warnings (18 min 49.03 sec)

mysql> show warnings\G

Message: Table './test/t21' is marked as crashed and should be repaired

Message: Table 't21' is marked as crashed and should be repaired

Message: 1 client is using or hasn't closed the table properly

Message: Size of datafile is: 14       Should be: 7

Message: Record-count is not ok; is 2   Should be: 1

Message: Found 2 key parts. Should be: 1

Message: Number of rows changed from 1 to 2

7 rows in set (0.00 sec)

Code-wise, auto-repair happens in open_table() and open_tables(). In open_table, there is this code:

      else if (share->crashed)

        (void) ot_ctx->request_backoff_action(Open_table_context::OT_REPAIR,

                                              table_list);

and open_tables() has:

      error= open_and_process_table(thd, thd->lex, tables, counter,

                                    flags, prelocking_strategy,

                                    has_prelocking_list, &ot_ctx,

                                    &new_frm_mem);

      if (error)

        if (ot_ctx.can_recover_from_failed_open())

Sergei Petrunia added a comment - 2014-03-19 13:40 I'm also trying to investigate what is needed for auto-repair. 1. Auto-repair doesn't work for mysql.proc table. I run a CREATE PROCEDURE ... , and kill the server right after ha_myisam::write_row(). If I restart the server and attempt to use mysql.proc again, I get: mysql> create procedure p4() begin select now(); end // ERROR 145 (HY000): Table './mysql/proc' is marked as crashed and should be repaired 2. Auto-repair does work for regular tables. mysql> insert into t21 values (2); ERROR 2013 (HY000): Lost connection to MySQL server during query ^^ -- I intentionally kill the server mysql> select * from t21; ERROR 2006 (HY000): MySQL server has gone away No connection. Trying to reconnect... Connection id: 3 Current database: test +------+ | a | +------+ | 1 | | 2 | +------+ 2 rows in set, 7 warnings (18 min 49.03 sec) mysql> show warnings\G Message: Table './test/t21' is marked as crashed and should be repaired Message: Table 't21' is marked as crashed and should be repaired Message: 1 client is using or hasn't closed the table properly Message: Size of datafile is: 14 Should be: 7 Message: Record-count is not ok; is 2 Should be: 1 Message: Found 2 key parts. Should be: 1 Message: Number of rows changed from 1 to 2 7 rows in set (0.00 sec) Code-wise, auto-repair happens in open_table() and open_tables(). In open_table, there is this code: else if (share->crashed) (void) ot_ctx->request_backoff_action(Open_table_context::OT_REPAIR, table_list); and open_tables() has: error= open_and_process_table(thd, thd->lex, tables, counter, flags, prelocking_strategy, has_prelocking_list, &ot_ctx, &new_frm_mem); if (error) { if (ot_ctx.can_recover_from_failed_open())

Sergei Petrunia added a comment - 2014-03-19 13:45

When I try debugging a failure to open a statistical table, I see a difference in this call:

Open_table_context::request_backoff_action (this=0x7ffff7e9ede0, action_arg=Open_table_context::OT_REPAIR,

Here,

(gdb) print action_arg
$312 = Open_table_context::OT_REPAIR
(gdb) print m_has_locks
$313 = true

and because of that we don't take any action.

Sergei Petrunia added a comment - 2014-03-19 13:45 When I try debugging a failure to open a statistical table, I see a difference in this call: Open_table_context::request_backoff_action (this=0x7ffff7e9ede0, action_arg=Open_table_context::OT_REPAIR, Here, (gdb) print action_arg $312 = Open_table_context::OT_REPAIR (gdb) print m_has_locks $313 = true and because of that we don't take any action.

Sergei Petrunia added a comment - 2014-03-19 14:02

The reason is that open_and_lock_tables() is structured like this:

  if (open_tables(thd, &tables, &counter, flags, prelocking_strategy))

    goto err;

...

  if (lock_tables(thd, tables, counter, flags))

    goto err;

  (void) read_statistics_for_tables_if_needed(thd, tables);

Statistical tables are opened after the regular tables have been opened and locked (I'm wondering why can't we open them at the same time?). Because of that, deadlock prevention logic prevents repair.

Sergei Petrunia added a comment - 2014-03-19 14:02 The reason is that open_and_lock_tables() is structured like this: if (open_tables(thd, &tables, &counter, flags, prelocking_strategy)) goto err; ... if (lock_tables(thd, tables, counter, flags)) goto err; (void) read_statistics_for_tables_if_needed(thd, tables); Statistical tables are opened after the regular tables have been opened and locked (I'm wondering why can't we open them at the same time?). Because of that, deadlock prevention logic prevents repair.

Sergei Petrunia added a comment - 2014-03-19 14:21

If I force execution in Open_table_context::request_backoff_action to allow repair, then I get an assertion

thd->mdl_context.is_lock_owner(MDL_key::TABLE, table->s->db.str, table->s->table_name.str, MDL_SHARED)

in close_thread_table() for table test.t10.

Sergei Petrunia added a comment - 2014-03-19 14:21 If I force execution in Open_table_context::request_backoff_action to allow repair, then I get an assertion thd->mdl_context.is_lock_owner(MDL_key::TABLE, table->s->db.str, table->s->table_name.str, MDL_SHARED) in close_thread_table() for table test.t10.

Sergei Petrunia added a comment - 2014-03-19 14:22

It seems, auto-repair (item#2) is difficult to do. I will only implement flushing (item#1), for now.

Sergei Petrunia added a comment - 2014-03-19 14:22 It seems, auto-repair (item#2) is difficult to do. I will only implement flushing (item#1), for now.

People

Assignee:: Sergei Petrunia

Reporter:: Sergei Petrunia

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 2014-03-19 12:20

Updated:: 2014-03-19 16:05

Resolved:: 2014-03-19 14:38

Git Integration

Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.

MariaDB Server