[MDEV-27180] Fully atomic partitioning DDL operations - Jira

Details

Type: New Feature
Status: In Review (View Workflow)
Priority: Critical
Resolution: Unresolved
Fix Version/s: None
Component/s: Partitioning, Server
Labels:
None

Description

Atomic DDL for partitioning operations covers crash-safety but it does not cover failures. F.ex. if error happens during the rename or drop of partitions the alter operation does not return the table to its original state. More details what states are possible after failed alter operation are described in handle_alter_part_error(). Let's take a couple of exerpts from it:

Manual drop of remaining partitions is required
The frm file is in an unknown state, and a backup is required.

Implementation of full atomicity for partitioning DDL takes into consideration the following modifications:

1. For partition drop when it is already in storage engine use DDL_LOG_DROP_TABLE_ACTION instead of DDL_LOG_DELETE_ACTION.

2. frm_action flag based on handler_name equal to reg_ext should not be used. Instead ddl_log_entry->name should contain filename with extension. So if .par file delete required we use second DDL_LOG_DELETE_ACTION.

That is: DDL_LOG_DELETE_ACTION is used only for file delete (maybe rename to DDL_LOG_RM_ACTION). For table drop DDL_LOG_DROP_TABLE_ACTION must be used.

3. Similarly, use DDL_LOG_RENAME_TABLE_ACTION instead of DDL_LOG_RENAME_ACTION. For partition rename update straight to DDL_RENAME_PHASE_TABLE because no triggers and stat processing is needed.

For DDL_LOG_REPLACE_ACTION log DDL_LOG_DROP_TABLE_ACTION first and then DDL_LOG_RENAME_TABLE_ACTION.

That is: DDL_LOG_RENAME_ACTION is used only for file rename. DDL_LOG_REPLACE_ACTION is deprecated. For file replace log DDL_LOG_RM_ACTION first.

4. mysql_drop_partitions(), mysql_rename_partitions() are not needed. This can be done via log replay. This refactoring deprecates partition_element::log_entry data.

5. Log replay must handle errors and should be crash-safe and atomic itself. That brings a couple of new modes for ddl_log_revert(): DDL_LOG_ERR_REPORT, DDL_LOG_ERR_ROLLBACK. In DDL_LOG_ERR_REPORT mode it fails log replay at the first encountered error and prints the error message. In DDL_LOG_ERR_ROLLBACK mode it also reverts back all the revertible changes it done before the error happens.

That is: DDL_LOG_ERR_ROLLBACK means DDL log replay writes new DDL log chain which covers the atomicity during the replay.

6. Implement the rest of ~~MDEV-22166~~ minor refactorings (see todo.txt).

P. 2. adds new feature of removing arbitrary files via DDL log (and not just .frm and .par files). This feature is required for MDEV-16417 atomicity. It was already implemented there as a quick hack, now is the time for unified solution.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

todo.txt
2021-12-06 19:34
6 kB
Aleksey Midenkov

Issue Links

blocks

MDEV-16417 Store Foreign Key metadata outside of InnoDB

In Review

MDEV-29558 Crash recovery: SHOW KEYS .. harvests 1696: Failed to read from the .par file

Open

MDEV-31212 Range partition: allow ADD PARTITION into the middle of ranges list

Open

MDEV-31213 CONVERT TABLE TO PARTITION: allow copy rows to avoid partition restrictions

Open

is blocked by

MDEV-25292 Atomic CREATE OR REPLACE TABLE

Stalled

is duplicated by

MDEV-27618 Atomic DDL is not very atomic on partitioned tables

Confirmed

relates to

MDEV-22165 CONVERT TABLE: move in partition from existing table

Closed

MDEV-22166 CONVERT PARTITION: move out partition into a table

Closed

MDEV-27618 Atomic DDL is not very atomic on partitioned tables

Confirmed

MDEV-29566 Failed to read from the .par file upon concurrent DDL/SELECT with partition pruning

Stalled

MDEV-22168 Supporting multiple engines with table partitioning

Stalled

MDEV-28801 Partitioning CONVERT: implement data copy

Open

MDEV-29558 Crash recovery: SHOW KEYS .. harvests 1696: Failed to read from the .par file

Open

(1 is duplicated by, 7 relates to)

Activity

Ascending order - Click to sort in descending order

Aleksey Midenkov added a comment - 2022-03-02 07:03 - edited

You can see that if you run

mtr --parts.partition_debug

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019

parts.partition_debug                   [ fail ] Found warnings/errors in server log file!

       Test ended at 2022-03-01 21:05:41

line

Warning: Table: ./test/t1#P#p2 is open on rename new table2

Warning: Table: ./test/t1#P#p2 is open on rename new table2

This is because in you call ddl_log_revert (), which does a rename, when the table is open. This is not allowed and could cause crashes and lost data for some engines.

The reason for the error is this code in sql_partition.cc:

    if (mysql_write_frm(lpt, WFRM_WRITE_CONVERTED_TO) ||

        ERROR_INJECT("convert_partition_1") ||

        write_log_drop_shadow_frm(lpt) ||

        ERROR_INJECT("convert_partition_2") ||

        mysql_write_frm(lpt, WFRM_WRITE_SHADOW) ||

        ERROR_INJECT("convert_partition_3") ||

        wait_while_table_is_used(thd, table, HA_EXTRA_NOT_USED) ||

        ERROR_INJECT("convert_partition_4") ||

        write_log_convert_partition(lpt) ||

        ERROR_INJECT("convert_partition_5") ||

        alter_close_table(lpt) ||

If we fail at convert_partition_5, then the table will not be closed and we have a request in the ddl log to do the rename.

Suggested fixes:

Don't have an error injection after write_log_convert_partition().

Set a variable before write_log_convert_partition(lpt) that is cleared after alter_close_table(). In the error handler, call alter_close_table() if the variable is set.

Aleksey Midenkov added a comment - 2022-03-02 07:03 - edited You can see that if you run mtr --parts.partition_debug worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 parts.partition_debug [ fail ] Found warnings/errors in server log file! Test ended at 2022-03-01 21:05:41 line Warning: Table: ./test/t1#P#p2 is open on rename new table2 Warning: Table: ./test/t1#P#p2 is open on rename new table2 This is because in you call ddl_log_revert (), which does a rename, when the table is open. This is not allowed and could cause crashes and lost data for some engines. The reason for the error is this code in sql_partition.cc: if (mysql_write_frm(lpt, WFRM_WRITE_CONVERTED_TO) || ERROR_INJECT("convert_partition_1") || write_log_drop_shadow_frm(lpt) || ERROR_INJECT("convert_partition_2") || mysql_write_frm(lpt, WFRM_WRITE_SHADOW) || ERROR_INJECT("convert_partition_3") || wait_while_table_is_used(thd, table, HA_EXTRA_NOT_USED) || ERROR_INJECT("convert_partition_4") || write_log_convert_partition(lpt) || ERROR_INJECT("convert_partition_5") || alter_close_table(lpt) || If we fail at convert_partition_5, then the table will not be closed and we have a request in the ddl log to do the rename. Suggested fixes: Don't have an error injection after write_log_convert_partition(). Set a variable before write_log_convert_partition(lpt) that is cleared after alter_close_table(). In the error handler, call alter_close_table() if the variable is set.

Aleksey Midenkov added a comment - 2022-03-02 22:08 - edited

@@ -1287,7 +1294,8 @@ static void rename_in_stat_tables(THD *thd, DDL_LOG_ENTRY *ddl_log_entry,

*/

 static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,

-                                  DDL_LOG_ENTRY *ddl_log_entry)

+                                  DDL_LOG_ENTRY *ddl_log_entry,

+                                  bool report_error)

Please add a function comment about the meaning of the above parameters

and also explain when report_error is set.

> @@ -1320,7 +1328,8 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,

>      DBUG_RETURN(FALSE);

>    handler_name=    ddl_log_entry->handler_name;

> -  thd->push_internal_handler(&no_such_table_handler);

> +  if (!report_error)

> +    thd->push_internal_handler(&no_such_table_handler);

Add a comment before the 'if' (or function comment), something like:

/*

   If report_error is set we want the user to see all errors as we expect no

   errors. If report_error is false, we are in recovery and we should only

   print unexpected errors, not errors for trying to delete files that may

   not exists.

*/

> @@ -1459,11 +1472,11 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,

>      /* fall through */

>      case DDL_RENAME_PHASE_TABLE:

>        /* Restore frm and table to original names */

> -      execute_rename_table(ddl_log_entry, file,

> -                           &ddl_log_entry->db, &ddl_log_entry->name,

> -                           &ddl_log_entry->from_db, &ddl_log_entry->from_name,

> -                           0,

> -                           from_path, to_path);

> +      flags= report_error ? FN_FROM_IS_TMP : 0;

The above needs a comment. It should explain why 'flag' is depending

on report_error and not ddl_log_entry->flags!

I don't understand why the execution should be different runtime

compared to during startup recovery.

>        if (increment_phase(entry_pos))

>          break;

>        /* Fall through */

> @@ -1765,31 +1785,6 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,

>      }

>      strxnmov(to_path, sizeof(to_path)-1, path.str, reg_ext, NullS);

>      mysql_file_delete(key_file_frm, to_path, MYF(MY_WME|MY_IGNORE_ENOENT));

We should probably set error if the above fails as we may not have a

error handler to catch this anymore.

> @@ -2296,11 +2291,27 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,

>    }

>  end:

> +  if (report_error)

> +  {

> +    if (error && file)

> +    {

> +      TABLE_SHARE share;

  Better to bzero the above, just in case.

> +      share.db=               ddl_log_entry->db;

> +      share.table_name=       ddl_log_entry->name;

> +      share.normalized_path=  ddl_log_entry->tmp_name;

> +      /* TODO: make TABLE_SHARE-independent handler::print_error()? */

Cannot be done as the lower level may not know the user visible table name.

However, we could create a helper function to do the error printing when

we only have a file pointer (if we have multiple cases when this is needed).

> +      file->change_table_ptr(NULL, &share);

> +      file->print_error(error, MYF(0));

Shouldn't some errors already have been reported?

At least errors from mysql_file_delete() are already reported and should not

be reported again.

For example, in your patch you set error to -1, which should not be sent to

print_error.

Better to have two variables, 'handler_error' and 'error' and you only do the

above if 'handler_error' is set.

> @@ -2395,11 +2406,13 @@ void ddl_log_release_memory_entry(DDL_LOG_MEMORY_ENTRY *log_entry)

>      @retval FALSE              Success

>  */

> -static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry)

> +static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry,

> +                                          bool report_error)

This function has @param, please add 'report_error' there!

> @@ -2415,24 +2428,29 @@ static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry)

>      DBUG_ASSERT(ddl_log_entry.entry_type == DDL_LOG_ENTRY_CODE ||

>                  ddl_log_entry.entry_type == DDL_LOG_IGNORE_ENTRY_CODE);

> -    if (ddl_log_execute_action(thd, &mem_root, &ddl_log_entry))

> +    if (ddl_log_execute_action(thd, &mem_root, &ddl_log_entry, report_error))

>      {

>        uint action_type= ddl_log_entry.action_type;

>        if (action_type >= DDL_LOG_LAST_ACTION)

>          action_type= 0;

> -      /* Write to error log and continue with next log entry */

> -      sql_print_error("DDL_LOG: Got error %d when trying to execute action "

> -                      "for entry %u of type '%s'",

> -                      (int) my_errno, read_entry,

> -                      ddl_log_action_name[action_type]);

> +      if (!report_error)

> +      {

> +        /* Write to error log and continue with next log entry */

> +        sql_print_error("DDL_LOG: Got error %d when trying to execute action "

> +                        "for entry %u of type '%s'",

> +                        (int) my_errno, read_entry,

> +                        ddl_log_action_name[action_type]);

> +      }

> +      result= true;

>        break;

>      }

The above code is confusing.  A reader will see it as

'If we should not report error, then we report an error'

Maybe better to change the name to 'interactive_mode' or 'in_recovery'

I checked the code and we only set report_error to true in the case of:

  /* NOTE: holds "drop old table; rename tmp table"  */

  result= ddl_log_revert(thd, ddl_log_state_rm, true);

I wonder, shouldn't we have ddl_log_revert() always set report_error to true

as for this call we don't expect any errors?

> @@ -2930,7 +2948,7 @@ void ddl_log_complete(DDL_LOG_STATE *state)

>    This is called for failed rename table, create trigger or drop trigger.

>  */

> -bool ddl_log_revert(THD *thd, DDL_LOG_STATE *state)

> +bool ddl_log_revert(THD *thd, DDL_LOG_STATE *state, bool report_error)

>  {

See comment above. What would happen if we would remove 'report_error' parameter

from this and do 'report_error= true' in this function?

As there should never be any errors when doing a ddl_log_revert() this should

be ok.

We still have to take care of following cases in ddl_log_revert():

- The original query failed and we to keep the errors state on exit.

  (thd->is_error() was set)

- The original query succeded but revert failed. In this case we have to

  convert the error to a warning as the original DDL succeded and we may

  have already sent an 'ok' to the end user.

  This can be done with some code like:

  save=thd->is_error();

  ... execute_code

  if (thd->is_error() && !save)

    convert_error_to_warning(thd));

>  /*

>    Setting ddl_log_entry.phase to this has the same effect as setting

> @@ -248,6 +248,7 @@ typedef struct st_ddl_log_state

>    */

>    DDL_LOG_MEMORY_ENTRY *main_entry;

>    uint16 flags;                                 /* Cache for flags */

> +  uint master_chain_pos;

>    bool is_active() { return list != 0; }

>  } DDL_LOG_STATE;

> @@ -273,7 +274,8 @@ bool ddl_log_write_execute_entry(uint first_entry,

>  bool ddl_log_disable_execute_entry(DDL_LOG_MEMORY_ENTRY **active_entry);

>  void ddl_log_complete(DDL_LOG_STATE *ddl_log_state);

> -bool ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state);

> +bool ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state,

> +                    bool report_error= false);

Please don't use optional arguments! It can easily be confusing when

looking at different pieces of code and not knowing about the extra argument

(could be another overloaded function).  Better to change all callers!

Aleksey Midenkov added a comment - 2022-03-02 22:08 - edited @@ -1287,7 +1294,8 @@ static void rename_in_stat_tables(THD *thd, DDL_LOG_ENTRY *ddl_log_entry, */ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, - DDL_LOG_ENTRY *ddl_log_entry) + DDL_LOG_ENTRY *ddl_log_entry, + bool report_error) Please add a function comment about the meaning of the above parameters and also explain when report_error is set. > @@ -1320,7 +1328,8 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > DBUG_RETURN(FALSE); > > handler_name= ddl_log_entry->handler_name; > - thd->push_internal_handler(&no_such_table_handler); > + if (!report_error) > + thd->push_internal_handler(&no_such_table_handler); Add a comment before the 'if' (or function comment), something like: /* If report_error is set we want the user to see all errors as we expect no errors. If report_error is false, we are in recovery and we should only print unexpected errors, not errors for trying to delete files that may not exists. */ > @@ -1459,11 +1472,11 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > /* fall through */ > case DDL_RENAME_PHASE_TABLE: > /* Restore frm and table to original names */ > - execute_rename_table(ddl_log_entry, file, > - &ddl_log_entry->db, &ddl_log_entry->name, > - &ddl_log_entry->from_db, &ddl_log_entry->from_name, > - 0, > - from_path, to_path); > + flags= report_error ? FN_FROM_IS_TMP : 0; The above needs a comment. It should explain why 'flag' is depending on report_error and not ddl_log_entry->flags! I don't understand why the execution should be different runtime compared to during startup recovery. > if (increment_phase(entry_pos)) > break; > /* Fall through */ > @@ -1765,31 +1785,6 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > } > strxnmov(to_path, sizeof(to_path)-1, path.str, reg_ext, NullS); > mysql_file_delete(key_file_frm, to_path, MYF(MY_WME|MY_IGNORE_ENOENT)); We should probably set error if the above fails as we may not have a error handler to catch this anymore. > @@ -2296,11 +2291,27 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > } > > end: > + if (report_error) > + { > + if (error && file) > + { > + TABLE_SHARE share; Better to bzero the above, just in case. > + share.db= ddl_log_entry->db; > + share.table_name= ddl_log_entry->name; > + share.normalized_path= ddl_log_entry->tmp_name; > + /* TODO: make TABLE_SHARE-independent handler::print_error()? */ Cannot be done as the lower level may not know the user visible table name. However, we could create a helper function to do the error printing when we only have a file pointer (if we have multiple cases when this is needed). > + file->change_table_ptr(NULL, &share); > + file->print_error(error, MYF(0)); Shouldn't some errors already have been reported? At least errors from mysql_file_delete() are already reported and should not be reported again. For example, in your patch you set error to -1, which should not be sent to print_error. Better to have two variables, 'handler_error' and 'error' and you only do the above if 'handler_error' is set. > @@ -2395,11 +2406,13 @@ void ddl_log_release_memory_entry(DDL_LOG_MEMORY_ENTRY *log_entry) > @retval FALSE Success > */ > > -static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry) > +static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry, > + bool report_error) This function has @param, please add 'report_error' there! > @@ -2415,24 +2428,29 @@ static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry) > DBUG_ASSERT(ddl_log_entry.entry_type == DDL_LOG_ENTRY_CODE || > ddl_log_entry.entry_type == DDL_LOG_IGNORE_ENTRY_CODE); > > - if (ddl_log_execute_action(thd, &mem_root, &ddl_log_entry)) > + if (ddl_log_execute_action(thd, &mem_root, &ddl_log_entry, report_error)) > { > uint action_type= ddl_log_entry.action_type; > if (action_type >= DDL_LOG_LAST_ACTION) > action_type= 0; > > - /* Write to error log and continue with next log entry */ > - sql_print_error("DDL_LOG: Got error %d when trying to execute action " > - "for entry %u of type '%s'", > - (int) my_errno, read_entry, > - ddl_log_action_name[action_type]); > + if (!report_error) > + { > + /* Write to error log and continue with next log entry */ > + sql_print_error("DDL_LOG: Got error %d when trying to execute action " > + "for entry %u of type '%s'", > + (int) my_errno, read_entry, > + ddl_log_action_name[action_type]); > + } > + result= true; > break; > } The above code is confusing. A reader will see it as 'If we should not report error, then we report an error' Maybe better to change the name to 'interactive_mode' or 'in_recovery' I checked the code and we only set report_error to true in the case of: /* NOTE: holds "drop old table; rename tmp table" */ result= ddl_log_revert(thd, ddl_log_state_rm, true); I wonder, shouldn't we have ddl_log_revert() always set report_error to true as for this call we don't expect any errors? > @@ -2930,7 +2948,7 @@ void ddl_log_complete(DDL_LOG_STATE *state) > This is called for failed rename table, create trigger or drop trigger. > */ > > -bool ddl_log_revert(THD *thd, DDL_LOG_STATE *state) > +bool ddl_log_revert(THD *thd, DDL_LOG_STATE *state, bool report_error) > { See comment above. What would happen if we would remove 'report_error' parameter from this and do 'report_error= true' in this function? As there should never be any errors when doing a ddl_log_revert() this should be ok. We still have to take care of following cases in ddl_log_revert(): - The original query failed and we to keep the errors state on exit. (thd->is_error() was set) - The original query succeded but revert failed. In this case we have to convert the error to a warning as the original DDL succeded and we may have already sent an 'ok' to the end user. This can be done with some code like: save=thd->is_error(); ... execute_code if (thd->is_error() && !save) convert_error_to_warning(thd)); > > /* > Setting ddl_log_entry.phase to this has the same effect as setting > @@ -248,6 +248,7 @@ typedef struct st_ddl_log_state > */ > DDL_LOG_MEMORY_ENTRY *main_entry; > uint16 flags; /* Cache for flags */ > + uint master_chain_pos; > bool is_active() { return list != 0; } > } DDL_LOG_STATE; > > @@ -273,7 +274,8 @@ bool ddl_log_write_execute_entry(uint first_entry, > bool ddl_log_disable_execute_entry(DDL_LOG_MEMORY_ENTRY **active_entry); > > void ddl_log_complete(DDL_LOG_STATE *ddl_log_state); > -bool ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state); > +bool ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state, > + bool report_error= false); Please don't use optional arguments! It can easily be confusing when looking at different pieces of code and not knowing about the extra argument (could be another overloaded function). Better to change all callers!

Marko Mäkelä added a comment - 2022-09-19 09:05

MDEV-29566 demonstrates that ha_innobase::delete_table() may fail for some partitions in DROP TABLE. How would this design take that into account?

Marko Mäkelä added a comment - 2022-09-19 09:05 MDEV-29566 demonstrates that ha_innobase::delete_table() may fail for some partitions in DROP TABLE . How would this design take that into account?

Aleksey Midenkov added a comment - 2022-12-13 06:15

Rebased to 10.11 in bb-10.11-midenok-MDEV-27180

Aleksey Midenkov added a comment - 2022-12-13 06:15 Rebased to 10.11 in bb-10.11-midenok-MDEV-27180

Aleksey Midenkov added a comment - 2023-08-16 17:49

marko If something fails everything is rolled back.

Aleksey Midenkov added a comment - 2023-08-16 17:49 marko If something fails everything is rolled back.

MariaDB Server

Fully atomic partitioning DDL operations

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Git Integration