Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-27180

Fully atomic partitioning DDL operations

Details

    • New Feature
    • Status: In Review (View Workflow)
    • Critical
    • Resolution: Unresolved
    • None
    • Partitioning, Server
    • None

    Description

      Atomic DDL for partitioning operations covers crash-safety but it does not cover failures. F.ex. if error happens during the rename or drop of partitions the alter operation does not return the table to its original state. More details what states are possible after failed alter operation are described in handle_alter_part_error(). Let's take a couple of exerpts from it:

      • Manual drop of remaining partitions is required
      • The frm file is in an unknown state, and a backup is required.

      Implementation of full atomicity for partitioning DDL takes into consideration the following modifications:

      1. For partition drop when it is already in storage engine use DDL_LOG_DROP_TABLE_ACTION instead of DDL_LOG_DELETE_ACTION.

      2. frm_action flag based on handler_name equal to reg_ext should not be used. Instead ddl_log_entry->name should contain filename with extension. So if .par file delete required we use second DDL_LOG_DELETE_ACTION.

      That is: DDL_LOG_DELETE_ACTION is used only for file delete (maybe rename to DDL_LOG_RM_ACTION). For table drop DDL_LOG_DROP_TABLE_ACTION must be used.

      3. Similarly, use DDL_LOG_RENAME_TABLE_ACTION instead of DDL_LOG_RENAME_ACTION. For partition rename update straight to DDL_RENAME_PHASE_TABLE because no triggers and stat processing is needed.

      For DDL_LOG_REPLACE_ACTION log DDL_LOG_DROP_TABLE_ACTION first and then DDL_LOG_RENAME_TABLE_ACTION.

      That is: DDL_LOG_RENAME_ACTION is used only for file rename. DDL_LOG_REPLACE_ACTION is deprecated. For file replace log DDL_LOG_RM_ACTION first.

      4. mysql_drop_partitions(), mysql_rename_partitions() are not needed. This can be done via log replay. This refactoring deprecates partition_element::log_entry data.

      5. Log replay must handle errors and should be crash-safe and atomic itself. That brings a couple of new modes for ddl_log_revert(): DDL_LOG_ERR_REPORT, DDL_LOG_ERR_ROLLBACK. In DDL_LOG_ERR_REPORT mode it fails log replay at the first encountered error and prints the error message. In DDL_LOG_ERR_ROLLBACK mode it also reverts back all the revertible changes it done before the error happens.

      That is: DDL_LOG_ERR_ROLLBACK means DDL log replay writes new DDL log chain which covers the atomicity during the replay.

      6. Implement the rest of MDEV-22166 minor refactorings (see todo.txt).

      P. 2. adds new feature of removing arbitrary files via DDL log (and not just .frm and .par files). This feature is required for MDEV-16417 atomicity. It was already implemented there as a quick hack, now is the time for unified solution.

      Attachments

        1. todo.txt
          6 kB
          Aleksey Midenkov

        Issue Links

          Activity

            midenok Aleksey Midenkov added a comment - - edited

            You can see that if you run
            mtr --parts.partition_debug
            worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019
            parts.partition_debug                   [ fail ] Found warnings/errors in server log file!
                   Test ended at 2022-03-01 21:05:41
            line
            Warning: Table: ./test/t1#P#p2 is open on rename new table2
            Warning: Table: ./test/t1#P#p2 is open on rename new table2
            This is because in you call ddl_log_revert (), which does a rename, when the table is open. This is not allowed and could cause crashes and lost data for some engines.
            The reason for the error is this code in sql_partition.cc:
                if (mysql_write_frm(lpt, WFRM_WRITE_CONVERTED_TO) ||
                    ERROR_INJECT("convert_partition_1") ||
                    write_log_drop_shadow_frm(lpt) ||
                    ERROR_INJECT("convert_partition_2") ||
                    mysql_write_frm(lpt, WFRM_WRITE_SHADOW) ||
                    ERROR_INJECT("convert_partition_3") ||
                    wait_while_table_is_used(thd, table, HA_EXTRA_NOT_USED) ||
                    ERROR_INJECT("convert_partition_4") ||
                    write_log_convert_partition(lpt) ||
                    ERROR_INJECT("convert_partition_5") ||
                    alter_close_table(lpt) ||
            If we fail at convert_partition_5, then the table will not be closed and we have a request in the ddl log to do the rename.
            Suggested fixes:
            Don't have an error injection after write_log_convert_partition().
            Set a variable before write_log_convert_partition(lpt) that is cleared after alter_close_table(). In the error handler, call alter_close_table() if the variable is set.
            

            midenok Aleksey Midenkov added a comment - - edited You can see that if you run mtr --parts.partition_debug worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 16000..16019 parts.partition_debug [ fail ] Found warnings/errors in server log file! Test ended at 2022-03-01 21:05:41 line Warning: Table: ./test/t1#P#p2 is open on rename new table2 Warning: Table: ./test/t1#P#p2 is open on rename new table2 This is because in you call ddl_log_revert (), which does a rename, when the table is open. This is not allowed and could cause crashes and lost data for some engines. The reason for the error is this code in sql_partition.cc: if (mysql_write_frm(lpt, WFRM_WRITE_CONVERTED_TO) || ERROR_INJECT("convert_partition_1") || write_log_drop_shadow_frm(lpt) || ERROR_INJECT("convert_partition_2") || mysql_write_frm(lpt, WFRM_WRITE_SHADOW) || ERROR_INJECT("convert_partition_3") || wait_while_table_is_used(thd, table, HA_EXTRA_NOT_USED) || ERROR_INJECT("convert_partition_4") || write_log_convert_partition(lpt) || ERROR_INJECT("convert_partition_5") || alter_close_table(lpt) || If we fail at convert_partition_5, then the table will not be closed and we have a request in the ddl log to do the rename. Suggested fixes: Don't have an error injection after write_log_convert_partition(). Set a variable before write_log_convert_partition(lpt) that is cleared after alter_close_table(). In the error handler, call alter_close_table() if the variable is set.
            midenok Aleksey Midenkov added a comment - - edited

            @@ -1287,7 +1294,8 @@ static void rename_in_stat_tables(THD *thd, DDL_LOG_ENTRY *ddl_log_entry,
             */
             
             static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,
            -                                  DDL_LOG_ENTRY *ddl_log_entry)
            +                                  DDL_LOG_ENTRY *ddl_log_entry,
            +                                  bool report_error)
             
            Please add a function comment about the meaning of the above parameters
            and also explain when report_error is set.
             
            > @@ -1320,7 +1328,8 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,
            >      DBUG_RETURN(FALSE);
            >  
            >    handler_name=    ddl_log_entry->handler_name;
            > -  thd->push_internal_handler(&no_such_table_handler);
            > +  if (!report_error)
            > +    thd->push_internal_handler(&no_such_table_handler);
             
            Add a comment before the 'if' (or function comment), something like:
             
            /*
               If report_error is set we want the user to see all errors as we expect no
               errors. If report_error is false, we are in recovery and we should only
               print unexpected errors, not errors for trying to delete files that may
               not exists.
             */
             
            > @@ -1459,11 +1472,11 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,
            >      /* fall through */
            >      case DDL_RENAME_PHASE_TABLE:
            >        /* Restore frm and table to original names */
            > -      execute_rename_table(ddl_log_entry, file,
            > -                           &ddl_log_entry->db, &ddl_log_entry->name,
            > -                           &ddl_log_entry->from_db, &ddl_log_entry->from_name,
            > -                           0,
            > -                           from_path, to_path);
            > +      flags= report_error ? FN_FROM_IS_TMP : 0;
             
            The above needs a comment. It should explain why 'flag' is depending
            on report_error and not ddl_log_entry->flags!
             
            I don't understand why the execution should be different runtime
            compared to during startup recovery.
             
            >        if (increment_phase(entry_pos))
            >          break;
            >        /* Fall through */
            > @@ -1765,31 +1785,6 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,
            >      }
            >      strxnmov(to_path, sizeof(to_path)-1, path.str, reg_ext, NullS);
            >      mysql_file_delete(key_file_frm, to_path, MYF(MY_WME|MY_IGNORE_ENOENT));
             
            We should probably set error if the above fails as we may not have a
            error handler to catch this anymore.
             
            > @@ -2296,11 +2291,27 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root,
            >    }
            >  
            >  end:
            > +  if (report_error)
            > +  {
            > +    if (error && file)
            > +    {
            > +      TABLE_SHARE share;
              Better to bzero the above, just in case.
            > +      share.db=               ddl_log_entry->db;
            > +      share.table_name=       ddl_log_entry->name;
            > +      share.normalized_path=  ddl_log_entry->tmp_name;
            > +      /* TODO: make TABLE_SHARE-independent handler::print_error()? */
             
            Cannot be done as the lower level may not know the user visible table name.
            However, we could create a helper function to do the error printing when
            we only have a file pointer (if we have multiple cases when this is needed).
             
            > +      file->change_table_ptr(NULL, &share);
            > +      file->print_error(error, MYF(0));
             
            Shouldn't some errors already have been reported?
            At least errors from mysql_file_delete() are already reported and should not
            be reported again.
            For example, in your patch you set error to -1, which should not be sent to
            print_error.
            Better to have two variables, 'handler_error' and 'error' and you only do the
            above if 'handler_error' is set.
             
            > @@ -2395,11 +2406,13 @@ void ddl_log_release_memory_entry(DDL_LOG_MEMORY_ENTRY *log_entry)
            >      @retval FALSE              Success
            >  */
            >  
            > -static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry)
            > +static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry,
            > +                                          bool report_error)
             
            This function has @param, please add 'report_error' there!
             
            > @@ -2415,24 +2428,29 @@ static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry)
            >      DBUG_ASSERT(ddl_log_entry.entry_type == DDL_LOG_ENTRY_CODE ||
            >                  ddl_log_entry.entry_type == DDL_LOG_IGNORE_ENTRY_CODE);
            >  
            > -    if (ddl_log_execute_action(thd, &mem_root, &ddl_log_entry))
            > +    if (ddl_log_execute_action(thd, &mem_root, &ddl_log_entry, report_error))
            >      {
            >        uint action_type= ddl_log_entry.action_type;
            >        if (action_type >= DDL_LOG_LAST_ACTION)
            >          action_type= 0;
            >  
            > -      /* Write to error log and continue with next log entry */
            > -      sql_print_error("DDL_LOG: Got error %d when trying to execute action "
            > -                      "for entry %u of type '%s'",
            > -                      (int) my_errno, read_entry,
            > -                      ddl_log_action_name[action_type]);
            > +      if (!report_error)
            > +      {
            > +        /* Write to error log and continue with next log entry */
            > +        sql_print_error("DDL_LOG: Got error %d when trying to execute action "
            > +                        "for entry %u of type '%s'",
            > +                        (int) my_errno, read_entry,
            > +                        ddl_log_action_name[action_type]);
            > +      }
            > +      result= true;
            >        break;
            >      }
             
            The above code is confusing.  A reader will see it as
            'If we should not report error, then we report an error'
             
            Maybe better to change the name to 'interactive_mode' or 'in_recovery'
             
            I checked the code and we only set report_error to true in the case of:
              /* NOTE: holds "drop old table; rename tmp table"  */
              result= ddl_log_revert(thd, ddl_log_state_rm, true);
             
            I wonder, shouldn't we have ddl_log_revert() always set report_error to true
            as for this call we don't expect any errors?
             
            > @@ -2930,7 +2948,7 @@ void ddl_log_complete(DDL_LOG_STATE *state)
            >    This is called for failed rename table, create trigger or drop trigger.
            >  */
            >  
            > -bool ddl_log_revert(THD *thd, DDL_LOG_STATE *state)
            > +bool ddl_log_revert(THD *thd, DDL_LOG_STATE *state, bool report_error)
            >  {
             
            See comment above. What would happen if we would remove 'report_error' parameter
            from this and do 'report_error= true' in this function?
             
            As there should never be any errors when doing a ddl_log_revert() this should
            be ok.
             
            We still have to take care of following cases in ddl_log_revert():
             
            - The original query failed and we to keep the errors state on exit.
              (thd->is_error() was set)
            - The original query succeded but revert failed. In this case we have to
              convert the error to a warning as the original DDL succeded and we may
              have already sent an 'ok' to the end user.
              This can be done with some code like:
             
              save=thd->is_error();
              ... execute_code
              if (thd->is_error() && !save)
                convert_error_to_warning(thd));
             
            >  
            >  /*
            >    Setting ddl_log_entry.phase to this has the same effect as setting
            > @@ -248,6 +248,7 @@ typedef struct st_ddl_log_state
            >    */
            >    DDL_LOG_MEMORY_ENTRY *main_entry;
            >    uint16 flags;                                 /* Cache for flags */
            > +  uint master_chain_pos;
            >    bool is_active() { return list != 0; }
            >  } DDL_LOG_STATE;
            >  
            > @@ -273,7 +274,8 @@ bool ddl_log_write_execute_entry(uint first_entry,
            >  bool ddl_log_disable_execute_entry(DDL_LOG_MEMORY_ENTRY **active_entry);
            >  
            >  void ddl_log_complete(DDL_LOG_STATE *ddl_log_state);
            > -bool ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state);
            > +bool ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state,
            > +                    bool report_error= false);
             
             
            Please don't use optional arguments! It can easily be confusing when
            looking at different pieces of code and not knowing about the extra argument
            (could be another overloaded function).  Better to change all callers!
            
            

            midenok Aleksey Midenkov added a comment - - edited @@ -1287,7 +1294,8 @@ static void rename_in_stat_tables(THD *thd, DDL_LOG_ENTRY *ddl_log_entry, */ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, - DDL_LOG_ENTRY *ddl_log_entry) + DDL_LOG_ENTRY *ddl_log_entry, + bool report_error)   Please add a function comment about the meaning of the above parameters and also explain when report_error is set.   > @@ -1320,7 +1328,8 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > DBUG_RETURN(FALSE); > > handler_name= ddl_log_entry->handler_name; > - thd->push_internal_handler(&no_such_table_handler); > + if (!report_error) > + thd->push_internal_handler(&no_such_table_handler);   Add a comment before the 'if' (or function comment), something like:   /* If report_error is set we want the user to see all errors as we expect no errors. If report_error is false, we are in recovery and we should only print unexpected errors, not errors for trying to delete files that may not exists. */   > @@ -1459,11 +1472,11 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > /* fall through */ > case DDL_RENAME_PHASE_TABLE: > /* Restore frm and table to original names */ > - execute_rename_table(ddl_log_entry, file, > - &ddl_log_entry->db, &ddl_log_entry->name, > - &ddl_log_entry->from_db, &ddl_log_entry->from_name, > - 0, > - from_path, to_path); > + flags= report_error ? FN_FROM_IS_TMP : 0;   The above needs a comment. It should explain why 'flag' is depending on report_error and not ddl_log_entry->flags!   I don't understand why the execution should be different runtime compared to during startup recovery.   > if (increment_phase(entry_pos)) > break; > /* Fall through */ > @@ -1765,31 +1785,6 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > } > strxnmov(to_path, sizeof(to_path)-1, path.str, reg_ext, NullS); > mysql_file_delete(key_file_frm, to_path, MYF(MY_WME|MY_IGNORE_ENOENT));   We should probably set error if the above fails as we may not have a error handler to catch this anymore.   > @@ -2296,11 +2291,27 @@ static int ddl_log_execute_action(THD *thd, MEM_ROOT *mem_root, > } > > end: > + if (report_error) > + { > + if (error && file) > + { > + TABLE_SHARE share; Better to bzero the above, just in case. > + share.db= ddl_log_entry->db; > + share.table_name= ddl_log_entry->name; > + share.normalized_path= ddl_log_entry->tmp_name; > + /* TODO: make TABLE_SHARE-independent handler::print_error()? */   Cannot be done as the lower level may not know the user visible table name. However, we could create a helper function to do the error printing when we only have a file pointer (if we have multiple cases when this is needed).   > + file->change_table_ptr(NULL, &share); > + file->print_error(error, MYF(0));   Shouldn't some errors already have been reported? At least errors from mysql_file_delete() are already reported and should not be reported again. For example, in your patch you set error to -1, which should not be sent to print_error. Better to have two variables, 'handler_error' and 'error' and you only do the above if 'handler_error' is set.   > @@ -2395,11 +2406,13 @@ void ddl_log_release_memory_entry(DDL_LOG_MEMORY_ENTRY *log_entry) > @retval FALSE Success > */ > > -static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry) > +static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry, > + bool report_error)   This function has @param, please add 'report_error' there!   > @@ -2415,24 +2428,29 @@ static bool ddl_log_execute_entry_no_lock(THD *thd, uint first_entry) > DBUG_ASSERT(ddl_log_entry.entry_type == DDL_LOG_ENTRY_CODE || > ddl_log_entry.entry_type == DDL_LOG_IGNORE_ENTRY_CODE); > > - if (ddl_log_execute_action(thd, &mem_root, &ddl_log_entry)) > + if (ddl_log_execute_action(thd, &mem_root, &ddl_log_entry, report_error)) > { > uint action_type= ddl_log_entry.action_type; > if (action_type >= DDL_LOG_LAST_ACTION) > action_type= 0; > > - /* Write to error log and continue with next log entry */ > - sql_print_error("DDL_LOG: Got error %d when trying to execute action " > - "for entry %u of type '%s'", > - (int) my_errno, read_entry, > - ddl_log_action_name[action_type]); > + if (!report_error) > + { > + /* Write to error log and continue with next log entry */ > + sql_print_error("DDL_LOG: Got error %d when trying to execute action " > + "for entry %u of type '%s'", > + (int) my_errno, read_entry, > + ddl_log_action_name[action_type]); > + } > + result= true; > break; > }   The above code is confusing. A reader will see it as 'If we should not report error, then we report an error'   Maybe better to change the name to 'interactive_mode' or 'in_recovery'   I checked the code and we only set report_error to true in the case of: /* NOTE: holds "drop old table; rename tmp table" */ result= ddl_log_revert(thd, ddl_log_state_rm, true);   I wonder, shouldn't we have ddl_log_revert() always set report_error to true as for this call we don't expect any errors?   > @@ -2930,7 +2948,7 @@ void ddl_log_complete(DDL_LOG_STATE *state) > This is called for failed rename table, create trigger or drop trigger. > */ > > -bool ddl_log_revert(THD *thd, DDL_LOG_STATE *state) > +bool ddl_log_revert(THD *thd, DDL_LOG_STATE *state, bool report_error) > {   See comment above. What would happen if we would remove 'report_error' parameter from this and do 'report_error= true' in this function?   As there should never be any errors when doing a ddl_log_revert() this should be ok.   We still have to take care of following cases in ddl_log_revert():   - The original query failed and we to keep the errors state on exit. (thd->is_error() was set) - The original query succeded but revert failed. In this case we have to convert the error to a warning as the original DDL succeded and we may have already sent an 'ok' to the end user. This can be done with some code like:   save=thd->is_error(); ... execute_code if (thd->is_error() && !save) convert_error_to_warning(thd));   > > /* > Setting ddl_log_entry.phase to this has the same effect as setting > @@ -248,6 +248,7 @@ typedef struct st_ddl_log_state > */ > DDL_LOG_MEMORY_ENTRY *main_entry; > uint16 flags; /* Cache for flags */ > + uint master_chain_pos; > bool is_active() { return list != 0; } > } DDL_LOG_STATE; > > @@ -273,7 +274,8 @@ bool ddl_log_write_execute_entry(uint first_entry, > bool ddl_log_disable_execute_entry(DDL_LOG_MEMORY_ENTRY **active_entry); > > void ddl_log_complete(DDL_LOG_STATE *ddl_log_state); > -bool ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state); > +bool ddl_log_revert(THD *thd, DDL_LOG_STATE *ddl_log_state, > + bool report_error= false);     Please don't use optional arguments! It can easily be confusing when looking at different pieces of code and not knowing about the extra argument (could be another overloaded function). Better to change all callers!

            MDEV-29566 demonstrates that ha_innobase::delete_table() may fail for some partitions in DROP TABLE. How would this design take that into account?

            marko Marko Mäkelä added a comment - MDEV-29566 demonstrates that ha_innobase::delete_table() may fail for some partitions in DROP TABLE . How would this design take that into account?

            Rebased to 10.11 in bb-10.11-midenok-MDEV-27180

            midenok Aleksey Midenkov added a comment - Rebased to 10.11 in bb-10.11-midenok-MDEV-27180

            marko If something fails everything is rolled back.

            midenok Aleksey Midenkov added a comment - marko If something fails everything is rolled back.

            People

              monty Michael Widenius
              midenok Aleksey Midenkov
              Votes:
              1 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:

                Git Integration

                  Error rendering 'com.xiplink.jira.git.jira_git_plugin:git-issue-webpanel'. Please contact your Jira administrators.