[MDEV-29499] Improving the 'Can't execute init_slave query' error message with the actual failure Created: 2022-09-09  Updated: 2023-12-29

Status: Open
Project: MariaDB Server
Component/s: Replication
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11
Fix Version/s: 10.4, 10.5, 10.6

Type: Bug Priority: Major
Reporter: Roel Van de Paar Assignee: Brandon Nesterenko
Resolution: Unresolved Votes: 0
Labels: affects-tests, beginner-friendly

Issue Links:
Blocks

 Description   

I have been keeping track of what errors are reported by the server and noticed the following. For basically the same issue:

Can't execute init_slave query 

I have seen all these errors:

Internal MariaDB error code: 1046
Internal MariaDB error code: 1064
Internal MariaDB error code: 1193
Internal MariaDB error code: 1231
Internal MariaDB error code: 1054
Internal MariaDB error code: 1237

This is confusing, and it affects testing as other issues are being missed due to all these errors being filtered. Here is perror output for these errors:

$ ./bin/perror 1046
MariaDB error code 1046 (ER_NO_DB_ERROR): No database selected
$ ./bin/perror 1064
MariaDB error code 1064 (ER_PARSE_ERROR): %s near '%-.80T' at line %d
$ ./bin/perror 1193
MariaDB error code 1193 (ER_UNKNOWN_SYSTEM_VARIABLE): Unknown system variable '%-.*s'
$ ./bin/perror 1231
MariaDB error code 1231 (ER_WRONG_VALUE_FOR_VAR): Variable '%-.64s' can't be set to the value of '%-.200T'
$ ./bin/perror 1054
MariaDB error code 1054 (ER_BAD_FIELD_ERROR): Unknown column '%-.192s' in '%-.192s'
$ ./bin/perror 1237
MariaDB error code 1237 (ER_SLAVE_IGNORED_TABLE): Slave SQL thread ignored the query because of replicate-*-table rules

This bug report to create a single new error code which is hit whenever the slave cannot execute the init_slave query for any reason.

Testing wise, that error can then be filtered and all the errors above can be unfiltered.

For users, some resolution (the underlying cause for the Can't execute init_slave query failure) may be missed this way, but in return there will be a single automatable error.

Then again, perhaps there is another way to achieve this resolution? For example how about (fictive):

$ ./bin/perror 9999
MariaDB error code 9999 (ER_INIT_SLAVE_ERROR): Can't execute init_slave query due to MariaDB error code [1046/1064/.../1237/...]: Check ./bin/perror  [1046/1064/.../1237/...] for more info



 Comments   
Comment by Roel Van de Paar [ 2022-09-09 ]

A somewhat random example which generates error code 1046:

SET GLOBAL init_slave='CREATE VIEW v1 SELECT f1()';
CHANGE MASTER TO master_host='a';
START SLAVE SQL_THREAD;

Changing the query to SELECT 1 does not generate an error, even though there is no way it can be executed on the slave (master/slave not configured)

A somewhat random example which generates error code 1064:

SET GLOBAL init_slave='a';
CHANGE MASTER TO master_host='a';
START SLAVE SQL_THREAD;

Comment by Roel Van de Paar [ 2022-09-09 ]

Apparently, in protocol design, one of the teachings is to abstract things, i.e. it does not actually matter if a packet was lost or if it was corrupted: just drop any corrupted packets and let the retransmission logic take care of it, which is how TCP/IP works. w/ Thanks to marko for the insightful points/discussion.

Comment by Roel Van de Paar [ 2023-07-17 ]

Thinking about this again, it would be good if we can achieve both ends (as noted towards the end of the original description): have a single error code for when init_slave cannot be executed, yet have some form of reason provided within the error message, i.e. something like (9999 is a fictive error code for ER_INIT_SLAVE_ERROR):

MariaDB error code 9999 (ER_INIT_SLAVE_ERROR): Can't execute init_slave query due to MariaDB error code ...some other code...

Comment by Kristian Nielsen [ 2023-12-09 ]

What does it mean "all these errors being filtered"? Where are they being filtered, and why?
What is an example exact error message (from command line/ START SLAVE as well as slave error log) when this occurs?

This error seems it would be very specific to (1) a START SLAVE run while (2) --init-slave being in effect. Filtering events in this specific case doesn't seem like something that should affect any other testing. And besides, replacing all errors with a generic 9999 error which is then filtered, will have the exact same effect, it seems?

It sounds like the real problem here is some filtering in some test tool that is insufficiently precise to the particular circumstances (1) and (2)? If so, the test tool should probably be fixed.

On the other hand, I agree the best behaviour would be if the error messages describes both that there was an error running the --init-slave query string, as well as the actual error that occurred.
We do have the ability to present multiple error codes, at least with SHOW WARNINGS.

Comment by Andrei Elkin [ 2023-12-09 ]

> I agree the best behaviour would be if the error messages describes both that there was an error running the --init-slave query string, as well as the actual error that occurred

Me 2.
What do you see about reusing Slave_SQL_Running_State to start keeping an error status, which would mention
an execution phase of an error occurred; currently in an error case the field is empty?

Comment by Roel Van de Paar [ 2023-12-12 ]

https://mariadb.zulipchat.com/#narrow/stream/118759-general/topic/Unifying.20'Can't.20execute.20init_slave.20query'

Comment by Roel Van de Paar [ 2023-12-12 ]

I agree that moving to a generic error is not the best solution. The "best behaviour would be if the error messages describes both that there was an error running the --init-slave query string, as well as the actual error that occurred" indeed sounds like the best way forward, and makes it much clearer for users and testers what happened, without having to check error codes. Let's proceed that way, if feasible.

Comment by Andrei Elkin [ 2023-12-12 ]

Actually Slave_SQL_Running_State 's state transitions reflect the sql thread's THD statuses. So it can't help immediately. But then I turn to think if Show-slave-status indeed is missing something like
Last_SQL_Error_state which would be a copy of Slave_SQL_Running_State at an error time.

Comment by Gulshan Kumar Prasad [ 2023-12-28 ]

As per the discussion,
for this part "there was an error running the --init-slave query string" there is error defined
```
rli->report(ERROR_LEVEL, thd->get_stmt_da()->sql_errno(), NULL,
"Slave SQL thread aborted. Can't execute init_slave query");
goto err;
```
and only need to add this part "the actual error that occurred" right?

Comment by Roel Van de Paar [ 2023-12-29 ]

ultravoilet26 Thank you. If feasible, yes.

Generated at Thu Feb 08 10:09:04 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.