[MDEV-29499] Improving the 'Can't execute init_slave query' error message with the actual failure Created: 2022-09-09 Updated: 2023-12-29 |
|
| Status: | Open |
| Project: | MariaDB Server |
| Component/s: | Replication |
| Affects Version/s: | 10.3, 10.4, 10.5, 10.6, 10.7, 10.8, 10.9, 10.10, 10.11 |
| Fix Version/s: | 10.4, 10.5, 10.6 |
| Type: | Bug | Priority: | Major |
| Reporter: | Roel Van de Paar | Assignee: | Brandon Nesterenko |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | affects-tests, beginner-friendly | ||
| Issue Links: |
|
||||
| Description |
|
I have been keeping track of what errors are reported by the server and noticed the following. For basically the same issue:
I have seen all these errors:
This is confusing, and it affects testing as other issues are being missed due to all these errors being filtered. Here is perror output for these errors:
This bug report to create a single new error code which is hit whenever the slave cannot execute the init_slave query for any reason. Testing wise, that error can then be filtered and all the errors above can be unfiltered. For users, some resolution (the underlying cause for the Can't execute init_slave query failure) may be missed this way, but in return there will be a single automatable error. Then again, perhaps there is another way to achieve this resolution? For example how about (fictive):
|
| Comments |
| Comment by Roel Van de Paar [ 2022-09-09 ] | ||||||
|
A somewhat random example which generates error code 1046:
Changing the query to SELECT 1 does not generate an error, even though there is no way it can be executed on the slave (master/slave not configured) A somewhat random example which generates error code 1064:
| ||||||
| Comment by Roel Van de Paar [ 2022-09-09 ] | ||||||
|
Apparently, in protocol design, one of the teachings is to abstract things, i.e. it does not actually matter if a packet was lost or if it was corrupted: just drop any corrupted packets and let the retransmission logic take care of it, which is how TCP/IP works. w/ Thanks to marko for the insightful points/discussion. | ||||||
| Comment by Roel Van de Paar [ 2023-07-17 ] | ||||||
|
Thinking about this again, it would be good if we can achieve both ends (as noted towards the end of the original description): have a single error code for when init_slave cannot be executed, yet have some form of reason provided within the error message, i.e. something like (9999 is a fictive error code for ER_INIT_SLAVE_ERROR):
| ||||||
| Comment by Kristian Nielsen [ 2023-12-09 ] | ||||||
|
What does it mean "all these errors being filtered"? Where are they being filtered, and why? This error seems it would be very specific to (1) a START SLAVE run while (2) --init-slave being in effect. Filtering events in this specific case doesn't seem like something that should affect any other testing. And besides, replacing all errors with a generic 9999 error which is then filtered, will have the exact same effect, it seems? It sounds like the real problem here is some filtering in some test tool that is insufficiently precise to the particular circumstances (1) and (2)? If so, the test tool should probably be fixed. On the other hand, I agree the best behaviour would be if the error messages describes both that there was an error running the --init-slave query string, as well as the actual error that occurred. | ||||||
| Comment by Andrei Elkin [ 2023-12-09 ] | ||||||
|
> I agree the best behaviour would be if the error messages describes both that there was an error running the --init-slave query string, as well as the actual error that occurred Me 2. | ||||||
| Comment by Roel Van de Paar [ 2023-12-12 ] | ||||||
| Comment by Roel Van de Paar [ 2023-12-12 ] | ||||||
|
I agree that moving to a generic error is not the best solution. The "best behaviour would be if the error messages describes both that there was an error running the --init-slave query string, as well as the actual error that occurred" indeed sounds like the best way forward, and makes it much clearer for users and testers what happened, without having to check error codes. Let's proceed that way, if feasible. | ||||||
| Comment by Andrei Elkin [ 2023-12-12 ] | ||||||
|
Actually Slave_SQL_Running_State 's state transitions reflect the sql thread's THD statuses. So it can't help immediately. But then I turn to think if Show-slave-status indeed is missing something like | ||||||
| Comment by Gulshan Kumar Prasad [ 2023-12-28 ] | ||||||
|
As per the discussion, | ||||||
| Comment by Roel Van de Paar [ 2023-12-29 ] | ||||||
|
ultravoilet26 Thank you. If feasible, yes. |