[MDEV-16768] Replication aborts with ER_KEY_NOT_FOUND on table with key on BLOB and without PK Created: 2018-07-16  Updated: 2023-04-27

Status: Stalled
Project: MariaDB Server
Component/s: Replication, Storage Engine - RocksDB
Affects Version/s: 10.2, 10.3
Fix Version/s: 10.4

Type: Bug Priority: Major
Reporter: Elena Stepanova Assignee: Sergei Petrunia
Resolution: Unresolved Votes: 0
Labels: None


 Description   

Run with --mysqld=--plugin-load-add=ha_rocksdb

--source include/have_binlog_format_row.inc
--source include/master-slave.inc
 
CREATE TABLE t1 (b BLOB, i INT, KEY(b(8))) ENGINE=RocksDB;
INSERT INTO t1 VALUES (REPEAT('a',9),1);
 
UPDATE t1 SET i = 2;
 
--sync_slave_with_master
 
# Cleanup
--connection master
DROP TABLE t1;
--source include/rpl_end.inc

10.2 1cc1d0429da

2018-07-16 19:11:54 140503152559872 [ERROR] mysqld: Can't find record in 't1'
2018-07-16 19:11:54 140503152559872 [ERROR] Slave SQL: Could not execute Update_rows_v1 event on table test.t1; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 923, Gtid 0-1-3, Internal MariaDB error code: 1032
2018-07-16 19:11:54 140503152559872 [Warning] Slave: Can't find record in 't1' Error_code: 1032
2018-07-16 19:11:54 140503152559872 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000001' position 726

Reproducible regardless of parallel replication or parallel replication mode.
Also reproducible on bb-10.2-mdev16428 b528b069a16.

Not reproducible with InnoDB.



 Comments   
Comment by Sergei Petrunia [ 2018-07-16 ]

Debugging both MyRocks and InnoDB. In both cases, the slave tries to find the record for update in Rows_log_event::find_row using key=0 (that is, KEY(b(8)).

The lookup key is

(gdb) x/11c m_key
  0x7fff94015460:	0 '\000'	8 '\b'	0 '\000'	97 'a'	97 'a'	97 'a'	97 'a'	97 'a'
  0x7fff94015468:	97 'a'	97 'a'	97 'a'

    if ((error= table->file->ha_index_read_map(table->record[0], m_key, 
                                               HA_WHOLE_KEY,
                                               HA_READ_KEY_EXACT)))

InnoDB finds the row and returns error=0 here, while MyRocks returns error=120.

Comment by Sergei Petrunia [ 2018-07-16 ]

Not reproducible on the upstream.
Debugging it in MyRocks, the issue looks a bit similar to MDEV-16155.
Upstream has a slightly different fix for MDEV-16155, could it be that their fix covers this issue while ours doesnt?

Comment by Sergei Petrunia [ 2018-07-16 ]

...no. Reverting our fix for MDEV-16155 and applying theirs doesn't fix the issue.

Comment by Ming Lin [ 2018-08-05 ]

Hi, this patch fixed it.
https://github.com/MariaDB/server/pull/837

Comment by Sergei Petrunia [ 2018-08-19 ]

Looking at the patch...

A testcase without replication:

CREATE TABLE t2 (b BLOB, i INT, KEY(b(8))) ENGINE=RocksDB;
INSERT INTO t2 VALUES (REPEAT('a',22),1);
 
explain 
select * from t2 where b=repeat('a', 22);
select * from t2 where b=repeat('a', 22);
select * from t2 ignore index (b) where b=repeat('a', 22);
drop table t2;

MariaDB [j22]> explain select * from t2 where b=repeat('a', 22);
+------+-------------+-------+------+---------------+------+---------+-------+------+-------------+
| id   | select_type | table | type | possible_keys | key  | key_len | ref   | rows | Extra       |
+------+-------------+-------+------+---------------+------+---------+-------+------+-------------+
|    1 | SIMPLE      | t2    | ref  | b             | b    | 11      | const |    1 | Using where |
+------+-------------+-------+------+---------------+------+---------+-------+------+-------------+
1 row in set (0.03 sec)
 
MariaDB [j22]> select * from t2 where b=repeat('a', 22);
Empty set (0.00 sec)
 
MariaDB [j22]> select * from t2 ignore index (b) where b=repeat('a', 22);
+------------------------+------+
| b                      | i    |
+------------------------+------+
| aaaaaaaaaaaaaaaaaaaaaa |    1 |
+------------------------+------+
1 row in set (0.01 sec)

Comment by Sergei Petrunia [ 2018-08-19 ]

Looking at the on-disk data for the above example.

The execution here

(gdb) wher
  #0  Field_blob::sort_string (this=0x7fff70049680, to=0x7fff70006645 "aaaaaaaa", length=8) at /home/psergey/dev-git/10.2-r14/sql/field.cc:8326
  #1  0x00007fffe7668505 in myrocks::Rdb_key_def::pack_with_make_sort_key (this=0x7fff7003f160, fpi=0x7fff700466d0, field=0x7fff70049680, buf=0x7fff70046a50 '\245' <repeats 24 times>, "h4z\025", '\217' <repeats 12 times>, "\245", dst=0x7ffff413eea0, pack_ctx=0x7ffff413eed0) at /home/psergey/dev-git/10.2-r14/storage/rocksdb/rdb_datadic.cc:1264
  #2  0x00007fffe76676d3 in myrocks::Rdb_key_def::pack_field (this=0x7fff7003f160, field=0x7fff70049680, pack_info=0x7fff700466d0, tuple=0x7fff70006645 "aaaaaaaa", packed_tuple=0x7fff70006640 "", pack_buffer=0x7fff70046a50 '\245' <repeats 24 times>, "h4z\025", '\217' <repeats 12 times>, "\245", unpack_info=0x7fff7003ef38, n_null_fields=0x0) at /home/psergey/dev-git/10.2-r14/storage/rocksdb/rdb_datadic.cc:993
  #3  0x00007fffe7667e38 in myrocks::Rdb_key_def::pack_record (this=0x7fff7003f160, tbl=0x7fff70043480, pack_buffer=0x7fff70046a50 '\245' <repeats 24 times>, "h4z\025", '\217' <repeats 12 times>, "\245", record=0x7fff70049648 "\374\026", packed_tuple=0x7fff70006640 "", unpack_info=0x7fff7003ef38, should_store_row_debug_checksums=@0x7ffff413f03b: false, hidden_pk_id=@0x7ffff413f1a0: 3, n_key_parts=2, n_null_fields=0x0, ttl_pk_offset=0x0, ttl_bytes=0x7fff7003efe8 '\245' <repeats 16 times>, "\360<\002p\377\177") at /home/psergey/dev-git/10.2-r14/storage/rocksdb/rdb_datadic.cc:1149
  #4  0x00007fffe75e7701 in myrocks::ha_rocksdb::update_sk (this=0x7fff7003ea28, table_arg=0x7fff70043480, kd=..., row_info=...) at /home/psergey/dev-git/10.2-r14/storage/rocksdb/ha_rocksdb.cc:9630

produces this secondary key:

(gdb) print new_packed_size
  $97 = 23
(gdb) print m_sk_packed_tuple
  $98 = (uchar *) 0x7fff70006640 ""
(gdb) x/23c m_sk_packed_tuple
  0x7fff70006640:	0 '\000'	0 '\000'	1 '\001'	4 '\004'	1 '\001'	97 'a'	97 'a'	97 'a'
                        |======= indexnr ======================================|  |= not null == |   |====== value (3 chars) ==
  0x7fff70006648:	97 'a'	97 'a'	97 'a'	97 'a'	97 'a'	0 '\000'	22 '\026'	0 '\000'
                        |===== value (5 chars ) ==============| |===== length ============ |   |== hidden_pk byte #1 ==
  0x7fff70006650:	0 '\000'	0 '\000'	0 '\000'	0 '\000'	0 '\000'	0 '\000'	2 '\002'
                        |======== hidden_pk bytes 2 - 8. Hidden pk has 8 bytes ================================================|

note that it stores length=22. This is incorrect. Ming's patch fixes this.

Concerns:

  • this will change on-disk data format.
  • Will changing Field_blob::sort_string affect something else in the server?
Comment by Sergei Petrunia [ 2018-08-19 ]

Upstream is not affected, they store 8 as length:

(gdb) print new_packed_size
  $7 = 23
(gdb)  x/23c m_sk_packed_tuple
  0x7fffe00196c0:	0 '\000'	0 '\000'	1 '\001'	6 '\006'	1 '\001'	97 'a'	97 'a'	97 'a'
  0x7fffe00196c8:	97 'a'	97 'a'	97 'a'	97 'a'	97 'a'	0 '\000'	8 '\b'	0 '\000'
  0x7fffe00196d0:	0 '\000'	0 '\000'	0 '\000'	0 '\000'	0 '\000'	0 '\000'	1 '\001'

Field_blob::make_sort_key() has this piece:

      uint key_length= MY_MIN(blob_length, length);

It was added by this patch:

https://github.com/facebook/mysql-5.6/commit/fecef54c24b5b64037af1ecf2f8e6ced5a90a3ac

commit fecef54c24b5b64037af1ecf2f8e6ced5a90a3ac
Author: Manuel Ung <mung@fb.com>
Date:   Tue Jun 14 09:50:58 2016 -0700
 
    Fix length in Field_blob::make_sort_key
    
    Summary:
    For the binary collation in Field_blob::make_sort_key, we write out the length of the entire blob, even though the key itself may be shorter. We can miss out on duplicate key errors because of this:
...

Comment by Sergei Petrunia [ 2018-08-19 ]

Another question is, why does MyRocks not use variable-length-varchar encoding in this case?
Consider two cases:

create table t4 ( pk int not null primary key, b blob, key(b(100)))engine=rocksdb;
insert into t4 values (1, 'abc');

Here, secondary key length is 111 bytes.

create table t5 ( pk int not null primary key, b varbinary(120), key(b(100)))engine=rocksdb;
insert into t5 values (1, 'abc');

Here, key length is 18 bytes. (Which IIRC is 18= (indexnr 4) + (null_flag 1) + (8 chunk_size) + (4 pk_value))

Comment by Sergei Petrunia [ 2018-08-21 ]

The above is not a critical limitation.

Comment by Sergei Golubchik [ 2018-09-15 ]

There's a bunch of problems with the commit 87609324e0ee:

  • it breaks sys_vars.max_sort_length_func. The test result is correct, commit introduced a bug. Because Field_blob::make_sort_key works correctly, exactly as it should, although not how RocksDB wants it to. The correct fix would be to fix RocksDB not to expect Field_blob::make_sort_key to do something it should not.
  • It comes with a complex replication test, instead of a simple test, as in the comment above.
  • it changes on-disk format
  • it's in 10.3 only, so 10.2 is still not fixed

I've fixed first two points in my commit 66750323e57 (it's only in bb-10.3-serg now, so commit hash might still change). Don't know if it's possible to avoid on-disk format changes.

Comment by Sergei Golubchik [ 2020-10-19 ]

It's not in review, as far as I can see. There is nothing to review, there was an old patch that was wrong and was reverted two years ago. Since then I don't see any other attempts to fix the bug.

Generated at Thu Feb 08 08:31:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.