[MDEV-24797] Column Compression - ERROR 1265 (01000): Data truncated for column Created: 2021-02-06  Updated: 2023-08-15  Resolved: 2023-08-15

Status: Closed
Project: MariaDB Server
Component/s: Data Definition - Alter Table, Data Manipulation - Insert
Affects Version/s: 10.3, 10.4, 10.5, 10.6, 10.7
Fix Version/s: 10.4.32, 10.5.23, 10.6.16, 10.10.7, 10.11.6, 11.0.4

Type: Bug Priority: Major
Reporter: Nuno Assignee: Alexander Barkov
Resolution: Duplicate Votes: 0
Labels: None

Issue Links:
Duplicate
duplicates MDEV-31724 Compressed varchar values lost on joi... Closed
Problem/Incident
is caused by MDEV-11371 Big column compressed Closed
Relates
relates to MDEV-22367 Remove write support for ROW_FORMAT=C... Closed
relates to MDEV-22839 ROW_FORMAT=COMPRESSED vs PAGE_COMPRES... Open

 Description   

Hello,

Testing MariaDB 10.5, and first time implementing column compression as part of upgrade.

Converting normal columns to COMPRESSED columns seem to work ok.

However, when I want to do an ALTER TABLE later, it appears that I have problems with the compressed columns, even if I'm not changing them.

Example:

MariaDB [mydb]> ALTER TABLE `mytable`
CHANGE `question` `question` VARCHAR(500) COMPRESSED CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NULL DEFAULT NULL,
CHANGE `col2` `col2` VARCHAR(50) COMPRESSED CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

Query OK, 3708285 rows affected (1 min 8.067 sec)
Records: 3708285 Duplicates: 0 Warnings: 0

MariaDB [mydb]> ALTER TABLE `mytable`
CHANGE `col2` `col2` VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

ERROR 1265 (01000): Data truncated for column 'question' at row 8362

MariaDB [mydb]> ALTER TABLE `mytable`
CHANGE `question` `question` VARCHAR(500) COMPRESSED CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NULL DEFAULT NULL,
CHANGE `col2` `col2` VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

ERROR 1265 (01000): Data truncated for column 'question' at row 8362

Any ideas why?



 Comments   
Comment by Nuno [ 2021-02-06 ]

marko greenman serg GeoffMontee
Hey guys,
Let me know if any of you has any idea about this!
Have a great weekend.
Thank you very much.

Comment by Rick James [ 2021-08-02 ]

What was the text that was truncated?

In experiments (with other compression techniques), I have found that text strings shorter than 10 bytes expanded instead of shrinking.

Would any "expansion" cause the error? If so, I suggest there is a design flaw in the overall design.

Comment by Nuno [ 2021-08-02 ]

What was the text that was truncated?

No idea... what is the "row 8362" anyway?
The error was too vague to me.

Also, I didn't change the column length. It was a VARCHAR(500), and I kept it as a VARCHAR(500), but with the COMPRESSED option.

As I didn't want to risk my data, I gave up on using this feature.

Comment by Marko Mäkelä [ 2021-08-17 ]

nunop, given that the ALTER TABLE operation ought to have been executed with ALGORITHM=COPY, I think that you could have extracted the data for "row 8362" simply by something like

SELECT * FROM mytable LIMIT 1 OFFSET 8362;

I would expect this bug to be repeatable with a table containing only 1 row (and possibly only 1 column), either on INSERT or on ALTER TABLE. This is because the MDEV-11371 column compression feature is operating on a single column at a time. The error ought to be reported before the data even reaches the storage engine layer. Hence, any other rows that might exist in the table should not matter.

Can you try to produce a repeatable test case?

Comment by Marko Mäkelä [ 2021-08-17 ]

rjasdfiii, I investigated a little to check if your suggestion about expansion is plausible.

In zlib, there is the function compressBound() that returns an upper bound for data expansion. I do not see it being invoked in the column compression code. There is both a constant overhead (I remember that zlib can store a minimal or a larger header) and a variable overhead. Information theory says that n bits of arbitrary input have n bits of entropy, and thus can represent 2^n different values. If a compression algorithm can encode some of those 2^n values in less than n bits each, it must necessarily expand the representation of some values. If some input is "uncompressible", the compression algorithm would have to encode something like "n bits of literal data follows" followed by the n bits of uncompressed payload.

I had a quick look at the implementation and I cannot see anything obviously wrong in the function compress_zlib(). The stream.avail_out seems to correctly set the maximum output length. If the data does not fit, the function should return 0, which I suppose means that the column will be stored uncompressed.

I think that we definitely need a test case before we can fix this.

Comment by Nuno [ 2021-08-18 ]

Can you try to produce a repeatable test case?

I just tried on my snapshot database.

Altered the "question" and "col2", then "col2" back to uncompressed, just like the initial example.

MariaDB > ALTER TABLE `mytable` CHANGE `col2` `col2` VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;
ERROR 1265 (01000): Data truncated for column 'question' at row 693
MariaDB > SELECT question FROM mytable LIMIT 1 OFFSET 693;
+---------------------------+
| question                  |
+---------------------------+
| I Like Being Dark At Home |
+---------------------------+
1 row in set (0.001 sec)

Note that "question" is a VARCHAR(500).
I don't know if this is really the "row 693" that it's complaining about truncation. Doesn't make sense.

Comment by Marko Mäkelä [ 2021-08-19 ]

nunop, thank you. Can you try to produce a self-contained test case? My hypothesis is that this should be repeatable with only that "row 693" present in the table. I hope that my use of LIMIT and OFFSET are correct. For example, it could be off by one, depending on whether the first row is counted as row number 0 or 1 in each context. I would expect the self-contained test case to be something like this (with all the … replaced):

CREATE TABLE mytable (col2 …, question VARCHAR(…) COMPRESSED …);
INSERT INTO mytable SET col2=…, question=…;
ALTER TABLE mytable CHANGE col2 col2 VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

Comment by Nuno [ 2021-09-23 ]

marko

Just tried this again and got the error on the same line (693).

ALTER TABLE `mytable`
CHANGE `question` `question` VARCHAR(500) COMPRESSED CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NULL DEFAULT NULL,
CHANGE `col2` `col2` VARCHAR(50) COMPRESSED CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;
 
ALTER TABLE `mytable`
CHANGE `col2` `col2` VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

> ERROR 1265 (01000): Data truncated for column 'question' at row 693

I tried this too:

SELECT tid, question FROM mytable LIMIT 5 OFFSET 690;

So I can see more rows around 693.

CREATE TABLE mytable2 (col2 VARCHAR(50) COMPRESSED CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `question` VARCHAR(500) COMPRESSED CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NULL DEFAULT NULL);
 
INSERT INTO mytable2
SELECT col2, question FROM mytable LIMIT 5 OFFSET 690;
 
SELECT * FROM mytable2;
 
ALTER TABLE mytable2 CHANGE col2 col2 VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

> ERROR 1265 (01000): Data truncated for column 'question' at row 3

I'll come back to you soon.

Comment by Nuno [ 2021-09-23 ]

Ok, so this gives me (one of) the trouble row:

SELECT tid, question FROM forums_topics LIMIT 1 OFFSET 692;

Has to be N-1.

It has 129 characters, so there's no reason for it to be "truncated".
And the characters are normal latin characters. No emojis or anything...

So I did the test in another way:

DROP TABLE mytable2;
 
CREATE TABLE mytable2 (col2 VARCHAR(50) COMPRESSED CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `question` VARCHAR(500) COMPRESSED CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NULL DEFAULT NULL);
 
INSERT INTO mytable2
set col2='...',
question='...';
 
ALTER TABLE mytable2 CHANGE col2 col2 VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

And I get the same:

> #1265 - Data truncated for column 'question' at row 1

Even converting all to "xxx" does it:

DROP TABLE mytable2;
 
CREATE TABLE mytable2 (col2 VARCHAR(50) COMPRESSED CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL, `question` VARCHAR(500) COMPRESSED CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NULL DEFAULT NULL);
 
INSERT INTO mytable2
set col2='...',
question='Xxx xxx xxx xxxxx xxxxx xxxx xxx xxxxx xxx xxxxx xxxx xx xxxx xxxx xxxx xxxxx? .X Xxxx Xxx Xxxxx Xxx Xxxxxx Xxxxxxxxx Xx X Xxx.';
 
ALTER TABLE mytable2 CHANGE col2 col2 VARCHAR(50) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL;

> #1265 - Data truncated for column 'question' at row 1

(you can use the last one as a test to try to reproduce on your side)

MariaDB 10.5.12 on CentOS 7.
(this test was on a Docker container with CentOS 7, but when I reproduced it the first time was on a baremetal CentOS 7)

Comment by Marko Mäkelä [ 2021-10-25 ]

Here is a simpler test case:

CREATE TABLE t1 (a VARCHAR(500) COMPRESSED CHARACTER SET utf8mb3);
INSERT INTO t1 SET a=REPEAT('x',127);
ALTER TABLE t1 FORCE;
DROP TABLE t1;

The warning is being issued in:

10.6 1193a793c40b806c6f1f007bbd87f4d9a73e686d

#1  0x0000555c2cc7d492 in Field::set_warning (this=0x7f9ac4072fc8, 
    level=level@entry=Sql_state_errno_level::WARN_LEVEL_WARN, 
    code=code@entry=1265, cut_increment=cut_increment@entry=1, 
    current_row=current_row@entry=0) at /mariadb/10.6/sql/field.cc:11107
#2  0x0000555c2cc905f8 in do_varstring2_mb (copy=0x7f9ac4017c60)
    at /mariadb/10.6/sql/field_conv.cc:597
#3  0x0000555c2cc8feba in do_copy_null (copy=<optimized out>)
    at /mariadb/10.6/sql/field_conv.cc:246
#4  0x0000555c2ca900b2 in copy_data_between_tables

The problem appears to be an incorrect condition in do_varstring2_mb() that does not take column compression into account:

  uint char_length= (copy->to_length - HA_KEY_BLOB_LENGTH) / cs->mbmaxlen;
  uint from_length= uint2korr(copy->from_ptr);
  const uchar *from_beg= copy->from_ptr + HA_KEY_BLOB_LENGTH;
  Well_formed_prefix prefix(cs, (char*) from_beg, from_length, char_length);
  if (prefix.length() < from_length)
  {
    if (current_thd->count_cuted_fields > CHECK_FIELD_EXPRESSION)
      copy->to_field->set_warning(Sql_condition::WARN_LEVEL_WARN,
                                  WARN_DATA_TRUNCATED, 1);
  }  

Note: copy->to_length is 1503 and from_length (apparently, the length of the compressed column) is 8 in this case. It was 40 for nunop’s example.

Comment by Rick James [ 2022-01-20 ]

Another theory: Could it be that zlib (or the wrapper around it) is returning 0 when the string cannot be compressed? That is, the (500) is not really what it is testing against.

Meanwhile, `REPEAT('x',127)` is highly compressible, hence not useful for borderline tests.

Note that .pdf and .jpg files are already compressed, thereby making them good candidates for 'large' test cases that are likely to expand when compressed.

Comment by Marko Mäkelä [ 2022-09-02 ]

For the record, zlib provides the function compressBound(sourceLen) that gives the worst-case compressed size (the expanded size) corresponding to an input data length sourceLen. In MDEV-27634 I learned that a zlib fork exists that makes use of the IBM System Z (s390x) instruction dfltcc. In that version, also compressBound() was changed. That could explain why some non-InnoDB compression tests are failing on s390x.

Generated at Thu Feb 08 09:32:44 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.