[MDEV-7807] information_schema.processlist truncates queries with binary strings Created: 2015-03-20  Updated: 2015-05-13  Due: 2015-04-20  Resolved: 2015-05-07

Status: Closed
Project: MariaDB Server
Component/s: Character Sets
Affects Version/s: 10.0.16
Fix Version/s: 10.1.5

Type: Bug Priority: Major
Reporter: Ralf Neubauer Assignee: Alexander Barkov
Resolution: Fixed Votes: 0
Labels: None


 Description   

When I start a statement with binary data the column INFO of information_schema.processlist is truncated at the first byte > 128. show full processlist shows the full query in it's Info column.

My test query is generated and run by this perl script:

perl -e "use DBI; $dbh = DBI->connect(qw(dbi:mysql:host=myhost;database=mydb myuser mypass)); $dbh->do(qq{select sleep(30+0*?)}, undef, join '', map chr, 0..255);"

While the sleep is still asleep, use

select * from information_schema.processlist;
show full processlist;

or look at the info column specifically:

select info from information_schema.processlist where state = 'User sleep';

. With

select char_length(info), length(info) from information_schema.processlist where state = 'User sleep';

which results in

154 154

you can verify that the values are truncated inside the server, this is not a problem with the client connection or the client.

For comparison use

perl -e "use DBI; $dbh = DBI->connect(qw(dbi:mysql:host=myhost;database=mydb myuser mypass)); $dbh->do(qq{select sleep(30+0*?)}, undef, join '', map chr, map $_ % 128, 0..255);"

and

select char_length(info), length(info) from information_schema.processlist where state = 'User sleep';

with the result

291 291

The character set settings are:

'character_set_client', 'utf8'
'character_set_connection', 'utf8'
'character_set_database', 'latin1'
'character_set_filesystem', 'binary'
'character_set_results', 'utf8'
'character_set_server', 'latin1'
'character_set_system', 'utf8'



 Comments   
Comment by Sergei Golubchik [ 2015-03-20 ]

I think that's what happening: your character_set_client is utf8, the query is expected to be a valid utf8 string. But it is not. The 'info' column of the information_schema.processlist is also utf8. When the server tries to convert an invalid utf8 string to a valid utf8 string — you get a truncation at the first invalid character.

Comment by Elena Stepanova [ 2015-03-20 ]

There is also a warning when you select from information_schema.processlist:

MariaDB [test]> show warnings;
+---------+------+----------------------------------------------------------------------------------+
| Level   | Code | Message                                                                          |
+---------+------+----------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\x80\x81\x82\x83\x84\x85...' for column 'INFO' at row 1 |
+---------+------+----------------------------------------------------------------------------------+
1 row in set (0.00 sec)

Comment by Ralf Neubauer [ 2015-03-23 ]

That's what I thought. But notice that:

  • The string is truncated before being sent to the client, as verified by selecting LENGTH(Info) (or OCTET_LENGTH(info)). It LENGTH(Info) varies with the client connection charset, I would consider this a bug.
  • It works with show full processlist.
  • The query is totally legal (binary data is a fact of life – the same would happen with an insert of e.g. a PNG image into a blob column) and it should be possible to monitor any legal query. (In my case the perl script runs with character_set_client latin1 and character_set_connection latin1 – according to documentation 0x80 is a legal character in MySQL's latin1 (the Euro sign), so the query string is not only valid as binary data, but as text).
  • I see the warning, but I didn't do anything wrong, I think. The warning is helpful in diagnosing, what goes wrong in the Info column, but not helpful for seeing the running queries. Or for diagnosing, what queries an application is sending to the server (I detected the problem, because an application sent an UTF-8 byte sequence in a comment field on table creation).
Comment by Sergei Golubchik [ 2015-03-23 ]

Yes, the string is truncated before it is sent to client. It's truncated when it's stored in the INFORMATION_SCHEMA.PROCESSLIST table (well, not “stored” per se, as there is no real table, but something similar happens anyway). Because the column is utf8 and it can only return valid utf8 data.

This issue cannot be possibly solved completely. INFORMATION_SCHEMA.PROCESSLIST returns the utf8 representation of the statement. And it is possible to send a statement that cannot be represented in utf8. We will never be able to claim that INFORMATION_SCHEMA.PROCESSLIST always shows all statements in their original unmodified form.

But truncation is wrong, the correct behavior is to replace invalid characters with question marks. We have fixed it in 10.1 — the query string in INFORMATION_SCHEMA.PROCESSLIST is not truncated anymore.

Comment by Ralf Neubauer [ 2015-03-23 ]

Wouldn't it be more useful for some use cases to add columns Info_binary and Info_character_set with the binary octet string and the corresponding encoding as used by the client having started the query? This way no information would get lost, you could even extract binary data which made the query run slow, as for example a PNG bytestream.

Comment by Sergei Golubchik [ 2015-03-23 ]

Yes, you're completely right. We were discussing this issue internally just now and this idea was one of the possibilities. Another one would be to print binary strings in hex, like select sleep(30+0*x'8081828384'). Would that work for you?

Comment by Ralf Neubauer [ 2015-03-23 ]

Hex strings would work technically, but they make it harder to see what is going on, if you encounter them – I envision myself decoding 1000s of hex strings manually, and then maybe the character encoding of this decoded hex strings, while looking at processlists. The contents of strings are as important as the queries themselves, to see what the application is doing now, and as a German whose umlauts tend to poison text as 'binary' (depending on the circumstances), I fear many interesting strings could get 'hex-encrypted' and hardly readable. It certainly helps if the contents are human readable most of the time, even if they contain strange bytes. I think it would be easier to decode the complete statement text than to decode a handful of contained hex strings. But if there would be a 'human readable' column (UTF8 with question marks) and a 'machine readable' column (with hex strings), both would be possible at the same time.

Comment by Alexander Barkov [ 2015-03-26 ]

Ralf, we are thinking of implementing your idea with adding a new column INFO_BINARY
into INFORMATION_SCHEMA.PROCESSLIST. This change is fine for 10.1.
But we're afraid it's too late to change structures of INFORMATION_SCHEMA tables in 10.0, which is GA.

We can still add the new column into 10.0, under conditional compilation.
So one will be able to build MariaDB-10.0 with the new column support using an extra command line parameter, like this:

cmake -DPROCESSLIST_INFO_BINARY=1
make
make install

Would this solve the problem for you? Or, do you use binary packages?

Comment by Ralf Neubauer [ 2015-03-26 ]

We use binary packages, but I think we can live with the "show full processlist; and extract the interesting row in the client" workaround or the question marks, until 10.1 is released.

Comment by Alexander Barkov [ 2015-03-26 ]

Okey, I'm changing "fix version" to 10.1 then. Thanks.

Generated at Thu Feb 08 07:22:25 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.