[MXS-1180] cdc.py not producing anything with JSON format, but does with AVRO Created: 2017-03-13  Updated: 2017-03-17  Resolved: 2017-03-17

Status: Closed
Project: MariaDB MaxScale
Component/s: N/A
Affects Version/s: 2.0.5
Fix Version/s: 2.0.6

Type: Bug Priority: Major
Reporter: Josh Becker Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

Ubuntu 14.04


Sprint: 2017-30

 Description   

I'm simply using cdc.py to print out changes for a particular table, and unfortunately specifying nothing (or JSON) produces one line, and then nothing whatsoever (but the CPU is pegged at 100% for the process).

/usr/bin/cdc.py -u cdc_user -ppassword -h 127.0.0.1 -P 4001 -f JSON ebth_production.items

If I switch the format to AVRO with the same command, it does print out information quite frequently (but it is binary, so not really useful).

It is worth noting that this particular table (items) has a LOT of changes constantly. If I choose another table (like users) it does print changes in JSON properly, but users doesn't have as many changes as items.

Let me know what information you need to further debug.



 Comments   
Comment by Josh Becker [ 2017-03-13 ]

If I change the JSON format to use read_avro() within cdc.py, it prints a lot of information.

I believe the overhead of using the JSONDecoder slows the entire process down and no data is printed. Maybe eventually it would be printed, but after 5 minutes, still nothing and the CPU for the python3 process is pegged at 100%.

Comment by markus makela [ 2017-03-13 ]

Hve you looked at the network output between the cdc.py client and MaxScale itself? Can you see the JSON data being streamed?

It sounds a bit like a parsing problem where malformed JSON data prevents the proper processing of the stream.

Comment by Josh Becker [ 2017-03-14 ]

Yes I believe that is it!

It fails on this line:

data = decoder.raw_decode(rbuf.decode('ascii'))

If I use read_avro instead here is a dump of the raw data that is causing the problem:
https://gist.github.com/Geesu/fb3030c8748604810dfcc31f09bfe53e

I'm not entirely sure why it won't decode it.

Comment by Josh Becker [ 2017-03-14 ]

And if I simply print out rbuf before the raw_decode it looks like this: https://gist.github.com/Geesu/5957122ae6a6f1e50ad584399b5f3a77

I'm not sure why there is a b' at the start of the string. When you remove both it will parse it just fine.

The actual exception:

'ascii' codec can't decode byte 0xc2 in position 2197: ordinal not in range(128)

Comment by markus makela [ 2017-03-14 ]

Ah, it's possibly caused by a non-ASCII character being encountered in the stream. As the data in Avro files is UTF-8, the \xc2\xa characters found in the data are valid.

Comment by Josh Becker [ 2017-03-14 ]

Is there any easy solution for this? I attempted to change decode to use 'utf-8' but no dice.

Interestingly enough it continues to loop given the error is handled, but only 1 line is ever printed.

Comment by Josh Becker [ 2017-03-14 ]

I wrote the raw bytes to a file during the second while True:

https://gist.github.com/Geesu/9ee61ab333762d3261dcd78b19a7dfa7

The bytes: https://dl.dropboxusercontent.com/u/15806454/cdc_raw_bytes.zip

Hopefully that will help?

Comment by markus makela [ 2017-03-15 ]

MaxScale seems to correctly handle the UTF-8 values but the client side cdc.py script fails to process it properly.

You could try testing with an alternative CDC client that I've written for NodeJS: https://github.com/markus456/cdc-funnel

I tested MaxScale with the cdc-funnel by inserting strings with UTF-8 characters in it:

create table test.t3(data text) character set "utf8";
insert into test.t3 values ("This is a space.");
insert into test.t3 values ("⦿☏☃☢");
insert into test.t3 values ("äöåǢ");

They were streamed correctly apart from the utf8mb4 characters:

data: {"domain":0,"server_id":3000,"sequence":7090,"event_number":1,"timestamp":1489569954,"event_type":"insert","data":"This is a space.","table":"test.t3"}
data: {"domain":0,"server_id":3000,"sequence":7091,"event_number":1,"timestamp":1489571182,"event_type":"insert","data":"⦿☏☃☢????????","table":"test.t3"}
data: {"domain":0,"server_id":3000,"sequence":7092,"event_number":1,"timestamp":1489571210,"event_type":"insert","data":"äöåǢ","table":"test.t3"}

Note: The emoticons replaced with the ???? values weren't accepted by Jira so I can't post them here.

Comment by Josh Becker [ 2017-03-16 ]

Attempting to use the node app but it's not going well:

ubuntu@ip-172-30-0-204:~/cdc-funnel$ nodejs funnel.js

/home/ubuntu/cdc-funnel/funnel.js:29
return new Promise((resolve, reject) => {
^

Do you expect this to be fixed at some point in the python app? Seems like something it should support, no?

Comment by markus makela [ 2017-03-17 ]

Yes, the problems with the Python script will be fixed for the next release.

I'm suspecting the problems with the NodeJS application are possibly caused by an older version of NodeJS. Older versions versions of it don't support the Promise type.

Comment by markus makela [ 2017-03-17 ]

Printing the raw output sent by MaxScale appears to solve all formatting errors in addition to fixing the hangup problem that's caused by the JSON parsing failing for Unicode characters. As MaxScale already formats the output into newline delimited JSON, there's no real need to do that a second time inside the cdc.py script.

The updated script can be found here: https://github.com/mariadb-corporation/MaxScale/blob/2.0-cdc-fix/server/modules/protocol/examples/cdc.py

Comment by markus makela [ 2017-03-17 ]

The data from the script is now processed with the assumption that the JSON that MaxScale sends is valid. This removes the need to parse the output as MaxScale already sends newline delimited JSON.

Generated at Thu Feb 08 04:04:49 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.