[MXS-1081] Avro data file corruption Created: 2017-01-16  Updated: 2017-03-09  Resolved: 2017-03-07

Status: Closed
Project: MariaDB MaxScale
Component/s: avrorouter
Affects Version/s: 2.0.3
Fix Version/s: 2.0.5, 2.1.1

Type: Bug Priority: Major
Reporter: Josh Becker Assignee: markus makela
Resolution: Fixed Votes: 0
Labels: None
Environment:

Ubuntu 14.04.5 LTS. mysql Ver 15.1 Distrib 10.1.20-MariaDB


Attachments: File ebth_production.followed_items_users.000001.avro     File ebth_production.followed_items_users.000001.avsc     File ebth_production.items.000001.avro     File ebth_production.items.000001.avro     File ebth_production.items.000001.avsc     Zip Archive generated_avro_files.zip     File items.sql     File my.cnf    
Sprint: 2017-26, 2017-27, 2017-28, 2017-29

 Description   

We set up maxscale on it's own server to connect to our galera cluster. Pretty straightforward configuration taken directly from your guides:

https://gist.github.com/Geesu/949a9660a3068e9f48b2877ec395da06

We then used cdc_schema.go to create schemas from our existing mariadb database (we don't have all binlogs to replay).

So far so good, but once we started up maxscale the data that was created in the avro files didn't make any sense whatsoever. The data was being corrupted somehow, and I suspect it is in the avrorouter.

An example avro file converted to JSON w/json-tools:
java -jar ~/avro-tools-1.8.1.jar tojson ebth_production.followed_items_users.000001.avro

{{"domain":0,"server_id":1,"sequence":28480571,"event_number":1,"timestamp":1484234657,"event_type":"insert","user_id":2158615,"item_id":4966469,"id":149400690,"deleted":0,"created_at":"34568-489768447-32668 489689435:32668:446036064","updated_at":"34568-489768447-32668 489689435:32668:446036064"}
{"domain":0,"server_id":1,"sequence":28480571,"event_number":2,"timestamp":1484234657,"event_type":"insert","user_id":2158615,"item_id":1480089267,"id":18,"deleted":0,"created_at":"34568-489768447-32668 489689435:32668:446036064","updated_at":"34568-489768447-32668 489689435:32668:446036064"}}

This data is all garbage. For other avro files the strings are completely mangled as well. Although if you cat the file you can sometimes see text that should be on the record (but it won't be the entire string).

I've attached the schema and the avro file.

I suspected it was using snappy but when using the getmeta command (avro tools) it says null:

avro.codec null
avro.schema {"type":"record","name":"ChangeRecord","namespace":"MaxScaleChangeDataSchema.avro","fields":[{"name":"domain","type":{"type":"int"}},{"name":"server_id","type":{"type":"int"}},{"name":"sequence","type":{"type":"int"}},{"name":"event_number","type":{"type":"int"}},{"name":"timestamp","type":{"type":"int"}},{"name":"event_type","type":{"type":"enum","name":"EVENT_TYPES","symbols":["insert","update_before","update_after","delete"]}},{"name":"user_id","type":{"type":"int"}},{"name":"item_id","type":{"type":"int"}},{"name":"id","type":{"type":"int"}},{"name":"deleted","type":{"type":"int"}},{"name":"created_at","type":{"type":"string"}},{"name":"updated_at","type":{"type":"string"}}]}



 Comments   
Comment by markus makela [ 2017-01-17 ]

Can you provide the SHOW CREATE TABLE output for that table?

Comment by Josh Becker [ 2017-01-17 ]

CREATE TABLE `followed_items_users` (
`user_id` int(11) DEFAULT NULL,
`item_id` int(11) DEFAULT NULL,
`id` int(11) NOT NULL AUTO_INCREMENT,
`deleted` tinyint(1) DEFAULT '0',
`created_at` datetime DEFAULT NULL,
`updated_at` datetime DEFAULT NULL,
PRIMARY KEY (`id`),
UNIQUE KEY `index_followed_items_users_on_user_id_and_item_id` (`user_id`,`item_id`),
KEY `index_followed_items_users_on_user_id` (`user_id`),
KEY `index_followed_items_users_on_item_id` (`item_id`),
KEY `index_followed_items_users_on_deleted` (`deleted`),
KEY `index_followed_items_users_on_user_id_and_deleted` (`user_id`,`deleted`),
KEY `index_followed_items_users_on_user_id_and_item_id_and_deleted` (`user_id`,`item_id`,`deleted`),
CONSTRAINT `_fk_rails_0b88d46caf` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE CASCADE,
CONSTRAINT `_fk_rails_202d32481c` FOREIGN KEY (`item_id`) REFERENCES `items` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=150756797 DEFAULT CHARSET=utf8

Comment by markus makela [ 2017-01-17 ]

This seems to be related to the temporal types. I've found a small bug which caused wrong values to be stored for years but other parts of datetime values seem to be working. Testing was done with MariaDB 10.1.19.

Comment by Josh Becker [ 2017-01-17 ]

I don't believe this is unique to times. In a different table we have strings and they are complete garbage and you can see the unicode characters being referenced as /x1239 in some.

Comment by markus makela [ 2017-01-18 ]

Can you provide an example table and data for it that reproduces this problem?

Comment by markus makela [ 2017-01-18 ]

Also if possible, please provide the my.cnf from the backend server where the binlogs were read from.

Comment by Josh Becker [ 2017-01-18 ]

ebth_production.items.000001.avsc ebth_production.items.000001.avro items.sql my.cnf Check out the attached! Trying to print the ebth_production.items.000001.avro files as JSON will fail with json-tools.

Comment by markus makela [ 2017-02-06 ]

Managed to reproduce the corruption of the avro files. Inserting empty values (INSERT INTO items VALUES ()) seems to trigger it but is not likely the cause of it.

Comment by markus makela [ 2017-02-11 ]

I've done some testing with the utf8mb4 character set and this list of unicode characters. The avrorouter converts the characters correctly if the column type is TEXT or other BLOB-like type. If the column is defined as VARCHAR, the conversion seems to fail and the values stored in the Avro files are empty. This seems to be caused by something in the Avro C library we use to store the values. A workaround to using Unicode characters is to declare the column as TEXT instead of VARCHAR.

Comment by Josh Becker [ 2017-02-11 ]

For the corruption we're seeing, it's just in utf8, not utf8mb4 specifically. Also the dates are wildly incorrect.

Comment by markus makela [ 2017-02-16 ]

We've done some fixes on the 2.0 branch which fix some of the problems mentioned here. Can you test if those changes fix the problems you are seeing? If building from source is not possible, we might be able to build a debug package for you to test.

Comment by Josh Becker [ 2017-02-20 ]

Do you have a detailed guide on compiling the 2.x branch? Unfortunately linking maxscale to the avro c library is failing.

I've documented the process here: https://gist.github.com/Geesu/a4e02b029309e58bfa009b5c46022a95

Comment by markus makela [ 2017-02-21 ]

That error is caused by the lack of the -fPIC compiler flag. Here is the script we use to build MaxScale for testing on Debian based systems. You can use it as a guide on how to set up your environment.

Comment by markus makela [ 2017-02-21 ]

We've built packages from the latest 2.0 commit that you can use for testing. You can find the packages here: http://max-tst-01.mariadb.com/ci-repository/2.0-feb21/mariadb-maxscale/

The packages were built from commit f7cc548093b037a573c3340f10db2284349b1c69.

Comment by Josh Becker [ 2017-02-27 ]

Unfortunately this didn't completely fix the problem :/ And it looks like data is being associated with the wrong columns. Example: https://gist.github.com/Geesu/2b7de05e733c8c596b742f17454eae3b

Also, the application is now crashing immediately. Log: https://gist.github.com/Geesu/d0d86d9e530bdce2d6e24ab4b642b543

Note: I do have to edit avro-conversion.ini manually given I don't have bin log 1. I'm using cdc_schema to dump the schema, then using the CHANGE MASTER command to use the most recent bin log at location 4.

Comment by markus makela [ 2017-02-27 ]

You should add the start_index to router_options to start from a custom binlog position. I'd recommend stopping MaxScale, adding the start_index option to the configuration, removing the avro-conversion.ini and avro.index files located in /var/lib/maxscale/ and then starting MaxScale again. This will reset the conversion process and it should give you a clean start.

If this doesn't work, can you provide the CREATE TABLE statement and one INSERT statement that reproduce this problem? With this information, we should be able to pinpoint which part of the table definition is creating the problems.

Comment by Josh Becker [ 2017-02-27 ]

Thanks I made that change. Interestingly enough the avro router never seems to actually process the binlog once I make this change.

If i restart the service again, then it tries to run and fails: https://gist.github.com/Geesu/76e7d94849dda5f37f5c35baa56a7454

Comment by Josh Becker [ 2017-02-27 ]

The error with debug on: https://gist.github.com/Geesu/2ff7fc9378a52145a4d8cf05a71b7899

Configuration if it is helpful: https://gist.github.com/Geesu/e05e0f50c0464b558cc435d8140de5ea

Comment by Josh Becker [ 2017-02-27 ]

And here is an example avro file mentioned in the above messages. It seems that the avrorouter is creating an invalid avro file for some reason. avro-tools can't parse it either (but it can show the schema). ebth_production.items.000001.avro

Comment by markus makela [ 2017-03-02 ]

I managed to reproduce the corruption of the strings. The VARCHAR type lengths were improperly interpreted which caused the string corruption.

We've fixed this bug and you can find packages containing the fix here: http://max-tst-01.mariadb.com/ci-repository/2.0-mar2/mariadb-maxscale/

The packages were built from commit 075ca42482d6f8265dab03999d49ad6b2bfdde05.

Comment by Josh Becker [ 2017-03-03 ]

Thanks for putting this together. Unfortunately I just tried this build and it is still failing. Here is the crash log w/all debug values on:
https://gist.github.com/Geesu/91f697d47a152abf824ab00e28102d2a

Interestingly enough items doesn't error like it did before (but java-tools still can't decode it to json).

And I've attached the avro files. I believe the avro file is getting corrupted somehow when it is created.

If it's helpful I'd be happy to give you access to the machine, or we could do a screenshare (Zoom, Screenhero, etc...).

generated_avro_files.zip

Comment by markus makela [ 2017-03-03 ]

Thank you for update. I think that there might be something in the table definition and the data that is inserted into the table that causes this to happen. We already have the table definition but we don't have the data for the table that causes these problems.

I've used the following SQL statements to create the data for testing this. Can you reproduce the crash with the following SQL?

CREATE TABLE `items` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL,
  `aasm_state` varchar(255) DEFAULT NULL,
  `barcode` varchar(255) DEFAULT NULL,
  `description` text,
  `active` tinyint(1) DEFAULT '0',
  `sort_order` int(11) DEFAULT '0',
  `condition_type_id` int(11) DEFAULT NULL,
  `condition_details` text,
  `shippable` tinyint(1) DEFAULT '1',
  `shipping_weight` decimal(10,3) DEFAULT NULL,
  `rate_inherited_from` varchar(255) DEFAULT NULL,
  `commission_type_id` int(11) DEFAULT '1',
  `commission_rate` decimal(5,2) DEFAULT '35.00',
  `delta` tinyint(1) NOT NULL DEFAULT '1',
  `sale_id` int(11) DEFAULT NULL,
  `seller_id` int(11) DEFAULT NULL,
  `sale_ends_at` datetime DEFAULT NULL,
  `created_at` datetime NOT NULL,
  `updated_at` datetime(6) DEFAULT NULL,
  `bids_count` int(11) DEFAULT '0',
  `following_users_count` int(11) DEFAULT '0',
  `buyer_invoice_id` int(11) DEFAULT NULL,
  `expected_sale_price` int(11) DEFAULT '0',
  `fragile` tinyint(1) DEFAULT '0',
  `lot_size` int(11) DEFAULT '1',
  `seller_invoice_id` int(11) DEFAULT NULL,
  `weight` decimal(10,3) DEFAULT NULL,
  `width` decimal(10,3) DEFAULT NULL,
  `height` decimal(10,3) DEFAULT NULL,
  `depth` decimal(10,3) DEFAULT NULL,
  `measurement_details` text,
  `close_order` int(11) DEFAULT '0',
  `attachments_count` int(11) DEFAULT '0',
  `cached_completeness_score` decimal(10,3) DEFAULT '0.000',
  `warehouse_pickup` tinyint(1) DEFAULT '0',
  `warehouse_fee` int(11) DEFAULT NULL,
  `taxable` tinyint(1) DEFAULT '1',
  `extended` tinyint(1) DEFAULT '0',
  `location` varchar(255) DEFAULT NULL,
  `high_bid_amount` int(11) DEFAULT '0',
  `disclaimer` text,
  `classification_value` int(11) DEFAULT NULL,
  `original_sale_ends_at` datetime DEFAULT NULL,
  `nlg_source` int(11) DEFAULT NULL,
  `origin_type` int(11) DEFAULT NULL,
  `yank_reason` text,
  `able_to_pickup_at_dc` tinyint(1) DEFAULT '0',
  PRIMARY KEY (`id`),
  KEY `index_items_on_sale_id` (`sale_id`),
  KEY `index_items_on_buyer_invoice_id` (`buyer_invoice_id`),
  KEY `index_items_on_seller_id` (`seller_id`),
  KEY `index_items_on_seller_invoice_id` (`seller_invoice_id`),
  KEY `index_items_on_sort_order` (`sort_order`),
  KEY `sort_order_asc_items` (`sort_order`) USING BTREE,
  KEY `index_items_on_active_and_aasm_state` (`active`,`aasm_state`),
  KEY `index_items_on_barcode` (`barcode`),
  KEY `index_items_on_delta` (`delta`),
  KEY `index_items_on_cached_completeness_score` (`cached_completeness_score`),
  KEY `idx_items_on_active_aasm_state_bids_count` (`active`,`aasm_state`,`bids_count`)
) ENGINE=InnoDB AUTO_INCREMENT=5166011 DEFAULT CHARSET=utf8;
 
INSERT INTO `items` (
 name, aasm_state, barcode, description, active, sort_order, condition_type_id,
 condition_details, shippable, shipping_weight, rate_inherited_from,
 commission_type_id, commission_rate, delta, sale_id, seller_id, sale_ends_at,
 created_at, updated_at, bids_count, following_users_count, buyer_invoice_id,
 expected_sale_price, fragile, lot_size, seller_invoice_id, weight, width,
 height, depth, measurement_details, close_order, attachments_count,
 cached_completeness_score, warehouse_pickup, warehouse_fee, taxable, extended,
 location, high_bid_amount, disclaimer, classification_value,
 original_sale_ends_at, nlg_source, origin_type, yank_reason,
 able_to_pickup_at_dc
) VALUES (
 "this is a test", "some state", "11010111011", "This is a description of this
 item. It is longer than other text fields in this row. Usually descriptions
 have a more verbose description of the item in question but on rare occasions,
 the descriptions are not as verbose as other fields.", 0, 1, 1, "This is a
 shorter text field", 1, 10.23, "short text field", 1, 35.0, 1, 2, 3, now(),
 now(), now(), 2, 1, 1, 1, 1, 1, 1, 10.1, 10.1, 20.2, 30.3, "this text field
 describes the measurements", 1, 0, 0.000, 1, 2, 1, 1, "Finland", 1, "This is
 the disclaimer, you should read it", 1, now(), 1, 1, "Why was this yanked?",
 0);

If the data is correctly processed by the avrorouter, then I would guess that something in the inserted data is causing the problems. In that case, I would like to ask you to try to provide a similar inserts that will cause MaxScale to fail.

Comment by Josh Becker [ 2017-03-03 ]

Sadly I'm not sure which insert is triggering it. Is there a way I could privately send you the binlog in question? Or would that not be helpful?

I can also add more debug logging to the avro portion of the code to print out what it reads from Maria, then attach?

Comment by markus makela [ 2017-03-03 ]

The actual binary logs would be very helpful. You can upload the binary logs to the MariaDB FTP server: http://ftp.mariadb.com/upload

Please name it MXS-1081 so we know it relates to this issue.

Comment by Josh Becker [ 2017-03-03 ]

I'm getting a 404 on that link.

Comment by Josh Becker [ 2017-03-03 ]

Also I'm diving into this a bit. I believe the issue may be in the BLOB processing. Note: The field is of type TEXT(65535) utf8.

For example, I have an item.description with the value of this in MariaDB:

"A late federal, black painted side chair with classical stenciled decoration, over a balloon seat with woven rush. Circa 1825."

If I simply use MSX_INFO to print that value out within the process_row_event_data function, I'm getting the following result:

2017-03-03 19:22:21 info : blob: 'A late federal, black painted side chair with classical stenciled decoration, over a balloon seat with woven rush. Circa 1825.
=I' Len: 126

My guess is you need to null terminate the end of the string similar to what you did when checking for a variable string?

Comment by Josh Becker [ 2017-03-03 ]

Also I believe there is still a bug in the section a fix was just pushed for: column_is_variable_string(map->column_types[i])

I have an ip_address and a user_agent column and sz is > 4000 in length, which doesn't make any sense.

2017-03-03 19:31:30 info : avro_value_get_by_name 'ip_address'
2017-03-03 19:31:30 info : variable_string_found: '' Len: 4696
2017-03-03 19:31:30 info : avro_value_get_by_name 'user_agent'
2017-03-03 19:31:30 info : variable_string_found: '' Len: 32581

The only change I made was printing the value:

https://gist.github.com/Geesu/773f4b52ba2fae09267e40d3c7355224

Comment by markus makela [ 2017-03-03 ]

Sorry, the link was wrong. Here's how to upload the file to the private FTP server using curl:

curl -T my-file.txt ftp://ftp.mariadb.com/private/

Just change my-file.txt to the name of the binlog file.

Comment by markus makela [ 2017-03-03 ]

Thank you for taking an in-depth look at the code. I think the wrong lengths could be caused by previous errors that causes the data pointer to be set at the wrong offset. I'll continue testing the code with various combinations of blob and non-blob data to see how they behave.

Comment by Josh Becker [ 2017-03-03 ]

I've uploaded both. And yes I think that is exactly it. I'm not sure which if block actually has the wrong length it is adding to ptr, I'm playing around as well but obviously you have more domain knowledge

The file names: MXS-1081-mysqld-bin.001293 and MXS-1081-mysqld-bin.001294

Comment by markus makela [ 2017-03-03 ]

Thank you, I'll take a look at them immediately.

Comment by Josh Becker [ 2017-03-03 ]

I also just uploaded all of the .avsc files, thought maybe you would need them. MXS-1081-avsc.zip

Comment by markus makela [ 2017-03-03 ]

I spotted that the binlogs contained an older format of the DATETIME type. This wasn't handled and it was causing a small miscalculation in the pointer offsets.

The table described in items.sql still fails to be created. For some reason it contains no values even though it should. I'll continue investigating this issue.

Comment by markus makela [ 2017-03-05 ]

I managed to resolve the problem with the failure to create table data. This appeared to be caused by a possible bug in the Avro C library when inserting BLOBs of zero length. A workaround for this is to store a single zero byte whenever a NULL value would need to be stored as an Avro bytes type. With these changes the table is correctly created and processed.

I've built packages from commit f2fc9b9d9f32596364f1eda639e9f837d8a9f050 which can be found here. Please test if this fixes the problems you are seeing.

Comment by Josh Becker [ 2017-03-05 ]

That did it!!! No crashes so far and it's creating the avro files properly!

Are there tests anywhere for this portion of the code base? I'd be happy to help contribute so these cases are handled.

Comment by markus makela [ 2017-03-05 ]

The tests for MaxScale are in a separate repository. The Avro data type related test is here. We appreciate all contributions that make MaxScale more stable, especially the avrorouter. Sadly, the testing environment is not very simple to set up so guaranteeing that the tests work isn't straightforward.

In any case, we'll start adding the test cases (probably the start of next week) relating to this bug now that we know how to reproduce it and that it's also fixed. I'd like to thank you for providing the valuable information that helped us debug this bug.

Comment by Josh Becker [ 2017-03-08 ]

I'm trying to compile this release w/CDC support and am running into issues. Do you know when 2.0.5 will be published?

Comment by markus makela [ 2017-03-08 ]

We've just begun the process of releasing the 2.0.5 packages, they should be on the mariadb.com website sometime later today. Meanwhile, you can get the packages straight from our testing repository: http://max-tst-01.mariadb.com/ci-repository/maxscale-2.0.5-tt1-release/mariadb-maxscale/

(The Downloads site on mariadb.com hasn't been updated at the time: https://downloads.mariadb.com/MaxScale/)

Comment by Josh Becker [ 2017-03-08 ]

Thanks so much!

Comment by Josh Becker [ 2017-03-08 ]

Thanks, that package worked great. But now I am receiving this:

2017-03-08 16:29:46 error : Failed to read field value 'event_type', type 'enum' at file offset 24583511, record numer 37443.
2017-03-08 16:29:46 error : Reading Avro file failed with error 'MAXAVRO_ERR_MEMORY'.
2017-03-08 16:29:46 info : Notified 1 clients about new data.

I can open a new issue if it is unrelated, but wanted to let you know in case maybe this fix introduced something else? The largest avro file hit 26 MB within seconds.

Comment by markus makela [ 2017-03-09 ]

That does seem like a different thing.

The error in question is only returned when a call to malloc fails. This shouldn't happen with the default kernel setting as memory is overcommitted which in turn should prevent malloc from failing. It's also possible that an impossible amount of memory was requested leading to a memory allocation failure.

Generated at Thu Feb 08 04:04:05 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.