[MDEV-5165] Duplicate MDEV-4452 Created: 2013-10-21  Updated: 2014-10-17  Resolved: 2014-10-17

Status: Closed
Project: MariaDB Server
Component/s: OTHER
Affects Version/s: 10.0.4, 5.5.33a
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: VAROQUI Stephane Assignee: Elena Stepanova
Resolution: Duplicate Votes: 0
Labels: None
Environment:

Debian Squeeze


Issue Links:
Duplicate
duplicates MDEV-4452 Problem with FederatedX between two l... Closed

 Description   

Facing the same MDEV-4452 issue form a trigger that is triggered form a replication .

sur MYSQL1 :

répli "ccmstats_lucifer"

Error 'Got error 10000 'Error on remote system: 2006: MySQL server has gone away' from FEDERATED' on query. Default database: 'ccmstats_shard03'. Query: 'replace into `ccmstats_shard03`.`ccmreferers`(`ip`,`date`,`firstseenon`,`keyword`,`domaine`,`referer`,`keyword_crc64`)values(3363428860,'2013-10-19 22:27:41','/download/start/descargar-14103-driver-de-video-de-lenovo-ibm-thinkpad-t30','','es.kioskea.net','http://static.ak.facebook.com/connect/xd_arbiter.php?version=27',9175071638627673410)'

répli "ccmstats_mysql1"

Error 'Got error 10000 'Error on remote system: 2006: MySQL server has gone away' from FEDERATED' on query. Default database: 'ccmstats_shard07'. Query: 'replace into `ccmstats_shard07`.`ccmreferers`(`ip`,`date`,`firstseenon`,`keyword`,`domaine`,`referer`,`keyword_crc64`)values(1947240397,'2013-10-19 09:33:46','/sites/details/1089563.jjwxc.net','','www.commentcamarche.net','http://www.quanneiren.com/seo/?page=1068&url=1089563.jjwxc.net',2180121150729504318)'

sur LUCIFER :

répli "ccmstats_gertrude"

Error 'Got error 10000 'Error on remote system: 2006: MySQL server has gone away' from FEDERATED' on query. Default database: 'ccmstats_shard13'. Query: 'replace into `ccmstats_shard13`.`ccmreferers`(`ip`,`date`,`firstseenon`,`keyword`,`domaine`,`referer`,`keyword_crc64`)values(1323859475,'2013-10-19 11:47:46','/forum/affich-1573832-pourquoi-mon-timer-ne-s-execute-pas','','codes-sources.commentcamarche.net','http://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fcodes-sources.commentcamarche.net%2Fforum%2Faffich-1573832-pourquoi-mon-timer-ne-s-execute-pas&ei=NVViUoq0PMa90QXl9ICACQ&usg=AFQjCNH10nmC3K-H9cMsSbl1rj7V7M_V7Q&bv',2736937289062304772)'

répli "ccmstats_lucifer"

Error 'Got error 10000 'Error on remote system: 2006: MySQL server has gone away' from FEDERATED' on query. Default database: 'ccmstats_shard10'. Query: 'replace into `ccmstats_shard10`.`ccmreferers`(`ip`,`date`,`firstseenon`,`keyword`,`domaine`,`referer`,`keyword_crc64`)values(2903347022,'2013-10-19 09:27:37','/news/12118-firefox-la-nuova-versione-nel-play-store','','it.kioskea.net','http://184.84.222.35/news/12118-Firefox%2C+la+nuova+versione+nel+Play+Store',9684943911008351985)'

sur GERTRUDE :

Error 'Got timeout reading communication packets' on query. Default database: 'ccmstats_shard24'. Query: 'replace into `ccmstats_shard24`.`ccmreferers`(`ip`,`date`,`firstseenon`,`keyword`,`domaine`,`referer`,`keyword_crc64`)values(3192578384,'2013-10-19 04:04:09','/download/start/descargar-16307-driver-de-audio-de-placa-base-pcchips-p27g','','es.kioskea.net','http://static.ak.facebook.com/connect/xd_arbiter.php?version=27',6488389282267718615)'

The master is 5.5.33a

CREATE TABLE `domaine_federated` (
`domaine` varchar(128) DEFAULT NULL,
`idDomaine` tinyint(4) NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`idDomaine`),
UNIQUE KEY `idx_domaine` (`domaine`)

) ENGINE=federated CONNECTION='PUMA/domaine'
DEFAULT CHARSET=latin1 ;

CREATE TABLE `url_federated` (
`url` char(255) DEFAULT NULL,
`idUrl` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`idUrl`),
UNIQUE KEY `idx_url` (`url`)
) ENGINE=federated CONNECTION='PUMA/url'
DEFAULT CHARSET=latin1 ;

The slave have a before insert trigger

define like this

CREATE TABLE `ccmreferers` (
`ip` int(4) unsigned NOT NULL,
`date` datetime NOT NULL,
`firstseenon` varchar(255) NOT NULL,
`keyword` varchar(128) NOT NULL,
`domaine` varchar(128) NOT NULL,
`referer` varchar(255) NOT NULL,
`keyword_crc64` bigint(20) unsigned NOT NULL,
PRIMARY KEY (`ip`,`keyword`),
KEY `date` (`date`),
KEY `referer` (`referer`,`firstseenon`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

DECLARE l_idUrl INT unsigned DEFAULT 0;
DECLARE i_idDom TINYINT unsigned DEFAULT 0;
DECLARE c_kw VARCHAR(128);
IF NOT EXISTS (SELECT 1 FROM ccmreferers
WHERE keyword = NEW.keyword AND ip = NEW.ip)
THEN
SET l_idUrl = ccmstats.GetIdUrl(NEW.firstseenon);
SET i_idDom = ccmstats.GetIdDomaine(NEW.domaine);
INSERT INTO stats_url_cur
SET keyword_crc64 = NEW.keyword_crc64,
DATE = NEW.date,
idUrl = l_idUrl,
idDomaine = i_idDom,
nb = 1
ON DUPLICATE KEY UPDATE nb=nb+1;
IF LENGTH(NEW.keyword) > 0 THEN
SET c_kw = REPLACE(TRIM(NEW.keyword),' ',' ');
INSERT INTO stats_url_kw_cur
SET keyword_crc64 = NEW.keyword_crc64,
DATE = NEW.date,
idUrl = l_idUrl,
keyword = c_kw,
idDomaine = i_idDom,
nb = 1
ON DUPLICATE KEY UPDATE nb=nb+1;
END IF;
END IF;

delimiter //
drop function GetIdUrl //

CREATE DEFINER=`root`@`%` FUNCTION `GetIdUrl`(l_strUrl varchar(255)) RETURNS int(11)
READS SQL DATA
BEGIN
DECLARE l_IdUrl int default 0;
SELECT IdUrl into l_IdUrl FROM `url` where Url = l_strUrl;
IF l_IdUrl=0 then
INSERT IGNORE INTO url_federated (Url) VALUES (l_strUrl);
SELECT IdUrl into l_IdUrl FROM `url_federated` where Url = l_strUrl ;
END IF;
RETURN l_IdUrl;
END;
//

delimiter //
drop function GetIdDomaine //

CREATE DEFINER=`root`@`%` FUNCTION `GetIdDomaine`(l_strDomaine char(50)) RETURNS int(11)
READS SQL DATA
BEGIN
DECLARE l_IdDomaine int default 0;
SELECT IdDomaine into l_IdDomaine FROM domaine where Domaine = l_strDomaine;
IF l_IdDomaine=0 THEN
INSERT IGNORE INTO domaine_federated (Domaine) VALUES (l_strDomaine);

SELECT IdDomaine into l_IdDomaine FROM domaine_federated where Domaine = l_strDomaine;

END IF;
RETURN l_IdDomaine;
END;
//



 Comments   
Comment by VAROQUI Stephane [ 2013-10-23 ]

Helena , you can notice that all slaves are pointing to the same remote table that slaves just break randomly at no predefine time . We have set a cron to select from the federated table to be able to produce other activity on the remote table .

That cron also get the same error .

Notice as well that the time that each slave breaks is very random

Just restarting the replication works so replaying the same query at a later time is fine

Comment by VAROQUI Stephane [ 2013-10-23 ]

Server that get the physical table is in reality MariaDB 5.5.31

Threads_cached 49
Threads_connected 20
Threads_created 900
Threads_running 6
back_log 128

Uptime =9077854
Uptime_since_flush_status =9077854

Variables

thread_cache_size = 64
thread_handling = one-thread-per-connection

net.core.somaxconn = 4096
open_files_limit =132079

Comment by Elena Stepanova [ 2013-10-23 ]

Hi Stephane,

Do you understand correctly from your comment above that the problem is sporadic, not reliably reproducible with the provided structures and queries?
If so, was there any particular reason why you chose to file a new bug rather than add a comment to MDEV-4452 which as you said it is a duplicate of?

Comment by VAROQUI Stephane [ 2013-10-23 ]

Elena Yes sporadic issue and i don't really know about the policy in case of duplicate issue that does not provide any solution at the end .

I'm more coming to you for help in a methodology to help founding the cause

The error clearly state it could be a network issue on the remote server but the applications using that same server does not suffer the same issues. that server that hold the federated table is a master so constantly used and every error from the application is logged into syslog .

We only have one error per day happening every day at the same time . The client is investigating this. but it does not match timming with our sporadic issue

codes-sources.commentcamarche.net web18 2013-10-19 04:12:00 /profile/user/Bul3:/var/www/vhosts/www.commentcamarche.net/include/ccmfunctions.php3:737 - Acc?s ? la base

To come back to network potential isues i have the client to come back to a more conservative tcp setting

From this setup to
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_fin_timeout = 60
net.ipv4.tcp_tw_reuse = 1

to this setup
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_fin_timeout = 25
net.ipv4.tcp_tw_reuse = 0

And we will see....

On the master looking at the query monitoring we do need see any queries popping at special unexpected response time.
when errors happened.

The only suboptimal status are

Aborted_connects | 157 |
Aborted_clients | 2932873 |

If we state the issue is not on the server that hold the table but more on servers that use the federated table

what we observe is an deadlock on the slave and 30 minutes later the replication is broken
That's a good hint

Oct 19 11:21:13 lucifer mysqld: 131019 11:21:13 [ERROR] Master 'ccmstats_gertrude': Slave SQL: Error 'Deadlock found when trying to get lock; try restarting transaction' on query. Default database: 'ccmstats_shard10'. Query: 'replace into `ccmstats_shard10`.`ccmreferers`(`ip`,`date`,`firstseenon`,`keyword`,`domaine`,`referer`,`keyword_crc64`)values(1315980540,'2013-10-19 11:21:13','/forum/affich-4000625-prime-noel-condition-d-attribution','','droit-finances.commentcamarche.net','https://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=5&ved=0CFwQFjAE&url=http%3A%2F%2Fdroit-finances.commentcamarche.net%2Fforum%2Faffich-4000625-prime-noel-condition-d-attribution&ei=KE1iUsfIHIeu0QXt5YGYBg&usg=AFQjCNGauJn6T9vDHk4LSLUQOI43iD_NwQ&s',5321267248370821353)', Internal MariaDB error code: 1213

Oct 19 11:47:46 lucifer mysqld: 131019 11:47:46 [ERROR] Master 'ccmstats_gertrude': Slave SQL: Error 'Got error 10000 'Error on remote system: 2006: MySQL server has gone away' from FEDERATED' on query. Default database: 'ccmstats_shard13'. Query: 'replace into `ccmstats_shard13`.`ccmreferers`(`ip`,`date`,`firstseenon`,`keyword`,`domaine`,`referer`,`keyword_crc64`)values(1323859475,'2013-10-19 11:47:46','/forum/affich-1573832-pourquoi-mon-timer-ne-s-execute-pas','','codes-sources.commentcamarche.net','http://www.google.fr/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&ved=0CDQQFjAB&url=http%3A%2F%2Fcodes-sources.commentcamarche.net%2Fforum%2Faffich-1573832-pourquoi-mon-timer-ne-s-execute-pas&ei=NVViUoq0PMa90QXl9ICACQ&usg=AFQjCNH10nmC3K-H9cMsSbl1rj7V7M_V7Q&bv',2736937289062304772)', Internal MariaDB error code: 1296
Oct 19 11:47:46 lucifer mysqld: 131019 11:47:46 [Warning] Master 'ccmstats_gertrude': Slave: Got error 10000 'Error on remote system: 2006: MySQL server has gone away' from FEDERATED Error_code: 1296
Oct 19 11:47:46 lucifer mysqld: 131019 11:47:46 [ERROR] Master 'ccmstats_gertrude': Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.000017' position 17754866

Asking the client to check if this predictable after a deadlock a replication break .

Now in any case . Looking at federatedX code you can see that federated have auto reconnect flag on . the question is what happen if it failed, does it retry ? and if the query is inside a replication thread so what ? Any way to make this stable dispite in our case i'm really not suspecting a network issue .

Thanks

Comment by VAROQUI Stephane [ 2013-10-24 ]

More input found on some various error log

We sporadicly get the following error as well on some MariaDB 10 slave that get the federated tables

mysqldump: Couldn't execute 'FLUSH /*!40101 LOCAL */ TABLES': Got an error writing communication packets (1160)

It could related to this bug reported on MySQL and to the same issue we are facing

http://bugs.mysql.com/bug.php?id=66184

Comment by VAROQUI Stephane [ 2013-10-30 ]

We have replace FederatedX with Spider Engine and the issue has show up again but getting replication stopped on different error messages .

One server have Last_SQL_Error: Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'ccmstats_shard12'. Query: 'replace into `ccmstats_shard12`.`ccmreferers

And an other one have

Last_SQL_Error: Error 'Remote MySQL server has gone away' on query. Default database: 'ccmstats_shard05'. Query: 'replace into `ccmstats_shard05`.`ccmreferers

Now one other server is fine and never stopped his replication and the difference is that this server does not have load on it . So this state that the issue is not on the remote server but more caused by activity on the one that his holding the linked table.

In a more generic way to fixe this we have slave-skip-errors, can we have slave-retry-errors

Comment by VAROQUI Stephane [ 2013-12-19 ]

Fixed that was an issue in the trigger code

Comment by Elena Stepanova [ 2014-10-17 ]

I added a test case to MDEV-4452, so I'm closing this bug as a duplicate.

Generated at Thu Feb 08 07:02:11 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.