[MDEV-14971] Mariadb stops working when second galera nodes joins Created: 2018-01-17  Updated: 2022-01-24  Resolved: 2022-01-24

Status: Closed
Project: MariaDB Server
Component/s: Galera SST
Affects Version/s: 10.2.12, 10.3.5
Fix Version/s: N/A

Type: Bug Priority: Major
Reporter: TAO ZHOU Assignee: Jan Lindström (Inactive)
Resolution: Incomplete Votes: 1
Labels: None
Environment:

FreeBSD 11.1
CentOS 7.4


Issue Links:
Relates
relates to MDEV-15399 Galera catches exception and terminat... Closed

 Description   

I have just upgrade mariadb to 10.2.12 and enabled galera.
When the second node joins, the first master node stops working and need to be restarted to recover.

Here's my error log:

2018-01-17 12:01:33 34426956544 [Note] WSREP: (3153ad69, 'tcp://0.0.0.0:4567') connection established to f61103ed tcp://192.168.62.211:4567
2018-01-17 12:01:33 34426956544 [Note] WSREP: (3153ad69, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2018-01-17 12:01:33 34426956544 [Note] WSREP: (3153ad69, 'tcp://0.0.0.0:4567') connection established to ba2863c0 tcp://192.168.62.201:4567
2018-01-17 12:01:33 34426956544 [Note] WSREP: declaring ba2863c0 at tcp://192.168.62.201:4567 stable
2018-01-17 12:01:33 34426956544 [Note] WSREP: declaring f61103ed at tcp://192.168.62.211:4567 stable
2018-01-17 12:01:33 34426956544 [Warning] WSREP: 3153ad69 conflicting prims: my prim: view_id(PRIM,3153ad69,1) other prim: view_id(PRIM,ba2863c0,16)
2018-01-17 12:01:33 34426956544 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=3153ad69,start_prim=1,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=53547,checksum=0,instances=
        3153ad69,prim=1,un=0,last_seq=53547,last_prim=view_id(PRIM,3153ad69,1),to_seq=53546,weight=1,segment=0
,state_msgs=
        3153ad69,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 3153ad69,prim=1,un=0,last_seq=53547,last_prim=view_id(PRIM,3153ad69,1),to_seq=53546,weight=1,segment=0
}}
,current_view=view(view_id(REG,3153ad69,17) memb {
        3153ad69,0
        ba2863c0,0
        f61103ed,0
} joined {
        ba2863c0,0
        f61103ed,0
} left {
} partitioned {
}),pc_view=view(view_id(PRIM,3153ad69,1) memb {
        3153ad69,0
} joined {
} left {
} partitioned {
}),mtu=32636}
2018-01-17 12:01:33 34426956544 [Note] WSREP: {v=0,t=1,ut=255,o=4,s=0,sr=0,as=-1,f=4,src=ba2863c0,srcvid=view_id(REG,3153ad69,17),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=36492262,nl=(
)
} 64
2018-01-17 12:01:33 34426956544 [ERROR] WSREP: exception caused by message: {v=0,t=3,ut=255,o=1,s=0,sr=-1,as=0,f=4,src=f61103ed,srcvid=view_id(REG,3153ad69,17),insvid=view_id(UNKNOWN,00000000,0),ru=00000000,r=[-1,-1],fs=8,nl=(
)
}
 state after handling message: evs::proto(evs::proto(3153ad69, OPERATIONAL, view_id(REG,3153ad69,17)), OPERATIONAL) {
current_view=view(view_id(REG,3153ad69,17) memb {
        3153ad69,0
        ba2863c0,0
        f61103ed,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} },
fifo_seq=56867,
last_sent=0,
known:
3153ad69 at
{o=1,s=0,i=1,fs=-1,}
ba2863c0 at tcp://192.168.62.201:4567
{o=1,s=0,i=1,fs=36492264,}
f61103ed at tcp://192.168.62.211:4567
{o=1,s=0,i=1,fs=8,}
 }2018-01-17 12:01:33 34426956544 [ERROR] WSREP: exception from gcomm, backend must be restarted: 3153ad69 aborting due to conflicting prims: older overrides (FATAL)
         at gcomm/src/pc_proto.cpp:handle_state():982
2018-01-17 12:01:33 34426956544 [Note] WSREP: gcomm: terminating thread
2018-01-17 12:01:33 34426956544 [Note] WSREP: gcomm: joining thread
2018-01-17 12:01:33 34426956544 [Note] WSREP: gcomm: closing backend
2018-01-17 12:01:33 34426956544 [Note] WSREP: Forced PC close
2018-01-17 12:01:33 34426956544 [Warning] WSREP: discarding 2 messages from message index
2018-01-17 12:01:33 34426956544 [Note] WSREP: gcomm: closed
2018-01-17 12:01:33 35628642304 [Note] WSREP: Received self-leave message.
2018-01-17 12:01:33 35628642304 [Note] WSREP: comp msg error in core 53
2018-01-17 12:01:33 38097805824 [Warning] WSREP: Send action {0x0, 2338, TORDERED} returned -53 (Software caused connection abort)
2018-01-17 12:01:33 38099474176 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:33 35628642304 [Note] WSREP: Closing send monitor...
2018-01-17 12:01:33 35628642304 [Note] WSREP: Closed send monitor.
2018-01-17 12:01:33 35628642304 [Note] WSREP: Closing replication queue.
2018-01-17 12:01:33 35628642304 [Note] WSREP: Closing slave action queue.
2018-01-17 12:01:33 38099134208 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:33 38099472896 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:33 38099139328 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:33 38099467776 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:34 38099475456 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:34 38083915264 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:34 38099460096 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:34 38099654912 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:34 35628644864 [Note] WSREP: applier thread exiting (code:6)
2018-01-17 12:01:34 38095302656 [Note] WSREP: applier thread exiting (code:6)



 Comments   
Comment by TAO ZHOU [ 2018-01-17 ]

Before upgrading, I was running 10.2.10. Galera was enabled but running with only one node.
After upgrading, I changed safe_to_boot to 1 and started with --wsrep-new-cluster option.

Comment by Jan Lindström (Inactive) [ 2021-12-23 ]

laocius Does this problem repeat with more recent version of MariaDB server and Galera library ?

Generated at Thu Feb 08 08:17:43 UTC 2024 using Jira 8.20.16#820016-sha1:9d11dbea5f4be3d4cc21f03a88dd11d8c8687422.