After the restart, the synchronization schema failed. The node that was rejoined was restarted several times. Observing the alpha leader log found that every time the leader sends a synchronization message, it is sent to the node IP before the restart, resulting in a timeout.
Before restart:
Name: dgraph-alpha-2
Namespace: crm-test
Priority: 0
Node: yq01-qianmo-f12-ssd-com-61-169-35.yq01.baidu.com/10.61.169.35
Start Time: Fri, 19 Mar 2021 14:48:21 +0800
Labels: app=dgraph-alpha
controller-revision-hash=dgraph-alpha-7759bb686f
statefulset.kubernetes.io/pod-name=dgraph-alpha-2
Annotations: cni.projectcalico.org/podIP: 192.168.177.30/32
cni.projectcalico.org/podIPs: 192.168.177.30/32
kubectl.kubernetes.io/restartedAt: 2021-03-16T22:41:01+08:00
sidecar.istio.io/inject: false
Status: Running
IP: 192.168.177.30
Controlled By: StatefulSet/dgraph-alpha
leader:
W0319 06:48:30.798970 19 node.go:420] Unable to send message to peer: 0x5. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.177.28:7080: i/o timeout"
W0319 06:48:31.874221 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:48:41.974243 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:48:52.074262 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
After restart:
Name: dgraph-alpha-2
Namespace: crm-test
Priority: 0
Node: yq01-qianmo-f12-ssd-com-61-169-35.yq01.baidu.com/10.61.169.35
Start Time: Fri, 19 Mar 2021 14:56:41 +0800
Labels: app=dgraph-alpha
controller-revision-hash=dgraph-alpha-7759bb686f
statefulset.kubernetes.io/pod-name=dgraph-alpha-2
Annotations: cni.projectcalico.org/podIP: 192.168.177.36/32
cni.projectcalico.org/podIPs: 192.168.177.36/32
kubectl.kubernetes.io/restartedAt: 2021-03-16T22:41:01+08:00
sidecar.istio.io/inject: false
Status: Running
IP: 192.168.177.36
Controlled By: StatefulSet/dgraph-alpha
leader:
W0319 06:48:30.798970 19 node.go:420] Unable to send message to peer: 0x5. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.177.28:7080: i/o timeout"
W0319 06:48:31.874221 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:48:41.974243 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:48:52.074262 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
I0319 06:49:03.885935 19 log.go:34] Block cache metrics: hit: 58342 miss: 1581724 keys-added: 340853 keys-updated: 15 keys-evicted: 184922 cost-added: 1510751754 cost-evicted: 812820598 sets-dropped: 0 sets-rejected: 1240578 gets-dropped: 21568 gets-kept: 1584064 gets-total: 1640066 hit-ratio: 0.04
I0319 06:54:03.885941 19 log.go:34] Block cache metrics: hit: 58342 miss: 1581724 keys-added: 340853 keys-updated: 15 keys-evicted: 184922 cost-added: 1510751754 cost-evicted: 812820598 sets-dropped: 0 sets-rejected: 1240578 gets-dropped: 21568 gets-kept: 1584064 gets-total: 1640066 hit-ratio: 0.04
W0319 06:56:55.091999 19 node.go:420] Unable to send message to peer: 0x5. Error: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 192.168.177.30:7080: i/o timeout"
W0319 06:56:56.174319 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:57:06.274300 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection
W0319 06:57:16.374332 19 node.go:420] Unable to send message to peer: 0x5. Error: Unhealthy connection