For anyone else who decides to go down this route, we implemented a quick-and-dirty system to automatically reconnect to other Dgraph nodes to minimize downtime to virtually nothing. We’ll describe the methodology we used.
We resolve a given set of names via the following:
const dns = require('dns')
dns.resolve4(e.endpoint, (err, addrs) => {
/* implementation here */
})
We then create a client using a stub from a randomly picked server in our region:
const endpoint = _.sample(servers) /* uses lodash */
dgraph_stub_pkg = {
endpoint,
stub: new dgraph.DgraphClientStub(
`${endpoint}:9080`,
grpc.credentials.createInsecure(),
),
in_flight: false, /* used for metrics */
closing: false /* and clean-up */
}
dgraph_stub_pkg.client =
new dgraph.DgraphClient(dgraph_stub_pkg.stub)
We then test to see if a given query/mutation fails using the following:
dgraph_op_func(dgraph_stub_pkg.client).then((res) => {
resolve(res) /*success, resolve promise here*/
}).catch((err) => {
// Reference: https://github.com/grpc/grpc/blob/master/doc/statuscodes.md
// 14 => UNAVAILABLE
if (err && err.code && err.details && err.code === 14) {
retry(dgraph_op_func) /* retry logic here */
.then((res) => resolve(res))
.catch((err) => reject(err))
} else {
reject(err) /* some other error */
}
})
We end up keeping a cache around of temporarily unavailable addresses, and tell the system to retest the connection to those addresses periodically.
We tested the technique by killing random docker instances over and over. Attached is an image of our tests.
