Changing replication

As @aman-bansal said, /removeNode is meant to replace unhealthy nodes. That means it’s not used to remove the current leader of the group. Doing so can make the existing members stuck trying to connect if the leader was suddenly removed and there’s no longer a majority.

The process to call /removeNode only to remove followers (leaders are presumably active and healthy) for both Dgraph Zero and Dgraph Alpha groups.

Manual recovery

If your cluster is still stuck, you can wipe the volumes and restore. Otherwise, you can undergo some manual recovery steps by keeping one of the Alpha p directories.

  1. Check /state for the maxLeaseId and maxTxnTs information (see docs about /state).
  2. Keep a p directory around and remove other volumes.
  3. Start the Zeros.
    • Call /assign?what=uids&num=N where num is set to the value for maxLeaseId from step 1. This sets the UID lease for blank UID assignment.
    • Call /assign?what=timestamps&num=N where num is set to the value for maxTxnTs from step 1. This sets the latest txn timestamp.
  4. Copy the p directory to the respective Alpha volumes.
  5. Start the Alphas.

This is similar to the steps for bulk loading where bulk loader outputs p directory that you can then copy to the Alpha instances (step 4).

3 Likes