Hi,
I have a Dgraph cluster with 3 Zero nodes and 3 Alpha nodes running the latest version.
Some time ago, one of the Alpha nodes failed to start. Now, at least one Alpha node contains a dataset that differs from the others.
As a result, my Go application receives inconsistent query results, depending on which Alpha node is selected for querying.
Steps I Took to Fix It:
- Scaled down all Alphas to 0.
- Deleted the disk of Alpha 0.
- Scaled Alphas back to 3.
- Manually terminated Alpha 0 to ensure that Alpha 1 and Alpha 2 started first.
- Alpha 0 then started with a fresh disk.
However, the issue persists.
My Expectations:
- If an Alpha or Zero node crashes, no data should be lost (which appears to be the case).
- Once new instances come up, data should automatically resynchronize across Alphas to ensure consistency.
Additional Issue:
I attempted to remove the faulty Alpha using Ratel, but the command was mistakenly executed on my local Docker-based cluster instead of the intended staging Kubernetes cluster (connected via port-forwarding). As a result, my local cluster is now failing to start.
Request for Guidance:
How can I properly resynchronize the staging environment without recreating the cluster? While I could recreate it, I’m concerned that this issue might eventually occur in production, where I want to avoid rebuilding everything—even though I have backups.
Any suggestions on how to resolve this and ensure proper resynchronization?
Thanks!