Alpha is out of sync

Hi,

I have a Dgraph cluster with 3 Zero nodes and 3 Alpha nodes running the latest version.

Some time ago, one of the Alpha nodes failed to start. Now, at least one Alpha node contains a dataset that differs from the others.

As a result, my Go application receives inconsistent query results, depending on which Alpha node is selected for querying.

Steps I Took to Fix It:

  1. Scaled down all Alphas to 0.
  2. Deleted the disk of Alpha 0.
  3. Scaled Alphas back to 3.
  4. Manually terminated Alpha 0 to ensure that Alpha 1 and Alpha 2 started first.
  5. Alpha 0 then started with a fresh disk.

However, the issue persists.

My Expectations:

  • If an Alpha or Zero node crashes, no data should be lost (which appears to be the case).
  • Once new instances come up, data should automatically resynchronize across Alphas to ensure consistency.

Additional Issue:

I attempted to remove the faulty Alpha using Ratel, but the command was mistakenly executed on my local Docker-based cluster instead of the intended staging Kubernetes cluster (connected via port-forwarding). As a result, my local cluster is now failing to start.

Request for Guidance:

How can I properly resynchronize the staging environment without recreating the cluster? While I could recreate it, I’m concerned that this issue might eventually occur in production, where I want to avoid rebuilding everything—even though I have backups.

Any suggestions on how to resolve this and ensure proper resynchronization?

Thanks!

Dgraph has no way of knowing if you have replaced the p directory. Dgraph would just assume that there’s no data. If you want to remove a bad alpha, you should use the api, removeNode, to remove that alpha, and then insert a new one. If you go this route, data would automatically be sychronised.
But, if you don’t want to do all that, and want a quick and dirty solution, you could rebuild the database and it would be quicker than this. All you need to do is delete the p, w directories from all the alphas, and put your desired new p directory (could be from one of the “good” alphas).

so you mean, if an alpha is in a faulty state I need to remove it from the cluster and create a new one?
Because dgraph did not automatically sync its nodes when one alpha is empty?

Seems to be like this, as I removed now all volumes beside one, and the result was, that all alphas start now, but nearly all data is gone.

this does not make sense to me.
If one node is left with correct data it should synchronize to the others.
Now it seems, some data is synced, and most data is lost. but this is not how I understand HA.
I mean it means, I do not need HA in this scenario, I can just run a single instance, as this would be the same effect.
Also, what I notice on my local machine is, when a node has issues to start, dgraph connection is very slow.

so you mean, if an alpha is in a faulty state I need to remove it from the cluster and create a new one?

Yes. If you have a bad alpha and want to fix it, don’t delete the p directory. Simple restarts, restarts with config changes are good. But don’t touch the p and w directory.

Because dgraph did not automatically sync its nodes when one alpha is empty?

As I said before, dgraph has no way of knowing that all the data is gone by error. (For all it can just be a new db or someone dropped or data). So you should remove the alpha from the raft alpha gorup, and then add a new one. In that case we would automatically stream a new snapshot.

Seems to be like this, as I removed now all volumes beside one, and the result was, that all alphas start now, but nearly all data is gone.

You need to manually copy the good data from good alpha to bad alpha if you are rebuilding the dgraph cluster youself.

this does not make sense to me.
If one node is left with correct data it should synchronize to the others.
Now it seems, some data is synced, and most data is lost. but this is not how I understand HA.
I mean it means, I do not need HA in this scenario, I can just run a single instance, as this would be the same effect.

Dgraph will do provide you HA, if you remove the node from the alpha group, and then add a new node. Basically when you just delete the p directory, we currently don’t have any way to detect it. If we can detect it, we can get a snapshot from the leader. This is kind of what removing and adding a new node would do. It would trigger a new snapshot.

Also, what I notice on my local machine is, when a node has issues to start, dgraph connection is very slow.

What issues?

ah ok thanks a lot.
so my main issue was that I haven’t used the API to remove an alpha. ok got it.

for the local machine, I sometimes notice this when I just killed the docker before shutdown the containers. In this case some alphas get problems, and so far the only solution is to recreate the local cluster.