Ok well after 40h of dgraph outage my team and I were able to apply enough patches to dgraph to get it to get up and export.
If ever anyone from dgraph looks at this:
- we wrapped the raftwal storage interface with one that would not produce raft peers that were removed according to the membership. This allowed raft elections to succeed.
- the issue really is that the custom raftwal implementation (or something) was not removing peers that were removed using the removeNode endpoint, though the dgraph side (not etcd/raft side) of things knew these peers were correctly removed.
- fundamentally the change we applied may be a ok safeguard if it is implausible to figure out why the peers were not being removed from storage in the first place.
- after this, we were stalled on that group with tens of thousands of transactions (at least to its point of view) away from a usable readTS. This was very confusing but made it so that only best-effort queries were succeeding, and only if you hit a member of that group directly. It did not appear that it was making progress on advancing that timestamp for some reason.
- we then applied another patch that allowed an export to be taken without waiting for readTS to be the latest according to the zeros. This allowed a full cluster export to succeed, where before it would wait indefinitely on reaching a current readTS.
- we probably lost some changes in the wal on that group, but after a couple of days of partial downtime, we had to go for a slightly destructive solution over none at all.
- after all of the above, I was able to use the export to rebuild the 12node cluster.
All in all, this was a massive pain, quite unfortunate we had to read dgraph code for 2 days to attempt to figure it out ourselves.