In PR #8774, to fix a WAL replay issue, for non-math.MaxUint64 readTs, the Rollup process uses maxCommitTs + 1 as the new version to save to badger.
What exactly is the WAL replay issue here? Could it be described in more detail?
Risk Analysis of Rollup Version+1
Following the fix in PR #8774, the following scenario may still result in data loss:
Timeline:
----------------------------------------------------------
T1 | [Rollup] Start (readTs = 5) for key "a"
| Merge deltas -> a = 3
| maxCommitTs = 3
----------------------------------------------------------
T2 | [Rollup] Prepare to write (a = 3, version = 4)
----------------------------------------------------------
T3 | [Mutation] Commit request
| set a = 5, commitTs = 4
----------------------------------------------------------
T4 | [Mutation] Write to Badger:
| (a = 5, version = 4)
----------------------------------------------------------
T5 | [Rollup] Finish and write to Badger:
| (a = 3, version = 4) <-- Overwrites mutation
----------------------------------------------------------
T6 | Final state in Badger: a = 3
| Mutation's update (a = 5) LOST
----------------------------------------------------------
Assume a readTs = 5 is issued and Rollup is performed on key a. After merging deltas, get a = 3, and maxCommitTs = 3.
Rollup then writes a = 3 to Badger with version = 4 (maxCommitTs + 1).
At the same time, a Mutation with commitTs = 4 wants to set a = 5.
If the Mutation commits to Badger before the Rollup writes, then the later Rollup write will overwrite the Mutation, resulting in data loss.
In this flow, Rollup and Mutation share the same version number 4. If Rollup starts earlier but commits later, it can overwrite a Mutation that has already been committed, causing data loss.
So in badger only the commit ts matters. So when you write something on 4, it doesn’t change anything before transaction at 4-5 is stored on 5.
In your timeline, the problem is that for readTs of 5 for key a, we are going to wait until everything less has been committed and written to the disk. So T3 will happen before T1 always.
Thank you very much for your reply. We are currently experiencing data loss issues while using Dgraph.
We suspect that the rollup process writing to Badger with version + 1 might conflict with the commitTs of new mutations, resulting in data loss.
This is because the commitTs for mutations is uniquely allocated by Zero, whereas the rollup process sets the Badger version to maxCommitTs + 1.
In addition, I’m curious about the previous fix for the WAL replay issue that involved using version + 1. From what I see in the code, when restarting a node, it first recovers from a snapshot and then replays the WAL starting from the snapshot index.
So in what scenario did the earlier WAL replay problem actually occur?
The WAL Replay issue happens something like this:
Timestamp 3: Transaction 1 finishes and delta changes are written to disk
We rollup and write complete changes at timestamp 3.
Dgraph restarts before we could take a snapshot.
When WAL replay happens, at timestamp 3, we again write the delta changes.
Normally this isn’t an issue, we will just rollup again and store it. However when a key is split, we write split postings lists as well. These split posting lists are updated whenever there’s a rollup. So the data looks something like this before and after wal replay
At TS=3 before rollup
Main PL (TS:1, Complete) (TS: 2, Delta) (TS: 3, Delta)
Split PL (TS:1, Complete)
At TS=3 after rollup
Main PL (TS:1, Complete) (TS: 2, Delta) (TS: 3, Complete)
Split PL (TS:1, Complete) (TS:3, Complete)
After WAL Replay:
Main PL (TS:1, Complete) (TS: 2, Delta) (TS: 3, Delta)
Split PL (TS:1, Complete) (TS:3, Complete)
Now in badger, if we have two different Complete postings in one key, we delete the first one. So after next compaction data would look like:
This causes an issue when we are now trying to read the main pl. Because now when we try to read it, we see at split pl are empty at ts=1.
You mentioned the old fix. The old fix also worked on the principle of writing to +1, however it suffered with a major problem. The core idea behind the fix was that no two transactions would end at ts, and ts+1. Because for the second transaction to end at ts+1, it needs to start before ts, and hence there would be an conflict between the two transactions. The problem with this was, in case of a list predicate, there wouldn’t be a conflict between two transactions. And this was causing a data loss. Our solution waits for ts+1, and writes to a timestamp after making sure that there’s no data written to it. So technically we are not writing to ts+1 anymore, we are writing to latestTimestamp +1. We also make sure that this latestTimestamp + 1 has no data.
Could you let me know details like what predicates are seeing data loss in. Is the data loss in data keys, or index keys? If you could give me more details that would be helpful. Also if you have any way of reproducing the issue, I can take it up and solve it.
Thank you very much for your detailed response. I now understand the scenario for the original fix and the logic behind the version+1 handling. This approach indeed prevents data loss (all transactions before readTs are committed and written to disk) and also resolves the WAL replay issue.
Regarding the historical data loss we observed, it potentially affected both data keys and index keys, and could occur on any Alpha node. We have checked the logs and found no errors, and we also reviewed the relevant metrics—raft_applied_index is consistent across all Alpha nodes in each group.
We have attempted to reproduce the issue but were unable to do so locally. The data loss issue is more likely to occur under very high concurrency, and the problem originally occurred on some relatively large predicates. Therefore, we suspect that the previous data loss may have been caused by overwrites or ordering conflicts during transaction writes.