Steps to reproduce the issue (command/config used to run Dgraph).
Normal helm install from chart.
While troubleshooting slowness while inserting, seeing errors in the logs of all 3 alphas:
W0316 05:37:45.883308 21 mvcc.go:132] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 00000c6a622d68756e742e6e616d65020a633161: could not read list part with key 04000c6a622d68756e742e6e616d65020a6331610000000000000001: Key not found rolling up key [0 0 12 106 98 45 104 117 110 116 46 110 97 109 101 2 10 99 49 97]
W0316 05:37:48.406653 21 mvcc.go:132] Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key 00000c6a622d68756e742e6e616d65020a633239: could not read list part with key 04000c6a622d68756e742e6e616d65020a6332390000000000000001: Key not found rolling up key [0 0 12 106 98 45 104 117 110 116 46 110 97 109 101 2 10 99 50 57]
(I have given you a couple lines of it here to show its many keys) ~1800 of these log entries in 2hr.
I have traced this through the code in v20.11.0 and have come to this function. Seems to be a badgerdb error from txn.Get(). The cluster is 84 days old.
I am wondering if this means whatever is at that key is corrupt? I have seen this before on my development system and ended up just wiping it and trying again. This is in our production system for a live application so any help would be appreciated.
I have the Error failed when calling List.rollup on a cluster I spun up 2 weeks ago as well. From where this error comes from(reading the posting list) it would surprise me if it was benign.
yep, ~4100 times in 8h on one of my alphas, 51 times on another (over 30m actually), 0 on the third (in the non-rotated logs anyway).
note this is the same cluster as this issue with an alpha being completely behind on raft applied index. The 4100 is not on the one with a super behind RAFT.
Hi @iluminae,
Not sure if you are still having this issue, but I was also running into it. In my situation, the (corrupt?) keys narrowed down to certain types that had a reverse predicate pointing towards them. I dropped the reverse edge index and recreated/reindexed it. Not an ideal solution, as I have no idea what caused it in the first place, but got everything running as it should again.
I am also on v21.03. That is interesting though, I guess I wasn’t quite as lucky with my upgrade.
Edited:
Hmm, looks like the same issue popped back up. I can do the same “trick” again and it seems to temporarily fix the issue.
If helpful:
What version of Dgraph are you using?
Dgraph codename : rocket
Dgraph SHA-256 : b4e4c77011e2938e9da197395dbce91d0c6ebb83d383b190f5b70201836a773f
Commit SHA-1 : a77bbe8ae
Commit timestamp : 2021-04-07 21:36:38 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true
For Dgraph official documentation, visit https://dgraph.io/docs.
For discussions about Dgraph , visit http://discuss.hypermode.com.
For fully-managed Dgraph Cloud , visit https://dgraph.io/cloud.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2021 Dgraph Labs, Inc.
Have you tried reproducing the issue with the latest release?
@balazshoranyi I actually just started seeing this again on my current cluster, exactly as originally posted.
Really, I do not even know if it is causing a devastating issue or not! It would be very difficult for me to identify if a few of my billions of values were missing or something.
Current cluster specs:
Dgraph version : v21.03.1
Dgraph codename : rocket-1
Dgraph SHA-256 : a00b73d583a720aa787171e43b4cb4dbbf75b38e522f66c9943ab2f0263007fe
Commit SHA-1 : ea1cb5f35
Commit timestamp : 2021-06-17 20:38:11 +0530
Branch : HEAD
Go version : go1.16.2
jemalloc enabled : true
For me there was no errors, just logs. No idea if it was effecting the result set or not. But I have rebuilt my cluster maybe 15 times since this issue was filed.
Error failed when calling List.rollup: while encoding: cannot iterate through the list: cannot initialize iterator when calling List.iterate: cannot read initial list part for list with base key
I was interested to know if the data was in-fact there and I had previously applied the following which does succeed but is extremely slow as I’m effectively searching back in time for the key. In one attempt it was 40k timestamps behind.
I found that by changing my snapshot after, memTableSize and disk performance (10000 IOPS & 1000 MPBS) I able to alleviate the pressure of it occurring.
i.e. --snapshot_after=1000 --badger=memTableSize=536870912
Has it got something to do with compaction and rollups occurring simultaneously?
@Tom_Hollingworth This was an issue before v22. If you are on the latest data it shouldn’t happen to you anymore. I will explain the reason, when the posting list grows in size, it gets split. We write these split keys indiviually in badger and the main original key just store the list of the split keys. Whenever we write new split keys, we delete the old split keys. Now here was the bug. If just after you delet teh old split keys, there’s a cluster restart, the main key would get overwritten due to wal replay. It will forget it made new splits. It will try to go back to the old splits, which have been deleted.
There was a fix for wal replay in v22, but it was causing other inconsistency issues instead. But the essence of the fix is the same, we prevent this overwrite during restart.
There’s a way to fix this corruption. I wrote a script to fix the p directory about 3 years ago, let me see if I can find it, but mostly it’s lost. Basically what you need to do is, find the new timestamp for the lost split key, then update the main key to point to that new timestamp.