Hello, we have an v24.0.5 instance that has been running continually for about half a year, but has now gone into an error loop. Looking at the logs I see thousands of lines with the error:
W0430 08:13:12.608632 31 log.go:35] [Compactor: 1] LOG Compact FAILED with error: MANIFEST removes non-existing table 2308251: {span:0xc03411b320 compactorId:1 t:{baseLevel:4 targetSz:[0 10485760 10485760 10955185 109551850 1095518501 10955185010] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]} p:{level:0 score:1 adjusted:1.0177845533553695 dropPrefixes:[] t:{baseLevel:4 targetSz:[0 10485760 10485760 10955185 109551850 1095518501 10955185010] fileSz:[67108864 2097152 2097152 4194304 8388608 16777216 33554432]}} thisLevel:0xc0001d84e0 nextLevel:0xc0001d8660 top:[0xc020eb00c0 0xc01cebd5c0 0xc012afdd40 0xc01c5f2240 0xc0145c1b00] bot:[0xc0099883c0 0xc068ca2540 0xc0145c0fc0 0xc040cc5140 0xc0328726c0 0xc00798bc80 0xc018c6f8c0 0xc00c2bb200 0xc007eada40 0xc018c6e240 0xc029e04000 0xc01106a6c0 0xc01f810d80 0xc00eec9440 0xc016a40000 0xc020edd200 0xc021aa6c00] thisRange:{left:[0 0 0 0 0 0 0 0 0 0 10 82 101 99 111 114 100 46 114 97 119 0 0 0 0 0 0 253 40 125 0 0 0 0 0 0 0 0] right:[4 0 0 0 0 0 0 0 0 0 35 79 114 103 97 110 105 122 97 116 105 111 110 82 101 115 111 108 118 101 100 69 110 116 105 116 121 46 117 112 100 97 116 101 84 83 2 67 7 233 0 1 0 10 0 22 0 0 0 0 1 68 154 136 255 255 255 255 255 255 255 255] inf:false size:0} nextRange:{left:[0 0 0 0 0 0 0 0 0 0 10 82 101 99 111 114 100 46 114 97 119 0 0 0 0 0 1 149 237 115 0 0 0 0 0 0 0 0] right:[0 0 0 0 0 0 0 0 0 0 61 79 114 103 97 110 105 122 97 116 105 111 110 82 101 115 111 108 118 101 100 69 110 116 105 116 121 46 114 101 103 105 115 116 101 114 101 100 65 100 100 114 101 115 115 67 111 117 110 116 114 121 78 111 114 109 97 108 105 115 101 100 0 0 0 0 0 0 254 3 71 255 255 255 255 255 255 255 255] inf:false size:0} splits:[] thisSize:0 dropPrefixes:[]}
I see other issues like this in the past, but at the time the fix was an upgrade to badger v2, so this issue must have reappeared recently somehow. Replicating the issue is going to be a challenge as it has taken many months of continuous load to trigger this issue.
Looking at the relevent badger code in manifest,go line 446, given that the purpose is to remove the missing table, should this be a non-error if the table is already missing? Or is the potential for leaving things behind in build.Levels an issue? I guess that issue could be solved by looking in all levels for this table in the case it is missing in build.tables? Otherwise I think this should be a hard error due to its ability to become an error loop.
Also, root cause aside, is there a way to repair this instance? It is not part of a replicated cluster. It comes back successfully from a restart but eventually hits the loop again. Do you think a backup, wipe and restore would fix the underlying tables, has this been attempted in the past?
Let me know if i can provide any more info and thanks in advance for any help!
Edit: Here is what I came up with looking at the manifest.go file
case pb.ManifestChange_DELETE:
tm, ok := build.Tables[tc.Id]
if !ok {
for _, level := range build.Levels {
delete(level.Tables, tc.Id)
}
} else {
delete(build.Levels[tm.Level].Tables, tc.Id)
delete(build.Tables, tc.Id)
}
build.Deletions++