Not sure if this is a bug or user-error, but I accidentally used the latest image tag on my 3-node kubernetes deployment. It upgraded from v20.07.2 to v20.11.0 yesterday and ran into errors with version number 7 not 8. I understand that’s because of storage compatibility, this isn’t the issue I’m reporting.
I’ve set both alpha and zero replicas down to 0, downgraded the StatefulSets back to v20.07.2 and then scaled my zero’s StatefulSet back to 3 replicas. Now I get an error on the first zero, here are the logs:
Dgraph version : v20.07.2
Dgraph codename : shuri-2
Dgraph SHA-256 : a927845127dab735c24727d5a24af411168771b55236aec50f0b987e8c0ac910
Commit SHA-1 : a7bc16d56
Commit timestamp : 2020-10-22 10:17:53 -0700
Branch : HEAD
Go version : go1.14.4
For Dgraph official documentation, visit https://dgraph.io/docs/.
For discussions about Dgraph , visit http://discuss.hypermode.com.
To say hi to the community , visit https://dgraph.slack.com.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.
I0123 21:51:45.158537 19 run.go:127] Setting up grpc listener at: 0.0.0.0:5080
I0123 21:51:45.159438 19 run.go:127] Setting up http listener at: 0.0.0.0:6080
I0123 21:51:45.160751 19 run.go:298] Opening zero BadgerDB with options: {Dir:zw ValueDir:zw SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0xc000
badger 2021/01/23 21:51:45 INFO: All 1 tables opened in 5ms
badger 2021/01/23 21:51:45 INFO: Replaying file id: 6 at offset: 42785522
badger 2021/01/23 21:51:45 INFO: Replay took: 4.921µs
I0123 21:51:45.208055 19 node.go:149] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc000112050 Applied:1444061 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864
I0123 21:51:45.208368 19 node.go:307] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:1444061 Term:3 XXX_unrecognized:[]}
I0123 21:51:45.208468 19 node.go:318] Found hardstate: {Term:7 Vote:0 Commit:1445344 XXX_unrecognized:[]}
I0123 21:51:45.209336 19 node.go:327] Group 0 found 1283 entries
I0123 21:51:45.209362 19 raft.go:515] Restarting node for dgraphzero
I0123 21:51:45.209427 19 node.go:186] Setting conf state to nodes:1 nodes:2 nodes:3
I0123 21:51:45.209593 19 pool.go:160] CONNECTING to dgraph-alpha-0.dgraph-alpha.default.svc.cluster.local:7080
I0123 21:51:45.209619 19 pool.go:160] CONNECTING to dgraph-alpha-1.dgraph-alpha.default.svc.cluster.local:7080
I0123 21:51:45.209642 19 pool.go:160] CONNECTING to dgraph-alpha-2.dgraph-alpha.default.svc.cluster.local:7080
I0123 21:51:45.209665 19 pool.go:160] CONNECTING to dgraph-zero-1.dgraph-zero.default.svc.cluster.local:5080
I0123 21:51:45.209724 19 pool.go:160] CONNECTING to dgraph-zero-2.dgraph-zero.default.svc.cluster.local:5080
I0123 21:51:45.210211 19 log.go:34] 1 became follower at term 7
I0123 21:51:45.210517 19 log.go:34] newRaft 1 [peers: [1,2,3], term: 7, commit: 1445344, applied: 1444061, lastindex: 1445344, lastterm: 7]
[Sentry] 2021/01/23 21:51:45 Sending fatal event [1118306852724b18a70f595e9e64d2cc] to o318308.ingest.sentry.io project: 1805390
2021/01/23 21:51:45 proto: ZeroProposal: illegal tag 0 (wire type 0)
github.com/dgraph-io/dgraph/x.Check
/ext-go/1/src/github.com/dgraph-io/dgraph/x/error.go:42
github.com/dgraph-io/dgraph/dgraph/cmd/zero.run
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:324
github.com/dgraph-io/dgraph/dgraph/cmd/zero.init.0.func1
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/zero/run.go:79
github.com/spf13/cobra.(*Command).execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830
github.com/spf13/cobra.(*Command).ExecuteC
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914
github.com/spf13/cobra.(*Command).Execute
/go/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/dgraph-io/dgraph/dgraph/cmd.Execute
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/cmd/root.go:70
main.main
/ext-go/1/src/github.com/dgraph-io/dgraph/dgraph/main.go:78
runtime.main
/usr/local/go/src/runtime/proc.go:203
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1373
Also, my mistake of using latest has happened on 5 different clusters, and following the steps of downgrading as described above has worked on 4 of them, just 1 is getting the error.
Hey, this is happening because the raftwal of 20.07 and 20.11 are not compatible. Upgrading a major version should be done via backup/restore process.
In your case this might have happened. The cluster had zeros on different versions at the same time. So zero at v20.11 would have propagated the ZeroProposal to zero at v20.07. Hence, when reading the proposal we saw this error because of incompatibilty of the raftwal across versions.
Hey @maaft, not sure why this might have happened as you have already mentioned that Dgraph version was kept the same 21.03.1 throughout the lifetime.
One way to fix it would be to recreate a cluster from the existing p directory. Basically, you would have to copy over the p directory. Bump the UID, timestamps on zero, and namespace ID (if you are using multi-tenancy) https://dgraph.io/docs/deploy/dgraph-zero/#endpoints. Then you can enable the traffic on the cluster.
If you know by any chance, know the output of the latest /state endpoint’s output. If your alpha is still running, then you can query /state and get the maxUID, maxTxnTs, maxNsId from there.
Else, we would have to get those values through a custom script by iterating over the DB.
Pardon if I don’t understand correctly, but alpha doesn’t seem to have a /state endpoint.
Zero is of course down, due to the error reported above.
Can you please give me an example how I can iterate over the DB to get those values?
I still have my zw folder for zero (with *.wal and wal.meta). Can I extract the IDs somehow by introspecting these files? If yes, what tools do I need to use?