We’re currently in the process of pivoting from GCP to AWS, and struggling to get dgraph running properly on AWS.
We have a relatively small dataset (60M rdf, but plan on expanding it to ~150M rdf after the migration). We are deploying on kubernetes, using the helm chart, with 1 dgraph group, 3 replicas. Dgraph version 20.03.1.
alpha: 40GB io1-fast-retain storage class (EBS)
zero: 4GB io1-fast-retain storage class (EBS)
We’ve been having issues with the dgraph pods crashing and then finally getting into a restart loop when loading data. We have tried loading data both with the live loader and with our internal data generation script (directly inserting to dgraph in batches of 5_000 with the py client).
The live loader reports a healthy load and then after about 40 minutes prints Killed
and exits. When we use our internal loader, eventually one or more of the pods crash, and then continue to crash after they are restarted. We see the following in the logs:
zero:
I0604 23:04:00.515506 20 log.go:34] raft.node: 3 elected leader 2 at term 4
W0604 23:07:08.266729 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 257ms. Breakdown: [{disk 257ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0604 23:25:34.598864 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 288ms. Breakdown: [{disk 288ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0604 23:33:46.928453 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 418ms. Breakdown: [{disk 0s} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0604 23:35:51.495267 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 385ms. Breakdown: [{disk 385ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0604 23:37:53.830954 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 221ms. Breakdown: [{disk 221ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0605 00:09:01.627581 20 node.go:674] [0x3] Read index context timed out
W0605 00:10:42.361160 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 351ms. Breakdown: [{disk 351ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0605 00:18:57.302604 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 292ms. Breakdown: [{advance 292ms} {disk 0s} {proposals 0s}]. Num entries: 0. MustSync: false
W0605 00:31:13.819653 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 400ms. Breakdown: [{advance 400ms} {disk 0s} {proposals 0s}]. Num entries: 0. MustSync: false
W0605 00:38:52.627478 20 node.go:674] [0x3] Read index context timed out
W0605 00:38:55.807380 20 pool.go:254] Connection lost with dgraph-live-dgraph-alpha-0.dgraph-live-dgraph-alpha-headless.live.svc.cluster.local:7080. Error: rpc error: code = Unavailable desc = transport is closing
W0605 00:48:19.630822 20 node.go:674] [0x3] Read index context timed out
W0605 00:50:25.627825 20 node.go:674] [0x3] Read index context timed out
W0605 01:12:19.534424 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 524ms. Breakdown: [{proposals 524ms} {disk 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0605 01:12:20.310197 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 382ms. Breakdown: [{disk 382ms} {proposals 1ms} {advance 0s}]. Num entries: 0. MustSync: false
W0605 01:20:32.594602 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 356ms. Breakdown: [{disk 356ms} {proposals 0s} {advance 0s}]. Num entries: 0. MustSync: false
W0605 01:32:47.306791 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 224ms. Breakdown: [{advance 212ms} {disk 12ms} {proposals 0s}]. Num entries: 0. MustSync: false
W0605 01:38:58.913003 20 raft.go:733] Raft.Ready took too long to process: Timer Total: 203ms. Breakdown: [{proposals 203ms} {disk 0s} {advance 0s}]. Num entries: 0. MustSync: false
I0605 01:47:36.127451 20 log.go:34] 3 is starting a new election at term 4
alpha:
E0605 18:30:03.489600 15 groups.go:994] While proposing delta with MaxAssigned: 21082 and num txns: 1. Error=Server overloaded with pending proposals. Please retry later. Retrying...
W0605 18:30:09.477349 15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
W0605 18:30:19.577293 15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
I0605 18:30:28.077278 15 draft.go:1307] Found 1 old transactions. Acting to abort them.
I0605 18:30:28.088283 15 draft.go:1268] TryAbort 1 txns with start ts. Error: <nil>
I0605 18:30:28.088301 15 draft.go:1284] TryAbort: No aborts found. Quitting.
I0605 18:30:28.088308 15 draft.go:1310] Done abortOldTransactions for 1 txns. Error: <nil>
W0605 18:30:29.677443 15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
E0605 18:30:31.490179 15 groups.go:994] While proposing delta with MaxAssigned: 21082 and num txns: 1. Error=Server overloaded with pending proposals. Please retry later. Retrying...
W0605 18:30:39.777415 15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
W0605 18:30:44.977219 15 node.go:420] Unable to send message to peer: 0x3. Error: EOF
W0605 18:30:49.877320 15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
E0605 18:30:59.490827 15 groups.go:994] While proposing delta with MaxAssigned: 21082 and num txns: 1. Error=Server overloaded with pending proposals. Please retry later. Retrying...
W0605 18:30:59.977333 15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
W0605 18:31:10.077329 15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
W0605 18:31:20.177292 15 node.go:420] Unable to send message to peer: 0x1. Error: Unhealthy connection
E0605 18:31:27.491457 15 groups.go:994] While proposing delta with MaxAssigned: 21082 and num txns: 1. Error=Server overloaded with pending proposals. Please retry later. Retrying...
I0605 18:31:28.077215 15 draft.go:1307] Found 1 old transactions. Acting to abort them.
I0605 18:31:28.088069 15 draft.go:1268] TryAbort 1 txns with start ts. Error: <nil>
another alpha on restart:
Dgraph version : v20.03.1
Dgraph SHA-256 : 6a40b1e084205ae9e29336780b3458a3869db45c0b96b916190881c16d705ba8
Commit SHA-1 : c201611d6
Commit timestamp : 2020-04-24 13:53:41 -0700
Branch : HEAD
Go version : go1.14.1
For Dgraph official documentation, visit https://docs.dgraph.io.
For discussions about Dgraph , visit http://discuss.hypermode.com.
To say hi to the community , visit https://dgraph.slack.com.
Licensed variously under the Apache Public License 2.0 and Dgraph Community License.
Copyright 2015-2020 Dgraph Labs, Inc.
I0605 18:30:36.903089 18 run.go:609] x.Config: {PortOffset:0 QueryEdgeLimit:1000000 NormalizeNodeLimit:10000}
I0605 18:30:36.903138 18 run.go:610] x.WorkerConfig: {ExportPath:export NumPendingProposals:256 Tracing:1 MyAddr:dgraph-live-dgraph-alpha-1.dgraph-live-dgraph-alpha-headless.live.svc.cluster.local:7080 ZeroAddr:[dgraph-live-dgraph-zero-0.dgraph-live-dgraph-zero-headless.live.svc.cluster.local:5080] RaftId:0 WhiteListedIPRanges:[] MaxRetries:-1 StrictMutations:false AclEnabled:false AbortOlderThan:5m0s SnapshotAfter:10000 ProposedGroupId:0 StartTime:2020-06-05 18:30:36.226962605 +0000 UTC m=+0.089461880 LudicrousMode:false BadgerKeyFile:}
I0605 18:30:36.903188 18 run.go:611] worker.Config: {PostingDir:p BadgerTables:mmap BadgerVlog:mmap BadgerKeyFile: BadgerCompressionLevel:3 WALDir:w MutationsMode:0 AuthToken: AllottedMemory:2048 HmacSecret:[] AccessJwtTtl:0s RefreshJwtTtl:0s AclRefreshInterval:0s}
I0605 18:30:36.907167 18 server_state.go:75] Setting Badger Compression Level: 3
I0605 18:30:36.907185 18 server_state.go:84] Setting Badger table load option: mmap
I0605 18:30:36.907191 18 server_state.go:96] Setting Badger value log load option: mmap
I0605 18:30:36.907202 18 server_state.go:141] Opening write-ahead log BadgerDB with options: {Dir:w ValueDir:w SyncWrites:false TableLoadingMode:1 ValueLogLoadingMode:2 NumVersionsToKeep:1 ReadOnly:false Truncate:true Logger:0x282e510 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1048576 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:10485760 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:10000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0605 18:30:38.793573 18 log.go:34] All 6 tables opened in 1.73s
I0605 18:30:38.805952 18 log.go:34] Replaying file id: 0 at offset: 957849371
I0605 18:30:49.142937 18 log.go:34] Replay took: 10.336955479s
I0605 18:30:49.145182 18 log.go:34] Replaying file id: 1 at offset: 0
I0605 18:31:20.287500 18 log.go:34] Replay took: 31.142296612s
I0605 18:31:20.293362 18 log.go:34] Replaying file id: 2 at offset: 0
I0605 18:31:48.684231 18 log.go:34] Replay took: 28.390477856s
I0605 18:31:48.685961 18 log.go:34] Replaying file id: 3 at offset: 0
I0605 18:31:58.447189 18 log.go:34] Replay took: 9.760887038s
I0605 18:31:58.450924 18 server_state.go:75] Setting Badger Compression Level: 3
I0605 18:31:58.451229 18 server_state.go:84] Setting Badger table load option: mmap
I0605 18:31:58.451471 18 server_state.go:96] Setting Badger value log load option: mmap
I0605 18:31:58.451775 18 server_state.go:160] Opening postings BadgerDB with options: {Dir:p ValueDir:p SyncWrites:false TableLoadingMode:2 ValueLogLoadingMode:2 NumVersionsToKeep:2147483647 ReadOnly:false Truncate:true Logger:0x282e510 Compression:2 InMemory:false MaxTableSize:67108864 LevelSizeMultiplier:10 MaxLevels:7 ValueThreshold:1024 NumMemtables:5 BlockSize:4096 BloomFalsePositive:0.01 KeepL0InMemory:true MaxCacheSize:1073741824 MaxBfCacheSize:0 LoadBloomsOnOpen:false NumLevelZeroTables:5 NumLevelZeroTablesStall:10 LevelOneSize:268435456 ValueLogFileSize:1073741823 ValueLogMaxEntries:1000000 NumCompactors:2 CompactL0OnClose:true LogRotatesToFlush:2 ZSTDCompressionLevel:3 VerifyValueChecksum:false EncryptionKey:[] EncryptionKeyRotationDuration:240h0m0s BypassLockGuard:false ChecksumVerificationMode:0 managedTxns:false maxBatchCount:0 maxBatchSize:0}
I0605 18:32:04.256014 18 log.go:34] 43 tables out of 83 opened in 3.026s
I0605 18:32:07.326065 18 log.go:34] 67 tables out of 83 opened in 6.096s
I0605 18:32:09.507096 18 log.go:34] All 83 tables opened in 8.277s
I0605 18:32:09.628863 18 log.go:34] Replaying file id: 73 at offset: 0
I0605 18:32:27.423486 18 log.go:34] Replay took: 17.794596998s
I0605 18:32:27.434057 18 log.go:34] Replaying file id: 74 at offset: 0
I0605 18:32:55.694783 18 log.go:34] Replay took: 28.260207343s
I0605 18:32:55.698983 18 log.go:34] Replaying file id: 75 at offset: 0
I0605 18:33:07.481943 18 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:[]}
I0605 18:33:07.482013 18 log.go:34] Running for level: 0
I0605 18:33:11.063700 18 log.go:34] Replay took: 15.364686825s
I0605 18:33:11.065335 18 log.go:34] Replaying file id: 76 at offset: 0
I0605 18:33:17.924071 18 log.go:34] Replay took: 6.858697639s
I0605 18:33:17.924636 18 log.go:34] Replaying file id: 77 at offset: 0
I0605 18:33:24.894635 18 log.go:34] Replay took: 6.969981227s
I0605 18:33:24.897840 18 log.go:34] Replaying file id: 78 at offset: 0
I0605 18:33:31.694015 18 log.go:34] Replay took: 6.796152436s
I0605 18:33:31.695474 18 log.go:34] Replaying file id: 79 at offset: 0
I0605 18:33:38.523017 18 log.go:34] Replay took: 6.827319391s
I0605 18:33:38.529853 18 log.go:34] Replaying file id: 80 at offset: 0
I0605 18:33:51.051489 18 log.go:34] Replay took: 12.521373342s
I0605 18:33:51.052336 18 log.go:34] Replaying file id: 81 at offset: 0
I0605 18:33:52.044519 18 log.go:34] Replay took: 991.928658ms
I0605 18:33:54.088433 18 groups.go:107] Current Raft Id: 0x1
I0605 18:33:54.103640 18 run.go:480] Bringing up GraphQL HTTP API at 0.0.0.0:8080/graphql
I0605 18:33:54.103678 18 run.go:481] Bringing up GraphQL HTTP admin API at 0.0.0.0:8080/admin
I0605 18:33:54.103709 18 run.go:512] gRPC server started. Listening on port 9080
I0605 18:33:54.103722 18 run.go:513] HTTP server started. Listening on port 8080
I0605 18:33:54.103746 18 worker.go:96] Worker listening at address: [::]:7080
I0605 18:33:54.192151 18 pool.go:160] CONNECTING to dgraph-live-dgraph-zero-0.dgraph-live-dgraph-zero-headless.live.svc.cluster.local:5080
I0605 18:33:54.303595 18 pool.go:160] CONNECTING to dgraph-live-dgraph-zero-1.dgraph-live-dgraph-zero-headless.live.svc.cluster.local:5080
I0605 18:33:54.317525 18 groups.go:135] Connected to group zero. Assigned group: 0
I0605 18:33:54.317546 18 groups.go:137] Raft Id after connection to Zero: 0x1
I0605 18:33:54.317601 18 pool.go:160] CONNECTING to dgraph-live-dgraph-alpha-0.dgraph-live-dgraph-alpha-headless.live.svc.cluster.local:7080
I0605 18:33:54.317627 18 pool.go:160] CONNECTING to dgraph-live-dgraph-alpha-2.dgraph-live-dgraph-alpha-headless.live.svc.cluster.local:7080
I0605 18:33:54.317673 18 pool.go:160] CONNECTING to dgraph-live-dgraph-zero-2.dgraph-live-dgraph-zero-headless.live.svc.cluster.local:5080
I0605 18:33:54.317741 18 draft.go:200] Node ID: 0x1 with GroupID: 1
I0605 18:33:54.318455 18 node.go:148] Setting raft.Config to: &{ID:1 peers:[] learners:[] ElectionTick:20 HeartbeatTick:1 Storage:0xc1481c7b00 Applied:115 MaxSizePerMsg:262144 MaxCommittedSizePerReady:67108864 MaxUncommittedEntriesSize:0 MaxInflightMsgs:256 CheckQuorum:false PreVote:true ReadOnlyOption:0 Logger:0x282e510 DisableProposalForwarding:false}
I0605 18:33:56.470893 18 node.go:306] Found Snapshot.Metadata: {ConfState:{Nodes:[1 2 3] Learners:[] XXX_unrecognized:[]} Index:115 Term:2 XXX_unrecognized:[]}
I0605 18:33:56.470933 18 node.go:317] Found hardstate: {Term:52 Vote:0 Commit:6907 XXX_unrecognized:[]}
I0605 18:33:59.115242 18 query.go:123] Dgraph query execution failed : Dgraph query failed because Please retry again, server is not ready to accept requests
I0605 18:33:59.115275 18 admin.go:520] Error reading GraphQL schema: Dgraph query failed because Dgraph query failed because Please retry again, server is not ready to accept requests.
I0605 18:34:04.121342 18 query.go:123] Dgraph query execution failed : Dgraph query failed because Please retry again, server is not ready to accept requests
I0605 18:34:04.121386 18 admin.go:520] Error reading GraphQL schema: Dgraph query failed because Dgraph query failed because Please retry again, server is not ready to accept requests.
We’re trying to figure out what to try next. It seems possible that the volumes are not fast enough, but we have no logs or data that make this clear. Another option would be splitting the cluster into multiple groups.
Three questions:
(1) How do we know if we need faster volumes?
(2) Does adding more groups to spread out predicates make the cluster more stable?
(3) Saas is checked on your roadmap in github for the first half of this year… is it out yet?
Thanks!