Finally after crashing dgraph about 100 times, i came up with a rather ugly solution.
Mostly the errors came from compaction and the raft took too long thingy.
Sometimes Dgraph was in a very annoying state after crash, where it was in an endless loop of trying to playback the files. i had to rm -rf all the time.
To load the ~16000 json files with just a few hundred mutations in each file, i wrote a file walker and simply called dgo mutate for each file. The first approach crashed like the live loader.
On a bigger machine (with latest generation nvme drive) i gave docker 16 CPUs and 32 gb of ram - it almost did the whole folder in one go, but crashed again near the finish line.
But with more ram and more cpus it went definitely 4-5 times longer than with less resources.
adding a time.Sleep(time.Duration(10) * time.Second) every 100 files, so that the raft and compaction can do its thing and catch up in the sleeptime did the trick for me - but the whole import now takes over 16 Minutes…most of the time the machine is idleling around.
A thing worth to mention is, that the processing time gets worse and worse on every compaction/raft, although the machine is pretty much in idle state.
Below are the logs from the complete run, where you can see this behaviour and a failed raft - could not be from the disk/ram/cpu etc. - why is this happening?
We are a bit concerned wether to use dgraph in production - it seems a bit unstable/cumbersome to deploy and manage especially if there might be dataloss or errors like the described ones, which can only be “solved” by rm -rf. We thought of deploying it to 8gb ram machines with 4 cores. This is just a part of the whole dataset and it must be synchronized every now and then - perhaps not possible on this kind of machine.
alpha_1 | I0411 21:59:36.575115 14 draft.go:136] Operation started with id: opRollup
alpha_1 | I0411 22:04:30.501918 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:}
alpha_1 | I0411 22:04:30.501976 14 log.go:34] Running for level: 0
alpha_1 | I0411 22:04:30.860490 14 log.go:34] LOG Compact 0->1, del 6 tables, add 1 tables, took 358.500295ms
alpha_1 | I0411 22:04:30.860523 14 log.go:34] Compaction for level: 0 DONE
alpha_1 | I0411 22:07:59.471448 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:}
alpha_1 | I0411 22:07:59.471495 14 log.go:34] Running for level: 0
alpha_1 | I0411 22:08:00.875596 14 log.go:34] LOG Compact 0->1, del 6 tables, add 4 tables, took 1.404088412s
alpha_1 | I0411 22:08:00.875636 14 log.go:34] Compaction for level: 0 DONE
alpha_1 | I0411 22:11:08.471385 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:}
alpha_1 | I0411 22:11:08.471433 14 log.go:34] Running for level: 0
alpha_1 | I0411 22:11:11.284801 14 log.go:34] LOG Compact 0->1, del 9 tables, add 7 tables, took 2.813354703s
alpha_1 | I0411 22:11:11.284847 14 log.go:34] Compaction for level: 0 DONE
alpha_1 | I0411 22:14:37.470818 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:}
alpha_1 | I0411 22:14:37.470934 14 log.go:34] Running for level: 0
alpha_1 | I0411 22:14:40.903792 14 log.go:34] LOG Compact 0->1, del 12 tables, add 10 tables, took 3.432809903s
alpha_1 | I0411 22:14:40.903836 14 log.go:34] Compaction for level: 0 DONE
alpha_1 | I0411 22:15:18.764163 14 draft.go:523] Creating snapshot at index: 23470. ReadTs: 33307.
zero_1 | W0411 22:15:25.388043 16 node.go:671] [0x1] Read index context timed out
zero_1 | W0411 22:15:27.407629 16 node.go:671] [0x1] Read index context timed out
zero_1 | I0411 22:15:27.934150 16 oracle.go:107] Purged below ts:33307, len(o.commits):80, len(o.rowCommit):622962
zero_1 | W0411 22:15:27.934243 16 raft.go:733] Raft.Ready took too long to process: Timer Total: 4.547s. Breakdown: [{proposals 4.547s} {disk 0s} {sync 0s} {advance 0s}]. Num entries: 1. MustSync: true
zero_1 | I0411 22:15:28.088899 16 oracle.go:107] Purged below ts:33307, len(o.commits):80, len(o.rowCommit):622962
zero_1 | I0411 22:15:33.284670 16 raft.go:616] Writing snapshot at index: 11887, applied mark: 12060
alpha_1 | I0411 22:18:40.470942 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:}
alpha_1 | I0411 22:18:40.470994 14 log.go:34] Running for level: 0
alpha_1 | I0411 22:18:40.766790 14 log.go:34] Got compaction priority: {level:0 score:1 dropPrefix:}
alpha_1 | I0411 22:18:40.766842 14 log.go:34] Running for level: 0
alpha_1 | I0411 22:18:41.137025 14 log.go:34] LOG Compact 0->1, del 6 tables, add 2 tables, took 370.164792ms
alpha_1 | I0411 22:18:41.137073 14 log.go:34] Compaction for level: 0 DONE
alpha_1 | I0411 22:18:46.234204 14 log.go:34] LOG Compact 0->1, del 15 tables, add 13 tables, took 5.763193925s
alpha_1 | I0411 22:18:46.234250 14 log.go:34] Compaction for level: 0 DONE
