Meditated about this issue, we started storing less data, we sharded the data, multiplied by two the cluster size, and for now the system seems more stable. Tip 58 from the Pragmatic Programmer comes in my mind: “Random Failures Are Often Concurrency Issues”. Will update this post if significative events will occur - fingers crossed.
Spoke to soon, it happened again, and one node is out of whack
dgraph[2197758]: W0924 12:57:02.928687 2197825 log.go:36] While running doCompact: MANIFEST removes non-existing table 18308720