I have reduced our ingestion to 2 threads and have not encountered the issue since (after having to restore from backup 3 times Christmas day) So it seems to be related to disk contention/io throttling.
This is why we were so effected by the badgerdb manifest corruption too - our disks are extremely throttled (gcp pd-ssd @2TB, second highest tier of IOPS on pd-ssd on GCP) and even 2 threads inserting is enough to trigger throttling on our disks.
What configuration does dgraph cloud use on GCP to avoid throttling? I mean the real issue needs to be solved in Dgraph but I am tired of rebuilding my cluster. At least backup/restore is OSS now - if I had to bulk load each of these times it would be even more unpalatable.