Corrupt database - unable to restart Dgraph

For this, I’ve literally started via a copy/paste of the startup docker command - nothing tricky. I’ve an AWS instance, installed docker, the run Dgraph. Nothing complex or different to the ‘out of box’ command. If its any consolation, I’ve never seen this before - our code base has been stable and unmodified for over 12 months on this particular project.

Also, it seems very variable. I can’t find a particular trigger that seems to cause it.

@mikehawkes it sounds like the issue happens only on AWS instance. Can you try running dgraph for a long period (maybe a week) on a local machine and see if you face the same issue?

Ok, so I’ve been on vacation for the past 2 weeks. I ran this for 2 weeks, during which time I have had 5 collapses with the same error. So far, it hasn’t corrupted the database to make it irrecoverable again. That has happened twice in the past 3 months.

I’ll run up the same service on my Mac server and see whether that also collapses.

This was a fresh AWS instance, running the standalone:latest Docker image.

@mikehawkes Let me know how it goes on the local instance. Please do save all the logs (before/after crash/corruption any issue that you see). I am still looking for a bug/crash that we can fix.

Will do - I’m putting a test script to poll the DB every 5 minutes and restart the container on failure. I want to see if there’s a timing pattern here … I do get core files in the data directory, so I think something’s tripping out Docker. I get a core dump - but not every time.

1 Like

Hi - I’m suspecting memory here. I’ve had another couple of crashes - is there a way to restrict RAM usage when using a containerised version of Dgraph? If so, I’ll set that and see if it still crashes out.

Hey @mikehawkes, there isn’t a way to limit the memory in dgraph but you could limit the memory of the docker container Runtime options with Memory, CPUs, and GPUs | Docker Documentation .

We also have metrics exposed via dgraph that you can use https://dgraph.io/docs/deploy/metrics/#memory-metrics

It panicked yesterday with an OOM message. Everything collapsed and I got a core dump. It’s the first time I’ve seen this in the logs for a while. Luckily the DB remained intact.

Do you have the logs or memory profile?

Please see the attached: log.txt (143.9 KB)

I see this

fatal error: runtime: cannot allocate memory
fatal error: runtime: cannot allocate memory

goroutine 3 [running]:
runtime.throw(0x1bb6a44, 0x1f)
	/usr/local/go/src/runtime/panic.go:1116 +0x72 fp=0xc0000595e0 sp=0xc0000595b0 pc=0xa1af82
runtime.newArenaMayUnlock(0x2ba2c00)
	/usr/local/go/src/runtime/mheap.go:1933 +0xda fp=0xc000059618 sp=0xc0000595e0 pc=0xa0e46a
runtime.newMarkBits(0x33, 0x33)
	/usr/local/go/src/runtime/mheap.go:1853 +0xc3 fp=0xc000059660 sp=0xc000059618 pc=0xa0dff3
runtime.(*mspan).sweep(0x7fb8015742d8, 0xc00007e000, 0xa48a01)
	/usr/local/go/src/runtime/mgcsweep.go:341 +0x4c9 fp=0xc000059740 sp=0xc000059660 pc=0xa09b19
runtime.sweepone(0x1c18980)
	/usr/local/go/src/runtime/mgcsweep.go:136 +0x284 fp=0xc0000597a8 sp=0xc000059740 pc=0xa093e4
runtime.bgsweep(0xc00007e000)
	/usr/local/go/src/runtime/mgcsweep.go:73 +0xba fp=0xc0000597d8 sp=0xc0000597a8 pc=0xa090aa
runtime.goexit()
	/usr/local/go/src/runtime/asm_amd64.s:1373 +0x1 fp=0xc0000597e0 sp=0xc0000597d8 pc=0xa4eb21
created by runtime.gcenable
	/usr/local/go/src/runtime/mgc.go:214 +0x5c

how much ram does your machine have?

This machine’s an 8GiB AWS instance. It’s only running DG and a 10MIB process, accessing a relatively small data set.

Do you run dgraph alpha and zero on the same machine? Dgraph uses cache to improve performance and the caches itself would take 1.5 gb. The w directory contents are kept in memory so all the .sst files in the w directory would be in memory.

8 GB would be very less to run dgraph alpha and zero on a single machine. You will need to use atleast a 16 gigs machine if you want to run both of them together.

1 Like

This is a test instance - running the Dgraph standalone Docker image. It’s been stable for over year, then started falling over. I actually wonder if this is two issues: I can now get this to fail reliably with a panic by running a query; whereas the fail without an evident panic stops running randomly after a few days. I’m not sure which causes the DB corruption (without any way to recover the data - that concerns me), as that seems only to have occurred 2-3 times over the past few months.

An OOM crash doesn’t cause data corruption. Dgraph can OOM and restart again (provided the machine has sufficient ram). This would never cause data corruption. I’ve only seen panics in the logs you have shared. I haven’t seen any data corruption yet.

@mikehawkes I will need a memory profile to help you. You can run the following to collect memory profiles every minute. You might need to change the zero and alpha ports.

while; do for i in {6180,8180}; do curl localhost:$i/debug/pprof/heap -o $(date +"%d-%m-%Y_%T")-$i-heap ; done; sleep 60;  done