For this, I’ve literally started via a copy/paste of the startup docker command - nothing tricky. I’ve an AWS instance, installed docker, the run Dgraph. Nothing complex or different to the ‘out of box’ command. If its any consolation, I’ve never seen this before - our code base has been stable and unmodified for over 12 months on this particular project.
Also, it seems very variable. I can’t find a particular trigger that seems to cause it.
@mikehawkes it sounds like the issue happens only on AWS instance. Can you try running dgraph for a long period (maybe a week) on a local machine and see if you face the same issue?
Ok, so I’ve been on vacation for the past 2 weeks. I ran this for 2 weeks, during which time I have had 5 collapses with the same error. So far, it hasn’t corrupted the database to make it irrecoverable again. That has happened twice in the past 3 months.
I’ll run up the same service on my Mac server and see whether that also collapses.
This was a fresh AWS instance, running the standalone:latest Docker image.
@mikehawkes Let me know how it goes on the local instance. Please do save all the logs (before/after crash/corruption any issue that you see). I am still looking for a bug/crash that we can fix.
Will do - I’m putting a test script to poll the DB every 5 minutes and restart the container on failure. I want to see if there’s a timing pattern here … I do get core files in the data directory, so I think something’s tripping out Docker. I get a core dump - but not every time.
Hi - I’m suspecting memory here. I’ve had another couple of crashes - is there a way to restrict RAM usage when using a containerised version of Dgraph? If so, I’ll set that and see if it still crashes out.
It panicked yesterday with an OOM message. Everything collapsed and I got a core dump. It’s the first time I’ve seen this in the logs for a while. Luckily the DB remained intact.
Do you run dgraph alpha and zero on the same machine? Dgraph uses cache to improve performance and the caches itself would take 1.5 gb. The w directory contents are kept in memory so all the .sst files in the w directory would be in memory.
8 GB would be very less to run dgraph alpha and zero on a single machine. You will need to use atleast a 16 gigs machine if you want to run both of them together.
This is a test instance - running the Dgraph standalone Docker image. It’s been stable for over year, then started falling over. I actually wonder if this is two issues: I can now get this to fail reliably with a panic by running a query; whereas the fail without an evident panic stops running randomly after a few days. I’m not sure which causes the DB corruption (without any way to recover the data - that concerns me), as that seems only to have occurred 2-3 times over the past few months.
An OOM crash doesn’t cause data corruption. Dgraph can OOM and restart again (provided the machine has sufficient ram). This would never cause data corruption. I’ve only seen panics in the logs you have shared. I haven’t seen any data corruption yet.
@mikehawkes I will need a memory profile to help you. You can run the following to collect memory profiles every minute. You might need to change the zero and alpha ports.
while; do for i in {6180,8180}; do curl localhost:$i/debug/pprof/heap -o $(date +"%d-%m-%Y_%T")-$i-heap ; done; sleep 60; done