Query that crashes a dgraph server

I have set up a dgraph cluster in kubernetes with 5 zeros and 30 servers with replicas set to 3.

Each dgraph server has memory set to --memory_mb 3036
When I run a long query which consists of approximately 200 blocks and some blocks are expected to return thousands or even hundred thousand results.

About 20 mins into the querying, the dgraph server pod that handles the query crashed and logged:

2018/01/31 01:48:48 node.go:400: WARN: A tick missed to fire. Node blocks too long!
2018/01/31 01:48:48 node.go:400: WARN: A tick missed to fire. Node blocks too long!
2018/01/31 01:48:48 node.go:400: WARN: A tick missed to fire. Node blocks too long!
2018/01/31 01:48:48 node.go:400: WARN: A tick missed to fire. Node blocks too long!
2018/01/31 01:48:48 node.go:400: WARN: A tick missed to fire. Node blocks too long!

Is this a result of memory usage exceeded the memory allocated? ie if I increase the memory setting would that solve this issue?

Is the cpu usage very high when you do this query? Can you please share the heap and cpu profile.
It can happen if all cpu is being eaten by the query or if process becomes too slow(dgraph using swap memory - this shouldn’t happen on kubernetes though)

Yes, the CPU usage is relatively high during the query time. 62.5% (10cores/16cores) in some spikes
Each node has spec n1-standard-16 (16 vCPUs, 60 GB memory)

This is the CPU usage at the time it crashes (~14.50)
image

Memory usage remains at high level even after the ingesting phase 93%(56GB/60GB)
Green line represents the overall usage.
Yellow line represents usage from default namespace where dgraph pods are deployed.
image

Does it mean the memory config for dgraph server will not have any effect on the performance of the dgraph? I thought during a long query dgraph server passes the result of a block of the query to another that contains the relevant predicate, if the result is too big to be stored in the memory will it crash? or it would log another error/warning message?

Yes, that’s true. Dgraph does pass the query to another server to execute it but the result has to finally be aggregated on the node which got the initial request. If the result is too big to be stored in memory, the server may go out of memory but logs should say that. Do you have any logs from the server crashing?

More memory should definitely improve performance but turn off swap space if not already done.

The logs I pasted were the only odd logs that I found when the pod crashes.

Is this some kind of config I can set in the GCE node? or from dgraph binary? Not sure how I can turn off swap space off, and why would this help to improve the performance.

sudo swapoff -a can be used to turn off swap. I am not sure whether your system is using swap but if it uses it makes the programn very slow since disk would be used instead of ram. Its better to let the programn go oom instead of using swap.

Cheers, the swap is set to 0 by default in GCE VM instances.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.