One Node in Dgraph Cluster Showing Unusual Resource Usage



We have two Dgraph environments with almost identical data distribution. In one of them, there’s a machine with abnormally high CPU usage. We’ve observed that its GC frequency, the amount of data read from LSM, and memory allocation frequency are all higher than other members in the same group. Our queries are fully load-balanced.
Does anyone know what might be causing this?

The first thing I would look at is the nature of the queries and mutations into the two clusters. Is it possible that the resource-hungry cluster is getting more/broader queries or high write rates?

We use the two environments in a primary-backup setup, so after switching, they receive identical writes. We have been monitoring the Alpha queries metrics, and both request and write loads are balanced.

At one point, we suspected that the leader node was consuming more resources due to the additional snapshot responsibilities. However, even after this machine was unexpectedly restarted and became a follower, it continued to use significantly more resources. It allocated more memory and performed more frequent garbage collection.

We’ve monitored several metrics, and this machine showed much higher values compared to the others, including:

  • go_memstats_alloc_bytes_total_total
  • go_gc_cycles_total_gc_cycles_total_total
  • badger_read_bytes_lsm

We’ve identified the issue: the schema inconsistency was caused by Alpha-3 believing that certain predicates had indexes, when in fact they do not.

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.