Dgraph fails to start on restarts with Kind (Kubernetes)

@slotlocker2 @joaquin Thanks for your replies. I tried @slotlocker2’s suggestion today and here are the results.

  1. Using hostPath without any provisioner (same gist as shared) and restarting - Worked well
  2. Mounting /var/local-path-provisioner from the Kind node onto the host using extraMounts and restarting the cluster - Did not work
  3. Replacing the same gist shared by @slotlocker2 to use the standard storage class instead of using hostpath and restarting - Worked well
  4. Using the Dgraph helm chart with standard storage class and restarting - Did not work

I will go ahead with what is working now (will use the helm chart for prod and yaml for dev for now) but to actually help you reproduce the problem I am facing, I have added a gist here which shares the config:

THIS DOES NOT WORK ON RESTARTS: Dgraph-Kind restart repro · GitHub

THIS WORKS ON RESTARTS: Dgraph-Kind restart repro · GitHub

So, looks like its not an issue with Kind or the Local path provisioner but has something to do with the helm chart.

The significant difference I see is the command being used to start zero

Helm chart has:

- bash
- "-c"
- |
  set -ex
  [[ `hostname` =~ -([0-9]+)$ ]] || exit 1
    ordinal=${BASH_REMATCH[1]}
    idx=$(($ordinal + 1))
    if [[ $ordinal -eq 0 ]]; then
      exec dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 5
    else
      exec dgraph zero --my=$(hostname -f):5080 --peer dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 5
    fi

and the yaml has:

- bash
- "-c"
- |
  set -ex
  dgraph zero --my=$(hostname -f):5080

and also the fact that the helm chart splits the deployment to multiple pods while this uses multiple containers within 1 pod.

Just to see the difference before and after restart in the pods with helm chart, I did a diff in the env and looks like the only thing which changes is the PPID across restarts

Zero before and after: Saved diff 4XLv6ADk - Diff Checker
Alpha before and after: Saved diff AblTFiz1 - Diff Checker

Everything else remains the same as you see.

UPDATE: GCE just killed my VM and the dgraph cluster restarted without any issues now. So, we might have to see how to fix the helm chart to match this: Dgraph-Kind restart repro · GitHub

Thanks again.

1 Like