@slotlocker2 @joaquin Thanks for your replies. I tried @slotlocker2’s suggestion today and here are the results.
- Using hostPath without any provisioner (same gist as shared) and restarting - Worked well
- Mounting
/var/local-path-provisionerfrom the Kind node onto the host using extraMounts and restarting the cluster - Did not work - Replacing the same gist shared by @slotlocker2 to use the
standardstorage class instead of using hostpath and restarting - Worked well - Using the Dgraph helm chart with
standardstorage class and restarting - Did not work
I will go ahead with what is working now (will use the helm chart for prod and yaml for dev for now) but to actually help you reproduce the problem I am facing, I have added a gist here which shares the config:
THIS DOES NOT WORK ON RESTARTS: Dgraph-Kind restart repro · GitHub
THIS WORKS ON RESTARTS: Dgraph-Kind restart repro · GitHub
So, looks like its not an issue with Kind or the Local path provisioner but has something to do with the helm chart.
The significant difference I see is the command being used to start zero
Helm chart has:
- bash
- "-c"
- |
set -ex
[[ `hostname` =~ -([0-9]+)$ ]] || exit 1
ordinal=${BASH_REMATCH[1]}
idx=$(($ordinal + 1))
if [[ $ordinal -eq 0 ]]; then
exec dgraph zero --my=$(hostname -f):5080 --idx $idx --replicas 5
else
exec dgraph zero --my=$(hostname -f):5080 --peer dgraph-dgraph-zero-0.dgraph-dgraph-zero-headless.${POD_NAMESPACE}.svc.cluster.local:5080 --idx $idx --replicas 5
fi
and the yaml has:
- bash
- "-c"
- |
set -ex
dgraph zero --my=$(hostname -f):5080
and also the fact that the helm chart splits the deployment to multiple pods while this uses multiple containers within 1 pod.
Just to see the difference before and after restart in the pods with helm chart, I did a diff in the env and looks like the only thing which changes is the PPID across restarts
Zero before and after: Saved diff 4XLv6ADk - Diff Checker
Alpha before and after: Saved diff AblTFiz1 - Diff Checker
Everything else remains the same as you see.
UPDATE: GCE just killed my VM and the dgraph cluster restarted without any issues now. So, we might have to see how to fix the helm chart to match this: Dgraph-Kind restart repro · GitHub
Thanks again.