(NO I WAS WRONG) Data lost after crash

Hi all. I have experienced this about three times now. My macbook’s battery is running out of juice, so, from time to time it just shuts down without no previous alarm if the charger is not connected.

Well, it happens that after I power it up again, the previous “ok” running queries starts to return no results.

See, I cant say for sure that the data is lost, because (and there is a feature that I miss A LOT) I do not know how to ask for the database if there anything in there at all, like a simple “select *”.

This is freaking me out, 'cause I’m moving forward into DGraph at full speed with my project.

Another (partially) related issue is with the “expand(all)”. Why is it such a drama queen? If I do the above query after the crash and no data is found anymore, I would expect just a simple “no results” and not an error.

{
    view(func: has(__is_view)) @filter(  eq(view, "home") ) {
      uid
      view
      template
      infos {
        uid
        expand(_all_)
      }
      widgets {
        uid
        expand(_all_)
      }
    }
}
(returns: Unhandled intern.node expand with parent infos)

while this

{
    view(func: has(__is_view)) @filter(  eq(view, "home") ) {
      uid
      
    }
}
(returns: Your query did not return any results)

Thanks for any help, and keep this awesome work up!

Ok, new crash.

Here is the terminal log after I restart:

MacLuiz:vshark labs$ cd domains/_common/_dgraph/
MacLuiz:_dgraph labs$ ./start
MacLuiz:_dgraph labs$ 2018/04/24 10:17:47 Listening on port 8000...
Setting up grpc listener at: 0.0.0.0:5080
Setting up http listener at: 0.0.0.0:6080
2018/04/24 10:17:47 node.go:246: Found hardstate: {Term:2 Vote:1 Commit:105 XXX_unrecognized:[]}
2018/04/24 10:17:47 gRPC server started.  Listening on port 9080
2018/04/24 10:17:47 HTTP server started.  Listening on port 8080
2018/04/24 10:17:47 node.go:258: Group 0 found 105 entries
2018/04/24 10:17:47 raft.go:411: Restarting node for dgraphzero
2018/04/24 10:17:47 worker.go:99: Worker listening at address: [::]:7080
2018/04/24 10:17:47 groups.go:86: Current Raft Id: 1
2018/04/24 10:17:47 raft.go:567: INFO: 1 became follower at term 2
2018/04/24 10:17:47 raft.go:316: INFO: newRaft 1 [peers: [], term: 2, commit: 105, applied: 0, lastindex: 105, lastterm: 2]
Running Dgraph zero...
2018/04/24 10:17:47 node.go:127: Setting conf state to nodes:1 
2018/04/24 10:17:47 pool.go:118: == CONNECT ==> Setting localhost:7080
2018/04/24 10:17:47 pool.go:118: == CONNECT ==> Setting localhost:5080
2018/04/24 10:17:47 zero.go:333: Got connection request: id:1 addr:"localhost:7080" 
2018/04/24 10:17:47 zero.go:430: Connected
2018/04/24 10:17:47 groups.go:109: Connected to group zero. Assigned group: 0
2018/04/24 10:17:47 draft.go:139: Node ID: 1 with GroupID: 1
2018/04/24 10:17:47 node.go:231: Found Snapshot, Metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:1 Term:1 XXX_unrecognized:[]}
2018/04/24 10:17:47 node.go:246: Found hardstate: {Term:2 Vote:1 Commit:11 XXX_unrecognized:[]}
2018/04/24 10:17:47 node.go:258: Group 1 found 10 entries
2018/04/24 10:17:47 draft.go:717: Restarting node for group: 1
2018/04/24 10:17:47 raft.go:567: INFO: 1 became follower at term 2
2018/04/24 10:17:47 raft.go:316: INFO: newRaft 1 [peers: [1], term: 2, commit: 11, applied: 1, lastindex: 11, lastterm: 2]
2018/04/24 10:17:47 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/04/24 10:17:47 groups.go:694: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/04/24 10:17:47 mutation.go:190: Done schema update predicate:"_predicate_" value_type:STRING list:true 
2018/04/24 10:17:47 mutation.go:159: Done schema update predicate:"__is_views" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:47 mutation.go:159: Done schema update predicate:"__is_widgets" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:47 mutation.go:159: Done schema update predicate:"__is_infos" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:47 groups.go:316: Asking if I can serve tablet for: slot
2018/04/24 10:17:48 groups.go:472: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/04/24 10:17:48 groups.go:694: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/04/24 10:17:48 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/04/24 10:17:49 groups.go:472: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/04/24 10:17:49 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/04/24 10:17:49 groups.go:694: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/04/24 10:17:50 raft.go:749: INFO: 1 is starting a new election at term 2
2018/04/24 10:17:50 raft.go:580: INFO: 1 became candidate at term 3
2018/04/24 10:17:50 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 3
2018/04/24 10:17:50 raft.go:621: INFO: 1 became leader at term 3
2018/04/24 10:17:50 node.go:301: INFO: raft.node: 1 elected leader 1 at term 3
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"slot" value_type:INT 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"view" value_type:STRING directive:INDEX tokenizer:"exact" upsert:true 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"widget" value_type:STRING directive:INDEX tokenizer:"exact" upsert:true 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"intent" value_type:STRING directive:INDEX tokenizer:"exact" upsert:true 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"label" value_type:STRING directive:INDEX tokenizer:"exact" tokenizer:"term" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"info" value_type:STRING directive:INDEX tokenizer:"fulltext" tokenizer:"term" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"pergunta" value_type:STRING directive:INDEX tokenizer:"fulltext" tokenizer:"term" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"mixins" value_type:STRING list:true 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"template" value_type:STRING 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"widgets" value_type:UID directive:REVERSE 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"__is_users" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"__is_perfis" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"__is_regras" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"perfis" value_type:UID directive:REVERSE 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"regras" value_type:UID directive:REVERSE 
2018/04/24 10:17:50 groups.go:316: Asking if I can serve tablet for: users
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"users" value_type:UID directive:REVERSE 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"pwd" value_type:PASSWORD 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"descricao" value_type:STRING 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"valor" value_type:STRING 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"valor_max" value_type:STRING 
2018/04/24 10:17:50 groups.go:316: Asking if I can serve tablet for: _share_hash_
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"_share_hash_" value_type:STRING directive:INDEX tokenizer:"exact" 
2018/04/24 10:17:50 groups.go:472: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/04/24 10:17:51 raft.go:749: INFO: 1 is starting a new election at term 2
2018/04/24 10:17:51 raft.go:580: INFO: 1 became candidate at term 3
2018/04/24 10:17:51 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 3
2018/04/24 10:17:51 raft.go:621: INFO: 1 became leader at term 3
2018/04/24 10:17:51 node.go:301: INFO: raft.node: 1 elected leader 1 at term 3

MacLuiz:_dgraph labs$ 2018/04/24 10:18:37 oracle.go:87: purging below ts:10006, len(o.commits):2, len(o.aborts):0, len(o.rowCommit):0


The contends of ./start

#!/bin/bash

dgraph zero &
dgraph server --memory_mb 4000 &
dgraph-ratel &

I see, so the problem is not directly linked to the Dgraph. Right? Crash itself is not done by nor a common activity done by queries, alter or mutation right?

In this situation, in your picture. I believe the Dgraph was not programmed to predict this. It is assumed that the Dgraph instance would be safe on a PC with 99.99% uptime. (PS: I assuming this, speaking for myself)

For you to have your data secured you need to at least give a shutdown command (via HTTP) and in the terminal use “control + c” to finalize the Dgraph correctly. And always make a export as backup.

If you do not do this procedure. Dgraph will keep some locks in files and prevent access to a new instance (in its view). I believe it would be an interesting feature to have a safe process of recovering from bad shutdown in cases like this. But at the moment there is no other way out, I believe.

Well, thanks for the update but it seems TERRIBLE BAD news.

How can I put customers data on a database that cant survive a server crash? Power outage, users mistakes, all those things happens all the time.

One cannot assume that you will be at the cloud at all times. There are many “on premises” scenarios out there.

For me, it does not seem to be server crash. As said your machine ran out of power and hung up subtly.

Who is assuming this is me. I’m just giving you what I know. Perhaps @pawan an better enlighten on some process of recovery from loss of energy.

But if you will have a service that will be public. You have to ensure that the machine will be on 99.99% uptime. And then do the measures I’ve outlined above to make sure the data is not locked or lost.

I believe if you have not “write over” to Dgraph’s folders. Your data is still there. But they’re with Lock. Just the guys to tell you something about it.

Thanks Michel.

By “server crash” I was not meaning “DGraph Server” crash, but “Hardware server” crash instead.

Some of my projects are for customers that held their own datacenters, and they are not always a pretty thing to look at, so this kind of failure can and WILL happen, its just a matter of when.

This is stuff that a relational database is used to, and while sometimes you end up with a corrupted database only recoverable via backup, this is not the case most of the times.

I do spect the same kind of fault tolerance on DGraph, as for me, at least, that is a tremendous no-no.

I see your point about the locks, and have a feeling that you’re right. Lets await for someone drop in with more light to us.

Thank you for your thoughts.

But, you can always delete the locks. Are just simple files. You have to do it before up-running the instance.

Fresh new day, fresh new crash =]

You’re right. Deleting the log files makes the data show up again.

P.s: accepting donations for a new macbook battery =] =] =]

1 Like

Dgraph is not supposed to lose any data. Everything is written to a write ahead log before returning back to the user and hence Dgraph should recover fine from crashes. What log file are you talking about?

The expand(_all_) not working sometimes is a bug. Can you file a Github issue for it and we can fix it?

Hi Pawan. Sorry, I typed wrong. I meant “LOCK” files. At the last macbook crash down 'cause the battery issue, I deleted the LOCK files inside p, z and zw before restart and the data was ok.

I will let some more crashes happen just to do more tests and will report it back, ok.

Data loss is a blasphemous thing to say, I know it. =]

Thanks for you good work.

1 Like

Hi. After a bunch of more crashes I coudnt see any data loss, nor even the need to delete de LOCK files. I was probably messing up things like a good dumb newbie would do.

Really sorry for crying wolf.

Thanks.

2 Likes

Nice! I’m happy to see it’s all, alright.