(NO I WAS WRONG) Data lost after crash

labs20 · April 24, 2018, 11:10am

Hi all. I have experienced this about three times now. My macbook’s battery is running out of juice, so, from time to time it just shuts down without no previous alarm if the charger is not connected.

Well, it happens that after I power it up again, the previous “ok” running queries starts to return no results.

See, I cant say for sure that the data is lost, because (and there is a feature that I miss A LOT) I do not know how to ask for the database if there anything in there at all, like a simple “select *”.

This is freaking me out, 'cause I’m moving forward into DGraph at full speed with my project.

Another (partially) related issue is with the “expand(all)”. Why is it such a drama queen? If I do the above query after the crash and no data is found anymore, I would expect just a simple “no results” and not an error.

{
    view(func: has(__is_view)) @filter(  eq(view, "home") ) {
      uid
      view
      template
      infos {
        uid
        expand(_all_)
      }
      widgets {
        uid
        expand(_all_)
      }
    }
}
(returns: Unhandled intern.node expand with parent infos)

while this

{
    view(func: has(__is_view)) @filter(  eq(view, "home") ) {
      uid
      
    }
}
(returns: Your query did not return any results)

Thanks for any help, and keep this awesome work up!

labs20 · April 24, 2018, 1:22pm

Ok, new crash.

Here is the terminal log after I restart:

MacLuiz:vshark labs$ cd domains/_common/_dgraph/
MacLuiz:_dgraph labs$ ./start
MacLuiz:_dgraph labs$ 2018/04/24 10:17:47 Listening on port 8000...
Setting up grpc listener at: 0.0.0.0:5080
Setting up http listener at: 0.0.0.0:6080
2018/04/24 10:17:47 node.go:246: Found hardstate: {Term:2 Vote:1 Commit:105 XXX_unrecognized:[]}
2018/04/24 10:17:47 gRPC server started.  Listening on port 9080
2018/04/24 10:17:47 HTTP server started.  Listening on port 8080
2018/04/24 10:17:47 node.go:258: Group 0 found 105 entries
2018/04/24 10:17:47 raft.go:411: Restarting node for dgraphzero
2018/04/24 10:17:47 worker.go:99: Worker listening at address: [::]:7080
2018/04/24 10:17:47 groups.go:86: Current Raft Id: 1
2018/04/24 10:17:47 raft.go:567: INFO: 1 became follower at term 2
2018/04/24 10:17:47 raft.go:316: INFO: newRaft 1 [peers: [], term: 2, commit: 105, applied: 0, lastindex: 105, lastterm: 2]
Running Dgraph zero...
2018/04/24 10:17:47 node.go:127: Setting conf state to nodes:1 
2018/04/24 10:17:47 pool.go:118: == CONNECT ==> Setting localhost:7080
2018/04/24 10:17:47 pool.go:118: == CONNECT ==> Setting localhost:5080
2018/04/24 10:17:47 zero.go:333: Got connection request: id:1 addr:"localhost:7080" 
2018/04/24 10:17:47 zero.go:430: Connected
2018/04/24 10:17:47 groups.go:109: Connected to group zero. Assigned group: 0
2018/04/24 10:17:47 draft.go:139: Node ID: 1 with GroupID: 1
2018/04/24 10:17:47 node.go:231: Found Snapshot, Metadata: {ConfState:{Nodes:[1] XXX_unrecognized:[]} Index:1 Term:1 XXX_unrecognized:[]}
2018/04/24 10:17:47 node.go:246: Found hardstate: {Term:2 Vote:1 Commit:11 XXX_unrecognized:[]}
2018/04/24 10:17:47 node.go:258: Group 1 found 10 entries
2018/04/24 10:17:47 draft.go:717: Restarting node for group: 1
2018/04/24 10:17:47 raft.go:567: INFO: 1 became follower at term 2
2018/04/24 10:17:47 raft.go:316: INFO: newRaft 1 [peers: [1], term: 2, commit: 11, applied: 1, lastindex: 11, lastterm: 2]
2018/04/24 10:17:47 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/04/24 10:17:47 groups.go:694: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/04/24 10:17:47 mutation.go:190: Done schema update predicate:"_predicate_" value_type:STRING list:true 
2018/04/24 10:17:47 mutation.go:159: Done schema update predicate:"__is_views" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:47 mutation.go:159: Done schema update predicate:"__is_widgets" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:47 mutation.go:159: Done schema update predicate:"__is_infos" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:47 groups.go:316: Asking if I can serve tablet for: slot
2018/04/24 10:17:48 groups.go:472: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/04/24 10:17:48 groups.go:694: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/04/24 10:17:48 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/04/24 10:17:49 groups.go:472: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/04/24 10:17:49 raft.go:1070: INFO: 1 no leader at term 2; dropping index reading msg
2018/04/24 10:17:49 groups.go:694: Error in oracle delta stream. Error: rpc error: code = Unknown desc = Node is no longer leader.
2018/04/24 10:17:50 raft.go:749: INFO: 1 is starting a new election at term 2
2018/04/24 10:17:50 raft.go:580: INFO: 1 became candidate at term 3
2018/04/24 10:17:50 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 3
2018/04/24 10:17:50 raft.go:621: INFO: 1 became leader at term 3
2018/04/24 10:17:50 node.go:301: INFO: raft.node: 1 elected leader 1 at term 3
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"slot" value_type:INT 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"view" value_type:STRING directive:INDEX tokenizer:"exact" upsert:true 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"widget" value_type:STRING directive:INDEX tokenizer:"exact" upsert:true 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"intent" value_type:STRING directive:INDEX tokenizer:"exact" upsert:true 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"label" value_type:STRING directive:INDEX tokenizer:"exact" tokenizer:"term" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"info" value_type:STRING directive:INDEX tokenizer:"fulltext" tokenizer:"term" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"pergunta" value_type:STRING directive:INDEX tokenizer:"fulltext" tokenizer:"term" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"mixins" value_type:STRING list:true 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"template" value_type:STRING 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"widgets" value_type:UID directive:REVERSE 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"__is_users" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"__is_perfis" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"__is_regras" value_type:INT directive:INDEX tokenizer:"int" 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"perfis" value_type:UID directive:REVERSE 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"regras" value_type:UID directive:REVERSE 
2018/04/24 10:17:50 groups.go:316: Asking if I can serve tablet for: users
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"users" value_type:UID directive:REVERSE 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"pwd" value_type:PASSWORD 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"descricao" value_type:STRING 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"valor" value_type:STRING 
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"valor_max" value_type:STRING 
2018/04/24 10:17:50 groups.go:316: Asking if I can serve tablet for: _share_hash_
2018/04/24 10:17:50 mutation.go:159: Done schema update predicate:"_share_hash_" value_type:STRING directive:INDEX tokenizer:"exact" 
2018/04/24 10:17:50 groups.go:472: Unable to sync memberships. Error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
2018/04/24 10:17:51 raft.go:749: INFO: 1 is starting a new election at term 2
2018/04/24 10:17:51 raft.go:580: INFO: 1 became candidate at term 3
2018/04/24 10:17:51 raft.go:664: INFO: 1 received MsgVoteResp from 1 at term 3
2018/04/24 10:17:51 raft.go:621: INFO: 1 became leader at term 3
2018/04/24 10:17:51 node.go:301: INFO: raft.node: 1 elected leader 1 at term 3

MacLuiz:_dgraph labs$ 2018/04/24 10:18:37 oracle.go:87: purging below ts:10006, len(o.commits):2, len(o.aborts):0, len(o.rowCommit):0

labs20 · April 24, 2018, 1:23pm

The contends of ./start

#!/bin/bash

dgraph zero &
dgraph server --memory_mb 4000 &
dgraph-ratel &

MichelDiz · April 24, 2018, 2:19pm

I see, so the problem is not directly linked to the Dgraph. Right? Crash itself is not done by nor a common activity done by queries, alter or mutation right?

In this situation, in your picture. I believe the Dgraph was not programmed to predict this. It is assumed that the Dgraph instance would be safe on a PC with 99.99% uptime. (PS: I assuming this, speaking for myself)

For you to have your data secured you need to at least give a shutdown command (via HTTP) and in the terminal use “control + c” to finalize the Dgraph correctly. And always make a export as backup.

If you do not do this procedure. Dgraph will keep some locks in files and prevent access to a new instance (in its view). I believe it would be an interesting feature to have a safe process of recovering from bad shutdown in cases like this. But at the moment there is no other way out, I believe.

labs20 · April 24, 2018, 2:27pm

Well, thanks for the update but it seems TERRIBLE BAD news.

How can I put customers data on a database that cant survive a server crash? Power outage, users mistakes, all those things happens all the time.

One cannot assume that you will be at the cloud at all times. There are many “on premises” scenarios out there.

MichelDiz · April 24, 2018, 2:33pm

For me, it does not seem to be server crash. As said your machine ran out of power and hung up subtly.

Who is assuming this is me. I’m just giving you what I know. Perhaps @pawan an better enlighten on some process of recovery from loss of energy.

But if you will have a service that will be public. You have to ensure that the machine will be on 99.99% uptime. And then do the measures I’ve outlined above to make sure the data is not locked or lost.

I believe if you have not “write over” to Dgraph’s folders. Your data is still there. But they’re with Lock. Just the guys to tell you something about it.

labs20 · April 24, 2018, 2:53pm

Thanks Michel.

By “server crash” I was not meaning “DGraph Server” crash, but “Hardware server” crash instead.

Some of my projects are for customers that held their own datacenters, and they are not always a pretty thing to look at, so this kind of failure can and WILL happen, its just a matter of when.

This is stuff that a relational database is used to, and while sometimes you end up with a corrupted database only recoverable via backup, this is not the case most of the times.

I do spect the same kind of fault tolerance on DGraph, as for me, at least, that is a tremendous no-no.

I see your point about the locks, and have a feeling that you’re right. Lets await for someone drop in with more light to us.

Thank you for your thoughts.

MichelDiz · April 24, 2018, 3:09pm

But, you can always delete the locks. Are just simple files. You have to do it before up-running the instance.

labs20 · April 25, 2018, 12:49pm

Fresh new day, fresh new crash =]

You’re right. Deleting the log files makes the data show up again.

P.s: accepting donations for a new macbook battery =] =] =]

pawan · April 25, 2018, 11:58pm

Dgraph is not supposed to lose any data. Everything is written to a write ahead log before returning back to the user and hence Dgraph should recover fine from crashes. What log file are you talking about?

The expand(_all_) not working sometimes is a bug. Can you file a Github issue for it and we can fix it?

labs20 · April 26, 2018, 10:43am

Hi Pawan. Sorry, I typed wrong. I meant “LOCK” files. At the last macbook crash down 'cause the battery issue, I deleted the LOCK files inside p, z and zw before restart and the data was ok.

I will let some more crashes happen just to do more tests and will report it back, ok.

Data loss is a blasphemous thing to say, I know it. =]

Thanks for you good work.

labs20 · April 30, 2018, 8:56pm

Hi. After a bunch of more crashes I coudnt see any data loss, nor even the need to delete de LOCK files. I was probably messing up things like a good dumb newbie would do.

Really sorry for crying wolf.

Thanks.

MichelDiz · April 30, 2018, 10:18pm

Nice! I’m happy to see it’s all, alright.

Topic		Replies	Views
Unexpected query results after restart Dgraph	3	403	May 27, 2020
New crash bug? Dgraph kind:question , dgraph , kind:bug	4	716	December 13, 2021
Dgraph crashed during live loading using dgraph live and unable to start the db Dgraph	12	789	February 24, 2019
Upgrade from 1.0.4 to 1.0.6 crash and data loss Users	16	1252	August 15, 2018
Regression: v21.03 crashes on query with orderdesc Dgraph dgraph , status:accepted , kind:bug , ticket:created	12	932	June 15, 2021

(NO I WAS WRONG) Data lost after crash

Related topics