Persistent Raft Logs

Hey @xiang90,

I’m trying to implement persistence for RAFT logs. We run many RAFT groups per server, and so we want to have one WAL to handle all the groups. The logic would be that we still use the Memory Store based WAL that you guys provide, but have the persistent one to bring the memory ones (one for each group) up to sync on a restart.

So, the main questions are:

  • When we take a snapshot, do we need to transmit them to the followers? In other words, do snapshots need to be synced across, so the RAFT logs between leader and followers are exactly the same. Otherwise, each member of the group can just snapshot on its own and there’s no need for communication.
  • How often do we need to sync to disk? I reckon every time right? To avoid the case where the server crashes and restarts, and then we aren’t able to bring the memory state to it’s last recorded position by the leader.
  • Say we’re replaying the logs from the persistent store into memory. If we encounter a snapshot entry, we can just discard all the previous entries for the memory store, right? This would help keep the memory usage low.
  • Our snapshots don’t really contain any data, they’re just a way for us to discard RAFT entries. We noticed that you guys use a snapshotter which seems to do something similar, just storing the Index and Term. Do you store snapshots separately from the logs? Do you only replay logs from the index and term specified in the snapshots? Would that make sense to do?

In general, any suggestions that you have for us to implement a persistent storage for RAFT logs, that’d be very useful.

Thanks!

Cheers,
Manish

1 Like

Hey @xiang90,

Before the meeting, could you maybe respond to some of these? Would help me get unstuck and move along.

No. Snapshot is node independent. (Compaction is node independent too) I think we have discussed a little bit about this before.

It is actually up to your application. You just need to ensure that fsync before the node tells other nodes about its state. Basically fsync before you decide to send out raft message. You can intentionally drop raft messages or hold them off for batching purpose.

Yes.

Yes. We store snapshot separately. Yes, we only reply log entries after the snapshot index. It makes sense.

1 Like

Yeah, that’s how we are set up right now. Each node snapshots independently. But, we weren’t sure how log syncs on a follower restart (crash and restart) by the leader would work. If the follower took a snapshot after the leader, and then lost all the entries, making the snapshot the last entry follower had, would the leader replay logs from that point onwards?

Between Snapshot, HardState and Entries in the same Ready() event, what’s the order in which they should be stored on persistent logs, so on replaying the end result would be the same? I reckon Snapshot has to be first, but how about between HardState and Entries. Or does it not matter, because MemoryStore is only concerned the last HardState it encounters. (I’m not clear about the role that HardState plays.)

Yes. It will. The follow will tell leader its offset. If leader has all the logs starting from that, it sends all the logs. If not, leader sends the most recent snapshot, then all the log following that snapshot.

Read the doc here: https://github.com/coreos/etcd/tree/master/raft

1 Like

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.