“Introducing Badger: A fast key-value store written purely in Go” blog post describes why it was decided to write completely new key value storage for Dgraph which is better suited for SSD and stores values separated from keys (WiscKey).
As I understand the query to Badger has logarithmic time cost.
Neo4j claims that they have the Node Store that allows speedy lookup by ID, by calculating the offset in the node store.
Would not it be better to compose a storage for Dgraph similar to Neo4j with constant time access?
Is it possible to use the similar to Neo4j approach in distributed data storage?
Will you point us to the documentation of Neo4J?
We keep optimizing badger where we see the opportunity. In general, I think badger is highly optimized and works really well for our use case for Dgraph by providing efficient lookups for the data and index that we are looking for.
Records are format we represent Neo4j’s nodes and relationships on disk. It’s always 14 bytes fixed size for nodes and points on the first relationship and property.
How is an node-record implemented?
There is the node-record on disk. It is loaded by the NodeStore and represented as NodeRecord instance in Neo4j. These NodeRecords are then used to load information about the node into a NodeImpl object.
Why is a Node Record relevant to Neo4j?
Fixed size blocks allow direct, fast access with the internal id, e.g. record # 1000 is found at position 14000 (1000 x 14). Whole regions of the store files are mapped into memory. The operating system makes portions of a file available in memory and takes care of syncing to disk. So we can access node records even faster. The node record is the database structure (starting point) for the graph element of a node.
Neo4j’s storage is organized in record-based files per data structure – nodes, relationships, properties, labels, and so on. Each node and relationship record block is directly addressable by its id.
The approach to data storage, chosen in Neo4j has one very useful consequence: since all records are strictly of the same size, accessing records by identifiers is pretty cheap, because doesn’t require any associative mapping from identifiers to record locations (hash table, tree or something else), identifiers just play as indexes in “arrays” of records.