Dgraph internals

Greetings,

I am interested in understanding specifics of Dgraph’s internals and concepts. Apologies in advance for whichever of those questions is obvious(and should perhaps have been obvious to me), or if there are design documents that provide answers – I mean to study the codebase to figure things out eventually, but I haven’t gotten around to it yet, and all told it should be better to get that information from the folks who are actually designing and implementing it.

  • Are edges represented as postings lists? “Bruce Lee”(subject) - starred(edge) - “Enter the Dragon”(object). Would there be a posting list for a key that encodes (subject, edge) => IDs of movies (i.e bruce_lee.starred => [ids of movies bruce lee was into]) ? If so, how is the edge/subject encoded into a key name? edge_type_id:subject_id:version, or something like that?
  • What happens when new movies were to be added to (bruce lee, starred), or removed from that list? would Dgraph fetch the existing posting list, decode/materialise it, update it in-memory, and then encode it and persist it back? would updates be queued in memory and every so often would be persisted back? and if so - would that require doing what I described above, or does Dgraph allow for creating additional ‘sub’ postings lists and during query time multiple postings list for the same (subject, predicate) are processed?
  • How are labels/attributes modelled on the KV store? e.g for a node/subject “Michael Jordan” you could have (attributes like height=value, team=value, country=value, etc). Are posting lists used for that as well?
  • What happens when a posting list is too long (say millions of IDs long). Is the whole thing retrieved and then processed, or is there some scheme where parts of it can somehow be skipped/not retrieved or retrieved in chunks ( so that you can intersect one chunk at a time, with another postings list) ?

Thank you very much in advance

Yes. We use <predicate, subject UID> → list of postings, where postings can have UIDs, or values.

Every update is a mutation. We do read the existing PL, but not mutate it immediately. We use the PL to calculate the indices that need to be updated, and make them part of the txn. We then create multiple delta mutations (for data / indices), and write them back to disk when txn gets committed.

Periodically, we read all the deltas and the last state to generate a new state, i.e. a new posting list. So, our reads remain fast.

Those labels are called facets in Dgraph. They’re stored in the posting corresponding to that edge.

Currently, we don’t have any special mechanisms to deal with long posting lists. But, we do plan to binary split long posting lists.

3 Likes

Thank you very much.

Could you perhaps point me to the released bits in the codebase that deal with facets ? If you have a subject with two attributes(country=us) and (color=blue), would country and color be two different predicates so that you’d need two keys for (country.subject) and (city.subject) ? And what would the PL for each of those hold ?

Thanks again

Facets don’t become predicates. They are just attachments, and stored within the postings of the posting list.

1 Like