SubGraph refactoring: introducing QueryBlock

@minions: Here are some thoughts about the SubGraph structure:

There are several graph traversal languages: GraphQL, Gremlin, Cypher, SPARQL. Our backend should be flexible enough to accommodate all these languages eventually.

As with all languages, it is generally true that “ease of use” comes at the expense of “being restrictive”. GraphQL is easy to understand and use, but it feels limited. It is great for a “query structure” that keeps fanning out, but for anything else, it is not so great.

The above is a graph DB schema taken from here.

Fix a (person, company). Say we want to see if that person applies to that company. Say we cannot add reverse edges. How would you do it in GraphQL? Say the jobs are all entities, not values. Essentially we want to intersect person.fanout("completes").fanout("appliesTo") with company.fanout("creates"). It is not clear at all how to do that in GraphQL.

Our backend should have a structure which you can intersect easily. This structure (which we shall call QueryBlock) should be as simple and as small as possible, so that any operations, say an intersection or even a groupBy have fewer fields to track and maintain.

Let’s take a look at the current SubGraph structure.

type SubGraph struct {
	Attr     string
	Count    int
	Offset   int
	AfterUid uint64
	GetCount uint16
	Children []*SubGraph
	IsRoot   bool
	GetUid   bool
	isDebug  bool
	Query  []byte // Contains list of source UIDs.
	Result []byte // Contains UID matrix or list of values for child attributes.
}

There are very few fields that QueryBlock needs. In particular, we only need: Attr, Query, Result. There is no need to keep Children.

Imagine we have this new structure

type QueryBlock struct {
	Attr	string
	Query	[]byte
	Result	[]byte
}

Actually, since in both postTraverse (for JSON) and preTraverse (for proto), we need to parse and create task.Query and task.Result, and parsing is very fast for flatbuffers, why not just cache these very handy structures?

type QueryBlock struct {
	Attr	string
	Query	*task.Query
	Result	*task.Result
	QueryB	[]byte
	ResultB	[]byte
}

There should be another structure which is like a list of sorted entities. Actually, that is just task.Query. The intersect function should return a task.Query or something similar. Now, to perform the intersection, we need two sorted list of unique UIDs from each QueryBlock. There are two choices here. Either we store the sorted list of unique destination UIDs in QueryBlock, or we don’t. I am not sure about this. Say we do store the dest UIDs in QueryBlock and get:

type QueryBlock struct {
	Attr	string
	Src	*task.Query
	Link	*task.Result
	SrcB	[]byte
	LinkB	[]byte
	Dest	*task.Query
}

Dest can be lazily initialized. I have taken the liberty of renaming Query as Src because what we care about in QueryBlock is the sorted list of unique source UIDs. And task.Result is a link between Src and Dest UIDs. Finally, this is how intersect can look like:

func Intersect(q1, q2 *QueryBlock) *task.Query

You might wonder: should we return []byte instead? Since flatbuffers parse really fast, and we often need its parsed structure, what if we just wrap it and make it smarter and stopping asking questions like this? Developer productivity is also important. The structure will look like the following.

type wrappedTaskQuery struct {
	b	[]byte		// b for bytes.
	p	*task.Query	// p for "parsed"
}

p can be lazily initialized, so that we parse it at most once. Now, QueryBlock can be just:

type QueryBlock struct {
	Attr	string
	Src	*wrappedTaskQuery
	Link	*wrappedTaskResult
	Dest	*wrappedTaskQuery
}

What about the rest of SubGraph? It seems to me that SubGraph is all about fanning out. It can be a companion to QueryBlock. Ideally it should look like this:

type SubGraph struct {
	Attr		string
	Children	[]*SubGraph
	Qb		*QueryBlock
}

To summarize, this is starting to look like a simple refactoring of task.Query, task.Result out of SubGraph so that we can build more logic around QueryBlock and not have to worry about the actual graph structure or query structure. I have swept a lot under the rug, e.g., Count, Offset, etc, but let’s defer that.

The task.Query pointer is a derivative information, so I’m not sure of the value of storing it in the struct itself. If the parsing was slow, like in protocol buffer, then it might sense to cache it. But, in the case of Flatbuffers, parsing is as simple as generating a single pointer variable. So, there’s no need to cache it.

uo := flatbuffers.GetUOffsetT(buf)
q := new(task.Query)
q.Init(buf, uo)

https://github.com/dgraph-io/dgraph/blob/master/task/Query.go#L12

If you think it would help, you could just create a helper function around this 3 line code in x.go.

We still do need the byte arrays because it’s helpful for the network communication. Whenever we need to send something off, we just copy the bytes over to network connection and we’re done.

So far, we’ve avoided having the next query within the same struct, because they belong to the children of the current struct. This ensures each struct instance contains the query it needs to execute, and the result it has. So, the parent preps up it’s children by creating the next query for them in their structs, as opposed to storing it in it’s own. I’d argue that’s a cleaner design.

Regarding intersect, I think the problem that you’re facing with SubGraph struct is that you have 2 structs which are on the same level, as opposed to having a parent-child relationship. The solution to me is relatively simple, have a list of siblings.

type SubGraph struct {
  Children []*SubGraph
  Siblings []*SubGraph
  // Other fields
}

So, when creating the queries for the children, you’d just intersect the results of the siblings first. Though, at some point we might want to do unions, so see my updated struct below.

I do see merit in creating a new struct because SubGraph seems to be spilling. This is what I’d propose.

type Params struct {
  Count int
  Offset int
  AfterUid uint64
  GetCount uint16  // 16 bits?
  GetUid bool
  isDebug bool
}

type SubGraph struct {
  Attr string
  Params Params
  Children []*SubGraph
  Intersect []*SubGraph
  // Later maybe: Union []*SubGraph
  IsRoot bool  // This field stays here because it's inherently attached to the subgraph node and not a parameter.
  Query []byte
  Result []byte
}

You could create one more struct out of this, which holds the query and the result, but that doesn’t help much, IMO. You replace 2 lines with 1 line – it’s not much of a saving in the SubGraph struct.

Adding siblings / intersect is making SubGraph even more complicated.

Imagine a query language that allows user to do something like:

b = a.fanout(..)
c = b.fanout(..)
bb = aa.fanout(..)
cc = bb.fanout(..)
d = intersect(bb, cc)
dd = a.fanout(..)
intersect(bb, dd)

Or random things like that. This is still simple stuff. What if you start to do more fancy things like groupBy? You are going to keep adding more fields to SubGraph?

If your backend has primitive operations, then it is very easy to implement the above. And to answer GraphQL queries, you just run along and execute these primitive operations. It is easier to test and easier to optimize more primitive operations. Have a new query language? No problem, just reduce it again.

Regarding wrapping around protos, I just see a lot of extra lines to convert between byte and task.Query. I find it a bit distracting. If we pass a task.Query, why not just pass it one form? Returning a wrappedTaskQuery type is more illuminating than returning a byte. Whoever needs it in byte, takes its byte version. Whoever needs it in task.Query, take its parsed version.

hmm… I think the example you’ve given makes the graph look more like a list. For e.g., the assignment of variables with b = a.fanout(..), could mean these things:

  • We refer to the subgraph generated from a, pointed to by b.
  • We return back the subgraph generated from a to the user, and store it on the client, refer to it as b.

Both cause more problems and complexity. I think the right way to think about the query language for graphs is to look for the end result. What is it the user wants, is it intersect(bb, dd)? If so, the query should contain all the steps until the intersect. Because if we’re returning the results of each step back to the user, that’d suffer from fanout problems and would be pretty inefficient.

Also, I’m not sure what you’re proposing. The initial proposal looked more like moving fields out of SubGraph, which I agree with and is what I’ve expanded on. But, you seem to be saying more than your proposal. So, I’m not clear.

What’s wrong with helper functions? x.GetQuery([]byte) *task.Query This should give you what you want, right? I feel you’re just unfamiliar with Flatbuffers and seem to think there’s a deeper problem here? We definitely could use more helper functions around them. Their API is cumbersome.

No, we are not returning anything to the user. QueryBlock is not responsible for that. It is just to store intermediate results on the backend. It is totally internal, shielded away from the user.

This reminds me of Google Flume. There is this idea of “deferred evaluation” which is similar to what you said — you start with what the user wants. But before you get there, you build some graph structure (but no processing yet) and then you backpropagate to identify which operations need to be executed (still no processing), do some plan optimization, then you start executing the optimized plan and returning stuff. This is complicated and I’m not suggesting that we implement this “deferred evaluation” feature yet.

The idea of “fanout” is the same as what we are currently doing for GraphQL. This is just a refactoring. We are not returning anything to the user at this point. For example, ToJSON will process all these QueryBlocks, join them using the structure it is given from GraphQL, and then return to the user. For other query languages, the process of massaging QueryBlocks into results for users might be different.

But at the level of QueryBlocks, we do not worry about graph structure much. We are only concerned about primitive operations around them such as intersect, union, diff, inner join, outer join, etc etc.

As for flatbuffers, yes I haven’t been using x.GetQuery. Yes, there’s nothing wrong with helper functions. I don’t think there is a big problem here. I think I am just not used to seeing byte being passed around as if they are proto buffers. Just ignore me for this one.

I suspect the QueryBlock concept would eventually converge to what we already have, i.e. SubGraph. But, if you think that’s not the case, happy to have a look at a PR which implements the intersection with QueryBlock. I think the hard part here isn’t the internal structure, it’s expressing the intersect in GraphQL, which doesn’t have that concept.

Regarding other languages, they’re simpler than GraphQL (I’m thinking of Gremlin and Cypher – both deal with lists of things, and not subgraphs). So, I wouldn’t worry too much about them. Let’s focus on one language and make that work first.

Regarding intersect in GraphQL, I am hesitant about hacking the language to make intersections work. I actually think Gremlin is more versatile than GraphQL. For example, you can “intersect” like this:

I am not too familiar with the language yet. My feeling is that the most powerful query language is something like what I wrote above where users have access to low level operatives and we do some optimization for them. For example, if something is not used at all, we don’t process it. If some operations can be combined, we combine them, etc.

GraphQL is more powerful in the sense that you can expand out by many attributes at one time. However, when you need to treat different attributes differently under the same parent, it is very clumsy. You have the best of both world with a language like the above.

If I do write the PR, I will try my best not to make it become SubGraph. I hope you are not thinking that I am doing a major rewrite. It is just a small refactoring so that all these operations can be implemented more naturally with less clutter (in my opinion…)

Gremlin doesn’t support the type system, schema validation, or fragments. It’s a very basic language which works at the edge expansion level. At least based on my understanding.

GraphQL, OTOH, can make diverse edges attached to one node, feel like querying a document.

Actor(_uid_: 0x01) {
}

Something simple like this can give you all the relevant fields corresponding to Actor type in a document like format. Doing this in Gremlin requires setting all the fields that you need by hand, and then validating them to ensure that each entity does contain all the fields mandated by Actor type; discarding the entities which don’t.

I’d encourage you to read the GraphQL spec and then make your mind about whether you think GraphQL is less versatile than Gremlin.

The thing that’s going on for Gremlin is that it is a proper graph query language, and hence readily allows you to do intersections. I think if we spend some time thinking about it, we can easily make GraphQL do that.

Hmm… So, you’re proposing creating a new language? I think you’ll have a better chance of executing extending the GraphQL spec to do what you want.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.