Some Generic questions

We are about to add some reasonably traditional social networking features to our app. Think timelines, fan-out on read newsfeeds, likes, shared posts etc. And as such are currently evaluating backend tech for this.

In the past we have always used several layers for implementing these sorts of features. Elasticsearch/Redis/Hazelcast managing relationships over distributed Big Table based stores like Cassandra/DynamoDB etc. However the idea of being able to do away with that complexity and simply have a single graph database do it all is exactly what what we would like to look at doing going forward.

Current, more mature Graph databases suffer from compromises in one area or another, but the main one being that they are not distributed, or in the case of Titan, not always performant enough to stand alone and provide user facing queries without adding in some caching layer magic.

Dgraph, is ‘exactly’ what we are looking for, at least in vision and concept.

But given the early stages of this project we have a few questions before we start playing with it.

  1. Given that Dgraph is not yet feature-complete, would it be possible to model a simple time-ordered fan-out on read newsfeed of ‘friends’ activities/posts?

  2. Would you expect Dgraph to perform at scale, while retrieving such feeds in a user facing manner without any kind of caching layer in between the app and the database?

  3. Does Dgraph suffer from the ‘supernode’ problem that most other graph databases do? I.e. how does it deal with a single entity that has 1 million+ edges connected to them? Does having so many relationships on a single entity slow down all traversals across that entity? What about different edge/relationship types? If we have 1 million [SHARED] edges on a single node/entity, does it effect the performance of traversing the [LIKED] edges? Titan uses vertex-centric indexes to work around this issue but from what little I’ve read on Dgraph, your data structure prevents this from becoming an issue. Is my assumption correct?

  4. How does the initial starting entity lookup work? Most graph databases allow you to index or add unique constraints to starting entities so that the initial lookup, before traversal, is pretty much instantaneous at any size. How does this work in Dgraph?

  5. I can see that support for limiting and paging results is in place, but what about sorting results by timestamp? How about by UTC date? If not can it be expected soon?

  6. Development on Dgraph appears to be progressing quite quickly. How would the upgrade procedure work? Would our graph data remain intact as we update releases?

  7. Dgraph is designed from the ground up to be distributed, hence our interest. But what does the horizontal scale out procedure look like? I can see that you have auto-discovery on your roadmap, which is great, but what about re-sharding existing data, how would that work as we add new nodes to our cluster?

Sorry for the long post and the huge amount of questions. But we need to balance our enthusiasm with pragmatism, and as such, really understand what we can ‘currently’ do with Dgraph, before investing hours getting to grips with it.

Regardless, we find this project incredibly exciting and will be watching closely.

Cheers!

3 Likes

Hi @GordyR,

Seems like a very apt use case.

This definitely makes me very glad!

There’re multiple ways of doing this. Do you want pagination from the database? If so, @jchiu is working on that code as we speak. If you don’t need pagination, and can sort the whole thing in the client, then that works right now.

The very basic query to do this is:

{ me(_uid_:0x0a) {
  friend {
    name
    post
    like
  }
}}

Yes, I do. Dgraph by principle doesn’t cache query results. But, it does cache the underlying posting lists, as long as it’s within the RAM budget allocation (set by default to 4GB) – we’ve noticed those improve the query latency considerably.

Depends. These 1+ million edges, do they have the same predicate? If not, then definitely not a problem. If all these edges have the same predicate (friend / share / like etc.), then we’ll have a big posting list of 1+ million uint64s, which is a whooping 8MB! Just kidding, 8MB is not that much. And Dgraph is designed to support very large posting lists. So, no, that shouldn’t be a problem.

No. We’re predicate sharded (relationship sharded), not entity sharded.

No. Different predicates don’t affect each other directly (one could argue about machine utilization being an indirect effect, but that’s a separate point).

Largely so. I can’t think of any particular reason for Dgraph to have performance issues under this workload.

That depends on how you structure your data. You could have a root node, with an _xid_:root, which is easier to remember than a 64-bit integer, and point things outwards from this root to other data, to allow for easier queries.

If you can talk about how specifically you’re doing to implement this, I can give you better suggestions and ideas about how well it would work.

@jchiu is working on sorting, and has a pending PR. This would be included in the next release, which we’re aiming for in a couple of weeks or so.

We’ll allow backups of Dgraph into the standard RDF format, which any version of Dgraph can readily accept. That way, you won’t have to worry about backwards compatibility of the underlying binary files. Having said that, starting from v1.0, we’ll be more careful about backwards compatibility.

I’ve been working on implementing RAFT for Dgraph over the past month. What you can do is to start one server, and then point others to this one. They’d automatically replicate the data and start getting write updates from each other.

When you add new nodes to the cluster, the way it’s being currently designed for the next release, is that you tell it to pick up certain RAFT groups. Multiple Dgraph predicate shards constitute one RAFT group. The consitution is decided based upon a schema that you the user can specify. For e.g., if you want all names to be handled by only one group, separate from all the other data, you can do that. And then you can assign servers which only handle the name group. This also allows you to have more or less replicas for different groups as you deem fit.

Overall, we try not to be too smart and allow a lot of freedom in terms of how many groups are present and which servers serve these groups, at the cost of slight inconvenience. If you don’t want to care about any of this, then just use the defaults, which puts everything in one group, and all the servers would be complete replicas of each other. Sorry for the long explanation.

Re-sharding – you’ll have to clarify what you mean. For me, that means changing the groups that predicate shards are part of. That’d incur downtime. But, if you just mean allowing more or less replicas, that part of very easy.

This seems like a very good use case, so I’m eager to help. If you guys want to have a VC with the team, happy to do that as well.

Cheers,
Manish

5 Likes

Thanks for the fantastic, prompt and detailed reply @mrjn

That pretty much covers all our initial questions, and has me extremely excited.

Regarding the ‘entry’ entity. I’m still not quite grasping your answer here. The majority of our queries will be starting from a “User” entity.

(User: {userid: John.Smith})-[:FRIEND]-(myfriends)-[:FRIEND]-(friendsoffiriends)

In databases like Neo4J or TitanDB we would add an index to all “User” entities, on the “userid” property. That way we can go straight to the (User: {userid: John.Smith}) and begin our graph traversal from there.

In our apps case, our userid’s are currently unique strings based on a users actual real life name, so that they map closely to their profile pages pretty URL’s, just like facebook. So the userid’s are john.smith, Jane.doe, john.smith1, john.smith321. We simply increment the number suffix as names are repeated.

As I was writing this I went back through the documentation and I think I have just answered my own question. it appears that the _xid_ is something that “we” ourselves assign right?

And under the hood Dgraph assigns its own _uid_?

In which case these _xid's that we assign need to be unique I take it?

So we should simply be able to do:

 mutation {
     set {
       <john.smith321> <type> <User> .
       <john.smith321> <firstname> "John" .
       <john.smith321> <lastname> "Smith" .
    }
 }

With the “User” part of the schema being like so:

type User {
     firstname: string
     lastname: int
     friends: Friend
     Likes: Like
}

And we can go straight to the entity john.smith321 via its _xid_ very efficiently without any sort of scan right? What sort of time complexity is this lookup? Is it the same for an _xid_ as for a _uid_ lookup?

Sorry for all the questions, although the learning curve for Dgraph seems quite simple, it is still a little different conceptually to anything I’ve used before.

There’s no such concept of indexing entities in Dgraph. Indexing for us means indexing on values, which only comes into play when we need to tokenize strings for search, or to allow sort on values (like date times, or integers, or floats etc.)

You can use _xid_ to represent your userids, and just run queries directly like:

{
  User(_xid_: john.smith) {
  }  // This would return firstname, lastname for the john.smith.
}

Note that _xid_s are converted internally to UIDs, via a fingerprint method (to uint64s), which as you can imagine is prone to collisions. We make no effort to detect those. This shouldn’t be a problem in typical use cases, but something to keep in mind. Also note that all results would be in UIDs again, and we can’t convert them back to XIDs. If you really need userids, then I’d advise you store them along with the entities, as an extra edge, like so: <john.smith> <userid> "john.smith" .

This behavior has been enacted after the last release to avoid complexity in RAFT implementation. Also, because we consider XIDs to be second-class citizens.

Note that the type system is solely to aid in data validation and retrieval for the end user. It has little effect in how we store data internally. By specifying the type User in the schema, the way you describe, you can query for all scalar User properties, without having to explicitly ask for scalar types like firstname, lastname, etc. Also, we’ll do validation to ensure that we can find both the firstname and the lastname for that entity, to consider it of type ‘User’. If some entity only has firstname, but not lastname, that entity won’t be returned.

This is because Dgraph’s type system is dynamic in nature. If we find all the edges mentioned in the type schema, then that entity is of that type; otherwise not. We don’t treat the edge, <john.smith321> <type> <User> . as holding any special meaning to us.

You can have a look at the Person example here:
https://wiki.dgraph.io/Queries_and_Mutations#Schema_File

XID -> UID is a simple hash conversion, which is a pure CPU operation. And then all further look ups are based on predicates, which is when we start hitting the persistence layer.

1 Like

Fantastic, I realised this halfway through writing my reply, and you’ve just confirmed it.

Thanks again @mrjn, you’ve been a great help.

1 Like

Hey @GordyR,

Curious if you had a chance to play with Dgraph, and what was your experience.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.