Gremlin is a graph traversal language. Gremlin makes use of Pipes to perform complex graph traversals.
Here are few examples of gremlin to get an hang of it.
- g.v(1); // get vertex by Id
- g.V[1…100] // get all vertex with id in range 1 to 100
- g.v(1).firstName; // get the attribute of vertex by id
- g.V(‘firstName’,‘John’); // get vertex with firstName as john
- g.V(‘firsName’,‘John’).count(); //get count
- g.v(1).outE(‘friend’); // outgoing friend edges
- g.v(1).inE(‘friend’); // incoming friend edges
- g.V.and(_().has(“age”, T.gt, 25), _().has(“age”, T.lt, 35)); // find all people with age between 25 to 35
- g.V.interval(“age”, 25, 35); // same as above
- g.V.has(‘email’, null) // people who don’t have email addresses
In addition to above gremlin supports graph manipulation queries adding edges, nodes and creating indexes.
But any graph algorithm is not just about getting data. Its about seeing how paths intersect/overlay/loop amongst themselves and each other. Paths tell you a lot about the “structure” of your data — topological statistics. PageRank, Betweenness, Eccentricity, Eigenvectors, Recommendations, Spreading Activation, …. these algorithms are all about the paths, not the data.
Gremlin/Cypher allows you to specify such queries which we can’t do in graphQL.
http://tinkerpop.apache.org/docs/3.2.1-SNAPSHOT/recipes/#shortest-path
g.V(1).repeat(out().simplePath()).until(hasId(5)).path().limit(1)
The traversal starts at vertex with the identifier of “1” and repeatedly traverses on out edges “until” it finds a vertex with an identifier of “5”.
Here simplepath() ensures that traverser doesn’t repeat a path in the graph.
There are many types of traversals supported by gremlin.
Gremlin also supports declarative form of query in addition to imperative form.
example: “Who created a project named ‘lop’ that was also created by someone who is 29 years old? Return the two creators.”
g.V().match(
__.as(‘creators’).out(‘created’).has(‘name’, ‘lop’).as(‘projects’), //(1)
__.as(‘projects’).in(‘created’).has(‘age’, 29).as(‘cocreators’)). //(2)
select(‘creators’,‘cocreators’).by(‘name’)
explanation: Find vertices that created something and match them as creators, then find out what they created which is named lop and match these vertices as projects. Using these projects vertices, find out their creators aged 29 and remember these as cocreators. Return the name of both creators and cocreators.
Gremlin is both a language and VM just like java(jvm). All the graph languages supported by gremlin get converted to gremlin bytecode which is executed by gremlin VM.
TinkerPop3 Documentation (Instruction set for gremlin bytecode)
With tinkerpop enabled stack, any tinkerpop enabled graph language providers can use DGraph. Any tinkerpop enabled graph processors(spark or hadoop giraph) can be used to run OLAP queries over DGraph like finding page rank etc.
A Gremlin traversal machine has a collection of traversal strategies. Some of these traversal strategies are specific to Gremlin (e.g. optimization strategies) and some are specific to the underlying graph system (e.g. provider optimization strategies). Gremlin-specific traversal strategies rewrite a traversal into a semantically-equivalent, though (typically) more optimal form. Similarly, provider-specific strategies mutate a traversal so as to leverage vendor-specific features such as graph-centric indices, vertex-centric indices, push-down predicates, batch-retrieval, schema validation, etc. Now that a traversal is represented in the graph as vertices and edges, Gremlin can traverse it and thus, rewrite it.
A classic example of provider specific strategy is index lookups. g.V().has(‘name’,’marko’) can be a single index call as opposed to iterating out all vertices and filtering one by one.
The implementation of TinkerPop’s core API(java) and its validation via the gremlin-test suite is all that is required of a graph system provider wishing to provide a TinkerPop3-enabled graph engine. The core api consists of
- Structure API: Graph, Element, Vertex, Edge, Property and Transaction (if transactions are supported).
- Process API: TraversalStrategy instances for optimizing Gremlin traversals to the provider’s graph system (i.e. TinkerGraphStepStrategy).
I was looking through various ways in which DGraph can be integrated with tinkerpop Stack.
Redirecting to Google Groups is discussion on gremlin-users group regarding this.
Option1: The implementation of TinkerPop’s core API would be a java client that
optimizes and translates gremlin steps into Graphql and pass to dgraph. Here we can take
advantage of gremlin’s parser and just need to implement traversal strategy.
This is used by sqlg.(GitHub - pietermartin/sqlg: TinkerPop graph over sql)
Everything can’t be expressed in subgraph format.
Option2: Implement RemoteConnection interface in dgraph which accepts gremlin bytecode via GraphSON(json)
and let dgraph internally convert it to subgraph and process the query.
Same issues as 1.
Option 3: Wrapping the DGraph in a Java API that implements TinkerPop’s structure (there would be an issue of java-go binding though) and we run a gremlin server on the host for remote connections. But traversal here would be controlled by gremlin through java api’s. Gremlin Server is part of tinkerpop stack which accepts gremlin bytecode from gremlin language variants and does the traversal. Without gremlin server there would be too many network calls. This option is normal implementation of their core API. Alternatively if we connect to database from client directly without Gremlin Server it would require better strategies for bulk operations to avoid network calls.
we would immediately be TinkerPop-compliant and over time we can create more/better strategies to do bulk operations and limit network bandwidth.
We won’t have much control over traversal, take the example of simplePath(), since in Dgraph predicates are sharded, Dgraph can find a path between two nodes with only one network call since all the edges would reside on a single machine.
For simple queries there is no issue in letting gremlin control the traversal i think.
According to Marko, You can always implement optimizations in Gremlin. Gremlin’s compiler model (TraversalStrategies) is very flexible
Edit:
Latest Comment from marko:
Yes! That is the point. Implement the TinkerPop structure API first and Gremlin will just work. Stage 1 complete. At this point, Apache TinkerPop will have full control of the execution. However, when you want to optimize, you write a strategy. This will delegate certain aspects of the traversal’s execution over to the vendor. For things the vendor can do better than Apache TinkerPop, it does. For things that the vendor can’t do better, Apache TinkerPop handles it.
I don’t know how many strategies Sqlg has, but I believe DSEGraph has 3 — one for global index use (g.V().has(’name’,’marko’)), one for vertex-centric index use (outE().has(‘time’,gt(2001))), and one for batch “get” of vertex properties.
HTH,
Marko.
It seems tinkerpop is flexible enough to let the graph vendor control some part of the traversal.
Option 4: Implementing gremlin machine in go.
Too tedious and very big project, if we are going this path, Marko Rodriguez has decided to help in implementation gremlin mini.