Query variable scalability and performance

Hi,

I’m trying to build a multilingual dictionary. For now, it is made up of Dictionaries, Words and Meanings.

An entry in an English-German dictionary for the word “apple” would be like:

{
	"content": "apple",
	"meaning": [
	  {
	    "translation": [
	      {
	        "content": "Apfel"
	      }
	    ]
	  }
	]
}

The translation for “apple” in the English-Hungarian dictionary would be like:

{
	"content": "apple",
	"meaning": [
	  {
	    "translation": [
	      {
	        "content": "alma"
	      }
	    ]
	  }
	]
}

Now, my idea is that every Word and Meaning may belong to multiple dictionaries. Here “apple” belongs to my eng-ger and eng-hun dict, “Apfel” belongs to eng-ger, “alma” belongs to eng-hun, and the Meaning node belongs to both dictionaries.

This is my schema:

<content>: string @index(fulltext, term, trigram) .
<dgraph.graphql.schema>: string .
<dgraph.graphql.xid>: string @index(exact) @upsert .
<dict>: [uid] @reverse .
<indexWord>: [uid] .
<language>: string .
<meaning>: [uid] @reverse .
<name>: string .
<pos>: string .
<shortName>: string @index(exact) .
<translation>: [uid] @reverse .

This is my dataset:

{
  "set": [
    {
      "dgraph.type": "Dictionary",
      "uid": "_:eng-hun-dict",
      "name": "English-Hungarian dictionary",
      "shortName": "eng-hun-dict",
      "indexWord": [
        {
          "dgraph.type": "Word",
          "uid": "_:apple-word",
          "dict": [
            {
              "uid": "_:eng-hun-dict"
            }
          ],
          "language": "eng",
          "pos": "noun",
          "content": "apple",
          "meaning": [
            {
              "dgraph.type": "Meaning",
              "uid": "_:apple-meaning",
              "dict": [
                {
                  "uid": "_:eng-hun-dict"
                }
              ],
              "translation":[
                {
                  "dgraph.type": "Word",
                  "uid": "_:alma-word",
                  "dict": [
                    {
                      "uid": "_:eng-hun-dict"
                    }
                  ],
                  "language": "hun",
                  "pos": "noun",
                  "content": "alma"
                }
              ]
            }
          ]
        }
      ]
    },
    {
      "dgraph.type": "Dictionary",
      "uid": "_:eng-ger-dict",
      "name": "English-German dictionary",
      "shortName": "eng-ger-dict",
      "indexWord": [
        {
          "dgraph.type": "Word",
          "uid": "_:apple-word",
          "dict": [
            {
              "uid": "_:eng-ger-dict"
            }
          ],
          "language": "eng",
          "pos": "noun",
          "content": "apple",
          "meaning": [
            {
              "dgraph.type": "Meaning",
              "uid": "_:apple-meaning",
              "dict": [
                {
                  "uid": "_:eng-ger-dict"
                }
              ],
              "translation": [
                {
                  "dgraph.type": "Word",
                  "uid": "_:apfel-word",
                  "dict": [
                    {
                      "uid": "_:eng-ger-dict"
                    }
                  ],
                  "language": "ger",
                  "pos": "noun",
                  "content": "apfel"
                }
              ]
            }
          ]
        }
      ]
    }
  ]
}

Now when I search for a word and its translation in a specific dictionary I want to generate these dictionary entries by traversing so that only those nodes are included in the result that belong to the currrently searched dictionary.

I use the following query. The parameter $dict can be either “eng-ger-dict” or “eng-hun-dict”. Based on this it generates either the dictionary entry with the translation “Apfel” or “alma”:

query dentry($dict: string){
	var(func: eq(shortName, $dict)) {
  	  ~dict @filter(eq(dgraph.type, "Meaning")) {
    	    MEANING_UID as uid
  	}
  }

	var(func: eq(shortName, $dict)) {
  	  ~dict @filter(eq(dgraph.type, "Word")) {
    	    WORD_UID as uid
  	}
  }  
  
  dentry(func: allofterms(content, "apple")) @filter(uid(WORD_UID)){
    content
    meaning @filter(uid(MEANING_UID)) {
      translation @filter(uid(WORD_UID)) {
        content
      }
    }
  }
}  

And now my actual question(s): I wonder how scalable is this solution? Once I have my dictionary fully created it will have millions of nodes belonging to the same dictionary. If I understand correctly how vars work then MEANING_UID and WORD_UID will be calculated before running my “dentry” query and MEANING_UID and WORD_UID could contains millions of uids which will be evaluated in the @filter expressions of “dentry()”.

How will this effect query performance? Can Dgraph handle it? How much memory will this use?

(Note: I know that for this toy database I could use uid_in(dict, 0xsome_dict_uid) like this and it would be more efficient:

query dentry($dict: int){
  dentry(func: allofterms(content, "apple")){
    content
    meaning @filter(uid_in(dict, $dict)) {
      translation @filter(uid_in(dict, $dict)) {
        content
      }
    }
  }
} 

But for now I try to understand how vars work and how much I could/should use it in case of more advanced queries.)

Thanks!

Hi,

Did you consider using i18n support that Dgraph provides for predicates? You can probably have a much simpler schema something like a single predicate “word” which is a multi-language string.
This tutorial illustrates this nicely.

Yes, I am aware of the i18n string support but in my case the Words will need all kinds of other metadata as well. The string content and its implicit Dgraph provided localization itself will not be enough in this case. But I will take a deeper look at this. However this alone would not solve my problems/concerns regarding query variables.

Can you please share the definition of types in the schema as well? It seems you have defined type “Dictionary”, “Word”. It would help me understand the schema better.

Sure (although I have some trouble figuring out from the documentation how to correctly do it and also in Ratel’s “Types” editor I’m unable to set type for fields of a my own type) :

type Dictionary {
  name
  shortName
  indexWord
}

type Meaning {
  dict
  translation
}

type Word {
  dict
  content
  pos
}

I, too, have stumbled upon this problem, which I call filtering based on nested node value.

I tried to decompose the problem, and this is what I came up with:

  • let’s assume, you have the Dictionary as a PREDICATE, you can then @filter(eq(shortName, "value")) which is indexed and efficient.
  • we have it however as a NODE, but unfortunately there is no @filter(eq(dict.shortName, "value")) as far as I am aware
  • this can be somehow simulated with cascade. Below I have made an equivalent query to yours, using cascade directive; however I’m not sure, how efficient cascades are. I think they’re better than filtering based on huge uid lists, but I’d love to have a better alternative to dealing with this use case
query dentry($dict: string){
  DICT as var(func: eq(shortName, $dict))

  dentry(func: allofterms(content, "apple")) @cascade {
    content
	dict @filter(uid(DICT))
    meaning {
	  dict @filter(uid(DICT))
      translation {
	    dict @filter(uid(DICT))
        content
      }
    }
  }
}
1 Like

Thanks you @ppp225, that’s a cool solution! I was not really aware of how @cascade works.

And you are right, I was also looking for a way to do something like

@filter(eq(dict.shortName, "value"))

or even

@filter(eq(dict.someOtherNode.evenDeeperNode.someValue, "value")).

It just would seem natural for me to be able to define constraints traversing any direction in the graph. (Although not a graph database, but such things can be nicely done in Hasura).

I also wonder how efficient @cascade is, but probably more efficient than a var with millions of uids.

I also liked how you used uid_in, however not always do we know the uid when doing the query. What would really solve this issue, would be the ability to use uid_in with variables, like so. I think this is the most efficient way to make this query.

query dentry($dict: string){
  DICT as var(func: eq(shortName, $dict))

  dentry(func: allofterms(content, "apple")) @filter(uid_in(dict, DICT)) {
    content
    meaning @filter(uid_in(dict, DICT)) {
      translation @filter(uid_in(dict, DICT)) {
        content
      }
    }
  }
}

I think, that @cascade could in theory optimize it this way. The docs say:

With the @cascade directive, nodes that don’t have all predicates specified in the query are removed.

Based on this definition, the cascade version of the query traverses all possible combinations, even nested ones, and then removes incomplete results at the end. But it could optimize the query, by removing them at query execution, basically working like uid_in.

1 Like

Yep, but unfortunately this is not possible at the moment, uid_in only accepts a static <uid> value. In my case, this would be OK to lookup the dictionary first by name or whatever and use its uid with uid_in but again, uid_in is just not flexible enough.

For now what I would really like to investigate is the filtering based on nested node value mechanism as you said it earlier and it seems that query variables could be the way to do it for any level of nesting. The questions is, how efficient is this or what performance problems we may find later?

1 Like

Cascade is a relatively low cost operation that happens towards the end of processing of a sub-query. It simply drops some of the nodes that were part of query result that didn’t have a predicate which was there in the query. So using a Cascade shouldn’t affect the performance of your query.
A Filter operation has cost of O(number of results) in that sub-query because we iterate over all results and filter out the ones that match the condition. So every new filter you add can will add a small cost to the query.

Feel free to ask if you have any further questions. Please do mark this as resolved if it answers your questions regarding performance of query.

2 Likes

Thanks for your answer!

Just one followup question if I may: is this true for all @filter operations, they are all done sequentially?

So, in case of my original query:

query dentry($dict: string){
	var(func: eq(shortName, $dict)) {
  	  ~dict @filter(eq(dgraph.type, "Meaning")) {
    	    MEANING_UID as uid
  	}
  }

	var(func: eq(shortName, $dict)) {
  	  ~dict @filter(eq(dgraph.type, "Word")) {
    	    WORD_UID as uid
  	}
  }  
  
  dentry(func: allofterms(content, "apple")) @filter(uid(WORD_UID)){
    content
    meaning @filter(uid(MEANING_UID)) {
      translation @filter(uid(WORD_UID)) {
        content
      }
    }
  }
}

the @filter(uid(MEANING_UID) and @filter(uid(WORD_UID)) are both done sequentially and not using any indexing?

No it’s a tree which executes parallel at a level and these levels are processed sequential.
So the original query is divided into sub-queries called sub-graphs which have children sub-graphs inside them. This forms a forest of trees in which each node in the tree is a “sub-graph”. Now, query processor tries to run the query for all children of a sub-graph in parallel but these levels are executed recursively hence they are sequential. This parallelisation is achieved using go routines.
Also, filters in a sub-graph are also processed in parallel.
So in your example,

var(func: eq(shortName, $dict)) {
  	  ~dict @filter(eq(dgraph.type, "Meaning")) {
    	    MEANING_UID as uid
  	}
  }

and

var(func: eq(shortName, $dict)) {
  	  ~dict @filter(eq(dgraph.type, "Word")) {
    	    WORD_UID as uid
  	}
  } 

would be parallel and

  meaning @filter(uid(MEANING_UID)) {
      translation @filter(uid(WORD_UID)) {
        content
      }
    }

the 2 filters in this part will be sequential because @filter(uid(WORD_UID)) is in the subgraph of a child of meaning predicate.

2 Likes

I see, thank you for the details!

Yes, you can do that now. See this PR: https://github.com/dgraph-io/dgraph/pull/5320 for an example.

1 Like