Upsert Will duplicate nodes be generated in the process of multithreading data update using the upsert block

Hi @tss,

This is a popular question, hence I am trying to explain at length.

Dgraph does not have a notion of unique attribute values across nodes. In fact every node in Dgraph is guaranteed to have a distinct uid (this attribute is controlled by Dgraph). Thus, when we talk about duplicates, we can only think of them in terms of the attribute value on the nodes.

Let’s imagine that Dgraph has predicates fullName and accountBalance. We declare that the fullName is unique within our dataset. The simplest way to avoid duplicates is to go with a logic below.

Path A
If a node with value fullName exists
update node X with value accountBalance
Path B
if a node X with value fullName does not exist
create node X
update node X with value fullName and accountBalance

This, of course, is the upsert block, a construct supported by Dgraph. The upsert block can be invoked via Ratel as well as Dgraph clients.

Multi-threading / Concurrency
If our transactions are spaced out, with no concurrency, the upsert block will help in avoiding duplicates. But if transactions happen concurrently, we could still end up with duplicates. We need an additional mechanism to help avoid duplicates for this particular concurrent update scenario.
This is exactly where @upsert directive in the schema helps. The @upsert directive checks if concurrent transactions are modifying nodes with the same attribute value, and if found, aborts one of the transactions. In our scenario, we can set the @upsert directive on the fullName attribute.

From the client perspective, all it needs to do in case of an aborted transaction is to do a retry. When duplicate transactions arrive concurrently, the first one will take Path B and the one which retries will take Path A.
Here is a video on the upsert directive.

1 Like