I’m performance tuning an application, and have acheived huge speed ups by batching records into thousands of insert JSON docs per transaction, insead of individual transactions on multiple threads. The nodes I am inserting must be unique by a value, and I have therefor made this value an upsert index.
By implementing batches I however had to reduce parallelism in my app, as this unique value can simultaneously occur in separate batches on separate threads, which causes the entire batch to fail. Retrying is then expensive enough to make it slower than a single thread. (It is not possible to shuffle the same values to the same thread before insert)
Is there any strategy to have parallel batch inserts, with the transactional guarantees of non-duplicate nodes, e.g. letting only the duplicated inserts fail and the rest of the batch commit successfully? Or any other way to batch load data whilst keeping nodes unique by a value? I also cannot use the bulk loader as this process is part of a high volume streaming application.
To rephrase, the docs say the following about upserts
Upsert operations are intended to be run concurrently, as per the needs of the application. As such, it’s possible that two concurrently running operations could try to add the same node at the same time. For example, both try to add a user with the same email address. If they do, then one of the transactions will fail with an error indicating that the transaction was aborted.
If this happens, the transaction is rolled back and it’s up to the user’s application logic to retry the whole operation. The transaction has to be retried in its entirety, all the way from creating a new transaction.
Issue is, insterts/upserts of individual nodes are painfully slow compared to a bulk inserts of thousands of documents. Bulk upserts in paralell however almost always fail, as the chance of key collisions is very high
Your understanding is correct. If two transaction have even single value common in them, Dgraph will fail one of the transaction. We don’t have support for entry level failure yet. Single transaction succeeds/fails as a whole unit.
Thanks for the response @ashishgoswami. I think the ability to support parallel “batches” while guarenteeing uniqueness would be a really great feature, any chance you are evaluating this?
For anyone else hitting this issue in future, the idea of “best effort” for a property exists in dgraph, you need to set a noconflict directive in your schema. https://dgraph.io/docs/master/query-language/#noconflict-directive . With this I can avoid the transaction failures of simultaneous batched updates, I dont need strict guarentees for these properties.