Dgraph queries without besteffort flag timing out

Hi

Our application (dockerized) uses the dgraph dgo client (dgraph/dgo/v200) to connect to and run query/mutations against the dgraph server (v20.11.1 running in docker).

We noticed that after a while the queries without besteffort are timing out(60s) but ones with besteffort set are returning fine. If we restart our application, they seem to work fine without timeouts. We observed this consistently after a while of running the application. There might be some mutations going during the queries but atmost less than 10 mutations concurrently.

If we restart the application, it looks like query behavior returns to normal without any timeouts. It looks as though there is some state maintained by the dgraph server per client which possibly explains the behavior.

I am curious if anybody has seen similar behavior.

Update: the application restarts are not having any effect on the behavior but just a coincidence we observed.

Hi @Vijay1

The way that best effort queries work is that the Alpha node does not have to fetch the ReadTs from zero and it uses its own ReadTs to read from the p dir.

Best-effort transactions are a variant of read-only transactions, which would use an Alpha’s observed MaxAssigned timestamp as the read timestamp. Thus, the receiver Alpha does not have to block at all and can continue to process the query.

from Dgraph Paper

Looking at the alpha logs you shared via email, I can see several mutations being served (and some of them are getting aborted, which is fine).

Since Dgraph reads are linearizable in your scenario where a lot of reads and writes are going on simultaneously; all reads are supposed to block until the Alpha has seen all the writes up until the read timestamp.

That’s said, you mentioned in your email that when downgraded your Dgraph version to v20.07.3 you didn’t notice any major issues, were your normal queries being served while running mutations?

Also, are you sending queries against nodes or predicates that are being updated simultaneously?

Maybe you can provide the followings to further inspect:

  • increase log verbosity to 2 for alpha and zero (see instruction here)
  • while query are timing out you can send a GET request to the /health?all to and alpha port and share the response here (see doc)

Best,

Hi Omar

Here is an update:
yesterday we started a new dgraph instance with 20.11.3 version and observed the above issue within 10 hours of operation while the 20.07.3 instance is working without an issue.

Also, are you sending queries against nodes or predicates that are being updated simultaneously?

yes, our application is constantly upserting nodes while the queries are in progress (more write heavy). We have around 200k-400k nodes but constantly doing upsert mutations even if there are net new nodes being created.

The alpha health is fine:
[{"instance":"zero","address":"zero:5080","status":"healthy","group":"0","version":"v20.11.3","uptime":71590,"lastEcho":1617378252},{"instance":"alpha","address":"alpha:7080","status":"healthy","group":"1","version":"v20.11.3","uptime":71591,"lastEcho":1617378253,"ongoing":["opRollup"],"ee_features":["backup_restore"],"max_assigned":1825770}]

I have attached the logs as you requested with the verbose level set to 2.dgraph_logs.tgz (157.2 KB)

Correction to the above post:


We have around 200k-400k nodes but constantly doing upsert mutations even if there are no net new nodes being created.

Hi Vijay,

Thanks, I’m looking into your logs now. Can you also enable query logging to see what request is alpha receiving - you can see instruction here. Once enabled please attach the logs as they will have information about queries and mutation (maybe wait some time to let the logs be captured)

Still sounds odd to me that this behavior is not showing up in 20.07.3, will keep digging

Also can you provide Jager traces for the query that is timing out, this will help further analyzing this behavior

You can enable jager tracing following the instruction here.

Best,

Hi @Vijay1

Did you have the chance to go over my previous messages and see if you can provide us what we asked for in order to do some progress on the issue reported here?

Best

Omar

All the queries without the best effort flag are timing out. We did not setup the jaeger tracing yet but have the zpages output which shows the “deadline exceeded” for some of the queries. Let me know if this helps.

Thankszpages.html (126.1 KB)

Thanks Vijay,

Will look into zpages and get back to you

Best,

Error traces on the raft heartbeat:

Span Name: pb.Raft.Heartbeat

5 Requests

When                       Elapsed (sec)
----------------------------------------
2021/04/09-13:15:38.692961      .   255 trace_id: d3a7949526218641cf10acdfcd2e9e75 span_id: b476afef79df1e2a parent_span_id: c5cd97886355372f
                                        Status{canonicalCode=UNKNOWN, description="No node has been set up yet"}
                                        Attributes:{Client=false, FailFast=false}
           13:15:38.693163      .   201 received message [0 bytes, 0 compressed bytes]
2021/04/09-13:15:39.695572      .   112 trace_id: a3a7707996157ecf0a4b01bb07fd9a01 span_id: b8590a12dcc70d71 parent_span_id: 2c286cce8ee7602b
                                        Status{canonicalCode=UNKNOWN, description="No node has been set up yet"}
                                        Attributes:{Client=false, FailFast=false}
           13:15:39.695668      .    95 received message [0 bytes, 0 compressed bytes]
2021/04/09-13:15:40.697826      .    79 trace_id: 5dd3e2925469d288b0b2f985e992ee70 span_id: 3acb37230d3c8594 parent_span_id: 2219c150d3eed729
                                        Status{canonicalCode=UNKNOWN, description="No node has been set up yet"}
                                        Attributes:{Client=false, FailFast=false}
           13:15:40.697890      .    63 received message [0 bytes, 0 compressed bytes]
2021/04/09-13:15:41.699575      .    59 trace_id: 2b019a4e7c5d0904867fd7672fc6f4cb span_id: bc3c65343eb0fcb7 parent_span_id: 0efb6a555cfdc526
                                        Status{canonicalCode=UNKNOWN, description="No node has been set up yet"}
                                        Attributes:{Client=false, FailFast=false}
           13:15:41.699621      .    46 received message [0 bytes, 0 compressed bytes]
2021/04/09-13:15:42.701570      .    93 trace_id: 1bc38579ac219cfc28d3ff85e8811305 span_id: 3eae92456f2474db parent_span_id: 04ecbfd7a0043d25
                                        Status{canonicalCode=UNKNOWN, description="No node has been set up yet"}
                                        Attributes:{Client=false, FailFast=false}
           13:15:42.701646      .    76 received message [0 bytes, 0 compressed bytes]

Can you please share the traces for processTask.��������hos****** that is >1min ?

Trace for both >10s which included the >1m (which is empty for some reason). We are using a timeout of 60s on our requests to dgraph.

pan Name: processTask.��������hostname

20 Requests

When                       Elapsed (sec)
----------------------------------------
2021/04/13-18:51:30.442462    59.999992 trace_id: 39579be200ed754a8343a443230ea9ce span_id: e077943a3784b0b0 parent_span_id: 5e0667290610398d
           18:51:30.442468      .     5 Start.  Attributes:{funcId=314771968, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:51:30.442472      .     4 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:52:30.442442    59.999970 End. Took 59.999970383s  Attributes:{funcId=314771968, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:52:30.444057    59.999846 trace_id: 29bc765bfa17c2a1bc71ea2071e7f1a2 span_id: add93b2a76f73817 parent_span_id: 6aaf77902cc90562
           18:52:30.444061      .     4 Start.  Attributes:{funcId=730138160, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:52:30.444065      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:53:30.443849    59.999785 End. Took 59.999782393s  Attributes:{funcId=730138160, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:53:30.445193    59.999990 trace_id: 87756ae31a54e6331d82e2071da25233 span_id: f8c9b50884f6495a parent_span_id: b59ff16e3ac816a5
           18:53:30.445198      .     5 Start.  Attributes:{funcId=1837794798, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:53:30.445203      .     4 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:54:30.445172    59.999969 End. Took 59.999957972s  Attributes:{funcId=1837794798, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:54:30.446832    59.999799 trace_id: d5c1162577bfb1ce253d97e361cbfbcc span_id: 43ba2fe791f55a9d parent_span_id: c14802d66081e379
           18:54:30.446836      .     4 Start.  Attributes:{funcId=829134364, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:54:30.446839      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:55:30.446616    59.999777 End. Took 59.999776727s  Attributes:{funcId=829134364, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:44:30.432948    59.999859 trace_id: 9fde74792a96f20b64a5c58ce58f8343 span_id: 0e4a4d7a5be88e02 parent_span_id: 8cd81f692a7417df
           18:44:30.432953      .     5 Start.  Attributes:{funcId=1539709383, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:44:30.432958      .     4 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:45:30.432700    59.999742 End. Took 59.999740572s  Attributes:{funcId=1539709383, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:45:30.434013    59.999132 trace_id: 3565f1675c0cdcfa468374ac72a62262 span_id: 593ac75869e79f45 parent_span_id: d7c8994738732822
           18:45:30.434016      .     3 Start.  Attributes:{funcId=520251098, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:45:30.434020      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:46:30.433115    59.999096 End. Took 59.999093483s  Attributes:{funcId=520251098, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:46:30.435384    59.999294 trace_id: 749d6dfd9b11c9645218df00f99eb5ba span_id: 269c6e48a85a28ac parent_span_id: a42a413777e6b088
           18:46:30.435388      .     3 Start.  Attributes:{funcId=865218393, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:46:30.435391      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:47:30.434648    59.999256 End. Took 59.999255851s  Attributes:{funcId=865218393, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:47:30.437236    59.999518 trace_id: bfb385b2111290e86fd136cfe703c0dd span_id: 718ce826b65939ef parent_span_id: ef1abb1585e5c1cb
           18:47:30.437243      .     6 Start.  Attributes:{funcId=1451722271, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:47:30.437248      .     4 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:48:30.436735    59.999487 End. Took 59.999484814s  Attributes:{funcId=1451722271, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:49:30.439985    59.999753 trace_id: 5b2f189f81b5bedec1dd3bdbe56c55d7 span_id: c825736cea111707 parent_span_id: 46b4455bb99d9fe3
           18:49:30.439991      .     6 Start.  Attributes:{funcId=1896495473, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:49:30.439997      .     5 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:50:30.439711    59.999715 End. Took 59.999712095s  Attributes:{funcId=1896495473, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:50:30.441858    59.999441 trace_id: cfbb05908f3eb8fac0fdb19a286604b1 span_id: d4ce83d310cbe3db parent_span_id: 525d56c2df566cb8
           18:50:30.441864      .     6 Start.  Attributes:{funcId=39532766, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:50:30.441869      .     4 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:51:30.441268    59.999399 End. Took 59.999396706s  Attributes:{funcId=39532766, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:06:30.380374    60.000070 trace_id: 2d6ec7f1ebd859339cfe8b3913001499 span_id: 91224e0cb66b3caf parent_span_id: 0fb120fb84f7c48b
           18:06:30.380379      .     4 Start.  Attributes:{funcId=1982365020, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:06:30.380382      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:07:30.380415    60.000033 End. Took 1m0.000032459s  Attributes:{funcId=1982365020, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:12:30.389180    60.000115 trace_id: 45b8e4fa8b85ab2941552a27abae9fb1 span_id: 57a784656b4e9188 parent_span_id: d53557543ada1965
           18:12:30.389185      .     4 Start.  Attributes:{funcId=1269399879, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:12:30.389188      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:13:30.389218    60.000030 End. Took 1m0.000027294s  Attributes:{funcId=1269399879, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:28:30.411134    60.000101 trace_id: fcd23aca7022e5adaf8f831bc895680f span_id: 135534b46ff86d8e parent_span_id: 91e306a33e84f66a
           18:28:30.411140      .     5 Start.  Attributes:{funcId=1125737798, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:28:30.411145      .     5 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:29:30.411222    60.000075 End. Took 1m0.000075767s  Attributes:{funcId=1125737798, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:35:30.421807    60.000063 trace_id: a8728527f003a2e572f67d5b209a2802 span_id: 24cae4eb32dad3aa parent_span_id: a258b7da01665c87
           18:35:30.421813      .     6 Start.  Attributes:{funcId=1309314620, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:35:30.421818      .     4 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:36:30.421848    60.000031 End. Took 1m0.00002847s  Attributes:{funcId=1309314620, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:48:30.438069    60.000192 trace_id: db35f5cc8ca8b38ad3f0ee7ca7985876 span_id: 3eee8f16f5ccc155 parent_span_id: fbc3cb7cab9e8ea0
           18:48:30.438073      .     4 Start.  Attributes:{funcId=1916132436, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:48:30.438076      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:49:30.438248    60.000171 End. Took 1m0.000171361s  Attributes:{funcId=1916132436, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-17:49:30.355497    60.000136 trace_id: 7065dd7a08ef925dcd5f73733c9ea742 span_id: 8a8424dfa3c24e66 parent_span_id: 0813f7cd724ed742
           17:49:30.355501      .     4 Start.  Attributes:{funcId=1913044995, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           17:49:30.355504      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           17:50:30.355617    60.000112 End. Took 1m0.00011245s  Attributes:{funcId=1913044995, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-17:52:30.359664    60.000060 trace_id: 52c659550486c296772623270482364a span_id: edc6bf8bfe33f952 parent_span_id: 6b55927acdbf812f
           17:52:30.359670      .     5 Start.  Attributes:{funcId=1799733158, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           17:52:30.359674      .     4 Waiting for startTs: 3051854 at node: 1, gid: 1
           17:53:30.359689    60.000015 End. Took 1m0.000013156s  Attributes:{funcId=1799733158, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-17:54:30.362457    60.000325 trace_id: 451481cec2632c56c747cf75cfb91cf5 span_id: 44604ad132ecd66a parent_span_id: c2ee1cc001785f47
           17:54:30.362463      .     5 Start.  Attributes:{funcId=2074338048, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           17:54:30.362468      .     4 Waiting for startTs: 3051854 at node: 1, gid: 1
           17:55:30.362744    60.000277 End. Took 1m0.000273937s  Attributes:{funcId=2074338048, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-17:56:30.365425    60.000178 trace_id: e3aa0579338435a77975e20b6a676eb4 span_id: 5cb26b9f7f5e7014 parent_span_id: da403e8e4eeaf8f0
           17:56:30.365430      .     5 Start.  Attributes:{funcId=873933302, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           17:56:30.365434      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           17:57:30.365581    60.000147 End. Took 1m0.0001446s  Attributes:{funcId=873933302, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
2021/04/13-18:00:30.371685    60.000063 trace_id: 46cb402fb04c243ca4c094b9afdc7705 span_id: 0ae5802ae8ce2b44 parent_span_id: 88735319b75ab420
           18:00:30.371690      .     4 Start.  Attributes:{funcId=1275705929, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}
           18:00:30.371693      .     3 Waiting for startTs: 3051854 at node: 1, gid: 1
           18:01:30.371709    60.000016 End. Took 1m0.000013407s  Attributes:{funcId=1275705929, funcName="processTask\x00\x00\x00\x00\x00\x00\x00\x00hostname"}


The traces suggest that the node is wating for the txns to be committed before it can serve the query with readTs greater than posting Ts of the node.

We’ll try to simulate the upsert work load to figure out why the upsert mutations are slower on 20.11.

Also, @Vijay1 Can you please try running your work-load on master.

Hi @ahsan

We just started the upsert workload with the latest master, will let you know the result of the test. So far as we have seen, the problem is not that the upserts are slower but one of the upsert mutation blocks/fails for some reason (we are issuing just upsert mutation at a time - no concurrent updates) and probably causing the subsequent non-besteffort queries blocked forever.

Thanks

Hi @ahsan

We ran the our workload against the master and ran into the issue after close to 100k upserts (serial mutations). Now the queries are timing out after 60s which is our default.

We have created a task for this issue. We’ll let you know once resolved.