[WIP] Performance bottleneck on 1 TB data

I tried running bulk loader while skipping map phase. It is still getting OOM. In logs I can see it reaches OOM after inserting ~450M RDFs.

[09:29:41Z] REDUCE 18m55s edge_count:449.4M edge_speed:396.1k/sec plist_count:411.5M plist_speed:362.7k/sec. Num Encoding: 0
writeCh: Time taken to write req: 1.259s
[09:29:42Z] REDUCE 18m56s edge_count:449.5M edge_speed:395.8k/sec plist_count:411.5M plist_speed:362.4k/sec. Num Encoding: 2
[09:29:43Z] REDUCE 18m57s edge_count:449.7M edge_speed:395.6k/sec plist_count:411.7M plist_speed:362.2k/sec. Num Encoding: 2
[09:29:45Z] REDUCE 18m59s edge_count:450.1M edge_speed:395.4k/sec plist_count:412.1M plist_speed:362.0k/sec. Num Encoding: 1
writeCh: Time taken to write req: 1.15s
[09:29:46Z] REDUCE 19m00s edge_count:450.2M edge_speed:395.1k/sec plist_count:412.2M plist_speed:361.7k/sec. Num Encoding: 1
writeCh: Time taken to write req: 1.168s

Heap profile for above is like below. Most of heap is taken by mapIterator.

Type: inuse_space
Time: Jul 17, 2020 at 11:02am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 17.45GB, 98.24% of 17.76GB total
Dropped 68 nodes (cum <= 0.09GB)
Showing top 10 nodes out of 32
      flat  flat%   sum%        cum   cum%
   10.95GB 61.69% 61.69%    10.98GB 61.81%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapIterator).startBatching.func1
    4.15GB 23.39% 85.08%    15.13GB 85.20%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapIterator).startBatching
    0.66GB  3.69% 88.77%     0.66GB  3.69%  github.com/dgraph-io/dgraph/protos/pb.(*PostingList).Marshal
    0.50GB  2.82% 91.60%     0.50GB  2.82%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
    0.37GB  2.07% 93.66%     0.51GB  2.88%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
    0.34GB  1.90% 95.56%     0.34GB  1.90%  github.com/dgraph-io/badger/v2/table.NewTableBuilder
    0.21GB  1.20% 96.75%     0.90GB  5.05%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).toList.func2
    0.14GB  0.81% 97.57%     0.14GB  0.81%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
    0.10GB  0.55% 98.12%     0.10GB  0.55%  compress/flate.(*dictDecoder).init (inline)
    0.02GB  0.12% 98.24%     0.12GB  0.67%  compress/flate.NewReader

Since bulk loader is going OOM (even if we don’t have lot of xids, since we are able to complete map phase), I switched to live loader.

I ran live loader on 1 alpha, 1 zero cluster.

Live loader first run

live loader able to insert around ~1B of RDFs before getting OOM in bit more than 4 hours.
Size of generated p directory was around 124GB.

[22:36:29Z] Elapsed: 04h14m01s Txns: 1056125 N-Quads: 1056125000 N-Quads/s [last 5s]: 86400 Aborts: 154
[22:36:34Z] Elapsed: 04h14m05s Txns: 1056184 N-Quads: 1056184000 N-Quads/s [last 5s]: 11800 Aborts: 154
[22:36:39Z] Elapsed: 04h14m10s Txns: 1056562 N-Quads: 1056562000 N-Quads/s [last 5s]: 75600 Aborts: 154
[22:36:44Z] Elapsed: 04h14m15s Txns: 1056943 N-Quads: 1056943000 N-Quads/s [last 5s]: 76200 Aborts: 154
[22:36:49Z] Elapsed: 04h14m20s Txns: 1057313 N-Quads: 1057313000 N-Quads/s [last 5s]: 74000 Aborts: 154
[22:36:54Z] Elapsed: 04h14m25s Txns: 1057655 N-Quads: 1057655000 N-Quads/s [last 5s]: 68400 Aborts: 154
[22:36:59Z] Elapsed: 04h14m30s Txns: 1058024 N-Quads: 1058024000 N-Quads/s [last 5s]: 73800 Aborts: 154
[22:37:04Z] Elapsed: 04h14m35s Txns: 1058334 N-Quads: 1058334000 N-Quads/s [last 5s]: 62000 Aborts: 154
[22:37:09Z] Elapsed: 04h14m40s Txns: 1058653 N-Quads: 1058653000 N-Quads/s [last 5s]: 63800 Aborts: 154
[22:37:14Z] Elapsed: 04h14m45s Txns: 1059000 N-Quads: 1059000000 N-Quads/s [last 5s]: 69400 Aborts: 154
[22:37:19Z] Elapsed: 04h14m50s Txns: 1059319 N-Quads: 1059319000 N-Quads/s [last 5s]: 63800 Aborts: 154

Live loader second run

In second run of live loader(on fresh cluster), it is running very slow. I am still investigating this. After running for around 4 hours it has only inserted ~25M RDFs. Current p directory size is ~1.6GB.

[12:40:15Z] Elapsed: 04h04m35s Txns: 25576 N-Quads: 25576000 N-Quads/s [last 5s]:   800 Aborts: 2276414
[12:40:20Z] Elapsed: 04h04m40s Txns: 25583 N-Quads: 25583000 N-Quads/s [last 5s]:  1400 Aborts: 2276846
[12:40:25Z] Elapsed: 04h04m45s Txns: 25586 N-Quads: 25586000 N-Quads/s [last 5s]:   600 Aborts: 2277396
[12:40:30Z] Elapsed: 04h04m50s Txns: 25592 N-Quads: 25592000 N-Quads/s [last 5s]:  1200 Aborts: 2278224
[12:40:35Z] Elapsed: 04h04m55s Txns: 25598 N-Quads: 25598000 N-Quads/s [last 5s]:  1200 Aborts: 2279403
[12:40:40Z] Elapsed: 04h05m00s Txns: 25605 N-Quads: 25605000 N-Quads/s [last 5s]:  1400 Aborts: 2280209
[12:40:45Z] Elapsed: 04h05m05s Txns: 25614 N-Quads: 25614000 N-Quads/s [last 5s]:  1800 Aborts: 2280740
[12:40:50Z] Elapsed: 04h05m10s Txns: 25618 N-Quads: 25618000 N-Quads/s [last 5s]:   800 Aborts: 2281330
[12:40:55Z] Elapsed: 04h05m15s Txns: 25624 N-Quads: 25624000 N-Quads/s [last 5s]:  1200 Aborts: 2282145
[12:41:00Z] Elapsed: 04h05m20s Txns: 25630 N-Quads: 25630000 N-Quads/s [last 5s]:  1200 Aborts: 2283247
[12:41:05Z] Elapsed: 04h05m25s Txns: 25633 N-Quads: 25633000 N-Quads/s [last 5s]:   600 Aborts: 2284115
1 Like