I tried running bulk loader while skipping map phase. It is still getting OOM. In logs I can see it reaches OOM after inserting ~450M RDFs.
[09:29:41Z] REDUCE 18m55s edge_count:449.4M edge_speed:396.1k/sec plist_count:411.5M plist_speed:362.7k/sec. Num Encoding: 0
writeCh: Time taken to write req: 1.259s
[09:29:42Z] REDUCE 18m56s edge_count:449.5M edge_speed:395.8k/sec plist_count:411.5M plist_speed:362.4k/sec. Num Encoding: 2
[09:29:43Z] REDUCE 18m57s edge_count:449.7M edge_speed:395.6k/sec plist_count:411.7M plist_speed:362.2k/sec. Num Encoding: 2
[09:29:45Z] REDUCE 18m59s edge_count:450.1M edge_speed:395.4k/sec plist_count:412.1M plist_speed:362.0k/sec. Num Encoding: 1
writeCh: Time taken to write req: 1.15s
[09:29:46Z] REDUCE 19m00s edge_count:450.2M edge_speed:395.1k/sec plist_count:412.2M plist_speed:361.7k/sec. Num Encoding: 1
writeCh: Time taken to write req: 1.168s
Heap profile for above is like below. Most of heap is taken by mapIterator.
Type: inuse_space
Time: Jul 17, 2020 at 11:02am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 17.45GB, 98.24% of 17.76GB total
Dropped 68 nodes (cum <= 0.09GB)
Showing top 10 nodes out of 32
flat flat% sum% cum cum%
10.95GB 61.69% 61.69% 10.98GB 61.81% github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapIterator).startBatching.func1
4.15GB 23.39% 85.08% 15.13GB 85.20% github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapIterator).startBatching
0.66GB 3.69% 88.77% 0.66GB 3.69% github.com/dgraph-io/dgraph/protos/pb.(*PostingList).Marshal
0.50GB 2.82% 91.60% 0.50GB 2.82% github.com/dgraph-io/ristretto/z.(*Bloom).Size
0.37GB 2.07% 93.66% 0.51GB 2.88% github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
0.34GB 1.90% 95.56% 0.34GB 1.90% github.com/dgraph-io/badger/v2/table.NewTableBuilder
0.21GB 1.20% 96.75% 0.90GB 5.05% github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*reducer).toList.func2
0.14GB 0.81% 97.57% 0.14GB 0.81% github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
0.10GB 0.55% 98.12% 0.10GB 0.55% compress/flate.(*dictDecoder).init (inline)
0.02GB 0.12% 98.24% 0.12GB 0.67% compress/flate.NewReader
Since bulk loader is going OOM (even if we don’t have lot of xids, since we are able to complete map phase), I switched to live loader.
I ran live loader on 1 alpha, 1 zero cluster.
Live loader first run
live loader able to insert around ~1B of RDFs before getting OOM in bit more than 4 hours.
Size of generated p directory was around 124GB.
[22:36:29Z] Elapsed: 04h14m01s Txns: 1056125 N-Quads: 1056125000 N-Quads/s [last 5s]: 86400 Aborts: 154
[22:36:34Z] Elapsed: 04h14m05s Txns: 1056184 N-Quads: 1056184000 N-Quads/s [last 5s]: 11800 Aborts: 154
[22:36:39Z] Elapsed: 04h14m10s Txns: 1056562 N-Quads: 1056562000 N-Quads/s [last 5s]: 75600 Aborts: 154
[22:36:44Z] Elapsed: 04h14m15s Txns: 1056943 N-Quads: 1056943000 N-Quads/s [last 5s]: 76200 Aborts: 154
[22:36:49Z] Elapsed: 04h14m20s Txns: 1057313 N-Quads: 1057313000 N-Quads/s [last 5s]: 74000 Aborts: 154
[22:36:54Z] Elapsed: 04h14m25s Txns: 1057655 N-Quads: 1057655000 N-Quads/s [last 5s]: 68400 Aborts: 154
[22:36:59Z] Elapsed: 04h14m30s Txns: 1058024 N-Quads: 1058024000 N-Quads/s [last 5s]: 73800 Aborts: 154
[22:37:04Z] Elapsed: 04h14m35s Txns: 1058334 N-Quads: 1058334000 N-Quads/s [last 5s]: 62000 Aborts: 154
[22:37:09Z] Elapsed: 04h14m40s Txns: 1058653 N-Quads: 1058653000 N-Quads/s [last 5s]: 63800 Aborts: 154
[22:37:14Z] Elapsed: 04h14m45s Txns: 1059000 N-Quads: 1059000000 N-Quads/s [last 5s]: 69400 Aborts: 154
[22:37:19Z] Elapsed: 04h14m50s Txns: 1059319 N-Quads: 1059319000 N-Quads/s [last 5s]: 63800 Aborts: 154
Live loader second run
In second run of live loader(on fresh cluster), it is running very slow. I am still investigating this. After running for around 4 hours it has only inserted ~25M RDFs. Current p directory size is ~1.6GB.
[12:40:15Z] Elapsed: 04h04m35s Txns: 25576 N-Quads: 25576000 N-Quads/s [last 5s]: 800 Aborts: 2276414
[12:40:20Z] Elapsed: 04h04m40s Txns: 25583 N-Quads: 25583000 N-Quads/s [last 5s]: 1400 Aborts: 2276846
[12:40:25Z] Elapsed: 04h04m45s Txns: 25586 N-Quads: 25586000 N-Quads/s [last 5s]: 600 Aborts: 2277396
[12:40:30Z] Elapsed: 04h04m50s Txns: 25592 N-Quads: 25592000 N-Quads/s [last 5s]: 1200 Aborts: 2278224
[12:40:35Z] Elapsed: 04h04m55s Txns: 25598 N-Quads: 25598000 N-Quads/s [last 5s]: 1200 Aborts: 2279403
[12:40:40Z] Elapsed: 04h05m00s Txns: 25605 N-Quads: 25605000 N-Quads/s [last 5s]: 1400 Aborts: 2280209
[12:40:45Z] Elapsed: 04h05m05s Txns: 25614 N-Quads: 25614000 N-Quads/s [last 5s]: 1800 Aborts: 2280740
[12:40:50Z] Elapsed: 04h05m10s Txns: 25618 N-Quads: 25618000 N-Quads/s [last 5s]: 800 Aborts: 2281330
[12:40:55Z] Elapsed: 04h05m15s Txns: 25624 N-Quads: 25624000 N-Quads/s [last 5s]: 1200 Aborts: 2282145
[12:41:00Z] Elapsed: 04h05m20s Txns: 25630 N-Quads: 25630000 N-Quads/s [last 5s]: 1200 Aborts: 2283247
[12:41:05Z] Elapsed: 04h05m25s Txns: 25633 N-Quads: 25633000 N-Quads/s [last 5s]: 600 Aborts: 2284115