Bulk loader still OOM during reduce phase

Thanks for sharing the dataset over DM. Your dataset has about 109 million unique predicate names in the schema. That’s way too much and explains the OOM that’s happening here.

If you can store the “unique” part of the edge as a value (e.g., 0x1> <id> "abc-123-567" .) instead of part of the predicate name then that’d greatly reduce the schema entries here.


Here’s a heap profile of the bulk loader with your data set during the map phase:

File: dgraph
Build ID: 6fee491f94a8999d0e85fe9ea8ccb3321f7ce0d0
Type: inuse_space
Time: Jul 23, 2021 at 2:07pm (PDT)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 44.24GB, 99.88% of 44.30GB total
Dropped 73 nodes (cum <= 0.22GB)
Showing top 10 nodes out of 12
      flat  flat%   sum%        cum   cum%
      19GB 42.88% 42.88%       19GB 42.88%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*schemaStore).validateType
   14.15GB 31.95% 74.83%    14.15GB 31.95%  github.com/dgraph-io/dgraph/x.NamespaceAttr
   10.83GB 24.45% 99.28%    10.83GB 24.45%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*shardMap).shardFor
    0.27GB   0.6% 99.88%     0.27GB   0.6%  bytes.makeSlice
         0     0% 99.88%     0.27GB   0.6%  bytes.(*Buffer).Write
         0     0% 99.88%     0.27GB   0.6%  bytes.(*Buffer).grow
         0     0% 99.88%     0.27GB   0.6%  github.com/dgraph-io/dgraph/chunker.(*rdfChunker).Chunk
         0     0% 99.88%    43.99GB 99.31%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).mapStage.func1
         0     0% 99.88%     0.27GB   0.6%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*loader).mapStage.func2
         0     0% 99.88%       19GB 42.89%  github.com/dgraph-io/dgraph/dgraph/cmd/bulk.(*mapper).createPostings

Most of the memory usage is due to the large number of predicates in the schema.