Bulk loader xidmap memory optimization

Very well written post. @harshil_goel

Some suggestions

1. --dry-run option

While you convert this into a library, I’d also suggest an option --dry-run to bulk loader or library. This basically means that bulk loader will do a dry run without allocating any UIDs and provide a report and/or dump it in a file. During dry run, it simply reads the input and keeps a count of unique xid, total xid and total rdfs. You will need a map from xid → bool but it’s memory needs will be lower than the actual by a factor of ~8 (because value is bool and not uint64).

The report could include how many xids exist in the data set, total rdfs, etc. This will allow the admin to tweak the bulk loader or the bulk loader can read the report file and intelligently decide the --limitmemory mode or not.

We can also take a step further, -dry-run <time> which will scan the input data for that duration and report the xids. This is to limit doing a dry run for very large datasets but get a guesstimate of the xid/rdf ratio, so it is not 100% fool proof.

2. Adaptive limitMemory

Instead of keeping a hard-coded threshold of 100K, we could taken into the account the system total RAM and Data file size etc to figure out a threshold.

3. CLI Option for ratio of unique xid/xid

In addition to or instead of (1), we can just ask the user to provide us their best guess of the xid and the total xid or the ratio. The presumption here is that they generally know their data sets and may be able to do some estimation. This will allow us to decide whether to go in --limitMemory mode or not.

Finally, why is the ratio of unique xids to xids important. I think the absolute number of unique xids will drive most of the decision making, not the ratio, Right?