Bulk loader xidmap memory optimization

Edit - I think HyperLogLog would be a better approach HyperLogLog - Wikipedia

Bloom filters with atomic count would be cheaper in terms of memory (but slower) than storing a map just to keep track of unique xids. Something like the following should work

increment uniq if not in the bloom filter and add uniq to bloom filter

We can also use a count-min sketch to get an approximate length for each xid->id pair.

I like the --dry-run option compared to the other two because it might be the simplest for the end-user. The other two options would require some prior knowledge.

1 Like