Storing triplets instead of posting lists?

jchiu · September 2, 2016, 4:18am

We love things to be sorted. And we have been trying to keep things sorted.

Yet another random idea that you probably have considered:

What if the key is just (pred, sub, obj) instead of (pred, sub)? To get the “posting list” from RockDB, we just seek to the prefix (pred, sub) and keep reading? The iterating might be painful, as opposed to a simple key-value fetch. But since RocksDB operates in blocks of about 4K, we can expect that once we hit a (pred, sub, obj), the other objs should be in the same block.

mrjn · September 2, 2016, 7:10am

Did you miss some words here?

jchiu · September 3, 2016, 6:54am

I have corrected the post but didn’t reply here…

jchiu · September 9, 2016, 10:12am

So I tried to do some benchmark to see how slow a row scan is compared to a point query of the same size.

Code is at: GitHub - jchiu0/rdbbenchmark

Here are the results.

BenchmarkPointQuery7-8    	 2000000	       835 ns/op
BenchmarkPointQuery8-8    	 2000000	       826 ns/op
BenchmarkPointQuery9-8    	 2000000	       790 ns/op
BenchmarkPointQuery10-8   	 2000000	      1109 ns/op
BenchmarkPointQuery11-8   	 1000000	      1303 ns/op
BenchmarkPointQuery12-8   	  500000	      2094 ns/op
BenchmarkRowScan7-8       	  200000	     10698 ns/op
BenchmarkRowScan8-8       	  100000	     18835 ns/op
BenchmarkRowScan9-8       	   50000	     35982 ns/op
BenchmarkRowScan10-8      	   20000	     74257 ns/op
BenchmarkRowScan11-8      	   10000	    150855 ns/op
BenchmarkRowScan12-8      	    5000	    296073 ns/op

A row scan seems really really slow. I can try tweaking some settings. Would suspect that a lot of other graph databases might be storing rdf triplets instead of what we’re doing here. If so, this might be one reason why Dgraph can shine, and might help with some marketing, maybe.

jchiu · September 12, 2016, 9:15am

I updated the code. Have tried quite a number of different optimizations for prefix scans but the best I got so far is not too great.

BenchmarkPointQuery7-8    	 1000000	      1208 ns/op
BenchmarkPointQuery8-8    	 2000000	      1099 ns/op
BenchmarkPointQuery9-8    	 2000000	      1158 ns/op
BenchmarkPointQuery10-8   	 1000000	      1259 ns/op
BenchmarkPointQuery11-8   	 1000000	      1514 ns/op
BenchmarkPointQuery12-8   	 1000000	      2257 ns/op
BenchmarkRowScan7-8       	  200000	      7531 ns/op
BenchmarkRowScan8-8       	  100000	     13375 ns/op
BenchmarkRowScan9-8       	   50000	     23786 ns/op
BenchmarkRowScan10-8      	   30000	     44632 ns/op
BenchmarkRowScan11-8      	   20000	     86536 ns/op
BenchmarkRowScan12-8      	   10000	    182612 ns/op

The hope is that row scan doesn’t take too much longer than point queries and we can get rid of gotomic hash, mutation layers and perhaps greatly simplify the codebase. Another benefit is that RocksDB can take care of super long posting lists automatically. Unfortunately, the benchmarks don’t look good.

(In case you wonder why point queries slowed down a bit, it’s because we force a flush from memtable and force everybody to re-read. That is the case for a typical query for a typical database.)

mrjn · September 12, 2016, 9:17am

Thanks for testing this hypothesis via benchmarks and profiling. That’s the right and scientific way to prove and disprove stuff. Good job!

system · November 28, 2017, 1:00am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Data == postings lists == indexes and resource utilization Dgraph Cloud	6	704	May 13, 2021
Some topics about RocksDB tuning Users	1	3676	November 28, 2017
Sparse adjacency matrices instead of posting lists? Misc	3	1016	November 13, 2018
21 million seems small Users	14	1505	November 28, 2017
Scale the shit out of this! - Dgraph Blog Blog	0	1329	August 18, 2017

Storing triplets instead of posting lists?

Related topics