Sorting and indexing

I remembered now the problem I was thinking about with pagination yesterday. So, say the query is [first 10 actors by dob]. Say there’s a node which all actors connect to, so getting all actors would give us a UID Matrix, with one element, and that element contains say 100K actors. We don’t really want to retrieve all their dobs before we pick the top 10.

So, what we can do is to execute this query in 2 stages:

  1. Send the entire UID Matrix (and not just a sorted unique uids) to the worker which serves the dob index. There, we intersect this UID Matrix, PL by PL (i.e. each Ai), with the dob index starting from the first year that we have records for. I say PL by PL, because it’s simple and it’s not going to be expensive, because after the first RocksDB lookup, we keep the retrieved PL in memory.
    We retrieve the first 10 for each Ai (in this case i is 0, but assume the scenario where i can be greater than zero), and also ensure that we pick each bucket’s intersection in its entirety. We shouldn’t stop after retrieving 10 intersection results in the middle of bucket, because within the bucket the ordering is based on uids, and not dobs.
    Return that result.

  2. Now we have a trimmed version of UID Matrix, we do what we do generally, i.e. convert it into a list of unique uids, and send to the worker which serves the original date of birth predicate. We retrieve all the dobs, and either send them back, if asked for, or sort the uids and send them back. The coordinator then re-generates the UID matrix based on these results and trims each Ai individually.

So, note that there’re 2 trims going on here, 1 in each step. And at each Ai, in each step.

This avoids the |A| network calls, while still allowing pagination without dropping any results on the floor.

Note that one might want to make this a step process in case UID matrix is small, but again, we shouldn’t optimize for small result sets. Assume that UID Matrix is huge, and just focus on a solution for that. This solution assumes that UID Matrix is huge. And that’s why the 2 step process makes sense.

1 Like