Recall we have a UID matrix: A. Denote A_i as i-th posting list. We want to sort all of them by some attribute say dob. Let B be the merged list. Let N be the total number of elements in UID matrix. Let |A_i| be size of i-th posting list. Let |B| be size of B, i.e., number of unique values. Note N=sum_i |A_i|. Let |A| be number of posting lists.
There are two extreme scenarios to keep in mind. First scenario is that the posting lists overlap completely, and |B|=N/|A|. Second scenario is that posting lists do not overlap at all, and |B|=N.
There are many ways we can sort. First, don’t worry about pagination.
Method 1: Worker sorts B and returns permutation. Build posting lists using permutation.
Here we do only one network call instead of |A| network calls where |A| is the number of posting lists. The complexity lies in how we use the ordering of B to order each posting list A_i.
Here is one way to do it. As we merge the posting lists to form B, keep information to be able to go from B to each A_i. One way to do it is that for each element of B, we store which posting lists this element comes from. For example, if A_0=(1,2,3) and A_1=(2,3,4) and B=(1,2,3,4). Then we will have a list of lists (or something equivalent and more memory-efficient) that goes like Q:=((0), (0,1), (0,1), (1)). The first element of Q is just (0) because B_0=1 is contained in posting list A_0 only. On the other hand, the second element of Q is (0,1) because B_1=2 is contained in both A_0,A_1.
Now, send B to be sorted. Say the final sorted B by dob goes like (4,3,2,1). Then the worker will return (3,2,1,0) as B_3 is the first element in the sorted B. Now, do a loop over (3,2,1,0). First, we see a 3. We look up Q and we see that B_0 is contained only in A_1. Hence, we append C_1 by B_3=4, where C is like A (UID matrix) but will contain posting lists sorted by dob. Continue scanning (3,2,1,0). The next element we see is 2. Look up Q and we see that B_2 is contained in A_0,A_1. Hence, we append both C_0,C_1 by B_2. Now, C_0=(B_2)=(3) while C_1=(B_3,B_2)=(4,3). We can continue this process and obtain C_0=(3,2,1) and C_1=(4,3,2).
Total running time is O(|B| log |B| + N). The first term is due to sorting of B in worker. The second term is due to “coordinator” sorting each posting list using the returned permutation.
One downside is that you need to allocate more memory to store Q and garbage collection is not cheap.
Method 2: Like Method 1 but do binary search instead of keeping Q.
You could try this. Say we have B sorted by dob, which is (4,3,2,1). For each element, we can run through each posting list, do a binary search to see if it is inside. If it is inside, append to C_i. This is actually quite expensive… The amount of work needed to sort each posting list (after getting sorted B) is O(|A| N) ignoring log factors for the binary search. This looks bad. Maybe you all have some other ways of using binary search?
Method 3: Worker sorts B, then coordinator sorts each posting list by their position in sorted B
Each posting list will need to store their positions in B. In the above example, A_0=(1,2,3) will store (0,1,2) while A_1 will store (1,2,3). Then the worker returns the position of B_i in the sorted list. In this case, it will return (3,2,1,0) like before because B_0=1 is position 3 of sorted list (4,3,2,1), and B_1=2 is position 2 of sorted list. Now, sort each posting list by this mapping.
There is actually no memory overhead here because after this, we have the ordering of each posting list. The disadvantage is that the coordinator still needs to do sorting per posting list, which removes the advantage of sorting B in the case of a lot of overlap between posting lists. Running time is O(|B| log |B| + N log (N/|A|)), roughly.
Method 4: Worker sorts each posting list (simplest)
This is the simplest method. Running time is O(N log (N/|A|)) roughly. The downside is that we need to make |A| network calls instead of 1, and time spent doing actual sorting can be worse by a factor of |A| in the case of max overlap. And we can no longer tell people that Dgraph is more efficient because we make one network call per predicate… However, in terms of asymptotic behavior, it is actually not that bad, as method 1 still has that O(N) term to build sorted posting lists in the coordinator.
Pagination
Say we want just the top few entries of each posting lists. If we use method 4, the worker can trim the results and return early. It can iterate over RocksDB until it gets enough results per posting list, and return.
However, for the other methods that sort B, this is a lot harder. You can’t really do any trimming. For example, A_0=(0,1,2,3,4,5,6,7,8,9,10) and A_1=(11). We ask for just one result. B=(0,1,2,3,4,5,6,7,8,9,10,11). Say when sorted by dob, the order doesn’t change. If we want the top result for A_1, we have to return the full sorted B. If the worker sorts each posting list, then it could have just returned (0) for sorting A_0 and (11) for sorting A_1.
Questions
Am I missing something basic here? Do you all have some other approaches in mind? I am leaning towards Method 4. It can be worse by a log factor, which I thought is palatable given that pagination can sometimes save a lot of work.
What do you all think? @core-devs
(In another post, I can write about whether we want to keep keys of index in memory.)