Sorting and indexing

So, in this case, the worker is going to not just return the sorted list, it’s going to return the indices of the sorted list. It’s not going to be that exact.

Sorting is essentially 2 step stage. First stage is to get a list of results which would be granular at the bucket level, i.e. we always pick up all intersections from a bucket. Because entities are sorted by uid in the bucket, and not by the specified sort order. So, we pick them all, and then send them out to the worker which contains the dob. This worker would then retrieve their dob, sort them, and trim the list down to the requested number N. Note that this N will almost always be lower than the initial sent list. So, we should not just return indices, we should return the actual final list.

Also, this particular step smells of some sort of optimization, while breaking the current semantics which is every query and reply contain uids, and not their indices or anything. I think something has to convert the list of uids to the index in this scheme, and I’d rather have that conversion be done closer to the metal than by another worker.

Method 2.

Shouldn’t this be N * log B, but you avoid creating an in-memory structure and allocating any more memory to store indices etc. in Q. In fact, even when you do store indices, you still need to convert from a given UID in B, to it’s index in Q. That effort is not being captured here, I think your assumption is that part would be done elsewhere in the worker. See above comments about doing that conversion here.

But we do right. You just stored for each posting list, another list which contains the index of the element. So, you essentially created the entire UID matrix in memory again.

That basically kills our entire idea of keeping the number of network calls directly proportional to the complexity of the query, and not the results. So, discard that idea immediately.

I think you’re applying the concept of pagination in the wrong place. You’re starting off with a defined set of uids, and then just re-arranging them. Once you do the re-arranging at each posting list in the UIDmatrix (Ai), you can individually do the pagination at the co-ordinator. So, this doesn’t seem like that much of an issue.

You’re trying to save cpu and memory costs but adding the cost of |A| network calls and stragglers. CPU is always faster than network, any way you look. So, this approach is definitely not better.