Yeah, it is essentially your method 4, with the main difference being it just avoids the |A| network calls. Also, note the 2-step approach.
I’d say just start with the 2-step approach. You’d essentially have to rewrite the thing if you assume they’re all in one machine. You’ll realize that ProcessTaskOverNetwork automatically takes care of everything you need to do. So, it’s not really that much more effort. Also, it avoids you writing custom code for the same-worker-serving-both assumption.
One more thing I was thinking about was that, you might want to have a new RPC call to do the first step of sort-trim, because it involves you sending out a UIDMatrix, which is not how typical Query expects it. Might keep things clean to have that diversification.