I have a bunch of nodes that resemble a file system: files and folders nodes. There are also user nodes which connect to some of the files and folders nodes. If a user is connected to a folder node, then any child nodes (nested too) is accessible by the user. Now, each file has a vector associated to it. What is the best way to perform a vector search on only what the user has access to? Note that the nodes can scale to billions. Additionally, the amount of files the user has access to can also scale to hundreds of millions. It is essential that I do a pre-filter first on what the user has access to.
I am wondering what the most efficient way to do this is.
This is a great opportunity to use DQL vars blocks. In an initial block, create a var representing the nodes (uid list) that the user has access to. In the final block, perform your vector search restricted by the var.
If there are potentially hundreds of millions of nodes the user has access to, then that uid list will be stored in memory, correct? So that would cause an issue if multiple search requests are being called at once? Would cause a huge burden on compute resources, right?
Understood. How are the results being used… are they being presented in a UI? If so, can you find a way to further refine the first set so it’s manageable? Maybe you could sort and paginate the nodes and limit vector searches to that set.
If the intended use is for some sort of overall graph analysis, then exporting your data and loading it into a graph analysis library is probably the best route.