Hi @martwetzels,
Thanks for joining the call with us today, always a pleasure talking to you. Please find below a summary of the call + action item.
- Data mutation: A lot of data coming every minute or so + data may be redundant. Goal here is to have a way where only the new data needs to be inserted. The way to achieve this we discussed was basically through an upsert where you can first find out which data points don’t exist already and only insert them, which is possible and is an efficient way.
- Querying all the data points for a time range
- Querying all the data points for a time range for a user
Your system is such that a User can remain there for about 1-2yrs at max. After that, there won’t be any data coming in for that user. But, there will be new users coming in, so the data linked to Providers is going to keep increasing.
So, you are concerned about having millions/billions of edges from a single Provider node to a lot of Data nodes and how to traverse that efficiently. The most important thing here is to build a graph that can be traversed efficiently to answer the queries you are doing. For that, we discussed how can you design your schema. There were two major approaches that we discussed:
-
Have User, Provider and Data as a type, and in Data nodes have a predicate to store the timestamp and index it. Also, link the Data node to the corresponding User and Provider nodes. That way you will be able to answer both the queries with the minimal graph traversal with the help of index on the timestamp. This is where you will need the
betweenfunction, which will make your queries more efficient than the current state. We already forwarded this feature internally and expressed your interest in -
Have User, Provider and Data as a type, but this time don’t store timestamp in the Data. Instead have different types called Year, Month, Day, Hour, …, Second. Since we know apriori that a year is going to have only 12 months, so it will have 12 different edges connecting it to each month. Similarly, for other types like Month, Day, etc.
Then, for a User, link it with the data through the hierarchy of Year->month-> … → Second.
Finally, the data will have an edge back to the User and Provider.
This way, you will need small indexes, and this should be more efficient for your queries than the previous approach.
Agreed Action Item
- You are going to try both these approaches, and perform some benchmarks. Then you are going to ping us back with the results, and then we can see if it can be improved further.
Best
Omar & Abhimanyu