Hello Everyone!
Apologies for the long post… I have been dipping my toes with a little bit of dgraph over the weekend and have enjoyed it tremendously.
I am trying to play with a data set of 320 million rows (.com zone file) and it’s high interconnectivity. So far I have something like this:
type Tld struct {
Uid string `json:"uid,omitempty"`
Name string `json:"tld.name,omitempty"`
}
type Domain struct {
Uid string `json:"uid,omitempty"`
Tld *Tld `json:"domain.tld,omitempty"`
Name string `json:"domain.name,omitempty"`
Nameservers []Nameserver `json:"domain.nameservers,omitempty"`
}
type Nameserver struct {
Uid string `json:"uid,omitempty"`
Name string `json:"nameserver.name,omitempty"`
Ttl int `json:"nameserver.ttl,omitempty"`
Domain *Domain `json:"nameserver.domain,omitempty"`
}
As you can see there are a lot of relationships between the different objects. I can load the data in with this schema:
tld.name: string @index(exact) @upsert .
domain.tld: uid .
domain.name: string @index(exact) @upsert .
domain.nameservers: [uid] @reverse .
nameserver.domain: uid @reverse .
nameserver.name: string @index(exact) @upsert .
nameserver.ttl: int .
So far so good.
The issue arises with the speed. Currently it’s running on 2 x 250GB SSD’s in software RAID 0. 1 Zero, 1 Alpha, 1 Raft all running on docker-compose.
Round trips for a full row insert (ensuring the tld, the domain, the nameserver and the nameserver’s domain) takes ~10ms. As you can imagine that will take a long long time to import, and the idea of ingesting daily (multiple zone files) suddenly becomes impossible.
I know that batching would be ideal and early on I did some experimenting with batching that enabled me to increase ingestion dramatically, but with the interconnectivity it’s hard to avoid creating duplicates.
One method I’m thinking to explore is to reserve UIDs from Zero and then doing adhoc reads from the system to populate caches which would allow me to build up caches (injecting reserved UIDs as and when I find new objects) and then when we reach some sort of threshold insert them.
Does anyone have any other suggestions? Am I approaching the schema in an overtly complicated way which is the main hindrance? Or is my hardware the root cause?