Go, dgraph and 320 million rows needs a bit of help to import faster

Hello Everyone!

Apologies for the long post… I have been dipping my toes with a little bit of dgraph over the weekend and have enjoyed it tremendously.

I am trying to play with a data set of 320 million rows (.com zone file) and it’s high interconnectivity. So far I have something like this:

type Tld struct {
        Uid  string `json:"uid,omitempty"`
        Name string `json:"tld.name,omitempty"`
}

type Domain struct {
        Uid         string       `json:"uid,omitempty"`
        Tld         *Tld         `json:"domain.tld,omitempty"`
        Name        string       `json:"domain.name,omitempty"`
        Nameservers []Nameserver `json:"domain.nameservers,omitempty"`
}

type Nameserver struct {
        Uid    string  `json:"uid,omitempty"`
        Name   string  `json:"nameserver.name,omitempty"`
        Ttl    int     `json:"nameserver.ttl,omitempty"`
        Domain *Domain `json:"nameserver.domain,omitempty"`
}

As you can see there are a lot of relationships between the different objects. I can load the data in with this schema:

tld.name: string @index(exact) @upsert .
domain.tld: uid .
domain.name: string @index(exact) @upsert .
domain.nameservers: [uid] @reverse .
nameserver.domain: uid @reverse .
nameserver.name: string @index(exact) @upsert .
nameserver.ttl: int .

So far so good.

The issue arises with the speed. Currently it’s running on 2 x 250GB SSD’s in software RAID 0. 1 Zero, 1 Alpha, 1 Raft all running on docker-compose.

Round trips for a full row insert (ensuring the tld, the domain, the nameserver and the nameserver’s domain) takes ~10ms. As you can imagine that will take a long long time to import, and the idea of ingesting daily (multiple zone files) suddenly becomes impossible.

I know that batching would be ideal and early on I did some experimenting with batching that enabled me to increase ingestion dramatically, but with the interconnectivity it’s hard to avoid creating duplicates.

One method I’m thinking to explore is to reserve UIDs from Zero and then doing adhoc reads from the system to populate caches which would allow me to build up caches (injecting reserved UIDs as and when I find new objects) and then when we reach some sort of threshold insert them.

Does anyone have any other suggestions? Am I approaching the schema in an overtly complicated way which is the main hindrance? Or is my hardware the root cause?

I’m a bit confused here. How is the process to insert you rows? How much data per second do you need to insert?

If you’re importing 320mi nquads, this via Bulk Load would probably take ~2 hours (maybe 76min it depends of your indexes).

Thanks for your reply. The Bulk Loader would indeed be faster, but I believe it’s only for initialising a cluster, .COM zone files (and other TLD’s have zone files, plus there is additional data I would like to import) are daily.

Currently the process is to have a goroutine read the file, it then has 100 workers that process each record. The record process looks something like:

        // create namserver
        nameserverUid, err := p.ensureNameserver(ctx, nsName, int(header.Ttl))
        if err != nil {
                return errors.Wrap(err, "ensureNameserver")
        }

        nameservers := []Nameserver{
                Nameserver{Uid: nameserverUid},
        }

        // create domain
        _, err = p.ensureDomain(ctx, domainName, nameservers)
        if err != nil {
                return errors.Wrap(err, "ensureDomain")
        }

The ensure* function looks like this:

    // check cache for the name we are looking for if it exists, return uid
    // do a query for the uid, if it exists set it in cache, return uid 
    // ensure any dependencies (tld, other domains)
    // do an upsert

One of the problems with this is a domain can have a nameserver which is on another domain in another tld. So one domain can avoid all caches and need to insert 3 records, where as another might only need to insert itself.

I’m not expecting to have it complete in 2 hours, but if it can’t ingest a daily snapshot of data in under 24 hours then there is something wrong with my setup, or this isn’t the correct tool for this job :frowning:

A couple of suggestions:

It looks like you can simplify your schema here given that you have domain.nameservers edges from domains to nameservers and nameserver.domain edges from nameservers to a domain.

Since you’re already maintaining these reverse edges yourself, the @reverse edges is duplicated effort that requires more writes than necessary per mutation.

Looks like you’re already using Dgraph v1.1.0 (given [uid] in your schema). You can make use of the Upsert Block which combines the query-mutate-commit within a single network call.

You could also try out live loader if you already have know the UIDs, which loads data concurrently in batches (10 goroutines and 1000 nquads per batch, by default).

Thanks for the reply dmai.

Unfortunately I don’t think I can do that. For example:

;; ANSWER SECTION:
dgraph.io.		86400	IN	NS	clyde.ns.cloudflare.com.
dgraph.io.		86400	IN	NS	mimi.ns.cloudflare.com.

dgraph.io's nameservers are [clyde.ns.cloudflare.com, mimi.ns.cloudflare.com] so it would need to create two nameserver nodes which would have a shared domain cloudflare.com. Then there is the two tlds [io, com] to be created.

I was also using the upsert block which was speeding things up, but I believe the issue is that each record is still a round trip and my attempts to batch these were tricky.

I think I have found a good method as reads are blazing fast; I parse the file twice, once to create/ensure all the tlds/domains exist. Then a second to create the nameservers links. Thanks for all the help!

Ah, thanks for correcting me. I misunderstood how the edges are used here, so if you’re making use of @reverse that’s great.

Glad you found a solution that works. I’d be interested in hearing the loading times you end up with.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.