Xid mapping in `live` ingest

abourget · July 31, 2018, 7:15pm

Hi there…

I’m pleased to see that bulk loading can read URIs and map to UIDs directly, while saving an xid edge…

I’d like to keep my reference data in that format, but then I can’t easily live load some incremental data through the mutations API. What I need to do is crawl through my file, query dgraph to find the xid → uid maps that exist, assign new blanks to those that don’t, and do all of that externally to the database. That’s a lot slower than anything… Wouldn’t there be a way I could live load some RDF that have some <whatever:node_id> as subject, and have the program do the mapping, and assignation of new UIDs?

Why the two different behaviors?

btw, thanks for the great database

atul_nasir · January 6, 2025, 3:18am

import json
import requests

# Configuration
DGRAPH_ENDPOINT = "http://localhost:8080"

def get_existing_xid_mappings():
    query = """
    {
      q(func: has(xid)) {
        uid
        xid
      }
    }
    """
    headers = {'Content-Type': 'application/json'}
    response = requests.post(f"{DGRAPH_ENDPOINT}/query", headers=headers, data=json.dumps({"query": query}))
    response.raise_for_status()
    result = response.json()
    mappings = {node['xid']: node['uid'] for node in result['q']}
    return mappings

def create_new_uids(xids, existing_mappings):
    new_mappings = {}
    for xid in xids:
        if xid not in existing_mappings:
            # Here you would typically use Dgraph's mutation to create a new node with the xid
            mutation = {
                "set": [
                    {
                        "uid": "_:new",
                        "xid": xid
                    }
                ]
            }
            headers = {'Content-Type': 'application/rdf+json', 'X-Dgraph-Mutation': 'true'}
            response = requests.post(f"{DGRAPH_ENDPOINT}/mutate", headers=headers, data=json.dumps(mutation))
            response.raise_for_status()
            new_uid = response.json()['uids']['new']
            new_mappings[xid] = new_uid
    return new_mappings

def handle_rdf_data(rdf_data):
    # Parse RDF to extract XIDs (this is a simplification; actual RDF parsing would be more complex)
    xids = set()
    for line in rdf_data.splitlines():
        if line.strip().startswith('<'):
            xid = line.split('>')[0][1:]  # Very basic, might not work for all RDF formats
            xids.add(xid)
    
    existing_mappings = get_existing_xid_mappings()
    new_mappings = create_new_uids(xids, existing_mappings)
    
    # Now update your RDF with the correct UIDs
    updated_rdf = []
    for line in rdf_data.splitlines():
        if line.strip().startswith('<'):
            xid = line.split('>')[0][1:]
            uid = new_mappings.get(xid, existing_mappings.get(xid))
            if uid:
                updated_line = line.replace(f"<{xid}>", f"<{uid}>")
                updated_rdf.append(updated_line)
            else:
                updated_rdf.append(line)  # Keep the line if we couldn't map the XID
        else:
            updated_rdf.append(line)
    
    return '\n'.join(updated_rdf)

def main():
    # Sample RDF data for demonstration
    sample_rdf = """
    <node1> <name> "Node One" .
    <node2> <name> "Node Two" .
    """
    
    updated_rdf = handle_rdf_data(sample_rdf)
    print(updated_rdf)
    
    # Mutation with updated RDF
    headers = {'Content-Type': 'application/rdf', 'X-Dgraph-Mutation': 'true'}
    response = requests.post(f"{DGRAPH_ENDPOINT}/mutate", headers=headers, data=updated_rdf)
    response.raise_for_status()
    print("Mutation completed:", response.text)

if __name__ == "__main__":
    main()

Raphael · January 6, 2025, 4:29pm

Hello Alexandre, Thanks for nice note about Dgraph

To load incremental data with dgraph live you have to use an xid predicate. This is the purpose of this feature.

the predicate used to store external ids can be anything (in our example we use xid as the name of the predicate)

Simply set the predicate name using --upsertPredicate in dgraph live:

docker exec -it dgraph-donors  dgraph live -f /dgraph/donors-CA.rdf.gz -s /dgraph/donors.schema --upsertPredicate xid

dgraph live will check the predicate and do only upserts. For newly created subjects, it will also save the xid predicate with the value of the blank node. like _:whatever#node_id.
Note that we are storing the _: part of the blank node which is a bit confusing.

There is also an option to speed up the process by maintaining an xidmap.
Dgraph live can store the xidmap (in a folder in compressed format) and use it during the next loading.

I think we can combine both feature ( I have not tested it): use xidmap to replace blank nodes with known uids and use xid predicate to upsert in case data has been injected with a different tool.
Note that in the other tool (or API) has to mutate the xid prediate correctly.

Hope that helps.

rarvikar · January 14, 2025, 2:28pm

Hi,

FWIW, we can also use xidmap and store the “UID → blank-node-identifier” mapping using the --xidmap feature in the Live Loader. For subsequent incremental loads, just specify the path to the folder used to the store the xidmap, created from the initial or first load.
More importantly, there’s no need to create an XID predicate in your schema for this feature to work, as the term xidmap in the context of the Live Loader, is only used to describe the UID-> Blank_Node mapping, rather than pure UID->XID predicate mapping and could be easily misunderstood.

HTH!
Best of luck!

Topic		Replies	Views
Understanding bulk data loads, and bulk updates, with XID in v0.8 Users	2	851	November 1, 2017
Where is the mapping of xids to uids which is created by bulk Users	3	660	April 5, 2018
Custom XIDs for creating edges Dgraph	1	464	August 15, 2020
Live Import Upsert using xid creates duplicates Dgraph	0	207	November 23, 2023
Fast Data Loading - Deploy Documentation	1	745	October 2, 2020

Xid mapping in `live` ingest

Related topics