Delete data over million UIDs

hi, my code like this

query := `
	  query {
		var(func: has(resourceId)) @filter(not eq(resourceId, "11111111")) {
		  uids as uid
		}
	  }
	`

	mu := &api.Mutation{
		DelNquads: []byte(`uid(uids) * * .`),
	}
	req := &api.Request{
		Query:     query,
		Mutations: []*api.Mutation{mu},
		CommitNow: true,
	}
	_, err := txn.Do(context.Background(), req)

but when there is a lot of data in the database, an error will be reported:

rpc error: code = Unknown desc = var [uids] has over million UIDs

so, how can I bypass the one million limit by only one txn.Do

my self-cluster config:

dgraph zero --my=127.0.0.1:5080 --bindall=false
dgraph alpha --my=127.0.0.1:7080 --zero=127.0.0.1:5080 --bindall=false

thx.

Don’t bypass it. Paginate the results (Pagination - Query language) and do multiple requests.

This is a limitation of Dgraph unfortunately and could lead to corrupt data if you don’t carefully manage the pagination to ensure all expected requests completed. Tale a look at upserts that might help do the next pqgination set without needing to externally control the pagination.

This same limitation applies at an even smaller scale when updating data.

Good point. I think amaster means that if you retrieve UIDs to delete, especially in parallel, in pages of 1,000, and those pages are retrieved at different times, they may not line up perfectly due to additions or deletions that occur between the various page queries.

Even if single threaded, if you retrieve a sequence of page queries, you may get an incomplete list due to concurrent changes.

To avoid issues, consider using after: and limit: (vs offset: and limit:) on UIDs to do the pagination. UIDs are sequential (or at least semi-sequential via block allocation - not sure). So the UIDs won’t be perturbed in earlier pages as you work through your millions of deletes.

Alternatively for any batch operation you can run single threaded and write the selection query in a way that limits to unprocessed items. For deletes that is easier since a “processed” item is gone, so no need to worry about re-processing it twice.

General approach for bulk edits/deletes, in your language of choice:

for i = 1 to totalNum div CHUNK_SIZE + 5 * CHUNK_SIZE // add 5 chunks for safety
   queryUnprocessed { your query for UIDs here, first chunk of CHUNK_SIZE UIDS }
   processThem( your mutation using the UIDs above }

E.g. to alter a field from firstName to givenName, your query would look for the first 1,000 items that do not have the givenName, so you are never reprocessing and don’t have to page.