Just for reference, I’ll give a short summary of how the backup feature works to give you a better idea of what could go wrong and what info would be helpful to diagnose this better.
- When you send a backup request, you are sending it to a particular alpha. This process does not have access to all of the data so it sends request to all the groups to execute a backup of their data.
- The leader of each group receives the request, processes the arguments, connects to S3/minio or to a network drive and calls the backup API in badger. Because most of the actual backup work is happening inside Badger and is common to S3 and filesystem backups (which are working as expected), it’s very unlikely that the issue is happening here.
- Each group writes their data to the destination independently in a folder of their own. The folder contains a
manifest.jsonfile. This file tells information to help create an incremental backup the next time a backup is created. It doesn’t contain any personal or private information so it can be shared with us without any privacy concerns. - Each group reports back to the alpha that received the original backup request. I think there’s an issue with this part since your backup is clearly wrong but Dgraph reported that it succeeded. Hopefully, this is the only issue.
If you could do the following, it would help me further debug the issue. Sharing text files is better than sharing images since the former is searchable:
- As stated above, each group leader connects to S3 independently. In practice, this means each dgraph alpha needs to have the proper credentials. The backup is writing to S3 so at least one of the servers has the proper credentials. However, if not all the alphas have the credentials, it could explain the issues you are seeing.
- Go into each alpha’s environment (physical machine or container/jail, whatever you are using to run Dgraph) and try to ping the S3 endpoint to rule out any connectivity issues. If possible, you could also try to write a dummy file to the S3 endpoint using Amazon’s tools to rule out any issues with AWS itself.
- Create a backup in an empty bucket. Share the logs for each alpha around the time the backup is created (no need to share the entire log for now). EDIT: Also important to share are the structure of the S3 buckets after the backup is completed (name of the folders and the files inside them) and the contents of the
manifest.jsonfile. The content of the backup themselves are not needed at this moment. The same applies for the item below. - Make a few dummy changes to your database that you can easily revert. Using the same bucket, create another backup. This should create an incremental backup that contains only the changes you made. Share the logs for all the alphas around the time of this backup as well.
In the meantime, I will try to get a S3 bucket of my own and test this feature on my end. I believe this was done by another engineer in the company before the 1.1 release but it doesn’t hurt to check again. Also, writing a test that tries to backup a large dataset should help spot issues earlier so I’ll try to get it prioritized.