Database becomes much smaller when reimported

beepsoft · September 11, 2020, 6:00am

Hi,

I have a database I initialized from JSON. The database in the filesystem became 11G. I then exported it to RDF and imported back to another database. The database size is now only 1.53G! This is very welcome, but is strange.

This is the stats of the live loader:

Number of TXs run            : 76087                                                                
Number of N-Quads processed  : 76086024
Time spent                   : 33m27.698715369s
N-Quads processed per second : 37910

I have 2 questions:

What is the reason for the 11G database to become just 1.53G when reimprted? (The data content seems to be the same, at least the same number of uids are found)
Is there some other mechanism to compact a database without exporting/reimporting?

Thanks,

Balazs

chewxy · September 11, 2020, 6:04am

I would suspect it’s because the replication and stuff hasn’t kicked in yet. Could be wrong.

beepsoft · September 11, 2020, 6:06am

I don’t think it really matters, but I ran both the initial loading of the database, the export and the reimport in macOS running dgraph natively on macOS with a single alpha and zero, so no clustering and such.

chewxy · September 11, 2020, 6:07am

Wondering if you could check for dataloss?

@ibrahim you might be interested in this issue

beepsoft · September 11, 2020, 6:26am

There seems to be no data loss.

I did a couple of queries like these on both the big and small database:

{
  all(func: has(content)) {
    count(uid)
  }
} 

{
  all(func: has(dgraph.type)) {
    count(uid)
  }
}

and the counts are the same.

Also run some more complex domain specific queries and they give the same result at both instances.

beepsoft · September 11, 2020, 6:37am

One difference I found is performance.

When I run this on the 11G database in ratel it times out because of the 20s timeout limit, however with the 1.5G database it does not.

{
  all(func: has(content)) {
    count(uid)
  }
}

The actual times when I run the same query with curl are 30s vs 18s.

The two databases run on different machines and the bigger one is running on a Mac Pro with 64G memory 16 cores, while the smaller database I run on my MacBook Pro 32G memory, 12 cores.

chewxy · September 11, 2020, 6:40am

Yeah this is definitely something to look at.

ibrahim · September 11, 2020, 7:07am

@beepsoft A DB consists of valid data (which you would query for) and deleted/expired data (which your queries won’t be able to see). The deleted/expired data is removed eventually by badger compactions.

I believe your first DB instance loaded data via live loader or mutations. In this case we will write each entry to vlog (this is the write ahead log) and sst files (which stores the keys and the values). Eventually, your vlog and ssts will have data that can be removed but compactions or value log GC hasn’t cleared them yet (these are background processes which are supposed to clean up things).

When you do an export, we export only valid data (not the deleted ones). When this export is imported via bulk loader, the bulk loader won’t create a vlog (wal file) unless it needs to. This is why you see less data on disk now.

To summarize, an export would give you only valid data and bulk loading this data would give you the minimal set of sst and vlogs files that are needed. Both the DB have the same amount of valid data but the old one has deleted/expired while the new one doesn’t have it.

This is a side effect of having stale data in the LSM tree (sst files). The new DB has only valid data and so it has to do less work while reading data. Less stale data == faster reads

Right now, we don’t have a way to do this but we’re working on this. We’re working on separating the vlog file so that the clean-up process becomes simpler (Support WAL mode by jarifibrahim · Pull Request #1445 · dgraph-io/badger · GitHub). We’re planning to release this in dgraph v2011 release (in November). We’re also working on adding some tooling to badger so that it can be used to clean up disk space faster.

There are two hacky ways to clean things up. Please don’t try any of these unless you’re sure what you’re doing.

We have badger flatten command which will compact your ssts (please don’t run this command on the p directory or you’ll lose data).
Snapshot - If you have a 3 alpha node cluster, you could delete the p and w directory from one of the alphas and it will get a new snapshot. Snapshots have only valid data and work similar to bulk loader.

beepsoft · September 11, 2020, 7:47am

@ibrahim Thanks for the detailed explanation!

Topic		Replies	Views
Database size and compression level Dgraph	5	1200	July 31, 2020
Dgraph disk space usage Dgraph	1	29	March 30, 2025
Dgraph storage doubled in less than 24 hours Dgraph	2	332	March 18, 2021
Dgraph slows down progressively Dgraph	4	527	April 7, 2020
Dose it have time or size limit to upload data via dgraph live? Users	3	660	February 22, 2018

Database becomes much smaller when reimported

Related topics