Hi,
Today, I realized that my development environment was close to running out of disk space.
It’s a test system with around 10 users, so the data volume should be relatively small. However, the disk usage had reached 14.5 GB, while the configured volume size was 15 GB.
Backup & Storage Update:
- I created a backup, which was only 40MB (zipped) and 400MB (unzipped).
- I switched the storage to a slower, mounted volume instead of local machine storage, increasing the available space to 300GB.
Live Import Performance:
After restoring the data via live import, the process took:
- Number of transactions: 6,355
- N-Quads processed: 6,354,243
- Total time: 2m 31.8s
- Processing speed: 42,081 N-Quads/sec
This seems quite slow given the relatively small dataset.
Disk Space Usage After Import:
After the import, I checked the new disk usage, and it was only 400MB, which matches the unzipped backup size.
Key Question:
Why did the previous setup require 15GB for the same data, while the new setup only uses 400MB?
Production Concerns:
I’m trying to understand how this will behave in production. The recommended disk size per Alpha is 750GB, but:
- My provider doesn’t offer 750GB SSDs.
- I’m unsure how many instances and shards would be needed for production, where we expect 100,000 users instead of 10.
- For comparison, at my previous company, we used MySQL with over 100 million users, and while the dataset was significantly larger, the total disk space used was only ~1TB. The main scaling challenge was CPU and RAM, not disk usage.
Concern About Dgraph’s Disk Usage:
From my observations, Dgraph appears to consume an excessive amount of disk space, which could significantly increase infrastructure costs.
Would appreciate any insights into:
- Why the old setup required 15GB for the same data.
- Whether Dgraph’s disk usage scales linearly with data size.
- Best practices for optimizing disk usage in production.
Thanks!