In addition to 1 TB data testing we could append some other datasets to the test. As for sure small projects won’t get there.
This project seems to collect blockchain data. GitHub - citp/BlockSci: A high-performance tool for blockchain science and exploration
Which can get 300GB on disk.
https://citp.github.io/BlockSci/readme.html
Current tools for blockchain analysis depend on general-purpose databases that provide “ACID” guarantees. But that’s unnecessary for blockchain analysis where the data structures are append-only. We take advantage of this observation in the design of our custom in-memory blockchain database as well as an analysis library. BlockSci’s core infrastructure is written in C++ and optimized for speed. (For example, traversing every transaction input and output on the Bitcoin blockchain takes only 1 second on our r4.2xlarge EC2 machine.) To make analysis more convenient, we provide Python bindings and a Jupyter notebook interface.
We recommend using instance with 60 GB of memory or more for optimal performance (r5.2xlarge). As of August 2019 the default disk size of 500GB may not suffice anymore, we therefore recommend choosing a larger disk size (e.g., 600 GB) when you create the instance. On boot, a Jupyter Notebook running BlockSci will launch immediately. To access the notebook, you must set up port forwarding to your computer. Inserting the name of your private key file and the domain of your ec2 instance into the following command will make the Notebook available on your machine.
I gonna add more examples we can append.