This is an entry in our long running series, “Accessing Data 3.0”, where we talk about the “whats” and the “hows” of working with data in web3. Enjoy!
There’s an often forgotten question in Data 3.0 - where do we actually put all our large data? That image of your favorite cat, videos from the last family trip, the unpublished book you’re working on - what is “home” for all that data? It’s easy to think “well if it’s web3, then it must be on-chain”, but that’s not always true, nor does it need to be! There’s a whole, growing world of decentralized data that has no tie back to a blockchain.
The simplest explanation is that putting data on-chain is expensive. Blockchains are, well, chains of “blocks”. Each of those blocks has a set of transactions, which in turn can include some amount of data. Each server participating in the network must then store all data for the blocks they help decentralize. For example, in Ethereum the default for many servers is to store the last one year of blocks. Furthermore, all data added to a block gets hashed to secure the given blockchain. The combination of these two requirements leads to limits on the amount of raw data within a block. This in turn creates a competitive effect for “block space”. Resulting in web3 users having to pay fees in correlation to the total number of users (often referred to as “gas”).
Now, it’s important to circle back to that “large” term we used at the beginning. What is does it mean to have a large piece of data? Take for instance the data required to reference moving funds from one account to another. This is often measured in a unit called “bytes” and is plenty small enough to keep on-chain. After all, this is the original use case blockchains! As you move into say a school paper though, you begin measuring data in “kilobytes”, or thousands of bytes. An image often runs into “megabytes” (millions) and videos get well into the “gigabyte” (billions) range. As a web3 user, it’s safe to assume that anything measured in more than bytes is too large. It's either impossible to store on-chain (due to block limits) or it's too expensive to do so.
Thankfully, the innovation in Data 3.0 hasn’t left us high and dry. Let’s take a look at a few of the popular solutions for storing “large” data today:
The Interplanetary File System is a free, “peer-to-peer” protocol for decentralizing data. IPFS was one of the earliest adopted solutions for storing data in web3 and continues to be a favorite for many. Getting started is easy (they even have a browser extension!) and the broader network improves the more users it has.
That reliance on adoption has been both a defining factor and a sort of achilles heel for the project. Users are only required to store data that they actually want to use themselves. For instance, there’s likely no [good] reason for me wanting to store your family videos, and so I won’t. But! if a meme is going viral and shared via IPFS, then every single viewer of that meme would also be sharing that data. From a practical perspective, this makes IPFS decentralized, but only temporarily so. In short, IPFS provides an easy, decentralized way to share data with others who want it.
Explore the desktop application and other ways to get started here. Or try out a hosted “pinning” provider like Pinata.
Also built by Protocol Labs, Filecoin aims to solve the "temporary" nature of IPFS by providing "contract-based" storage. Servers offer their storage capacity to the network and users pay to host their data for a fixed period of time. Fees get determined by the size of the data stored and the length of the “contract”. This storage market is then powered by a dedicated blockchain and currency, $FIL.
And behind this marketplace, the servers paid to store your data are all doing so via IPFS. That means that adoption of Filecoin is also adoption of IPFS.
Check out web3.storage or Fleek to explore early consumer applications for Filecoin.
Much like Filecoin, Arweave has a storage market with a dedicated blockchain and token ($AR). Rather than doing fixed term contracts though, Arweave promises permanent data storage. One upfront fee, storage forever.
Arweave accomplishes this permanence by gamifying data storage (details in the yellow paper). Each server of the network can choose to store whatever data they want. For instance, they could avoid storing illegal content by censoring what's stored. But those servers are also incentivized to store data that isn’t sufficiently decentralized. In other words, it's worth more to store data decentralized to only a few servers vs thousands. And over the span of the network, this results in all data always stored.
Check out their ArDrive to give it a spin.
(fun fact - this blog is hosted on Arweave via Mirror.xyz)
Storj is another competitor in the web3 storage space, but focuses on developers. It boasts full compatibility with AWS S3 so most developers can leverage decentralized storage out of the box. The end result being fast, reliable cloud storage that's also decentralized.
In general, servers in solutions like Filecoin and Arweave are only rewarded if they store the full chunk of data (e.g. an image). Storj is different. It takes a given chunk of data, encrypts it, and then shares smaller pieces with its network of servers. When a user wants to retrieve data, only 29 of those pieces are required to reconstruct the full chunk of data.
In this way, servers are incapable of being aware of the data they’re storing. This in turn allows Storj to control data at a network level; optimizing for speed and privacy along the way.
Storj does have a hosted interface for consumers to leverage their network. And of course there's documentation for developers to get started as well.
Here’s the skinny on when to use different Data 3.0 storage solutions today:
IPFS - free, easy to use, and temporary file sharing
Filecoin - fixed-length storage for a fee
Arweave - permanent storage for an upfront cost
Storj - developer-centric alternative to AWS S3