What it takes to run a blockchain at 100 TPS for 2 weeks
Recently, the Lugano based Jelurida blockchain lab released its loadtest report for the Ardor platform.
In a nutshell, we were able to run an “on blockchain” loadtest for more than 14 days in a row while maintaining an average throughput of more than 100 transactions per second.
In this document I’ll discuss the significance of this achievement, what we learnt in the process, and where are we going from here.
When looking at blockchain products and their scaling reports, you see a strange dichotomy: on one hand the existing production blockchains are stuck at around 15 transactions per second with no significant advancement over the last several years. On the other hand, when you look at project whitepapers you see promises of theoretical scaling to even millions of transactions per second. For some reason, it seems to be difficult to bridge the gap between the existing mediocre results and the fantastic promised results. So why is a blockchain so difficult to scale?
Basic blockchain operations like signing transactions, generating blocks and so on usually do not represent real scaling barriers. Most of these operations can be performed millions of times per second even on a commodity laptop. The limit must be elsewhere. Let’s explore.
Full Consistency vs. Eventual Consistency
When Donald Trump tweets his “words of wisdom” to his millions of followers, he does not care if they receive his message within a second, a minute or an hour, as long as they receive his message eventually. This is what we call “eventual consistency”, which represents the guarantee that a given action will take place eventually. Eventual consistency is usually complimented by another guarantee called “read your own writes”, which makes sure at least old Donald himself can see his tweet immediately before he heads to bed regardless of whether his followers already saw his message or not.
Pretty much all social networks follow this consistency guarantee, and this lets them scale considerably better than traditional databases which follow the “full consistency” guarantee which means, intuitively, that everyone sees the exact same data at any given time.
It is quite obvious that a blockchain which manages a token of value cannot use eventual consistency. If such a blockchain were ever implemented, users would be able to exploit the time interval it takes a transaction to propagate between nodes to double spend their tokens while other nodes are not yet aware of the existence of an initial transaction.
Conclusion: blockchain products must use full consistency models.
Full Replication vs Partial Replication
Even a fully consistent system can achieve much more than 100 TPS. For example, at Software AG the JIS mainframe integration product whose development I managed for many years, was able to scale to around 2000 TPS when using a strong server, while maintaining fully consistent view of the application state on the mainframe. Same goes for many database applications. With the JIS product the limit was related to the amount of memory available, CPU consumption or network congestion either at the middleware level or the mainframe itself. We achieved this using partial replication: both load and data were split between several servers, each one holding only part of the data and serving only part of the load.
The difference to blockchain is that in a blockchain environment every node must agree on the same state as every other node independently. There is no single “mainframe” system or a “database” (or a “coordinator”) which represents the correct state. Every node needs to reach this state on its own by processing all transactions and agreeing with every other node on the existing state of the blockchain i.e. reach consensus. But this requires full replication of data between nodes which is much more expensive than the partial replication used by scaled centralized systems. Ultimately this massive (full) replication requirement limits the scaling potential of any blockchain compared to a centralized database.
Lessons from the Ardor 100 TPS loadtest
In a fully decentralized blockchain, scaling is first and foremost limited by the consensus algorithm itself. Even if your servers did not max out CPU, Memory, Disk, Network or their local software, you are still limited in your transaction processing capacity by the consensus itself. If too many transactions are processed too quickly, nodes can no longer agree on the order of transactions and how to organize them into blocks.
This has several implications:
1. Single node performance is usually not a limiting factor. Therefore, it makes more sense to implement a blockchain node using a high level language which provides better security and is easier to maintain (i.e. Java) than using low level language which provides better performance (i.e. C/C++).
2. The ability of nodes to reach consensus is a limiting factor, therefore the consensus algorithm must remain as simple as possible. It must be based on a simple formula which can be calculated quickly and efficiently. I predict that complex PoS algorithms which require multiple rounds of voting will not scale well. Similarly, a consensus algorithm which needs to reach consensus too often, like a DAG or a blockchain with a very short block time, will tend to diverge quickly under load.
3. Under heavy load, nodes generate blocks independently at the same time causing the blockchain to fork. Resolving these forks is a resource intensive operation since the node switching to a better fork needs to undo all its existing blocks up to a common block, then switch to a better fork provided by another node. Ideally, forking should be limited as much as possible by increasing the block time enough to reduce the chance of two or more simultaneously generated blocks.
4. Block size should be kept under control. Using a large block time causes performance bottleneck as block size increases and during the long time between blocks, the number of unconfirmed transactions grows. If not kept under control these large blocks and many unconfirmed transactions can drain the node resources causing an out of memory problems, and undesired performance peaks.
5. Based on these trade-offs, we found the sweet spot for block time in our existing decentralized network to be between 5 and 10 seconds.
Consensus algorithms and their scaling potential
PoW algorithms differ from PoS algorithms since while in PoS the wait time between blocks is mostly idle, in PoW the wait time between blocks is dedicated to hashing in an attempt to generate the next block. Therefore, PoW chains will always remain CPU intensive. This means node operators will always need to upgrade their hardware to compete for the generation of the next block so the usage of commodity hardware is impossible.
DPoS can theoretically scale efficiently since forking should happen less frequently when the sequence of block producers (delegates) is known in advance. A fork can still happen in DPoS in case a block produced by the current delegate does not reach the next delegate on time for its own block production time slot. The limiting factor seems to be load on the current delegate and block propagation time. DPoS has additional overhead of the voting transactions themselves. Under extreme load the blockchain might not be able to process votes fast enough to agree on the order of delegates. I’m curious to see loadtest results for some DPoS chains to be able to compare.
DAG implementations are most susceptible to consensus related problems since they need to reach consensus for every transaction, not for every block. This is probably the reason why we always see some form of centralization in DAG implementations.
Other consensus algorithms typically found in permissioned blockchains, such as variants of BFT, usually assume a trusted agreement on the sequence of node generators thus eliminating forking at the expense of requiring trust in the node operators.
Once other teams execute similar load testing runs to the ones performed at Jelurida’s blockchain laboratory, I predict we’ll see the following trends:
PoW and PoS will scale roughly to the same levels albeit with PoW, this will require much stronger hardware and a large energy waste.
DPoS has the potential to scale better than PoS and PoW due less frequent forking at the expense of some centralization.
DAG is not expected to scale better than a blockchain.
Between the various PoS algorithms, I expect the trend to be the simpler the algorithm, the better it scales.
Are we stuck with blockchain scaling limited to around 100 TPS?
Not necessarily. There are two promising directions to resolve the consensus bottleneck:
Off chain transactions
Off chain transactions, which involve locking funds on the main chain while performing most of the work off chain, are in my view an awkward solution, however it can prove useful for limited applications such as the lightning network payment channels. I don’t see how it can be used for contract execution though or more complex applications.
Divide and Conquer
Algorithms such as the Ardor sub-nets research project, seem like a more feasible scaling solution separating the blockchain into loosely dependent chains, thus removing the requirement for each node to process every transaction which is the “full replication” limiting factor.
In addition to this, we will see plenty of hybrid blockchain solutions providing some level of trade-off between decentralization and performance.
Why is it so difficult to run a reliable loadtest?
Loadtesting requires an extremely stable product. When running at 100 TPS, you cannot bluff. Your database must be fully optimized. The node networking communication code must work flawlessly. You need to debug and resolve all race conditions, memory leaks and synchronization bugs. Being able to run a stable product under load is a testament to the quality of the product.
You cannot run a loadtest on a production mainnet or an online testnet since it is impossible to achieve the necessary lab conditions required for testing in these environments. Instead, you must be able to setup a new blockchain instance with a clean Genesis block. This means that your product should be able to decouple the blockchain code from the specific genesis block. A difficult task unto itself.
Next, you need to understand how to effectively loadtest your blockchain and master the required tools and techniques. Getting this right is not a trivial task, it requires deep understanding of how your blockchain works.
Monitoring your blockchain during loadtest is another required task. You need to monitor the server capacity but also the software itself.
Combining all this into a stable loadtest that run consistently for more than 14 days is a formidable task. No wonder Ardor is the only blockchain undergoing these types of tests.
To give you an example, when we started load testing, we were able to achieve just about 20 TPS before the servers started to max their CPU. Running some diagnostics revealed that the default mode of bundling transactions as soon as they are submitted was causing excessive database access. To resolve this, we configured the servers to perform bundling only before generating a block. This resolved the bottleneck but moved us to the next problem. Under similar load, nodes started blacklisting each other causing each node to continue on its own fork. Resolving this required a lengthy debug session of the node communication code which resulted in solving several bugs and race conditions and updating some configurations. Only then we were able to run the tests at full capacity.
Load testing is a critical requirement for main stream adoption!
Deploying a blockchain on mainnet requires a full understanding of its scaling capabilities. Exactly like organizations should loadtest a mission critical web application before deploying it to production, Blockchain developers should benchmark their blockchain performance regularly and test their blockchain for stability under load over long periods of time.
This post was first published on https://medium.com/@lyaffe/our-blockchain-scaling-experience-bd2be41f06c8