Peer Review: CBC Casper

Get writings by email.

Get writings and updates by email.

December 6, 2018

TLDR: Liveness and safety are two inseparable properties of consensus protocols. Liveness is a guarantee that a protocol will do something useful even in the face of failures. CBC Casper, a family of consensus protocols, leaves out liveness from their analysis. Treating 1 out of 2 doesn't get you 50% of the way to a full protocol; it leaves you with almost nothing.

The crypto ecosystem is rapidly growing and working on innovative technical solutions for fundamental computer science problems. These problems touch distributed systems, applied cryptography, and game theory, amongst others. Each of these scientific fields has rich literature, peer-review processes, and established concepts and terminology that can be used in crypto protocols. But so far we haven't seen a healthy peer-review ecosystem emerge in the crypto industry for various reasons.

The norm in the industry is for engineers to publish independent (non-peer-reviewed) whitepapers for their projects, and there are no established channels for getting feedback on the technical work. We firmly believe that peer review processes are net positive for the ecosystem and improve quality and accuracy of technical work; we had our work on blockchains peer-reviewed by the distributed systems community which resulted in a publication at USENIX'16 among others. While Satoshi may not have had the Bitcoin whitepaper peer-reviewed, it would be a mistake to say that this is good practice for the myriad of white papers produced over the past few years. An honest, constructive dialog on the theoretical and practical soundness of research proposals can improve the quality and accuracy of work, and help this industry reach maturation.
It is sensible that we should all work together towards finding solutions to underlying fundamental problems that are holding us back from widespread practical adoption of decentralized systems.

Ethereum has done a great job building a community of developers. Below is our effort to provide constructive feedback in the form of a public review of CBC Casper. We commend both Vitalik and Vlad for welcoming a public review, discussing protocol details with us, and providing feedback on earlier drafts of this review. We hope they find our input useful.

Overview of Review:

Ethereum Research recently released a paper called "Introducing the "Minimal CBC Casper" Family of Consensus Protocols". Ethereum scalability research is an important topic as the network experienced severe scalability problems on the production network in 2017. These were due to fundamental design issues in Ethereum itself (as predicted in 2016 for example).

A close examination of the CBC Casper paper reveals a major shortcoming that calls in to question whether the paper contributes to the material advancement of working consensus protocols and scalability efforts.

Liveness and safety are inseparable:

Specifically, the paper attempts to treatand provide Byzantine safety without respect to liveness. However, liveness and safety are inseparable properties of Byzantine fault-tolerant protocols for achieving consensus. Because of this, we believe that the approach presented in this paper is fundamentally the wrong approach to start the design of consensus protocols.

Correctness is a guarantee that once a decision has been reached, it will remain decided. Liveness is a guarantee that a protocol will do something useful — i.e., process transactions — even in the face of failures. Without the liveness guarantee, users cannot actually do useful things with the protocol, because they do not know when they can consider that transactions have been processed successfully. For example, in Bitcoin, users need to know when they can consider a transaction confirmed.

This paper attempts to prove correctness without considering liveness. This is a problem because proving correctness without liveness is not practically useful, and most of the hard problems show up precisely when you consider liveness — treating 1 out of 2 doesn't get you 50% of the way to a full protocol; it leaves you with almost nothing.

Further, the failure to address liveness leads to other problems for achieving Byzantine fault tolerance:

a) CBC Casper deals with only a subset of Byzantine faults. Any blockchain network needs to be Byzantine fault tolerant since it is comprised of mutually-distrustful nodes. Nodes can fail for arbitrary reasons, including malice on the nodes' parts. This paper only discusses one type of failure — the failure when nodes "equivocate" i.e., send self-conflicting messages. However, such failures are only a subset of faulty behavior: there are other failures that are harder to deal with that this paper does not address e.g., a malicious node that intentionally tries to prevent consensus (but does not equivocate).

b) CBC Casper does not specify concrete decision-making mechanisms. Example protocols in the paper depend on an "estimator function" whose consequences in the protocol are not clear, and not considered in the proofs. While we suspect from their examples that the estimator functions are used to inform a decision-making mechanism, the paper does not discuss the consequences of this nor does it consider them in its correctness proofs. As a consequence, the safety proof is not informative — it does not make significant progress towards reasoning about useful protocols. Most of the important elements are left unproven.

c) CBC Casper does not consider other relevant, hard problems for achieving consensus in blockchains. Examples include reconciling chain forks, selecting which forks are valid, how reorganizations are detected and handled, recovering from accidental forks, and so on. This omission is problematic because these are precisely the problems to which CBC Casper protocols will be implemented to address. The safety proof does not yield any insights into how to address these issues because these are fundamentally liveness properties, which are left out of the proofs.

The proofs in the paper are useful to the extent an empty set of properties is useful. An empty set of properties might be "safe" but makes no progress towards having a practical protocol.

The paper's definition of Byzantine fault tolerance as "BFT safety but without liveness and only for equivocation faults" is unconventional and, in our view, fundamentally the wrong approach to start the design of consensus protocols.
Below is a more detailed technical review:

Full Technical Review:

This paper sets out to provide proofs of Byzantine fault-tolerant consensus safety for the entire Minimal CBC Casper family of protocols. Notably, this paper attempts to provide a proof of consensus safety without concern for a liveness guarantee. This is an unusual goal and makes it difficult to reason about the correctness and usefulness of protocols in the described family. This is because, as we will discuss, liveness is a fundamental property of BFT protocols. A safe Byzantine fault tolerant protocol with fewer than some threshold t of faulty nodes provides correctness and liveness simultaneously, which means that reasoning about one without the other is the wrong approach. Importantly, the ways in which protocols provide liveness affect the methods for achieving safety. Defining safety without considering liveness makes safety much easier to achieve, but also makes it less applicable in practice; liveness is a strict requirement.

As a first step, we'll briefly describe what liveness is, and why it is important. Roughly speaking, liveness is a guarantee that a network will make progress as long as some fraction of the network is non-faulty. Making progress in protocols is important because in order for the network to accomplish anything, it needs to make progress. For example, in the Bitcoin network, it is important that users' transactions be accepted or rejected. Without a guarantee of progress, users cannot actually do anything useful with the network, because they do not know if their transactions will be processed successfully. Guaranteeing progress in such networks is hard, and the majority of complexity in the design, verification, and proofs of correctness for BFT networks is exactly in guaranteeing that property. Because the Casper paper does not address this property, it does not make significant progress towards practical consensus protocols.

What the CBC Casper Paper Achieves and Why You Cannot Treat Correctness Without Liveness

The CBC Casper paper provides a definition for a family of protocols, called the "Minimal CBC Casper" family. CBC stands for "correct-by-construction". Once the family of protocols is defined, the authors use those definitions to demonstrate that:

Nodes in such a system will share common future protocol states (in the face of a fault-tolerance level t)
Subsequent decisions will be consistent (because they share future protocol states)
This leads to fault tolerance.

The paper concludes by asserting that they have "partially specified a family of consensus protocols and shown that every member of this family has asynchronous, Byzantine fault-tolerant consensus safety" (Section 5).

Despite correctly proving properties 1 and 2 of their described protocol family, the authors do NOT prove that the CBC Casper family of protocols is Byzantine fault tolerant in either practice or theory. Consensus safety is inseparable from liveness. To see why this is the case, let's look at a simple example of a consensus protocol that implements a global lock (mutex).

Example: Deciding Locks
In this example, participants are just attempting to resolve the holder of a lock and the first client to claim the lock is the owner. Suppose two nodes participating in the protocol receive two different messages at exactly the same time — one sees Alice claim the lock and the other sees Bob claim the lock. Note that these two nodes will have two different protocol histories even though there is no equivocation whatsoever.

Now, we consider two properties proven in the paper's safety proof (Section 3). First, Theorem 1 shows that any two protocol states without any equivocation (or rather, fewer that t faults in the union of their states, which is true for any states with no equivocation) have a shared future. Now, Theorem 3 shows that for any property which has been decided in one protocol state, its negation cannot be decided in another protocol state without any equivocation. How can these theorems be true for the case of a lock? Neither node has equivocated, and so they must share a future, according to Theorem 1. Each node's protocol state has the property that a given lock is held by either Alice or Bob, and these properties cannot both be true. So, for p defined as the property "Alice holds lock", we have state s_1 where p is decided, and state s_2 where not p is decided (i.e., the negation).

How can we reconcile this with the correctness proofs in the paper? Well, according to the definition of "Decided" provided in the paper, neither of these properties is actually decided. But how can we know if a given property is decided or not? That is a question of liveness, not correctness as defined in the paper. But any design of a practical protocol would need to answer the question "When is a property decided?", and cannot rely on recursion or collapsing infinite protocol states, and because of this, it fundamentally affects the decisions about correctness, and subsequently proofs of correctness. In blockchains, how this question is answered informs decisions about conflict resolution, fork reorganization, chain selection, mining power, and more. For this reason, it is an absolute necessity to treat liveness simultaneously with correctness.

Being able to resolve basic scenarios like "which party has the lock" is some of the basic use-cases of consensus protocols. A protocol that cannot do this is not practically useful. The locks example has similarities with token ownership on blockchains ("do these tokens belong to Alice or Bob?").

Liveness Imposes Stricter Fault Thresholds than Safety

While the safety proofs in the CBC Casper paper hold true for all estimator functions and any fault thresholds, any validator behavior which provides liveness would necessarily place tighter practical bounds on fault threshold t than what the safety proof suggests. What we mean by this is that while validators may make consistent decisions on properties of consensus values for all estimators and for any t, they only do so on useful properties for a more-constrained range of t. This is not discussed in the paper but is demonstrated as follows.

When considering liveness, two additional constraints will apply which tell us more about what CBC Casper protocols can actually do:

There exists T_s such that validators cannot guarantee persistence of consistent validator decisions on useful properties of consensus values for 0 <= t <= T_s < sum(W(v)). That is, there is a fault threshold lower than the sum(W(v)) whereby a decision on a useful property can appear to clients to get reversed from what it was observed to be earlier — it can go from consistently-decided to not-consistently-decided.
There exists T_l such that validators cannot make additional decisions on useful properties of consensus values for 0 <= t <= T_l < sum(W(v)). That is, there is a fault threshold lower than the sum(W(v)) whereby no matter what they do, validators cannot add more properties to the set of consistent decisions on properties of consensus values.

(Note that W(v) is the validator weight function, and v in V are validators).

While the paper does not speak to these thresholds, they are nevertheless important when it comes to reasoning about what CBC Casper protocols can do in practice. In particular, T_s and T_l are strictly less than the sum(W(v)) in the context of BFT protocols. To see why, consider a formulation of the classic BFT binary decision problem in CBC Casper:

The consensus values are {0, 1}
W(v) = 1 for all v.
The protocol states encode the sequences of authenticated messages each validator received from each other validator (let us assume for now that messages cannot be forged, replayed, or arbitrarily delayed).
The estimator function looks at a validator's protocol state and decides whether or not a consensus value was agreed upon by a BFT quorum of validators in V (i.e. at least 2f + 1 votes agree on the same property for a consensus value).
The useful consensus value property that requires an argument of liveness is "all correct validators decide c", where c is a consensus value.

From the safety proof, we can expect that if we execute this protocol, then in the absence of sufficiently many faults, either "all correct validators decide 0" or "all correct validators decide 1" will be true, and whichever one that is will be in the set of consistent decisions made by all correct validators on {0, 1}.

However, what the safety proof will not show is that given this formulation, the correct validators will only be able to consistently decide this property if the following additional bounds on t are met:

t <= T_s <= 1/3 * sum(W(v)) < sum(W(v))
t <= T_l <= 1/2 * sum(W(v)) < sum(W(v))

These thresholds are much lower than sum(W(v)), and follow directly from the known bounds on BFT protocols. First, a BFT protocol cannot make progress if over 1/3 of its replicas are faulty. In CBC Casper terms, this means that the useful property won't be consistently decided by validators at faults higher than 1/3 of the votes. Second, a BFT protocol cannot guarantee that the decisions it has made will be persistent if over half of its replicas are faulty. In CBC Casper terms, this means that from the client's perspective, any previously-observed consistent decision on the useful property is not guaranteed to remain so at faults higher than 1/2 of the votes.

What this means is that while the safety proof is correct regardless of the estimator function and the fault threshold, it is not informative. The tight bounds on the protocol's fault thresholds are determined by the protocol's liveness properties, not the safety proof. Without a proof of liveness, the reader learns nothing useful about the actual behavior of CBC Casper protocols from the theorems of this paper. The makes the safety result trivial.

Conclusion

The paper claims to show that the entire Minimal CBC Casper family of protocols exhibits Byzantine fault tolerance safety. However, the definitions and proofs provided in this paper result in neither a theoretically sound nor practically useful treatment of Byzantine fault-tolerance. We believe that considering liveness without correctness is a fundamentally wrong approach. Importantly, it remains unclear if the definition of the Casper protocol family provides any meaningful safety guarantees for blockchains, or if this approach is applicable to scaling blockchains at all.

Given what the paper states, it would have been more accurate for the paper to say that CBC Casper is NOT Byzantine fault-tolerant and that the goal of the paper is to provide theoretical proofs for a subset of Byzantine safety faults instead. It would then be useful to first discuss if such limited theoretical proofs (different from Byzantine fault-tolerance) are useful or not. In our view, this general approach of considering safety without liveness and limiting faults to only equivocation faults is not useful, to begin with.

We'd like to thank Vlad and Vitalik for diligently answering our questions and clarifying certain concepts that helped with this review. Their participation in this review signals openness to critical feedback and we applaud them. It'd be a great outcome if peer-review processes become a norm for our industry; we're certainly open to public reviews of our works and to potentially continue occasional reviews of other protocols.

Together we can all work towards making the vision for decentralized computing a reality.

— Muneeb Ali, Jude Nelson, and Aaron Blankstein

Disclaimer: The views presented in this post do not represent the views of Blockstack PBC and are personal views of the respective authors.

Comments? Tweet them @muneeb

Muneeb Ali
Co-founder Stacks, a Bitcoin layer for smart contracts. CEO Trust Machines, building Bitcoin apps. → Learn more