Distributed Systems For Fun and Profit
My notes
A great book about Distributed Systems. From the programmer's (not the mathematician's) point of view. I love the approach to abstractions and modelling. The great rationale behind different models and abstraction levels. Difference between CAP and FLP impossibilities. Excellent conceptualisation, and great metaphors. Great discussion about different types of consistency, strong, weak, client-centric, linear, sequential, and eventual, etc. Love the philosophical debates, and clear definitions. Big plus for mentioning Nietzsche :) I'm missing a chapter about byzantine fault-tolerant systems. Nevertheless, looking for more books written in such a concise, neat, and intuitive way.
My highlights:
- [[Good abstraction makes working with a system easier to understand, while capturing the factors that are relevant for a particular purpose]]
- [[Abstractions, fundamentally, are fake, but they make the world manageable]]
- [[Models that hide details are easier to understand, models that expose details are more performant]]
- [[Modelling reality requires removing as many details as possible to make it universally applicable, yet not too much to make it impractical]]
- The "C" in CAP theorem is "string consistency", but "consistency" is not a synonym for "strong consistency".
- The eventual consistency guarantee is as trivial as "people are eventually dead" — it means nothing unless we specify more constraints.
- Order is important because distributed systems that preserve order can be treated as a single computer.
- Time is a source of order, it allows us to define the order of events.
- Clocks allow us to synthesize an approximation of the same counter without communicating it directly.
- Casual clocks track only the relevant events (dependencies) that influenced the current state.
- Casual clocks are a generalisation of timestamps because timestamps are specific instance which assumes that everything was relevant.
- [[Assuming order is one way to reduce the space of possible executions and possible occurrences]]
- People can reason on ordered events, otherwise, there are too many permutations to consider.
- People tend to think in terms of total order. It's easier; however, distributed systems require thinking in terms of partial orders, where things happen concurrently.
- Depending on the system model: 1) the synchronous system model has a global clock; 2) The partially synchronous model has a local clock. 3) One cannot use clocks in the asynchronous system model.
- Using time has two benefits: define order across a system without communication and define timeouts.
- A vector clock is an extension of the Lamport clocks. It maintains distinct counters for each node and updates them on each internal event.
- The issue with vector clocks is that they can be quite large in large systems.
- Completeness is about maximizing detectability and is achieved by just not waiting forever.
- Accuracy is about minimalising false positives. String accuracy is possible only in synchronous systems and is achieved by setting hard time boundaries. In asynchronous systems only weak accuracy is possible.
- Weak completeness can be transformed into strong completeness by broadcasting information about the suspected process.
- Even the eventually weak failure detector allows for solving consensus problems.
- The easiest way to achieve total order is to nominate a single node (leader), which dictates the order of operations to other nodes. The downside is that the system is as slow as the slowest node, and the current leader is a bottleneck.
- [[Failure-detector is essential because it is not possible to tell whether a remote node has crashed or is experiencing high latency. Failed nodes can be ignored, but partitioned ones can not as they may cause divergence]]
- The distinction is crucial to achieving consistency because crashed nodes can be ignored (they won't diverge), but the partitioned can not be safely ignored as they cause divergence.
- Mutual exclusion, leader election, multicast and atomic broadcast are all instances of the more general problem of consensus
- [[Partition tolerance requires a method for breaking symmetry. The partitions MUST not be symmetric (equal). There must be a way to distinguish which one to follow and ignore the other]]. Otherwise, it is not possible to prevent divergence.
- Majority vote consensus requires an odd number of parties to achieve asymmetry—only one partition contains a majority.
- [[The majority consensus protocols are great because they prevent partitioning even in case of disagreement]]. The protocol may stop, but it will not diverge.
- [[CRDTs achieve eventual convergence by limiting a set of operations that are order-independent and insensitive to message duplication/re-delivery]]. Operations are associative, commutative, and idempotent. In mathematics, the structures are called join or meet semilattices.
- Inferences made within a monotonic framework once deductively valid, cannot be invalidated by new information. Inferences made within a non-monotonic framework can.
- [[Defeasible reasoning is non-monotonic logic because assertions made utilizing partial information can be invalidated by new knowledge]] For example, if we learn that Tweety is a bird, we'll assume that Tweety can fly; but if we later learn that Tweety is a penguin, then we'll have to revise our conclusion.
- [[Monotonic logic is great because we don't have to reconsider the validity of inferences]]. In other words, we don't have to re-compute data once we get new information. Once we know that Tweety is a bird, we can safely conclude that Tweety can fly, and nothing we learn can invalidate that conclusion.
- Calculating two numbers is monotonic. Aggregation is non-monotonic. The reason is that aggregation does not only calculate a sum but also asserts that it has seen all of the values. And the only way to assert that is via coordination across nodes.