distributed systems

most devs are not the audience for the CAP theorem

August 21, 2023July 2, 2024 alan Leave a comment

eric brewer presented CAP at a distributed computing conference in 2000 to designers of distributed systems, most of whom were familiar with relational databases and their consistency guarantees. His intention was to start the conversation about trade-offs in the system design space that need to be made between consistency and availability for early cloud-based storage systems that needed to be highly available. His main argument was that systems that for cloud storage systems to be highly available, some level of consistency needed to be sacrificed.

most developers like myself are using distributed systems – not designing them. more importantly, i find that i’m often required to think about app data usage patterns and how their storage system supports those usage patterns at levels far more granular than what CAP offers. as a distributed systems user, CAP can basically be reduced to “there’s a trade-off between availability and consistency”. that’s a concept that i find much easier to digest on its own

more useful questions / concepts than CAP to grapple with are…

is replication synchronous or asynchronous? Is this adjustable?
In the event of node or network failures, what is the data recovery process like? what writes get rolled back?
is there support for transactions? what level of isolation is supported (are dirty reads possible?)
can I read my own writes?
how do I scale reads and writes as my system grows? what does the process of adding additional nodes to the system look like?
how are concurrency conflicts handled / when there is contention over shared data?

mongoDB read preferences

August 8, 2023July 2, 2024 alan Leave a comment

MongoDB read preferences give you control over read behavior when using replicasets. Writes in every environment go to primary, but reads can be configured to read from secondary or primary based on various criteria. In most versions of mongo, the read preference defaults to primary in the client but you should check your version for the default.

primary

Read from the primary. if the primary is down, the read will fail. This is useful if you need the latest successful write and have zero tolerance for stale data. The terms “latest” and “successful” are a bit of a sliding scale depending on the read concern. For example, with read concern local to a primary, it’s possible to read writes from other clients that have not been acknowledged by the primary node. In practical terms, this just means that the data could be lost in the event of a rollback event since it may not have been replicated to a majority of nodes.

Regardless of a write’s write concern, other clients using "local" or "available" read concern can see the result of a write operation before the write operation is acknowledged to the issuing client.
https://www.mongodb.com/docs/manual/core/read-isolation-consistency-recency/#read-uncommitted

But I thought mongo maintained a WAL using an on-disk journal? Why would data be rolled back at all?

Yes, the journal is used for data recovery in the event of a node shutdown. But when a primary node goes down, a new node is elected as the primary, and then the previous primary rejoins the cluster as a secondary it may have data / writes that are ahead of the new primary. Since the new primary is the source of truth, those writes need to be rolled back to be consistent with the new primary!

If you want data that has the highest durability level, use majority read concern. When read concern is majority, it will only return data that has been successfully acknowledged by a majority of nodes. However, this does sacrifice some level of data freshness because the latest write from a client may not have been replicated to the majority of nodes yet, so you may be reading an older value even though you’re reading from the primary.

Another factor to consider when choosing this setting is your primary nodes capacity and current load – if your primary is close to capacity on cpu or memory use, switching reads to primary may impact writes.

primaryPreferred

Read from the primary, but if the primary is unavailable, the read goes to a secondary. This setting is appropriate cases you want the most recent write, but you can tolerate occasional stale reads from a secondary. Replication lag can range pretty widely from single digit ms to minutes depending on the write volume, so keep in mind that during periods of high replication lag you’re more likely to be reading stale data.

secondary

All reads go to secondaries. Not all secondaries though – just one based on a server selection process. The server selection works like this:

First, there’s the default threshhold value which is 15ms. Now this isn’t the window used for server selection though – it’s only a part of it. The other part is the lowest average network round trip time (RTT) of secondary nodes. So if the closest node has an RTT of 100ms, the total set will be nodes that fall within 100 + 15 (115ms). It’ll use this time to randomly select a node to forward the request to.

This read preference can make sense if your system is very read heavy AND you can tolerate stale reads. Scaling out with many secondary nodes is one way to spread your read load and improve read availability. At the same time, it may also add latency to both your primary (due to the increased oplog syncing), increase inconsistency and likelihood of data staleness because there are now more nodes that are likely to be out of sync with writes, and potentially reduced write latency particularly for writes that require majority acknowledgement because you’ll need N / 2 + 1 acknowledgements.

secondaryPreferred

All reads go to secondaries, but if not available, read from the primary. The same concerns for “secondary” read preference applies here. The additional factor to consider here is whether your primary can afford to take on this additional load.

nearest

Doesn’t matter which, just pick the one with lowest latency based on the server selection algorithm I mentioned in the secondary read preference section.

why the words in “CAP theorem” are confusing

August 5, 2023October 31, 2024 alan 1 Comment

lets go through each one ok? my goal is to make you as confused as i am

consistency

the word “consistency” is super overloaded in the realm of distributed systems. In CAP, consistency means (informally) that every read reflects the most recent write. This is also known as single-copy consistency or strict / strong consistency. Data is replicated to multiple nodes in these systems, but any reads to this storage system should give the illusion that read and writes are actually going to the same node in a multi-user / concurrent environment. Put another way – there’s multiple copies of data in practice, but readers always see the latest (single) copy by some loose definition of “latest”.

unfortunately, “strong” consistency is a grey area with lots of informal definitions. Some distributed systems researchers such as Martin Kleppman equate CAP consistency with linearlizability, which depending on your definition of strong consistency, is either a weaker form of strong consistency or a more precise and formal definition of strong consistency. Does consistency mean that every write is instantaneously reflected on all nodes? Or, in the case of linearlizability, that even though reads may not reflect the latest write they will always reflect the correct ordering of writes based on some agreed upon global time?

data stores that support transactions are also characterized by their ACID properties (Atomic, Consistent, Isolated, Durable). ACID consistency has nothing to do with ensuring that readers get the latest write; ACID is about the preservation of invariants in the data itself (such as keeping debit or credit balanced) within a transaction, which is often the responsibility of the application itself rather than the storage system. Relational databases support specific forms of invariant enforcement through referential integrity, but many invariants cannot be easily enforced by the storage system. Are they related at all? Well, they’re related insofar as CAP consistency trade-offs affect ACID consistency for system designers, but they’re still completely different notions of consistency! For example, if it’s possible for a read to return stale data during a transaction (dirty reads), this can make it difficult to maintain invariants for the application.

availability

availability in CAP has very strict, binary definition. CAP availability states that every non-failing node (even if isolated from the rest of the system) can respond successfully to requests. This is confusing because most site operators talk about availability in terms of uptime and percentages. For example, consider this scenario of a system that currently serves most reads from secondary nodes. Lets assume that most secondary nodes in a system fails and can no longer serve requests. The server switches all reads to the primary and is still technically “available” from a user standpoint. This fails the CAP definition of availability, but passes the more practical operational definition of availability. One way to distinguish one from the other when they’re discussed at the same time is to refer to CAP availability as “big A” and practical availability / uptime as “little a”.

in practice, availability (“little a”) isn’t black and white (can EVERY non-failing node respond?). Most real world systems don’t even offer this strong guarantee. This definition might make sense in a theoretical system and make it easier to reason with more formally / mathematically, but in reality many storage systems allow users to choose between many different levels of availability and for different operations. For example, some databases allow users to tune availability by reads or writes using quorums. Systems that prioritize data durability in the face of failures may sacrifice some availability for writes. If writes cannot be propagated to a majority of nodes in the system, return an error. At the same time, it may also prioritize availability for reads and so perhaps any response from any replica is considered successful, even if that data might be stale. If you’re trying to reason about the availability of a storage system, these details matter and are far more relevant than whether or not the system is capable of achieving big A (CAP availability).

partition tolerance

partition tolerance is confusing on multiple levels.

the word “partition” itself is confusing because partitions in a distributed system are splits or shards of a data store that is maintained for purposes of scalability. In CAP, a partition refers to some network partition which is basically a network error that disrupts node to node communication. if a system remains operational in the presence of a network partition, then it is considered partition tolerant in the CAP model.

second, CAP appears to present partition tolerance as something a distributed system has or does not have. but because networks are fundamentally unreliable, system designers need to design systems to tolerate network errors. there’s not a really a choice or trade-off that should be made – it’s more of a requirement. unlike availability and consistency (which are actual trade-offs that need to be made when there is a network partition), partition tolerance is not something you can relax in practice. so designers are really choosing between consistency and availability, assuming that there will be network partitions. this is why some people say CAP is really just CP and AP. you’re either choose consistency with partition tolerance or availability with partition tolerance.

finally, partitions in the network are relatively rare events. two data centers or nodes within a data center may lose communication, but that’s rare compared to individual node failures (disk full, over-capacity, random crashes) that happen all the time given a large network. In the model of CAP, however, a single node failure isn’t a network failure and so it falls outside the stated rules of CAP. eric brewer has clarified though that in practice, nodes that fail to respond within some time window can be treated as conceptually equivalent to an actual partition event..