Hacker News new | comments | show | ask | jobs | submitlogin
ACID transactions in a globally distributed database (fauna.com)
123 points by jchanimal 11 months ago | hide | past | web | 40 comments | favorite

In a globally distributed database, the main problem is time control because different locations have different clocks. For example, assume that one table of the database is located on Earth while another table is stored on Mars (4-13 minutes delay). Now we want to commit a transaction with these two tables from the moon while the result will be queried from Venus. I could not understand from this post if Fauna is able to correctly work under such conditions and whether it was designed for such use cases.

Much like the Google spanner database. Was at a presentation earlier this month where they went into more detail. Very interesting product!


I think the moment the internet goes galactic, we will need to implement special relativity algorithms to properly sync machine states, unless quantum computation takes off and we get a quantum entagled network!

"Globally distributed" should imply that it's confined to one planet.

More generally, the point is that if you have two locations where the fastest you can send a unit of data is t, then the shortest possible time interval you can pass a transaction from one location to another is t.

That puts a upper (EDIT: corrected from lower) bound on throughput of many types of operations, such as e.g. if you need to guarantee strict ordering of transactions (EDIT: in the worst case, where transactions are being issued from multiple nodes; best case you can optimize in a variety of ways).

So while we won't have to deal with interplanetary latencies, it still matters greatly for throughput and consistency guarantees.

Even assuming you meant "upper bound" instead of "lower bound", this isn't strictly true in theory. To achieve high throughput with large latency between nodes, all you have to do is group transactions into large batches and reach consensus on how they should be ordered (for example, you can run paxos to decide what a batch consists of, and then use any method you like to order within the batch). Running paxos at interplanetary latencies would obviously take some time, but since you can have multiple rounds in progress at once (or increase the size of each batch to be much larger than the latency), throughput isn't limited by latency. In practice, most people care a lot about latency, and achieving the desired latency puts a limit on throughput. See the Calvin DB paper for a more thorough explanation.

Yes, upper bound. Lower bound on time it takes for the transaction to be globally available.

You're right that there are operations that you can increase throughput of, and I was imprecise in that you are right you can order them "after the fact", but that's not very interesting. The scenario I had in mind was where clients talking to different nodes are issuing transactions that depend on the same items of data.

In that case you either need to obtain a lock, in which case you best case need to wait for the latency interval (plus a margin or largest possible clock skew) to see if the other side wants a lock on the same object, or you need to be optimistically firing off transactions, but the best case then is that you just get your transaction in right before the remote side start operating on it, and they happen to just fire off another transaction, and so on. In reality there'd be slowdowns.

If you're dealing with uncontended objects, you can do much better on average, but you can't guarantee better than the latency between nodes.

There are certainly tons of special cases where you can optimize.

@sleepydog, that was an example. Even in one planet, RTT in internet can be quite severe for realtime applications.

Even then, the roundtrip time between continents can be in range of several hundreds milliseconds, which is far too much for many real time transactional systems.

Interesting but ultimately irrelevant since it appears to not be FLOSS. We have already seen what happened to FoundationDB.

They are primarily targeting the cloud database market. There are many examples of non free software surviving just fine in this market. Both amazon and google run their own proprietary cloud databases like this.

And it is very easy to host your own instance of Amazon or Google :-)

> industry experience has taught us that availability—in the strict sense that it is defined in the CAP theorem—is overrated: In practice, the uptime of systems that favor availability have not proven greater than what can be achieved with consistent systems.

The answer to any distributed consistent database is always the same: we gave up on one letter in CAP. In this case, it was 'A', availability.

And hey, cool, that's certainly an option. But I'd prefer if they had simply made that the subtitle.

What they're trying to explain is that so-called AP systems do not have 100% availability either. The CAP theorem applies to very specific scenarios that are not necessarily common - it is a much more narrow theorem then "CP vs AP" implies. For example, if your network partitions are usually small (i.e. there exists a majority outside the partition) and you can avoid routing requests to the minority partition (e.g. because you peer with ISPs and can route somewhere else, or internally avoid routing to a rack that is don etc.) then you won't observe a loss of A.

For more information on this perspective you can read _Spanner, TrueTime and the CAP theorem_ by the fella that coined the term CAP: https://static.googleusercontent.com/media/research.google.c...

Yammer concurred with this perspective, stating:

"At Yammer we have experience with AP systems, and we’ve seen loss of availability for both Cassandra and Riak for various reasons. Our AP systems have not been more reliable than our CP systems, yet they have been more difficult to work with and reason about in the presence of inconsistencies. Other companies have also seen outages with AP systems in production. So in practice, AP systems are just as susceptible as CP systems to outages due to issues such as human error and buggy code, both on the client side and the server side."


"In practice, the uptime of systems that favor availability have not proven greater than what can be achieved with consistent systems."

This claim is much stronger than just saying AP systems sometimes fail. Therefore observing some AP systems failures is not enough to prove it. For the statement to hold true, AP systems would have to fail just as often as CP systems. However, such claim is totally unfounded. The article doesn't link to any research showing AP systems have the same (or worse) availability than CP systems.

On the other hand, most CP systems, even centralized ones, are not truly consistent as required by ACID either. Most RDBMS systems run at Read Committed level for performance reasons, which is ACD, not ACID.

It is also ignoring the fact that most systems are neither AP or CP, and that this distinction is not very useful to describe database systems.

Yep, the Yammer article is just opinion. It may or may not match your experiences/beliefs :) Check the Spanner paper for slightly more data (but its still pretty informal) from Google. They don't talk about other systems, but they explain why Spanner has such high availability. It would be interesting to see an empirical study comparing availability but I'm not holding my breath because doing that fairly and realistically would be challenging.

I'm not sure about the claim about CP systems, but I probably have a different opinion on "common" CP systems (etcd, consul, Spanner-likes, which do offer consistency in the sense of ACID.) But it's definitely worth considering things weaker than that - e.g. Azure Blob Storage and Google Cloud Storage (S3-likes) offer strong consistency but no cross-key transactions. These databases are very applicable but weaker than Spanner-likes like the one from TFA.

Agreed that AP vs. CP isn't a good way to describe databases :)

Note that this mostly just an opinion piece is also very guilty of attempting to poison the well.

"AP systems are not 100% available in practice"

Nothing is 100% available, it is like absolute zero, you can get close but never reach it.

I am not sure if the intent on this was to justify the uptime problems that yammer had this year, or if it was to just self justify engineering choices.

The real flaw is ignoring the horses for courses reality of systems needs. Automated Teller Machine's transactions are not strongly consistent, yet we manage to get by.

Spanner is a "shared nothing" architecture which has advantages availability and recoverablity. But you do pay a price for the high level of consistency.

Each write is mediated by a leader, and has to be committed on at least 2 of the three regional nodes which does have a cost.

Yammer (and I guess rayokota) chose to use PostgreSQL which is not a horrible choice for a traditional RDBMS these days, but we all know the pain of a master node failover, re-sync etc.. And with most streaming replication methods there will be a serious issue with the nodes that may have been on a partition with the original master and the fail-over node that is promoted.

Distributed systems are hard, but I am having a hard time really finding much value in this post outside of a fairly opaque opinion that there is no perfect product for all needs, which will always be true.

But as for why I chose to respond to your comment jacobparker. The thing to remember about strong consistency is that locks, two stage commits, single mediators etc...they will all limit the theoretical speedup in latency you can have by adding resources.

While not perfect this means that you should engineer with Amdahl's law and hope for Gustafson's law.

As a practical example on how this becomes an issue with scaling consider the following article about the limitations of single disk queues in the linux kernel on systems with multiple CPUs


Note the extreme falloff on Figure 4.

This doesn't mean that you can't start out with a simple strong consistency model, but if you don't at least try to avoid the need for ACID type transactions it will be very hard to scale later. Worse most teams I have been on try to implement complex and fragile sharding schemes which in effect replicate a BASE style datastore. If we think managing distributed systems are hard, writing them is much more difficult.

But really there are no universal truths in this area, and all decisions should be made on use case and not by selecting the product before defining the need (which is our most common method it seems)

> For the statement to hold true, AP systems would have to fail just as often as CP systems.

I think all that would need to be true is that there exists one CP system that has uptime figures comparable to AP systems. Then you could say that targeting AP is pointless, when you could either just use that system; or engineer your own new CP system to do the same things to achieve uptime as that system.

You can achieve high uptime in a CP system by assuring P will never happen, ending up with some kind of an AC system, but in practice it may be very costly, hard to setup and operate.

Can you imagine every ATM having a redundant networking connectivity and two independent power lines?

There is no free lunch. Comparing just availability figures is therefore not enough. Still an AP system may be preferable if it has the same availability at a fraction of a cost of that unicorn available and consistent system.

Note that a "CP system with high uptime" isn't suddenly an AC system. It still doesn't attempt to have 100% uptime, and its ecosystem of clients still must be built around the idea that the server can go down, if just for a few minutes a year.

This is different from an AP system, because an AP system has to be built from the ground up to assume that stupid corruptions will happen during netsplits and that it'll have to re-integrate them; while the CP system can just assume it'll all go down, and not have to have any of that extra code or infrastructure.

A good example of a high-uptime CP system that I'm aware of, is telecom switches. They're usually architected as simple pairs: one live, and one hot standby. And—even in platforms that don't have "hot upgrade" functionality—you can still upgrade both nodes without downtime by just intentionally causing connections to fail-over back and forth between the nodes.

Telecom switches have very little downtime. But they do go down, rather than becoming inconsistent. And the resulting architecture is much cheaper (in both hardware/networking costs, ops costs, and software development/maintenance costs) than the equivalent architecture you'd need to serve calls under AP guarantees.

For many applications repeatable read, or even read committed, is plenty good enough Isolation.

For many applications BASE is plenty good consistency.

Does your explanation also hold when you look at read and write availability separately? I'd only see that to hold for the read "view" here. E.g., Cassandra is used because you can write your data almost always to Cassandra, while with a CP-oriented system (HBase?) you are stuck "waiting" for consistency.

Yes; in vanilla Paxos you need a majority of nodes to be up/reachable to service either a read or a write. There isn't really a distinction between read/write in terms of availability (for better or worse.) I think its pretty unlikely (in terms of db design) that you'd have a consistent system that could service reads but not writes during some manner of outage.

> I think its pretty unlikely (in terms of db design) that you'd have a consistent system that could service reads but not writes during some manner of outage.

Consider a bank account. I can consistently service writes that add or subtract from the balance as long as I can guarantee durability and is willing to blindly update without a consistent view of the balance.

I can only service consistent reads when I can guarantee that I have seen every committed transaction.

In this case we opt for availability over consistency when presenting the current balance, but for consistency when e.g. cutting statements (through the brute-force method of just waiting until all settlement for the relevant dates has occurred).

I think you misread my comment; I said its unlikely to have a consistent system that can service reads but not writes during some manner of disruption.

> I think its pretty unlikely (in terms of db design) that you'd have a consistent system that could service reads but not writes during some manner of outage.

It's very easy to imagine this scenario. All you need is a read replica of your CP RDBMS.

For most systems consistency is only required on write transactions and eventual consistency on read only transactions is acceptable.

Paxos is about ensuring consistency. So why would the original arguments hold for a write-available database like Cassandra - or about log/journal storage, then?

This is an insightful observation. Designing more and more complex systems to achieve high availability may end up not achieving anything or even being negative because those HA systems are too complex to implement, configure and operate in a way that actually achieves HA.

I think the same may end up being true of globally distributed databases that try to provide generic ACID transactions. Will they actually ever achieve the performance and reliability needed to serve massive scale global work loads?

Or will this be a Mongo situation where all of the scale promises turn out to be unfulfilled once you actually reach the scale where you need them?

Mongo situation is best scenario. They went ipo while others failed.

I started using CockroachDB [1] recently and I'm really impressed with it overall. The Postgres compatibility is genius in my opinion, and despite some hurdles it's been a joy to work with (I'm using xo [2] to generate Go code from custom templates). I've yet to deploy it in production, but the best part is that if the performance sucks I can simply deploy postgres instead. As someone who used to be a big RethinkDB supporter, this is my new favorite pet oss project.

edit: oh, and it's written in Go, my main language, which is definitely a plus if you're a gopher

1: https://www.cockroachlabs.com/tags/acid/

2: https://github.com/xo/xo

I feel the same way regarding the importance of a migration back to Postgres if needed. I haven't been given a reason to doubt cockroachdb's stability, but you never know what could go wrong. The fact that I can have a break glass contingency plan to restore data to Postgres if I need to is huge. I doubt I'll ever need it, but the fact that it's an option made using a younger project like cockroachdb an acceptable risk.

Postgres compatibility? Really? Reference? I want to believe but are not you comparing apples to oranges? Last time I checked the JSON datatype was not supported on Cockroach.

CockroachDB explicitly aims to offer a Postgres-compatible interface. Is their homepage reference enough for that? https://www.cockroachlabs.com/product/cockroachdb/ or https://www.cockroachlabs.com/docs/stable/build-an-app-with-...

no, not even a JSON datatype is now supported. I cannot use a database without JSON for my use case https://github.com/cockroachdb/cockroach/issues/2969 An compatible interface is not compatibility and JSON as an example is not the biggest magic stuff inside PostgreSQL.

JSONB support is coming, fuck yeah!


Give the dudes time. Postgres has a lot of stuff.

JSONB support is on our 2.0 roadmap and is actively under development.

Roadmap: https://github.com/cockroachdb/cockroach/wiki/Roadmap

Confirmed Development (I see you already reference this issue above): https://github.com/cockroachdb/cockroach/issues/2969

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact