Absolute consistency

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Justin Sheehy

On Jan 10, 2012, at 9:42 PM, Les Mikesell wrote:

> How do things like mongo and elasticsearch manage atomic operations
> while still being redundant?

Most such systems use some variant of primary copy replication, also known as master/slave replication.

That approach can provide consistency, but has much weaker availability properties.

-Justin



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Les Mikesell
On Tue, Jan 10, 2012 at 8:52 PM, Justin Sheehy <[hidden email]> wrote:
>
> On Jan 10, 2012, at 9:42 PM, Les Mikesell wrote:
>
>> How do things like mongo and elasticsearch manage atomic operations
>> while still being redundant?
>
> Most such systems use some variant of primary copy replication, also known as master/slave replication.
>
> That approach can provide consistency, but has much weaker availability properties.

Doesn't riak need some kind of partition owner/master concept to
control migration?  And if it has that, why can't the client request
that an operation happens on the partition owner/master first for
things that need consistency?

--
   Les Mikesell
     [hidden email]

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Ian Plosker
Les,

In Riak, there is no single primary copy considered the canonical version. For each key, there will be N (3 by default) partitions responsible for storing the associated value. In effect, there are N primaries for any key. This is how Riak makes its availability guarantees, as well as why "absolute consistency" is difficult.

-- 
Ian Plosker <[hidden email]>
Developer Advocate
Basho Technologies, Inc.

On Tuesday, January 10, 2012 at 10:09 PM, Les Mikesell wrote:

On Tue, Jan 10, 2012 at 8:52 PM, Justin Sheehy <[hidden email]> wrote:

On Jan 10, 2012, at 9:42 PM, Les Mikesell wrote:

How do things like mongo and elasticsearch manage atomic operations
while still being redundant?

Most such systems use some variant of primary copy replication, also known as master/slave replication.

That approach can provide consistency, but has much weaker availability properties.

Doesn't riak need some kind of partition owner/master concept to
control migration? And if it has that, why can't the client request
that an operation happens on the partition owner/master first for
things that need consistency?

--
Les Mikesell

_______________________________________________
riak-users mailing list


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Vishal Shah
To add to Ian's comment, for me personally, this specific characteristic is in fact a very important distinguishing feature of Riak vs other scalable KV systems. To me, this is what separates 

My understanding is that Riak borrows this from Dynamo which talks about "decentralization" - master nodes often times are a cause of bottlenecks and/or single point of failure for certain operations.

Vishal

On Jan 11, 2012, at 8:39 AM, Ian Plosker wrote:

Les,

In Riak, there is no single primary copy considered the canonical version. For each key, there will be N (3 by default) partitions responsible for storing the associated value. In effect, there are N primaries for any key. This is how Riak makes its availability guarantees, as well as why "absolute consistency" is difficult.

-- 
Ian Plosker <[hidden email]>
Developer Advocate
Basho Technologies, Inc.

On Tuesday, January 10, 2012 at 10:09 PM, Les Mikesell wrote:

On Tue, Jan 10, 2012 at 8:52 PM, Justin Sheehy <[hidden email]> wrote:

On Jan 10, 2012, at 9:42 PM, Les Mikesell wrote:

How do things like mongo and elasticsearch manage atomic operations
while still being redundant?

Most such systems use some variant of primary copy replication, also known as master/slave replication.

That approach can provide consistency, but has much weaker availability properties.

Doesn't riak need some kind of partition owner/master concept to
control migration? And if it has that, why can't the client request
that an operation happens on the partition owner/master first for
things that need consistency?

--
Les Mikesell

_______________________________________________
riak-users mailing list

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Les Mikesell
In reply to this post by Ian Plosker
On Wed, Jan 11, 2012 at 10:39 AM, Ian Plosker <[hidden email]> wrote:
> Les,
>
> In Riak, there is no single primary copy considered the canonical version.
> For each key, there will be N (3 by default) partitions responsible for
> storing the associated value. In effect, there are N primaries for any key.
> This is how Riak makes its availability guarantees, as well as why "absolute
> consistency" is difficult.

But who makes the decision that a partition needs to migrate and where
a key is at any time during that migration?  That isn't independently
decided by each node, is it?  And if you have an authority for that,
why can't a client ask that authority to control ordering of
operations for certain things where the client is willing to trade the
time it might take for atomicity.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Les Mikesell
In reply to this post by Vishal Shah
On Wed, Jan 11, 2012 at 10:56 AM, Vishal Shah <[hidden email]> wrote:
> To add to Ian's comment, for me personally, this specific characteristic is
> in fact a very important distinguishing feature of Riak vs other scalable KV
> systems. To me, this is what separates

Yes, but it makes it unusable for anything that requires atomic
operations.  A group here just went with elasticsearch - partly
because of the full lucene indexer and range queries, but mostly
because they wanted redundant data feeds and an atomic operation to
reject duplicates.  And as a nice side effect, you can run a
client-only node that doesn't store data on the same box with the
application so you don't need to go through a separate load balancer
to deal with failures of the node the app is configured to use.

--
  Les Mikesell
    [hidden email]

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Joseph Blomstedt-3
If ElasticSearch is a better fit, then using ElasticSearch is the
right thing to do. The whole "NoSQL movement" is really about choice.
At scale there will never be a single solution that is best for
everyone.  Riak is intentionally focused on high availability,
reliability, and fault-tolerance. If you have critical data and/or
care most about optimizing for "it just works" (even during a server
failure, even at 2 AM), then Riak is likely the best choice. If other
priorities dominate, then you may want to look elsewhere.

Riak is based on Dynamo. Dynamo is an eventually consistent system
that embraces sloppy quorums and hinted handoff. In terms of CAP, it
is an AP system rather than a CP system. And, by embracing sloppy
quorums, Riak/Dynamo takes AP to its limit by optimizing for "always
writable" over "read-your-own writes" consistency. In comparison,
Project Voldemort is another Dynamo inspired system that chooses to go
with strict quorums. This provides RYOW eventual consistency, but
leads to lower availability guarantees during failures/partitions.

For Riak 1.0, we introduced PW/PR. The intention behind PW/PR was to
add support for strict quorums. Requests with R/W settings would use
the standard sloppy quorum logic, requests with PR/PW would use strict
quorum logic. As Jon mentioned in his earlier email, the current
implementation that went into 1.0 turns out to have a corner case were
strict quorums aren't enforced. Addressing this corner case and
strengthening the guarantees of PW/PR are "on the list".

To illustrate sloppy vs strict, consider a 5-node cluster: A/B/C/D/E.
And a W=Quorum/N=3 write request is issued on a key that is owned by
nodes A/B/C. However, nodes B and C are currently offline.

With sloppy quorums/hinted handoff, Riak will re-route the requests
meant for B/C to nodes D/E. This allows the write to succeed with the
guarantee of having 3-replicas in the cluster. In the future, D will
eventually handoff it's replica back to B whenever it comes online,
and E will do the same to C. This is hinted handoff.

With strict quorums (ie. PW=Quorum), the desired behavior is to only
send requests to the primary nodes. Since B/C are down, only A would
respond, and therefore the write would fail because it did not fulfill
the requested replica requirement. Again, in the current
implementation, there are cases where this is not guaranteed and a
request may be sent to a fallback node. This leads to the behavior
discussed earlier.

Of course, even if we had perfect PW/PR semantics, Riak still only
gives you a limited form of "read your own writes" consistency. The
labels "absolute consistency", "strong consistency" or even "atomic
operation" are vague when discussing distributed systems with multiple
clients.

Some thoughts to ponder:

1. Do you allow multiple clients to write to Riak at the same time?
With concurrent writers, "atomic" can mean multiple things. Do you
want linearizability? Do you want one writer to fail? Is optimistic
concurrency control or MVCC your solution? You could route everything
though a single writer, but then you introduce a
single-point-of-failure (SPOF). Adding an SPOF in-front of a highly
available system is non-ideal. Or, perhaps a single writer with leader
election / failover?

2. What about write failure? In Riak, a write failure does not mean
the value won't later show-up in a read. If you issue a PW=3 write, it
may fail because it succeeded to write to 1 replica, but not the other
2. However, the 1-replica that does have the value will eventually
propagate it to the other 2 replicas through read repair. Thus, you
may eventually read the failed-to-write value. Of course, the write
could have failed to all 3 replicas in which case you won't ever read
it. Which type of failure was it? You don't know, because actual
failure and "didn't reply before I responded to the client" are
indistinguishable without something similar to a 2-phase commit. Does
your client handle this case? Perhaps just re-issues writes until they
succeed? What if your client dies while re-issuing requests, is the
value lost? What consistency guarantees do you want to provide in this
scenario?

In general, distributed consistency is non-trivial. Even master/slave
systems have choices to make. Synchronous vs asynchronous replication.
ElasticSearch is synchronous (at least to secondary RAM), while
MongoDB is asynchronous. If you have multiple slaves, what consistency
guarantees are there between all of them? If the master crashes during
a write that was replicated to some but not all slaves, is it possible
to get different values on a read if different subset of slaves crash
as well? For the strongest replication guarantees, you end up with
protocols with higher latency and lower availability guarantees. It's
always a tradeoff game.

As an aside, around March 2010, I started to investigate strong
consistency in Riak. Part of that work lead to an implementation of
riak_zab (http://github.com/jtuple/riak_zab), an atomic, 2-phase
commit protocol built on riak_core that I released last year. I have
unreleased code that provided a strongly consistent riak_kv layer
(riakual) on top of riak_zab. This was one of many possible ways to
add stronger guarantees to Riak.

As Jon mentioned, stronger consistency is a research area for 2012.
While CAP dictates that you can't have C/A/P at once, there's no
reason you can't have a product that provides both AP requests and CP
requests. Perhaps there will be more to discuss on that point later on
this year.

The main take-away of this long email is that providing different
guarantees in the presence of node failures and network partitions is
a non-trivial problem.  If the goal is high-availability and no SPOF,
the problem is even more challenging.  I would recommend against
anyone trying to implementing client-side strong consistency on-top of
Riak, unless you understand the scope of the problem and are
intentionally limiting yourself to a subset of strong consistency (eg.
"assume writes always succeed", "assume only one writer always", etc).
Or, if you understand how you would do so, leverage something like
Zookeeper to provide consistent replicated state that is backed by
Riak. (That's essentially what riakual/riak_zab did). Or, better yet,
look at how you can reformulate your problem to work in an eventually
consistent system. Less coordination will always provide faster, more
predictable performance. Options like statebox, knockbox, and
meangirls can help with certain problem domains:
https://github.com/mochi/statebox
http://reiddraper.com/introducing-knockbox/
https://github.com/aphyr/meangirls

Finally, if you truly need atomic, strong consistency, today and not
tomorrow then consider other options. Seriously, if Riak isn't the
right fit, "Don't use my database":
http://www.slideshare.net/BashoTechnologies/basho-and-riak-at-goto-stockholm-dont-use-my-database

If you do look at other options, take the time to truly understand the
guarantees provided. What does atomic, multi-master replication mean
for a given product?  What failure conditions can it tolerate? Can you
ever lose data under rare, but not necessarily uncommon scenarios? If
you're unsure, ask questions. It's not about good or bad. Different
products have different guarantees.  The goal is to ensure that
whatever features/guarantees you need for your project are provided by
the choice you decide upon.

-Joe
@jtuple


On Wed, Jan 11, 2012 at 11:07 AM, Les Mikesell <[hidden email]> wrote:

> On Wed, Jan 11, 2012 at 10:56 AM, Vishal Shah <[hidden email]> wrote:
>> To add to Ian's comment, for me personally, this specific characteristic is
>> in fact a very important distinguishing feature of Riak vs other scalable KV
>> systems. To me, this is what separates
>
> Yes, but it makes it unusable for anything that requires atomic
> operations.  A group here just went with elasticsearch - partly
> because of the full lucene indexer and range queries, but mostly
> because they wanted redundant data feeds and an atomic operation to
> reject duplicates.  And as a nice side effect, you can run a
> client-only node that doesn't store data on the same box with the
> application so you don't need to go through a separate load balancer
> to deal with failures of the node the app is configured to use.
>
> --
>  Les Mikesell
>    [hidden email]
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



--
Joseph Blomstedt <[hidden email]>
Software Engineer
Basho Technologies, Inc.
http://www.basho.com/

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Les Mikesell
On Wed, Jan 11, 2012 at 9:14 PM, Joseph Blomstedt <[hidden email]> wrote:
>
> Some thoughts to ponder:
>
> 1. Do you allow multiple clients to write to Riak at the same time?
> With concurrent writers, "atomic" can mean multiple things. Do you
> want linearizability? Do you want one writer to fail?

In our particular case we have redundant feeds with upstream
processing that will provide a unique key per item.  So we expect two
concurrent writers and want one to fail, the other to succeed and be
indexed.  There's a bit of messiness involved because some (few) of
the items will be updates to existing data.

> Is optimistic
> concurrency control or MVCC your solution? You could route everything
> though a single writer, but then you introduce a
> single-point-of-failure (SPOF). Adding an SPOF in-front of a highly
> available system is non-ideal. Or, perhaps a single writer with leader
> election / failover?

The point of the design is to avoid any SPOF.  The upstream data comes
from multiple sources, so the redundant feeds are expected to cover
individual source failures to the extent possible, letting the DB drop
the duplicates but add its own internal redundancy for storage.

> 2. What about write failure? In Riak, a write failure does not mean
> the value won't later show-up in a read. If you issue a PW=3 write, it
> may fail because it succeeded to write to 1 replica, but not the other

We expect one to fail.  If it fails for any reason other than the key
already existing, the writer client can queue and retry within limits
(its a feed, not interactive).

> In general, distributed consistency is non-trivial. Even master/slave
> systems have choices to make. Synchronous vs asynchronous replication.
> ElasticSearch is synchronous (at least to secondary RAM), while
> MongoDB is asynchronous. If you have multiple slaves, what consistency
> guarantees are there between all of them? If the master crashes during
> a write that was replicated to some but not all slaves, is it possible
> to get different values on a read if different subset of slaves crash
> as well? For the strongest replication guarantees, you end up with
> protocols with higher latency and lower availability guarantees. It's
> always a tradeoff game.

I don't expect to trust any of these schemes 100%.  We will have
another instance of the cluster in another location and a tool that
compares recently created keys between them (one of the reasons for
wanting efficient range queries).    We are still kicking around the
relative reliability and other tradeoffs of trying to inject all 4
feeds into each DB vs. the two local copies vs. crisscrossed.
Crisscrossed will probably win.

> As Jon mentioned, stronger consistency is a research area for 2012.
> While CAP dictates that you can't have C/A/P at once, there's no
> reason you can't have a product that provides both AP requests and CP
> requests. Perhaps there will be more to discuss on that point later on
> this year.

Yes, even understanding that you can't guarantee both at once, you
could let the client decide which it wants.  Some things can wait.

--
   Les Mikesell
     [hidden email]

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
12