Absolute consistency

classic Classic list List threaded Threaded
28 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Absolute consistency

Thomas Bakketun
Hello,

Riak is based on eventual consistency concept, but sometimes it's
desirable to have absolute consistency. I though it would be possible to
implement this on top of Riak with the help of pr and wr parameters of
fetch and store, but when testing this I sometimes get worrying results.

I have at test setup with Riak nodes in three virtual machines running
on my desktop machine. The test bucket has n_val=3 and allow_mult=true.

On my desktop machine I constantly read from every node like this:
watch curl -v "'http://192.168.100.21:8098/riak/test/foo?pr=quorum'"

When I pause two of the virtual machines, the remaining one starts to
fail with error message "PR-value unsatisfied: 1/2", just as expected.
But occasionally it will returns 200 OK for about 10 seconds, before it
starts returning "PR-value unsatisfied: 1/2". I can't predictably
reproduce this situation, but it's seems to happen when I resume a
machine that was paused when the other machines where running too.

Pausing virtual machines to simulate downtime might be a bit
unrealistic. Still it concerns me that Riak can return stale data. Here
is an example:

First I wrote a value when all nodes where up. Then I pause one of the
nodes and wrote a new value. Then I paused the two remaining nodes and
resume the first node. For a few seconds the first node then returned
200 OK with the old value when though I supplied pr=quorum.

I have also been able to do successful writes with w, dw and pw=quorum
with only one machine running. I thing the node also returned 200 OK
when the object was read with pr=quorum right after.

The tests where done with Riak 1.0.0-1 running on Ubuntu 10.10 64 bit.

--
Thomas Bakketun, Senior software developer
T. +47 21 53 69 40
Copyleft Solutions
www.copyleftsolutions.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Alexander Sicular

When n=3 in a three node cluster, sometimes two of those n are on the same node.

@siculars
http://siculars.posterous.com

Sent from my rotary phone.

On Jan 5, 2012 11:07 AM, "Thomas Bakketun" <[hidden email]> wrote:
Hello,

Riak is based on eventual consistency concept, but sometimes it's
desirable to have absolute consistency. I though it would be possible to
implement this on top of Riak with the help of pr and wr parameters of
fetch and store, but when testing this I sometimes get worrying results.

I have at test setup with Riak nodes in three virtual machines running
on my desktop machine. The test bucket has n_val=3 and allow_mult=true.

On my desktop machine I constantly read from every node like this:
watch curl -v "'http://192.168.100.21:8098/riak/test/foo?pr=quorum'"

When I pause two of the virtual machines, the remaining one starts to
fail with error message "PR-value unsatisfied: 1/2", just as expected.
But occasionally it will returns 200 OK for about 10 seconds, before it
starts returning "PR-value unsatisfied: 1/2". I can't predictably
reproduce this situation, but it's seems to happen when I resume a
machine that was paused when the other machines where running too.

Pausing virtual machines to simulate downtime might be a bit
unrealistic. Still it concerns me that Riak can return stale data. Here
is an example:

First I wrote a value when all nodes where up. Then I pause one of the
nodes and wrote a new value. Then I paused the two remaining nodes and
resume the first node. For a few seconds the first node then returned
200 OK with the old value when though I supplied pr=quorum.

I have also been able to do successful writes with w, dw and pw=quorum
with only one machine running. I thing the node also returned 200 OK
when the object was read with pr=quorum right after.

The tests where done with Riak 1.0.0-1 running on Ubuntu 10.10 64 bit.

--
Thomas Bakketun, Senior software developer
T. <a href="tel:%2B47%2021%2053%2069%2040" value="+4721536940">+47 21 53 69 40
Copyleft Solutions
www.copyleftsolutions.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Runar Jordahl
In reply to this post by Thomas Bakketun
As Alexander pointed out, N does not mean "a separate PC". There was a
thread about this earlier:

http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003316.html

Kind regards
Runar Jordahl
blog.epigent.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Tim Robinson
Ouch.

I'm shocked that is not considered a major bug. At minimum that kind of stuff should be front and center in their wiki/docs. Here I am thinking n 2 on a 3 node cluster means I'm covered when in fact I am not. It's the whole reason I gave Riak consideration.

Tim

-----Original Message-----
From: "Runar Jordahl" <[hidden email]>
Sent: Thursday, January 5, 2012 12:13pm
To: "Thomas Bakketun" <[hidden email]>
Cc: [hidden email]
Subject: Re: Absolute consistency

As Alexander pointed out, N does not mean "a separate PC". There was a
thread about this earlier:

http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003316.html

Kind regards
Runar Jordahl
blog.epigent.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


Tim Robinson



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Alexander Sicular
you're right. there should be better notice about this. I think the number of physical machines needs to be greater than n to keep this from happening. i also think I've seen a specific algorithm that is floating around that defines this behavior but i can't conjure it up at the moment.   i also think the recommended default config is four nodes. and if its not it should be.

-Alexander Sicular

@siculars

On Jan 5, 2012, at 2:44 PM, Tim Robinson wrote:

> Ouch.
>
> I'm shocked that is not considered a major bug. At minimum that kind of stuff should be front and center in their wiki/docs. Here I am thinking n 2 on a 3 node cluster means I'm covered when in fact I am not. It's the whole reason I gave Riak consideration.
>
> Tim
>
> -----Original Message-----
> From: "Runar Jordahl" <[hidden email]>
> Sent: Thursday, January 5, 2012 12:13pm
> To: "Thomas Bakketun" <[hidden email]>
> Cc: [hidden email]
> Subject: Re: Absolute consistency
>
> As Alexander pointed out, N does not mean "a separate PC". There was a
> thread about this earlier:
>
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003316.html
>
> Kind regards
> Runar Jordahl
> blog.epigent.com
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> Tim Robinson
>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Alexander Sicular
In reply to this post by Tim Robinson
shocked? dude. chillax. shocked would be like finding out the pope's gots kids. or we didn't land on the moon. or did we? there is so much magic abstraction going on here and just sending this email to you I'm shocked every time the anything actually works.

-Alexander Sicular

@siculars

On Jan 5, 2012, at 2:44 PM, Tim Robinson wrote:

> Ouch.
>
> I'm shocked that is not considered a major bug. At minimum that kind of stuff should be front and center in their wiki/docs. Here I am thinking n 2 on a 3 node cluster means I'm covered when in fact I am not. It's the whole reason I gave Riak consideration.
>
> Tim
>
> -----Original Message-----
> From: "Runar Jordahl" <[hidden email]>
> Sent: Thursday, January 5, 2012 12:13pm
> To: "Thomas Bakketun" <[hidden email]>
> Cc: [hidden email]
> Subject: Re: Absolute consistency
>
> As Alexander pointed out, N does not mean "a separate PC". There was a
> thread about this earlier:
>
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003316.html
>
> Kind regards
> Runar Jordahl
> blog.epigent.com
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> Tim Robinson
>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Tim Robinson

The kind of people that gravitate towards riak are ones who take redundancy pretty seriously, so forgive me if I try to choose words that I feel are on par to the expectation.


-----Original Message-----
From: "Alexander Sicular" <[hidden email]>
Sent: Thursday, January 5, 2012 7:51am
To: "Tim Robinson" <[hidden email]>
Cc: "Runar Jordahl" <[hidden email]>, [hidden email]
Subject: Re: Absolute consistency

shocked? dude. chillax. shocked would be like finding out the pope's gots kids. or we didn't land on the moon. or did we? there is so much magic abstraction going on here and just sending this email to you I'm shocked every time the anything actually works.

-Alexander Sicular

@siculars

On Jan 5, 2012, at 2:44 PM, Tim Robinson wrote:

> Ouch.
>
> I'm shocked that is not considered a major bug. At minimum that kind of stuff should be front and center in their wiki/docs. Here I am thinking n 2 on a 3 node cluster means I'm covered when in fact I am not. It's the whole reason I gave Riak consideration.
>
> Tim
>
> -----Original Message-----
> From: "Runar Jordahl" <[hidden email]>
> Sent: Thursday, January 5, 2012 12:13pm
> To: "Thomas Bakketun" <[hidden email]>
> Cc: [hidden email]
> Subject: Re: Absolute consistency
>
> As Alexander pointed out, N does not mean "a separate PC". There was a
> thread about this earlier:
>
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003316.html
>
> Kind regards
> Runar Jordahl
> blog.epigent.com
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> Tim Robinson
>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



Tim Robinson



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Aphyr
In reply to this post by Tim Robinson
On 01/05/2012 11:44 AM, Tim Robinson wrote:
> Ouch.
>
> I'm shocked that is not considered a major bug. At minimum that kind of stuff should be front and center in their wiki/docs. Here I am thinking n 2 on a 3 node cluster means I'm covered when in fact I am not. It's the whole reason I gave Riak consideration.
>
> Tim

I think you may have this backwards. N=3 and 2 nodes would mean one node
has 1 copy, and 1 node has 2 copies, of any given piece. For n=2 and 3
nodes, there should be no overlap.

The other thing to consider is that for certain combinations of
partition number P and node number N, distributing partitions mod N can
result in overlaps at the edge of the ring. This means zero to n
preflists can overlap on some nodes. That means n=3 can, *with the wrong
choice of N and P*, result in minimum 2 machines having copies of any
given key, assuming P > N.

There are also failure modes to consider. I haven't read the new key
balancing algo, so my explanation may be out of date.

--Kyle

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Tim Robinson
Thank you for this info. I'm still somewhat confused.

Why would anyone ever want 2 copies on one physical PC? Correct me if I am wrong, but part of the sales pitch for Riak is that the cost of hardware is lessened by distributing your data across a cluster of less expensive machines as opposed to having it all one reside on an enormous server with very little redundancy.

The 2 copies of data on one physical PC provides no redundancy, but increases hardware costs quite a bit.

Right?

Thanks,
Tim

-----Original Message-----
From: "Aphyr" <[hidden email]>
Sent: Thursday, January 5, 2012 1:01pm
To: "Tim Robinson" <[hidden email]>
Cc: "Runar Jordahl" <[hidden email]>, [hidden email]
Subject: Re: Absolute consistency

On 01/05/2012 11:44 AM, Tim Robinson wrote:
> Ouch.
>
> I'm shocked that is not considered a major bug. At minimum that kind of stuff should be front and center in their wiki/docs. Here I am thinking n 2 on a 3 node cluster means I'm covered when in fact I am not. It's the whole reason I gave Riak consideration.
>
> Tim

I think you may have this backwards. N=3 and 2 nodes would mean one node
has 1 copy, and 1 node has 2 copies, of any given piece. For n=2 and 3
nodes, there should be no overlap.

The other thing to consider is that for certain combinations of
partition number P and node number N, distributing partitions mod N can
result in overlaps at the edge of the ring. This means zero to n
preflists can overlap on some nodes. That means n=3 can, *with the wrong
choice of N and P*, result in minimum 2 machines having copies of any
given key, assuming P > N.

There are also failure modes to consider. I haven't read the new key
balancing algo, so my explanation may be out of date.

--Kyle


Tim Robinson



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Aphyr
On 01/05/2012 12:12 PM, Tim Robinson wrote:

> Thank you for this info. I'm still somewhat confused.
>
> Why would anyone ever want 2 copies on one physical PC? Correct me if
> I am wrong, but part of the sales pitch for Riak is that the cost of
> hardware is lessened by distributing your data across a cluster of
> less expensive machines as opposed to having it all one reside on an
> enormous server with very little redundancy.
>
> The 2 copies of data on one physical PC provides no redundancy, but
> increases hardware costs quite a bit.
>
> Right?

Because in the case you expressed shock over, the pigeonhole
principle makes it *impossible* to store three copies of information in
two places without overlap. The alternative is lying to you about the
replica semantics. That would be bad.

In the second case I described, it's an artifact of a simplistic but
correct vnode sharding algorithm which uses the partion ID modulo node
count to assign the node for each partition. When N is not a multiple of
n, the last and the first (or second, etc, you do the math) partitions
can wind up on the same node. If you don't use even multiples of n/N,
the proportion of data that does overlap on one node is on the order of
1/64 to 1/1024 of the keyspace. This is not a significant operational cost.

This *does* reduce fault tolerance: losing those two "special" nodes
(but not two arbitrary nodes) can destroy those special keys even though
they were stored with N=3. As the probability of losing two *particular*
nodes simultaneously compares favorably with the probability of losing
*any three* nodes simultaneously, I haven't been that concerned over it.
It takes roughly six hours for me to allocate a new machine and restore
the destroyed node's backup to it. Anecdotally, I think you're more
likely to see *cluster* failure than *dual node* failure in a small
distributed system, but that's a long story.

The riak team has been aware of this since at least Jun 2010
(https://issues.basho.com/show_bug.cgi?id=228), and there are
operational workarounds involving target_n_val. As I understand it,
solving the key distribution problem is... nontrivial.

--Kyle

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Tim Robinson
So with the original thread where with N=3 on 3 nodes. The developer believed each node was getting a copy. When in fact 2 copies went to a single node. So yes, there's redundancy and the "shock" value can go away :) My apologies.

That said, I have no ability to assess how much data space that is wasting, but it seems like potentially 1/3 - correct?

Another way to look at it, using the above noted case, is that I need to double[1] the amount of hardware needed to achieve a single amount of redundancy.

[1] not specifically, but effectively.


-----Original Message-----
From: "Aphyr" <[hidden email]>
Sent: Thursday, January 5, 2012 1:29pm
To: "Tim Robinson" <[hidden email]>
Cc: "Runar Jordahl" <[hidden email]>, [hidden email]
Subject: Re: Absolute consistency

On 01/05/2012 12:12 PM, Tim Robinson wrote:

> Thank you for this info. I'm still somewhat confused.
>
> Why would anyone ever want 2 copies on one physical PC? Correct me if
> I am wrong, but part of the sales pitch for Riak is that the cost of
> hardware is lessened by distributing your data across a cluster of
> less expensive machines as opposed to having it all one reside on an
> enormous server with very little redundancy.
>
> The 2 copies of data on one physical PC provides no redundancy, but
> increases hardware costs quite a bit.
>
> Right?

Because in the case you expressed shock over, the pigeonhole
principle makes it *impossible* to store three copies of information in
two places without overlap. The alternative is lying to you about the
replica semantics. That would be bad.

In the second case I described, it's an artifact of a simplistic but
correct vnode sharding algorithm which uses the partion ID modulo node
count to assign the node for each partition. When N is not a multiple of
n, the last and the first (or second, etc, you do the math) partitions
can wind up on the same node. If you don't use even multiples of n/N,
the proportion of data that does overlap on one node is on the order of
1/64 to 1/1024 of the keyspace. This is not a significant operational cost.

This *does* reduce fault tolerance: losing those two "special" nodes
(but not two arbitrary nodes) can destroy those special keys even though
they were stored with N=3. As the probability of losing two *particular*
nodes simultaneously compares favorably with the probability of losing
*any three* nodes simultaneously, I haven't been that concerned over it.
It takes roughly six hours for me to allocate a new machine and restore
the destroyed node's backup to it. Anecdotally, I think you're more
likely to see *cluster* failure than *dual node* failure in a small
distributed system, but that's a long story.

The riak team has been aware of this since at least Jun 2010
(https://issues.basho.com/show_bug.cgi?id=228), and there are
operational workarounds involving target_n_val. As I understand it,
solving the key distribution problem is... nontrivial.

--Kyle


Tim Robinson



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Jeremiah Peschka
I seem to recall reading somewhere, but can't find it now, that when n = 3 three physical nodes is the minimum and 5 nodes is the recommended configuration.

Jeremiah Peschka
Founder, Brent Ozar PLF, LLC


On Thu, Jan 5, 2012 at 12:53 PM, Tim Robinson <[hidden email]> wrote:
So with the original thread where with N=3 on 3 nodes. The developer believed each node was getting a copy. When in fact 2 copies went to a single node. So yes, there's redundancy and the "shock" value can go away :) My apologies.

That said, I have no ability to assess how much data space that is wasting, but it seems like potentially 1/3 - correct?

Another way to look at it, using the above noted case, is that I need to double[1] the amount of hardware needed to achieve a single amount of redundancy.

[1] not specifically, but effectively.


-----Original Message-----
From: "Aphyr" <[hidden email]>
Sent: Thursday, January 5, 2012 1:29pm
To: "Tim Robinson" <[hidden email]>
Cc: "Runar Jordahl" <[hidden email]>, [hidden email]
Subject: Re: Absolute consistency

On 01/05/2012 12:12 PM, Tim Robinson wrote:
> Thank you for this info. I'm still somewhat confused.
>
> Why would anyone ever want 2 copies on one physical PC? Correct me if
> I am wrong, but part of the sales pitch for Riak is that the cost of
> hardware is lessened by distributing your data across a cluster of
> less expensive machines as opposed to having it all one reside on an
> enormous server with very little redundancy.
>
> The 2 copies of data on one physical PC provides no redundancy, but
> increases hardware costs quite a bit.
>
> Right?

Because in the case you expressed shock over, the pigeonhole
principle makes it *impossible* to store three copies of information in
two places without overlap. The alternative is lying to you about the
replica semantics. That would be bad.

In the second case I described, it's an artifact of a simplistic but
correct vnode sharding algorithm which uses the partion ID modulo node
count to assign the node for each partition. When N is not a multiple of
n, the last and the first (or second, etc, you do the math) partitions
can wind up on the same node. If you don't use even multiples of n/N,
the proportion of data that does overlap on one node is on the order of
1/64 to 1/1024 of the keyspace. This is not a significant operational cost.

This *does* reduce fault tolerance: losing those two "special" nodes
(but not two arbitrary nodes) can destroy those special keys even though
they were stored with N=3. As the probability of losing two *particular*
nodes simultaneously compares favorably with the probability of losing
*any three* nodes simultaneously, I haven't been that concerned over it.
It takes roughly six hours for me to allocate a new machine and restore
the destroyed node's backup to it. Anecdotally, I think you're more
likely to see *cluster* failure than *dual node* failure in a small
distributed system, but that's a long story.

The riak team has been aware of this since at least Jun 2010
(https://issues.basho.com/show_bug.cgi?id=228), and there are
operational workarounds involving target_n_val. As I understand it,
solving the key distribution problem is... nontrivial.

--Kyle


Tim Robinson



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Les Mikesell
In reply to this post by Tim Robinson
On Thu, Jan 5, 2012 at 2:53 PM, Tim Robinson <[hidden email]> wrote:
> So with the original thread where with N=3 on 3 nodes. The developer believed each node was getting a copy. When in fact 2 copies went to a single node. So yes, there's redundancy and the "shock" value can go away :) My apologies.
>
> That said, I have no ability to assess how much data space that is wasting, but it seems like potentially 1/3 - correct?

If you want to write 3 copies while one of your 3 nodes is down.

> Another way to look at it, using the above noted case, is that I need to double[1] the amount of hardware needed to achieve a single amount of redundancy.

Or perhaps that you want to use more nodes so having one down doesn't
impact the others so much.

--
   Les Mikesell
     [hidden email]

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Aphyr
In reply to this post by Tim Robinson
On 01/05/2012 12:53 PM, Tim Robinson wrote:

> So with the original thread where with N=3 on 3 nodes. The developer
> believed each node was getting a copy. When in fact 2 copies went to
> a single node. So yes, there's redundancy and the "shock" value can
> go away :) My apologies.
>
> That said, I have no ability to assess how much data space that is
> wasting, but it seems like potentially 1/3 - correct?
>
> Another way to look at it, using the above noted case, is that I need
> to double[1] the amount of hardware needed to achieve a single amount
> of redundancy.
>
> [1] not specifically, but effectively.

For the third time, no. Please read
http://wiki.basho.com/What-is-Riak%3F.html.

For 256 partitions, N=3:

Part Node
0 0
1 1
2 2
3 0
4 1
...
253 1
254 2
255 0

With n = 3, a key assigned to partition 0 will be stored on partition 0,
1, and 2.

Key Preflist Nodes
0 0,1,2 0,1,2
1 1,2,3 1,2,0
...
253 253,254,255 1,2,0
254 254,255,0 2,0,0 <-- overlap!
255 255,0,1 0,0,1 <-- overlap!

Only 1/128 of the data will reside on only two nodes. 127/128 of the
data will be distributed to three nodes. No data resides on only one
node. Data loss requires the simultaneous destruction of:

a. Any three nodes
b. Node 0 and 1
c. Node 0 and 2

This is true regardless of the number of nodes, so long as n does not
evenly divide N AND you are not using the workaround I linked to in my
previous post. If you do either of those things (use 4, 8, 16, or 32
nodes instead of 3, or use the n_val workaround), the distribution will
be even.

--Kyle

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Joseph Blomstedt-3
In reply to this post by Tim Robinson
Internet went down as I was writing an email. Looks like everyone
already did a great job answering the availability issues. Although, I
might as well chime in as a Basho engineer.

On a side note, it looks like we've completely highjacked the
"Absolute consistency" question initially proposed. I'll email out
some thoughts on that later.

> Why would anyone ever want 2 copies on one physical PC?

You wouldn't want 2 copies on one machine. But, if a user requests 3
replicas, and there are only 2 machines, then that third replica has
to go somewhere.

This is a common scenario during initial data loading, development,
and testing/evaluation of Riak. For example, people often start up a
single-node RIak cluster, load up some data, and then add 2 or more
nodes to turn it into a properly sized cluster. Before the new nodes
are added, all the replicas live on the single-node. As nodes are
added, replicas will move to the additional nodes to ensure
availability.

Simply put, unless you have enough machines to hold all your requested
replicas, there isn't much RIak can do for you. You could certainly
argue that perhaps you could merge replicas and only have 2 written in
the reduced node case, and then copy a replica as enough machines are
added to match/exceed the replica count. But, that's additional
complexity for very minor gain. I would rather Riak have a single
easily understood operating mode that you can rely on to understand
your availability guarantees then have alternative operating modes
depending on cluster sizing.

If you want to run a Riak cluster with N=3, you should have 3+ nodes.

With that said, Kyle correctly mentioned edge cases where having more
nodes than N could still leave to reduced availability. Specifically,
if the number of nodes does not cleanly divide the ring size, then
there may be reduced availability preference lists at the wrap-around
point of the Riak ring. For example, a 64 partition ring with 4 nodes
won't have this problem; but a 64 partition ring with 3 nodes may.
This is considered a bug, and is documented at:
https://issues.basho.com/show_bug.cgi?id=228

As listed on the issue, there are operational workarounds that ensure
this doesn't occur. Such as going with 64/4 rather than 64/3. Fixing
the issue entirely is something we at Basho are working towards. The
new ring claim algorithm in the upcoming release of Riak makes the
wrap-around issue much less likely anytime you have more than N nodes.
A future release will address the issue more directly.

-Joe

On Thu, Jan 5, 2012 at 1:12 PM, Tim Robinson <[hidden email]> wrote:

> Thank you for this info. I'm still somewhat confused.
>
> Why would anyone ever want 2 copies on one physical PC? Correct me if I am wrong, but part of the sales pitch for Riak is that the cost of hardware is lessened by distributing your data across a cluster of less expensive machines as opposed to having it all one reside on an enormous server with very little redundancy.
>
> The 2 copies of data on one physical PC provides no redundancy, but increases hardware costs quite a bit.
>
> Right?
>
> Thanks,
> Tim
>
> -----Original Message-----
> From: "Aphyr" <[hidden email]>
> Sent: Thursday, January 5, 2012 1:01pm
> To: "Tim Robinson" <[hidden email]>
> Cc: "Runar Jordahl" <[hidden email]>, [hidden email]
> Subject: Re: Absolute consistency
>
> On 01/05/2012 11:44 AM, Tim Robinson wrote:
>> Ouch.
>>
>> I'm shocked that is not considered a major bug. At minimum that kind of stuff should be front and center in their wiki/docs. Here I am thinking n 2 on a 3 node cluster means I'm covered when in fact I am not. It's the whole reason I gave Riak consideration.
>>
>> Tim
>
> I think you may have this backwards. N=3 and 2 nodes would mean one node
> has 1 copy, and 1 node has 2 copies, of any given piece. For n=2 and 3
> nodes, there should be no overlap.
>
> The other thing to consider is that for certain combinations of
> partition number P and node number N, distributing partitions mod N can
> result in overlaps at the edge of the ring. This means zero to n
> preflists can overlap on some nodes. That means n=3 can, *with the wrong
> choice of N and P*, result in minimum 2 machines having copies of any
> given key, assuming P > N.
>
> There are also failure modes to consider. I haven't read the new key
> balancing algo, so my explanation may be out of date.
>
> --Kyle
>
>
> Tim Robinson
>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



--
Joseph Blomstedt <[hidden email]>
Software Engineer
Basho Technologies, Inc.
http://www.basho.com/

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Tim Robinson
Got it. Thanks for the responses and the patience.
Tim

-----Original Message-----
From: "Joseph Blomstedt" <[hidden email]>
Sent: Thursday, January 5, 2012 2:14pm
To: "Tim Robinson" <[hidden email]>
Cc: "Aphyr" <[hidden email]>, [hidden email]
Subject: Re: Absolute consistency

Internet went down as I was writing an email. Looks like everyone
already did a great job answering the availability issues. Although, I
might as well chime in as a Basho engineer.

On a side note, it looks like we've completely highjacked the
"Absolute consistency" question initially proposed. I'll email out
some thoughts on that later.

> Why would anyone ever want 2 copies on one physical PC?

You wouldn't want 2 copies on one machine. But, if a user requests 3
replicas, and there are only 2 machines, then that third replica has
to go somewhere.

This is a common scenario during initial data loading, development,
and testing/evaluation of Riak. For example, people often start up a
single-node RIak cluster, load up some data, and then add 2 or more
nodes to turn it into a properly sized cluster. Before the new nodes
are added, all the replicas live on the single-node. As nodes are
added, replicas will move to the additional nodes to ensure
availability.

Simply put, unless you have enough machines to hold all your requested
replicas, there isn't much RIak can do for you. You could certainly
argue that perhaps you could merge replicas and only have 2 written in
the reduced node case, and then copy a replica as enough machines are
added to match/exceed the replica count. But, that's additional
complexity for very minor gain. I would rather Riak have a single
easily understood operating mode that you can rely on to understand
your availability guarantees then have alternative operating modes
depending on cluster sizing.

If you want to run a Riak cluster with N=3, you should have 3+ nodes.

With that said, Kyle correctly mentioned edge cases where having more
nodes than N could still leave to reduced availability. Specifically,
if the number of nodes does not cleanly divide the ring size, then
there may be reduced availability preference lists at the wrap-around
point of the Riak ring. For example, a 64 partition ring with 4 nodes
won't have this problem; but a 64 partition ring with 3 nodes may.
This is considered a bug, and is documented at:
https://issues.basho.com/show_bug.cgi?id=228

As listed on the issue, there are operational workarounds that ensure
this doesn't occur. Such as going with 64/4 rather than 64/3. Fixing
the issue entirely is something we at Basho are working towards. The
new ring claim algorithm in the upcoming release of Riak makes the
wrap-around issue much less likely anytime you have more than N nodes.
A future release will address the issue more directly.

-Joe

On Thu, Jan 5, 2012 at 1:12 PM, Tim Robinson <[hidden email]> wrote:

> Thank you for this info. I'm still somewhat confused.
>
> Why would anyone ever want 2 copies on one physical PC? Correct me if I am wrong, but part of the sales pitch for Riak is that the cost of hardware is lessened by distributing your data across a cluster of less expensive machines as opposed to having it all one reside on an enormous server with very little redundancy.
>
> The 2 copies of data on one physical PC provides no redundancy, but increases hardware costs quite a bit.
>
> Right?
>
> Thanks,
> Tim
>
> -----Original Message-----
> From: "Aphyr" <[hidden email]>
> Sent: Thursday, January 5, 2012 1:01pm
> To: "Tim Robinson" <[hidden email]>
> Cc: "Runar Jordahl" <[hidden email]>, [hidden email]
> Subject: Re: Absolute consistency
>
> On 01/05/2012 11:44 AM, Tim Robinson wrote:
>> Ouch.
>>
>> I'm shocked that is not considered a major bug. At minimum that kind of stuff should be front and center in their wiki/docs. Here I am thinking n 2 on a 3 node cluster means I'm covered when in fact I am not. It's the whole reason I gave Riak consideration.
>>
>> Tim
>
> I think you may have this backwards. N=3 and 2 nodes would mean one node
> has 1 copy, and 1 node has 2 copies, of any given piece. For n=2 and 3
> nodes, there should be no overlap.
>
> The other thing to consider is that for certain combinations of
> partition number P and node number N, distributing partitions mod N can
> result in overlaps at the edge of the ring. This means zero to n
> preflists can overlap on some nodes. That means n=3 can, *with the wrong
> choice of N and P*, result in minimum 2 machines having copies of any
> given key, assuming P > N.
>
> There are also failure modes to consider. I haven't read the new key
> balancing algo, so my explanation may be out of date.
>
> --Kyle
>
>
> Tim Robinson
>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



--
Joseph Blomstedt <[hidden email]>
Software Engineer
Basho Technologies, Inc.
http://www.basho.com/


Tim Robinson



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Thomas Bakketun
On 2012-01-05 22:52, Tim Robinson wrote:

> On a side note, it looks like we've completely highjacked the
> "Absolute consistency" question initially proposed.

Yes, the answers so far doesn't explain the behaviour I have observed.
If that particular key would happen to have two of it's primary replica
on the same node, then that single node would be able to answer fetch
request with pw=quorum even if it's the only node currently running. But
that node would also be a have to be online for pw=quorum read of the
key to be successful. This not what I observed, quorum reads are
successful when any two of the three nodes are online. Also the
surprising results I get only last for a few seconds, and it's not one
particular node that gives the surprising results.

Another thing I observed. Sometimes the vclock of an object starts to
flap between two different values. The object value is the same. Here's
an example:

a85hYGBgzmDKBVIc6TXMFn6HL3ZmMCUxMLC65bEyzM6pPsEHlRWoNl7nd/jyeohsLVB2JZKsoK2Av9+qLz4Q2UNA2U8g2SwA


a85hYGBgzmDKBVIcAtXG6/wOX16fwZTEwMBam8fKsDKn+gQfVFbQVsDfb9UXH4jsIaDsJyTZ9BpmC7/DFzshsm5A2dkg2SwA

No writes are done when this happens. One node always returns the first
vclock, while the two others occupationally returns the second one. I
have not supplied any client-id when writing to the key.

--
Thomas Bakketun, Senior software developer
T. +47 21 53 69 40
Copyleft Solutions
www.copyleftsolutions.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Andrew Thompson
In reply to this post by Thomas Bakketun
Thomas,

I just replicated your setup (at least for the PR gets) and you can
indeed violate PR/PW when you pause a node on a VM. The reason this
happens is that riak's check for PR/PW simply looks at the ring's
preflist for a partition and checks that the required number of
partitions for that preflist are marked as primaries.

Now, when you pause a VM you interrupt any TCP connections that node has
open, just like if you unplugged the network cable, but not like if the
OS shut down or riak itself crashed. In those cases a FIN packet is sent
so that the other erlang nodes notice that their persistant connections
to that node have been reset, they will then reassign ownership of the
partitions owned by that downed node and PR/PW will start to fail.

However, since FIN packets are not generated when you pause the VM, it
takes a few moments for the erlang network heartbeat stuff to notice
that the node is down, so the preflists aren't recalculated. This is the
window where you see the mysterious behaviour.

Now, this is arguably a bug, although fixing it might be challenging.
I've filed https://issues.basho.com/show_bug.cgi?id=1318 to track this.

I don't have a workaround that I can think of offhand, unfortunately.

Andrew

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Jon Meredith
Hi list,

After some discussion internally we've agreed that setting PR=R, PW=W=DW and PR+PW > N is insufficient to guarantee reading your writes.
In the case where PR=quorum, PW=quorum, say for N=3 that would mean PR=R=PW=W=DW=2 there is at least one case where you would not be *guaranteed* to read your write.

It goes like this
  Write to a preference list containing primary1, primary2, primary3 - write succeeds on primary1, primary2 and fails on primary3 (say out of disk space)
  Primary 2 goes down
  Read from a preference list containing primary1, fallback2, primary3 and fallback2/primary3 answer first fulfilling R and giving stale data.

The current best option I can think of is to set PR=R=PW=W=DW=N to ensure all primary replicas and only primary replicas are involved.  Any failures will make access to those objects unavailable, but that may be better than nothing for some use cases.

I've added a task for us to investigate whether applying the PR/PW constraint on the replies received from the vnode is sufficient to make it more useful, however with the way that handoff works there are likely to be windows of inconsistency with PR/PW set to anything other than N.  Riak is at it's heart an eventually consistent database, however investigating alternate approaches to stronger consistency is a very interesting research topic for us going into 2012.

Best regards, 
Jon


On Tue, Jan 10, 2012 at 12:15 PM, Andrew Thompson <[hidden email]> wrote:
Thomas,

I just replicated your setup (at least for the PR gets) and you can
indeed violate PR/PW when you pause a node on a VM. The reason this
happens is that riak's check for PR/PW simply looks at the ring's
preflist for a partition and checks that the required number of
partitions for that preflist are marked as primaries.

Now, when you pause a VM you interrupt any TCP connections that node has
open, just like if you unplugged the network cable, but not like if the
OS shut down or riak itself crashed. In those cases a FIN packet is sent
so that the other erlang nodes notice that their persistant connections
to that node have been reset, they will then reassign ownership of the
partitions owned by that downed node and PR/PW will start to fail.

However, since FIN packets are not generated when you pause the VM, it
takes a few moments for the erlang network heartbeat stuff to notice
that the node is down, so the preflists aren't recalculated. This is the
window where you see the mysterious behaviour.

Now, this is arguably a bug, although fixing it might be challenging.
I've filed https://issues.basho.com/show_bug.cgi?id=1318 to track this.

I don't have a workaround that I can think of offhand, unfortunately.

Andrew

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



--
Jon Meredith
Platform Engineering Manager
Basho Technologies, Inc.


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Absolute consistency

Les Mikesell
On Tue, Jan 10, 2012 at 4:51 PM, Jon Meredith <[hidden email]> wrote:
> Hi list,
>
> After some discussion internally we've agreed that setting PR=R, PW=W=DW and
> PR+PW > N is insufficient to guarantee reading your writes.

How do things like mongo and elasticsearch manage atomic operations
while still being redundant?

--
  Les Mikesell
    [hidden email]

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
12