'not found' after join

classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

'not found' after join

Greg Nelson
Hello riak users!

I have a 4 node cluster that started out as 3 nodes.  ring_creation_size = 2048, target_n_val is default (4), and all buckets have n_val = 3.

When I joined the 4th node, for a few minutes some GETs were returning 'not found' for data that was already in riak.  Eventually the data was returned, due to read repair I would assume.  Is this expected?  It seems that 'not found' and read repairs should only happen when something goes wrong, like a node goes down.  Not when adding a node to the cluster, which is supposed to be part of normal operation!

Any help or insight is appreciated!

Greg

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Alexander Sicular
Search the list for a lengthy discussion on very large ring_creation_size. I'm wagering that 2048 is too large. If you can't imagine having more than 50 physical nodes your size should not be more than 512.

Cheers,
Alexander

Sent from my Verizon Wireless BlackBerry

-----Original Message-----
From: Greg Nelson <[hidden email]>
Sender: [hidden email]
Date: Mon, 2 May 2011 18:48:36
To: <[hidden email]>
Subject: 'not found' after join

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Greg Nelson
That was my post re: large ring_creation_size. I can imagine having more than 50 nodes, as discussed in that thread. :)

But my question about riak's behavior after a node joins applies for any ring_creation_size...

Sent from my iPhone

On May 2, 2011, at 7:13 PM, [hidden email] wrote:

> Search the list for a lengthy discussion on very large ring_creation_size. I'm wagering that 2048 is too large. If you can't imagine having more than 50 physical nodes your size should not be more than 512.
>
> Cheers,
> Alexander
>
> Sent from my Verizon Wireless BlackBerry
>
> -----Original Message-----
> From: Greg Nelson <[hidden email]>
> Sender: [hidden email]
> Date: Mon, 2 May 2011 18:48:36
> To: <[hidden email]>
> Subject: 'not found' after join
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Aphyr
In reply to this post by Greg Nelson
I'd like to chime in here by noting that it would be incredibly nice if
the client could distinguish between a record that is missing because
the vnode is unavailable, and a record that truly does not exist. My
consistency-repair system was running during partition handoff,
determined that several thousand users were "deleted", and removed their
following relationships.

--Kyle

On 05/02/2011 06:48 PM, Greg Nelson wrote:

> Hello riak users!
>
> I have a 4 node cluster that started out as 3 nodes. ring_creation_size
> = 2048, target_n_val is default (4), and all buckets have n_val = 3.
>
> When I joined the 4th node, for a few minutes some GETs were returning
> 'not found' for data that was already in riak. Eventually the data was
> returned, due to read repair I would assume. Is this expected? It seems
> that 'not found' and read repairs should only happen when something goes
> wrong, like a node goes down. Not when adding a node to the cluster,
> which is supposed to be part of normal operation!
>
> Any help or insight is appreciated!
>
> Greg
>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Ryan Zezeski-2
In reply to this post by Greg Nelson
Greg,

Your expectations are fair, just because you added a node doesn't mean Riak should return notfounds.  Unfortunately, we aren't quite there yet.  This is a side effect of how Riak currently implements handoff in that it immediately updates/gossips the ring causing many partitions to handoff immediately.  If a request comes in that relies on these partitions then it will get a notfound and perform read repair.  You're situation is multiplied by the fact that you are going from 3 nodes to 4.  More vnode shuffling occurs because of the small cluster size.

We're well aware of this and have it on our radar for improvement in a future release.

All this said, you data will be eventually consistent.  That is, all your data will eventually be handed off and things will work as normal.  It's only during the handoff that you _may_ encounter notfounds.  In this case it would be best to add a new node to your cluster at lowest load times and if you can spare additional hardware a few more nodes to start with is an even easier option.

-Ryan

On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <[hidden email]> wrote:
Hello riak users!

I have a 4 node cluster that started out as 3 nodes.  ring_creation_size = 2048, target_n_val is default (4), and all buckets have n_val = 3.

When I joined the 4th node, for a few minutes some GETs were returning 'not found' for data that was already in riak.  Eventually the data was returned, due to read repair I would assume.  Is this expected?  It seems that 'not found' and read repairs should only happen when something goes wrong, like a node goes down.  Not when adding a node to the cluster, which is supposed to be part of normal operation!

Any help or insight is appreciated!

Greg

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Greg Nelson
Ok, thank you Ryan.  I'm glad that what I'm seeing is expected behavior, although it is a little surprising.

As Kyle asked in a parallel reply on this thread: is it possible for the client to distinguish between these different scenarios where a notfound can be returned?  My initial thought is that when a node is being added, clients can do reads with r=1 and retry GETs that 404...?

On Monday, May 2, 2011 at 8:14 PM, Ryan Zezeski wrote:

Greg,

Your expectations are fair, just because you added a node doesn't mean Riak should return notfounds.  Unfortunately, we aren't quite there yet.  This is a side effect of how Riak currently implements handoff in that it immediately updates/gossips the ring causing many partitions to handoff immediately.  If a request comes in that relies on these partitions then it will get a notfound and perform read repair.  You're situation is multiplied by the fact that you are going from 3 nodes to 4.  More vnode shuffling occurs because of the small cluster size.

We're well aware of this and have it on our radar for improvement in a future release.

All this said, you data will be eventually consistent.  That is, all your data will eventually be handed off and things will work as normal.  It's only during the handoff that you _may_ encounter notfounds.  In this case it would be best to add a new node to your cluster at lowest load times and if you can spare additional hardware a few more nodes to start with is an even easier option.

-Ryan

On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <[hidden email]> wrote:
Hello riak users!

I have a 4 node cluster that started out as 3 nodes.  ring_creation_size = 2048, target_n_val is default (4), and all buckets have n_val = 3.

When I joined the 4th node, for a few minutes some GETs were returning 'not found' for data that was already in riak.  Eventually the data was returned, due to read repair I would assume.  Is this expected?  It seems that 'not found' and read repairs should only happen when something goes wrong, like a node goes down.  Not when adding a node to the cluster, which is supposed to be part of normal operation!

Any help or insight is appreciated!

Greg

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Greg Nelson
Actually, I just found this bug: https://issues.basho.com/show_bug.cgi?id=992

So it looks like the r=1 idea won't work.

On Monday, May 2, 2011 at 8:36 PM, Greg Nelson wrote:

Ok, thank you Ryan.  I'm glad that what I'm seeing is expected behavior, although it is a little surprising.

As Kyle asked in a parallel reply on this thread: is it possible for the client to distinguish between these different scenarios where a notfound can be returned?  My initial thought is that when a node is being added, clients can do reads with r=1 and retry GETs that 404...?

On Monday, May 2, 2011 at 8:14 PM, Ryan Zezeski wrote:

Greg,

Your expectations are fair, just because you added a node doesn't mean Riak should return notfounds.  Unfortunately, we aren't quite there yet.  This is a side effect of how Riak currently implements handoff in that it immediately updates/gossips the ring causing many partitions to handoff immediately.  If a request comes in that relies on these partitions then it will get a notfound and perform read repair.  You're situation is multiplied by the fact that you are going from 3 nodes to 4.  More vnode shuffling occurs because of the small cluster size.

We're well aware of this and have it on our radar for improvement in a future release.

All this said, you data will be eventually consistent.  That is, all your data will eventually be handed off and things will work as normal.  It's only during the handoff that you _may_ encounter notfounds.  In this case it would be best to add a new node to your cluster at lowest load times and if you can spare additional hardware a few more nodes to start with is an even easier option.

-Ryan

On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <[hidden email]> wrote:
Hello riak users!

I have a 4 node cluster that started out as 3 nodes.  ring_creation_size = 2048, target_n_val is default (4), and all buckets have n_val = 3.

When I joined the 4th node, for a few minutes some GETs were returning 'not found' for data that was already in riak.  Eventually the data was returned, due to read repair I would assume.  Is this expected?  It seems that 'not found' and read repairs should only happen when something goes wrong, like a node goes down.  Not when adding a node to the cluster, which is supposed to be part of normal operation!

Any help or insight is appreciated!

Greg

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Nico Meyer
In reply to this post by Ryan Zezeski-2
Hi everyone,

I just want to note that I observed similar behaviour with a somewhat
larger clusters of 10 or so nodes. I first noticed that handoff activity
after node join (or leave for that matter) involved a lot more
partitions than I would have expected. By comparing the old and the new
ring file, I found out that more than 80 percent of partitions had to be
moved to another node.
My naive expectation was that joining a node to a cluster of size X
would result in roughly ring_creation_size/(X+1) partitions to be handed
off, which would also be the minimum if one expects a balanced cluster
afterwards.
Furthermore it would in theory be possible to move partitions in such a
way that at least one partition from each preflist stays on the same
node. Maybe for X>N it should even be possible to guarantee this for a
basic quorum of each preflist, eliminating the notfound problem
completely, but I am not sure about that.

I may be able to provide some ring files to analyze this behaviour if
someone from basho is interested.

Cheer Nico

Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski:

> Greg,
>
>
> Your expectations are fair, just because you added a node doesn't mean
> Riak should return notfounds.  Unfortunately, we aren't quite there
> yet.  This is a side effect of how Riak currently implements handoff
> in that it immediately updates/gossips the ring causing
> many partitions to handoff immediately.  If a request comes in that
> relies on these partitions then it will get a notfound and perform
> read repair.  You're situation is multiplied by the fact that you are
> going from 3 nodes to 4.  More vnode shuffling occurs because of the
> small cluster size.
>
>
> We're well aware of this and have it on our radar for improvement in a
> future release.
>
>
> All this said, you data will be eventually consistent.  That is, all
> your data will eventually be handed off and things will work as
> normal.  It's only during the handoff that you _may_ encounter
> notfounds.  In this case it would be best to add a new node to your
> cluster at lowest load times and if you can spare additional hardware
> a few more nodes to start with is an even easier option.
>
>
> -Ryan
>
> On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <[hidden email]>
> wrote:
>         Hello riak users!
>        
>        
>         I have a 4 node cluster that started out as 3 nodes.
>          ring_creation_size = 2048, target_n_val is default (4), and
>         all buckets have n_val = 3.
>        
>        
>         When I joined the 4th node, for a few minutes some GETs were
>         returning 'not found' for data that was already in riak.
>          Eventually the data was returned, due to read repair I would
>         assume.  Is this expected?  It seems that 'not found' and read
>         repairs should only happen when something goes wrong, like a
>         node goes down.  Not when adding a node to the cluster, which
>         is supposed to be part of normal operation!
>        
>        
>         Any help or insight is appreciated!
>        
>        
>         Greg
>        
>         _______________________________________________
>         riak-users mailing list
>         [hidden email]
>         http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>        
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Greg Nelson
I just added node #5 to our cluster, and once again the experience during the subsequent 60-minute handoff period was pretty awful!  I just don't understand why this would be expected behavior while adding a node.  There doesn't seem to be any realistic way to join a node to an online cluster.  As far as I'm concerned this is a huge defect in Riak.

Read-repair didn't seem to kick in immediately for data.  My application was configured to retry GETs (with a few seconds of backoff), and still got 404s.  I manually requested an object repeatedly for over 20 minutes until finally getting a result.

I think bug #992 (https://issues.basho.com/show_bug.cgi?id=992) describes the defect, but I'm wondering if there is more to it than this?  Especially since read-repair didn't quite seem to work.

Could what Daniel describes on that bug ("Only return not found when all vnodes have reported not found (or error)") be implemented as a configurable option?  Maybe something one could kick in when a node joins until all handoffs are complete?

What we can do to remedy this before I add node #6, #7, etc.  We're storing huge amounts of data, which means that a) we'll be adding nodes often, and b) the amount of data handoff will be large, which means long periods of handoff where we don't want to have downtime.

Greg

On Tuesday, May 3, 2011 at 2:30 AM, Nico Meyer wrote:

Hi everyone,

I just want to note that I observed similar behaviour with a somewhat
larger clusters of 10 or so nodes. I first noticed that handoff activity
after node join (or leave for that matter) involved a lot more
partitions than I would have expected. By comparing the old and the new
ring file, I found out that more than 80 percent of partitions had to be
moved to another node.
My naive expectation was that joining a node to a cluster of size X
would result in roughly ring_creation_size/(X+1) partitions to be handed
off, which would also be the minimum if one expects a balanced cluster
afterwards.
Furthermore it would in theory be possible to move partitions in such a
way that at least one partition from each preflist stays on the same
node. Maybe for X>N it should even be possible to guarantee this for a
basic quorum of each preflist, eliminating the notfound problem
completely, but I am not sure about that.

I may be able to provide some ring files to analyze this behaviour if
someone from basho is interested.

Cheer Nico

Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski:
Greg,


Your expectations are fair, just because you added a node doesn't mean
Riak should return notfounds. Unfortunately, we aren't quite there
yet. This is a side effect of how Riak currently implements handoff
in that it immediately updates/gossips the ring causing
many partitions to handoff immediately. If a request comes in that
relies on these partitions then it will get a notfound and perform
read repair. You're situation is multiplied by the fact that you are
going from 3 nodes to 4. More vnode shuffling occurs because of the
small cluster size.


We're well aware of this and have it on our radar for improvement in a
future release.


All this said, you data will be eventually consistent. That is, all
your data will eventually be handed off and things will work as
normal. It's only during the handoff that you _may_ encounter
notfounds. In this case it would be best to add a new node to your
cluster at lowest load times and if you can spare additional hardware
a few more nodes to start with is an even easier option.


-Ryan

On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <[hidden email]>
wrote:
Hello riak users!


I have a 4 node cluster that started out as 3 nodes.
ring_creation_size = 2048, target_n_val is default (4), and
all buckets have n_val = 3.


When I joined the 4th node, for a few minutes some GETs were
returning 'not found' for data that was already in riak.
Eventually the data was returned, due to read repair I would
assume. Is this expected? It seems that 'not found' and read
repairs should only happen when something goes wrong, like a
node goes down. Not when adding a node to the cluster, which
is supposed to be part of normal operation!


Any help or insight is appreciated!


Greg

________________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Kresten Krab Thorup
I've also been asking for this, and the current master has code to remedy these things, but it's not in an official release yet.

At Erlang level, you can specify options to a RiakClient:get as follows


-type option() :: {r, pos_integer()} |         %% Minimum number of successful responses
                  {pr, non_neg_integer()} |    %% Minimum number of primary vnodes participating
                  {basic_quorum, boolean()} |  %% Whether to use basic quorum (return early
                                               %% in some failure cases.
                  {notfound_ok, boolean()}  |  %% Count notfound reponses as successful.
                  {timeout, pos_integer() | infinity}. %% Timeout for vnode responses

And so to get the semantics I think you're asking for, do GET (assuming N=3) with

[{r,2},{pr, 2}, {basic_quorum, false}, {notfound_ok, true}]

So this will work as you want as long as there is only one node down.

During handoff you may see a new kind of error

HTTP 503 / {error r_value_unsatisfied ...}

which  is the behavior when basic quorum is disabled, i.e. the alternative to getting a notfound just because there was some node which did not have the value.

Each of those are also available as query parameters when doing a HTTP get.

curl http://127.0.0.1:8091/riak/buck/key?r=2&basic_quorum=false&notfound_ok=true

I'm also looking forward to a release which has this, and I'm hoping that the defaults can somehow be simplified / strengthened so people new to this don't need to be so surprised about these things.

Kresten

On May 5, 2011, at 8:18 AM, Greg Nelson wrote:

I just added node #5 to our cluster, and once again the experience during the subsequent 60-minute handoff period was pretty awful!  I just don't understand why this would be expected behavior while adding a node.  There doesn't seem to be any realistic way to join a node to an online cluster.  As far as I'm concerned this is a huge defect in Riak.

Read-repair didn't seem to kick in immediately for data.  My application was configured to retry GETs (with a few seconds of backoff), and still got 404s.  I manually requested an object repeatedly for over 20 minutes until finally getting a result.

I think bug #992 (https://issues.basho.com/show_bug.cgi?id=992) describes the defect, but I'm wondering if there is more to it than this?  Especially since read-repair didn't quite seem to work.

Could what Daniel describes on that bug ("Only return not found when all vnodes have reported not found (or error)") be implemented as a configurable option?  Maybe something one could kick in when a node joins until all handoffs are complete?

What we can do to remedy this before I add node #6, #7, etc.  We're storing huge amounts of data, which means that a) we'll be adding nodes often, and b) the amount of data handoff will be large, which means long periods of handoff where we don't want to have downtime.

Greg

On Tuesday, May 3, 2011 at 2:30 AM, Nico Meyer wrote:

Hi everyone,

I just want to note that I observed similar behaviour with a somewhat
larger clusters of 10 or so nodes. I first noticed that handoff activity
after node join (or leave for that matter) involved a lot more
partitions than I would have expected. By comparing the old and the new
ring file, I found out that more than 80 percent of partitions had to be
moved to another node.
My naive expectation was that joining a node to a cluster of size X
would result in roughly ring_creation_size/(X+1) partitions to be handed
off, which would also be the minimum if one expects a balanced cluster
afterwards.
Furthermore it would in theory be possible to move partitions in such a
way that at least one partition from each preflist stays on the same
node. Maybe for X>N it should even be possible to guarantee this for a
basic quorum of each preflist, eliminating the notfound problem
completely, but I am not sure about that.

I may be able to provide some ring files to analyze this behaviour if
someone from basho is interested.

Cheer Nico

Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski:
Greg,


Your expectations are fair, just because you added a node doesn't mean
Riak should return notfounds. Unfortunately, we aren't quite there
yet. This is a side effect of how Riak currently implements handoff
in that it immediately updates/gossips the ring causing
many partitions to handoff immediately. If a request comes in that
relies on these partitions then it will get a notfound and perform
read repair. You're situation is multiplied by the fact that you are
going from 3 nodes to 4. More vnode shuffling occurs because of the
small cluster size.


We're well aware of this and have it on our radar for improvement in a
future release.


All this said, you data will be eventually consistent. That is, all
your data will eventually be handed off and things will work as
normal. It's only during the handoff that you _may_ encounter
notfounds. In this case it would be best to add a new node to your
cluster at lowest load times and if you can spare additional hardware
a few more nodes to start with is an even easier option.


-Ryan

On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <[hidden email]<mailto:[hidden email]>>
wrote:
Hello riak users!


I have a 4 node cluster that started out as 3 nodes.
ring_creation_size = 2048, target_n_val is default (4), and
all buckets have n_val = 3.


When I joined the 4th node, for a few minutes some GETs were
returning 'not found' for data that was already in riak.
Eventually the data was returned, due to read repair I would
assume. Is this expected? It seems that 'not found' and read
repairs should only happen when something goes wrong, like a
node goes down. Not when adding a node to the cluster, which
is supposed to be part of normal operation!


Any help or insight is appreciated!


Greg

________________________________________________
riak-users mailing list
[hidden email]<mailto:[hidden email]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]<mailto:[hidden email]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]<mailto:[hidden email]>
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

<ATT00001..txt>


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Ryan Zezeski-2
In reply to this post by Greg Nelson
Hi Greg,

You're right, the current situation for handoff is not optimal.  We are aware of these shortcomings and are working on solutions.  That said, there may be some knobs you can turn to minimize the amount of time you spend in handoff.

Since you are using a ring size of 2048 you are possibly testing the limits of certain subsystems inside Riak.  I'm not saying it shouldn't work, just that many things in Riak were probably coded with an expectation of a smaller ring size.

1. For example, riak_core has a `handoff_concurrency` setting that determines how many vnodes can concurrently handoff on a given node.  By default this is set to 4.  That's going to take a while with your 2048 vnodes and all :)

2. Also, by default, riak_core sets the `vnode_inactivity_timeout` to 60s.  That means that the vnode must be inactive for 60s before it will begin handoff.  You could try lowering this to expedite handoff.

3. _Do not_ make constant calls to `riak-admin transfers`.  Every time you call this you are resetting the vnode activity and stalling handoff.  I know, I know, how could anyone possibly be expected to know that?  Along with everything else, it's something we plan to address.

In summary, you could try adding the following to your app.config and adjust accordingly.  And keep in mind not to hit `riak-admin transfers` too often to give the vnodes a chance to handoff.

[
%% Riak Core Config
  {riak_core, [
                    ...
                    %% settings to adjust handoff latency/throughput
                    {handoff_concurrency, N},
                    {vnode_inactivity_timeout, M},
                    ....
                   ]},
  ...
].

HTH,
-Ryan

On Thu, May 5, 2011 at 2:18 AM, Greg Nelson <[hidden email]> wrote:
I just added node #5 to our cluster, and once again the experience during the subsequent 60-minute handoff period was pretty awful!  I just don't understand why this would be expected behavior while adding a node.  There doesn't seem to be any realistic way to join a node to an online cluster.  As far as I'm concerned this is a huge defect in Riak.

Read-repair didn't seem to kick in immediately for data.  My application was configured to retry GETs (with a few seconds of backoff), and still got 404s.  I manually requested an object repeatedly for over 20 minutes until finally getting a result.

I think bug #992 (https://issues.basho.com/show_bug.cgi?id=992) describes the defect, but I'm wondering if there is more to it than this?  Especially since read-repair didn't quite seem to work.

Could what Daniel describes on that bug ("Only return not found when all vnodes have reported not found (or error)") be implemented as a configurable option?  Maybe something one could kick in when a node joins until all handoffs are complete?

What we can do to remedy this before I add node #6, #7, etc.  We're storing huge amounts of data, which means that a) we'll be adding nodes often, and b) the amount of data handoff will be large, which means long periods of handoff where we don't want to have downtime.

Greg

On Tuesday, May 3, 2011 at 2:30 AM, Nico Meyer wrote:

Hi everyone,

I just want to note that I observed similar behaviour with a somewhat
larger clusters of 10 or so nodes. I first noticed that handoff activity
after node join (or leave for that matter) involved a lot more
partitions than I would have expected. By comparing the old and the new
ring file, I found out that more than 80 percent of partitions had to be
moved to another node.
My naive expectation was that joining a node to a cluster of size X
would result in roughly ring_creation_size/(X+1) partitions to be handed
off, which would also be the minimum if one expects a balanced cluster
afterwards.
Furthermore it would in theory be possible to move partitions in such a
way that at least one partition from each preflist stays on the same
node. Maybe for X>N it should even be possible to guarantee this for a
basic quorum of each preflist, eliminating the notfound problem
completely, but I am not sure about that.

I may be able to provide some ring files to analyze this behaviour if
someone from basho is interested.

Cheer Nico

Am Montag, den 02.05.2011, 23:14 -0400 schrieb Ryan Zezeski:
Greg,


Your expectations are fair, just because you added a node doesn't mean
Riak should return notfounds. Unfortunately, we aren't quite there
yet. This is a side effect of how Riak currently implements handoff
in that it immediately updates/gossips the ring causing
many partitions to handoff immediately. If a request comes in that
relies on these partitions then it will get a notfound and perform
read repair. You're situation is multiplied by the fact that you are
going from 3 nodes to 4. More vnode shuffling occurs because of the
small cluster size.


We're well aware of this and have it on our radar for improvement in a
future release.


All this said, you data will be eventually consistent. That is, all
your data will eventually be handed off and things will work as
normal. It's only during the handoff that you _may_ encounter
notfounds. In this case it would be best to add a new node to your
cluster at lowest load times and if you can spare additional hardware
a few more nodes to start with is an even easier option.


-Ryan

On Mon, May 2, 2011 at 9:48 PM, Greg Nelson <[hidden email]>
wrote:
Hello riak users!


I have a 4 node cluster that started out as 3 nodes.
ring_creation_size = 2048, target_n_val is default (4), and
all buckets have n_val = 3.


When I joined the 4th node, for a few minutes some GETs were
returning 'not found' for data that was already in riak.
Eventually the data was returned, due to read repair I would
assume. Is this expected? It seems that 'not found' and read
repairs should only happen when something goes wrong, like a
node goes down. Not when adding a node to the cluster, which
is supposed to be part of normal operation!


Any help or insight is appreciated!


Greg

________________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

John D. Rowell
Hi Ryan, Greg,

2011/5/5 Ryan Zezeski <[hidden email]>
1. For example, riak_core has a `handoff_concurrency` setting that determines how many vnodes can concurrently handoff on a given node.  By default this is set to 4.  That's going to take a while with your 2048 vnodes and all :)

Won't that make the handoff situation potentially worse? From the thread I understood that the main problem was that the cluster was shuffling too much data around and thus becoming unresponsive and/or returning unexpected results (like "not founds"). I'm attributing the concerns more to an excessive I/O situation than to how long the handoff takes. If the handoff can be made transparent (no or little side effects) I don't think most people will really care (e.g. the "fix the cluster tomorrow" anecdote).

How about using a percentage of available I/O to throttle the vnode handoff concurrency? Start with 1, and monitor the node's I/O (kinda like 'atop' does, collection CPU, disk and network metrics), if it is below the expected usage, then increase the vnode handoff concurrency, and vice-versa.

I for one would be perfectly happy if the handoff took several hours (even days) if we could maintain the core riak_kv characteristics intact during those events. We've all seen looooong RAID rebuild times, and it's usually better to just sit tight and keep the rebuild speed low (slower I/O) while keeping all of the dependent systems running smoothly.

cheers
-jd

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Ryan Zezeski-2
John,

All great points.  The problem is that the ring changes immediately when a node is added.  So now, all the sudden, the preflist is potentially pointing to nodes that don't have the data and they won't have that data until handoff occurs.  The faster that data gets transferred, the less time your clients have to hit 'notfound'.

However, I agree completely with what you're saying.  This is just a side effect of how the system currently works.  In a perfect world we wouldn't care how long handoff takes and we would also do some sort of automatic congestion control akin to TCP Reno or something.  The preflist would still point to the "old" partitions until all data has been successfully handed off, and then and only then would we flip the switch for that vnode.  I'm pretty sure that's where we are heading (I say "pretty sure" b/c I just joined the team and haven't been heavily involved in these specific talks yet).

It's all coming down the pipe...

As for your specific I/O question re handoff_concurrecy, you might be right.  I would think it depends on hardware/platform/etc.  I was offering it as a possible stopgap to minimize Greg's pain.  It's certainly a cure to a symptom, not the problem itself.

-Ryan

On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[hidden email]> wrote:
Hi Ryan, Greg,


2011/5/5 Ryan Zezeski <[hidden email]>
1. For example, riak_core has a `handoff_concurrency` setting that determines how many vnodes can concurrently handoff on a given node.  By default this is set to 4.  That's going to take a while with your 2048 vnodes and all :)

Won't that make the handoff situation potentially worse? From the thread I understood that the main problem was that the cluster was shuffling too much data around and thus becoming unresponsive and/or returning unexpected results (like "not founds"). I'm attributing the concerns more to an excessive I/O situation than to how long the handoff takes. If the handoff can be made transparent (no or little side effects) I don't think most people will really care (e.g. the "fix the cluster tomorrow" anecdote).

How about using a percentage of available I/O to throttle the vnode handoff concurrency? Start with 1, and monitor the node's I/O (kinda like 'atop' does, collection CPU, disk and network metrics), if it is below the expected usage, then increase the vnode handoff concurrency, and vice-versa.

I for one would be perfectly happy if the handoff took several hours (even days) if we could maintain the core riak_kv characteristics intact during those events. We've all seen looooong RAID rebuild times, and it's usually better to just sit tight and keep the rebuild speed low (slower I/O) while keeping all of the dependent systems running smoothly.

cheers
-jd


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Ben Tilly
OK, I've been sitting here watching this thread, and I'd really like
to understand what happens when a node leaves/joins.  I can't find any
really good documents describing it.  Based on the conversation as
I've followed it, here is a detailed description of my garbled
misunderstanding.  Please correct this.

Suppose that we have a ring with 9 nodes, named 1 through 9.  A new node joins.

1. The new node is placed as node 10.

2. The preflist for everyone gets updated.  Modulo roundoff:
     - 10% of node 1's vnodes are now pointed at node 2.
     - 20% of node 2's vnodes are now pointed at node 3.
     - 30% of node 3's vnodes are now pointed at node 4.
     - 40% of node 4's vnodes are now pointed at node 5.
     - 50% of node 5's vnodes are now pointed at node 6.
     - 60% of node 6's vnodes are now pointed at node 7.
     - 70% of node 7's vnodes are now pointed at node 8.
     - 80% of node 8's vnodes are now pointed at node 9.
     - 90% of node 9's vnodes are now pointed at node 10.
   All told about half of the vnodes are pointed to the wrong place.
   If you've replicated data 3 times, then keys whose first copy
   appears in the ranges (1/10, 1/9), (1/5, 2/9), (3/10, 1/3) have
   all 3 vnodes pointed to the wrong place.  That's 1/15 of the
   total data set.

3. With default settings, if 2 of your vnodes are on nodes that do
   not know about your data, then you will be told that data does
   not exist.  With other settings that can be chosen (notfound ok,
   basic_quorum false) you can make data that any node knows about
   findable.  Data that have all 3 vnodes in the wrong place, is
   temporarily unavailable.

4. Nodes are now responding for vnodes that they don't have data
   for, and start asking for vnodes that have not been
   properly handed off.  They will ask for, by default, a maximum
   of 4 vnodes at a time.  And they cannot ask for a vnode until
   all activity on that vnode has stopped for a minimum of, by
   default, 60 seconds.

5. Once all vnodes are handed over, all data is available and no
   data was actually lost.

Please correct.  Or if there is a document somewhere that is likely to
clear up some of my misunderstandings, point me there so that I can
understand better what happens.  (I'm pretty sure that my
understanding has to be incomplete, because I know that target_n_val
exists, and it looks like it should be involved somehow.)

On Thu, May 5, 2011 at 10:33 AM, Ryan Zezeski <[hidden email]> wrote:

> John,
> All great points.  The problem is that the ring changes immediately when a
> node is added.  So now, all the sudden, the preflist is potentially pointing
> to nodes that don't have the data and they won't have that data until
> handoff occurs.  The faster that data gets transferred, the less time your
> clients have to hit 'notfound'.
> However, I agree completely with what you're saying.  This is just a side
> effect of how the system currently works.  In a perfect world we wouldn't
> care how long handoff takes and we would also do some sort of automatic
> congestion control akin to TCP Reno or something.  The preflist would still
> point to the "old" partitions until all data has been successfully handed
> off, and then and only then would we flip the switch for that vnode.  I'm
> pretty sure that's where we are heading (I say "pretty sure" b/c I just
> joined the team and haven't been heavily involved in these specific talks
> yet).
> It's all coming down the pipe...
> As for your specific I/O question re handoff_concurrecy, you might be right.
>  I would think it depends on hardware/platform/etc.  I was offering it as a
> possible stopgap to minimize Greg's pain.  It's certainly a cure to a
> symptom, not the problem itself.
> -Ryan
>
> On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[hidden email]> wrote:
>>
>> Hi Ryan, Greg,
>>
>> 2011/5/5 Ryan Zezeski <[hidden email]>
>>>
>>> 1. For example, riak_core has a `handoff_concurrency` setting that
>>> determines how many vnodes can concurrently handoff on a given node.  By
>>> default this is set to 4.  That's going to take a while with your 2048
>>> vnodes and all :)
>>
>> Won't that make the handoff situation potentially worse? From the thread I
>> understood that the main problem was that the cluster was shuffling too much
>> data around and thus becoming unresponsive and/or returning unexpected
>> results (like "not founds"). I'm attributing the concerns more to an
>> excessive I/O situation than to how long the handoff takes. If the handoff
>> can be made transparent (no or little side effects) I don't think most
>> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>>
>> How about using a percentage of available I/O to throttle the vnode
>> handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
>> 'atop' does, collection CPU, disk and network metrics), if it is below the
>> expected usage, then increase the vnode handoff concurrency, and vice-versa.
>>
>> I for one would be perfectly happy if the handoff took several hours (even
>> days) if we could maintain the core riak_kv characteristics intact during
>> those events. We've all seen looooong RAID rebuild times, and it's usually
>> better to just sit tight and keep the rebuild speed low (slower I/O) while
>> keeping all of the dependent systems running smoothly.
>>
>> cheers
>> -jd
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Ben Tilly
I forgot to include in my previous email that in searching for
information I ran across
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2010-November/thread.html#2388
which suggests that handoff gets massively more complicated if
multiple nodes join/leave at once.  However it also looked like bugs
were filed, and it might have gotten fixed.

Is that resolved?  Here is my particular case of interest.  We're
looking at using Riak in clusters where there is a non-trivial risk
that 2 nodes could disappear without warning in rapid succession.
(Stuff running on one machine takes it down, fails over to a second
one, takes that down as well.)  If data has been replicated 5 times,
is Riak likely to survive somewhat gracefully?

On Thu, May 5, 2011 at 12:03 PM, Ben Tilly <[hidden email]> wrote:

> OK, I've been sitting here watching this thread, and I'd really like
> to understand what happens when a node leaves/joins.  I can't find any
> really good documents describing it.  Based on the conversation as
> I've followed it, here is a detailed description of my garbled
> misunderstanding.  Please correct this.
>
> Suppose that we have a ring with 9 nodes, named 1 through 9.  A new node joins.
>
> 1. The new node is placed as node 10.
>
> 2. The preflist for everyone gets updated.  Modulo roundoff:
>     - 10% of node 1's vnodes are now pointed at node 2.
>     - 20% of node 2's vnodes are now pointed at node 3.
>     - 30% of node 3's vnodes are now pointed at node 4.
>     - 40% of node 4's vnodes are now pointed at node 5.
>     - 50% of node 5's vnodes are now pointed at node 6.
>     - 60% of node 6's vnodes are now pointed at node 7.
>     - 70% of node 7's vnodes are now pointed at node 8.
>     - 80% of node 8's vnodes are now pointed at node 9.
>     - 90% of node 9's vnodes are now pointed at node 10.
>   All told about half of the vnodes are pointed to the wrong place.
>   If you've replicated data 3 times, then keys whose first copy
>   appears in the ranges (1/10, 1/9), (1/5, 2/9), (3/10, 1/3) have
>   all 3 vnodes pointed to the wrong place.  That's 1/15 of the
>   total data set.
>
> 3. With default settings, if 2 of your vnodes are on nodes that do
>   not know about your data, then you will be told that data does
>   not exist.  With other settings that can be chosen (notfound ok,
>   basic_quorum false) you can make data that any node knows about
>   findable.  Data that have all 3 vnodes in the wrong place, is
>   temporarily unavailable.
>
> 4. Nodes are now responding for vnodes that they don't have data
>   for, and start asking for vnodes that have not been
>   properly handed off.  They will ask for, by default, a maximum
>   of 4 vnodes at a time.  And they cannot ask for a vnode until
>   all activity on that vnode has stopped for a minimum of, by
>   default, 60 seconds.
>
> 5. Once all vnodes are handed over, all data is available and no
>   data was actually lost.
>
> Please correct.  Or if there is a document somewhere that is likely to
> clear up some of my misunderstandings, point me there so that I can
> understand better what happens.  (I'm pretty sure that my
> understanding has to be incomplete, because I know that target_n_val
> exists, and it looks like it should be involved somehow.)
>
> On Thu, May 5, 2011 at 10:33 AM, Ryan Zezeski <[hidden email]> wrote:
>> John,
>> All great points.  The problem is that the ring changes immediately when a
>> node is added.  So now, all the sudden, the preflist is potentially pointing
>> to nodes that don't have the data and they won't have that data until
>> handoff occurs.  The faster that data gets transferred, the less time your
>> clients have to hit 'notfound'.
>> However, I agree completely with what you're saying.  This is just a side
>> effect of how the system currently works.  In a perfect world we wouldn't
>> care how long handoff takes and we would also do some sort of automatic
>> congestion control akin to TCP Reno or something.  The preflist would still
>> point to the "old" partitions until all data has been successfully handed
>> off, and then and only then would we flip the switch for that vnode.  I'm
>> pretty sure that's where we are heading (I say "pretty sure" b/c I just
>> joined the team and haven't been heavily involved in these specific talks
>> yet).
>> It's all coming down the pipe...
>> As for your specific I/O question re handoff_concurrecy, you might be right.
>>  I would think it depends on hardware/platform/etc.  I was offering it as a
>> possible stopgap to minimize Greg's pain.  It's certainly a cure to a
>> symptom, not the problem itself.
>> -Ryan
>>
>> On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[hidden email]> wrote:
>>>
>>> Hi Ryan, Greg,
>>>
>>> 2011/5/5 Ryan Zezeski <[hidden email]>
>>>>
>>>> 1. For example, riak_core has a `handoff_concurrency` setting that
>>>> determines how many vnodes can concurrently handoff on a given node.  By
>>>> default this is set to 4.  That's going to take a while with your 2048
>>>> vnodes and all :)
>>>
>>> Won't that make the handoff situation potentially worse? From the thread I
>>> understood that the main problem was that the cluster was shuffling too much
>>> data around and thus becoming unresponsive and/or returning unexpected
>>> results (like "not founds"). I'm attributing the concerns more to an
>>> excessive I/O situation than to how long the handoff takes. If the handoff
>>> can be made transparent (no or little side effects) I don't think most
>>> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>>>
>>> How about using a percentage of available I/O to throttle the vnode
>>> handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
>>> 'atop' does, collection CPU, disk and network metrics), if it is below the
>>> expected usage, then increase the vnode handoff concurrency, and vice-versa.
>>>
>>> I for one would be perfectly happy if the handoff took several hours (even
>>> days) if we could maintain the core riak_kv characteristics intact during
>>> those events. We've all seen looooong RAID rebuild times, and it's usually
>>> better to just sit tight and keep the rebuild speed low (slower I/O) while
>>> keeping all of the dependent systems running smoothly.
>>>
>>> cheers
>>> -jd
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Alexander Sicular
In reply to this post by Ryan Zezeski-2
I'm really loving this thread. Generating great ideas for the way
things should be... in the future. It seems to me that "the ring
changes immediately" is actually the problem as Ryan astutely
mentions. One way the future could look is :

- a new node comes online
- introductions are made
- candidate vnodes are selected for migration (<- insert pixie dust magic here)
- the number of simultaneous migrations are configurable, fewer for
limited interruption or more for quicker completion
- vnodes are migrated
- once migration is completed, ownership is claimed

Selecting vnodes for migration is where the unicorn cavalry attack the
dragons den. If done right(er) the algorithm could be swappable to
optimize for different strategies. Don't ask me how to implement it,
I'm only a yellow belt in erlang-fu.

Cheers,
Alexander

On Thu, May 5, 2011 at 13:33, Ryan Zezeski <[hidden email]> wrote:

> John,
> All great points.  The problem is that the ring changes immediately when a
> node is added.  So now, all the sudden, the preflist is potentially pointing
> to nodes that don't have the data and they won't have that data until
> handoff occurs.  The faster that data gets transferred, the less time your
> clients have to hit 'notfound'.
> However, I agree completely with what you're saying.  This is just a side
> effect of how the system currently works.  In a perfect world we wouldn't
> care how long handoff takes and we would also do some sort of automatic
> congestion control akin to TCP Reno or something.  The preflist would still
> point to the "old" partitions until all data has been successfully handed
> off, and then and only then would we flip the switch for that vnode.  I'm
> pretty sure that's where we are heading (I say "pretty sure" b/c I just
> joined the team and haven't been heavily involved in these specific talks
> yet).
> It's all coming down the pipe...
> As for your specific I/O question re handoff_concurrecy, you might be right.
>  I would think it depends on hardware/platform/etc.  I was offering it as a
> possible stopgap to minimize Greg's pain.  It's certainly a cure to a
> symptom, not the problem itself.
> -Ryan
>
> On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[hidden email]> wrote:
>>
>> Hi Ryan, Greg,
>>
>> 2011/5/5 Ryan Zezeski <[hidden email]>
>>>
>>> 1. For example, riak_core has a `handoff_concurrency` setting that
>>> determines how many vnodes can concurrently handoff on a given node.  By
>>> default this is set to 4.  That's going to take a while with your 2048
>>> vnodes and all :)
>>
>> Won't that make the handoff situation potentially worse? From the thread I
>> understood that the main problem was that the cluster was shuffling too much
>> data around and thus becoming unresponsive and/or returning unexpected
>> results (like "not founds"). I'm attributing the concerns more to an
>> excessive I/O situation than to how long the handoff takes. If the handoff
>> can be made transparent (no or little side effects) I don't think most
>> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>>
>> How about using a percentage of available I/O to throttle the vnode
>> handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
>> 'atop' does, collection CPU, disk and network metrics), if it is below the
>> expected usage, then increase the vnode handoff concurrency, and vice-versa.
>>
>> I for one would be perfectly happy if the handoff took several hours (even
>> days) if we could maintain the core riak_kv characteristics intact during
>> those events. We've all seen looooong RAID rebuild times, and it's usually
>> better to just sit tight and keep the rebuild speed low (slower I/O) while
>> keeping all of the dependent systems running smoothly.
>>
>> cheers
>> -jd
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Andy Gross

Alex's description roughly matches up with some of our plans to address this issue. 

As with almost anything, this comes down to a tradeoff between consistency and availability.   In the case of joining nodes, making the join/handoff/ownership claim process more "atomic" requires a higher degree of consensus from the machines in the cluster.  The current process (which is clearly non-optimal) allows nodes to join the ring as long as they can contact one current ring member.  A more atomic process would introduce consensus issues that might prevent nodes from joining in partitioned scenarios.

A good solution would probably involve some consistency knobs around the join process to deal with a spectrum of failure/partition scenarios.

This is something of which we are acutely aware and are actively pursuing solutions for a near-term release.

- Andy


On Thu, May 5, 2011 at 12:22 PM, Alexander Sicular <[hidden email]> wrote:
I'm really loving this thread. Generating great ideas for the way
things should be... in the future. It seems to me that "the ring
changes immediately" is actually the problem as Ryan astutely
mentions. One way the future could look is :

- a new node comes online
- introductions are made
- candidate vnodes are selected for migration (<- insert pixie dust magic here)
- the number of simultaneous migrations are configurable, fewer for
limited interruption or more for quicker completion
- vnodes are migrated
- once migration is completed, ownership is claimed

Selecting vnodes for migration is where the unicorn cavalry attack the
dragons den. If done right(er) the algorithm could be swappable to
optimize for different strategies. Don't ask me how to implement it,
I'm only a yellow belt in erlang-fu.

Cheers,
Alexander

On Thu, May 5, 2011 at 13:33, Ryan Zezeski <[hidden email]> wrote:
> John,
> All great points.  The problem is that the ring changes immediately when a
> node is added.  So now, all the sudden, the preflist is potentially pointing
> to nodes that don't have the data and they won't have that data until
> handoff occurs.  The faster that data gets transferred, the less time your
> clients have to hit 'notfound'.
> However, I agree completely with what you're saying.  This is just a side
> effect of how the system currently works.  In a perfect world we wouldn't
> care how long handoff takes and we would also do some sort of automatic
> congestion control akin to TCP Reno or something.  The preflist would still
> point to the "old" partitions until all data has been successfully handed
> off, and then and only then would we flip the switch for that vnode.  I'm
> pretty sure that's where we are heading (I say "pretty sure" b/c I just
> joined the team and haven't been heavily involved in these specific talks
> yet).
> It's all coming down the pipe...
> As for your specific I/O question re handoff_concurrecy, you might be right.
>  I would think it depends on hardware/platform/etc.  I was offering it as a
> possible stopgap to minimize Greg's pain.  It's certainly a cure to a
> symptom, not the problem itself.
> -Ryan
>
> On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[hidden email]> wrote:
>>
>> Hi Ryan, Greg,
>>
>> 2011/5/5 Ryan Zezeski <[hidden email]>
>>>
>>> 1. For example, riak_core has a `handoff_concurrency` setting that
>>> determines how many vnodes can concurrently handoff on a given node.  By
>>> default this is set to 4.  That's going to take a while with your 2048
>>> vnodes and all :)
>>
>> Won't that make the handoff situation potentially worse? From the thread I
>> understood that the main problem was that the cluster was shuffling too much
>> data around and thus becoming unresponsive and/or returning unexpected
>> results (like "not founds"). I'm attributing the concerns more to an
>> excessive I/O situation than to how long the handoff takes. If the handoff
>> can be made transparent (no or little side effects) I don't think most
>> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>>
>> How about using a percentage of available I/O to throttle the vnode
>> handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
>> 'atop' does, collection CPU, disk and network metrics), if it is below the
>> expected usage, then increase the vnode handoff concurrency, and vice-versa.
>>
>> I for one would be perfectly happy if the handoff took several hours (even
>> days) if we could maintain the core riak_kv characteristics intact during
>> those events. We've all seen looooong RAID rebuild times, and it's usually
>> better to just sit tight and keep the rebuild speed low (slower I/O) while
>> keeping all of the dependent systems running smoothly.
>>
>> cheers
>> -jd
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Greg Nelson
In reply to this post by Alexander Sicular
The future I'd like to see is basically what I initially expected.  That is, I can add a single node to an online cluster and clients should not even see any effects of this or need to know that it's even happening -- except of course the side effects like the added load on the cluster incurred by gossiping new ring state, handing off data, etc.  But if no data has actually been lost, I don't believe data should ever be unavailable, temporarily or not.  And I'd like to be able to, as someone else mentioned, add a node and throttle the handoffs and let it trickle over hours or even days.

Waving hands and saying that eventually the data will make it is true in principle, but in practice if you are following a read/modify/write pattern for some objects, you could easily lose data.  e.g., my application writes JSON arrays to certain objects, and when it wishes to append something to the array, it will read/append/write back.  If that initial read returns 404, then a new empty array is created.  This is normal operation.  But if that 404 is not a "normal" 404, it will happily create a new empty array, append, and write back a single-element array to that key.  Of course there could have been a 100 element array in Riak that was just unavailable at the time which is now effectively lost.

Anyhow, I do understand the importance of knowing what will happen when doing something operationally like adding a node, and I understand that one can't naively expect everything to just work like magic.  But the current behavior is pretty poorly documented and surprising.  I don't think it was even mentioned in the operations webinar!  (Ok, I'll stop beating a dead horse.  :))

On Thursday, May 5, 2011 at 12:22 PM, Alexander Sicular wrote:

I'm really loving this thread. Generating great ideas for the way
things should be... in the future. It seems to me that "the ring
changes immediately" is actually the problem as Ryan astutely
mentions. One way the future could look is :

- a new node comes online
- introductions are made
- candidate vnodes are selected for migration (<- insert pixie dust magic here)
- the number of simultaneous migrations are configurable, fewer for
limited interruption or more for quicker completion
- vnodes are migrated
- once migration is completed, ownership is claimed

Selecting vnodes for migration is where the unicorn cavalry attack the
dragons den. If done right(er) the algorithm could be swappable to
optimize for different strategies. Don't ask me how to implement it,
I'm only a yellow belt in erlang-fu.

Cheers,
Alexander

On Thu, May 5, 2011 at 13:33, Ryan Zezeski <[hidden email]> wrote:
John,
All great points.  The problem is that the ring changes immediately when a
node is added.  So now, all the sudden, the preflist is potentially pointing
to nodes that don't have the data and they won't have that data until
handoff occurs.  The faster that data gets transferred, the less time your
clients have to hit 'notfound'.
However, I agree completely with what you're saying.  This is just a side
effect of how the system currently works.  In a perfect world we wouldn't
care how long handoff takes and we would also do some sort of automatic
congestion control akin to TCP Reno or something.  The preflist would still
point to the "old" partitions until all data has been successfully handed
off, and then and only then would we flip the switch for that vnode.  I'm
pretty sure that's where we are heading (I say "pretty sure" b/c I just
joined the team and haven't been heavily involved in these specific talks
yet).
It's all coming down the pipe...
As for your specific I/O question re handoff_concurrecy, you might be right.
 I would think it depends on hardware/platform/etc.  I was offering it as a
possible stopgap to minimize Greg's pain.  It's certainly a cure to a
symptom, not the problem itself.
-Ryan

On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[hidden email]> wrote:

Hi Ryan, Greg,

2011/5/5 Ryan Zezeski <[hidden email]>

1. For example, riak_core has a `handoff_concurrency` setting that
determines how many vnodes can concurrently handoff on a given node.  By
default this is set to 4.  That's going to take a while with your 2048
vnodes and all :)

Won't that make the handoff situation potentially worse? From the thread I
understood that the main problem was that the cluster was shuffling too much
data around and thus becoming unresponsive and/or returning unexpected
results (like "not founds"). I'm attributing the concerns more to an
excessive I/O situation than to how long the handoff takes. If the handoff
can be made transparent (no or little side effects) I don't think most
people will really care (e.g. the "fix the cluster tomorrow" anecdote).

How about using a percentage of available I/O to throttle the vnode
handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
'atop' does, collection CPU, disk and network metrics), if it is below the
expected usage, then increase the vnode handoff concurrency, and vice-versa.

I for one would be perfectly happy if the handoff took several hours (even
days) if we could maintain the core riak_kv characteristics intact during
those events. We've all seen looooong RAID rebuild times, and it's usually
better to just sit tight and keep the rebuild speed low (slower I/O) while
keeping all of the dependent systems running smoothly.

cheers
-jd


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Mike Oxford
In reply to this post by Andy Gross
As someone not familiar with Riak's internals...

You have xN replication.
Make the transition-nodes be part of an x(N+1) and write-only (eg, they don't count in the read quorum.)

If you're set up for x3 replication, then the transition bucket ends up as part of an x4 replication.
As more queries come in, you will get up to 4 responses, but the write-only response gets tossed.
A split would give you up to a 2x2 where it's really 2x1+1 and you can rebuild normally.
Once your x4 replication is consistant you remove one node from the replication set, taking you back to 3.

This, while not trivial, seems to leverage many of the pieces you already have in place and avoids the
problem of "slow replication fubars requests for indeterminate amounts of time."

Just a suggestion from the peanut gallery... :)

-mox



On Thu, May 5, 2011 at 1:06 PM, Andy Gross <[hidden email]> wrote:

Alex's description roughly matches up with some of our plans to address this issue. 

As with almost anything, this comes down to a tradeoff between consistency and availability.   In the case of joining nodes, making the join/handoff/ownership claim process more "atomic" requires a higher degree of consensus from the machines in the cluster.  The current process (which is clearly non-optimal) allows nodes to join the ring as long as they can contact one current ring member.  A more atomic process would introduce consensus issues that might prevent nodes from joining in partitioned scenarios.

A good solution would probably involve some consistency knobs around the join process to deal with a spectrum of failure/partition scenarios.

This is something of which we are acutely aware and are actively pursuing solutions for a near-term release.

- Andy


On Thu, May 5, 2011 at 12:22 PM, Alexander Sicular <[hidden email]> wrote:
I'm really loving this thread. Generating great ideas for the way
things should be... in the future. It seems to me that "the ring
changes immediately" is actually the problem as Ryan astutely
mentions. One way the future could look is :

- a new node comes online
- introductions are made
- candidate vnodes are selected for migration (<- insert pixie dust magic here)
- the number of simultaneous migrations are configurable, fewer for
limited interruption or more for quicker completion
- vnodes are migrated
- once migration is completed, ownership is claimed

Selecting vnodes for migration is where the unicorn cavalry attack the
dragons den. If done right(er) the algorithm could be swappable to
optimize for different strategies. Don't ask me how to implement it,
I'm only a yellow belt in erlang-fu.

Cheers,
Alexander

On Thu, May 5, 2011 at 13:33, Ryan Zezeski <[hidden email]> wrote:
> John,
> All great points.  The problem is that the ring changes immediately when a
> node is added.  So now, all the sudden, the preflist is potentially pointing
> to nodes that don't have the data and they won't have that data until
> handoff occurs.  The faster that data gets transferred, the less time your
> clients have to hit 'notfound'.
> However, I agree completely with what you're saying.  This is just a side
> effect of how the system currently works.  In a perfect world we wouldn't
> care how long handoff takes and we would also do some sort of automatic
> congestion control akin to TCP Reno or something.  The preflist would still
> point to the "old" partitions until all data has been successfully handed
> off, and then and only then would we flip the switch for that vnode.  I'm
> pretty sure that's where we are heading (I say "pretty sure" b/c I just
> joined the team and haven't been heavily involved in these specific talks
> yet).
> It's all coming down the pipe...
> As for your specific I/O question re handoff_concurrecy, you might be right.
>  I would think it depends on hardware/platform/etc.  I was offering it as a
> possible stopgap to minimize Greg's pain.  It's certainly a cure to a
> symptom, not the problem itself.
> -Ryan
>
> On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[hidden email]> wrote:
>>
>> Hi Ryan, Greg,
>>
>> 2011/5/5 Ryan Zezeski <[hidden email]>
>>>
>>> 1. For example, riak_core has a `handoff_concurrency` setting that
>>> determines how many vnodes can concurrently handoff on a given node.  By
>>> default this is set to 4.  That's going to take a while with your 2048
>>> vnodes and all :)
>>
>> Won't that make the handoff situation potentially worse? From the thread I
>> understood that the main problem was that the cluster was shuffling too much
>> data around and thus becoming unresponsive and/or returning unexpected
>> results (like "not founds"). I'm attributing the concerns more to an
>> excessive I/O situation than to how long the handoff takes. If the handoff
>> can be made transparent (no or little side effects) I don't think most
>> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>>
>> How about using a percentage of available I/O to throttle the vnode
>> handoff concurrency? Start with 1, and monitor the node's I/O (kinda like
>> 'atop' does, collection CPU, disk and network metrics), if it is below the
>> expected usage, then increase the vnode handoff concurrency, and vice-versa.
>>
>> I for one would be perfectly happy if the handoff took several hours (even
>> days) if we could maintain the core riak_kv characteristics intact during
>> those events. We've all seen looooong RAID rebuild times, and it's usually
>> better to just sit tight and keep the rebuild speed low (slower I/O) while
>> keeping all of the dependent systems running smoothly.
>>
>> cheers
>> -jd
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: 'not found' after join

Ben Tilly
In reply to this post by Andy Gross
On Thu, May 5, 2011 at 1:06 PM, Andy Gross <[hidden email]> wrote:

>
> Alex's description roughly matches up with some of our plans to address this
> issue.
> As with almost anything, this comes down to a tradeoff between consistency
> and availability.   In the case of joining nodes, making the
> join/handoff/ownership claim process more "atomic" requires a higher degree
> of consensus from the machines in the cluster.  The current process (which
> is clearly non-optimal) allows nodes to join the ring as long as they can
> contact one current ring member.  A more atomic process would introduce
> consensus issues that might prevent nodes from joining in partitioned
> scenarios.

Why would the whole ring need to know about the join?  Suppose that
there exists the option when you ask for a piece of data to reply, "I
don't have it, this data moved to X and the boundary between us is at
Y."  Now all that needs to know about a handoff is the two nodes
involved, and everyone else can find out about it lazily.

Add to that the ability for any node to add another node to its right,
and any node to tell a node that it is on their left.  You only need
to reach one node to be able to join the ring.

> A good solution would probably involve some consistency knobs around the
> join process to deal with a spectrum of failure/partition scenarios.
> This is something of which we are acutely aware and are actively pursuing
> solutions for a near-term release.
> - Andy

How near term?

I'm in the early stages of a development effort that is aiming to
release something this year.  We were hoping to use Riak, but this
could be a show stopper for us.

> On Thu, May 5, 2011 at 12:22 PM, Alexander Sicular <[hidden email]>
> wrote:
>>
>> I'm really loving this thread. Generating great ideas for the way
>> things should be... in the future. It seems to me that "the ring
>> changes immediately" is actually the problem as Ryan astutely
>> mentions. One way the future could look is :
>>
>> - a new node comes online
>> - introductions are made
>> - candidate vnodes are selected for migration (<- insert pixie dust magic
>> here)
>> - the number of simultaneous migrations are configurable, fewer for
>> limited interruption or more for quicker completion
>> - vnodes are migrated
>> - once migration is completed, ownership is claimed
>>
>> Selecting vnodes for migration is where the unicorn cavalry attack the
>> dragons den. If done right(er) the algorithm could be swappable to
>> optimize for different strategies. Don't ask me how to implement it,
>> I'm only a yellow belt in erlang-fu.
>>
>> Cheers,
>> Alexander
>>
>> On Thu, May 5, 2011 at 13:33, Ryan Zezeski <[hidden email]> wrote:
>> > John,
>> > All great points.  The problem is that the ring changes immediately when
>> > a
>> > node is added.  So now, all the sudden, the preflist is potentially
>> > pointing
>> > to nodes that don't have the data and they won't have that data until
>> > handoff occurs.  The faster that data gets transferred, the less time
>> > your
>> > clients have to hit 'notfound'.
>> > However, I agree completely with what you're saying.  This is just a
>> > side
>> > effect of how the system currently works.  In a perfect world we
>> > wouldn't
>> > care how long handoff takes and we would also do some sort of automatic
>> > congestion control akin to TCP Reno or something.  The preflist would
>> > still
>> > point to the "old" partitions until all data has been successfully
>> > handed
>> > off, and then and only then would we flip the switch for that vnode.
>> >  I'm
>> > pretty sure that's where we are heading (I say "pretty sure" b/c I just
>> > joined the team and haven't been heavily involved in these specific
>> > talks
>> > yet).
>> > It's all coming down the pipe...
>> > As for your specific I/O question re handoff_concurrecy, you might be
>> > right.
>> >  I would think it depends on hardware/platform/etc.  I was offering it
>> > as a
>> > possible stopgap to minimize Greg's pain.  It's certainly a cure to a
>> > symptom, not the problem itself.
>> > -Ryan
>> >
>> > On Thu, May 5, 2011 at 1:10 PM, John D. Rowell <[hidden email]> wrote:
>> >>
>> >> Hi Ryan, Greg,
>> >>
>> >> 2011/5/5 Ryan Zezeski <[hidden email]>
>> >>>
>> >>> 1. For example, riak_core has a `handoff_concurrency` setting that
>> >>> determines how many vnodes can concurrently handoff on a given node.
>> >>>  By
>> >>> default this is set to 4.  That's going to take a while with your 2048
>> >>> vnodes and all :)
>> >>
>> >> Won't that make the handoff situation potentially worse? From the
>> >> thread I
>> >> understood that the main problem was that the cluster was shuffling too
>> >> much
>> >> data around and thus becoming unresponsive and/or returning unexpected
>> >> results (like "not founds"). I'm attributing the concerns more to an
>> >> excessive I/O situation than to how long the handoff takes. If the
>> >> handoff
>> >> can be made transparent (no or little side effects) I don't think most
>> >> people will really care (e.g. the "fix the cluster tomorrow" anecdote).
>> >>
>> >> How about using a percentage of available I/O to throttle the vnode
>> >> handoff concurrency? Start with 1, and monitor the node's I/O (kinda
>> >> like
>> >> 'atop' does, collection CPU, disk and network metrics), if it is below
>> >> the
>> >> expected usage, then increase the vnode handoff concurrency, and
>> >> vice-versa.
>> >>
>> >> I for one would be perfectly happy if the handoff took several hours
>> >> (even
>> >> days) if we could maintain the core riak_kv characteristics intact
>> >> during
>> >> those events. We've all seen looooong RAID rebuild times, and it's
>> >> usually
>> >> better to just sit tight and keep the rebuild speed low (slower I/O)
>> >> while
>> >> keeping all of the dependent systems running smoothly.
>> >>
>> >> cheers
>> >> -jd
>> >
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > [hidden email]
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>> >
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
12