Understanding Riaks rebalancing and handoff behaviour

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Understanding Riaks rebalancing and handoff behaviour

Sven Riedel
Hi,
I'm currently assessing how well riak fits our needs as a large scale data store.

In the course of testing riak, I've set up a cluster in Amazons with 6 nodes across two EC2 instances (m2.xlarge). After seeing surprisingly a surprisingly bad write performance (which I'll write more on in a separate post once I've finished my tests), I wanted to migrate the cluster to instances with a better IO performance.

Lets call the original EC2 instances A and B. The plan was to migrate the cluster to new EC2 instances called C and D. During the following actions no other processes were reading/writing from/to the cluster. All instances are in the same availability zone.

What I did so far was to tell all riak nodes on B to leave the ring and let the ring re-stabilize. One surprising behaviour here was that the riak nodes on A suddenly all went into deep sleep mode (process state D) for about 30 minutes, and all riak-admin status/transfer calls claimed all nodes were down when in fact they weren't and were quite busy. But left to themselves they sorted everything out in the end.

Then I set up 3 new riak nodes on C and told them to join the cluster.

So far everything went well. riak-admin transfers showed me that both the nodes on A and the nodes on C were waiting on/for handoffs. However, the handoffs didn't start. I gave the cluster an hour, but no data transfer got initiated to the new nodes.

Since I didn't find any way to manually trigger the handoff, I told all the nodes on A (riak01, riak02 and riak03) to leave the cluster and after the last node on A left the ring, the handoffs started.
After all the data in riak01 got moved to the nodes on C, the master process shut down and the handoff for the remaining data from riak02 and riak03 stopped. I tried restarting riak01 manually, however riak-admin ringready claims that riak01 and riak04 (on C) disagree on the partition owners. riak-admin transfers still lists the same amount of partitions awaiting handoff as when the the handoff to the nodes on C started.

My current data distribution is as follows (via du -c):
On A:
1780 riak01/data
188948 riak02/data
3766736 riak03/data

On B:
13215908 riak04/data
1855584 riak05/data
5745076 riak06/data

riak04 and riak05 are awaiting the handoff of 341 partitions, riak06 of 342 partitions.

The ring_creation_size is 512, n_val for the bucket is 3, w is set to 1.

My questions at this point are:
1. What would normally trigger a rebalancing of the nodes?
2. Is there a way to manually trigger a rebalancing?
3. Did I do anything wrong with the procedure described above to be left in the current odd state by riak?
4. How would I rectify this situation in a production environment?

Regards,
Sven

------------------------------------------
Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
[hidden email]

Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805
Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Alexander Sicular
Mainly, I'm of the impression that you should join/leave a cluster one
node at a time.

-Alexander.

On 2010-11-09, Sven Riedel <[hidden email]> wrote:

> Hi,
> I'm currently assessing how well riak fits our needs as a large scale data
> store.
>
> In the course of testing riak, I've set up a cluster in Amazons with 6 nodes
> across two EC2 instances (m2.xlarge). After seeing surprisingly a
> surprisingly bad write performance (which I'll write more on in a separate
> post once I've finished my tests), I wanted to migrate the cluster to
> instances with a better IO performance.
>
> Lets call the original EC2 instances A and B. The plan was to migrate the
> cluster to new EC2 instances called C and D. During the following actions no
> other processes were reading/writing from/to the cluster. All instances are
> in the same availability zone.
>
> What I did so far was to tell all riak nodes on B to leave the ring and let
> the ring re-stabilize. One surprising behaviour here was that the riak nodes
> on A suddenly all went into deep sleep mode (process state D) for about 30
> minutes, and all riak-admin status/transfer calls claimed all nodes were
> down when in fact they weren't and were quite busy. But left to themselves
> they sorted everything out in the end.
>
> Then I set up 3 new riak nodes on C and told them to join the cluster.
>
> So far everything went well. riak-admin transfers showed me that both the
> nodes on A and the nodes on C were waiting on/for handoffs. However, the
> handoffs didn't start. I gave the cluster an hour, but no data transfer got
> initiated to the new nodes.
>
> Since I didn't find any way to manually trigger the handoff, I told all the
> nodes on A (riak01, riak02 and riak03) to leave the cluster and after the
> last node on A left the ring, the handoffs started.
> After all the data in riak01 got moved to the nodes on C, the master process
> shut down and the handoff for the remaining data from riak02 and riak03
> stopped. I tried restarting riak01 manually, however riak-admin ringready
> claims that riak01 and riak04 (on C) disagree on the partition owners.
> riak-admin transfers still lists the same amount of partitions awaiting
> handoff as when the the handoff to the nodes on C started.
>
> My current data distribution is as follows (via du -c):
> On A:
> 1780 riak01/data
> 188948 riak02/data
> 3766736 riak03/data
>
> On B:
> 13215908 riak04/data
> 1855584 riak05/data
> 5745076 riak06/data
>
> riak04 and riak05 are awaiting the handoff of 341 partitions, riak06 of 342
> partitions.
>
> The ring_creation_size is 512, n_val for the bucket is 3, w is set to 1.
>
> My questions at this point are:
> 1. What would normally trigger a rebalancing of the nodes?
> 2. Is there a way to manually trigger a rebalancing?
> 3. Did I do anything wrong with the procedure described above to be left in
> the current odd state by riak?
> 4. How would I rectify this situation in a production environment?
>
> Regards,
> Sven
>
> ------------------------------------------
> Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany,
> www.scoreloop.com
> [hidden email]
>
> Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB
> 174805
> Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van
> der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

--
Sent from my mobile device

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Justin Sheehy
On Tue, Nov 9, 2010 at 10:30 AM, Alexander Sicular <[hidden email]> wrote:
> Mainly, I'm of the impression that you should join/leave a cluster one
> node at a time.

This impression is correct.

I believe that in the not-too-distant future a feature may be added to
enable stable addition of many nodes at once, but at this time the
right approach is to add a node, allow the ringstate to stabilize
through gossip, then repeat as needed.

-Justin

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Sven Riedel
In reply to this post by Alexander Sicular

On Nov 9, 2010, at 4:30 PM, Alexander Sicular wrote:

> Mainly, I'm of the impression that you should join/leave a cluster one
> node at a time.
>

Of course that is what I did when I told all nodes to join/leave.
Tell node1 to leave. Wait until riak-admin ringready returns true.
Tell node2 to leave. Wait until riak-admin ringready returns true.
etc.

Regards,
Sven
------------------------------------------
Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
[hidden email]

Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805
Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Nico Meyer
In reply to this post by Sven Riedel
Hi,

I have also seen similar (not quite the same though) problems when
removing nodes from the cluster. Normally what happens, is that most
partitions are moved away from the node that was removed but the handoff
of some partitions will always produce an error like this:

=ERROR REPORT==== 10-Nov-2010::11:13:28 ===
Handoff sender riak_kv_vnode 1097553475690883148533821910506661759878332153856 failed error:{badmatch,
                                                                                             {error,
                                                                                              closed}}

On the node to which the data is handed off an error like this occurs:


=ERROR REPORT==== 10-Nov-2010::11:36:14 ===

** Generic server <0.9664.1698> terminating

** Last message in was {tcp,#Port<0.12814448>,

                            [1|

                             <<1,119,2,136,253,10,6,101,118,101,110,116,115,

                               18,10,56,56,51,50,49,49,53,57,48,52,26,224,4,
                               .....
                               181,29,211,248,13,58,118,211,2>>]}

** When Server state == {state,#Port<0.12814448>,
       
                               1097553475690883148533821910506661759878332153856,
         
                               riak_kv_vnode,<0.9665.1698>,0}

** Reason for termination ==

** {normal,{gen_fsm,sync_send_all_state_event,

                    [<0.9665.1698>,


{handoff_data,<<1,119,2,136,253,10,6,101,118,101,110,116,                      
                115,18,10,56,56,51,50,49,49,53,57,48,52,
                .....
                211,2>>},

        60000]}}



I cut the large binaries here for the sake of brevity.

Digging a little deeper I noticed that the distribution of partitions on
the node that was removed, as shown by riak-admin status, was always a
little different from the other nodes, which were still members of the
ring. Usually the partitions are quite unbalanced on the removed node.

So I compared the ring states manually using the console, and in the
ring state on the removed node quite a few partitions where assigned to
different nodes than what the other nodes thought.
After I manually synced the ring on the leaving node with the rest of
the cluster by doing this on the console:

{ok,Ring} = rpc:call('riak@otherNode', riak_core_ring_manger,
get_my_ring, []).
riak_core_ring_manager:set_my_ring(R).

the remaining partitions could be handed off successfully. Of course the
nodename in the ring is wrong after that.


My analysis of what happens was as follows:

-when the node is removed the ring state is updated accordingly and sent
to all nodes, including the one that was just removed

-during some latter round of gossip, the ring is changed again. But this
time it is not updated on the node that was just removed, since only
members of the ring are considered for gossiping.

-the leaving node does handoff some partitions to nodes which don't own
that partition (anymore)

-now two things can happen:
  -the (wrong) node happens to have data for the partition in question
(so the vnode is running) -> the handoff works, although the data is
shipped off again shortly afterwards to the correct node. This is very
inefficient, but it works
  -the node has no data for that partition -> the vnode is startet on
demand an immediately exits again, because it tries to do a handoff
itself (after a timeout of 0 seconds when first started) which
immediately completes


I am not sure why the ring changes twice. Maybe it has to do with the
fact that the target_n_val setting is not consistent within at least one
of our clusters, which I just discovered myself. It doesn't seem to
depend on the number of nodes, or the ring_creations size, as I have
seen this problem consistently even after removing several nodes in a
row and on two different clusters with a ring size of 1024 and 64
respectively.

Also riak-admin ringready will not recognize this problem, as far as I
read the code, because only the ring states of the current ring members
are compared. I haven't tried it, cause I am still on 0.12.0.
The same is apparently true for riak-admin transfers, which might tell
you that there are no handoffs left, even if the removed node still has
data.

I discovered another problem while debugging this. I you restart (or it
crashes) a node that you removed from the cluster which still has data,
it won't start handing off it's data afterwards. The reason being, that
is the node watcher also does not get notified that the other nodes are
up, and so all of them are considered down. This also can only be worked
around manually via the erlang console.

If any of this problems have been fixed in a recent version, I am sorry
for taking up your time and bandwidth :-). A short skimming of the
relevant modules didn't reveal anything hinting in that way.



Cheers,
Nico

Am Dienstag, den 09.11.2010, 16:08 +0100 schrieb Sven Riedel:

> Hi,
> I'm currently assessing how well riak fits our needs as a large scale data store.
>
> In the course of testing riak, I've set up a cluster in Amazons with 6 nodes across two EC2 instances (m2.xlarge). After seeing surprisingly a surprisingly bad write performance (which I'll write more on in a separate post once I've finished my tests), I wanted to migrate the cluster to instances with a better IO performance.
>
> Lets call the original EC2 instances A and B. The plan was to migrate the cluster to new EC2 instances called C and D. During the following actions no other processes were reading/writing from/to the cluster. All instances are in the same availability zone.
>
> What I did so far was to tell all riak nodes on B to leave the ring and let the ring re-stabilize. One surprising behaviour here was that the riak nodes on A suddenly all went into deep sleep mode (process state D) for about 30 minutes, and all riak-admin status/transfer calls claimed all nodes were down when in fact they weren't and were quite busy. But left to themselves they sorted everything out in the end.
>
> Then I set up 3 new riak nodes on C and told them to join the cluster.
>
> So far everything went well. riak-admin transfers showed me that both the nodes on A and the nodes on C were waiting on/for handoffs. However, the handoffs didn't start. I gave the cluster an hour, but no data transfer got initiated to the new nodes.
>
> Since I didn't find any way to manually trigger the handoff, I told all the nodes on A (riak01, riak02 and riak03) to leave the cluster and after the last node on A left the ring, the handoffs started.
> After all the data in riak01 got moved to the nodes on C, the master process shut down and the handoff for the remaining data from riak02 and riak03 stopped. I tried restarting riak01 manually, however riak-admin ringready claims that riak01 and riak04 (on C) disagree on the partition owners. riak-admin transfers still lists the same amount of partitions awaiting handoff as when the the handoff to the nodes on C started.
>
> My current data distribution is as follows (via du -c):
> On A:
> 1780 riak01/data
> 188948 riak02/data
> 3766736 riak03/data
>
> On B:
> 13215908 riak04/data
> 1855584 riak05/data
> 5745076 riak06/data
>
> riak04 and riak05 are awaiting the handoff of 341 partitions, riak06 of 342 partitions.
>
> The ring_creation_size is 512, n_val for the bucket is 3, w is set to 1.
>
> My questions at this point are:
> 1. What would normally trigger a rebalancing of the nodes?
> 2. Is there a way to manually trigger a rebalancing?
> 3. Did I do anything wrong with the procedure described above to be left in the current odd state by riak?
> 4. How would I rectify this situation in a production environment?
>
> Regards,
> Sven
>
> ------------------------------------------
> Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
> [hidden email]
>
> Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805
> Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Sven Riedel
Hi,
thanks for the detailed reply. So you would suggest that somehow the partition allocation got into an incosistent state across nodes. I'll have to check the logs to see if anything similar to your dump pops up.

So I compared the ring states manually using the console, and in the
ring state on the removed node quite a few partitions where assigned to
different nodes than what the other nodes thought.
After I manually synced the ring on the leaving node with the rest of
the cluster by doing this on the console:

{ok,Ring} = rpc:call('riak@otherNode', riak_core_ring_manger,
get_my_ring, []).
riak_core_ring_manager:set_my_ring(R).


That ought to be 

riak_core_ring_manager:set_my_ring( Ring ).

right? Just verifying because my Erlang is rather rudimentary :)

Also riak-admin ringready will not recognize this problem, as far as I
read the code, because only the ring states of the current ring members
are compared. I haven't tried it, cause I am still on 0.12.0.
The same is apparently true for riak-admin transfers, which might tell
you that there are no handoffs left, even if the removed node still has
data.

I'm running 0.13.0, so if we're stumbling over the same cause it's still there.


I discovered another problem while debugging this. I you restart (or it
crashes) a node that you removed from the cluster which still has data,
it won't start handing off it's data afterwards. The reason being, that
is the node watcher also does not get notified that the other nodes are
up, and so all of them are considered down. This also can only be worked
around manually via the erlang console.

Why would that have to be worked around at all? My understanding is through the data duplication within the ring having a single node encounter a messy and fatal accident shouldn't destabilize the entire ring.  The nodes which contain the duplicate data would just take over until a replacement node gets added, and the newly dead node is removed (ok, via console).

So this still leaves me with some of my original questions open:

1. What would normally trigger a rebalancing of the nodes?
2. Is there a way to manually trigger a rebalancing?
3. Did I do anything wrong with the procedure described above to be left in the current odd state by riak?

Regards,
Sven

------------------------------------------
Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
[hidden email]

Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805 
Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi 


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Nico Meyer
Hi,

Am Donnerstag, den 11.11.2010, 07:59 +0100 schrieb Sven Riedel:

> Hi,
> thanks for the detailed reply. So you would suggest that somehow the
> partition allocation got into an incosistent state across nodes. I'll
> have to check the logs to see if anything similar to your dump pops
> up.
>
> > So I compared the ring states manually using the console, and in the
> > ring state on the removed node quite a few partitions where assigned
> > to
> > different nodes than what the other nodes thought.
> > After I manually synced the ring on the leaving node with the rest
> > of
> > the cluster by doing this on the console:
> >
> > {ok,Ring} = rpc:call('riak@otherNode', riak_core_ring_manger,
> > get_my_ring, []).
> > riak_core_ring_manager:set_my_ring(R).
> >
> >
>
>
> That ought to be
>
>
> riak_core_ring_manager:set_my_ring( Ring ).
>
>
> right? Just verifying because my Erlang is rather rudimentary :)

You are right.

> > Also riak-admin ringready will not recognize this problem, as far as
> > I
> > read the code, because only the ring states of the current ring
> > members
> > are compared. I haven't tried it, cause I am still on 0.12.0.
> > The same is apparently true for riak-admin transfers, which might
> > tell
> > you that there are no handoffs left, even if the removed node still
> > has
> > data.
> >
>
>
> I'm running 0.13.0, so if we're stumbling over the same cause it's
> still there.
>
> >
> > I discovered another problem while debugging this. I you restart (or
> > it
> > crashes) a node that you removed from the cluster which still has
> > data,
> > it won't start handing off it's data afterwards. The reason being,
> > that
> > is the node watcher also does not get notified that the other nodes
> > are
> > up, and so all of them are considered down. This also can only be
> > worked
> > around manually via the erlang console.
> >
>
>
> Why would that have to be worked around at all? My understanding is
> through the data duplication within the ring having a single node
> encounter a messy and fatal accident shouldn't destabilize the entire
> ring.  The nodes which contain the duplicate data would just take over
> until a replacement node gets added, and the newly dead node is
> removed (ok, via console).
>

It's not a problem right away. But since the replicated data is not
actively synchronized in the background the keys that were not copied
until the node dies have one less replica. That is until they are read
at least once, at which point read repair does replicate the key again.
So it depends on your setup and requirements, if this is acceptable or
not.

>
> So this still leaves me with some of my original questions open:
> > >
> > > 1. What would normally trigger a rebalancing of the nodes?
> > > 2. Is there a way to manually trigger a rebalancing?
> > > 3. Did I do anything wrong with the procedure described above to
> > > be left in the current odd state by riak?
>

Every vnode, which is responsible for one partition, checks after 60
seconds of inactivity if its is residing one the node where it should
be, according to the ring state. If not, the data is send to the correct
node. So the rebalancing of data is triggered by rebalancing the
partitions among the nodes in the ring.
The ring is balanced during the gossiping of the ring state, which is
done by every node with another random node at every 0-60 (also
randomly) seconds.
In the worst case it could take some minutes before the ring stabilizes,
but its statistically likely to converge faster.

So there's is nothing to trigger manually really. One problem I see has
to do again with restarting a node, which still has data that should be
handed off to another node. Initially only the vnodes that are owned by
a node started, which by definition don't include the ones to be handed
off. But if the vnodes are never started, they won't perform the
handoff.
It kind of works anyway, but the vnodes are started and transfered
sequentially. Normally four partitions are transferred in parallel, so I
don't know if this is by design or by accident. The details are
convoluted enough to suspect the latter.
In any case this would also make also have the effect that those
partitions won't show up in the output of riak-admin transfers, since
only running vnodes are considered.

I also forgot, that I patched another problem in my own version of riak,
which will prevent any further handoff of data after four of them failed
with an error or timeout. This probably happened in your case, if your A
nodes became unresponsive for 30 minutes (did the machine swap by the
way?).
I should probably create a bug report for this, with my patch attached.
Stupid laziness!

After reading your original post again, I think almost all of the things
you saw can be explained by the bug that I mentioned in my first answer
(the ring status of removed nodes is not synchronized with the remaining
nodes). The problem obviously becomes worse if you remove several nodes
at a time.


I should really file some bug reports for all the problems I found, but
it is just too much effort for me right now. I have fixed the problems
that bugged us the most myself, so I should at least provide the patches
like a good open source citizen :-)

Bye,

Nico





_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Scott Lystig Fritchie
In reply to this post by Nico Meyer
Nico Meyer <[hidden email]> wrote:

nm> I discovered another problem while debugging this. I you restart (or
nm> it crashes) a node that you removed from the cluster which still has
nm> data, it won't start handing off it's data afterwards. The reason
nm> being, that is the node watcher also does not get notified that the
nm> other nodes are up, and so all of them are considered down. This
nm> also can only be worked around manually via the erlang console.

Nico, I've opened ticket 878 after scripting your scenario and
duplicating it on an Ubuntu9 32-bit box using the Riak package
riak_0.13.0-2_i386.deb.

    https://issues.basho.com/show_bug.cgi?id=878

On to Sven's problem that started this thread ... I've a larger script
that attempts to reproduce his problem, using 12 nodes installed on a
single Ubuntu9 32-bit machine (though reading carefully, Sven doesn't
get around to using EC2 instance number D, so only 9 nodes are used).

I have the script and output available at
http://www.snookles.com/scotttmp/riedel-scenario.tar.gz.  Sorry, I don't
have the rest of the basho_expect infrastructure available to outside
users right now(*), so it isn't possible for outsiders to re-run the
test, but it should show what's being done at a high level (the Python
script) and the detailed output (the other file, search for the regexp
"\*\*" for major section headings).

Sven, if I've made a major mistake on the script, please let me know
outside of the mailing list.  I'll try to fix the script and, if
necessary, open another Bugzilla ticket.

-Scott

(*) Releasing basho_expect with a reasonable open source license is on
the Basho todo list.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Sven Riedel
In reply to this post by Nico Meyer

On Nov 11, 2010, at 4:05 PM, Nico Meyer wrote:

>>>
>>> I discovered another problem while debugging this. I you restart (or
>>> it
>>> crashes) a node that you removed from the cluster which still has
>>> data,
>>> it won't start handing off it's data afterwards. The reason being,
>>> that
>>> is the node watcher also does not get notified that the other nodes
>>> are
>>> up, and so all of them are considered down. This also can only be
>>> worked
>>> around manually via the erlang console.
>>>
>>
>>
>> Why would that have to be worked around at all? My understanding is
>> through the data duplication within the ring having a single node
>> encounter a messy and fatal accident shouldn't destabilize the entire
>> ring.  The nodes which contain the duplicate data would just take over
>> until a replacement node gets added, and the newly dead node is
>> removed (ok, via console).
>>
>
> It's not a problem right away. But since the replicated data is not
> actively synchronized in the background the keys that were not copied
> until the node dies have one less replica. That is until they are read
> at least once, at which point read repair does replicate the key again.
> So it depends on your setup and requirements, if this is acceptable or
> not.

So if the relevant data isn't read in a while and two more nodes go down (with
an n_val of 3), there is a chance that some data is lost.

>
>>
>> So this still leaves me with some of my original questions open:
>>>>
>>>> 1. What would normally trigger a rebalancing of the nodes?
>>>> 2. Is there a way to manually trigger a rebalancing?
>>>> 3. Did I do anything wrong with the procedure described above to
>>>> be left in the current odd state by riak?
>>
>
> Every vnode, which is responsible for one partition, checks after 60
> seconds of inactivity if its is residing one the node where it should
> be, according to the ring state. If not, the data is send to the correct
> node. So the rebalancing of data is triggered by rebalancing the
> partitions among the nodes in the ring.
> The ring is balanced during the gossiping of the ring state, which is
> done by every node with another random node at every 0-60 (also
> randomly) seconds.
> In the worst case it could take some minutes before the ring stabilizes,
> but its statistically likely to converge faster.

Ah, if the partition ownership changes, the data should get relocated straight away.
I had gossip put down only for immediate topology changes (which worked fine in my case, it just never got around to redistributing the data).

>
> So there's is nothing to trigger manually really. One problem I see has
> to do again with restarting a node, which still has data that should be
> handed off to another node. Initially only the vnodes that are owned by
> a node started, which by definition don't include the ones to be handed
> off. But if the vnodes are never started, they won't perform the
> handoff.

This sounds less confidence inspiring. On the off-chance that the node (or hardware) crashes between telling a node to leave the ring and the data handoff having been completed, you'll have more work on your hands than just restarting the node, and you have to know about this as well. This isn't a scenario that will happen often, but from the point of view of someone who wants the least amount of manual intervention when something goes wrong, I'd prefer resilience over performance.

> It kind of works anyway, but the vnodes are started and transfered
> sequentially. Normally four partitions are transferred in parallel, so I
> don't know if this is by design or by accident. The details are
> convoluted enough to suspect the latter.
> In any case this would also make also have the effect that those
> partitions won't show up in the output of riak-admin transfers, since
> only running vnodes are considered.

From what I saw in my case the number of handoffs were displayed correctly in the beginning, however the numbers didn't decrease (or change at all) as data got handed around.

>
> I also forgot, that I patched another problem in my own version of riak,
> which will prevent any further handoff of data after four of them failed
> with an error or timeout. This probably happened in your case, if your A
> nodes became unresponsive for 30 minutes (did the machine swap by the
> way?).

A was up and responsive the entire time. No swapping; the machines don't have any swap space configured. However, they were rather weak in the IO department, so the load did rise to 10 and above on bulk inserts, and to around 6 during the handoffs IIRC (on a two core machine). One of the reasons I wanted to check things out on instances with more IO performance.

> I should probably create a bug report for this, with my patch attached.
> Stupid laziness!
>
> After reading your original post again, I think almost all of the things
> you saw can be explained by the bug that I mentioned in my first answer
> (the ring status of removed nodes is not synchronized with the remaining
> nodes). The problem obviously becomes worse if you remove several nodes
> at a time.

Which means that I shouldn't just wait for riak-admin ringready to return TRUE, but for the data handoffs to have completed as well before changing the network topology again?

Thanks for your answers, they have been enlightening.

Regards,
Sven

------------------------------------------
Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
[hidden email]

Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805
Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Nico Meyer
Am Freitag, den 12.11.2010, 08:43 +0100 schrieb Sven Riedel:

> >
> > It's not a problem right away. But since the replicated data is not
> > actively synchronized in the background the keys that were not copied
> > until the node dies have one less replica. That is until they are read
> > at least once, at which point read repair does replicate the key again.
> > So it depends on your setup and requirements, if this is acceptable or
> > not.
>
> So if the relevant data isn't read in a while and two more nodes go down (with
> an n_val of 3), there is a chance that some data is lost.
>

Correct. This might be highly unlikely though.

> >
> >>
> >> So this still leaves me with some of my original questions open:
> >>>>
> >>>> 1. What would normally trigger a rebalancing of the nodes?
> >>>> 2. Is there a way to manually trigger a rebalancing?
> >>>> 3. Did I do anything wrong with the procedure described above to
> >>>> be left in the current odd state by riak?
> >>
> >
> > Every vnode, which is responsible for one partition, checks after 60
> > seconds of inactivity if its is residing one the node where it should
> > be, according to the ring state. If not, the data is send to the correct
> > node. So the rebalancing of data is triggered by rebalancing the
> > partitions among the nodes in the ring.
> > The ring is balanced during the gossiping of the ring state, which is
> > done by every node with another random node at every 0-60 (also
> > randomly) seconds.
> > In the worst case it could take some minutes before the ring stabilizes,
> > but its statistically likely to converge faster.
>
> Ah, if the partition ownership changes, the data should get relocated straight away.
> I had gossip put down only for immediate topology changes (which worked fine in my case, it just never got around to redistributing the data).
>
> >
> > So there's is nothing to trigger manually really. One problem I see has
> > to do again with restarting a node, which still has data that should be
> > handed off to another node. Initially only the vnodes that are owned by
> > a node started, which by definition don't include the ones to be handed
> > off. But if the vnodes are never started, they won't perform the
> > handoff.
>
> This sounds less confidence inspiring. On the off-chance that the node (or hardware) crashes between telling a node to leave the ring and the data handoff having been completed, you'll have more work on your hands than just restarting the node, and you have to know about this as well. This isn't a scenario that will happen often, but from the point of view of someone who wants the least amount of manual intervention when something goes wrong, I'd prefer resilience over performance.
>
As I mentioned in my next paragraph (not very clearly I must confess)
the handoff would eventually start, if not for the node watcher problem
I also mentioned in my first post. Scott Lystig Fritchie created a bug
report for this (https://issues.basho.com/show_bug.cgi?id=878). But this
problem only exists for nodes that are to be removed.
Other nodes might also need to handoff data, if some of their partitions
should be moved to another node. This happens also if you add a node.

The handoffs will eventually start, but it will take longer then normal.
At the start of the node and at every ring update a random vnode for a
parition not owned by the node is started, which might be empty, and
performs a handoff, which consequently might not have to do anything.
When the handoff finishes another random vnode is started (could be the
same vnode though).

Effectively all partitions in the ring are eventually tested, one after
another, and transferred if necessary. But how long it will take to
transfer all partitions is somewhat unpredictable because of the random
part. Also, only one at a time will be transferred, which might take
longer than usual.


> > It kind of works anyway, but the vnodes are started and transfered
> > sequentially. Normally four partitions are transferred in parallel, so I
> > don't know if this is by design or by accident. The details are
> > convoluted enough to suspect the latter.
> > In any case this would also make also have the effect that those
> > partitions won't show up in the output of riak-admin transfers, since
> > only running vnodes are considered.
>
> From what I saw in my case the number of handoffs were displayed correctly in the beginning, however the numbers didn't decrease (or change at all) as data got handed around.
>

If the node is not restarted, all vnodes are still running, so the
number would be correct. Most likely no handoffs were being done
anymore, because four of them crashed and the locks are not cleared in
this case. By default only four handoffs are allowed at the same time.
Have you looked in for the error messages I mentioned in your logs?

> >
> > I also forgot, that I patched another problem in my own version of riak,
> > which will prevent any further handoff of data after four of them failed
> > with an error or timeout. This probably happened in your case, if your A
> > nodes became unresponsive for 30 minutes (did the machine swap by the
> > way?).
>
> A was up and responsive the entire time. No swapping; the machines don't have any swap space configured. However, they were rather weak in the IO department, so the load did rise to 10 and above on bulk inserts, and to around 6 during the handoffs IIRC (on a two core machine). One of the reasons I wanted to check things out on instances with more IO performance.
>

You wrote riak-admin status said the nodes were down, thats what I
meant. It means at least that the Erlang VM was unresponsive.

> > I should probably create a bug report for this, with my patch attached.
> > Stupid laziness!
> >
> > After reading your original post again, I think almost all of the things
> > you saw can be explained by the bug that I mentioned in my first answer
> > (the ring status of removed nodes is not synchronized with the remaining
> > nodes). The problem obviously becomes worse if you remove several nodes
> > at a time.
>
> Which means that I shouldn't just wait for riak-admin ringready to return TRUE, but for the data handoffs to have completed as well before changing the network topology again?
>

Yes. By using e.g. 'df' or listing the directories in your bitcask dir,
which should be empty if everything went well.
But with the two bugs that are still present it might never finish.
The workaround for this is quite involved at the moment.
I will try to create bug reports in the next few days.

> Thanks for your answers, they have been enlightening.
>
> Regards,
> Sven
>
> ------------------------------------------
> Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
> [hidden email]
>
> Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805
> Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi
>



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Understanding Riaks rebalancing and handoff behaviour

Sven Riedel

On Nov 12, 2010, at 3:57 PM, Nico Meyer wrote:

Am Freitag, den 12.11.2010, 08:43 +0100 schrieb Sven Riedel:

It's not a problem right away. But since the replicated data is not
actively synchronized in the background the keys that were not copied
until the node dies have one less replica. That is until they are read
at least once, at which point read repair does replicate the key again.
So it depends on your setup and requirements, if this is acceptable or
not.

So if the relevant data isn't read in a while and two more nodes go down (with
an n_val of 3), there is a chance that some data is lost.


Correct. This might be highly unlikely though.

I agree. But still good to know to be fully informed about the risks involved when choosing a small n_val with large, write-heavy datasets on few nodes :)


It kind of works anyway, but the vnodes are started and transfered
sequentially. Normally four partitions are transferred in parallel, so I
don't know if this is by design or by accident. The details are
convoluted enough to suspect the latter.
In any case this would also make also have the effect that those
partitions won't show up in the output of riak-admin transfers, since
only running vnodes are considered.

From what I saw in my case the number of handoffs were displayed correctly in the beginning, however the numbers didn't decrease (or change at all) as data got handed around.


If the node is not restarted, all vnodes are still running, so the
number would be correct. Most likely no handoffs were being done
anymore, because four of them crashed and the locks are not cleared in
this case. By default only four handoffs are allowed at the same time.
Have you looked in for the error messages I mentioned in your logs?

I just had a look and I see a slightly different exception: 
On riak01 (which had been sending out data, and subsequently shut down):

ERROR: "Handoff receiver for partition ~p exiting abnormally after processing ~p objects: ~p\n" - [ 0,
{ noproc, 
  {gen_fsm,
   sync_send_all_state_event,
   [<0.24196.2404>,
    {handoff_data, 
    <<...binary...>>
   },
   60000]}}]

Nothing interesting that I can see on the receiving end(s).




I should probably create a bug report for this, with my patch attached.
Stupid laziness!

After reading your original post again, I think almost all of the things
you saw can be explained by the bug that I mentioned in my first answer
(the ring status of removed nodes is not synchronized with the remaining
nodes). The problem obviously becomes worse if you remove several nodes
at a time.

Which means that I shouldn't just wait for riak-admin ringready to return TRUE, but for the data handoffs to have completed as well before changing the network topology again?


Yes. By using e.g. 'df' or listing the directories in your bitcask dir,
which should be empty if everything went well.
But with the two bugs that are still present it might never finish.
The workaround for this is quite involved at the moment.
I will try to create bug reports in the next few days.


I see, this wasn't clear to me. This was then probably what caused everything to get off track. 
I think I'll throw away the test cluster then and start over, this time not just waiting for ringready. Maybe this should be mentioned more prominently in the documentation alongside the description of riak-admin ringready. Having a "settled ring" implied to me that it's ok to proceed with topology changes :)


Regards,
Sven

------------------------------------------
Scoreloop AG, Brecherspitzstrasse 8, 81541 Munich, Germany, www.scoreloop.com
[hidden email]

Sitz der Gesellschaft: München, Registergericht: Amtsgericht München, HRB 174805 
Vorstand: Dr. Marc Gumpinger (Vorsitzender), Dominik Westner, Christian van der Leeden, Vorsitzender des Aufsichtsrates: Olaf Jacobi 


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com