Fwd: Re: riak cluster suddenly became unresponsive

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Re: riak cluster suddenly became unresponsive

Ingo Rockel
and the riak-users mailer-daemon should really set a "reply-to"...

-------- Original-Nachricht --------
Betreff: Re: riak cluster suddenly became unresponsive
Datum: Tue, 19 Mar 2013 15:40:12 +0100
Von: Ingo Rockel <[hidden email]>
An: Mark Phillips <[hidden email]>

Hi Mark,

thanks!

The 1.3 update is already planned.

But we will add the zdbbl first as we ran into the same issue yesterday
again.

Ingo

Am 19.03.2013 15:04, schrieb Mark Phillips:

> Hi Ingo,
>
> Sorry for the delay in getting back to you.
>
> This looks symptomatic of some of the scheduler issues we fixed of 1.3.
> A few of the    eleveldb issues in the release notes [1] can provide
> precise details. Is upgrading a possibility?
>
> Tweaking your zdbbl in vm.args should alleviate some of the issues with
> busy buffers but upgrading is probably your best path here.
>
> Hope that helps. Keep us posted.
>
> Mark
>
> [1] https://github.com/basho/riak/blob/master/RELEASE-NOTES.md
>
> On Friday, March 15, 2013, Ingo Rockel wrote:
>
>     Hi,
>
>     we have a 12 nodes cluster running riak 1.2.1 which went live a week
>     ago. Yesterday, suddenly from one minute to another the
>     put_fsm_time_95 and the get_fsm_time_95 raised from something below
>     100ms up to several seconds. This went on for about 25 min and than
>     went away.
>
>     Checking the riak-logs of the nodes, I find a lot of these:
>
>     2013-03-14 17:48:06.388 [info]
>     <0.62.0>@riak_core_sysmon___handler:handle_event:89 Monitor got
>     {suppressed,port_events,1}
>     2013-03-14 17:48:06.889 [info]
>     <0.62.0>@riak_core_sysmon___handler:handle_event:85 monitor
>     busy_dist_port <0.7156.1>
>     [{initial_call,{riak_core___vnode,init,1}},{almost___current_function,{erlang,bif___return_trap,1}},{message___queue_len,1}]
>     {#Port<0.9083226>,'riak@172.22.3.22'}
>
>     This messages are logged all day, but only once every few minutes
>     but in the problematic time frame between 17:45 and 18:17 it gets
>     logged several times every second. The node ip differs though, but
>     it seems only three nodes were involved.
>
>     Except of these three nodes the cpu utilisation drops by half during
>     this on all other nodes. On the three nodes there's only a slight drop.
>
>     We are using leveldb as storage backend. I also checked some of the
>     LOG files of leveldb and there are compactions logged, but these are
>     logged all the day every few hours.
>
>     In this time our software was quite unresponsive too so I would like
>     to know what was causing this and what I might do to stop. Any
>     ideas, hints?
>
>     I found this:
>
>     https://groups.google.com/__forum/?fromgroups=#!topic/__nosql-databases/GqbaeiKCSYE
>     <https://groups.google.com/forum/?fromgroups=#!topic/nosql-databases/GqbaeiKCSYE>
>
>     where Jon Meredith suggests to raise the buffer size to get rid of
>     the busy buffers by adding +zdbbl 16384 to the vm.args file. Might
>     this help?
>
>     Regards,
>
>              Ingo
>     --
>     Software Architect
>
>     Blue Lion mobile GmbH
>     Tel. +49 (0) 221 788 797 14
>     Fax. +49 (0) 221 788 797 19
>     Mob. +49 (0) 176 24 87 30 89
>
>     [hidden email]
>      >>> qeep: Hefferwolf
>
>     www.bluelionmobile.com <http://www.bluelionmobile.com>
>     www.qeep.net <http://www.qeep.net>
>
>     _________________________________________________
>     riak-users mailing list
>     [hidden email]
>     http://lists.basho.com/__mailman/listinfo/riak-users___lists.basho.com
>     <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
>


--
Software Architect

Blue Lion mobile GmbH
Tel. +49 (0) 221 788 797 14
Fax. +49 (0) 221 788 797 19
Mob. +49 (0) 176 24 87 30 89

[hidden email]
>>> qeep: Hefferwolf

www.bluelionmobile.com
www.qeep.net



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Re: riak cluster suddenly became unresponsive

Evan Vigil-McClanahan
Also, in the mean time, adding +swt very_low to your vm.args can help
lessen the incidence of this issue.

On Tue, Mar 19, 2013 at 7:41 AM, Ingo Rockel
<[hidden email]> wrote:

> and the riak-users mailer-daemon should really set a "reply-to"...
>
> -------- Original-Nachricht --------
> Betreff: Re: riak cluster suddenly became unresponsive
> Datum: Tue, 19 Mar 2013 15:40:12 +0100
> Von: Ingo Rockel <[hidden email]>
> An: Mark Phillips <[hidden email]>
>
> Hi Mark,
>
> thanks!
>
> The 1.3 update is already planned.
>
> But we will add the zdbbl first as we ran into the same issue yesterday
> again.
>
> Ingo
>
> Am 19.03.2013 15:04, schrieb Mark Phillips:
>>
>> Hi Ingo,
>>
>> Sorry for the delay in getting back to you.
>>
>> This looks symptomatic of some of the scheduler issues we fixed of 1.3.
>> A few of the    eleveldb issues in the release notes [1] can provide
>> precise details. Is upgrading a possibility?
>>
>> Tweaking your zdbbl in vm.args should alleviate some of the issues with
>> busy buffers but upgrading is probably your best path here.
>>
>> Hope that helps. Keep us posted.
>>
>> Mark
>>
>> [1] https://github.com/basho/riak/blob/master/RELEASE-NOTES.md
>>
>> On Friday, March 15, 2013, Ingo Rockel wrote:
>>
>>     Hi,
>>
>>     we have a 12 nodes cluster running riak 1.2.1 which went live a week
>>     ago. Yesterday, suddenly from one minute to another the
>>     put_fsm_time_95 and the get_fsm_time_95 raised from something below
>>     100ms up to several seconds. This went on for about 25 min and than
>>     went away.
>>
>>     Checking the riak-logs of the nodes, I find a lot of these:
>>
>>     2013-03-14 17:48:06.388 [info]
>>     <0.62.0>@riak_core_sysmon___handler:handle_event:89 Monitor got
>>
>>     {suppressed,port_events,1}
>>     2013-03-14 17:48:06.889 [info]
>>     <0.62.0>@riak_core_sysmon___handler:handle_event:85 monitor
>>     busy_dist_port <0.7156.1>
>>
>> [{initial_call,{riak_core___vnode,init,1}},{almost___current_function,{erlang,bif___return_trap,1}},{message___queue_len,1}]
>>
>>     {#Port<0.9083226>,'riak@172.22.3.22'}
>>
>>     This messages are logged all day, but only once every few minutes
>>     but in the problematic time frame between 17:45 and 18:17 it gets
>>     logged several times every second. The node ip differs though, but
>>     it seems only three nodes were involved.
>>
>>     Except of these three nodes the cpu utilisation drops by half during
>>     this on all other nodes. On the three nodes there's only a slight
>> drop.
>>
>>     We are using leveldb as storage backend. I also checked some of the
>>     LOG files of leveldb and there are compactions logged, but these are
>>     logged all the day every few hours.
>>
>>     In this time our software was quite unresponsive too so I would like
>>     to know what was causing this and what I might do to stop. Any
>>     ideas, hints?
>>
>>     I found this:
>>
>>
>> https://groups.google.com/__forum/?fromgroups=#!topic/__nosql-databases/GqbaeiKCSYE
>>
>>
>> <https://groups.google.com/forum/?fromgroups=#!topic/nosql-databases/GqbaeiKCSYE>
>>
>>     where Jon Meredith suggests to raise the buffer size to get rid of
>>     the busy buffers by adding +zdbbl 16384 to the vm.args file. Might
>>     this help?
>>
>>     Regards,
>>
>>              Ingo
>>     --
>>     Software Architect
>>
>>     Blue Lion mobile GmbH
>>     Tel. +49 (0) 221 788 797 14
>>     Fax. +49 (0) 221 788 797 19
>>     Mob. +49 (0) 176 24 87 30 89
>>
>>     [hidden email]
>>      >>> qeep: Hefferwolf
>>
>>     www.bluelionmobile.com <http://www.bluelionmobile.com>
>>     www.qeep.net <http://www.qeep.net>
>>
>>     _________________________________________________
>>     riak-users mailing list
>>     [hidden email]
>>     http://lists.basho.com/__mailman/listinfo/riak-users___lists.basho.com
>>     <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>
>>
>
>
> --
> Software Architect
>
> Blue Lion mobile GmbH
> Tel. +49 (0) 221 788 797 14
> Fax. +49 (0) 221 788 797 19
> Mob. +49 (0) 176 24 87 30 89
>
> [hidden email]
>>>>
>>>> qeep: Hefferwolf
>
>
> www.bluelionmobile.com
> www.qeep.net
>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

mailing list headers (was Re: riak cluster suddenly became unresponsive)

Justin Sheehy
In reply to this post by Ingo Rockel
Hi, Ingo.

On Mar 19, 2013, at 10:41 AM, Ingo Rockel wrote:

> and the riak-users mailer-daemon should really set a "reply-to"…

Most email client programs have two well-understood controls for replies, one for "reply (to sender)" and one for "reply to all."

We are not going to make one of them broken.

-Justin




_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com