Complete Riak failure.

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Complete Riak failure.

Richard Heycock-2
I've been running riak for about a month and I'm using it as a
persistent cache. Everything has been fine until tonight that is where
it appears to have had a catastrophic failure. Basically when I make a
request using the http interface I get a failure. I cannot be more
specific than that as there is no response code back; I think it's
getting a TCP reset ...

... That was about 10 minutes ago. When I tried to find out if I was
getting a TCP reset I restarted th program and everything was fine.

Anyway most of the errors look like this:


=ERROR REPORT==== 9-Aug-2010::13:23:27 ===
{mochiweb_socket_server,256,{acceptor_error,{error,accept_failed}}}

=ERROR REPORT==== 9-Aug-2010::13:23:27 ===
    application: mochiweb
    "Accept failed error"
    "{error,emfile}"

And every so often I get:

=ERROR REPORT==== 9-Aug-2010::13:23:56 ===
** State machine <0.17333.0> terminating
** Last event in was {riak_vnode_req_v1,
                      182687704666362864775460604089535377456991567872,
                      {fsm,undefined,<0.17335.0>},
                      {riak_kv_put_req_v1,
                       {<<"uris">>,
                        <<"5ea5dc023dd73b5f711efde3d50b9e91da4c5027">>},
                       {r_object,<<"uris">>,
                        <<"5ea5dc023dd73b5f711efde3d50b9e91da4c5027">>,
                        [{r_content,
                          {dict,5,16,16,8,80,48,
                           {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                           {{[],[],
                             [[<<"Links">>]],
                             [],[],[],[],[],[],[],
                             [[<<"content-type">>,97,112,112,108,105,99,97,
                               116,105,111,110,47,106,115,111,110],
                              [<<"X-Riak-VTag">>,49,102,84,99,107,121,67,100,
                               81,67,83,85,102,67,102,118,68,55,80,50,82,118]],
                             [],[],
                             [[<<"X-Riak-Last-Modified">>|
                               {1281,360236,222890}]],
                             [],
                             [[<<"X-Riak-Meta">>]]}}},
                          <<"{\"uri\":\"http://feedproxy.google.com/~r/time/politics/~3/SGIQlFYkIy8/0,8599,2006898,00.html\",\"download_date\":\"20100809132314\"}">>}],
                        [{<<2,235,89,160>>,{1,63448579436}}],
                        {dict,1,16,16,8,80,48,
                         {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                         {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                           [[clean|true]],
                           []}}},
                        undefined},
                       118194806,63448579436,
                       [{returnbody,true}]}}
** When State == active
**      Data  == {state,182687704666362864775460604089535377456991567872,
                        riak_kv_vnode,
                        {state,182687704666362864775460604089535377456991567872,
                               riak_kv_bitcask_backend,
                               {#Ref<0.0.0.23192>,
                                "/var/lib/riak/bitcask/182687704666362864775460604089535377456991567872"},
                               [],false},
                        undefined,none}
** Reason for termination =
** {bad_return_value,{error,{write_locked,emfile}}}


I've put a full log here:

    http://stuff.roughage.com.au/riak-failure.log.gz

It's nearly midnight here and I going to bed but if anyone can shed any
light on this that'd cool.

rgh

--
+61 (0) 410 646 369
[e]:  [hidden email]
[im]: [hidden email]

You're worried criminals will continue to penetrate into cyberspace, and
I'm worried complexity, poor design and mismanagement will be there to meet
them - Marcus Ranum

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Complete Riak failure.

Dmitry Demeshchuk
The error you get means that too many sockets are opened
simultaneously. I guess there are two main possible reasons:

1. You are being spammed by someone
2. Your REST requests to Riak are somehow handled slower than new
requests come. So, the number of opened HTTP connections increase over
time and you start getting this message.

Some time ago I got exactly the same problem with YAWS (nothing
related to Riak though) and I decided to migrate to mochiweb and
optimized some of our server-side code to reduce the response delay.
And that helped.

On Mon, Aug 9, 2010 at 5:55 PM, Richard Heycock <[hidden email]> wrote:

> I've been running riak for about a month and I'm using it as a
> persistent cache. Everything has been fine until tonight that is where
> it appears to have had a catastrophic failure. Basically when I make a
> request using the http interface I get a failure. I cannot be more
> specific than that as there is no response code back; I think it's
> getting a TCP reset ...
>
> ... That was about 10 minutes ago. When I tried to find out if I was
> getting a TCP reset I restarted th program and everything was fine.
>
> Anyway most of the errors look like this:
>
>
> =ERROR REPORT==== 9-Aug-2010::13:23:27 ===
> {mochiweb_socket_server,256,{acceptor_error,{error,accept_failed}}}
>
> =ERROR REPORT==== 9-Aug-2010::13:23:27 ===
>    application: mochiweb
>    "Accept failed error"
>    "{error,emfile}"
>
> And every so often I get:
>
> =ERROR REPORT==== 9-Aug-2010::13:23:56 ===
> ** State machine <0.17333.0> terminating
> ** Last event in was {riak_vnode_req_v1,
>                      182687704666362864775460604089535377456991567872,
>                      {fsm,undefined,<0.17335.0>},
>                      {riak_kv_put_req_v1,
>                       {<<"uris">>,
>                        <<"5ea5dc023dd73b5f711efde3d50b9e91da4c5027">>},
>                       {r_object,<<"uris">>,
>                        <<"5ea5dc023dd73b5f711efde3d50b9e91da4c5027">>,
>                        [{r_content,
>                          {dict,5,16,16,8,80,48,
>                           {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>                           {{[],[],
>                             [[<<"Links">>]],
>                             [],[],[],[],[],[],[],
>                             [[<<"content-type">>,97,112,112,108,105,99,97,
>                               116,105,111,110,47,106,115,111,110],
>                              [<<"X-Riak-VTag">>,49,102,84,99,107,121,67,100,
>                               81,67,83,85,102,67,102,118,68,55,80,50,82,118]],
>                             [],[],
>                             [[<<"X-Riak-Last-Modified">>|
>                               {1281,360236,222890}]],
>                             [],
>                             [[<<"X-Riak-Meta">>]]}}},
>                          <<"{\"uri\":\"http://feedproxy.google.com/~r/time/politics/~3/SGIQlFYkIy8/0,8599,2006898,00.html\",\"download_date\":\"20100809132314\"}">>}],
>                        [{<<2,235,89,160>>,{1,63448579436}}],
>                        {dict,1,16,16,8,80,48,
>                         {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>                         {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
>                           [[clean|true]],
>                           []}}},
>                        undefined},
>                       118194806,63448579436,
>                       [{returnbody,true}]}}
> ** When State == active
> **      Data  == {state,182687704666362864775460604089535377456991567872,
>                        riak_kv_vnode,
>                        {state,182687704666362864775460604089535377456991567872,
>                               riak_kv_bitcask_backend,
>                               {#Ref<0.0.0.23192>,
>                                "/var/lib/riak/bitcask/182687704666362864775460604089535377456991567872"},
>                               [],false},
>                        undefined,none}
> ** Reason for termination =
> ** {bad_return_value,{error,{write_locked,emfile}}}
>
>
> I've put a full log here:
>
>    http://stuff.roughage.com.au/riak-failure.log.gz
>
> It's nearly midnight here and I going to bed but if anyone can shed any
> light on this that'd cool.
>
> rgh
>
> --
> +61 (0) 410 646 369
> [e]:  [hidden email]
> [im]: [hidden email]
>
> You're worried criminals will continue to penetrate into cyberspace, and
> I'm worried complexity, poor design and mismanagement will be there to meet
> them - Marcus Ranum
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>



--
Best regards,
Dmitry Demeshchuk

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Complete Riak failure.

Bob Ippolito
On Mon, Aug 9, 2010 at 10:07 PM, Dmitry Demeshchuk <[hidden email]> wrote:
> The error you get means that too many sockets are opened
> simultaneously. I guess there are two main possible reasons:
>
> 1. You are being spammed by someone
> 2. Your REST requests to Riak are somehow handled slower than new
> requests come. So, the number of opened HTTP connections increase over
> time and you start getting this message.

Well, file handles are for sockets and files, there might be a
backend/schema combination that likes lots of open files. Adding a
"ulimit -n SOME_BIG_NUMBER" in your start script would make sure you
have plenty of filenos to spare.

-bob

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Complete Riak failure.

Dmitry Demeshchuk
Greetings, Bob.

I thought that the ulimit command concerns only the number of possible
connections to _one_ file/socket.
When I written my own MySQL client in Erlang, I was getting the
{error, einval} error sometimes because I had too many opened
connections to the MySQL server. And ulimit solved this problem.
So, the "ulimit" problem seems to me like a problem for a single
client that tries to make many connections to a server.

However, as far as I understood the explanation of the {error, emfile}
error, it means that too many _different_ files are opened for now (I
guess, it's some limit of the file system). I didn't find any
universal ways to increase this limit.
This problem seems to be a problem of a server that opened too many
simultaneous sockets. And these sockets are really different files so
I'm not sure if ulimit can solve this problem.

Am I wrong?

On Mon, Aug 9, 2010 at 6:11 PM, Bob Ippolito <[hidden email]> wrote:

> On Mon, Aug 9, 2010 at 10:07 PM, Dmitry Demeshchuk <[hidden email]> wrote:
>> The error you get means that too many sockets are opened
>> simultaneously. I guess there are two main possible reasons:
>>
>> 1. You are being spammed by someone
>> 2. Your REST requests to Riak are somehow handled slower than new
>> requests come. So, the number of opened HTTP connections increase over
>> time and you start getting this message.
>
> Well, file handles are for sockets and files, there might be a
> backend/schema combination that likes lots of open files. Adding a
> "ulimit -n SOME_BIG_NUMBER" in your start script would make sure you
> have plenty of filenos to spare.
>
> -bob
>



--
Best regards,
Dmitry Demeshchuk

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Complete Riak failure.

Bob Ippolito
On Mon, Aug 9, 2010 at 10:21 PM, Dmitry Demeshchuk <[hidden email]> wrote:
>
> I thought that the ulimit command concerns only the number of possible
> connections to _one_ file/socket.

That is not how it works.

-bob

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Complete Riak failure.

Dave Smith
In reply to this post by Richard Heycock-2
Richard,

If you see this error again, could you please do a lsof on the erlang process prior to restarting it? Obviously, the erlang VM ran out of file handles, but it's impossible to tell why without some information from the running process. We have fixed a bug re: file handles leaks in recent releases of bitcask -- please make sure you're running 0.12.1. 

Hope that helps,

D.

On Mon, Aug 9, 2010 at 7:55 AM, Richard Heycock <[hidden email]> wrote:
I've been running riak for about a month and I'm using it as a
persistent cache. Everything has been fine until tonight that is where
it appears to have had a catastrophic failure. Basically when I make a
request using the http interface I get a failure. I cannot be more
specific than that as there is no response code back; I think it's
getting a TCP reset ...



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com