Bad MapReduce job brings the Riak to a screeching halt?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Bad MapReduce job brings the Riak to a screeching halt?

Brad Heller
Hello Riak world,

I've been experimenting with migrating some of our OLAP data in to Riak recently. I'm still learning about the…particulars…of Riak, so apologies is the solution to this is obvious or this is an overt n00b question.

I'm developing on a three-ring Riak cluster on my machine (OSX 10.8.1). I'm primarily using Ruby + Ripple but I've done a lot exploration with Curl too. I'm also using Rekon as a way to peek in to data I'm storing.

The issue I'm facing: I tried to run an improperly-formatted MapReduce job against a bucket with about 45k keys in it and it seemed to crash Riak. Here's the job itself:

1.9.3p194 :065 > puts job.to_json
{"inputs":{"bucket":"raw_statistics","key_filters":[["starts_with","some_string"],["and",[[["tokenize",":",4]],[["between",1345197700,1345697700,true]]]]]},"query":[{"map":{"language":"javascript","keep":true,"name":"Riak.mapValuesJson"}}]}

I would expect about 2.5k matches to the map. Here's the output from one of the vnodes' error.log

2012-08-29 19:27:52.908 [error] <0.420.0>@riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup
2012-08-29 19:45:41.739 [error] <0.959.0> gen_fsm <0.959.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 
2012-08-29 19:45:41.773 [error] <0.594.0> gen_fsm <0.594.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 
2012-08-29 19:45:41.778 [error] <0.594.0> CRASH REPORT Process <0.594.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in gen_fsm:terminate/7 line 611 
2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]
2012-08-29 19:45:41.814 [error] <0.141.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.594.0> exit with reason no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in context child_terminated
2012-08-29 19:45:41.818 [error] <0.959.0> CRASH REPORT Process <0.959.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in gen_fsm:terminate/7 line 611 
2012-08-29 19:45:41.822 [error] <0.141.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.959.0> exit with reason no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in context child_terminated
2012-08-29 19:45:41.943 [error] <0.962.0> gen_fsm <0.962.0> in state ready terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 

For what it's worth the format of my key is as follows (if anyone has any suggestions on a smarter way to format these, I'm all ears).

<some piece of user data>:<user ID>:<some other piece of data>:<timestamp in seconds>

So my question is: Why did this completely kill Riak? This makes me pretty nervous--a bug in our app has the potential to bring down the ring! Is there anything we can do to protect against this?

And a bonus question: What is a reasonable way to query this? I can't maintain links as there will potentially be hundreds of thousands of these objects to query over (each one is pretty small). Is this a good candidate for a compound secondary index?

Thanks for any help.

Cheers,

Brad Heller | Engineering Lead | Cloudability.com | 541-231-1514 | Skype: brad.heller | @bradhe | @cloudability



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Bad MapReduce job brings the Riak to a screeching halt?

Alexander Sicular
What's your "ulimit -n" ?

I think you ran out of fd's. I cite the "io error: lock" mumbo jumbo. 

-Alexander

@siculars

Sent from my iRotaryPhone

On Aug 29, 2012, at 23:07, Brad Heller <[hidden email]> wrote:

Hello Riak world,

I've been experimenting with migrating some of our OLAP data in to Riak recently. I'm still learning about the…particulars…of Riak, so apologies is the solution to this is obvious or this is an overt n00b question.

I'm developing on a three-ring Riak cluster on my machine (OSX 10.8.1). I'm primarily using Ruby + Ripple but I've done a lot exploration with Curl too. I'm also using Rekon as a way to peek in to data I'm storing.

The issue I'm facing: I tried to run an improperly-formatted MapReduce job against a bucket with about 45k keys in it and it seemed to crash Riak. Here's the job itself:

1.9.3p194 :065 > puts job.to_json
{"inputs":{"bucket":"raw_statistics","key_filters":[["starts_with","some_string"],["and",[[["tokenize",":",4]],[["between",1345197700,1345697700,true]]]]]},"query":[{"map":{"language":"javascript","keep":true,"name":"Riak.mapValuesJson"}}]}

I would expect about 2.5k matches to the map. Here's the output from one of the vnodes' error.log

2012-08-29 19:27:52.908 [error] <0.420.0>@riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup
2012-08-29 19:45:41.739 [error] <0.959.0> gen_fsm <0.959.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 
2012-08-29 19:45:41.773 [error] <0.594.0> gen_fsm <0.594.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 
2012-08-29 19:45:41.778 [error] <0.594.0> CRASH REPORT Process <0.594.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in gen_fsm:terminate/7 line 611 
2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]
2012-08-29 19:45:41.814 [error] <0.141.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.594.0> exit with reason no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in context child_terminated
2012-08-29 19:45:41.818 [error] <0.959.0> CRASH REPORT Process <0.959.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in gen_fsm:terminate/7 line 611 
2012-08-29 19:45:41.822 [error] <0.141.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.959.0> exit with reason no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in context child_terminated
2012-08-29 19:45:41.943 [error] <0.962.0> gen_fsm <0.962.0> in state ready terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 

For what it's worth the format of my key is as follows (if anyone has any suggestions on a smarter way to format these, I'm all ears).

<some piece of user data>:<user ID>:<some other piece of data>:<timestamp in seconds>

So my question is: Why did this completely kill Riak? This makes me pretty nervous--a bug in our app has the potential to bring down the ring! Is there anything we can do to protect against this?

And a bonus question: What is a reasonable way to query this? I can't maintain links as there will potentially be hundreds of thousands of these objects to query over (each one is pretty small). Is this a good candidate for a compound secondary index?

Thanks for any help.

Cheers,

Brad Heller | Engineering Lead | Cloudability.com | 541-231-1514 | Skype: brad.heller | @bradhe | @cloudability


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Bad MapReduce job brings the Riak to a screeching halt?

Brad Heller
Interesting. 

$ ulimit -n
2560

I remember seeing something in the documentation about this…I also see Riak returning HTTP 405's every now and then when it's under load. Perhaps related?

On Aug 29, 2012, at 8:43 PM, Alexander Sicular <[hidden email]> wrote:

What's your "ulimit -n" ?

I think you ran out of fd's. I cite the "io error: lock" mumbo jumbo. 

-Alexander

@siculars

Sent from my iRotaryPhone

On Aug 29, 2012, at 23:07, Brad Heller <[hidden email]> wrote:

Hello Riak world,

I've been experimenting with migrating some of our OLAP data in to Riak recently. I'm still learning about the…particulars…of Riak, so apologies is the solution to this is obvious or this is an overt n00b question.

I'm developing on a three-ring Riak cluster on my machine (OSX 10.8.1). I'm primarily using Ruby + Ripple but I've done a lot exploration with Curl too. I'm also using Rekon as a way to peek in to data I'm storing.

The issue I'm facing: I tried to run an improperly-formatted MapReduce job against a bucket with about 45k keys in it and it seemed to crash Riak. Here's the job itself:

1.9.3p194 :065 > puts job.to_json
{"inputs":{"bucket":"raw_statistics","key_filters":[["starts_with","some_string"],["and",[[["tokenize",":",4]],[["between",1345197700,1345697700,true]]]]]},"query":[{"map":{"language":"javascript","keep":true,"name":"Riak.mapValuesJson"}}]}

I would expect about 2.5k matches to the map. Here's the output from one of the vnodes' error.log

2012-08-29 19:27:52.908 [error] <0.420.0>@riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup
2012-08-29 19:45:41.739 [error] <0.959.0> gen_fsm <0.959.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 
2012-08-29 19:45:41.773 [error] <0.594.0> gen_fsm <0.594.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 
2012-08-29 19:45:41.778 [error] <0.594.0> CRASH REPORT Process <0.594.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in gen_fsm:terminate/7 line 611 
2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]
2012-08-29 19:45:41.814 [error] <0.141.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.594.0> exit with reason no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in context child_terminated
2012-08-29 19:45:41.818 [error] <0.959.0> CRASH REPORT Process <0.959.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in gen_fsm:terminate/7 line 611 
2012-08-29 19:45:41.822 [error] <0.141.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.959.0> exit with reason no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in context child_terminated
2012-08-29 19:45:41.943 [error] <0.962.0> gen_fsm <0.962.0> in state ready terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 

For what it's worth the format of my key is as follows (if anyone has any suggestions on a smarter way to format these, I'm all ears).

<some piece of user data>:<user ID>:<some other piece of data>:<timestamp in seconds>

So my question is: Why did this completely kill Riak? This makes me pretty nervous--a bug in our app has the potential to bring down the ring! Is there anything we can do to protect against this?

And a bonus question: What is a reasonable way to query this? I can't maintain links as there will potentially be hundreds of thousands of these objects to query over (each one is pretty small). Is this a good candidate for a compound secondary index?

Thanks for any help.

Cheers,

Brad Heller | Engineering Lead | Cloudability.com | 541-231-1514 | Skype: brad.heller | @bradhe | @cloudability


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Bad MapReduce job brings the Riak to a screeching halt?

Alexander Sicular
With three instances running you should definitely bump it up. Try "ulimit -n 16384" (or more) in a terminal and then start your three instances from that terminal. Rerun your tests. 


@siculars

Sent from my iRotaryPhone

On Aug 30, 2012, at 2:10, Brad Heller <[hidden email]> wrote:

Interesting. 

$ ulimit -n
2560

I remember seeing something in the documentation about this…I also see Riak returning HTTP 405's every now and then when it's under load. Perhaps related?

On Aug 29, 2012, at 8:43 PM, Alexander Sicular <[hidden email]> wrote:

What's your "ulimit -n" ?

I think you ran out of fd's. I cite the "io error: lock" mumbo jumbo. 

-Alexander

@siculars

Sent from my iRotaryPhone

On Aug 29, 2012, at 23:07, Brad Heller <[hidden email]> wrote:

Hello Riak world,

I've been experimenting with migrating some of our OLAP data in to Riak recently. I'm still learning about the…particulars…of Riak, so apologies is the solution to this is obvious or this is an overt n00b question.

I'm developing on a three-ring Riak cluster on my machine (OSX 10.8.1). I'm primarily using Ruby + Ripple but I've done a lot exploration with Curl too. I'm also using Rekon as a way to peek in to data I'm storing.

The issue I'm facing: I tried to run an improperly-formatted MapReduce job against a bucket with about 45k keys in it and it seemed to crash Riak. Here's the job itself:

1.9.3p194 :065 > puts job.to_json
{"inputs":{"bucket":"raw_statistics","key_filters":[["starts_with","some_string"],["and",[[["tokenize",":",4]],[["between",1345197700,1345697700,true]]]]]},"query":[{"map":{"language":"javascript","keep":true,"name":"Riak.mapValuesJson"}}]}

I would expect about 2.5k matches to the map. Here's the output from one of the vnodes' error.log

2012-08-29 19:27:52.908 [error] <0.420.0>@riak_pipe_vnode:new_worker:766 Pipe worker startup failed:fitting was gone before startup
2012-08-29 19:45:41.739 [error] <0.959.0> gen_fsm <0.959.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 
2012-08-29 19:45:41.773 [error] <0.594.0> gen_fsm <0.594.0> in state active terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 
2012-08-29 19:45:41.778 [error] <0.594.0> CRASH REPORT Process <0.594.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in gen_fsm:terminate/7 line 611 
2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]
2012-08-29 19:45:41.814 [error] <0.141.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.594.0> exit with reason no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in context child_terminated
2012-08-29 19:45:41.818 [error] <0.959.0> CRASH REPORT Process <0.959.0> with 1 neighbours exited with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in gen_fsm:terminate/7 line 611 
2012-08-29 19:45:41.822 [error] <0.141.0> Supervisor riak_core_vnode_sup had child undefined started with {riak_core_vnode,start_link,undefined} at <0.959.0> exit with reason no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 in context child_terminated
2012-08-29 19:45:41.943 [error] <0.962.0> gen_fsm <0.962.0> in state ready terminated with reason: no match of right hand value {error,{bad_filter,[<<"tokenize">>,<<":">>,4]}} in riak_kv_mapred_filters:'-logical_and/1-fun-0-'/1 line 176 

For what it's worth the format of my key is as follows (if anyone has any suggestions on a smarter way to format these, I'm all ears).

<some piece of user data>:<user ID>:<some other piece of data>:<timestamp in seconds>

So my question is: Why did this completely kill Riak? This makes me pretty nervous--a bug in our app has the potential to bring down the ring! Is there anything we can do to protect against this?

And a bonus question: What is a reasonable way to query this? I can't maintain links as there will potentially be hundreds of thousands of these objects to query over (each one is pretty small). Is this a good candidate for a compound secondary index?

Thanks for any help.

Cheers,

Brad Heller | Engineering Lead | Cloudability.com | 541-231-1514 | Skype: brad.heller | @bradhe | @cloudability


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Bad MapReduce job brings the Riak to a screeching halt?

bryan-basho
Administrator
In reply to this post by Brad Heller
On Wed, Aug 29, 2012 at 11:07 PM, Brad Heller <[hidden email]> wrote:
> The issue I'm facing: I tried to run an improperly-formatted MapReduce job
> against a bucket with about 45k keys in it and it seemed to crash Riak.

…snip…

> So my question is: Why did this completely kill Riak? This makes me pretty
> nervous--a bug in our app has the potential to bring down the ring! Is there
> anything we can do to protect against this?

Hi, Brad. Indeed, you have found a bug around our validation of
keyfilters. I've filed an issue to track it:

https://github.com/basho/riak_kv/issues/387

The short version is that nested keyfilters (those inside and/or/not
clauses) are not validated until execution time. The manner in which
they are executed means that any failures they have happen on each
vnode processing them, so there is quite a bit more error handling and
logging going on.

I don't think this should have "crashed" Riak, though. The query would
have hung until its timeout, and there would have been quite a spew in
the logs, but Riak should have remained running and able to handle
other requests (barring a second problem, possibly related to KV vnode
workers dying like this during a fold operation). Could you share more
details about what you meant by "crash", please?

Thanks for reporting this.

Cheers,
Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Bad MapReduce job brings the Riak to a screeching halt?

Kelly McLaughlin
In reply to this post by Brad Heller

On Aug 29, 2012, at 9:07 PM, Brad Heller <[hidden email]> wrote:
>
> So my question is: Why did this completely kill Riak? This makes me pretty nervous--a bug in our app has the potential to bring down the ring! Is there anything we can do to protect against this?
>

Riak 1.2 had a lot of changes to leveldb and one of those was a change to using flock() instead of fcntl(SET_FL) to try and make the locking a bit saner. Previously, using fcntl, multiple processes in the erlang VM could get a lock to the same leveldb instance and this could obviously lead to some conflicts. However, a result of the change to using flock is that when the vnode crashes the resources can still be locked by the previous process and this results in this message:

        2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]

Currently we do not attempt to wait or retry the vnode restart and this can cause the node to crash. I can understand you being a little nervous, but we are aware of this and are taking steps on two fronts to address it. First, as Bryan mentioned previously, we're looking at fixing these error conditions that cause the vnode to crash that really should not do so. Second, we're looking at a way to add some retry logic when the vnode does crash and the resources are locked. Thanks for the report!

Kelly
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Bad MapReduce job brings the Riak to a screeching halt?

Brad Heller
Hey Kelly, Bryan,

Thanks for the replies. Good to hear this is being worked on! And sorry I didn't elaborate on "crashed." In this instance crashed meant "stopped taking connections on the HTTP interface." I didn't check to see if the Beam processes died (I think they did as load decreased).

I bumped my ulimit -n based on previous suggestions and that seemed to help. If/when I run in to this again I will indeed post more details!

Thanks,
Brad

On Aug 30, 2012, at 3:16 PM, Kelly McLaughlin <[hidden email]> wrote:

>
> On Aug 29, 2012, at 9:07 PM, Brad Heller <[hidden email]> wrote:
>>
>> So my question is: Why did this completely kill Riak? This makes me pretty nervous--a bug in our app has the potential to bring down the ring! Is there anything we can do to protect against this?
>>
>
> Riak 1.2 had a lot of changes to leveldb and one of those was a change to using flock() instead of fcntl(SET_FL) to try and make the locking a bit saner. Previously, using fcntl, multiple processes in the erlang VM could get a lock to the same leveldb instance and this could obviously lead to some conflicts. However, a result of the change to using flock is that when the vnode crashes the resources can still be locked by the previous process and this results in this message:
>
> 2012-08-29 19:45:41.785 [error] <0.23924.70>@riak_kv_vnode:init:265 Failed to start riak_kv_multi_backend Reason: [{riak_kv_eleveldb_backend,{db_open,"IO error: lock ../../tmp/riak/instance1/leveldb/0/LOCK: Resource temporarily unavailable"}}]
>
> Currently we do not attempt to wait or retry the vnode restart and this can cause the node to crash. I can understand you being a little nervous, but we are aware of this and are taking steps on two fronts to address it. First, as Bryan mentioned previously, we're looking at fixing these error conditions that cause the vnode to crash that really should not do so. Second, we're looking at a way to add some retry logic when the vnode does crash and the resources are locked. Thanks for the report!
>
> Kelly


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com