"forward_preflist" timeout error after 1.1 upgrade

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

"forward_preflist" timeout error after 1.1 upgrade

Matthew A. Brown
Hi all,

We're seeing a new timeout error after upgrading our cluster to Riak
1.1. The error message:

{"phase":0,"error":"[timeout]","input":"{{<<\"service_profiles\">>,<<\"8fh/2\">>},{struct,[{<<\"key\">>,<<\"s-10925\">>}]}}","type":"forward_preflist","stack":"[]"}

Sean indicates that the "forward_preflist" error type suggests the
problem is with inter-node communication, but I am wondering if anyone
could provide more detail on how we might go about digging up the root
cause?

For what it's worth, we're setting a 1-hour timeout on the map/reduce
itself, but it's erroring out way before that interval has expired.

Thanks for any help!

Mat

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: "forward_preflist" timeout error after 1.1 upgrade

bryan-basho
Administrator
On Wed, Feb 22, 2012 at 11:26 AM, Matthew A. Brown
<[hidden email]> wrote:
> Hi all,
>
> We're seeing a new timeout error after upgrading our cluster to Riak
> 1.1. The error message:
>
> {"phase":0,"error":"[timeout]","input":"{{<<\"service_profiles\">>,<<\"8fh/2\">>},{struct,[{<<\"key\">>,<<\"s-10925\">>}]}}","type":"forward_preflist","stack":"[]"}

Hi, Matthew.  This is, indeed, a bit of a bug.  I've filed an issue:
https://github.com/basho/riak_kv/issues/290

The general problem is that the inputs to the fetch half of your map
phase are outrunning the rate at which it can pump outputs to the
processing half.  This means that the queues for the fetchers get
backed up, and that leaves no place for retry requests to go when
fetchers run into not-founds and the like.  It ends up causing errors
(which are internal timeouts of a sort, unrelated to the timeout
you've set), which causes the MapReduce endpoint to kill the query.

If you're willing to recompile riak, you can modify the `q_limit`
fields in the pipe spec I mentioned in `riak_kv_mrc_pipe`.  Otherwise,
the most likely fix (until we resolve the issue) is to make the map
function (and anything else downstream) as fast as possible so it will
keep up.

Hope that helps,
Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com