Riak 1.2.1 Crash During Rolling Upgrade to 1.3.1

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Riak 1.2.1 Crash During Rolling Upgrade to 1.3.1

Shane McEwan-2
G'day!

I upgraded our production 4-node Riak cluster from 1.2.1 to 1.3.1 on the
weekend. It didn't go as smoothly as expected.

After starting Riak on the first upgraded node, node01, I started
getting error messages on two as yet unupgraded nodes, node02 and node03:

2013-06-08 21:22:50.596 [error] <0.149.0> gen_server
riak_core_handoff_manager terminated with reason: no match of right hand
value
{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}
in riak_core_handoff_manager:receive_handoff/1 line 492
2013-06-08 21:22:50.604 [error] <0.149.0> CRASH REPORT Process
riak_core_handoff_manager with 0 neighbours exited with reason: no match
of right hand value
{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}
in riak_core_handoff_manager:receive_handoff/1 line 492 in
gen_server:terminate/6 line 747
2013-06-08 21:22:50.605 [error] <0.143.0> Supervisor
riak_core_handoff_sup had child riak_core_handoff_manager started with
riak_core_handoff_manager:start_link() at <0.149.0> exit with reason no
match of right hand value
{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}
in riak_core_handoff_manager:receive_handoff/1 line 492 in context
child_terminated
2013-06-08 21:22:50.606 [error] <0.147.0> gen_server <0.147.0>
terminated with reason:
{{{badmatch,{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}},[{riak_core_handoff_manager,receive_handoff,1,[{file,"src/riak_core_handoff_manager.erl"},{line,492}]},{riak_core_handoff_manager,handle_call,3,[{...},...]},...]},...}
2013-06-08 21:22:50.608 [error] <0.147.0> CRASH REPORT Process
riak_core_handoff_listener with 1 neighbours exited with reason:
{{{badmatch,{error,{'EXIT',{undef,[{riak_core_handoff_receiver,start_link,[[]],[]},{supervisor,do_start_child_i,3,[{file,"supervisor.erl"},{line,319}]},{supervisor,handle_call,3,[{file,"supervisor.erl"},{line,344}]},{gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,588}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,227}]}]}}}},[{riak_core_handoff_manager,receive_handoff,1,[{file,"src/riak_core_handoff_manager.erl"},{line,492}]},{riak_core_handoff_manager,handle_call,3,[{...},...]},...]},...}
in gen_server:terminate/6 line 747

Eventually, after 5 minutes, Riak on node02 and node03 crashed
completely with:

2013-06-08 21:27:47.029 [error] <0.17586.989> Supervisor
riak_core_handoff_listener_sup had child riak_core_handoff_listener
started with riak_core_handoff_listener:start_link() at undefined exit
with reason bad return value: {error,eaddrinuse} in context start_error
2013-06-08 21:27:47.030 [error] <0.17583.989> Supervisor
riak_core_handoff_sup had child riak_core_handoff_listener_sup started
with riak_core_handoff_listener_sup:start_link() at undefined exit with
reason shutdown in context start_error
2013-06-08 21:27:47.031 [error] <0.139.0> Supervisor riak_core_sup had
child riak_core_handoff_sup started with
riak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exit
with reason shutdown in context start_error
2013-06-08 21:27:47.032 [error] <0.17594.989> CRASH REPORT Process
riak_core_handoff_listener with 1 neighbours exited with reason: bad
return value: {error,eaddrinuse} in gen_server:init_it/6 line 332
2013-06-08 21:27:47.033 [error] <0.17593.989> Supervisor
riak_core_handoff_listener_sup had child riak_core_handoff_listener
started with riak_core_handoff_listener:start_link() at undefined exit
with reason bad return value: {error,eaddrinuse} in context start_error
2013-06-08 21:27:47.034 [error] <0.17590.989> Supervisor
riak_core_handoff_sup had child riak_core_handoff_listener_sup started
with riak_core_handoff_listener_sup:start_link() at undefined exit with
reason shutdown in context start_error
2013-06-08 21:27:47.034 [error] <0.139.0> Supervisor riak_core_sup had
child riak_core_handoff_sup started with
riak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exit
with reason shutdown in context start_error
2013-06-08 21:27:47.035 [error] <0.139.0> Supervisor riak_core_sup had
child riak_core_handoff_sup started with
riak_core_handoff_sup:start_link() at {restarting,<0.17470.989>} exit
with reason reached_max_restart_intensity in context shutdown

And on node01 I finally got some messages indicating something was wrong:

2013-06-08 21:27:43.849 [error]
<0.9407.0>@riak_core_handoff_sender:start_fold:226 hinted_handoff
transfer of riak_kv_vnode from 'riaknode01@10.1.1.13'
685078892498860742907977265335757665463718379520 to
'riaknode02@10.1.1.10' 685078892498860742907977265335757665463718379520
failed because of
exit:{noproc,{riak_core_gen_server,call,[{riak_kv_handoff_listener,'riaknode02@10.1.1.10'},handoff_port,infinity]}}
[{riak_core_gen_server,call,3,[{file,"src/riak_core_gen_server.erl"},{line,214}]},{riak_core_handoff_sender,start_fold,5,[{file,"src/riak_core_handoff_sender.erl"},{line,84}]}]

The fourth node, node04, kept running fine. I assume this is because it
doesn't have any of node01's vnode replicas on it so wasn't involved in
any handoffs.

Anyway, I continued with the upgrades without any further incident,
upgrading the crashed nodes next and, finally, node04.

Everything seems to be running fine. Thankfully we were in a maintenance
window and I wasn't relying on the rolling upgrade capability to ensure
service continuity. But should I be worried that something might be
messed up because of the crash? Or that something is messed up that
caused the crash?

I have crash dumps if they're of any use.

Thanks!

Shane.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com