We've been running with 1.3.1 for most of this week. Generally it's been
going well. We especially feel happier knowing that Active Anti-Entropy
is keeping an eye on things. As we mostly use map reduce queries we
rarely triggered any read repairs so it's good that we'll be getting
repairs from now on. Nice work!
However, there's a few things that have popped up that I'd be interested
in getting some advice about.
Firstly, as mentioned in an earlier message (that seems to have
fallen on deaf ears :-) ) we had a couple of 1.2.1 nodes crash when I
upgraded one of the other nodes to 1.3.1. The current theory is that I
made the mistake of installing the new Riak package on all the nodes
before starting the upgrade. When I restarted the first node it started
doing its handoff checks. The two 1.2.1 nodes that had vnode replicas of
the new 1.3.1 node tried to start their riak_core_handoff_receiver
functions. The only thing I can think of is that the 1.2.1 nodes didn't
actually have those functions in memory so went to disk to load them.
Because I'd upgraded the Riak software, but hadn't restarted it yet, it
couldn't find the module files it was expecting so it failed. That's the
theory, anyway. So, tip of the day, don't upgrade your software until
you're ready to restart it!
Secondly, we've noticed a significant change in our FSM times since
upgrading (attached or ). The red-ish lines are 95th percentile
"puts" from our four nodes. The blue-ish lines are "gets". We were
averaging a stable sub-2ms for puts before the upgrade and now we're
closer to 4ms with a lot of jitter. The gets are unchanged. Is this
related to active anti-entropy? The AAE trees have been indexed but
we're still seeing that puts are slower.
Finally, we've started seeing the following error occasionally pop up on
[error] <0.212.0> Supervisor riak_pipe_fitting_sup had child undefined
started with riak_pipe_fitting:start_link() at <0.4459.767> exit with
reason noproc in context shutdown_error
According to riak_pipe issue #49 on GitHub the problem has been
around since 1.1.2 but we're only seeing it since upgrading to 1.3.1. It
doesn't seem to be load related and we don't get any associated errors
in our application and it is happening less than once per day. Anything
we should be worrying about?