Riak 1.3.1 crash when directory used by AAE is full

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Riak 1.3.1 crash when directory used by AAE is full

Dave Brady
Greetings,

Some background: I have been testing using AAE in our backup ring.  I did not want the AAE data to sit on our (expensive and comparatively limited) SSD disks, so I created a LV for it on the systems' SAS disks.

All seemed well for a few weeks.

I got to doing other stuff for a bit, and when I came back to this ring today, I noticed that the filesystem used by AAE was full on all nodes.  I then noticed that Riak had crashed on every node because this problem.

I was not expecting that AAE issues would be able to kill Riak.

Anyone else had this happen?
--
Dave Brady

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak 1.3.1 crash when directory used by AAE is full

Mark Phillips-4
Hi Dave, 

Thanks for the info. A few follow up questions:

* How much total data is in the cluster?
* Have you changed the AAE default settings at all? If so, to what?
* How much space was allocated for the AAE FS?

>
I was not expecting that AAE issues would be able to kill Riak.
>

So, we've never seen this before in testing, but, admittedly, we never tested the case where AAE wasn't given enough disk space. While it's a sub-optimal behavior, it's no different than Riak (or any other db daemon) running out of storage space and dying. That said, the docs are *very* sparse on AAE, and we only lay out the config defaults as part of the KV section in the app.config file [0]. At the very least we should have the expected systems needs for AAE storage documented. There's probably a middle-ground mitigated with documentation in the short term. We're trying to freeze for 1.4 at the moment, but I'll make sure this gets some discussion time after that's done.

Mark  




On Mon, May 27, 2013 at 10:22 AM, Dave Brady <[hidden email]> wrote:
Greetings,

Some background: I have been testing using AAE in our backup ring.  I did not want the AAE data to sit on our (expensive and comparatively limited) SSD disks, so I created a LV for it on the systems' SAS disks.

All seemed well for a few weeks.

I got to doing other stuff for a bit, and when I came back to this ring today, I noticed that the filesystem used by AAE was full on all nodes.  I then noticed that Riak had crashed on every node because this problem.

I was not expecting that AAE issues would be able to kill Riak.

Anyone else had this happen?
--
Dave Brady

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak 1.3.1 crash when directory used by AAE is full

Shane McEwan-2
On 29/05/13 06:50, Mark Phillips wrote:
>At the very least we should have the
> expected systems needs for AAE storage documented. There's probably a
> middle-ground mitigated with documentation in the short term. We're trying
> to freeze for 1.4 at the moment, but I'll make sure this gets some
> discussion time after that's done.

I was a little surprised when I started getting low disk space alerts
from my staging cluster after upgrading to 1.3.1 and turning on AAE. I
checked and along with the 68GB leveldb directory I now also have a
9.7GB anti_entropy directory. That's a 14% overhead.

We've got the space (and this is why we test on staging before deploying
to production) so it didn't break anything but people running close to
full will be caught out by this.

Shane.


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak 1.3.1 crash when directory used by AAE is full

Dave Brady
In reply to this post by Dave Brady
Hi Mark,

* How much total data is in the cluster?

   About 850 GB (750 GB in Bitcask + 100 GB in AAE) at the time of the crash.

* Have you changed the AAE default settings at all? If so, to what?

   No, everything was left with the defaults.

* How much space was allocated for the AAE FS?

   20 GB.  Each node was using about 9.5 GB after AAE finished its first build of the trees.
   I had loaded one bucket (using Dan Kerrigan's wonderful riak-data-migrator), and the ring was not updated afterwards.
   The amount of space used by AAE remained at 9.5 GB for at least a couple of weeks.
   I do not know when it began to chew up more space.

Thanks for the assist!

--
Dave Brady

----- Original Message -----
From: "Mark Phillips" <[hidden email]>
To: "Dave Brady" <[hidden email]>
Cc: "riak-users" <[hidden email]>
Sent: Wednesday, May 29, 2013 7:50:14 AM GMT +01:00 Amsterdam / Berlin / Bern / Rome / Stockholm / Vienna
Subject: Re: Riak 1.3.1 crash when directory used by AAE is full

Hi Dave, 

Thanks for the info. A few follow up questions:

* How much total data is in the cluster?
* Have you changed the AAE default settings at all? If so, to what?
* How much space was allocated for the AAE FS?

>
I was not expecting that AAE issues would be able to kill Riak.
>

So, we've never seen this before in testing, but, admittedly, we never tested the case where AAE wasn't given enough disk space. While it's a sub-optimal behavior, it's no different than Riak (or any other db daemon) running out of storage space and dying. That said, the docs are *very* sparse on AAE, and we only lay out the config defaults as part of the KV section in the app.config file [0]. At the very least we should have the expected systems needs for AAE storage documented. There's probably a middle-ground mitigated with documentation in the short term. We're trying to freeze for 1.4 at the moment, but I'll make sure this gets some discussion time after that's done.

Mark  




On Mon, May 27, 2013 at 10:22 AM, Dave Brady <[hidden email]> wrote:
Greetings,

Some background: I have been testing using AAE in our backup ring.  I did not want the AAE data to sit on our (expensive and comparatively limited) SSD disks, so I created a LV for it on the systems' SAS disks.

All seemed well for a few weeks.

I got to doing other stuff for a bit, and when I came back to this ring today, I noticed that the filesystem used by AAE was full on all nodes.  I then noticed that Riak had crashed on every node because this problem.

I was not expecting that AAE issues would be able to kill Riak.

Anyone else had this happen?
--
Dave Brady

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com