Riak CS: avoiding RAM overflow and OOM killer

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Riak CS: avoiding RAM overflow and OOM killer

Daniel Miller
Hi,

I have a Riak CS cluster up and running, and am anticipating exponential growth in the number of key/value pairs over the next few years. From reading the documentation and experience, I've concluded that the default configuration of CS (with riak_cs_kv_multi_backend) keeps all keys in RAM. The OOM killer strikes when Riak uses too much RAM, which is not good for my sanity or sleep. Because of the amount of growth I am anticipating, it seems unlikely that I can allocate enough RAM to keep up with the load. Disk, on the other hand, is less constrained.

A little background on the data set: I have a sparsely accessed key set. By that I mean after a key is written, the more time passes with that key not being accessed, the less likely it is to be accessed any time soon. At any given time, most keys will be dormant. However, any given key _could_ be accessed at any time, so should be possible to retrieve it.

I am currently running a smaller cluster (with smaller nodes: less RAM, smaller disks) than I expect to use eventually. I am starting to hit some growth-related issues that are prompting me to explore more options before it becomes a dire situation.

My question: Are there ways to tune Riak (CS) to support this scenario gracefully? That is, are there ways to make Riak not load all keys into RAM? It looks like leveldb is just what I want, but I'm a little nervous switching over to only leveldb when the default/recommended config uses the multi backend.

As a stop-gap measure, I enabled swap (with swappiness = 0), which I anticipated would kill performance, but was pleasantly surprised to see it return to effectively no-swap performance levels after a short period of lower performance. I'm guessing this is not a good long-term solution as my dataset grows. The problem with using large amounts of swap is that each time Riak starts it needs to read all keys into RAM. Long term, as our dataset grows, the amount of time needed to read keys into RAM will cause a very long restart time (and thus period of unavailability), which could endanger availability for a prolonged period if multiple nodes go down at once.

Thanks!
Daniel Miller
Dimagi, Inc.


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak CS: avoiding RAM overflow and OOM killer

Daniel Miller
I found a similar question from over a year ago (http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-July/017327.html), and it sounds like leveldb is the way to go, although possibly not well tested. Has anything changed with regard to Basho's (or anyone else) experience with using leveldb backend instead of the mutli backend for CS?

On Fri, Nov 4, 2016 at 11:48 AM, Daniel Miller <[hidden email]> wrote:
Hi,

I have a Riak CS cluster up and running, and am anticipating exponential growth in the number of key/value pairs over the next few years. From reading the documentation and experience, I've concluded that the default configuration of CS (with riak_cs_kv_multi_backend) keeps all keys in RAM. The OOM killer strikes when Riak uses too much RAM, which is not good for my sanity or sleep. Because of the amount of growth I am anticipating, it seems unlikely that I can allocate enough RAM to keep up with the load. Disk, on the other hand, is less constrained.

A little background on the data set: I have a sparsely accessed key set. By that I mean after a key is written, the more time passes with that key not being accessed, the less likely it is to be accessed any time soon. At any given time, most keys will be dormant. However, any given key _could_ be accessed at any time, so should be possible to retrieve it.

I am currently running a smaller cluster (with smaller nodes: less RAM, smaller disks) than I expect to use eventually. I am starting to hit some growth-related issues that are prompting me to explore more options before it becomes a dire situation.

My question: Are there ways to tune Riak (CS) to support this scenario gracefully? That is, are there ways to make Riak not load all keys into RAM? It looks like leveldb is just what I want, but I'm a little nervous switching over to only leveldb when the default/recommended config uses the multi backend.

As a stop-gap measure, I enabled swap (with swappiness = 0), which I anticipated would kill performance, but was pleasantly surprised to see it return to effectively no-swap performance levels after a short period of lower performance. I'm guessing this is not a good long-term solution as my dataset grows. The problem with using large amounts of swap is that each time Riak starts it needs to read all keys into RAM. Long term, as our dataset grows, the amount of time needed to read keys into RAM will cause a very long restart time (and thus period of unavailability), which could endanger availability for a prolonged period if multiple nodes go down at once.

Thanks!
Daniel Miller
Dimagi, Inc.



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak CS: avoiding RAM overflow and OOM killer

Alexander Sicular
Hi Daniel,

How many nodes?
-You should be using 5 minimum if you using the default config. There
are reasons.

How much ram per node?
-As you noted, in Riak CS, 1MB file chunks are stored in bitcask.
Their key names and some overhead consume memory.

How many objects (files)? What is the average file size?
-If your size distribution significantly skews < 1MB that means you
will have a bunch of files in bitcask eating up ram.

Kota was a former Basho engineer who worked on CS... That said, Basho
may not support a non standard deployment.

-Alexander

On Mon, Nov 21, 2016 at 2:45 PM, Daniel Miller <[hidden email]> wrote:

> I found a similar question from over a year ago
> (http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-July/017327.html),
> and it sounds like leveldb is the way to go, although possibly not well
> tested. Has anything changed with regard to Basho's (or anyone else)
> experience with using leveldb backend instead of the mutli backend for CS?
>
> On Fri, Nov 4, 2016 at 11:48 AM, Daniel Miller <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have a Riak CS cluster up and running, and am anticipating exponential
>> growth in the number of key/value pairs over the next few years. From
>> reading the documentation and experience, I've concluded that the default
>> configuration of CS (with riak_cs_kv_multi_backend) keeps all keys in RAM.
>> The OOM killer strikes when Riak uses too much RAM, which is not good for my
>> sanity or sleep. Because of the amount of growth I am anticipating, it seems
>> unlikely that I can allocate enough RAM to keep up with the load. Disk, on
>> the other hand, is less constrained.
>>
>> A little background on the data set: I have a sparsely accessed key set.
>> By that I mean after a key is written, the more time passes with that key
>> not being accessed, the less likely it is to be accessed any time soon. At
>> any given time, most keys will be dormant. However, any given key _could_ be
>> accessed at any time, so should be possible to retrieve it.
>>
>> I am currently running a smaller cluster (with smaller nodes: less RAM,
>> smaller disks) than I expect to use eventually. I am starting to hit some
>> growth-related issues that are prompting me to explore more options before
>> it becomes a dire situation.
>>
>> My question: Are there ways to tune Riak (CS) to support this scenario
>> gracefully? That is, are there ways to make Riak not load all keys into RAM?
>> It looks like leveldb is just what I want, but I'm a little nervous
>> switching over to only leveldb when the default/recommended config uses the
>> multi backend.
>>
>> As a stop-gap measure, I enabled swap (with swappiness = 0), which I
>> anticipated would kill performance, but was pleasantly surprised to see it
>> return to effectively no-swap performance levels after a short period of
>> lower performance. I'm guessing this is not a good long-term solution as my
>> dataset grows. The problem with using large amounts of swap is that each
>> time Riak starts it needs to read all keys into RAM. Long term, as our
>> dataset grows, the amount of time needed to read keys into RAM will cause a
>> very long restart time (and thus period of unavailability), which could
>> endanger availability for a prolonged period if multiple nodes go down at
>> once.
>>
>> Thanks!
>> Daniel Miller
>> Dimagi, Inc.
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak CS: avoiding RAM overflow and OOM killer

Daniel Miller
Hi Alexander,

Thanks for responding.

> How many nodes?

We currently have 9 nodes in our cluster.

> How much ram per node?

Each node has 4GB of ram and 4GB of swap. The memory levels (ram + swap) on each node are currently between 4GB and 5.5GB.

> How many objects (files)? What is the average file size?

We currently have >30 million objects, and I analyzed the average object size before we migrated data into the cluster it was about 4KB/object, with some objects being much larger (multiple MB). Is there an easy way to get this information from a running cluster so I can give you more accurate information?


On Tue, Nov 22, 2016 at 2:42 AM, Alexander Sicular <[hidden email]> wrote:
Hi Daniel,

How many nodes?
-You should be using 5 minimum if you using the default config. There
are reasons.

How much ram per node?
-As you noted, in Riak CS, 1MB file chunks are stored in bitcask.
Their key names and some overhead consume memory.

How many objects (files)? What is the average file size?
-If your size distribution significantly skews < 1MB that means you
will have a bunch of files in bitcask eating up ram.

Kota was a former Basho engineer who worked on CS... That said, Basho
may not support a non standard deployment.

-Alexander

On Mon, Nov 21, 2016 at 2:45 PM, Daniel Miller <[hidden email]> wrote:
> I found a similar question from over a year ago
> (http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-July/017327.html),
> and it sounds like leveldb is the way to go, although possibly not well
> tested. Has anything changed with regard to Basho's (or anyone else)
> experience with using leveldb backend instead of the mutli backend for CS?
>
> On Fri, Nov 4, 2016 at 11:48 AM, Daniel Miller <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have a Riak CS cluster up and running, and am anticipating exponential
>> growth in the number of key/value pairs over the next few years. From
>> reading the documentation and experience, I've concluded that the default
>> configuration of CS (with riak_cs_kv_multi_backend) keeps all keys in RAM.
>> The OOM killer strikes when Riak uses too much RAM, which is not good for my
>> sanity or sleep. Because of the amount of growth I am anticipating, it seems
>> unlikely that I can allocate enough RAM to keep up with the load. Disk, on
>> the other hand, is less constrained.
>>
>> A little background on the data set: I have a sparsely accessed key set.
>> By that I mean after a key is written, the more time passes with that key
>> not being accessed, the less likely it is to be accessed any time soon. At
>> any given time, most keys will be dormant. However, any given key _could_ be
>> accessed at any time, so should be possible to retrieve it.
>>
>> I am currently running a smaller cluster (with smaller nodes: less RAM,
>> smaller disks) than I expect to use eventually. I am starting to hit some
>> growth-related issues that are prompting me to explore more options before
>> it becomes a dire situation.
>>
>> My question: Are there ways to tune Riak (CS) to support this scenario
>> gracefully? That is, are there ways to make Riak not load all keys into RAM?
>> It looks like leveldb is just what I want, but I'm a little nervous
>> switching over to only leveldb when the default/recommended config uses the
>> multi backend.
>>
>> As a stop-gap measure, I enabled swap (with swappiness = 0), which I
>> anticipated would kill performance, but was pleasantly surprised to see it
>> return to effectively no-swap performance levels after a short period of
>> lower performance. I'm guessing this is not a good long-term solution as my
>> dataset grows. The problem with using large amounts of swap is that each
>> time Riak starts it needs to read all keys into RAM. Long term, as our
>> dataset grows, the amount of time needed to read keys into RAM will cause a
>> very long restart time (and thus period of unavailability), which could
>> endanger availability for a prolonged period if multiple nodes go down at
>> once.
>>
>> Thanks!
>> Daniel Miller
>> Dimagi, Inc.
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak CS: avoiding RAM overflow and OOM killer

Alexander Sicular-2
Hi Daniel,

Ya, I'm not surprised you're having issues. 4GB ram is woefully underspecd. 😔

🤓Stupid math:

3e7 x 3 (replication) / 9 = 1e7 minimum objects per node ( absolutely more due to obj > 1MB size )

1e7 x ~400 bytes per obj in ram = 4e9 ram per node just for bitcask. Aka 4 GB. 

You already hit your limit. We can stop here. Done. End of. ☠️

🤔But let's continue for funzies😋. 

Assuming defaults:

Default ring_size = 64 / 9 nodes ~ 7 virtual nodes per physical node. 

Default leveldb ram allocation = 70%

Leveldb operates, aka consumes resources including ram, on a vnode basis. It likes to consume ram on the order of 300MB through 2.5GB per vnode, increasing in performance till it caps. Even if you did switch everything to level you'd still be redlined. 

Bottom line is that bitcask, leveldb and your OS are fighting for ram all day 'ery day😡. Why you hate them and make them fight like that?😩 Not nice! (Trumpisms!)🤓

-Alexander

ps. You probably want to bump to 128 ring size. More vnodes equals more parallelism, but also means more resource consumption. You prob want min 8 (v)CPU and 16GB min ram. YMMV, check my math. 

pps. If you don't want to double your per VM cost (aws ec2, etc) you could add nodes to the cluster. Because Riak uniformly distributes data around the cluster adding nodes increase total resources to the cluster, reduces number of objects allocated to each node. The converse is also true, if you double your node size you could halve your node count. That said, systems like Riak like prefer more nodes. It's just a math game. 

@siculars

Sent from my iRotaryPhone

On Nov 22, 2016, at 08:51, Daniel Miller <[hidden email]> wrote:

Hi Alexander,

Thanks for responding.

> How many nodes?

We currently have 9 nodes in our cluster.

> How much ram per node?

Each node has 4GB of ram and 4GB of swap. The memory levels (ram + swap) on each node are currently between 4GB and 5.5GB.

> How many objects (files)? What is the average file size?

We currently have >30 million objects, and I analyzed the average object size before we migrated data into the cluster it was about 4KB/object, with some objects being much larger (multiple MB). Is there an easy way to get this information from a running cluster so I can give you more accurate information?


On Tue, Nov 22, 2016 at 2:42 AM, Alexander Sicular <[hidden email]> wrote:
Hi Daniel,

How many nodes?
-You should be using 5 minimum if you using the default config. There
are reasons.

How much ram per node?
-As you noted, in Riak CS, 1MB file chunks are stored in bitcask.
Their key names and some overhead consume memory.

How many objects (files)? What is the average file size?
-If your size distribution significantly skews < 1MB that means you
will have a bunch of files in bitcask eating up ram.

Kota was a former Basho engineer who worked on CS... That said, Basho
may not support a non standard deployment.

-Alexander

On Mon, Nov 21, 2016 at 2:45 PM, Daniel Miller <[hidden email]> wrote:
> I found a similar question from over a year ago
> (http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-July/017327.html),
> and it sounds like leveldb is the way to go, although possibly not well
> tested. Has anything changed with regard to Basho's (or anyone else)
> experience with using leveldb backend instead of the mutli backend for CS?
>
> On Fri, Nov 4, 2016 at 11:48 AM, Daniel Miller <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have a Riak CS cluster up and running, and am anticipating exponential
>> growth in the number of key/value pairs over the next few years. From
>> reading the documentation and experience, I've concluded that the default
>> configuration of CS (with riak_cs_kv_multi_backend) keeps all keys in RAM.
>> The OOM killer strikes when Riak uses too much RAM, which is not good for my
>> sanity or sleep. Because of the amount of growth I am anticipating, it seems
>> unlikely that I can allocate enough RAM to keep up with the load. Disk, on
>> the other hand, is less constrained.
>>
>> A little background on the data set: I have a sparsely accessed key set.
>> By that I mean after a key is written, the more time passes with that key
>> not being accessed, the less likely it is to be accessed any time soon. At
>> any given time, most keys will be dormant. However, any given key _could_ be
>> accessed at any time, so should be possible to retrieve it.
>>
>> I am currently running a smaller cluster (with smaller nodes: less RAM,
>> smaller disks) than I expect to use eventually. I am starting to hit some
>> growth-related issues that are prompting me to explore more options before
>> it becomes a dire situation.
>>
>> My question: Are there ways to tune Riak (CS) to support this scenario
>> gracefully? That is, are there ways to make Riak not load all keys into RAM?
>> It looks like leveldb is just what I want, but I'm a little nervous
>> switching over to only leveldb when the default/recommended config uses the
>> multi backend.
>>
>> As a stop-gap measure, I enabled swap (with swappiness = 0), which I
>> anticipated would kill performance, but was pleasantly surprised to see it
>> return to effectively no-swap performance levels after a short period of
>> lower performance. I'm guessing this is not a good long-term solution as my
>> dataset grows. The problem with using large amounts of swap is that each
>> time Riak starts it needs to read all keys into RAM. Long term, as our
>> dataset grows, the amount of time needed to read keys into RAM will cause a
>> very long restart time (and thus period of unavailability), which could
>> endanger availability for a prolonged period if multiple nodes go down at
>> once.
>>
>> Thanks!
>> Daniel Miller
>> Dimagi, Inc.
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak CS: avoiding RAM overflow and OOM killer

DeadZen
ok I loled at this. then got worries trump could win a node election.

anyways. 24gigs per riak server is not a bad safe bet.
Erlang in general is ram heavy. It uses it more effectively then most languages wrt concurrency, but ram is the fuel for concurrency and buffer for operations, especially dumb operations involving large orange loud-mouth objects.. as pointed out you can increase cumulative resources by adding more physical nodes. and there is a trade off for adding more virtual nodes. theres also an IPC trade off adding more then a few dozen physical nodes in the same cluster. Theres also possiby a trade off using riak ts and riak kv in the same cluster. Nothing but tradeoffs.

On Tue, Nov 22, 2016 at 12:29 PM Alexander Sicular <[hidden email]> wrote:
Hi Daniel,

Ya, I'm not surprised you're having issues. 4GB ram is woefully underspecd. 😔

🤓Stupid math:

3e7 x 3 (replication) / 9 = 1e7 minimum objects per node ( absolutely more due to obj > 1MB size )

1e7 x ~400 bytes per obj in ram = 4e9 ram per node just for bitcask. Aka 4 GB. 

You already hit your limit. We can stop here. Done. End of. ☠️

🤔But let's continue for funzies😋. 

Assuming defaults:

Default ring_size = 64 / 9 nodes ~ 7 virtual nodes per physical node. 

Default leveldb ram allocation = 70%

Leveldb operates, aka consumes resources including ram, on a vnode basis. It likes to consume ram on the order of 300MB through 2.5GB per vnode, increasing in performance till it caps. Even if you did switch everything to level you'd still be redlined. 

Bottom line is that bitcask, leveldb and your OS are fighting for ram all day 'ery day😡. Why you hate them and make them fight like that?😩 Not nice! (Trumpisms!)🤓

-Alexander

ps. You probably want to bump to 128 ring size. More vnodes equals more parallelism, but also means more resource consumption. You prob want min 8 (v)CPU and 16GB min ram. YMMV, check my math. 

pps. If you don't want to double your per VM cost (aws ec2, etc) you could add nodes to the cluster. Because Riak uniformly distributes data around the cluster adding nodes increase total resources to the cluster, reduces number of objects allocated to each node. The converse is also true, if you double your node size you could halve your node count. That said, systems like Riak like prefer more nodes. It's just a math game. 

@siculars

Sent from my iRotaryPhone

On Nov 22, 2016, at 08:51, Daniel Miller <[hidden email]> wrote:

Hi Alexander,

Thanks for responding.

> How many nodes?

We currently have 9 nodes in our cluster.


> How much ram per node?

Each node has 4GB of ram and 4GB of swap. The memory levels (ram + swap) on each node are currently between 4GB and 5.5GB.


> How many objects (files)? What is the average file size?

We currently have >30 million objects, and I analyzed the average object size before we migrated data into the cluster it was about 4KB/object, with some objects being much larger (multiple MB). Is there an easy way to get this information from a running cluster so I can give you more accurate information?


On Tue, Nov 22, 2016 at 2:42 AM, Alexander Sicular <[hidden email]> wrote:
Hi Daniel,

How many nodes?
-You should be using 5 minimum if you using the default config. There
are reasons.

How much ram per node?
-As you noted, in Riak CS, 1MB file chunks are stored in bitcask.
Their key names and some overhead consume memory.

How many objects (files)? What is the average file size?
-If your size distribution significantly skews < 1MB that means you
will have a bunch of files in bitcask eating up ram.

Kota was a former Basho engineer who worked on CS... That said, Basho
may not support a non standard deployment.

-Alexander

On Mon, Nov 21, 2016 at 2:45 PM, Daniel Miller <[hidden email]> wrote:
> I found a similar question from over a year ago
> (http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-July/017327.html),
> and it sounds like leveldb is the way to go, although possibly not well
> tested. Has anything changed with regard to Basho's (or anyone else)
> experience with using leveldb backend instead of the mutli backend for CS?
>
> On Fri, Nov 4, 2016 at 11:48 AM, Daniel Miller <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have a Riak CS cluster up and running, and am anticipating exponential
>> growth in the number of key/value pairs over the next few years. From
>> reading the documentation and experience, I've concluded that the default
>> configuration of CS (with riak_cs_kv_multi_backend) keeps all keys in RAM.
>> The OOM killer strikes when Riak uses too much RAM, which is not good for my
>> sanity or sleep. Because of the amount of growth I am anticipating, it seems
>> unlikely that I can allocate enough RAM to keep up with the load. Disk, on
>> the other hand, is less constrained.
>>
>> A little background on the data set: I have a sparsely accessed key set.
>> By that I mean after a key is written, the more time passes with that key
>> not being accessed, the less likely it is to be accessed any time soon. At
>> any given time, most keys will be dormant. However, any given key _could_ be
>> accessed at any time, so should be possible to retrieve it.
>>
>> I am currently running a smaller cluster (with smaller nodes: less RAM,
>> smaller disks) than I expect to use eventually. I am starting to hit some
>> growth-related issues that are prompting me to explore more options before
>> it becomes a dire situation.
>>
>> My question: Are there ways to tune Riak (CS) to support this scenario
>> gracefully? That is, are there ways to make Riak not load all keys into RAM?
>> It looks like leveldb is just what I want, but I'm a little nervous
>> switching over to only leveldb when the default/recommended config uses the
>> multi backend.
>>
>> As a stop-gap measure, I enabled swap (with swappiness = 0), which I
>> anticipated would kill performance, but was pleasantly surprised to see it
>> return to effectively no-swap performance levels after a short period of
>> lower performance. I'm guessing this is not a good long-term solution as my
>> dataset grows. The problem with using large amounts of swap is that each
>> time Riak starts it needs to read all keys into RAM. Long term, as our
>> dataset grows, the amount of time needed to read keys into RAM will cause a
>> very long restart time (and thus period of unavailability), which could
>> endanger availability for a prolonged period if multiple nodes go down at
>> once.
>>
>> Thanks!
>> Daniel Miller
>> Dimagi, Inc.
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak CS: avoiding RAM overflow and OOM killer

Alexander Sicular
Hello DeadZen,

Yes, networking interconnect becomes a bigger issue with more nodes in the cluster. A Riak cluster is actually a fully meshed network of erlang virtual machines. Multiple 1/10 gig nics dedicated to inter/intra networking are your friends. That said, we have many customers running many 50+ node clusters. 

Happy thanksgiving ppl! 
-Alexander 


@siculars

Sent from my iRotaryPhone

On Nov 23, 2016, at 10:59, DeadZen <[hidden email]> wrote:

ok I loled at this. then got worries trump could win a node election.

anyways. 24gigs per riak server is not a bad safe bet.
Erlang in general is ram heavy. It uses it more effectively then most languages wrt concurrency, but ram is the fuel for concurrency and buffer for operations, especially dumb operations involving large orange loud-mouth objects.. as pointed out you can increase cumulative resources by adding more physical nodes. and there is a trade off for adding more virtual nodes. theres also an IPC trade off adding more then a few dozen physical nodes in the same cluster. Theres also possiby a trade off using riak ts and riak kv in the same cluster. Nothing but tradeoffs.

On Tue, Nov 22, 2016 at 12:29 PM Alexander Sicular <[hidden email]> wrote:
Hi Daniel,

Ya, I'm not surprised you're having issues. 4GB ram is woefully underspecd. 😔

🤓Stupid math:

3e7 x 3 (replication) / 9 = 1e7 minimum objects per node ( absolutely more due to obj > 1MB size )

1e7 x ~400 bytes per obj in ram = 4e9 ram per node just for bitcask. Aka 4 GB. 

You already hit your limit. We can stop here. Done. End of. ☠️

🤔But let's continue for funzies😋. 

Assuming defaults:

Default ring_size = 64 / 9 nodes ~ 7 virtual nodes per physical node. 

Default leveldb ram allocation = 70%

Leveldb operates, aka consumes resources including ram, on a vnode basis. It likes to consume ram on the order of 300MB through 2.5GB per vnode, increasing in performance till it caps. Even if you did switch everything to level you'd still be redlined. 

Bottom line is that bitcask, leveldb and your OS are fighting for ram all day 'ery day😡. Why you hate them and make them fight like that?😩 Not nice! (Trumpisms!)🤓

-Alexander

ps. You probably want to bump to 128 ring size. More vnodes equals more parallelism, but also means more resource consumption. You prob want min 8 (v)CPU and 16GB min ram. YMMV, check my math. 

pps. If you don't want to double your per VM cost (aws ec2, etc) you could add nodes to the cluster. Because Riak uniformly distributes data around the cluster adding nodes increase total resources to the cluster, reduces number of objects allocated to each node. The converse is also true, if you double your node size you could halve your node count. That said, systems like Riak like prefer more nodes. It's just a math game. 

@siculars

Sent from my iRotaryPhone

On Nov 22, 2016, at 08:51, Daniel Miller <[hidden email]> wrote:

Hi Alexander,

Thanks for responding.

> How many nodes?

We currently have 9 nodes in our cluster.


> How much ram per node?

Each node has 4GB of ram and 4GB of swap. The memory levels (ram + swap) on each node are currently between 4GB and 5.5GB.


> How many objects (files)? What is the average file size?

We currently have >30 million objects, and I analyzed the average object size before we migrated data into the cluster it was about 4KB/object, with some objects being much larger (multiple MB). Is there an easy way to get this information from a running cluster so I can give you more accurate information?


On Tue, Nov 22, 2016 at 2:42 AM, Alexander Sicular <[hidden email]> wrote:
Hi Daniel,

How many nodes?
-You should be using 5 minimum if you using the default config. There
are reasons.

How much ram per node?
-As you noted, in Riak CS, 1MB file chunks are stored in bitcask.
Their key names and some overhead consume memory.

How many objects (files)? What is the average file size?
-If your size distribution significantly skews < 1MB that means you
will have a bunch of files in bitcask eating up ram.

Kota was a former Basho engineer who worked on CS... That said, Basho
may not support a non standard deployment.

-Alexander

On Mon, Nov 21, 2016 at 2:45 PM, Daniel Miller <[hidden email]> wrote:
> I found a similar question from over a year ago
> (http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-July/017327.html),
> and it sounds like leveldb is the way to go, although possibly not well
> tested. Has anything changed with regard to Basho's (or anyone else)
> experience with using leveldb backend instead of the mutli backend for CS?
>
> On Fri, Nov 4, 2016 at 11:48 AM, Daniel Miller <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have a Riak CS cluster up and running, and am anticipating exponential
>> growth in the number of key/value pairs over the next few years. From
>> reading the documentation and experience, I've concluded that the default
>> configuration of CS (with riak_cs_kv_multi_backend) keeps all keys in RAM.
>> The OOM killer strikes when Riak uses too much RAM, which is not good for my
>> sanity or sleep. Because of the amount of growth I am anticipating, it seems
>> unlikely that I can allocate enough RAM to keep up with the load. Disk, on
>> the other hand, is less constrained.
>>
>> A little background on the data set: I have a sparsely accessed key set.
>> By that I mean after a key is written, the more time passes with that key
>> not being accessed, the less likely it is to be accessed any time soon. At
>> any given time, most keys will be dormant. However, any given key _could_ be
>> accessed at any time, so should be possible to retrieve it.
>>
>> I am currently running a smaller cluster (with smaller nodes: less RAM,
>> smaller disks) than I expect to use eventually. I am starting to hit some
>> growth-related issues that are prompting me to explore more options before
>> it becomes a dire situation.
>>
>> My question: Are there ways to tune Riak (CS) to support this scenario
>> gracefully? That is, are there ways to make Riak not load all keys into RAM?
>> It looks like leveldb is just what I want, but I'm a little nervous
>> switching over to only leveldb when the default/recommended config uses the
>> multi backend.
>>
>> As a stop-gap measure, I enabled swap (with swappiness = 0), which I
>> anticipated would kill performance, but was pleasantly surprised to see it
>> return to effectively no-swap performance levels after a short period of
>> lower performance. I'm guessing this is not a good long-term solution as my
>> dataset grows. The problem with using large amounts of swap is that each
>> time Riak starts it needs to read all keys into RAM. Long term, as our
>> dataset grows, the amount of time needed to read keys into RAM will cause a
>> very long restart time (and thus period of unavailability), which could
>> endanger availability for a prolonged period if multiple nodes go down at
>> once.
>>
>> Thanks!
>> Daniel Miller
>> Dimagi, Inc.
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Riak CS: avoiding RAM overflow and OOM killer

Daniel Miller
In reply to this post by Alexander Sicular-2
Hi Alexander,

Thanks a lot for your input. I have a few follow-up questions.
 
Stupid math:

3e7 x 3 (replication) / 9 = 1e7 minimum objects per node ( absolutely more due to obj > 1MB size )

1e7 x ~400 bytes per obj in ram = 4e9 ram per node just for bitcask. Aka 4 GB. 

You already hit your limit.

It makes sense that this is hitting the limit when all keys are in ram (i.e., with bitcask), but I thought leveldb did not keep all keys in ram, so wouldn't have as high ram requirements. Does leveldb require the same (or more) ram as multi backend?


Assuming defaults:

Default ring_size = 64 / 9 nodes ~ 7 virtual nodes per physical node. 

Default leveldb ram allocation = 70%

These are good assumptions. I'm running a mostly default configuration.

Aside: I realized too late that I should have used a larger ring size, but I don't see an easy way to change that short of writing custom data migration code. FWIW, I can easily spin up a new cluster with a new ring size. Is there a supported way to change the ring size on a cluster with data in it or migrate to a new cluster with larger ring size without writing a custom migration to pull data out of the old (small ring) cluster and stuff it in a new (larger ring) cluster?

Also it sounds like increasing the ring size might increase my ram requirements, so maybe in my situation (with limited ram) it's actually better to have a smaller ring? Is that true?


Leveldb operates, aka consumes resources including ram, on a vnode basis. It likes to consume ram on the order of 300MB through 2.5GB per vnode

In my case it sounds like I need somewhere between 2GB and 17.5GB of ram/node (with 7 vnodes per physical node) for leveldb. Can you explain this 300MB-2.5GB ram/vnode requirement in conjunction with the 70% ram allocation setting? How does the 70% setting affect ram usage?

Remember that I have a sparsely accessed dataset, so it is not necessary that all keys be kept in ram since most of them will not be accessed frequently and it's fine if it takes longer to access a key that has not been accessed for a while. My constraints are mainly in hardware, so I'm trying to get a configuration that will run acceptably on minimal hardware. So far (using swap) I have acceptable performance, but I'd prefer to eliminate swap and switch to leveldb backend if I can get to comparable performance levels that way.

Thanks so much for your help!

~ Daniel
 

On Tue, Nov 22, 2016 at 12:27 PM, Alexander Sicular <[hidden email]> wrote:
Hi Daniel,

Ya, I'm not surprised you're having issues. 4GB ram is woefully underspecd. 😔

🤓Stupid math:

3e7 x 3 (replication) / 9 = 1e7 minimum objects per node ( absolutely more due to obj > 1MB size )

1e7 x ~400 bytes per obj in ram = 4e9 ram per node just for bitcask. Aka 4 GB. 

You already hit your limit. We can stop here. Done. End of. ☠️

🤔But let's continue for funzies😋. 

Assuming defaults:

Default ring_size = 64 / 9 nodes ~ 7 virtual nodes per physical node. 

Default leveldb ram allocation = 70%

Leveldb operates, aka consumes resources including ram, on a vnode basis. It likes to consume ram on the order of 300MB through 2.5GB per vnode, increasing in performance till it caps. Even if you did switch everything to level you'd still be redlined. 

Bottom line is that bitcask, leveldb and your OS are fighting for ram all day 'ery day😡. Why you hate them and make them fight like that?😩 Not nice! (Trumpisms!)🤓

-Alexander

ps. You probably want to bump to 128 ring size. More vnodes equals more parallelism, but also means more resource consumption. You prob want min 8 (v)CPU and 16GB min ram. YMMV, check my math. 

pps. If you don't want to double your per VM cost (aws ec2, etc) you could add nodes to the cluster. Because Riak uniformly distributes data around the cluster adding nodes increase total resources to the cluster, reduces number of objects allocated to each node. The converse is also true, if you double your node size you could halve your node count. That said, systems like Riak like prefer more nodes. It's just a math game. 

@siculars

Sent from my iRotaryPhone

On Nov 22, 2016, at 08:51, Daniel Miller <[hidden email]> wrote:

Hi Alexander,

Thanks for responding.

> How many nodes?

We currently have 9 nodes in our cluster.

> How much ram per node?

Each node has 4GB of ram and 4GB of swap. The memory levels (ram + swap) on each node are currently between 4GB and 5.5GB.

> How many objects (files)? What is the average file size?

We currently have >30 million objects, and I analyzed the average object size before we migrated data into the cluster it was about 4KB/object, with some objects being much larger (multiple MB). Is there an easy way to get this information from a running cluster so I can give you more accurate information?


On Tue, Nov 22, 2016 at 2:42 AM, Alexander Sicular <[hidden email]> wrote:
Hi Daniel,

How many nodes?
-You should be using 5 minimum if you using the default config. There
are reasons.

How much ram per node?
-As you noted, in Riak CS, 1MB file chunks are stored in bitcask.
Their key names and some overhead consume memory.

How many objects (files)? What is the average file size?
-If your size distribution significantly skews < 1MB that means you
will have a bunch of files in bitcask eating up ram.

Kota was a former Basho engineer who worked on CS... That said, Basho
may not support a non standard deployment.

-Alexander

On Mon, Nov 21, 2016 at 2:45 PM, Daniel Miller <[hidden email]> wrote:
> I found a similar question from over a year ago
> (http://lists.basho.com/pipermail/riak-users_lists.basho.com/2015-July/017327.html),
> and it sounds like leveldb is the way to go, although possibly not well
> tested. Has anything changed with regard to Basho's (or anyone else)
> experience with using leveldb backend instead of the mutli backend for CS?
>
> On Fri, Nov 4, 2016 at 11:48 AM, Daniel Miller <[hidden email]> wrote:
>>
>> Hi,
>>
>> I have a Riak CS cluster up and running, and am anticipating exponential
>> growth in the number of key/value pairs over the next few years. From
>> reading the documentation and experience, I've concluded that the default
>> configuration of CS (with riak_cs_kv_multi_backend) keeps all keys in RAM.
>> The OOM killer strikes when Riak uses too much RAM, which is not good for my
>> sanity or sleep. Because of the amount of growth I am anticipating, it seems
>> unlikely that I can allocate enough RAM to keep up with the load. Disk, on
>> the other hand, is less constrained.
>>
>> A little background on the data set: I have a sparsely accessed key set.
>> By that I mean after a key is written, the more time passes with that key
>> not being accessed, the less likely it is to be accessed any time soon. At
>> any given time, most keys will be dormant. However, any given key _could_ be
>> accessed at any time, so should be possible to retrieve it.
>>
>> I am currently running a smaller cluster (with smaller nodes: less RAM,
>> smaller disks) than I expect to use eventually. I am starting to hit some
>> growth-related issues that are prompting me to explore more options before
>> it becomes a dire situation.
>>
>> My question: Are there ways to tune Riak (CS) to support this scenario
>> gracefully? That is, are there ways to make Riak not load all keys into RAM?
>> It looks like leveldb is just what I want, but I'm a little nervous
>> switching over to only leveldb when the default/recommended config uses the
>> multi backend.
>>
>> As a stop-gap measure, I enabled swap (with swappiness = 0), which I
>> anticipated would kill performance, but was pleasantly surprised to see it
>> return to effectively no-swap performance levels after a short period of
>> lower performance. I'm guessing this is not a good long-term solution as my
>> dataset grows. The problem with using large amounts of swap is that each
>> time Riak starts it needs to read all keys into RAM. Long term, as our
>> dataset grows, the amount of time needed to read keys into RAM will cause a
>> very long restart time (and thus period of unavailability), which could
>> endanger availability for a prolonged period if multiple nodes go down at
>> once.
>>
>> Thanks!
>> Daniel Miller
>> Dimagi, Inc.
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com