A script to check bitcask keydir sizes

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

A script to check bitcask keydir sizes

Aphyr
I'm trying to track some basic metrics so we can plan for cluster
capacity, monitor transfers, etc. Figured this might be of interest to
other riak admins. Apologies if my erlang is nonidiomatic, I'm still
learning. :)

#!/usr/bin/env escript
%%! -name riakstatuscheck -setcookie riak

main([]) -> main(["riak@127.0.0.1"]);
main([Node]) ->
   io:format("~w\n", [
     lists:foldl(
       fun({_VNode, Count}, Sum) -> Sum + Count end,
       0,
       rpc:call(list_to_atom(Node), riak_kv_bitcask_backend, key_counts, [])
     )
   ]).


$ ./riakstatus riak@127.0.0.1
18729

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Anthony Molinaro
So a question about when to add new nodes.  I'm looking at the output of
this script and the output of riak-admin status to attempt to figure out
if it's time to grow a cluster.

I have 4 nodes 1024 partitions replication factor 3, currently with a
single bitcask single bucket where both the key and the value are 36 bytes.

According to the bitcask spreadsheet the overhead per key is 40 bytes

The current key counts/memory are

             key_counts  mem_total  mem_allocated   (key_count*76)
node1        22381785  25269010432  21015953408     1701015660
node2        22378092  25269010432  14076137472     1700734992
node3        22373770  25269010432  21565509632     1700406520
node4        22382394  25269010432  21493731328     1701061944

node2 failed at some point and was replaced with with a new node.

So there is some oddness here I don't understand.  According to the
calculated value I should see about 1.7GB per box used, instead I see
21GB on most machines except for the one which was restarted which has
14GB.  From looking at memory it seems like I should be adding some nodes
real soon or amount allocated will hit the total amount.  Or maybe there's
a memory leak which will reduce the amount of memory (as with node2)?

I'm just trying to figure out why I seem to almost be out of memory with
23 million documents when the Bitcask capacity planning spreadsheet seems
to suggest I should be able to have 282 million with 20 GiB of free Ram.

Confused,

-Anthony

On Wed, Mar 16, 2011 at 12:04:48PM -0700, Aphyr wrote:

> I'm trying to track some basic metrics so we can plan for cluster
> capacity, monitor transfers, etc. Figured this might be of interest
> to other riak admins. Apologies if my erlang is nonidiomatic, I'm
> still learning. :)
>
> #!/usr/bin/env escript
> %%! -name riakstatuscheck -setcookie riak
>
> main([]) -> main(["riak@127.0.0.1"]);
> main([Node]) ->
>   io:format("~w\n", [
>     lists:foldl(
>       fun({_VNode, Count}, Sum) -> Sum + Count end,
>       0,
>       rpc:call(list_to_atom(Node), riak_kv_bitcask_backend, key_counts, [])
>     )
>   ]).
>
>
> $ ./riakstatus riak@127.0.0.1
> 18729
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

--
------------------------------------------------------------------------
Anthony Molinaro                           <[hidden email]>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Nico Meyer
Hi Anthony,

are you sure you are not including the filesystem cache in your
mem_allocated values? It will grow to use all of the free memory or the
total size of your bitcask data files, whichever is smaller.

We have about 100Mio keys per node, and riak uses about 7GB of RAM.

Cheers,
Nico

On 23.03.2011 23:24, Anthony Molinaro wrote:

> So a question about when to add new nodes.  I'm looking at the output of
> this script and the output of riak-admin status to attempt to figure out
> if it's time to grow a cluster.
>
> I have 4 nodes 1024 partitions replication factor 3, currently with a
> single bitcask single bucket where both the key and the value are 36 bytes.
>
> According to the bitcask spreadsheet the overhead per key is 40 bytes
>
> The current key counts/memory are
>
>               key_counts  mem_total  mem_allocated   (key_count*76)
> node1        22381785  25269010432  21015953408     1701015660
> node2        22378092  25269010432  14076137472     1700734992
> node3        22373770  25269010432  21565509632     1700406520
> node4        22382394  25269010432  21493731328     1701061944
>
> node2 failed at some point and was replaced with with a new node.
>
> So there is some oddness here I don't understand.  According to the
> calculated value I should see about 1.7GB per box used, instead I see
> 21GB on most machines except for the one which was restarted which has
> 14GB.  From looking at memory it seems like I should be adding some nodes
> real soon or amount allocated will hit the total amount.  Or maybe there's
> a memory leak which will reduce the amount of memory (as with node2)?
>
> I'm just trying to figure out why I seem to almost be out of memory with
> 23 million documents when the Bitcask capacity planning spreadsheet seems
> to suggest I should be able to have 282 million with 20 GiB of free Ram.
>
> Confused,
>
> -Anthony
>
> On Wed, Mar 16, 2011 at 12:04:48PM -0700, Aphyr wrote:
>> I'm trying to track some basic metrics so we can plan for cluster
>> capacity, monitor transfers, etc. Figured this might be of interest
>> to other riak admins. Apologies if my erlang is nonidiomatic, I'm
>> still learning. :)
>>
>> #!/usr/bin/env escript
>> %%! -name riakstatuscheck -setcookie riak
>>
>> main([]) ->  main(["riak@127.0.0.1"]);
>> main([Node]) ->
>>    io:format("~w\n", [
>>      lists:foldl(
>>        fun({_VNode, Count}, Sum) ->  Sum + Count end,
>>        0,
>>        rpc:call(list_to_atom(Node), riak_kv_bitcask_backend, key_counts, [])
>>      )
>>    ]).
>>
>>
>> $ ./riakstatus riak@127.0.0.1
>> 18729
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Anthony Molinaro
Hi Nico,

   Its unclear riak-admin status eventually calls riak_kv_stat:get_stats
which states

%%</dd><dt> mem_total
%%</dt><dd> The first element of the tuple returned by
%%          {@link memsup:get_memory_data/0}.
%%
%%</dd><dt> mem_allocated
%%</dt><dd> The second element of the tuple returned by
%%          {@link memsup:get_memory_data/0}.
%%

The man page for memsup states

  Returns  the  result  of the latest memory check, where Total is
  the total memory size and Allocated the allocated  memory  size,
  in bytes.

which doesn't tell me if that's filesystem cache or not.  According to
top the riak beam.smp is using 3G of virtual and 2.7G of resident
on node1 which don't match any of the values from before.  Attaching
to the riak node and running memory() I see

[{total,2392208240},
 {processes,15012600},
 {processes_used,12797856},
 {system,2377195640},
 {atom,824297},
 {atom_used,812024},
 {binary,898656},
 {code,8336856},
 {ets,556632}]

Which seems to reflect what top claims.  I'm just curious what to look at
to determine when I need to add new nodes.  I'm currently capturing the
statistics riak provides and putting them into rrds, and mean response time
is great (95,99, and 100 have spikes quite regularly which I still don't fully
understand the cause of, but mean/median is pretty good <1ms).  But I'm
wondering when to detect if the whole thing will come crashing down.

I've used Cassandra for the last 20 months in production and had the same
issue, it works great then it falls over, and unfortunately with such evenly
space data, everything tends to fall over at once.  I just don't want that
to happen with my riak cluster, so am wondering how to tell if you are close
to needing to grow.

Anyone have any ideas?

-Anthony


On Thu, Mar 24, 2011 at 01:21:05AM +0100, Nico Meyer wrote:

> Hi Anthony,
>
> are you sure you are not including the filesystem cache in your
> mem_allocated values? It will grow to use all of the free memory or
> the total size of your bitcask data files, whichever is smaller.
>
> We have about 100Mio keys per node, and riak uses about 7GB of RAM.
>
> Cheers,
> Nico
>
> On 23.03.2011 23:24, Anthony Molinaro wrote:
> >So a question about when to add new nodes.  I'm looking at the output of
> >this script and the output of riak-admin status to attempt to figure out
> >if it's time to grow a cluster.
> >
> >I have 4 nodes 1024 partitions replication factor 3, currently with a
> >single bitcask single bucket where both the key and the value are 36 bytes.
> >
> >According to the bitcask spreadsheet the overhead per key is 40 bytes
> >
> >The current key counts/memory are
> >
> >              key_counts  mem_total  mem_allocated   (key_count*76)
> >node1        22381785  25269010432  21015953408     1701015660
> >node2        22378092  25269010432  14076137472     1700734992
> >node3        22373770  25269010432  21565509632     1700406520
> >node4        22382394  25269010432  21493731328     1701061944
> >
> >node2 failed at some point and was replaced with with a new node.
> >
> >So there is some oddness here I don't understand.  According to the
> >calculated value I should see about 1.7GB per box used, instead I see
> >21GB on most machines except for the one which was restarted which has
> >14GB.  From looking at memory it seems like I should be adding some nodes
> >real soon or amount allocated will hit the total amount.  Or maybe there's
> >a memory leak which will reduce the amount of memory (as with node2)?
> >
> >I'm just trying to figure out why I seem to almost be out of memory with
> >23 million documents when the Bitcask capacity planning spreadsheet seems
> >to suggest I should be able to have 282 million with 20 GiB of free Ram.
> >
> >Confused,
> >
> >-Anthony
> >
> >On Wed, Mar 16, 2011 at 12:04:48PM -0700, Aphyr wrote:
> >>I'm trying to track some basic metrics so we can plan for cluster
> >>capacity, monitor transfers, etc. Figured this might be of interest
> >>to other riak admins. Apologies if my erlang is nonidiomatic, I'm
> >>still learning. :)
> >>
> >>#!/usr/bin/env escript
> >>%%! -name riakstatuscheck -setcookie riak
> >>
> >>main([]) ->  main(["riak@127.0.0.1"]);
> >>main([Node]) ->
> >>   io:format("~w\n", [
> >>     lists:foldl(
> >>       fun({_VNode, Count}, Sum) ->  Sum + Count end,
> >>       0,
> >>       rpc:call(list_to_atom(Node), riak_kv_bitcask_backend, key_counts, [])
> >>     )
> >>   ]).
> >>
> >>
> >>$ ./riakstatus riak@127.0.0.1
> >>18729
> >>
> >>_______________________________________________
> >>riak-users mailing list
> >>[hidden email]
> >>http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >

--
------------------------------------------------------------------------
Anthony Molinaro                           <[hidden email]>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Nico Meyer
Hi Anthony,

watching the memory that riak consumes is certainly the most important
metric. If riak runs out of memory you are screwed. On a dedicated node
just use the 'free' command to get that information:

          total      used       free      shared    buffers   cached
Mem:      66106100   49834164   16271936   0        232460    41492552
-/+ buffers/cache:    8109152   57996948
Swap:      1951856         36    1951820


the most important line is the second one. There should be "enough" free
memory. Also swap used should always be near zero. Whats enough is not
so clear cut. If latency is really important for you, it should be
enough free memory, so that your working set fits into that amount. Then
the disk is mostly used for writing.

The way you can eyeball if you have enough memory left for caching is
for example by using a tool like 'iostat' (in the systats package on
Debian). It shows you the disk utilization of your system. I usually
call it like that:

iostat -x 2


Of course you should also watch CPU and network utilization, but usually
disk or memory becomes a problem first.

Cheers,
Nico

Am Mittwoch, den 23.03.2011, 23:45 -0700 schrieb Anthony Molinaro:

> Hi Nico,
>
>    Its unclear riak-admin status eventually calls riak_kv_stat:get_stats
> which states
>
> %%</dd><dt> mem_total
> %%</dt><dd> The first element of the tuple returned by
> %%          {@link memsup:get_memory_data/0}.
> %%
> %%</dd><dt> mem_allocated
> %%</dt><dd> The second element of the tuple returned by
> %%          {@link memsup:get_memory_data/0}.
> %%
>
> The man page for memsup states
>
>   Returns  the  result  of the latest memory check, where Total is
>   the total memory size and Allocated the allocated  memory  size,
>   in bytes.
>
> which doesn't tell me if that's filesystem cache or not.  According to
> top the riak beam.smp is using 3G of virtual and 2.7G of resident
> on node1 which don't match any of the values from before.  Attaching
> to the riak node and running memory() I see
>
> [{total,2392208240},
>  {processes,15012600},
>  {processes_used,12797856},
>  {system,2377195640},
>  {atom,824297},
>  {atom_used,812024},
>  {binary,898656},
>  {code,8336856},
>  {ets,556632}]
>
> Which seems to reflect what top claims.  I'm just curious what to look at
> to determine when I need to add new nodes.  I'm currently capturing the
> statistics riak provides and putting them into rrds, and mean response time
> is great (95,99, and 100 have spikes quite regularly which I still don't fully
> understand the cause of, but mean/median is pretty good <1ms).  But I'm
> wondering when to detect if the whole thing will come crashing down.
>
> I've used Cassandra for the last 20 months in production and had the same
> issue, it works great then it falls over, and unfortunately with such evenly
> space data, everything tends to fall over at once.  I just don't want that
> to happen with my riak cluster, so am wondering how to tell if you are close
> to needing to grow.
>
> Anyone have any ideas?
>
> -Anthony
>
>
> On Thu, Mar 24, 2011 at 01:21:05AM +0100, Nico Meyer wrote:
> > Hi Anthony,
> >
> > are you sure you are not including the filesystem cache in your
> > mem_allocated values? It will grow to use all of the free memory or
> > the total size of your bitcask data files, whichever is smaller.
> >
> > We have about 100Mio keys per node, and riak uses about 7GB of RAM.
> >
> > Cheers,
> > Nico
> >
> > On 23.03.2011 23:24, Anthony Molinaro wrote:
> > >So a question about when to add new nodes.  I'm looking at the output of
> > >this script and the output of riak-admin status to attempt to figure out
> > >if it's time to grow a cluster.
> > >
> > >I have 4 nodes 1024 partitions replication factor 3, currently with a
> > >single bitcask single bucket where both the key and the value are 36 bytes.
> > >
> > >According to the bitcask spreadsheet the overhead per key is 40 bytes
> > >
> > >The current key counts/memory are
> > >
> > >              key_counts  mem_total  mem_allocated   (key_count*76)
> > >node1        22381785  25269010432  21015953408     1701015660
> > >node2        22378092  25269010432  14076137472     1700734992
> > >node3        22373770  25269010432  21565509632     1700406520
> > >node4        22382394  25269010432  21493731328     1701061944
> > >
> > >node2 failed at some point and was replaced with with a new node.
> > >
> > >So there is some oddness here I don't understand.  According to the
> > >calculated value I should see about 1.7GB per box used, instead I see
> > >21GB on most machines except for the one which was restarted which has
> > >14GB.  From looking at memory it seems like I should be adding some nodes
> > >real soon or amount allocated will hit the total amount.  Or maybe there's
> > >a memory leak which will reduce the amount of memory (as with node2)?
> > >
> > >I'm just trying to figure out why I seem to almost be out of memory with
> > >23 million documents when the Bitcask capacity planning spreadsheet seems
> > >to suggest I should be able to have 282 million with 20 GiB of free Ram.
> > >
> > >Confused,
> > >
> > >-Anthony
> > >
> > >On Wed, Mar 16, 2011 at 12:04:48PM -0700, Aphyr wrote:
> > >>I'm trying to track some basic metrics so we can plan for cluster
> > >>capacity, monitor transfers, etc. Figured this might be of interest
> > >>to other riak admins. Apologies if my erlang is nonidiomatic, I'm
> > >>still learning. :)
> > >>
> > >>#!/usr/bin/env escript
> > >>%%! -name riakstatuscheck -setcookie riak
> > >>
> > >>main([]) ->  main(["riak@127.0.0.1"]);
> > >>main([Node]) ->
> > >>   io:format("~w\n", [
> > >>     lists:foldl(
> > >>       fun({_VNode, Count}, Sum) ->  Sum + Count end,
> > >>       0,
> > >>       rpc:call(list_to_atom(Node), riak_kv_bitcask_backend, key_counts, [])
> > >>     )
> > >>   ]).
> > >>
> > >>
> > >>$ ./riakstatus riak@127.0.0.1
> > >>18729
> > >>
> > >>_______________________________________________
> > >>riak-users mailing list
> > >>[hidden email]
> > >>http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > >
>



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Greg Nelson
Maybe slightly a different topic, but it's always seemed a little strange to me that bucket names are prefixed to every key in the keydir -- or rather, they are part of the key at that layer.  Wouldn't it be the common case that there are relatively few buckets?  And so wouldn't it save a lot of memory to keep a reference to an interned bucket name string in each entry, instead of the whole bucket name?

This seems like it would only be more expensive (in terms of memory usage) if either a) your bucket names are very short, like fewer than 4 bytes, or b) your ratio of keys to buckets is near 1.

I'm sure I am missing something, though, and haven't poked into that part of the code yet.

On Thursday, March 24, 2011 at 3:11 AM, Nico Meyer wrote:

Hi Anthony,

watching the memory that riak consumes is certainly the most important
metric. If riak runs out of memory you are screwed. On a dedicated node
just use the 'free' command to get that information:

total used free shared buffers cached
Mem: 66106100 49834164 16271936 0 232460 41492552
-/+ buffers/cache: 8109152 57996948
Swap: 1951856 36 1951820


the most important line is the second one. There should be "enough" free
memory. Also swap used should always be near zero. Whats enough is not
so clear cut. If latency is really important for you, it should be
enough free memory, so that your working set fits into that amount. Then
the disk is mostly used for writing.

The way you can eyeball if you have enough memory left for caching is
for example by using a tool like 'iostat' (in the systats package on
Debian). It shows you the disk utilization of your system. I usually
call it like that:

iostat -x 2


Of course you should also watch CPU and network utilization, but usually
disk or memory becomes a problem first.

Cheers,
Nico

Am Mittwoch, den 23.03.2011, 23:45 -0700 schrieb Anthony Molinaro:
Hi Nico,

Its unclear riak-admin status eventually calls riak_kv_stat:get_stats
which states

%%</dd><dt> mem_total
%%</dt><dd> The first element of the tuple returned by
%% {@link memsup:get_memory_data/0}.
%%
%%</dd><dt> mem_allocated
%%</dt><dd> The second element of the tuple returned by
%% {@link memsup:get_memory_data/0}.
%%

The man page for memsup states

Returns the result of the latest memory check, where Total is
the total memory size and Allocated the allocated memory size,
in bytes.

which doesn't tell me if that's filesystem cache or not. According to
top the riak beam.smp is using 3G of virtual and 2.7G of resident
on node1 which don't match any of the values from before. Attaching
to the riak node and running memory() I see

[{total,2392208240},
{processes,15012600},
{processes_used,12797856},
{system,2377195640},
{atom,824297},
{atom_used,812024},
{binary,898656},
{code,8336856},
{ets,556632}]

Which seems to reflect what top claims. I'm just curious what to look at
to determine when I need to add new nodes. I'm currently capturing the
statistics riak provides and putting them into rrds, and mean response time
is great (95,99, and 100 have spikes quite regularly which I still don't fully
understand the cause of, but mean/median is pretty good <1ms). But I'm
wondering when to detect if the whole thing will come crashing down.

I've used Cassandra for the last 20 months in production and had the same
issue, it works great then it falls over, and unfortunately with such evenly
space data, everything tends to fall over at once. I just don't want that
to happen with my riak cluster, so am wondering how to tell if you are close
to needing to grow.

Anyone have any ideas?

-Anthony


On Thu, Mar 24, 2011 at 01:21:05AM +0100, Nico Meyer wrote:
Hi Anthony,

are you sure you are not including the filesystem cache in your
mem_allocated values? It will grow to use all of the free memory or
the total size of your bitcask data files, whichever is smaller.

We have about 100Mio keys per node, and riak uses about 7GB of RAM.

Cheers,
Nico

On 23.03.2011 23:24, Anthony Molinaro wrote:
So a question about when to add new nodes. I'm looking at the output of
this script and the output of riak-admin status to attempt to figure out
if it's time to grow a cluster.

I have 4 nodes 1024 partitions replication factor 3, currently with a
single bitcask single bucket where both the key and the value are 36 bytes.

According to the bitcask spreadsheet the overhead per key is 40 bytes

The current key counts/memory are

key_counts mem_total mem_allocated (key__count*76)
node1 22381785 25269010432 21015953408 1701015660
node2 22378092 25269010432 14076137472 1700734992
node3 22373770 25269010432 21565509632 1700406520
node4 22382394 25269010432 21493731328 1701061944

node2 failed at some point and was replaced with with a new node.

So there is some oddness here I don't understand. According to the
calculated value I should see about 1.7GB per box used, instead I see
21GB on most machines except for the one which was restarted which has
14GB. From looking at memory it seems like I should be adding some nodes
real soon or amount allocated will hit the total amount. Or maybe there's
a memory leak which will reduce the amount of memory (as with node2)?

I'm just trying to figure out why I seem to almost be out of memory with
23 million documents when the Bitcask capacity planning spreadsheet seems
to suggest I should be able to have 282 million with 20 GiB of free Ram.

Confused,

-Anthony

On Wed, Mar 16, 2011 at 12:04:48PM -0700, Aphyr wrote:
I'm trying to track some basic metrics so we can plan for cluster
capacity, monitor transfers, etc. Figured this might be of interest
to other riak admins. Apologies if my erlang is nonidiomatic, I'm
still learning. :)

#!/usr/bin/env escript
%%! -name riakstatuscheck -setcookie riak

main([]) -> main(["[hidden email]"]);
main([Node]) ->
io:format("~w\n", [
lists:foldl(
fun({{_VNode, Count}, Sum) -> Sum + Count end,
0,
rpc:call(list_to_atom(Node), riak_kv_bitcask_backend, key_counts, [])
)
]).


$ ./riakstatus [hidden email]
18729

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Justin Sheehy
Hi, Greg.

On Thu, Mar 24, 2011 at 10:17 AM, Greg Nelson <[hidden email]> wrote:
> Wouldn't it be the common case that
> there are relatively few buckets?  And so wouldn't it save a lot of memory
> to keep a reference to an interned bucket name string in each entry, instead
> of the whole bucket name?

One reason this isn't done is that bitcask is an independent
application, used-by rather than part-of Riak.  It's just a local kv
store, and knows nothing of higher-level concepts like buckets.
Another reason is that there are also users with very many buckets in
use, a situation that makes the proposed solution uncomfortable.

In cases where there are truly few buckets and one knows it would stay
that way, one could plausibly modify riak_kv_bitcask_backend (the part
of Riak that talks to Bitcask) to use a bitcask per bucket on each
vnode instead of a single bitcask per vnode.  One downside of that
approach would be that if the number of buckets did grow then the file
descriptor consumption would be large and the node-wide I/O profile
might be much worse as well.

Everything has tradeoffs.

-Justin

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Nico Meyer
In reply to this post by Greg Nelson
You are right. But on the other hand the savings are really small.
Unless you have some good reason to give your bucket a very long name,
you can just choose a short bucket to begin with.

The bigger concern for me would be the way the bucket/key tuple is
serialized:

Eshell V5.8  (abort with ^G)
1> iolist_size(term_to_binary({<<>>,<<>>})).
13

That's 13 bytes of overhead per key were only 2 bytes is needed with
reasonable bucket/key length limits of 256 bytes each. Or if that is not
enough, one could also use a variable length encoding, so bucket/keys
can be arbitrarily large and the most common cases (less then 128 bytes)
still only use 2 bytes of overhead.

That is still only about 1GB per 100Mio keys, so even that is not really
relevant in today's world of cheap 64GB machines.


Am Donnerstag, den 24.03.2011, 10:17 -0700 schrieb Greg Nelson:

> Maybe slightly a different topic, but it's always seemed a little
> strange to me that bucket names are prefixed to every key in the
> keydir -- or rather, they are part of the key at that layer.  Wouldn't
> it be the common case that there are relatively few buckets?  And so
> wouldn't it save a lot of memory to keep a reference to an interned
> bucket name string in each entry, instead of the whole bucket name?
>
>
> This seems like it would only be more expensive (in terms of memory
> usage) if either a) your bucket names are very short, like fewer than
> 4 bytes, or b) your ratio of keys to buckets is near 1.
>
>
> I'm sure I am missing something, though, and haven't poked into that
> part of the code yet.
>
>
> On Thursday, March 24, 2011 at 3:11 AM, Nico Meyer wrote:
>
> > Hi Anthony,
> >
> > watching the memory that riak consumes is certainly the most
> > important
> > metric. If riak runs out of memory you are screwed. On a dedicated
> > node
> > just use the 'free' command to get that information:
> >
> > total used free shared buffers cached
> > Mem: 66106100 49834164 16271936 0 232460 41492552
> > -/+ buffers/cache: 8109152 57996948
> > Swap: 1951856 36 1951820
> >
> >
> > the most important line is the second one. There should be "enough"
> > free
> > memory. Also swap used should always be near zero. Whats enough is
> > not
> > so clear cut. If latency is really important for you, it should be
> > enough free memory, so that your working set fits into that amount.
> > Then
> > the disk is mostly used for writing.
> >
> > The way you can eyeball if you have enough memory left for caching
> > is
> > for example by using a tool like 'iostat' (in the systats package on
> > Debian). It shows you the disk utilization of your system. I usually
> > call it like that:
> >
> > iostat -x 2
> >
> >
> > Of course you should also watch CPU and network utilization, but
> > usually
> > disk or memory becomes a problem first.
> >
> > Cheers,
> > Nico
> >
> > Am Mittwoch, den 23.03.2011, 23:45 -0700 schrieb Anthony Molinaro:
> > > Hi Nico,
> > >
> > > Its unclear riak-admin status eventually calls
> > > riak_kv_stat:get_stats
> > > which states
> > >
> > > %%</dd><dt> mem_total
> > > %%</dt><dd> The first element of the tuple returned by
> > > %% {@link memsup:get_memory_data/0}.
> > > %%
> > > %%</dd><dt> mem_allocated
> > > %%</dt><dd> The second element of the tuple returned by
> > > %% {@link memsup:get_memory_data/0}.
> > > %%
> > >
> > > The man page for memsup states
> > >
> > > Returns the result of the latest memory check, where Total is
> > > the total memory size and Allocated the allocated memory size,
> > > in bytes.
> > >
> > > which doesn't tell me if that's filesystem cache or not. According
> > > to
> > > top the riak beam.smp is using 3G of virtual and 2.7G of resident
> > > on node1 which don't match any of the values from before.
> > > Attaching
> > > to the riak node and running memory() I see
> > >
> > > [{total,2392208240},
> > > {processes,15012600},
> > > {processes_used,12797856},
> > > {system,2377195640},
> > > {atom,824297},
> > > {atom_used,812024},
> > > {binary,898656},
> > > {code,8336856},
> > > {ets,556632}]
> > >
> > > Which seems to reflect what top claims. I'm just curious what to
> > > look at
> > > to determine when I need to add new nodes. I'm currently capturing
> > > the
> > > statistics riak provides and putting them into rrds, and mean
> > > response time
> > > is great (95,99, and 100 have spikes quite regularly which I still
> > > don't fully
> > > understand the cause of, but mean/median is pretty good <1ms). But
> > > I'm
> > > wondering when to detect if the whole thing will come crashing
> > > down.
> > >
> > > I've used Cassandra for the last 20 months in production and had
> > > the same
> > > issue, it works great then it falls over, and unfortunately with
> > > such evenly
> > > space data, everything tends to fall over at once. I just don't
> > > want that
> > > to happen with my riak cluster, so am wondering how to tell if you
> > > are close
> > > to needing to grow.
> > >
> > > Anyone have any ideas?
> > >
> > > -Anthony
> > >
> > >
> > > On Thu, Mar 24, 2011 at 01:21:05AM +0100, Nico Meyer wrote:
> > > > Hi Anthony,
> > > >
> > > > are you sure you are not including the filesystem cache in your
> > > > mem_allocated values? It will grow to use all of the free memory
> > > > or
> > > > the total size of your bitcask data files, whichever is smaller.
> > > >
> > > > We have about 100Mio keys per node, and riak uses about 7GB of
> > > > RAM.
> > > >
> > > > Cheers,
> > > > Nico
> > > >
> > > > On 23.03.2011 23:24, Anthony Molinaro wrote:
> > > > > So a question about when to add new nodes. I'm looking at the
> > > > > output of
> > > > > this script and the output of riak-admin status to attempt to
> > > > > figure out
> > > > > if it's time to grow a cluster.
> > > > >
> > > > > I have 4 nodes 1024 partitions replication factor 3, currently
> > > > > with a
> > > > > single bitcask single bucket where both the key and the value
> > > > > are 36 bytes.
> > > > >
> > > > > According to the bitcask spreadsheet the overhead per key is
> > > > > 40 bytes
> > > > >
> > > > > The current key counts/memory are
> > > > >
> > > > > key_counts mem_total mem_allocated (key__count*76)
> > > > > node1 22381785 25269010432 21015953408 1701015660
> > > > > node2 22378092 25269010432 14076137472 1700734992
> > > > > node3 22373770 25269010432 21565509632 1700406520
> > > > > node4 22382394 25269010432 21493731328 1701061944
> > > > >
> > > > > node2 failed at some point and was replaced with with a new
> > > > > node.
> > > > >
> > > > > So there is some oddness here I don't understand. According to
> > > > > the
> > > > > calculated value I should see about 1.7GB per box used,
> > > > > instead I see
> > > > > 21GB on most machines except for the one which was restarted
> > > > > which has
> > > > > 14GB. From looking at memory it seems like I should be adding
> > > > > some nodes
> > > > > real soon or amount allocated will hit the total amount. Or
> > > > > maybe there's
> > > > > a memory leak which will reduce the amount of memory (as with
> > > > > node2)?
> > > > >
> > > > > I'm just trying to figure out why I seem to almost be out of
> > > > > memory with
> > > > > 23 million documents when the Bitcask capacity planning
> > > > > spreadsheet seems
> > > > > to suggest I should be able to have 282 million with 20 GiB of
> > > > > free Ram.
> > > > >
> > > > > Confused,
> > > > >
> > > > > -Anthony
> > > > >
> > > > > On Wed, Mar 16, 2011 at 12:04:48PM -0700, Aphyr wrote:
> > > > > > I'm trying to track some basic metrics so we can plan for
> > > > > > cluster
> > > > > > capacity, monitor transfers, etc. Figured this might be of
> > > > > > interest
> > > > > > to other riak admins. Apologies if my erlang is
> > > > > > nonidiomatic, I'm
> > > > > > still learning. :)
> > > > > >
> > > > > > #!/usr/bin/env escript
> > > > > > %%! -name riakstatuscheck -setcookie riak
> > > > > >
> > > > > > main([]) -> main(["riak@127.0.0.1"]);
> > > > > > main([Node]) ->
> > > > > > io:format("~w\n", [
> > > > > > lists:foldl(
> > > > > > fun({{_VNode, Count}, Sum) -> Sum + Count end,
> > > > > > 0,
> > > > > > rpc:call(list_to_atom(Node), riak_kv_bitcask_backend,
> > > > > > key_counts, [])
> > > > > > )
> > > > > > ]).
> > > > > >
> > > > > >
> > > > > > $ ./riakstatus riak@127.0.0.1
> > > > > > 18729
> > > > > >
> > > > > > _______________________________________________
> > > > > > riak-users mailing list
> > > > > > [hidden email]
> > > > > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > > > > >
> >
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > [hidden email]
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >
>
>



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Justin Sheehy
On Thu, Mar 24, 2011 at 1:51 PM, Nico Meyer <[hidden email]> wrote:

> The bigger concern for me would be the way the bucket/key tuple is
> serialized:
>
> Eshell V5.8  (abort with ^G)
> 1> iolist_size(term_to_binary({<<>>,<<>>})).
> 13
>
> That's 13 bytes of overhead per key were only 2 bytes is needed with
> reasonable bucket/key length limits of 256 bytes each. Or if that is not
> enough, one could also use a variable length encoding, so bucket/keys
> can be arbitrarily large and the most common cases (less then 128 bytes)
> still only use 2 bytes of overhead.

I've made a branch of bitcask that effectively does this.  It uses 3
bytes per record instead of 13, saving 10 bytes (both in RAM and on
disk) per element stored.

The tricky thing, however, is backward compatibility.  There are many
Riak installations out there with data stored in bitcask using the old
key encoding, and we shouldn't force them all to do a very costly
full-sweep of their existing data in order to get these savings.  When
we sort out the best way to manage a smooth upgrade, I would happily
push out the smaller encoding.

-Justin

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Nico Meyer
Hi Justin,

I wanted to write this earlier, but I just had to much on my plate:

Am 08.06.2011 16:11, schrieb Justin Sheehy:

> On Thu, Mar 24, 2011 at 1:51 PM, Nico Meyer<[hidden email]>  wrote:
>
>> The bigger concern for me would be the way the bucket/key tuple is
>> serialized:
>>
>> Eshell V5.8  (abort with ^G)
>> 1>  iolist_size(term_to_binary({<<>>,<<>>})).
>> 13
>>
>> That's 13 bytes of overhead per key were only 2 bytes is needed with
>> reasonable bucket/key length limits of 256 bytes each. Or if that is not
>> enough, one could also use a variable length encoding, so bucket/keys
>> can be arbitrarily large and the most common cases (less then 128 bytes)
>> still only use 2 bytes of overhead.
> I've made a branch of bitcask that effectively does this.  It uses 3
> bytes per record instead of 13, saving 10 bytes (both in RAM and on
> disk) per element stored.
>
> The tricky thing, however, is backward compatibility.  There are many
> Riak installations out there with data stored in bitcask using the old
> key encoding, and we shouldn't force them all to do a very costly
> full-sweep of their existing data in order to get these savings.  When
> we sort out the best way to manage a smooth upgrade, I would happily
> push out the smaller encoding.
>

I think the possible gains of this change are fairly limited. Shaving of
about 10 bytes per key compared to 43 bytes of overhead plus lets say at
least 10 bytes for bucket and key combined is already less than 20
percent savings.
The saving seems even smaller if you consider the overhead imposed by
the memory allocator. I wrote a small test program in C++ which
allocates one million blocks of memory of a given size and prints the
overhead for each allocation. Turns out the overhead ranges from 8 to 23
bytes in a sawtooth like pattern (on a 64bit Linux machine):

size=56: overhead=8
size=57: overhead=23
size=58: overhead=22
size=59: overhead=21
size=60: overhead=20
size=61: overhead=19
size=62: overhead=18
size=63: overhead=17
size=64: overhead=16
size=65: overhead=15
size=66: overhead=14
size=67: overhead=13
size=68: overhead=12
size=69: overhead=11
size=70: overhead=10
size=71: overhead=9
size=72: overhead=8

Not much you can do about that, unless one wants to use unaligned
memory, which one doesn't.


> -Justin


Cheers,
Nico


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A script to check bitcask keydir sizes

Mike Oxford
> Am 08.06.2011 16:11, schrieb Justin Sheehy:
> The saving seems even smaller if you consider the overhead imposed by the
> memory allocator. I wrote a small test program in C++ which allocates one
> million blocks of memory of a given size and prints the overhead for each
> allocation. Turns out the overhead ranges from 8 to 23 bytes in a sawtooth
> like pattern (on a 64bit Linux machine):
>
--snip
>
> Not much you can do about that, unless one wants to use unaligned memory,
> which one doesn't.

Memory pools.
Preallocation+slicing.

Games to be played.

-mox

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com