Pruning (merging) after storage reaches a certain size?

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Pruning (merging) after storage reaches a certain size?

Steve Webb
Hello there.

I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
cluster with the spritzer twitter feed.  I used the bitcask 'expiry_secs'
to expire data after 3 days.

I'm curious - I'm up to about 10GB of storage and I'm guessing that I'll
be full in 3-4 more days of ingesting data.  I have no idea if/when a
merge will run to expire the older data.

Q: Is there a method or command to force a merge at any time?
Q: Is there a way to run a merge when the storage size reaches a specific
threshold?

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Justin Sheehy
Hi, Steve.

Check out this page: http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings

Basically, a "merge trigger" must be met in order to have the merge process occur.  When it does occur, it will affect all existing files that meet a "merge threshold."

One note that is relevant for your specific use: the expiry_secs parameter will cause a given item to disappear from the client API immediately after expiry, and to be cleaned if it is in a file already being merged, but will not currently contribute toward merge triggers or thresholds on its own if not otherwise "dead".

-Justin


On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:

> Hello there.
>
> I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch cluster with the spritzer twitter feed.  I used the bitcask 'expiry_secs' to expire data after 3 days.
>
> I'm curious - I'm up to about 10GB of storage and I'm guessing that I'll be full in 3-4 more days of ingesting data.  I have no idea if/when a merge will run to expire the older data.
>
> Q: Is there a method or command to force a merge at any time?
> Q: Is there a way to run a merge when the storage size reaches a specific threshold?
>
> - Steve
>
> --
> Steve Webb - Senior System Administrator for gnip.com
> http://twitter.com/GnipWebb
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Steve Webb
Justin -

My current bitcask settings are:

  %% Bitcask Config
  {bitcask, [
              {data_root, "/var/lib/riaksearch/bitcask" },
              {dead_bytes_merge_trigger, 10242880 },
              {dead_bytes_threshold, 5242880 },
              {expiry_secs, 86400}
            ]},

My understanding of these settings mean that the data should auto-expire
after one day.  Also, once each bitcask file in
.../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired
data in it, should be merged, right?

I'm collecting the spritzer twitter stream and loading it into two buckets
(one non-indexed bucket holds the full tweet, one indexed bucket holds the
tweet string, id, date and username).  I used to see about 10 GB of data
total, but it's growing and currently at 26GB of data total.

I'm seeing these in the logs:

INFO REPORT==== 13-Jun-2011::08:28:19 ===
Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 seconds,
0.18 MB/sec

=INFO REPORT==== 13-Jun-2011::08:29:01 ===
Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511 seconds,
0.17 MB/sec

=INFO REPORT==== 13-Jun-2011::08:31:23 ===
Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753
seconds, 0.15 MB/sec

... but I'm not seeing any "merging" related entries.

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Wed, 8 Jun 2011, Justin Sheehy wrote:

> Hi, Steve.
>
> Check out this page:
> http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings
>
> Basically, a "merge trigger" must be met in order to have the merge
> process occur.  When it does occur, it will affect all existing files
> that meet a "merge threshold."
>
> One note that is relevant for your specific use: the expiry_secs
> parameter will cause a given item to disappear from the client API
> immediately after expiry, and to be cleaned if it is in a file already
> being merged, but will not currently contribute toward merge triggers or
> thresholds on its own if not otherwise "dead".
>
> -Justin
>
>
> On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:
>
>> Hello there.
>>
>>
>> I'm curious - I'm up to about 10GB of storage and I'm guessing that
>> I'll be full in 3-4 more days of ingesting data.  I have no idea
>> if/when a merge will run to expire the older data.
>>
>> I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
>> cluster with the spritzer twitter feed.  I used the bitcask
>> 'expiry_secs' to expire data after 3 days. Q: Is there a method or
>> command to force a merge at any time? Q: Is there a way to run a merge
>> when the storage size reaches a specific threshold?
>>
>> - Steve
>>
>> --
>> Steve Webb - Senior System Administrator for gnip.com
>> http://twitter.com/GnipWebb
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Dan Reverri
Hi Steve,

This Knowledge Base article may be related:

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[hidden email]


On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[hidden email]> wrote:
Justin -

My current bitcask settings are:

 %% Bitcask Config
 {bitcask, [
            {data_root, "/var/lib/riaksearch/bitcask" },
            {dead_bytes_merge_trigger, 10242880 },
            {dead_bytes_threshold, 5242880 },
            {expiry_secs, 86400}
          ]},

My understanding of these settings mean that the data should auto-expire after one day.  Also, once each bitcask file in .../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired data in it, should be merged, right?

I'm collecting the spritzer twitter stream and loading it into two buckets (one non-indexed bucket holds the full tweet, one indexed bucket holds the tweet string, id, date and username).  I used to see about 10 GB of data total, but it's growing and currently at 26GB of data total.

I'm seeing these in the logs:

INFO REPORT==== 13-Jun-2011::08:28:19 ===
Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 seconds, 0.18 MB/sec

=INFO REPORT==== 13-Jun-2011::08:29:01 ===
Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511 seconds, 0.17 MB/sec

=INFO REPORT==== 13-Jun-2011::08:31:23 ===
Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753 seconds, 0.15 MB/sec

... but I'm not seeing any "merging" related entries.


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Wed, 8 Jun 2011, Justin Sheehy wrote:

Hi, Steve.

Check out this page: http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings

Basically, a "merge trigger" must be met in order to have the merge process occur.  When it does occur, it will affect all existing files that meet a "merge threshold."

One note that is relevant for your specific use: the expiry_secs parameter will cause a given item to disappear from the client API immediately after expiry, and to be cleaned if it is in a file already being merged, but will not currently contribute toward merge triggers or thresholds on its own if not otherwise "dead".

-Justin


On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:

Hello there.


I'm curious - I'm up to about 10GB of storage and I'm guessing that I'll be full in 3-4 more days of ingesting data.  I have no idea if/when a merge will run to expire the older data.

I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch cluster with the spritzer twitter feed.  I used the bitcask 'expiry_secs' to expire data after 3 days. Q: Is there a method or command to force a merge at any time? Q: Is there a way to run a merge when the storage size reaches a specific threshold?


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Steve Webb
Dan -

I've got dead_bytes_threshold=5242880 (5M) and
dead_bytes_merge_trigger=10242880.  My bitcask *.data files are 250-ish MB
in size:

root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280# ls -lah
total 771M
drwxr-xr-x  2 riak riak 4.0K 2011-06-12 01:08 .
drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 ..
-rw-------  1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data
-rw-r--r--  1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint
-rw-------  1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data
-rw-r--r--  1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint
-rw-------  1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data
-rw-r--r--  1 riak riak  27K 2011-06-08 13:45 1307562333.bitcask.hint
-rw-------  1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data
-rw-r--r--  1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint
-rw-------  1 riak riak  107 2011-06-12 01:08 bitcask.write.lock

I'm pretty sure that 50% or more of the data in these files should've
aged-off by now and the merge trigger should've happened.  The article
shows why merges happen when a restart is done, but it doesn't really
explain why merges don't happen at normal runtime.

I really don't want to restart riak every day to merge files.

Q: What are some good trigger settings for my use case?

I want to collect and store 1 day worth of tweets from the twitter
spritzer feed and have the data files auto-merge once in a while (once a
day or more frequently) when they've gotten 10% of 'dead' data in them
(aka, the tweets expire after 1 day).

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Dan Reverri wrote:

> Hi Steve,
>
> This Knowledge Base article may be related:
> https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted
>
> Thanks,
> Dan
>
> Daniel Reverri
> Developer Advocate
> Basho Technologies, Inc.
> [hidden email]
>
>
> On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[hidden email]> wrote:
>
>> Justin -
>>
>> My current bitcask settings are:
>>
>>  %% Bitcask Config
>>  {bitcask, [
>>             {data_root, "/var/lib/riaksearch/bitcask" },
>>             {dead_bytes_merge_trigger, 10242880 },
>>             {dead_bytes_threshold, 5242880 },
>>             {expiry_secs, 86400}
>>           ]},
>>
>> My understanding of these settings mean that the data should auto-expire
>> after one day.  Also, once each bitcask file in
>> .../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired data
>> in it, should be merged, right?
>>
>> I'm collecting the spritzer twitter stream and loading it into two buckets
>> (one non-indexed bucket holds the full tweet, one indexed bucket holds the
>> tweet string, id, date and username).  I used to see about 10 GB of data
>> total, but it's growing and currently at 26GB of data total.
>>
>> I'm seeing these in the logs:
>>
>> INFO REPORT==== 13-Jun-2011::08:28:19 ===
>> Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 seconds,
>> 0.18 MB/sec
>>
>> =INFO REPORT==== 13-Jun-2011::08:29:01 ===
>> Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511 seconds,
>> 0.17 MB/sec
>>
>> =INFO REPORT==== 13-Jun-2011::08:31:23 ===
>> Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753 seconds,
>> 0.15 MB/sec
>>
>> ... but I'm not seeing any "merging" related entries.
>>
>>
>> - Steve
>>
>> --
>> Steve Webb - Senior System Administrator for gnip.com
>> http://twitter.com/GnipWebb
>>
>> On Wed, 8 Jun 2011, Justin Sheehy wrote:
>>
>>  Hi, Steve.
>>>
>>> Check out this page:
>>> http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings
>>>
>>> Basically, a "merge trigger" must be met in order to have the merge
>>> process occur.  When it does occur, it will affect all existing files that
>>> meet a "merge threshold."
>>>
>>> One note that is relevant for your specific use: the expiry_secs parameter
>>> will cause a given item to disappear from the client API immediately after
>>> expiry, and to be cleaned if it is in a file already being merged, but will
>>> not currently contribute toward merge triggers or thresholds on its own if
>>> not otherwise "dead".
>>>
>>> -Justin
>>>
>>>
>>> On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:
>>>
>>>  Hello there.
>>>>
>>>>
>>>> I'm curious - I'm up to about 10GB of storage and I'm guessing that I'll
>>>> be full in 3-4 more days of ingesting data.  I have no idea if/when a merge
>>>> will run to expire the older data.
>>>>
>>>> I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
>>>> cluster with the spritzer twitter feed.  I used the bitcask 'expiry_secs' to
>>>> expire data after 3 days. Q: Is there a method or command to force a merge
>>>> at any time? Q: Is there a way to run a merge when the storage size reaches
>>>> a specific threshold?
>>>>
>>>>
>>>> - Steve
>>>>
>>>> --
>>>> Steve Webb - Senior System Administrator for gnip.com
>>>> http://twitter.com/GnipWebb
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>
>>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Dan Reverri
Hi Steve,

The article points out that the active data file is not considered during merge checks. Your 250-ish MB data file is the active file and not considered during the merge check. The file will eventually role over to a non-active file when it hits 2 GB in size. Once the file is not active it will be considered during the merge check and merging will take place.

The 2 GB file size is configurable via the max_file_size parameter:

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[hidden email]


On Mon, Jun 13, 2011 at 2:38 PM, Steve Webb <[hidden email]> wrote:
Dan -

I've got dead_bytes_threshold=5242880 (5M) and dead_bytes_merge_trigger=10242880.  My bitcask *.data files are 250-ish MB in size:

root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280# ls -lah
total 771M
drwxr-xr-x  2 riak riak 4.0K 2011-06-12 01:08 .
drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 ..
-rw-------  1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data
-rw-r--r--  1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint
-rw-------  1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data
-rw-r--r--  1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint
-rw-------  1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data
-rw-r--r--  1 riak riak  27K 2011-06-08 13:45 1307562333.bitcask.hint
-rw-------  1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data
-rw-r--r--  1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint
-rw-------  1 riak riak  107 2011-06-12 01:08 bitcask.write.lock

I'm pretty sure that 50% or more of the data in these files should've aged-off by now and the merge trigger should've happened.  The article shows why merges happen when a restart is done, but it doesn't really explain why merges don't happen at normal runtime.

I really don't want to restart riak every day to merge files.

Q: What are some good trigger settings for my use case?

I want to collect and store 1 day worth of tweets from the twitter spritzer feed and have the data files auto-merge once in a while (once a day or more frequently) when they've gotten 10% of 'dead' data in them (aka, the tweets expire after 1 day).


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Dan Reverri wrote:

Hi Steve,

This Knowledge Base article may be related:
https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[hidden email]


On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[hidden email]> wrote:

Justin -

My current bitcask settings are:

 %% Bitcask Config
 {bitcask, [
           {data_root, "/var/lib/riaksearch/bitcask" },
           {dead_bytes_merge_trigger, 10242880 },
           {dead_bytes_threshold, 5242880 },
           {expiry_secs, 86400}
         ]},

My understanding of these settings mean that the data should auto-expire
after one day.  Also, once each bitcask file in
.../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired data
in it, should be merged, right?

I'm collecting the spritzer twitter stream and loading it into two buckets
(one non-indexed bucket holds the full tweet, one indexed bucket holds the
tweet string, id, date and username).  I used to see about 10 GB of data
total, but it's growing and currently at 26GB of data total.

I'm seeing these in the logs:

INFO REPORT==== 13-Jun-2011::08:28:19 ===
Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 seconds,
0.18 MB/sec

=INFO REPORT==== 13-Jun-2011::08:29:01 ===
Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511 seconds,
0.17 MB/sec

=INFO REPORT==== 13-Jun-2011::08:31:23 ===
Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753 seconds,
0.15 MB/sec

... but I'm not seeing any "merging" related entries.


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Wed, 8 Jun 2011, Justin Sheehy wrote:

 Hi, Steve.

Check out this page:
http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings

Basically, a "merge trigger" must be met in order to have the merge
process occur.  When it does occur, it will affect all existing files that
meet a "merge threshold."

One note that is relevant for your specific use: the expiry_secs parameter
will cause a given item to disappear from the client API immediately after
expiry, and to be cleaned if it is in a file already being merged, but will
not currently contribute toward merge triggers or thresholds on its own if
not otherwise "dead".

-Justin


On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:

 Hello there.


I'm curious - I'm up to about 10GB of storage and I'm guessing that I'll
be full in 3-4 more days of ingesting data.  I have no idea if/when a merge
will run to expire the older data.

I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
cluster with the spritzer twitter feed.  I used the bitcask 'expiry_secs' to
expire data after 3 days. Q: Is there a method or command to force a merge
at any time? Q: Is there a way to run a merge when the storage size reaches
a specific threshold?


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Steve Webb
Dan -

Q: What does the syntax: 16#80000000 represent in the max_file_size
parameter?  It's supposed to be 2GB, but I can't see where that means 2GB
anywhere.

Even if that meant 16 files of 80MB each, that only comes out to slightly
over 1GB.

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Dan Reverri wrote:

> Hi Steve,
>
> The article points out that the active data file is not considered during
> merge checks. Your 250-ish MB data file is the active file and not
> considered during the merge check. The file will eventually role over to a
> non-active file when it hits 2 GB in size. Once the file is not active it
> will be considered during the merge check and merging will take place.
>
> The 2 GB file size is configurable via the max_file_size parameter:
> https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L22
>
> Thanks,
> Dan
>
> Daniel Reverri
> Developer Advocate
> Basho Technologies, Inc.
> [hidden email]
>
>
> On Mon, Jun 13, 2011 at 2:38 PM, Steve Webb <[hidden email]> wrote:
>
>> Dan -
>>
>> I've got dead_bytes_threshold=5242880 (5M) and
>> dead_bytes_merge_trigger=10242880.  My bitcask *.data files are 250-ish MB
>> in size:
>>
>> root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#
>> ls -lah
>> total 771M
>> drwxr-xr-x  2 riak riak 4.0K 2011-06-12 01:08 .
>> drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 ..
>> -rw-------  1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data
>> -rw-r--r--  1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint
>> -rw-------  1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data
>> -rw-r--r--  1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint
>> -rw-------  1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data
>> -rw-r--r--  1 riak riak  27K 2011-06-08 13:45 1307562333.bitcask.hint
>> -rw-------  1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data
>> -rw-r--r--  1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint
>> -rw-------  1 riak riak  107 2011-06-12 01:08 bitcask.write.lock
>>
>> I'm pretty sure that 50% or more of the data in these files should've
>> aged-off by now and the merge trigger should've happened.  The article shows
>> why merges happen when a restart is done, but it doesn't really explain why
>> merges don't happen at normal runtime.
>>
>> I really don't want to restart riak every day to merge files.
>>
>> Q: What are some good trigger settings for my use case?
>>
>> I want to collect and store 1 day worth of tweets from the twitter spritzer
>> feed and have the data files auto-merge once in a while (once a day or more
>> frequently) when they've gotten 10% of 'dead' data in them (aka, the tweets
>> expire after 1 day).
>>
>>
>> - Steve
>>
>> --
>> Steve Webb - Senior System Administrator for gnip.com
>> http://twitter.com/GnipWebb
>>
>> On Mon, 13 Jun 2011, Dan Reverri wrote:
>>
>>  Hi Steve,
>>>
>>> This Knowledge Base article may be related:
>>>
>>> https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted
>>>
>>> Thanks,
>>> Dan
>>>
>>> Daniel Reverri
>>> Developer Advocate
>>> Basho Technologies, Inc.
>>> [hidden email]
>>>
>>>
>>> On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[hidden email]> wrote:
>>>
>>>  Justin -
>>>>
>>>> My current bitcask settings are:
>>>>
>>>>  %% Bitcask Config
>>>>  {bitcask, [
>>>>            {data_root, "/var/lib/riaksearch/bitcask" },
>>>>            {dead_bytes_merge_trigger, 10242880 },
>>>>            {dead_bytes_threshold, 5242880 },
>>>>            {expiry_secs, 86400}
>>>>          ]},
>>>>
>>>> My understanding of these settings mean that the data should auto-expire
>>>> after one day.  Also, once each bitcask file in
>>>> .../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired
>>>> data
>>>> in it, should be merged, right?
>>>>
>>>> I'm collecting the spritzer twitter stream and loading it into two
>>>> buckets
>>>> (one non-indexed bucket holds the full tweet, one indexed bucket holds
>>>> the
>>>> tweet string, id, date and username).  I used to see about 10 GB of data
>>>> total, but it's growing and currently at 26GB of data total.
>>>>
>>>> I'm seeing these in the logs:
>>>>
>>>> INFO REPORT==== 13-Jun-2011::08:28:19 ===
>>>> Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 seconds,
>>>> 0.18 MB/sec
>>>>
>>>> =INFO REPORT==== 13-Jun-2011::08:29:01 ===
>>>> Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511
>>>> seconds,
>>>> 0.17 MB/sec
>>>>
>>>> =INFO REPORT==== 13-Jun-2011::08:31:23 ===
>>>> Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753
>>>> seconds,
>>>> 0.15 MB/sec
>>>>
>>>> ... but I'm not seeing any "merging" related entries.
>>>>
>>>>
>>>> - Steve
>>>>
>>>> --
>>>> Steve Webb - Senior System Administrator for gnip.com
>>>> http://twitter.com/GnipWebb
>>>>
>>>> On Wed, 8 Jun 2011, Justin Sheehy wrote:
>>>>
>>>>  Hi, Steve.
>>>>
>>>>>
>>>>> Check out this page:
>>>>>
>>>>> http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings
>>>>>
>>>>> Basically, a "merge trigger" must be met in order to have the merge
>>>>> process occur.  When it does occur, it will affect all existing files
>>>>> that
>>>>> meet a "merge threshold."
>>>>>
>>>>> One note that is relevant for your specific use: the expiry_secs
>>>>> parameter
>>>>> will cause a given item to disappear from the client API immediately
>>>>> after
>>>>> expiry, and to be cleaned if it is in a file already being merged, but
>>>>> will
>>>>> not currently contribute toward merge triggers or thresholds on its own
>>>>> if
>>>>> not otherwise "dead".
>>>>>
>>>>> -Justin
>>>>>
>>>>>
>>>>> On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:
>>>>>
>>>>>  Hello there.
>>>>>
>>>>>>
>>>>>>
>>>>>> I'm curious - I'm up to about 10GB of storage and I'm guessing that
>>>>>> I'll
>>>>>> be full in 3-4 more days of ingesting data.  I have no idea if/when a
>>>>>> merge
>>>>>> will run to expire the older data.
>>>>>>
>>>>>> I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
>>>>>> cluster with the spritzer twitter feed.  I used the bitcask
>>>>>> 'expiry_secs' to
>>>>>> expire data after 3 days. Q: Is there a method or command to force a
>>>>>> merge
>>>>>> at any time? Q: Is there a way to run a merge when the storage size
>>>>>> reaches
>>>>>> a specific threshold?
>>>>>>
>>>>>>
>>>>>> - Steve
>>>>>>
>>>>>> --
>>>>>> Steve Webb - Senior System Administrator for gnip.com
>>>>>> http://twitter.com/GnipWebb
>>>>>>
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> [hidden email]
>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>
>>>>>>
>>>>>
>>>>>  _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Chad DePue
that's a hex number or 2147483648 decimal

Chad DePue
inakanetworks.com - development consulting | skype cdepue | @chaddepue
+1 206.866.5707 



On Mon, Jun 13, 2011 at 7:08 PM, Steve Webb <[hidden email]> wrote:
Dan -

Q: What does the syntax: 16#80000000 represent in the max_file_size parameter?  It's supposed to be 2GB, but I can't see where that means 2GB anywhere.

Even if that meant 16 files of 80MB each, that only comes out to slightly over 1GB.


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Dan Reverri wrote:

Hi Steve,

The article points out that the active data file is not considered during
merge checks. Your 250-ish MB data file is the active file and not
considered during the merge check. The file will eventually role over to a
non-active file when it hits 2 GB in size. Once the file is not active it
will be considered during the merge check and merging will take place.

The 2 GB file size is configurable via the max_file_size parameter:
https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L22

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[hidden email]


On Mon, Jun 13, 2011 at 2:38 PM, Steve Webb <[hidden email]> wrote:

Dan -

I've got dead_bytes_threshold=5242880 (5M) and
dead_bytes_merge_trigger=10242880.  My bitcask *.data files are 250-ish MB
in size:

root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#
ls -lah
total 771M
drwxr-xr-x  2 riak riak 4.0K 2011-06-12 01:08 .
drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 ..
-rw-------  1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data
-rw-r--r--  1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint
-rw-------  1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data
-rw-r--r--  1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint
-rw-------  1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data
-rw-r--r--  1 riak riak  27K 2011-06-08 13:45 1307562333.bitcask.hint
-rw-------  1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data
-rw-r--r--  1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint
-rw-------  1 riak riak  107 2011-06-12 01:08 bitcask.write.lock

I'm pretty sure that 50% or more of the data in these files should've
aged-off by now and the merge trigger should've happened.  The article shows
why merges happen when a restart is done, but it doesn't really explain why
merges don't happen at normal runtime.

I really don't want to restart riak every day to merge files.

Q: What are some good trigger settings for my use case?

I want to collect and store 1 day worth of tweets from the twitter spritzer
feed and have the data files auto-merge once in a while (once a day or more
frequently) when they've gotten 10% of 'dead' data in them (aka, the tweets
expire after 1 day).


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Dan Reverri wrote:

 Hi Steve,

This Knowledge Base article may be related:

https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[hidden email]


On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[hidden email]> wrote:

 Justin -

My current bitcask settings are:

 %% Bitcask Config
 {bitcask, [
          {data_root, "/var/lib/riaksearch/bitcask" },
          {dead_bytes_merge_trigger, 10242880 },
          {dead_bytes_threshold, 5242880 },
          {expiry_secs, 86400}
        ]},

My understanding of these settings mean that the data should auto-expire
after one day.  Also, once each bitcask file in
.../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired
data
in it, should be merged, right?

I'm collecting the spritzer twitter stream and loading it into two
buckets
(one non-indexed bucket holds the full tweet, one indexed bucket holds
the
tweet string, id, date and username).  I used to see about 10 GB of data
total, but it's growing and currently at 26GB of data total.

I'm seeing these in the logs:

INFO REPORT==== 13-Jun-2011::08:28:19 ===
Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 seconds,
0.18 MB/sec

=INFO REPORT==== 13-Jun-2011::08:29:01 ===
Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511
seconds,
0.17 MB/sec

=INFO REPORT==== 13-Jun-2011::08:31:23 ===
Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753
seconds,
0.15 MB/sec

... but I'm not seeing any "merging" related entries.


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Wed, 8 Jun 2011, Justin Sheehy wrote:

 Hi, Steve.


Check out this page:

http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings

Basically, a "merge trigger" must be met in order to have the merge
process occur.  When it does occur, it will affect all existing files
that
meet a "merge threshold."

One note that is relevant for your specific use: the expiry_secs
parameter
will cause a given item to disappear from the client API immediately
after
expiry, and to be cleaned if it is in a file already being merged, but
will
not currently contribute toward merge triggers or thresholds on its own
if
not otherwise "dead".

-Justin


On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:

 Hello there.



I'm curious - I'm up to about 10GB of storage and I'm guessing that
I'll
be full in 3-4 more days of ingesting data.  I have no idea if/when a
merge
will run to expire the older data.

I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
cluster with the spritzer twitter feed.  I used the bitcask
'expiry_secs' to
expire data after 3 days. Q: Is there a method or command to force a
merge
at any time? Q: Is there a way to run a merge when the storage size
reaches
a specific threshold?


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



 _______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com





_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Steve Webb
In reply to this post by Steve Webb
Ahh, that's HEX notation in erlang.  Sorry for the stupid question.

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Steve Webb wrote:

> Dan -
>
> Q: What does the syntax: 16#80000000 represent in the max_file_size
> parameter?  It's supposed to be 2GB, but I can't see where that means 2GB
> anywhere.
>
> Even if that meant 16 files of 80MB each, that only comes out to slightly
> over 1GB.
>
> - Steve
>
> --
> Steve Webb - Senior System Administrator for gnip.com
> http://twitter.com/GnipWebb
>
> On Mon, 13 Jun 2011, Dan Reverri wrote:
>
>> Hi Steve,
>>
>> The article points out that the active data file is not considered during
>> merge checks. Your 250-ish MB data file is the active file and not
>> considered during the merge check. The file will eventually role over to a
>> non-active file when it hits 2 GB in size. Once the file is not active it
>> will be considered during the merge check and merging will take place.
>>
>> The 2 GB file size is configurable via the max_file_size parameter:
>> https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L22
>>
>> Thanks,
>> Dan
>>
>> Daniel Reverri
>> Developer Advocate
>> Basho Technologies, Inc.
>> [hidden email]
>>
>>
>> On Mon, Jun 13, 2011 at 2:38 PM, Steve Webb <[hidden email]> wrote:
>>
>>> Dan -
>>>
>>> I've got dead_bytes_threshold=5242880 (5M) and
>>> dead_bytes_merge_trigger=10242880.  My bitcask *.data files are 250-ish MB
>>> in size:
>>>
>>> root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#
>>> ls -lah
>>> total 771M
>>> drwxr-xr-x  2 riak riak 4.0K 2011-06-12 01:08 .
>>> drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 ..
>>> -rw-------  1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data
>>> -rw-r--r--  1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint
>>> -rw-------  1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data
>>> -rw-r--r--  1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint
>>> -rw-------  1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data
>>> -rw-r--r--  1 riak riak  27K 2011-06-08 13:45 1307562333.bitcask.hint
>>> -rw-------  1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data
>>> -rw-r--r--  1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint
>>> -rw-------  1 riak riak  107 2011-06-12 01:08 bitcask.write.lock
>>>
>>> I'm pretty sure that 50% or more of the data in these files should've
>>> aged-off by now and the merge trigger should've happened.  The article
>>> shows
>>> why merges happen when a restart is done, but it doesn't really explain
>>> why
>>> merges don't happen at normal runtime.
>>>
>>> I really don't want to restart riak every day to merge files.
>>>
>>> Q: What are some good trigger settings for my use case?
>>>
>>> I want to collect and store 1 day worth of tweets from the twitter
>>> spritzer
>>> feed and have the data files auto-merge once in a while (once a day or
>>> more
>>> frequently) when they've gotten 10% of 'dead' data in them (aka, the
>>> tweets
>>> expire after 1 day).
>>>
>>>
>>> - Steve
>>>
>>> --
>>> Steve Webb - Senior System Administrator for gnip.com
>>> http://twitter.com/GnipWebb
>>>
>>> On Mon, 13 Jun 2011, Dan Reverri wrote:
>>>
>>>  Hi Steve,
>>>>
>>>> This Knowledge Base article may be related:
>>>>
>>>> https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted
>>>>
>>>> Thanks,
>>>> Dan
>>>>
>>>> Daniel Reverri
>>>> Developer Advocate
>>>> Basho Technologies, Inc.
>>>> [hidden email]
>>>>
>>>>
>>>> On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[hidden email]> wrote:
>>>>
>>>>  Justin -
>>>>>
>>>>> My current bitcask settings are:
>>>>>
>>>>>  %% Bitcask Config
>>>>>  {bitcask, [
>>>>>            {data_root, "/var/lib/riaksearch/bitcask" },
>>>>>            {dead_bytes_merge_trigger, 10242880 },
>>>>>            {dead_bytes_threshold, 5242880 },
>>>>>            {expiry_secs, 86400}
>>>>>          ]},
>>>>>
>>>>> My understanding of these settings mean that the data should auto-expire
>>>>> after one day.  Also, once each bitcask file in
>>>>> .../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired
>>>>> data
>>>>> in it, should be merged, right?
>>>>>
>>>>> I'm collecting the spritzer twitter stream and loading it into two
>>>>> buckets
>>>>> (one non-indexed bucket holds the full tweet, one indexed bucket holds
>>>>> the
>>>>> tweet string, id, date and username).  I used to see about 10 GB of data
>>>>> total, but it's growing and currently at 26GB of data total.
>>>>>
>>>>> I'm seeing these in the logs:
>>>>>
>>>>> INFO REPORT==== 13-Jun-2011::08:28:19 ===
>>>>> Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694
>>>>> seconds,
>>>>> 0.18 MB/sec
>>>>>
>>>>> =INFO REPORT==== 13-Jun-2011::08:29:01 ===
>>>>> Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511
>>>>> seconds,
>>>>> 0.17 MB/sec
>>>>>
>>>>> =INFO REPORT==== 13-Jun-2011::08:31:23 ===
>>>>> Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753
>>>>> seconds,
>>>>> 0.15 MB/sec
>>>>>
>>>>> ... but I'm not seeing any "merging" related entries.
>>>>>
>>>>>
>>>>> - Steve
>>>>>
>>>>> --
>>>>> Steve Webb - Senior System Administrator for gnip.com
>>>>> http://twitter.com/GnipWebb
>>>>>
>>>>> On Wed, 8 Jun 2011, Justin Sheehy wrote:
>>>>>
>>>>>  Hi, Steve.
>>>>>
>>>>>>
>>>>>> Check out this page:
>>>>>>
>>>>>> http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings
>>>>>>
>>>>>> Basically, a "merge trigger" must be met in order to have the merge
>>>>>> process occur.  When it does occur, it will affect all existing files
>>>>>> that
>>>>>> meet a "merge threshold."
>>>>>>
>>>>>> One note that is relevant for your specific use: the expiry_secs
>>>>>> parameter
>>>>>> will cause a given item to disappear from the client API immediately
>>>>>> after
>>>>>> expiry, and to be cleaned if it is in a file already being merged, but
>>>>>> will
>>>>>> not currently contribute toward merge triggers or thresholds on its own
>>>>>> if
>>>>>> not otherwise "dead".
>>>>>>
>>>>>> -Justin
>>>>>>
>>>>>>
>>>>>> On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:
>>>>>>
>>>>>>  Hello there.
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I'm curious - I'm up to about 10GB of storage and I'm guessing that
>>>>>>> I'll
>>>>>>> be full in 3-4 more days of ingesting data.  I have no idea if/when a
>>>>>>> merge
>>>>>>> will run to expire the older data.
>>>>>>>
>>>>>>> I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
>>>>>>> cluster with the spritzer twitter feed.  I used the bitcask
>>>>>>> 'expiry_secs' to
>>>>>>> expire data after 3 days. Q: Is there a method or command to force a
>>>>>>> merge
>>>>>>> at any time? Q: Is there a way to run a merge when the storage size
>>>>>>> reaches
>>>>>>> a specific threshold?
>>>>>>>
>>>>>>>
>>>>>>> - Steve
>>>>>>>
>>>>>>> --
>>>>>>> Steve Webb - Senior System Administrator for gnip.com
>>>>>>> http://twitter.com/GnipWebb
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> riak-users mailing list
>>>>>>> [hidden email]
>>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>  _______________________________________________
>>>>> riak-users mailing list
>>>>> [hidden email]
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>
>>>>>
>>>>
>>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Justin Sheehy
In reply to this post by Steve Webb
Hi, Steve.

The key to your situation was in my earlier email:

    One note that is relevant for your specific use: the expiry_secs
    parameter will cause a given item to disappear from the client
    API immediately after expiry, and to be cleaned if it is in a file
    already being merged, but will not currently contribute toward
    merge triggers or thresholds on its own if not otherwise "dead".

That is, bitcask wasn't originally designed around the expiry-centric
way of removing old data, and data that has simply expired (but not
actively been deleted) will not be counted as garbage toward
thresholds or triggers at this time.  It will be cleaned up in a
merge, but will not contribute toward causing the merge in the first
place.  In a use case where you only add items and never actually
delete anything, a merge will never be dynamically triggered.

It is plausible that we could add some expiry-statistics measurement
and triggering to bitcask, but today that's the state of things.  You
could manually trigger merges, but that currently requires a bit of
Erlang.

I hope that this helps.

-Justin

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Steve Webb
I'm going to experiment with the bitcask max_file_size and reduce it to
80MB or so (my current files are 200+MB) so hopefully, this will force a
merge on the files earlier and will discover the expired records.  I'll
let you know how it goes.

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Justin Sheehy wrote:

> Hi, Steve.
>
> The key to your situation was in my earlier email:
>
>    One note that is relevant for your specific use: the expiry_secs
>    parameter will cause a given item to disappear from the client
>    API immediately after expiry, and to be cleaned if it is in a file
>    already being merged, but will not currently contribute toward
>    merge triggers or thresholds on its own if not otherwise "dead".
>
> That is, bitcask wasn't originally designed around the expiry-centric
> way of removing old data, and data that has simply expired (but not
> actively been deleted) will not be counted as garbage toward
> thresholds or triggers at this time.  It will be cleaned up in a
> merge, but will not contribute toward causing the merge in the first
> place.  In a use case where you only add items and never actually
> delete anything, a merge will never be dynamically triggered.
>
> It is plausible that we could add some expiry-statistics measurement
> and triggering to bitcask, but today that's the state of things.  You
> could manually trigger merges, but that currently requires a bit of
> Erlang.
>
> I hope that this helps.
>
> -Justin
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Steve Webb
In reply to this post by Justin Sheehy
My current app.config bitcask section is looking like this:

%% Bitcask Config
  {bitcask, [
              {data_root, "/var/lib/riaksearch/bitcask" },
              {dead_bytes_merge_trigger, 10242880 },
              {dead_bytes_threshold, 5242880 },
              {max_file_size, 80000000 },
              {expiry_secs, 86400}
            ]},

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Justin Sheehy wrote:

> Hi, Steve.
>
> The key to your situation was in my earlier email:
>
>    One note that is relevant for your specific use: the expiry_secs
>    parameter will cause a given item to disappear from the client
>    API immediately after expiry, and to be cleaned if it is in a file
>    already being merged, but will not currently contribute toward
>    merge triggers or thresholds on its own if not otherwise "dead".
>
> That is, bitcask wasn't originally designed around the expiry-centric
> way of removing old data, and data that has simply expired (but not
> actively been deleted) will not be counted as garbage toward
> thresholds or triggers at this time.  It will be cleaned up in a
> merge, but will not contribute toward causing the merge in the first
> place.  In a use case where you only add items and never actually
> delete anything, a merge will never be dynamically triggered.
>
> It is plausible that we could add some expiry-statistics measurement
> and triggering to bitcask, but today that's the state of things.  You
> could manually trigger merges, but that currently requires a bit of
> Erlang.
>
> I hope that this helps.
>
> -Justin
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Steve Webb
In reply to this post by Dan Reverri
Q: It looks like I have files in my bitcask directories that are not being
actively used (I've restarted, and they seem to still be a pretty descent
size still, and the mtime is several days old):

root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280# ls -la
total 791992
drwxr-xr-x  2 riak riak      4096 2011-06-13 16:49 .
drwxr-xr-x 34 riak riak      4096 2011-06-13 16:47 ..
-rw-------  1 riak riak 239369130 2011-06-08 13:11 1307415077.bitcask.data
-rw-r--r--  1 riak riak   4458434 2011-06-08 13:11 1307415077.bitcask.hint
-rw-------  1 riak riak 288686080 2011-06-10 13:30 1307562153.bitcask.data
-rw-r--r--  1 riak riak   5347188 2011-06-10 13:30 1307562153.bitcask.hint
-rw-------  1 riak riak   1431867 2011-06-08 13:45 1307562333.bitcask.data
-rw-r--r--  1 riak riak     27162 2011-06-08 13:45 1307562333.bitcask.hint
-rw-------  1 riak riak 259423130 2011-06-13 15:55 1307862506.bitcask.data
-rw-r--r--  1 riak riak   9814878 2011-06-13 15:55 1307862506.bitcask.hint
-rw-------  1 riak riak    950009 2011-06-13 16:22 1308003767.bitcask.data
-rw-r--r--  1 riak riak     35768 2011-06-13 16:22 1308003767.bitcask.hint
-rw-------  1 riak riak    561579 2011-06-13 16:55 1308005359.bitcask.data
-rw-r--r--  1 riak riak     21024 2011-06-13 16:55 1308005359.bitcask.hint
-rw-------  1 riak riak       107 2011-06-13 16:49 bitcask.write.lock
root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#

How come these older files aren't being considered for merge?

Again, my bitcast settings are:

  %% Bitcask Config
  {bitcask, [
              {data_root, "/var/lib/riaksearch/bitcask" },
              {dead_bytes_merge_trigger, 10242880 },
              {dead_bytes_threshold, 5242880 },
              {max_file_size, 80000000 },
              {expiry_secs, 86400}
            ]},

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Dan Reverri wrote:

> Hi Steve,
>
> The article points out that the active data file is not considered during
> merge checks. Your 250-ish MB data file is the active file and not
> considered during the merge check. The file will eventually role over to a
> non-active file when it hits 2 GB in size. Once the file is not active it
> will be considered during the merge check and merging will take place.
>
> The 2 GB file size is configurable via the max_file_size parameter:
> https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L22
>
> Thanks,
> Dan
>
> Daniel Reverri
> Developer Advocate
> Basho Technologies, Inc.
> [hidden email]
>
>
> On Mon, Jun 13, 2011 at 2:38 PM, Steve Webb <[hidden email]> wrote:
>
>> Dan -
>>
>> I've got dead_bytes_threshold=5242880 (5M) and
>> dead_bytes_merge_trigger=10242880.  My bitcask *.data files are 250-ish MB
>> in size:
>>
>> root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#
>> ls -lah
>> total 771M
>> drwxr-xr-x  2 riak riak 4.0K 2011-06-12 01:08 .
>> drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 ..
>> -rw-------  1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data
>> -rw-r--r--  1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint
>> -rw-------  1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data
>> -rw-r--r--  1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint
>> -rw-------  1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data
>> -rw-r--r--  1 riak riak  27K 2011-06-08 13:45 1307562333.bitcask.hint
>> -rw-------  1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data
>> -rw-r--r--  1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint
>> -rw-------  1 riak riak  107 2011-06-12 01:08 bitcask.write.lock
>>
>> I'm pretty sure that 50% or more of the data in these files should've
>> aged-off by now and the merge trigger should've happened.  The article shows
>> why merges happen when a restart is done, but it doesn't really explain why
>> merges don't happen at normal runtime.
>>
>> I really don't want to restart riak every day to merge files.
>>
>> Q: What are some good trigger settings for my use case?
>>
>> I want to collect and store 1 day worth of tweets from the twitter spritzer
>> feed and have the data files auto-merge once in a while (once a day or more
>> frequently) when they've gotten 10% of 'dead' data in them (aka, the tweets
>> expire after 1 day).
>>
>>
>> - Steve
>>
>> --
>> Steve Webb - Senior System Administrator for gnip.com
>> http://twitter.com/GnipWebb
>>
>> On Mon, 13 Jun 2011, Dan Reverri wrote:
>>
>>  Hi Steve,
>>>
>>> This Knowledge Base article may be related:
>>>
>>> https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted
>>>
>>> Thanks,
>>> Dan
>>>
>>> Daniel Reverri
>>> Developer Advocate
>>> Basho Technologies, Inc.
>>> [hidden email]
>>>
>>>
>>> On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[hidden email]> wrote:
>>>
>>>  Justin -
>>>>
>>>> My current bitcask settings are:
>>>>
>>>>  %% Bitcask Config
>>>>  {bitcask, [
>>>>            {data_root, "/var/lib/riaksearch/bitcask" },
>>>>            {dead_bytes_merge_trigger, 10242880 },
>>>>            {dead_bytes_threshold, 5242880 },
>>>>            {expiry_secs, 86400}
>>>>          ]},
>>>>
>>>> My understanding of these settings mean that the data should auto-expire
>>>> after one day.  Also, once each bitcask file in
>>>> .../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired
>>>> data
>>>> in it, should be merged, right?
>>>>
>>>> I'm collecting the spritzer twitter stream and loading it into two
>>>> buckets
>>>> (one non-indexed bucket holds the full tweet, one indexed bucket holds
>>>> the
>>>> tweet string, id, date and username).  I used to see about 10 GB of data
>>>> total, but it's growing and currently at 26GB of data total.
>>>>
>>>> I'm seeing these in the logs:
>>>>
>>>> INFO REPORT==== 13-Jun-2011::08:28:19 ===
>>>> Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 seconds,
>>>> 0.18 MB/sec
>>>>
>>>> =INFO REPORT==== 13-Jun-2011::08:29:01 ===
>>>> Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511
>>>> seconds,
>>>> 0.17 MB/sec
>>>>
>>>> =INFO REPORT==== 13-Jun-2011::08:31:23 ===
>>>> Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753
>>>> seconds,
>>>> 0.15 MB/sec
>>>>
>>>> ... but I'm not seeing any "merging" related entries.
>>>>
>>>>
>>>> - Steve
>>>>
>>>> --
>>>> Steve Webb - Senior System Administrator for gnip.com
>>>> http://twitter.com/GnipWebb
>>>>
>>>> On Wed, 8 Jun 2011, Justin Sheehy wrote:
>>>>
>>>>  Hi, Steve.
>>>>
>>>>>
>>>>> Check out this page:
>>>>>
>>>>> http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings
>>>>>
>>>>> Basically, a "merge trigger" must be met in order to have the merge
>>>>> process occur.  When it does occur, it will affect all existing files
>>>>> that
>>>>> meet a "merge threshold."
>>>>>
>>>>> One note that is relevant for your specific use: the expiry_secs
>>>>> parameter
>>>>> will cause a given item to disappear from the client API immediately
>>>>> after
>>>>> expiry, and to be cleaned if it is in a file already being merged, but
>>>>> will
>>>>> not currently contribute toward merge triggers or thresholds on its own
>>>>> if
>>>>> not otherwise "dead".
>>>>>
>>>>> -Justin
>>>>>
>>>>>
>>>>> On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:
>>>>>
>>>>>  Hello there.
>>>>>
>>>>>>
>>>>>>
>>>>>> I'm curious - I'm up to about 10GB of storage and I'm guessing that
>>>>>> I'll
>>>>>> be full in 3-4 more days of ingesting data.  I have no idea if/when a
>>>>>> merge
>>>>>> will run to expire the older data.
>>>>>>
>>>>>> I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
>>>>>> cluster with the spritzer twitter feed.  I used the bitcask
>>>>>> 'expiry_secs' to
>>>>>> expire data after 3 days. Q: Is there a method or command to force a
>>>>>> merge
>>>>>> at any time? Q: Is there a way to run a merge when the storage size
>>>>>> reaches
>>>>>> a specific threshold?
>>>>>>
>>>>>>
>>>>>> - Steve
>>>>>>
>>>>>> --
>>>>>> Steve Webb - Senior System Administrator for gnip.com
>>>>>> http://twitter.com/GnipWebb
>>>>>>
>>>>>> _______________________________________________
>>>>>> riak-users mailing list
>>>>>> [hidden email]
>>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>>>
>>>>>>
>>>>>
>>>>>  _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Dan Reverri
Hi Steve,

The reason those files are not being merged is because of the point Justin made earlier. Expiry is not a condition that affects the merge check.

Stated earlier:
That is, bitcask wasn't originally designed around the expiry-centric
way of removing old data, and data that has simply expired (but not
actively been deleted) will not be counted as garbage toward
thresholds or triggers at this time.  It will be cleaned up in a
merge, but will not contribute toward causing the merge in the first
place.  In a use case where you only add items and never actually
delete anything, a merge will never be dynamically triggered.

It is plausible that we could add some expiry-statistics measurement
and triggering to bitcask, but today that's the state of things.  You
could manually trigger merges, but that currently requires a bit of
Erlang.

I hope that this helps.

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[hidden email]


On Mon, Jun 13, 2011 at 3:57 PM, Steve Webb <[hidden email]> wrote:
Q: It looks like I have files in my bitcask directories that are not being actively used (I've restarted, and they seem to still be a pretty descent size still, and the mtime is several days old):


root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280# ls -la
total 791992
drwxr-xr-x  2 riak riak      4096 2011-06-13 16:49 .
drwxr-xr-x 34 riak riak      4096 2011-06-13 16:47 ..
-rw-------  1 riak riak 239369130 2011-06-08 13:11 1307415077.bitcask.data
-rw-r--r--  1 riak riak   4458434 2011-06-08 13:11 1307415077.bitcask.hint
-rw-------  1 riak riak 288686080 2011-06-10 13:30 1307562153.bitcask.data
-rw-r--r--  1 riak riak   5347188 2011-06-10 13:30 1307562153.bitcask.hint
-rw-------  1 riak riak   1431867 2011-06-08 13:45 1307562333.bitcask.data
-rw-r--r--  1 riak riak     27162 2011-06-08 13:45 1307562333.bitcask.hint
-rw-------  1 riak riak 259423130 2011-06-13 15:55 1307862506.bitcask.data
-rw-r--r--  1 riak riak   9814878 2011-06-13 15:55 1307862506.bitcask.hint
-rw-------  1 riak riak    950009 2011-06-13 16:22 1308003767.bitcask.data
-rw-r--r--  1 riak riak     35768 2011-06-13 16:22 1308003767.bitcask.hint
-rw-------  1 riak riak    561579 2011-06-13 16:55 1308005359.bitcask.data
-rw-r--r--  1 riak riak     21024 2011-06-13 16:55 1308005359.bitcask.hint
-rw-------  1 riak riak       107 2011-06-13 16:49 bitcask.write.lock

root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#

How come these older files aren't being considered for merge?

Again, my bitcast settings are:


 %% Bitcask Config
 {bitcask, [
            {data_root, "/var/lib/riaksearch/bitcask" },
            {dead_bytes_merge_trigger, 10242880 },
            {dead_bytes_threshold, 5242880 },
            {max_file_size, 80000000 },
            {expiry_secs, 86400}
          ]},

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Dan Reverri wrote:

Hi Steve,

The article points out that the active data file is not considered during
merge checks. Your 250-ish MB data file is the active file and not
considered during the merge check. The file will eventually role over to a
non-active file when it hits 2 GB in size. Once the file is not active it
will be considered during the merge check and merging will take place.

The 2 GB file size is configurable via the max_file_size parameter:
https://github.com/basho/bitcask/blob/master/ebin/bitcask.app#L22

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[hidden email]


On Mon, Jun 13, 2011 at 2:38 PM, Steve Webb <[hidden email]> wrote:

Dan -

I've got dead_bytes_threshold=5242880 (5M) and
dead_bytes_merge_trigger=10242880.  My bitcask *.data files are 250-ish MB
in size:

root@ha2:/data/riaksearch/bitcask/1027618338748291114361965898003636498195577569280#
ls -lah
total 771M
drwxr-xr-x  2 riak riak 4.0K 2011-06-12 01:08 .
drwxr-xr-x 34 riak riak 4.0K 2011-06-12 01:10 ..
-rw-------  1 riak riak 229M 2011-06-08 13:11 1307415077.bitcask.data
-rw-r--r--  1 riak riak 4.3M 2011-06-08 13:11 1307415077.bitcask.hint
-rw-------  1 riak riak 276M 2011-06-10 13:30 1307562153.bitcask.data
-rw-r--r--  1 riak riak 5.1M 2011-06-10 13:30 1307562153.bitcask.hint
-rw-------  1 riak riak 1.4M 2011-06-08 13:45 1307562333.bitcask.data
-rw-r--r--  1 riak riak  27K 2011-06-08 13:45 1307562333.bitcask.hint
-rw-------  1 riak riak 246M 2011-06-13 15:34 1307862506.bitcask.data
-rw-r--r--  1 riak riak 9.4M 2011-06-13 15:34 1307862506.bitcask.hint
-rw-------  1 riak riak  107 2011-06-12 01:08 bitcask.write.lock

I'm pretty sure that 50% or more of the data in these files should've
aged-off by now and the merge trigger should've happened.  The article shows
why merges happen when a restart is done, but it doesn't really explain why
merges don't happen at normal runtime.

I really don't want to restart riak every day to merge files.

Q: What are some good trigger settings for my use case?

I want to collect and store 1 day worth of tweets from the twitter spritzer
feed and have the data files auto-merge once in a while (once a day or more
frequently) when they've gotten 10% of 'dead' data in them (aka, the tweets
expire after 1 day).


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Dan Reverri wrote:

 Hi Steve,

This Knowledge Base article may be related:

https://help.basho.com/entries/20141178-why-does-it-seem-that-bitcask-merging-is-only-triggered-when-a-riak-node-is-restarted

Thanks,
Dan

Daniel Reverri
Developer Advocate
Basho Technologies, Inc.
[hidden email]


On Mon, Jun 13, 2011 at 10:25 AM, Steve Webb <[hidden email]> wrote:

 Justin -

My current bitcask settings are:

 %% Bitcask Config
 {bitcask, [
          {data_root, "/var/lib/riaksearch/bitcask" },
          {dead_bytes_merge_trigger, 10242880 },
          {dead_bytes_threshold, 5242880 },
          {expiry_secs, 86400}
        ]},

My understanding of these settings mean that the data should auto-expire
after one day.  Also, once each bitcask file in
.../riaksearch/bitcask/xxx/*.data once it has 10M of "dead" or expired
data
in it, should be merged, right?

I'm collecting the spritzer twitter stream and loading it into two
buckets
(one non-indexed bucket holds the full tweet, one indexed bucket holds
the
tweet string, id, date and username).  I used to see about 10 GB of data
total, but it's growing and currently at 26GB of data total.

I'm seeing these in the logs:

INFO REPORT==== 13-Jun-2011::08:28:19 ===
Pid <0.6844.0> compacted 3 segments for 942232 bytes in 4.900694 seconds,
0.18 MB/sec

=INFO REPORT==== 13-Jun-2011::08:29:01 ===
Pid <0.6267.0> compacted 3 segments for 1721790 bytes in 9.690511
seconds,
0.17 MB/sec

=INFO REPORT==== 13-Jun-2011::08:31:23 ===
Pid <0.6924.0> compacted 3 segments for 6988416 bytes in 44.659753
seconds,
0.15 MB/sec

... but I'm not seeing any "merging" related entries.


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Wed, 8 Jun 2011, Justin Sheehy wrote:

 Hi, Steve.


Check out this page:

http://wiki.basho.com/Bitcask-Configuration.html#Disk-Usage-and-Merging-Settings

Basically, a "merge trigger" must be met in order to have the merge
process occur.  When it does occur, it will affect all existing files
that
meet a "merge threshold."

One note that is relevant for your specific use: the expiry_secs
parameter
will cause a given item to disappear from the client API immediately
after
expiry, and to be cleaned if it is in a file already being merged, but
will
not currently contribute toward merge triggers or thresholds on its own
if
not otherwise "dead".

-Justin


On Jun 7, 2011, at 4:29 PM, Steve Webb wrote:

 Hello there.



I'm curious - I'm up to about 10GB of storage and I'm guessing that
I'll
be full in 3-4 more days of ingesting data.  I have no idea if/when a
merge
will run to expire the older data.

I'm loading a 2-node (1GB mem, 20GB storage, vmware VMs) riaksearch
cluster with the spritzer twitter feed.  I used the bitcask
'expiry_secs' to
expire data after 3 days. Q: Is there a method or command to force a
merge
at any time? Q: Is there a way to run a merge when the storage size
reaches
a specific threshold?


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



 _______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com






_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Steve Webb
In reply to this post by Justin Sheehy
Just to be clear ...

If I made the max_file_size small, and set the expire_secs value to
something small, but never explicitly delete anything, the non-active
files will be considered for merging (and will prune any expired data)
just because they are inactive and don't trigger any other merge-detection
criteria?

If not, how would I configure riak as a system that I could continuously
insert data into and always just have the last days worth of data or so?

- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Justin Sheehy wrote:

> Hi, Steve.
>
> The key to your situation was in my earlier email:
>
>    One note that is relevant for your specific use: the expiry_secs
>    parameter will cause a given item to disappear from the client
>    API immediately after expiry, and to be cleaned if it is in a file
>    already being merged, but will not currently contribute toward
>    merge triggers or thresholds on its own if not otherwise "dead".
>
> That is, bitcask wasn't originally designed around the expiry-centric
> way of removing old data, and data that has simply expired (but not
> actively been deleted) will not be counted as garbage toward
> thresholds or triggers at this time.  It will be cleaned up in a
> merge, but will not contribute toward causing the merge in the first
> place.  In a use case where you only add items and never actually
> delete anything, a merge will never be dynamically triggered.
>
> It is plausible that we could add some expiry-statistics measurement
> and triggering to bitcask, but today that's the state of things.  You
> could manually trigger merges, but that currently requires a bit of
> Erlang.
>
> I hope that this helps.
>
> -Justin
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Pruning (merging) after storage reaches a certain size?

Jeremiah Peschka
The only other time I've seen this documented is in a thread from 2011 about pruning (merging) after deletes. Here's the message in question in case any historians are lurking: http://markmail.org/message/4nakwhixkw3jcxvi
---
Jeremiah Peschka
Managing Director, Brent Ozar PLF, LLC


On Thu, Jul 28, 2011 at 9:18 AM, Steve Webb <[hidden email]> wrote:
Just to be clear ...

If I made the max_file_size small, and set the expire_secs value to something small, but never explicitly delete anything, the non-active files will be considered for merging (and will prune any expired data) just because they are inactive and don't trigger any other merge-detection criteria?

If not, how would I configure riak as a system that I could continuously insert data into and always just have the last days worth of data or so?


- Steve

--
Steve Webb - Senior System Administrator for gnip.com
http://twitter.com/GnipWebb

On Mon, 13 Jun 2011, Justin Sheehy wrote:

Hi, Steve.

The key to your situation was in my earlier email:

   One note that is relevant for your specific use: the expiry_secs
   parameter will cause a given item to disappear from the client
   API immediately after expiry, and to be cleaned if it is in a file
   already being merged, but will not currently contribute toward
   merge triggers or thresholds on its own if not otherwise "dead".

That is, bitcask wasn't originally designed around the expiry-centric
way of removing old data, and data that has simply expired (but not
actively been deleted) will not be counted as garbage toward
thresholds or triggers at this time.  It will be cleaned up in a
merge, but will not contribute toward causing the merge in the first
place.  In a use case where you only add items and never actually
delete anything, a merge will never be dynamically triggered.

It is plausible that we could add some expiry-statistics measurement
and triggering to bitcask, but today that's the state of things.  You
could manually trigger merges, but that currently requires a bit of
Erlang.

I hope that this helps.

-Justin


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com