Quantcast

multi-get (yet again)

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

multi-get (yet again)

Jeremy Dunck
I'm new to riak and need multi-get (that is, getting the value and/or
existence of keys in a single network-trip latency).

I was wondering what the latency of the map-reduce approach is?
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003229.html

Alternatively, has anyone tried scaling concurrent gets (perhaps with
evented io) to do many concurrent requests and combining results on
the client?

I am toying with a python+gevent multiget function.  If the stance is
still that a multiget operation doesn't belong in core, I'm a little
surprised that there doesn't seem to at least be a nice client-lib API
func to do it.  It sure seems useful...

In my use-case, the immediate need is to know whether a db insert
needs to be done.  We're handling too many keys to want to store in
memory (so no redis, etc), and we don't want to go to the db more than
we need to, so it seems riak would be good here.  But we're getting
1000s of potential insert keys and want to whittle down all those to a
relative few db inserts.

So I was thinking riak key-per-id, and insert to the db iff the riak
key doesn't exist, then add the riak key.  We'll get some race
conditions on the insert, but that's OK in our case.

We do need low latency on the riak check, though, hence either
multiplexing w/ eventing or map-reduce (if that latency is actually
good).

Am I doing it wrong?

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-get (yet again)

Parnell Springmeyer
Jeremy,

I was looking for something similar and first built an extra handler onto an internal erlang cowboy API server that used maelstrom (my own worker pool OTP application).

It was used to make a simple POST with a string of the {bucket, key} pairs and the server would concurrently GET and combine the results and send it back. This was very fast (thousands of keys GET in ms).

Since that seemed gross, I then decided (based on some input from someone else on the list) to try using a simple Map/Reduce phase that did not use javascript but the erlang functions (since those are going to be really fast and take advantage Erlang's concurrency better than the javascript VM's).

In python, you can do this to run that type of M/R phase without knowing any Erlang code:

client = riak.RiakClient()

# Add your KNOWN bucket and key pairs (you can do this in a loop)
query = client.add(bucket, key)
query.add(bucket, key)
query.add(bucket, key)
etc… (as many as you like)

# Now tell the map and reduce phases to use Erlang module "riak_kv_mapreduce" and its given function
# "map_object_value" and "reduce_set_union".
results = client.map(["riak_kv_mapreduce", "map_object_value"]) \
                 .reduce(["riak_kv_mapreduce", "reduce_set_union"]) \
                 .run()

The above returns results faster for me, than the brokered multi-get approach I used (I guarantee my brokered multi-get is faster than anything you can do with python + gevent, if that's the case, the M/R phase is definitely the route you want to go).

So IMHO, it is very fast as long as you know the buckets and keys you want to get.

On Aug 9, 2012, at 12:11 AM, Jeremy Dunck wrote:

> I'm new to riak and need multi-get (that is, getting the value and/or
> existence of keys in a single network-trip latency).
>
> I was wondering what the latency of the map-reduce approach is?
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003229.html
>
> Alternatively, has anyone tried scaling concurrent gets (perhaps with
> evented io) to do many concurrent requests and combining results on
> the client?
>
> I am toying with a python+gevent multiget function.  If the stance is
> still that a multiget operation doesn't belong in core, I'm a little
> surprised that there doesn't seem to at least be a nice client-lib API
> func to do it.  It sure seems useful...
>
> In my use-case, the immediate need is to know whether a db insert
> needs to be done.  We're handling too many keys to want to store in
> memory (so no redis, etc), and we don't want to go to the db more than
> we need to, so it seems riak would be good here.  But we're getting
> 1000s of potential insert keys and want to whittle down all those to a
> relative few db inserts.
>
> So I was thinking riak key-per-id, and insert to the db iff the riak
> key doesn't exist, then add the riak key.  We'll get some race
> conditions on the insert, but that's OK in our case.
>
> We do need low latency on the riak check, though, hence either
> multiplexing w/ eventing or map-reduce (if that latency is actually
> good).
>
> Am I doing it wrong?
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-get (yet again)

Kresten Krab Thorup
The only issue with this approach is AFAIK that M/R effectively runs with R=1, i.e. it doesn't ensure that a value is consistent across replicas.  

IMHO riak_kv_mapreduce should have a map_get_object_value, which does a proper RiakClient:get, i.e. something like this: [will be slower, but will honour the bucket's default R value].

map_get_object_value({error, notfound}=NF, KD, Action) ->                                    
    notfound_map_action(NF, KD, Action);                                                      
map_get_object_value(RO, KD, Action) ->                                                      
    {ok, RiakClient} = riak:local_client(),                                                  
    case RiakClient:get(riak_object:bucket(RO),riak_object:bucket(RO)) of                    
        {error, notfound}=NF ->                                                              
            notfound_map_action(NF, KD, Action);                                              
        {ok, RiakObject} ->                                                                        
            [riak_object:get_value(RiakObject)]                                              
    end.                                                                                      
                                                                                             
                                                                                             


Kresten


On Aug 9, 2012, at 10:46 AM, Parnell Springmeyer <[hidden email]> wrote:

> Jeremy,
>
> I was looking for something similar and first built an extra handler onto an internal erlang cowboy API server that used maelstrom (my own worker pool OTP application).
>
> It was used to make a simple POST with a string of the {bucket, key} pairs and the server would concurrently GET and combine the results and send it back. This was very fast (thousands of keys GET in ms).
>
> Since that seemed gross, I then decided (based on some input from someone else on the list) to try using a simple Map/Reduce phase that did not use javascript but the erlang functions (since those are going to be really fast and take advantage Erlang's concurrency better than the javascript VM's).
>
> In python, you can do this to run that type of M/R phase without knowing any Erlang code:
>
> client = riak.RiakClient()
>
> # Add your KNOWN bucket and key pairs (you can do this in a loop)
> query = client.add(bucket, key)
> query.add(bucket, key)
> query.add(bucket, key)
> etc… (as many as you like)
>
> # Now tell the map and reduce phases to use Erlang module "riak_kv_mapreduce" and its given function
> # "map_object_value" and "reduce_set_union".
> results = client.map(["riak_kv_mapreduce", "map_object_value"]) \
>                 .reduce(["riak_kv_mapreduce", "reduce_set_union"]) \
>                 .run()
>
> The above returns results faster for me, than the brokered multi-get approach I used (I guarantee my brokered multi-get is faster than anything you can do with python + gevent, if that's the case, the M/R phase is definitely the route you want to go).
>
> So IMHO, it is very fast as long as you know the buckets and keys you want to get.
>
> On Aug 9, 2012, at 12:11 AM, Jeremy Dunck wrote:
>
>> I'm new to riak and need multi-get (that is, getting the value and/or
>> existence of keys in a single network-trip latency).
>>
>> I was wondering what the latency of the map-reduce approach is?
>> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003229.html
>>
>> Alternatively, has anyone tried scaling concurrent gets (perhaps with
>> evented io) to do many concurrent requests and combining results on
>> the client?
>>
>> I am toying with a python+gevent multiget function.  If the stance is
>> still that a multiget operation doesn't belong in core, I'm a little
>> surprised that there doesn't seem to at least be a nice client-lib API
>> func to do it.  It sure seems useful...
>>
>> In my use-case, the immediate need is to know whether a db insert
>> needs to be done.  We're handling too many keys to want to store in
>> memory (so no redis, etc), and we don't want to go to the db more than
>> we need to, so it seems riak would be good here.  But we're getting
>> 1000s of potential insert keys and want to whittle down all those to a
>> relative few db inserts.
>>
>> So I was thinking riak key-per-id, and insert to the db iff the riak
>> key doesn't exist, then add the riak key.  We'll get some race
>> conditions on the insert, but that's OK in our case.
>>
>> We do need low latency on the riak check, though, hence either
>> multiplexing w/ eventing or map-reduce (if that latency is actually
>> good).
>>
>> Am I doing it wrong?
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



Mobile: + 45 2343 4626 | Skype: krestenkrabthorup | Twitter: @drkrab
Trifork A/S  |  Margrethepladsen 4  | DK- 8000 Aarhus C |  Phone : +45 8732 8787  |  www.trifork.com
 




_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-get (yet again)

Eric Moritz
In reply to this post by Parnell Springmeyer

I toyed with a pmap in Python a while back to attempt to speed up multiple HTTP request to our web services layer at work. You may want to attempt that with gevent.

Here's the code I wrote which is probably not production ready. https://github.com/ericmoritz/pmap

On Aug 9, 2012 4:46 AM, "Parnell Springmeyer" <[hidden email]> wrote:
Jeremy,

I was looking for something similar and first built an extra handler onto an internal erlang cowboy API server that used maelstrom (my own worker pool OTP application).

It was used to make a simple POST with a string of the {bucket, key} pairs and the server would concurrently GET and combine the results and send it back. This was very fast (thousands of keys GET in ms).

Since that seemed gross, I then decided (based on some input from someone else on the list) to try using a simple Map/Reduce phase that did not use javascript but the erlang functions (since those are going to be really fast and take advantage Erlang's concurrency better than the javascript VM's).

In python, you can do this to run that type of M/R phase without knowing any Erlang code:

client = riak.RiakClient()

# Add your KNOWN bucket and key pairs (you can do this in a loop)
query = client.add(bucket, key)
query.add(bucket, key)
query.add(bucket, key)
etc… (as many as you like)

# Now tell the map and reduce phases to use Erlang module "riak_kv_mapreduce" and its given function
# "map_object_value" and "reduce_set_union".
results = client.map(["riak_kv_mapreduce", "map_object_value"]) \
                 .reduce(["riak_kv_mapreduce", "reduce_set_union"]) \
                 .run()

The above returns results faster for me, than the brokered multi-get approach I used (I guarantee my brokered multi-get is faster than anything you can do with python + gevent, if that's the case, the M/R phase is definitely the route you want to go).

So IMHO, it is very fast as long as you know the buckets and keys you want to get.

On Aug 9, 2012, at 12:11 AM, Jeremy Dunck wrote:

> I'm new to riak and need multi-get (that is, getting the value and/or
> existence of keys in a single network-trip latency).
>
> I was wondering what the latency of the map-reduce approach is?
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-February/003229.html
>
> Alternatively, has anyone tried scaling concurrent gets (perhaps with
> evented io) to do many concurrent requests and combining results on
> the client?
>
> I am toying with a python+gevent multiget function.  If the stance is
> still that a multiget operation doesn't belong in core, I'm a little
> surprised that there doesn't seem to at least be a nice client-lib API
> func to do it.  It sure seems useful...
>
> In my use-case, the immediate need is to know whether a db insert
> needs to be done.  We're handling too many keys to want to store in
> memory (so no redis, etc), and we don't want to go to the db more than
> we need to, so it seems riak would be good here.  But we're getting
> 1000s of potential insert keys and want to whittle down all those to a
> relative few db inserts.
>
> So I was thinking riak key-per-id, and insert to the db iff the riak
> key doesn't exist, then add the riak key.  We'll get some race
> conditions on the insert, but that's OK in our case.
>
> We do need low latency on the riak check, though, hence either
> multiplexing w/ eventing or map-reduce (if that latency is actually
> good).
>
> Am I doing it wrong?
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-get (yet again)

bryan-basho
Administrator
In reply to this post by Kresten Krab Thorup
On Thu, Aug 9, 2012 at 5:11 AM, Kresten Krab Thorup <[hidden email]> wrote:
> The only issue with this approach is AFAIK that M/R effectively runs with R=1, i.e. it doesn't ensure that a value is consistent across replicas.
>
> IMHO riak_kv_mapreduce should have a map_get_object_value, which does a proper RiakClient:get, i.e. something like this: [will be slower, but will honour the bucket's default R value].

I recently realized that this would be a fairly small and easy thing
to do since MR has been ported to Riak Pipe. I'm frying other fish at
the moment, but if any of your are interested, read on.

In Riak Pipe, an MR "map" phase is broken into two steps: "get" and
"transform". The "get" phase is what reads the value from Riak. It is
currently implemented in riak_kv_pipe_get, in the riak_kv application.

If you read riak_kv_pipe_get.erl, you'll see that all of the fetching
logic is in the process/3 function. Modifying this code to do a
regular riak_client:get instead of talking directly to a single vnode
should be easy.

We would like to keep the existing implementation as the default, at
least for now. So, my suggestion would be to add the new behavior as
an option, with flags to control it. This could be accomplished either
by modifying riak_kv_pipe_get to look for a flag in its argument, or
by modifying riak_kv_mrc_pipe to use a new fitting instead of
riak_kv_pipe_get.

With either modification, you'll want to also change riak_kv_mrc_pipe
to pass the map arguments through to the "get" fitting. These
arguments are the only place available to external clients to specify
any of the R-value tuning parameters. Yes, that means a map function
implementation will have to ignore them, but hopefully that's not
insurmountable. See the reduce_batch_size and reduce_phase_only_1
optional "reduce" phase arguments for examples on how to do this.

There are probably other ways to fit this kind of fetching behavior in
as well. While Kresten's map-function implementation is good, I think
this behavior is useful in more cases than resolving a
notfound. Hopefully what I've written above is enough to get one or
more of you started down a path.

Cheers,
Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: multi-get (yet again)

Parnell Springmeyer
I'm interested in this, I'll fork the repo and see what I can get added in there.

On Aug 10, 2012, at 7:52 AM, Bryan Fink wrote:

> On Thu, Aug 9, 2012 at 5:11 AM, Kresten Krab Thorup <[hidden email]> wrote:
>> The only issue with this approach is AFAIK that M/R effectively runs with R=1, i.e. it doesn't ensure that a value is consistent across replicas.
>>
>> IMHO riak_kv_mapreduce should have a map_get_object_value, which does a proper RiakClient:get, i.e. something like this: [will be slower, but will honour the bucket's default R value].
>
> I recently realized that this would be a fairly small and easy thing
> to do since MR has been ported to Riak Pipe. I'm frying other fish at
> the moment, but if any of your are interested, read on.
>
> In Riak Pipe, an MR "map" phase is broken into two steps: "get" and
> "transform". The "get" phase is what reads the value from Riak. It is
> currently implemented in riak_kv_pipe_get, in the riak_kv application.
>
> If you read riak_kv_pipe_get.erl, you'll see that all of the fetching
> logic is in the process/3 function. Modifying this code to do a
> regular riak_client:get instead of talking directly to a single vnode
> should be easy.
>
> We would like to keep the existing implementation as the default, at
> least for now. So, my suggestion would be to add the new behavior as
> an option, with flags to control it. This could be accomplished either
> by modifying riak_kv_pipe_get to look for a flag in its argument, or
> by modifying riak_kv_mrc_pipe to use a new fitting instead of
> riak_kv_pipe_get.
>
> With either modification, you'll want to also change riak_kv_mrc_pipe
> to pass the map arguments through to the "get" fitting. These
> arguments are the only place available to external clients to specify
> any of the R-value tuning parameters. Yes, that means a map function
> implementation will have to ignore them, but hopefully that's not
> insurmountable. See the reduce_batch_size and reduce_phase_only_1
> optional "reduce" phase arguments for examples on how to do this.
>
> There are probably other ways to fit this kind of fetching behavior in
> as well. While Kresten's map-function implementation is good, I think
> this behavior is useful in more cases than resolving a
> notfound. Hopefully what I've written above is enough to get one or
> more of you started down a path.
>
> Cheers,
> Bryan
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

signature.asc (858 bytes) Download Attachment
Loading...