Speed of map/reduce queries

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Speed of map/reduce queries

John Lynch
Hello all,

We have been running some simple tests to try and understand the performance characteristics of Riak, and we are getting some strange results. Basically, we spin up a small EC2 instance (using Ubuntu9.10  ami-4d742508), install Riak 0.9.1 via deb using a DETS backend, and populate it with 10,000 JSON objects (~3K each) in a single bucket with Riak-assigned keys.  Then we run this map reduce query (from a separate client machine) using Ripple to find some keys:

results = Riak::MapReduce.new(client)
          .add("RiakLoader")
          .map("function(value,keyData,arg) {
                var re = new RegExp(arg);

                return value.key.match(re) ? [value.key] : [];
               }", :keep => true, :arg => "^12").run


We see the beam process run at 38% CPU, 2% RAM for about 60 seconds, before returning the results, which was about 7 keys. The time stays the same each time we run the query.  60 seconds seems like an awful long time to search 10,000 keys. Are we doing something wrong? Or is that an expected result?


Regards,

John Lynch, CTO
Rigel Group, LLC
[hidden email]



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

Kevin Smith-5
John -

Thanks for your question. I've done some experimenting and I think I understand the behavior you're seeing. Riak currently uses JSON as the marshaling format for passing data between the Erlang and Javascript VMs. While JSON is easy to understand and debug it isn't particular fast or compact. Each of the 10k objects must be JSON encoded and decoded on their way from Erlang to the Spidermonkey VM. In your situation it means Riak must JSON encode and decode each of the 10,000 objects during the map phase. Even though map phases are highly parallel, this does create a fair amount of work and can become CPU bound, especially in the single-node case.

Based on my tests using your map phase and similar data, it's clear the marshaling overhead is causing the performance you're seeing. I believe we can certainly improve marshaling performance but, as with many things, it's a question of priorities and resources. In the meantime, you might consider storing the bucket & key values in a separate bucket and map over those to avoid the object marshaling overhead.

--Kevin
On Mar 19, 2010, at 2:07 PM, John Lynch wrote:

> Hello all,
>
> We have been running some simple tests to try and understand the performance characteristics of Riak, and we are getting some strange results. Basically, we spin up a small EC2 instance (using Ubuntu9.10  ami-4d742508), install Riak 0.9.1 via deb using a DETS backend, and populate it with 10,000 JSON objects (~3K each) in a single bucket with Riak-assigned keys.  Then we run this map reduce query (from a separate client machine) using Ripple to find some keys:
>
> results = Riak::MapReduce.new(client)
>           .add("RiakLoader")
>           .map("function(value,keyData,arg) {
>                 var re = new RegExp(arg);
>                 return value.key.match(re) ? [value.key] : [];
>                }", :keep => true, :arg => "^12").run
>
>
> We see the beam process run at 38% CPU, 2% RAM for about 60 seconds, before returning the results, which was about 7 keys. The time stays the same each time we run the query.  60 seconds seems like an awful long time to search 10,000 keys. Are we doing something wrong? Or is that an expected result?
>
>
> Regards,
>
> John Lynch, CTO
> Rigel Group, LLC
> [hidden email]
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

John Lynch
Kevin,

Thanks for looking into it. What you say of course makes sense. It would, however, be nice to have a way to map over just the keyspace without having to marshal all the actual objects themselves. Some way to pass a map function to the list_keys function? Is that possible in Javascript or Erlang?  I am curious to know how an Erlang map function would perform for the same thing. I will try that next...


Regards,
 
- John

On Fri, Mar 19, 2010 at 1:50 PM, Kevin Smith <[hidden email]> wrote:
John -

Thanks for your question. I've done some experimenting and I think I understand the behavior you're seeing. Riak currently uses JSON as the marshaling format for passing data between the Erlang and Javascript VMs. While JSON is easy to understand and debug it isn't particular fast or compact. Each of the 10k objects must be JSON encoded and decoded on their way from Erlang to the Spidermonkey VM. In your situation it means Riak must JSON encode and decode each of the 10,000 objects during the map phase. Even though map phases are highly parallel, this does create a fair amount of work and can become CPU bound, especially in the single-node case.

Based on my tests using your map phase and similar data, it's clear the marshaling overhead is causing the performance you're seeing. I believe we can certainly improve marshaling performance but, as with many things, it's a question of priorities and resources. In the meantime, you might consider storing the bucket & key values in a separate bucket and map over those to avoid the object marshaling overhead.

--Kevin
On Mar 19, 2010, at 2:07 PM, John Lynch wrote:

> Hello all,
>
> We have been running some simple tests to try and understand the performance characteristics of Riak, and we are getting some strange results. Basically, we spin up a small EC2 instance (using Ubuntu9.10  ami-4d742508), install Riak 0.9.1 via deb using a DETS backend, and populate it with 10,000 JSON objects (~3K each) in a single bucket with Riak-assigned keys.  Then we run this map reduce query (from a separate client machine) using Ripple to find some keys:
>
> results = Riak::MapReduce.new(client)
>           .add("RiakLoader")
>           .map("function(value,keyData,arg) {
>                 var re = new RegExp(arg);
>                 return value.key.match(re) ? [value.key] : [];
>                }", :keep => true, :arg => "^12").run
>
>
> We see the beam process run at 38% CPU, 2% RAM for about 60 seconds, before returning the results, which was about 7 keys. The time stays the same each time we run the query.  60 seconds seems like an awful long time to search 10,000 keys. Are we doing something wrong? Or is that an expected result?
>
>
> Regards,
>
> John Lynch, CTO
> Rigel Group, LLC
> [hidden email]
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

Kevin Smith-5
John -

Currently there isn't a way to map over bucket/key pairs or even key metadata in Riak. You have to take the entire object on each map function call. I'm getting ready to head out of town to teach a class at Erlang Factory next week but I'm going to give this some more thought and see if there's something we can do to improve the situation in the short-term.

--Kevin
On Mar 19, 2010, at 5:31 PM, John Lynch wrote:

> Kevin,
>
> Thanks for looking into it. What you say of course makes sense. It would, however, be nice to have a way to map over just the keyspace without having to marshal all the actual objects themselves.  Some way to pass a map function to the list_keys function? Is that possible in Javascript or Erlang?  I am curious to know how an Erlang map function would perform for the same thing. I will try that next...
>
>
> Regards,
>  
> - John
>
> On Fri, Mar 19, 2010 at 1:50 PM, Kevin Smith <[hidden email]> wrote:
> John -
>
> Thanks for your question. I've done some experimenting and I think I understand the behavior you're seeing. Riak currently uses JSON as the marshaling format for passing data between the Erlang and Javascript VMs. While JSON is easy to understand and debug it isn't particular fast or compact. Each of the 10k objects must be JSON encoded and decoded on their way from Erlang to the Spidermonkey VM. In your situation it means Riak must JSON encode and decode each of the 10,000 objects during the map phase. Even though map phases are highly parallel, this does create a fair amount of work and can become CPU bound, especially in the single-node case.
>
> Based on my tests using your map phase and similar data, it's clear the marshaling overhead is causing the performance you're seeing. I believe we can certainly improve marshaling performance but, as with many things, it's a question of priorities and resources. In the meantime, you might consider storing the bucket & key values in a separate bucket and map over those to avoid the object marshaling overhead.
>
> --Kevin
> On Mar 19, 2010, at 2:07 PM, John Lynch wrote:
>
> > Hello all,
> >
> > We have been running some simple tests to try and understand the performance characteristics of Riak, and we are getting some strange results. Basically, we spin up a small EC2 instance (using Ubuntu9.10  ami-4d742508), install Riak 0.9.1 via deb using a DETS backend, and populate it with 10,000 JSON objects (~3K each) in a single bucket with Riak-assigned keys.  Then we run this map reduce query (from a separate client machine) using Ripple to find some keys:
> >
> > results = Riak::MapReduce.new(client)
> >           .add("RiakLoader")
> >           .map("function(value,keyData,arg) {
> >                 var re = new RegExp(arg);
> >                 return value.key.match(re) ? [value.key] : [];
> >                }", :keep => true, :arg => "^12").run
> >
> >
> > We see the beam process run at 38% CPU, 2% RAM for about 60 seconds, before returning the results, which was about 7 keys. The time stays the same each time we run the query.  60 seconds seems like an awful long time to search 10,000 keys. Are we doing something wrong? Or is that an expected result?
> >
> >
> > Regards,
> >
> > John Lynch, CTO
> > Rigel Group, LLC
> > [hidden email]
> >
> >
> > _______________________________________________
> > riak-users mailing list
> > [hidden email]
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

Kevin Smith-5
After giving this some more thought and reading some code last night I think there are two approaches we could take to improve performance in the short term:

1) Replace erlang_js' JSON with a better encoding -- JSON is easy to read and debug but it isn't especially fast or efficient. Replacing it with a better encoding could help performance. BERT (http://bert-rpc.org/) is a good candidate here. Rusty Klophaus, a fellow Basho-ite, has written a BERT encoder/decoder in Javascript which would certainly speed implementation.

2) Add an option to map/reduce job definitions to allow phases to specify when they do and do not want the entire object body. This would save the cost of encoding and decoding potentially large object bodies for phases which aren't interested in the data.

I'm going to have some down time at Erlang Factory this week and plan to prototype #1 and possibly #2 while I'm there. I'll be sure to post my progress to the list if I get something working.

--Kevin
On Mar 20, 2010, at 11:26 AM, Kevin Smith wrote:

> John -
>
> Currently there isn't a way to map over bucket/key pairs or even key metadata in Riak. You have to take the entire object on each map function call. I'm getting ready to head out of town to teach a class at Erlang Factory next week but I'm going to give this some more thought and see if there's something we can do to improve the situation in the short-term.
>
> --Kevin
> On Mar 19, 2010, at 5:31 PM, John Lynch wrote:
>
>> Kevin,
>>
>> Thanks for looking into it. What you say of course makes sense. It would, however, be nice to have a way to map over just the keyspace without having to marshal all the actual objects themselves.  Some way to pass a map function to the list_keys function? Is that possible in Javascript or Erlang?  I am curious to know how an Erlang map function would perform for the same thing. I will try that next...
>>
>>
>> Regards,
>>
>> - John
>>
>> On Fri, Mar 19, 2010 at 1:50 PM, Kevin Smith <[hidden email]> wrote:
>> John -
>>
>> Thanks for your question. I've done some experimenting and I think I understand the behavior you're seeing. Riak currently uses JSON as the marshaling format for passing data between the Erlang and Javascript VMs. While JSON is easy to understand and debug it isn't particular fast or compact. Each of the 10k objects must be JSON encoded and decoded on their way from Erlang to the Spidermonkey VM. In your situation it means Riak must JSON encode and decode each of the 10,000 objects during the map phase. Even though map phases are highly parallel, this does create a fair amount of work and can become CPU bound, especially in the single-node case.
>>
>> Based on my tests using your map phase and similar data, it's clear the marshaling overhead is causing the performance you're seeing. I believe we can certainly improve marshaling performance but, as with many things, it's a question of priorities and resources. In the meantime, you might consider storing the bucket & key values in a separate bucket and map over those to avoid the object marshaling overhead.
>>
>> --Kevin
>> On Mar 19, 2010, at 2:07 PM, John Lynch wrote:
>>
>>> Hello all,
>>>
>>> We have been running some simple tests to try and understand the performance characteristics of Riak, and we are getting some strange results. Basically, we spin up a small EC2 instance (using Ubuntu9.10  ami-4d742508), install Riak 0.9.1 via deb using a DETS backend, and populate it with 10,000 JSON objects (~3K each) in a single bucket with Riak-assigned keys.  Then we run this map reduce query (from a separate client machine) using Ripple to find some keys:
>>>
>>> results = Riak::MapReduce.new(client)
>>>          .add("RiakLoader")
>>>          .map("function(value,keyData,arg) {
>>>                var re = new RegExp(arg);
>>>                return value.key.match(re) ? [value.key] : [];
>>>               }", :keep => true, :arg => "^12").run
>>>
>>>
>>> We see the beam process run at 38% CPU, 2% RAM for about 60 seconds, before returning the results, which was about 7 keys. The time stays the same each time we run the query.  60 seconds seems like an awful long time to search 10,000 keys. Are we doing something wrong? Or is that an expected result?
>>>
>>>
>>> Regards,
>>>
>>> John Lynch, CTO
>>> Rigel Group, LLC
>>> [hidden email]
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> [hidden email]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

Kevin Smith-5
Yup. I've had my eye on msgpack too but I'm not sure I'll have time to  
write a msgpack impl in Erlang and do ghe additional prototyping while  
I'm at Erlang Factory. Certainly something to investigate, though.

Sent from my iPhone

On Mar 21, 2010, at 4:14 AM, Alvaro Videla <[hidden email]>  
wrote:

> Hi,
>
> On the MongoDB ML also appeared the discussion about encoding  
> formats, in this case, related to their BSON format.
>
> Someone suggested to use  "msgpack" http://msgpack.sourceforge.net/ 
> which looks pretty interesting.
>
> While there are no Erlang clients yet, I believe is worth looking  
> into it.
>
> Regard,
>
> Alvaro
>
>
> On Mar 21, 2010, at 4:00 PM, Kevin Smith wrote:
>
>> After giving this some more thought and reading some code last  
>> night I think there are two approaches we could take to improve  
>> performance in the short term:
>>
>> 1) Replace erlang_js' JSON with a better encoding -- JSON is easy  
>> to read and debug but it isn't especially fast or efficient.  
>> Replacing it with a better encoding could help performance. BERT (http://bert-rpc.org/ 
>> ) is a good candidate here. Rusty Klophaus, a fellow Basho-ite, has  
>> written a BERT encoder/decoder in Javascript which would certainly  
>> speed implementation.
>>
>> 2) Add an option to map/reduce job definitions to allow phases to  
>> specify when they do and do not want the entire object body. This  
>> would save the cost of encoding and decoding potentially large  
>> object bodies for phases which aren't interested in the data.
>>
>> I'm going to have some down time at Erlang Factory this week and  
>> plan to prototype #1 and possibly #2 while I'm there. I'll be sure  
>> to post my progress to the list if I get something working.
>>
>> --Kevin
>> On Mar 20, 2010, at 11:26 AM, Kevin Smith wrote:
>>
>>> John -
>>>
>>> Currently there isn't a way to map over bucket/key pairs or even  
>>> key metadata in Riak. You have to take the entire object on each  
>>> map function call. I'm getting ready to head out of town to teach  
>>> a class at Erlang Factory next week but I'm going to give this  
>>> some more thought and see if there's something we can do to  
>>> improve the situation in the short-term.
>>>
>>> --Kevin
>>> On Mar 19, 2010, at 5:31 PM, John Lynch wrote:
>>>
>>>> Kevin,
>>>>
>>>> Thanks for looking into it. What you say of course makes sense.  
>>>> It would, however, be nice to have a way to map over just the  
>>>> keyspace without having to marshal all the actual objects  
>>>> themselves.  Some way to pass a map function to the list_keys  
>>>> function? Is that possible in Javascript or Erlang?  I am curious  
>>>> to know how an Erlang map function would perform for the same  
>>>> thing. I will try that next...
>>>>
>>>>
>>>> Regards,
>>>>
>>>> - John
>>>>
>>>> On Fri, Mar 19, 2010 at 1:50 PM, Kevin Smith <[hidden email]>  
>>>> wrote:
>>>> John -
>>>>
>>>> Thanks for your question. I've done some experimenting and I  
>>>> think I understand the behavior you're seeing. Riak currently  
>>>> uses JSON as the marshaling format for passing data between the  
>>>> Erlang and Javascript VMs. While JSON is easy to understand and  
>>>> debug it isn't particular fast or compact. Each of the 10k  
>>>> objects must be JSON encoded and decoded on their way from Erlang  
>>>> to the Spidermonkey VM. In your situation it means Riak must JSON  
>>>> encode and decode each of the 10,000 objects during the map  
>>>> phase. Even though map phases are highly parallel, this does  
>>>> create a fair amount of work and can become CPU bound, especially  
>>>> in the single-node case.
>>>>
>>>> Based on my tests using your map phase and similar data, it's  
>>>> clear the marshaling overhead is causing the performance you're  
>>>> seeing. I believe we can certainly improve marshaling performance  
>>>> but, as with many things, it's a question of priorities and  
>>>> resources. In the meantime, you might consider storing the bucket  
>>>> & key values in a separate bucket and map over those to avoid the  
>>>> object marshaling overhead.
>>>>
>>>> --Kevin
>>>> On Mar 19, 2010, at 2:07 PM, John Lynch wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> We have been running some simple tests to try and understand the  
>>>>> performance characteristics of Riak, and we are getting some  
>>>>> strange results. Basically, we spin up a small EC2 instance  
>>>>> (using Ubuntu9.10  ami-4d742508), install Riak 0.9.1 via deb  
>>>>> using a DETS backend, and populate it with 10,000 JSON objects  
>>>>> (~3K each) in a single bucket with Riak-assigned keys.  Then we  
>>>>> run this map reduce query (from a separate client machine) using  
>>>>> Ripple to find some keys:
>>>>>
>>>>> results = Riak::MapReduce.new(client)
>>>>>        .add("RiakLoader")
>>>>>        .map("function(value,keyData,arg) {
>>>>>              var re = new RegExp(arg);
>>>>>              return value.key.match(re) ? [value.key] : [];
>>>>>             }", :keep => true, :arg => "^12").run
>>>>>
>>>>>
>>>>> We see the beam process run at 38% CPU, 2% RAM for about 60  
>>>>> seconds, before returning the results, which was about 7 keys.  
>>>>> The time stays the same each time we run the query.  60 seconds  
>>>>> seems like an awful long time to search 10,000 keys. Are we  
>>>>> doing something wrong? Or is that an expected result?
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> John Lynch, CTO
>>>>> Rigel Group, LLC
>>>>> [hidden email]
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> riak-users mailing list
>>>>> [hidden email]
>>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

Tuncer Ayaz
On Sun, Mar 21, 2010 at 9:19 AM, Kevin Smith <[hidden email]> wrote:
> Yup. I've had my eye on msgpack too but I'm not sure I'll have time to write
> a msgpack impl in Erlang and do ghe additional prototyping while I'm at
> Erlang Factory. Certainly something to investigate, though.

Kota UENISHI wrote a msgpack implementation:
http://bitbucket.org/kuenishi/messagepack-for-erlang/

Just to complete the list and raise awareness I'll
mention other alternatives:
 - Rusty's BinaryVice (erlang impl available)
 - Mauricio Fernández' extprot
 - BitTorrent's bencoding (erlang impl available)

> Sent from my iPhone
>
> On Mar 21, 2010, at 4:14 AM, Alvaro Videla <[hidden email]> wrote:
>
>> Hi,
>>
>> On the MongoDB ML also appeared the discussion about encoding formats, in
>> this case, related to their BSON format.
>>
>> Someone suggested to use  "msgpack" http://msgpack.sourceforge.net/ which
>> looks pretty interesting.
>>
>> While there are no Erlang clients yet, I believe is worth looking into it.
>>
>> Regard,
>>
>> Alvaro

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

Alexander Sicular
In reply to this post by Kevin Smith-5
Might I suggest Apaches Avro which I believe is being primed as the
default encoding for Hadoop.

On 2010-03-21, Kevin Smith <[hidden email]> wrote:

> After giving this some more thought and reading some code last night I think
> there are two approaches we could take to improve performance in the short
> term:
>
> 1) Replace erlang_js' JSON with a better encoding -- JSON is easy to read
> and debug but it isn't especially fast or efficient. Replacing it with a
> better encoding could help performance. BERT (http://bert-rpc.org/) is a
> good candidate here. Rusty Klophaus, a fellow Basho-ite, has written a BERT
> encoder/decoder in Javascript which would certainly speed implementation.
>
> 2) Add an option to map/reduce job definitions to allow phases to specify
> when they do and do not want the entire object body. This would save the
> cost of encoding and decoding potentially large object bodies for phases
> which aren't interested in the data.
>
> I'm going to have some down time at Erlang Factory this week and plan to
> prototype #1 and possibly #2 while I'm there. I'll be sure to post my
> progress to the list if I get something working.
>
> --Kevin
> On Mar 20, 2010, at 11:26 AM, Kevin Smith wrote:
>
>> John -
>>
>> Currently there isn't a way to map over bucket/key pairs or even key
>> metadata in Riak. You have to take the entire object on each map function
>> call. I'm getting ready to head out of town to teach a class at Erlang
>> Factory next week but I'm going to give this some more thought and see if
>> there's something we can do to improve the situation in the short-term.
>>
>> --Kevin
>> On Mar 19, 2010, at 5:31 PM, John Lynch wrote:
>>
>>> Kevin,
>>>
>>> Thanks for looking into it. What you say of course makes sense. It would,
>>> however, be nice to have a way to map over just the keyspace without
>>> having to marshal all the actual objects themselves.  Some way to pass a
>>> map function to the list_keys function? Is that possible in Javascript or
>>> Erlang?  I am curious to know how an Erlang map function would perform
>>> for the same thing. I will try that next...
>>>
>>>
>>> Regards,
>>>
>>> - John
>>>
>>> On Fri, Mar 19, 2010 at 1:50 PM, Kevin Smith <[hidden email]> wrote:
>>> John -
>>>
>>> Thanks for your question. I've done some experimenting and I think I
>>> understand the behavior you're seeing. Riak currently uses JSON as the
>>> marshaling format for passing data between the Erlang and Javascript VMs.
>>> While JSON is easy to understand and debug it isn't particular fast or
>>> compact. Each of the 10k objects must be JSON encoded and decoded on
>>> their way from Erlang to the Spidermonkey VM. In your situation it means
>>> Riak must JSON encode and decode each of the 10,000 objects during the
>>> map phase. Even though map phases are highly parallel, this does create a
>>> fair amount of work and can become CPU bound, especially in the
>>> single-node case.
>>>
>>> Based on my tests using your map phase and similar data, it's clear the
>>> marshaling overhead is causing the performance you're seeing. I believe
>>> we can certainly improve marshaling performance but, as with many things,
>>> it's a question of priorities and resources. In the meantime, you might
>>> consider storing the bucket & key values in a separate bucket and map
>>> over those to avoid the object marshaling overhead.
>>>
>>> --Kevin
>>> On Mar 19, 2010, at 2:07 PM, John Lynch wrote:
>>>
>>>> Hello all,
>>>>
>>>> We have been running some simple tests to try and understand the
>>>> performance characteristics of Riak, and we are getting some strange
>>>> results. Basically, we spin up a small EC2 instance (using Ubuntu9.10
>>>> ami-4d742508), install Riak 0.9.1 via deb using a DETS backend, and
>>>> populate it with 10,000 JSON objects (~3K each) in a single bucket with
>>>> Riak-assigned keys.  Then we run this map reduce query (from a separate
>>>> client machine) using Ripple to find some keys:
>>>>
>>>> results = Riak::MapReduce.new(client)
>>>>          .add("RiakLoader")
>>>>          .map("function(value,keyData,arg) {
>>>>                var re = new RegExp(arg);
>>>>                return value.key.match(re) ? [value.key] : [];
>>>>               }", :keep => true, :arg => "^12").run
>>>>
>>>>
>>>> We see the beam process run at 38% CPU, 2% RAM for about 60 seconds,
>>>> before returning the results, which was about 7 keys. The time stays the
>>>> same each time we run the query.  60 seconds seems like an awful long
>>>> time to search 10,000 keys. Are we doing something wrong? Or is that an
>>>> expected result?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> John Lynch, CTO
>>>> Rigel Group, LLC
>>>> [hidden email]
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> [hidden email]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

--
Sent from my mobile device

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

John Lynch-2
In reply to this post by Kevin Smith-5
Kevin, I think option #2 would be very valuable for a lot of use  
cases. As they say, "No encoding is faster than no encoding".

Sent from my iPhone

On Mar 21, 2010, at 1:00 AM, Kevin Smith <[hidden email]> wrote:

> After giving this some more thought and reading some code last night  
> I think there are two approaches we could take to improve  
> performance in the short term:
>
> 1) Replace erlang_js' JSON with a better encoding -- JSON is easy to  
> read and debug but it isn't especially fast or efficient. Replacing  
> it with a better encoding could help performance. BERT (http://bert-rpc.org/ 
> ) is a good candidate here. Rusty Klophaus, a fellow Basho-ite, has  
> written a BERT encoder/decoder in Javascript which would certainly  
> speed implementation.
>
> 2) Add an option to map/reduce job definitions to allow phases to  
> specify when they do and do not want the entire object body. This  
> would save the cost of encoding and decoding potentially large  
> object bodies for phases which aren't interested in the data.
>
> I'm going to have some down time at Erlang Factory this week and  
> plan to prototype #1 and possibly #2 while I'm there. I'll be sure  
> to post my progress to the list if I get something working.
>
> --Kevin
> On Mar 20, 2010, at 11:26 AM, Kevin Smith wrote:
>
>> John -
>>
>> Currently there isn't a way to map over bucket/key pairs or even  
>> key metadata in Riak. You have to take the entire object on each  
>> map function call. I'm getting ready to head out of town to teach a  
>> class at Erlang Factory next week but I'm going to give this some  
>> more thought and see if there's something we can do to improve the  
>> situation in the short-term.
>>
>> --Kevin
>> On Mar 19, 2010, at 5:31 PM, John Lynch wrote:
>>
>>> Kevin,
>>>
>>> Thanks for looking into it. What you say of course makes sense. It  
>>> would, however, be nice to have a way to map over just the  
>>> keyspace without having to marshal all the actual objects  
>>> themselves.  Some way to pass a map function to the list_keys  
>>> function? Is that possible in Javascript or Erlang?  I am curious  
>>> to know how an Erlang map function would perform for the same  
>>> thing. I will try that next...
>>>
>>>
>>> Regards,
>>>
>>> - John
>>>
>>> On Fri, Mar 19, 2010 at 1:50 PM, Kevin Smith <[hidden email]>  
>>> wrote:
>>> John -
>>>
>>> Thanks for your question. I've done some experimenting and I think  
>>> I understand the behavior you're seeing. Riak currently uses JSON  
>>> as the marshaling format for passing data between the Erlang and  
>>> Javascript VMs. While JSON is easy to understand and debug it  
>>> isn't particular fast or compact. Each of the 10k objects must be  
>>> JSON encoded and decoded on their way from Erlang to the  
>>> Spidermonkey VM. In your situation it means Riak must JSON encode  
>>> and decode each of the 10,000 objects during the map phase. Even  
>>> though map phases are highly parallel, this does create a fair  
>>> amount of work and can become CPU bound, especially in the single-
>>> node case.
>>>
>>> Based on my tests using your map phase and similar data, it's  
>>> clear the marshaling overhead is causing the performance you're  
>>> seeing. I believe we can certainly improve marshaling performance  
>>> but, as with many things, it's a question of priorities and  
>>> resources. In the meantime, you might consider storing the bucket  
>>> & key values in a separate bucket and map over those to avoid the  
>>> object marshaling overhead.
>>>
>>> --Kevin
>>> On Mar 19, 2010, at 2:07 PM, John Lynch wrote:
>>>
>>>> Hello all,
>>>>
>>>> We have been running some simple tests to try and understand the  
>>>> performance characteristics of Riak, and we are getting some  
>>>> strange results. Basically, we spin up a small EC2 instance  
>>>> (using Ubuntu9.10  ami-4d742508), install Riak 0.9.1 via deb  
>>>> using a DETS backend, and populate it with 10,000 JSON objects  
>>>> (~3K each) in a single bucket with Riak-assigned keys.  Then we  
>>>> run this map reduce query (from a separate client machine) using  
>>>> Ripple to find some keys:
>>>>
>>>> results = Riak::MapReduce.new(client)
>>>>         .add("RiakLoader")
>>>>         .map("function(value,keyData,arg) {
>>>>               var re = new RegExp(arg);
>>>>               return value.key.match(re) ? [value.key] : [];
>>>>              }", :keep => true, :arg => "^12").run
>>>>
>>>>
>>>> We see the beam process run at 38% CPU, 2% RAM for about 60  
>>>> seconds, before returning the results, which was about 7 keys.  
>>>> The time stays the same each time we run the query.  60 seconds  
>>>> seems like an awful long time to search 10,000 keys. Are we doing  
>>>> something wrong? Or is that an expected result?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> John Lynch, CTO
>>>> Rigel Group, LLC
>>>> [hidden email]
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> [hidden email]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

Jim Roepcke
+1 on both option #1 and (especially) option #2!

Jim

On Sun, Mar 21, 2010 at 10:18 AM, John Lynch <[hidden email]> wrote:

> Kevin, I think option #2 would be very valuable for a lot of use cases. As
> they say, "No encoding is faster than no encoding".
>
> Sent from my iPhone
>
> On Mar 21, 2010, at 1:00 AM, Kevin Smith <[hidden email]> wrote:
>
>> After giving this some more thought and reading some code last night I
>> think there are two approaches we could take to improve performance in the
>> short term:
>>
>> 1) Replace erlang_js' JSON with a better encoding -- JSON is easy to read
>> and debug but it isn't especially fast or efficient. Replacing it with a
>> better encoding could help performance. BERT (http://bert-rpc.org/) is a
>> good candidate here. Rusty Klophaus, a fellow Basho-ite, has written a BERT
>> encoder/decoder in Javascript which would certainly speed implementation.
>>
>> 2) Add an option to map/reduce job definitions to allow phases to specify
>> when they do and do not want the entire object body. This would save the
>> cost of encoding and decoding potentially large object bodies for phases
>> which aren't interested in the data.
>>
>> I'm going to have some down time at Erlang Factory this week and plan to
>> prototype #1 and possibly #2 while I'm there. I'll be sure to post my
>> progress to the list if I get something working.
>>
>> --Kevin
>> On Mar 20, 2010, at 11:26 AM, Kevin Smith wrote:
>>
>>> John -
>>>
>>> Currently there isn't a way to map over bucket/key pairs or even key
>>> metadata in Riak. You have to take the entire object on each map function
>>> call. I'm getting ready to head out of town to teach a class at Erlang
>>> Factory next week but I'm going to give this some more thought and see if
>>> there's something we can do to improve the situation in the short-term.
>>>
>>> --Kevin
>>> On Mar 19, 2010, at 5:31 PM, John Lynch wrote:
>>>
>>>> Kevin,
>>>>
>>>> Thanks for looking into it. What you say of course makes sense. It
>>>> would, however, be nice to have a way to map over just the keyspace without
>>>> having to marshal all the actual objects themselves.  Some way to pass a map
>>>> function to the list_keys function? Is that possible in Javascript or
>>>> Erlang?  I am curious to know how an Erlang map function would perform for
>>>> the same thing. I will try that next...
>>>>
>>>>
>>>> Regards,
>>>>
>>>> - John
>>>>
>>>> On Fri, Mar 19, 2010 at 1:50 PM, Kevin Smith <[hidden email]> wrote:
>>>> John -
>>>>
>>>> Thanks for your question. I've done some experimenting and I think I
>>>> understand the behavior you're seeing. Riak currently uses JSON as the
>>>> marshaling format for passing data between the Erlang and Javascript VMs.
>>>> While JSON is easy to understand and debug it isn't particular fast or
>>>> compact. Each of the 10k objects must be JSON encoded and decoded on their
>>>> way from Erlang to the Spidermonkey VM. In your situation it means Riak must
>>>> JSON encode and decode each of the 10,000 objects during the map phase. Even
>>>> though map phases are highly parallel, this does create a fair amount of
>>>> work and can become CPU bound, especially in the single-node case.
>>>>
>>>> Based on my tests using your map phase and similar data, it's clear the
>>>> marshaling overhead is causing the performance you're seeing. I believe we
>>>> can certainly improve marshaling performance but, as with many things, it's
>>>> a question of priorities and resources. In the meantime, you might consider
>>>> storing the bucket & key values in a separate bucket and map over those to
>>>> avoid the object marshaling overhead.
>>>>
>>>> --Kevin
>>>> On Mar 19, 2010, at 2:07 PM, John Lynch wrote:
>>>>
>>>>> Hello all,
>>>>>
>>>>> We have been running some simple tests to try and understand the
>>>>> performance characteristics of Riak, and we are getting some strange
>>>>> results. Basically, we spin up a small EC2 instance (using Ubuntu9.10
>>>>>  ami-4d742508), install Riak 0.9.1 via deb using a DETS backend, and
>>>>> populate it with 10,000 JSON objects (~3K each) in a single bucket with
>>>>> Riak-assigned keys.  Then we run this map reduce query (from a separate
>>>>> client machine) using Ripple to find some keys:
>>>>>
>>>>> results = Riak::MapReduce.new(client)
>>>>>        .add("RiakLoader")
>>>>>        .map("function(value,keyData,arg) {
>>>>>              var re = new RegExp(arg);
>>>>>              return value.key.match(re) ? [value.key] : [];
>>>>>             }", :keep => true, :arg => "^12").run
>>>>>
>>>>>
>>>>> We see the beam process run at 38% CPU, 2% RAM for about 60 seconds,
>>>>> before returning the results, which was about 7 keys. The time stays the
>>>>> same each time we run the query.  60 seconds seems like an awful long time
>>>>> to search 10,000 keys. Are we doing something wrong? Or is that an expected
>>>>> result?
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>> John Lynch, CTO
>>>>> Rigel Group, LLC
>>>>> [hidden email]

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Speed of map/reduce queries

Jeff Hammerbacher
In reply to this post by Alexander Sicular
FWIW, an implementation of Avro in Erlang was begun but not completed as no one really needed one at the time. If there is serious interest from the Basho community, I'm sure the project could be revived and pushed through to completion fairly quickly.

On Sun, Mar 21, 2010 at 9:30 AM, Alexander Sicular <[hidden email]> wrote:
Might I suggest Apaches Avro which I believe is being primed as the
default encoding for Hadoop.

On 2010-03-21, Kevin Smith <[hidden email]> wrote:
> After giving this some more thought and reading some code last night I think
> there are two approaches we could take to improve performance in the short
> term:
>
> 1) Replace erlang_js' JSON with a better encoding -- JSON is easy to read
> and debug but it isn't especially fast or efficient. Replacing it with a
> better encoding could help performance. BERT (http://bert-rpc.org/) is a
> good candidate here. Rusty Klophaus, a fellow Basho-ite, has written a BERT
> encoder/decoder in Javascript which would certainly speed implementation.
>
> 2) Add an option to map/reduce job definitions to allow phases to specify
> when they do and do not want the entire object body. This would save the
> cost of encoding and decoding potentially large object bodies for phases
> which aren't interested in the data.
>
> I'm going to have some down time at Erlang Factory this week and plan to
> prototype #1 and possibly #2 while I'm there. I'll be sure to post my
> progress to the list if I get something working.
>
> --Kevin
> On Mar 20, 2010, at 11:26 AM, Kevin Smith wrote:
>
>> John -
>>
>> Currently there isn't a way to map over bucket/key pairs or even key
>> metadata in Riak. You have to take the entire object on each map function
>> call. I'm getting ready to head out of town to teach a class at Erlang
>> Factory next week but I'm going to give this some more thought and see if
>> there's something we can do to improve the situation in the short-term.
>>
>> --Kevin
>> On Mar 19, 2010, at 5:31 PM, John Lynch wrote:
>>
>>> Kevin,
>>>
>>> Thanks for looking into it. What you say of course makes sense. It would,
>>> however, be nice to have a way to map over just the keyspace without
>>> having to marshal all the actual objects themselves.  Some way to pass a
>>> map function to the list_keys function? Is that possible in Javascript or
>>> Erlang?  I am curious to know how an Erlang map function would perform
>>> for the same thing. I will try that next...
>>>
>>>
>>> Regards,
>>>
>>> - John
>>>
>>> On Fri, Mar 19, 2010 at 1:50 PM, Kevin Smith <[hidden email]> wrote:
>>> John -
>>>
>>> Thanks for your question. I've done some experimenting and I think I
>>> understand the behavior you're seeing. Riak currently uses JSON as the
>>> marshaling format for passing data between the Erlang and Javascript VMs.
>>> While JSON is easy to understand and debug it isn't particular fast or
>>> compact. Each of the 10k objects must be JSON encoded and decoded on
>>> their way from Erlang to the Spidermonkey VM. In your situation it means
>>> Riak must JSON encode and decode each of the 10,000 objects during the
>>> map phase. Even though map phases are highly parallel, this does create a
>>> fair amount of work and can become CPU bound, especially in the
>>> single-node case.
>>>
>>> Based on my tests using your map phase and similar data, it's clear the
>>> marshaling overhead is causing the performance you're seeing. I believe
>>> we can certainly improve marshaling performance but, as with many things,
>>> it's a question of priorities and resources. In the meantime, you might
>>> consider storing the bucket & key values in a separate bucket and map
>>> over those to avoid the object marshaling overhead.
>>>
>>> --Kevin
>>> On Mar 19, 2010, at 2:07 PM, John Lynch wrote:
>>>
>>>> Hello all,
>>>>
>>>> We have been running some simple tests to try and understand the
>>>> performance characteristics of Riak, and we are getting some strange
>>>> results. Basically, we spin up a small EC2 instance (using Ubuntu9.10
>>>> ami-4d742508), install Riak 0.9.1 via deb using a DETS backend, and
>>>> populate it with 10,000 JSON objects (~3K each) in a single bucket with
>>>> Riak-assigned keys.  Then we run this map reduce query (from a separate
>>>> client machine) using Ripple to find some keys:
>>>>
>>>> results = Riak::MapReduce.new(client)
>>>>          .add("RiakLoader")
>>>>          .map("function(value,keyData,arg) {
>>>>                var re = new RegExp(arg);
>>>>                return value.key.match(re) ? [value.key] : [];
>>>>               }", :keep => true, :arg => "^12").run
>>>>
>>>>
>>>> We see the beam process run at 38% CPU, 2% RAM for about 60 seconds,
>>>> before returning the results, which was about 7 keys. The time stays the
>>>> same each time we run the query.  60 seconds seems like an awful long
>>>> time to search 10,000 keys. Are we doing something wrong? Or is that an
>>>> expected result?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> John Lynch, CTO
>>>> Rigel Group, LLC
>>>> [hidden email]
>>>>
>>>>
>>>> _______________________________________________
>>>> riak-users mailing list
>>>> [hidden email]
>>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>>
>>>
>>> _______________________________________________
>>> riak-users mailing list
>>> [hidden email]
>>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

--
Sent from my mobile device

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com