MR Output Changes

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

MR Output Changes

Jeremiah Peschka
While building out functionality in CorrugatedIron to coincide with the Riak 1.0.0 release, I've run into something a bit odd.

MapReduce phases with a keep set to true are returning in multiple RpbMapRedResp messages. I expected this, especially since hte API says it's possible. What's causing the issue is the output format.

When I MR via the HTTP API, I get results like this:

[["riak_index_tests","5"],["riak_index_tests","6"],["riak_index_tests","7"],["riak_index_tests","4"],["riak_index_tests","8"],["riak_index_tests","3"],["riak_index_tests","2"],["riak_index_tests","0"],["riak_index_tests","1"],["riak_index_tests","9"]]

When I execute the same MR through the PB API I get results that look like this:

[["riak_index_tests","1"]][["riak_index_tests","6"]][["riak_index_tests","0"]][["riak_index_tests","7"]][["riak_index_tests","8"]][["riak_index_tests","4"]][["riak_index_tests","9"]][["riak_index_tests","2"]][["riak_index_tests","3"]][["riak_index_tests","5"]]

The MR query itself, just to make sure I haven't done something wrong, is this:

curl -H "Content-Type: application/json" $RIAK/mapred     -d @-
{"inputs":{"bucket":"riak_index_tests","index":"age_int","key":32},"query":[{"reduce":{"language":"erlang","module":"riak_kv_mapreduce","function":"reduce_identity","keep":true}}]}

and this:

{"inputs":{"bucket":"riak_index_tests","index":"age_int","key":32},"query":[{"reduce":{"language":"erlang","module":"riak_kv_mapreduce","function":"reduce_identity","keep":true}}]}

Is there a reason that Riak is returning discrete JSON chunks instead of streaming back raw bytes to the client? Is there less documented change in Riak 1.0 that will prevent this type of behavior (which incidentally didn't crop up until the 1.0 pre-release builds).
---
Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
Microsoft SQL Server MVP


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: MR Output Changes

bryan-basho
Administrator
On Sat, Sep 24, 2011 at 4:39 PM, Jeremiah Peschka
<[hidden email]> wrote:
> MapReduce phases with a keep set to true are returning in multiple RpbMapRedResp messages. I expected this, especially since hte API says it's possible. What's causing the issue is the output format.

> Is there a reason that Riak is returning discrete JSON chunks instead of streaming back raw bytes to the client? Is there less documented change in Riak 1.0 that will prevent this type of behavior (which incidentally didn't crop up until the 1.0 pre-release builds).

Hi, Jermiah.  We recently found a similar misunderstanding in the Java
client.  You read the API correctly, but the pre-1.0 MapReduce system
never exploited that feature, so clients implementing that behavior
didn't have much to test against.

The intent of returning discrete JSON chunks is to allow each response
chunk to be fully consumable without further information.  You should
be able to parse each response independently, instead of having to
wait for a "done" signal.  This encapsulation is also important when
multiple phases set keep=true, because it allows them to interleave
results in the response, instead of blocking on each other.

The HTTP interface uses a similar setup when streamed results are
requested, using the query parameter chunked=true.  Since all PB
results are "streamed", the same encapsulation is used.

-Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: MR Output Changes

Jeremiah Peschka
This makes sense to me.

Now I shall use this knowledge to take over the world. Or to re-write part of the CorrugatedIron MapReduce functionality.
---
Jeremiah Peschka - Founder, Brent Ozar PLF, LLC
Microsoft SQL Server MVP

On Sep 26, 2011, at 6:51 AM, Bryan Fink wrote:

> On Sat, Sep 24, 2011 at 4:39 PM, Jeremiah Peschka
> <[hidden email]> wrote:
>> MapReduce phases with a keep set to true are returning in multiple RpbMapRedResp messages. I expected this, especially since hte API says it's possible. What's causing the issue is the output format.
> …
>> Is there a reason that Riak is returning discrete JSON chunks instead of streaming back raw bytes to the client? Is there less documented change in Riak 1.0 that will prevent this type of behavior (which incidentally didn't crop up until the 1.0 pre-release builds).
>
> Hi, Jermiah.  We recently found a similar misunderstanding in the Java
> client.  You read the API correctly, but the pre-1.0 MapReduce system
> never exploited that feature, so clients implementing that behavior
> didn't have much to test against.
>
> The intent of returning discrete JSON chunks is to allow each response
> chunk to be fully consumable without further information.  You should
> be able to parse each response independently, instead of having to
> wait for a "done" signal.  This encapsulation is also important when
> multiple phases set keep=true, because it allows them to interleave
> results in the response, instead of blocking on each other.
>
> The HTTP interface uses a similar setup when streamed results are
> requested, using the query parameter chunked=true.  Since all PB
> results are "streamed", the same encapsulation is used.
>
> -Bryan


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com