A function as an input for map/reduce

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

A function as an input for map/reduce

Mikhail Sobolev
Hi,

I watched "MapReducing Big Data With Riak and Luwak"[1] recorded
webinar.  And it is based on a possibility to use a function ("dynamic
inputs" @ ~8:04 in the video) to actually generate inputs for the query.

It raised two questions:

1. Is there a documentation about this kind of function?  (parameters,
   requirements, limitations, general comments)

2. Is there more information about "it can through a few keys at a time,
   and the map/reduce chain would go ahead and start doing the
   processing on whatever keys it gets as soon as it gets them, it does
   not have to wait for the whole list of that function" (@ ~9:54 in the
   video)?  What I'm concerned here is about a chain of
   map/map/map/reduce/reduce phases.   How the processing is actually
   performed?  What are the synchronization points?

--
Misha

[1] http://vimeo.com/20074937

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A function as an input for map/reduce

Justin Sheehy
Hi, Mikhail.

On Tue, May 3, 2011 at 5:55 PM, Mikhail Sobolev <[hidden email]> wrote:

>   Is there more information about "it can through a few keys at a time,
>   and the map/reduce chain would go ahead and start doing the
>   processing on whatever keys it gets as soon as it gets them, it does
>   not have to wait for the whole list of that function" (@ ~9:54 in the
>   video)?  What I'm concerned here is about a chain of
>   map/map/map/reduce/reduce phases.   How the processing is actually
>   performed?  What are the synchronization points?

The "map" part of the MapReduce programming paradigm is not only
inherently parallel, it also does not impose a point of order on the
overall dataflow and thus does not introduce a concurrency barrier.
In practical terms this means that individual data items can be
processed as soon as they arrive, and the results can be immediately
pushed on to the next phase of the overall job without waiting for all
other data to make it through the map.

The "reduce" part does not have this pleasant property, as that phase
is present in order to perform exactly the kinds of operations (such
as counting) that do require waiting.

-Justin

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: A function as an input for map/reduce

Mikhail Sobolev
Hi Justin,

On Thu, May 05, 2011 at 10:26:19AM -0400, Justin Sheehy wrote:

> The "map" part of the MapReduce programming paradigm is not only
> inherently parallel, it also does not impose a point of order on the
> overall dataflow and thus does not introduce a concurrency barrier.
> In practical terms this means that individual data items can be
> processed as soon as they arrive, and the results can be immediately
> pushed on to the next phase of the overall job without waiting for all
> other data to make it through the map.
>
> The "reduce" part does not have this pleasant property, as that phase
> is present in order to perform exactly the kinds of operations (such
> as counting) that do require waiting.
Thank you for the description.  I now wonder if it's possible for a
map-function instead of returning the whole list of results, do
something that Riak would take as "ah! another map result, let's do pass
it to the next phase"?  In other words, instead of something like

    my_map_function(Object, _, _) ->
        object_to_list_of_values(Object).

do something like

    my_map_function(Object, _, _) ->
        produce_one(Object).

    produce_one(Object) ->
        ...
        emit(....), % following CouchDB syntax
        ...
        produce_one(Object').

--
Misha

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: A function as an input for map/reduce

Justin Sheehy
Hi, Mikhail.

On Thu, May 5, 2011 at 5:15 PM, Mikhail Sobolev <[hidden email]> wrote:

> Thank you for the description.  I now wonder if it's possible for a
> map-function instead of returning the whole list of results, do
> something that Riak would take as "ah! another map result, let's do pass
> it to the next phase"?

It is quite possible in Riak to have a map phase followed by another
map phase.  You simply have to declare the job as having those phases,
each with their map function.

The way you showed it wouldn't quite work, as it is the return value
-- not a side effect -- that a map function passes on to the following
phase.

-Justin

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com