mapreduce with non-existent keys

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

mapreduce with non-existent keys

Mark Boyd ソフトウェア 建築家

I’ve got a set of bucket/key pairs that may contain items that no longer exist in riak. Is it possible to pass that to map/reduce and explicitly tell riak to ignore any pairs which aren’t current, ie: which aren’t found? For example, if I have compiled a list of pairs but before passing the list, one or more of those items was removed from the database, then my map/reduce appears to fail since it doesn’t find the referenced item. Can riak be told to ignore such missing items if they are incurred?

 

Thanks.

 

Mark



NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

RE: mapreduce with non-existent keys

Mark Boyd ソフトウェア 建築家

Never mind. I found the archive search page and this same question posted earlier here:

 

http://riak-users.197444.n3.nabble.com/Map-Reduce-behavior-when-key-not-found-td3641739.html

 

Mark

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Boyd ?????? ???
Sent: Sunday, July 15, 2012 7:55 AM
To: [hidden email]
Subject: mapreduce with non-existent keys

 

I’ve got a set of bucket/key pairs that may contain items that no longer exist in riak. Is it possible to pass that to map/reduce and explicitly tell riak to ignore any pairs which aren’t current, ie: which aren’t found? For example, if I have compiled a list of pairs but before passing the list, one or more of those items was removed from the database, then my map/reduce appears to fail since it doesn’t find the referenced item. Can riak be told to ignore such missing items if they are incurred?

 

Thanks.

 

Mark



NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

 


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: mapreduce with non-existent keys

Mark Phillips-4
This info should be in the docs. I just added an issue to track it [1]. Thanks for digging. :) 

Mark
twitter.com/pharkmillups

[1] https://github.com/basho/riak_wiki/issues/316

On Sun, Jul 15, 2012 at 9:18 PM, Mark Boyd ソフトウェア 建築家 <[hidden email]> wrote:

Never mind. I found the archive search page and this same question posted earlier here:

 

http://riak-users.197444.n3.nabble.com/Map-Reduce-behavior-when-key-not-found-td3641739.html

 

Mark

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Boyd ?????? ???
Sent: Sunday, July 15, 2012 7:55 AM
To: [hidden email]
Subject: mapreduce with non-existent keys

 

I’ve got a set of bucket/key pairs that may contain items that no longer exist in riak. Is it possible to pass that to map/reduce and explicitly tell riak to ignore any pairs which aren’t current, ie: which aren’t found? For example, if I have compiled a list of pairs but before passing the list, one or more of those items was removed from the database, then my map/reduce appears to fail since it doesn’t find the referenced item. Can riak be told to ignore such missing items if they are incurred?

 

Thanks.

 

Mark



NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

 


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

RE: mapreduce with non-existent keys

Mark Boyd ソフトウェア 建築家
In reply to this post by Mark Boyd ソフトウェア 建築家

For those wishing to know how I solved this and to shed light on debugging map/reduce here is what I did.

 

Background:

 

I’m dealing with a set of keys that are mirrored in two buckets, an authorization expression bucket and a protected objects bucket. My goal is to use map/reduce to evaluate the authz expressions for the passed-in keys and return only those protected objects for which a user is authorized. But authz expressions and protected objects themselves may not exist since they could be deleted while references to them may not have been cleaned up as yet.

 

For my input I have this bucket/key pairs array from other processing. I have authz expressions for v1 to v4 but not v5 to v8 and protected objects for v1 to v3 but v4 to v8.

 

[[“authz”, “v1”], [“authz”, “v2”], [“authz”, “v3”], [“authz”, “v4”], … [“authz”, “v8”]]

 

I’m using riak-js in node.js. My map reduce looked like this to begin with:

 

db.add(pairs)

    .map(evaluation.toMapReduceForm, { 'obj-bucket' : 'v2.tv', 'user-atts' : userAtts })

                .map('Riak.mapValuesJson') // converts the buckets and keys array into array of json objects

                .run(function(err, listOfViews) {

                    if (err) {

                        console.log("ERROR: Unable to obtain tvs for id '" + id + "'. Detail: " + JSON.stringify(err));

                                send500ToClient(response);

                        return;

                    }

                    callback(listOfViews);

                });

 

This results in the err object being the unhelpful {"statusCode":500}. Fortunately, I have an http proxy that I wrote, “google wamulator”, that I’ve configured allowing all riak-js http traffic passing to riak to pass through the proxy exposing what passes across the wire. And here is what I saw:

 

{

·         "phase":0,

·         "error":"function_clause",

·         "input":"{{error,notfound},{<<"v2.tv.authz">>,<<"v5">>},{struct,[{<<"type">>,<<"FALSE">>}]}}",

·         "type":"error",

·         "stack":"[{riak_kv_pipe_get,bkey,[{not_found,{<<"v2.tv.authz">>,<<"v5">>},{struct,[{<<"type">>,<<"FALSE">>}]}}]},{riak_kv_pipe_get,bkey_chash,1},{riak_pipe_vnode,queue_work,4},{riak_kv_mrc_map,send_results,2},{riak_pipe_vnode_worker,process_input,3},{riak_pipe_vnode_worker,wait_for_input,2},{gen_fsm,handle_msg,7},{proc_lib,init_p_do_apply,3}]"

}

 

This is where it got interesting. It appears that it wasn’t finding the authz object for the v5 key. So I assumed it was failing before even hitting my first map function. On the contrary, _it wasn’t_. On a whim I commented out the second map and the reduce portions. And ran the query again. And the following array was returned.

 

[

  [

    "v2.tv",

    "v1"

  ],

  {

    "not_found": {

      "bucket": "v2.tv.authz",

      "key": "v5",

      "keydata": "undefined"

    }

  },

… more not_found objects, one for each missing key,

  [

    "v2.tv",

    "v3"

  ],

]

 

This gave me some great information:

 

1)      If I don’t have a reduce phase my objects returned from a map phase make it back to the client as-is. We can use that for debugging!

2)      I was getting these weird not_found objects included with my two objects (of three) for which the user was authorized.

 

Now where did those not_found objects come from? After _much_ trial and error I came to the conclusion that to each phase is passed an array. A map phase interprets that as an array of bucket/key pair arrays. For each of those the map phase looks for the corresponding item. If not found, the map phase puts one of these not_found objects in its output array. If an item _is_ found it passes the item to the map function and sticks any returned object into the output array.

 

Note that I said “returned object” not “bucket/key pairs”. As noted in item 1 above, it appears to be crafting another input array without interpretation. It appears that intepretation belongs to the next phase. And if there is no next phase, then that array propagates back to the client as-is including any not_founds for missing bucket/key objects in the input array to the map. In contrast, it appears that a reduce phase takes the incoming array as-is without treating them as bucket/key pairs.

 

Now back to my original error. The not_found error for the v5 key was coming from the second map phase, the mapValuesJson part. As noted, it tries to interpret the incoming array as bucket/key pair array objects and sees those not_found items and throws the error.

 

So how did I solve this problem?

 

Riak has some pre-defined javascript functions that can be used in map/reduce defined at https://github.com/basho/riak_kv/blob/master/priv/mapred_builtins.js. I noted that one of these, filterNotDefined, had a single argument having a plural name, values. That led me to believe that it was solely for use in the reduce phase. So here is what I did. Notice that after each map phase to which keys will be passed that might not exist I have a reduce phase that leverages the filterNotDefined function to pull those not_found objects from the array. That last one is there so that I don’t get those not_found objects in the array returned from riak.

 

db.add(pairs)

    .map(evaluation.toMapReduceForm, { 'obj-bucket' : 'v2.tv', 'user-atts' : userAtts })

        .reduce('Riak.filterNotFound')

        .map('Riak.mapValuesJson') // converts the buckets and keys array into array of json objects

        .reduce('Riak.filterNotFound')

        .run(function(err, listOfViews) { // process on client the list of returned array objects

            if (err) {

               console.log("ERROR: Unable to obtain tvs for id '" + id + "'. Detail: " + JSON.stringify(err));

                   send500ToClient(response);

               return;

            }

            callback(listOfViews);

        });

 

Yes, you can have multiple reduce steps and that solves the “not found” issue. Hope this helps.

 

Mark

 

From: Mark Boyd ソフトウェア 建築家
Sent: Sunday, July 15, 2012 10:18 PM
To: [hidden email]
Subject: RE: mapreduce with non-existent keys

 

Never mind. I found the archive search page and this same question posted earlier here:

 

http://riak-users.197444.n3.nabble.com/Map-Reduce-behavior-when-key-not-found-td3641739.html

 

Mark

 

From: [hidden email] [hidden email] On Behalf Of Mark Boyd ?????? ???
Sent: Sunday, July 15, 2012 7:55 AM
To: [hidden email]
Subject: mapreduce with non-existent keys

 

I’ve got a set of bucket/key pairs that may contain items that no longer exist in riak. Is it possible to pass that to map/reduce and explicitly tell riak to ignore any pairs which aren’t current, ie: which aren’t found? For example, if I have compiled a list of pairs but before passing the list, one or more of those items was removed from the database, then my map/reduce appears to fail since it doesn’t find the referenced item. Can riak be told to ignore such missing items if they are incurred?

 

Thanks.

 

Mark



NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

 


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

RE: mapreduce with non-existent keys

Mark Boyd ソフトウェア 建築家

Can anyone familiar with the innards of riak describe how distribution of a map/reduce is handled when there are multiple reduce phases included as in this solution copied from below. I’m assuming that the first map phase would spread to nodes containing data for incoming bucket/key combinations and their output pulled back to the coordinating node for the first reduce phase. Then the second map phase would spread to (potentially different) nodes containing data for that phase’s incoming bucket/key combinations and their output pulled back to the coordinating node for the final reduce phase.

 

Is that correct?

 

db.add(pairs)

    .map(evaluation.toMapReduceForm, { 'obj-bucket' : 'v2.tv', 'user-atts' : userAtts })

        .reduce('Riak.filterNotFound')

        .map('Riak.mapValuesJson') // converts the buckets and keys array into array of json objects

        .reduce('Riak.filterNotFound')

        .run(function(err, listOfViews) { // process on client the list of returned array objects

            if (err) {

               console.log("ERROR: Unable to obtain tvs for id '" + id + "'. Detail: " + JSON.stringify(err));

                   send500ToClient(response);

               return;

            }

            callback(listOfViews);

        });

 

Mark

 

From: [hidden email] [mailto:[hidden email]] On Behalf Of Mark Boyd ?????? ???
Sent: Monday, July 16, 2012 2:35 PM
To: [hidden email]
Subject: RE: mapreduce with non-existent keys

 

For those wishing to know how I solved this and to shed light on debugging map/reduce here is what I did.

 

Background:

 

I’m dealing with a set of keys that are mirrored in two buckets, an authorization expression bucket and a protected objects bucket. My goal is to use map/reduce to evaluate the authz expressions for the passed-in keys and return only those protected objects for which a user is authorized. But authz expressions and protected objects themselves may not exist since they could be deleted while references to them may not have been cleaned up as yet.

 

For my input I have this bucket/key pairs array from other processing. I have authz expressions for v1 to v4 but not v5 to v8 and protected objects for v1 to v3 but v4 to v8.

 

[[“authz”, “v1”], [“authz”, “v2”], [“authz”, “v3”], [“authz”, “v4”], … [“authz”, “v8”]]

 

I’m using riak-js in node.js. My map reduce looked like this to begin with:

 

db.add(pairs)

    .map(evaluation.toMapReduceForm, { 'obj-bucket' : 'v2.tv', 'user-atts' : userAtts })

                .map('Riak.mapValuesJson') // converts the buckets and keys array into array of json objects

                .run(function(err, listOfViews) {

                    if (err) {

                        console.log("ERROR: Unable to obtain tvs for id '" + id + "'. Detail: " + JSON.stringify(err));

                                send500ToClient(response);

                        return;

                    }

                    callback(listOfViews);

                });

 

This results in the err object being the unhelpful {"statusCode":500}. Fortunately, I have an http proxy that I wrote, “google wamulator”, that I’ve configured allowing all riak-js http traffic passing to riak to pass through the proxy exposing what passes across the wire. And here is what I saw:

 

{

·         "phase":0,

·         "error":"function_clause",

·         "input":"{{error,notfound},{<<"v2.tv.authz">>,<<"v5">>},{struct,[{<<"type">>,<<"FALSE">>}]}}",

·         "type":"error",

·         "stack":"[{riak_kv_pipe_get,bkey,[{not_found,{<<"v2.tv.authz">>,<<"v5">>},{struct,[{<<"type">>,<<"FALSE">>}]}}]},{riak_kv_pipe_get,bkey_chash,1},{riak_pipe_vnode,queue_work,4},{riak_kv_mrc_map,send_results,2},{riak_pipe_vnode_worker,process_input,3},{riak_pipe_vnode_worker,wait_for_input,2},{gen_fsm,handle_msg,7},{proc_lib,init_p_do_apply,3}]"

}

 

This is where it got interesting. It appears that it wasn’t finding the authz object for the v5 key. So I assumed it was failing before even hitting my first map function. On the contrary, _it wasn’t_. On a whim I commented out the second map and the reduce portions. And ran the query again. And the following array was returned.

 

[

  [

    "v2.tv",

    "v1"

  ],

  {

    "not_found": {

      "bucket": "v2.tv.authz",

      "key": "v5",

      "keydata": "undefined"

    }

  },

… more not_found objects, one for each missing key,

  [

    "v2.tv",

    "v3"

  ],

]

 

This gave me some great information:

 

1)      If I don’t have a reduce phase my objects returned from a map phase make it back to the client as-is. We can use that for debugging!

2)      I was getting these weird not_found objects included with my two objects (of three) for which the user was authorized.

 

Now where did those not_found objects come from? After _much_ trial and error I came to the conclusion that to each phase is passed an array. A map phase interprets that as an array of bucket/key pair arrays. For each of those the map phase looks for the corresponding item. If not found, the map phase puts one of these not_found objects in its output array. If an item _is_ found it passes the item to the map function and sticks any returned object into the output array.

 

Note that I said “returned object” not “bucket/key pairs”. As noted in item 1 above, it appears to be crafting another input array without interpretation. It appears that intepretation belongs to the next phase. And if there is no next phase, then that array propagates back to the client as-is including any not_founds for missing bucket/key objects in the input array to the map. In contrast, it appears that a reduce phase takes the incoming array as-is without treating them as bucket/key pairs.

 

Now back to my original error. The not_found error for the v5 key was coming from the second map phase, the mapValuesJson part. As noted, it tries to interpret the incoming array as bucket/key pair array objects and sees those not_found items and throws the error.

 

So how did I solve this problem?

 

Riak has some pre-defined javascript functions that can be used in map/reduce defined at https://github.com/basho/riak_kv/blob/master/priv/mapred_builtins.js. I noted that one of these, filterNotDefined, had a single argument having a plural name, values. That led me to believe that it was solely for use in the reduce phase. So here is what I did. Notice that after each map phase to which keys will be passed that might not exist I have a reduce phase that leverages the filterNotDefined function to pull those not_found objects from the array. That last one is there so that I don’t get those not_found objects in the array returned from riak.

 

db.add(pairs)

    .map(evaluation.toMapReduceForm, { 'obj-bucket' : 'v2.tv', 'user-atts' : userAtts })

        .reduce('Riak.filterNotFound')

        .map('Riak.mapValuesJson') // converts the buckets and keys array into array of json objects

        .reduce('Riak.filterNotFound')

        .run(function(err, listOfViews) { // process on client the list of returned array objects

            if (err) {

               console.log("ERROR: Unable to obtain tvs for id '" + id + "'. Detail: " + JSON.stringify(err));

                   send500ToClient(response);

               return;

            }

            callback(listOfViews);

        });

 

Yes, you can have multiple reduce steps and that solves the “not found” issue. Hope this helps.

 

Mark

 

From: Mark Boyd ソフトウェア 建築家
Sent: Sunday, July 15, 2012 10:18 PM
To: [hidden email]
Subject: RE: mapreduce with non-existent keys

 

Never mind. I found the archive search page and this same question posted earlier here:

 

http://riak-users.197444.n3.nabble.com/Map-Reduce-behavior-when-key-not-found-td3641739.html

 

Mark

 

From: [hidden email] [hidden email] On Behalf Of Mark Boyd ?????? ???
Sent: Sunday, July 15, 2012 7:55 AM
To: [hidden email]
Subject: mapreduce with non-existent keys

 

I’ve got a set of bucket/key pairs that may contain items that no longer exist in riak. Is it possible to pass that to map/reduce and explicitly tell riak to ignore any pairs which aren’t current, ie: which aren’t found? For example, if I have compiled a list of pairs but before passing the list, one or more of those items was removed from the database, then my map/reduce appears to fail since it doesn’t find the referenced item. Can riak be told to ignore such missing items if they are incurred?

 

Thanks.

 

Mark



NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

 


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: mapreduce with non-existent keys

bryan-basho
Administrator
Wow, this question slipped by while I wasn't looking. Sorry about that.

On Mon, Jul 16, 2012 at 4:47 PM, Mark Boyd ソフトウェア 建築家
<[hidden email]> wrote:
> Can anyone familiar with the innards of riak describe how distribution of a
> map/reduce is handled when there are multiple reduce phases included as in
> this solution copied from below. I’m assuming that the first map phase would
> spread to nodes containing data for incoming bucket/key combinations and
> their output pulled back to the coordinating node for the first reduce
> phase. Then the second map phase would spread to (potentially different)
> nodes containing data for that phase’s incoming bucket/key combinations and
> their output pulled back to the coordinating node for the final reduce
> phase.

Exactly correct, Mark. Map is always spread to vnodes holding the
objects to be read/transformed. Reduce is always brought back to a
single node for aggregation. So you would have a
scatter-gather-scatter-gather pattern, just as you described.

Javascript map is quite limited in its handling of errors, as you
found. Erlang map phases, however, get an opportunity to handle the
notfound themselves. An example of what this looks like can be found
in the riak_kv_mapreduce:map_object_value/3 function:

https://github.com/basho/riak_kv/blob/master/src/riak_kv_mapreduce.erl#L81-99

So, if you find the intermediate aggregation of that filtering reduce
to be a problem, you could consider migrating to Erlang for your map
phase.

-Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com