Help with a complex Map Reduce query

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Help with a complex Map Reduce query

Julien Genestoux
Hello,

We store the following in our Riak cluster:
- feeds as a list of 10 keys to entries. All the keys are like this: feedKey-entryKey
- entries as a complex JSON object.

We try to avoid losing track of any entryKey by deleting it from the feed object only when corresponding object has been deleted.
Yet, due to a bug in our implementation, we have 'lost' some entries. In other words, some feedKey-entryKey elements are not in any feed object.

We're now trying to find the best way to "clean" that mess :)

Our initial solution was to list all the feed keys, and then, for each, issue a mapReduce object to list all entries whose key start with feedKey.
Then, we can compare the expected list of entryKey (stored in the feedKey) with the actual list of feedKey-* elements and delete the extra ones.
In practice, that would be about 500,000 map reduce jobs. We're thinking that may not be the solution (and it can take litterally weeks to complete 
as each mapReduce job takes about 10secs to complete)

We're now thinking there may be a better way? Maybe with a single mapReduce job which would iterate over all the entry keys and then only keep track 
of the feedKey that have more than 10 elements? This would probably cut down very significantly the number of map reduce as we would run them only 
on the few (maybe 1%?) feedKey for which there are 'lost' entries?

Maybe there would be a better way? Any idea?

Thanks


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Help with a complex Map Reduce query

bryan-basho
Administrator
Hi, Julien.

On Sat, Jun 1, 2013 at 5:27 PM, Julien Genestoux
<[hidden email]> wrote:
> Yet, due to a bug in our implementation, we have 'lost' some entries. In
> other words, some feedKey-entryKey elements are not in any feed object.

> Our initial solution was to list all the feed keys, and then, for each,

Is it possible that there are feedKey-entryKey objects for which there
is no feed object? The problem as you described it made it sound like
the feed object always exists, but may just be missing an entry. I ask
if the feed object might be missing entirely, because if it is then
the initial solution you describe (listing all feed keys) won't work,
regardless of speed, because it won't find some of the entry key
prefixes. If this is the case, you have no choice but to list all
entry keys.

> We're now thinking there may be a better way? Maybe with a single mapReduce
> job which would iterate over all the entry keys and then only keep track
> of the feedKey that have more than 10 elements? This would probably cut down
> very significantly the number of map reduce as we would run them only
> on the few (maybe 1%?) feedKey for which there are 'lost' entries?
>
> Maybe there would be a better way? Any idea?

I might suggest removing MapReduce from the equation entirely, and
listing keys straight to the client for processing. Trying to find
anything with "more than X instances" in a Riak MapReduce is a
difficult task, because you will have to build the entire result set
on one node. There is no way to trim it down as work progresses,
because you can't know whether or not you have seen all entries for a
feed until you have seen all entries, period. Thus, ignoring feeds
with 10 or less elements can't be done until the end of processing. If
the total number of feed objects is small, this may be possible, but
if not, then managing the large result set will be tricky at best (due
to timeout/retry/etc.), and impossible with a JS reduce phase at least
(because of the time required to transfer the encoded data out to
spidermonkey and back).

Streaming all keys to a client is also expensive, but managing retries
after timeout, or bugs in sorting/filtering logic will be much simpler
since you won't have to worry about hammering the Riak cluster. You
can sort and resort that list locally, check the idea for feedKeys
with more than 10 elements, and compare it to other plans before
committing to additional cluster time.

In addition, if you're using the eleveldb backend, then the next
release of Riak will bring the ability to paginate 2i results. So, you
could make streaming all keys to a client less punishing by requesting
just a few keys at a time from the '$bucket' index. This capability is
committed on our master branches, linked from
https://github.com/basho/riak_kv/pull/540

HTH,
Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Loading...