mapreduce and UTF8 strings - text oriented

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

mapreduce and UTF8 strings - text oriented

Tux Racer
Hello Riak Users!

I am new to Riak (and erlang) and would like to know what would be the
best way to deal with UTF8 words (or unicode)

For instance, if I take the wordcount example described at:


https://wiki.basho.com/display/RIAK/MapReduce

and assume I have a non English text PUT at

http://localhost:8098/riak/alice/p1

http://localhost:8098/riak/alice/p2

http://localhost:8098/riak/alice/p5

As non English text we could take a French, Italian or German text which contain non ASCII characters (or even Chinese ;) but tokenizing Chinese is a research topic isn't it? )

how could I tokenize the words?
It seems to me that both languages options (Javascript and Erlang) available to write the map reduce code have a very poor support of unicode.
Also I would like to normalize words removing diacritics and accents:

Garçon -> garçon
Cet été là -> cet ete la

I know how to normalize those phrases in Java or Python, but not in javascript or erlang. (Maybe the question is here why having chosen Javascript as mapreduce language instead of a more powerful language (like python maybe)?)

Also, do you guys have an estimation of the release of the riak search code? Is that based on lucene or is that completely new?

Also it is not clear to me how/where map reduce results are stored and what happens when the documents are updated: for instance if I update
http://localhost:8098/riak/alice/p1

will the mapreduce job automatically update the word count results?

Sorry if some questions are trivial to answer, but I am still building erlang as riak does not compile on my debian machine :( see
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2009-December/000315.html so had no chance to test a riak install yet

Cheers
TuX



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: mapreduce and UTF8 strings - text oriented

bryan-basho
Administrator
On Tue, Mar 23, 2010 at 12:58 PM, TuX RaceR <[hidden email]> wrote:

> It seems to me that both languages options (Javascript and Erlang) available
> to write the map reduce code have a very poor support of unicode.
> Also I would like to normalize words removing diacritics and accents:
>
> Garçon -> garçon
> Cet été là -> cet ete la
>
> I know how to normalize those phrases in Java or Python, but not in
> javascript or erlang. (Maybe the question is here why having chosen
> Javascript as mapreduce language instead of a more powerful language (like
> python maybe)?)

Hi, TuX.  Other languages do have more library support for Unicode
string mangling.  Erlang and Javascript have basic support for slicing
& dicing various UTF encodings, but as far as normalization and they
like, they are a little lacking.

One option for Erlang is to use the code from the Starling project, a
driver for ICU:
http://code.google.com/p/starling/
http://site.icu-project.org/

As for, "Why not use some other language?" the main answer is that
Javascript hit enough sweet spots that it made sense to start there:
trivial dynamic loading of code, VMs meant for embedding, native
handling of a common serialization format (JSON), wide-spread
developer familiarity.  Support for other languages in m/r queries has
not been ruled out at all.  In fact, the attachment points for
evaluating m/r functions are simple enough that I'd expect a
determined community member could probably whip up a pretty wicked
patch to support a new one.  ;)

> Also, do you guys have an estimation of the release of the riak search code?
> Is that based on lucene or is that completely new?

We are not announcing any dates at this time.  It can be considered a
completely new thing, though lessons, interfaces, and/or libraries
from lucene may be included.

> Also it is not clear to me how/where map reduce results are stored and what
> happens when the documents are updated: for instance if I update
> http://localhost:8098/riak/alice/p1
>
> will the mapreduce job automatically update the word count results?

Map/reduce results are not stored beyond temporary caching.  The
results are computed at request time.  So, yes, the map/reduce result
will change as soon as you update an object it depends on.

> Sorry if some questions are trivial to answer, but I am still building
> erlang as riak does not compile on my debian machine :( see
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2009-December/000315.html
> so had no chance to test a riak install yet

I recommend abandoning the various build processes, and instead
downloading a pre-built release from http://downloads.basho.com/  The
correct Erlang distribution is included in those releases, so you
don't even need to build/install your own Erlang.

Hope that helps,
Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: mapreduce and UTF8 strings - text oriented

Tux Racer
Thank you very much Bryan, for the detailed answer,

about the languages, do you have some data about the performance
overhead of use javascript over erlang?

Bryan Fink wrote:
> Map/reduce results are not stored beyond temporary caching.  The
> results are computed at request time.  So, yes, the map/reduce result
> will change as soon as you update an object it depends on.
>  

As the results are computed at query time, could you please detail what
should be the typical maximum size of a mapreduce job written in erlang
to get a 'live' response (i.e. a response built in less that 0.1
second). If I take the example mentioned earlier:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2009-October/000051.html
where the map reduce sorts links by date, what would be the typical
maximum number of links to get a responsive web browsing?
I am trying to understand the use cases of Riak mapreduce.
Also if the mapreduce job is too big (too slow), I guess it could be
feasible to store in riak itself the results to get faster answers (say
run a mapreduce daily).

Also, always on example
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2009-October/000051.html,
the date is encoded as a String "2009-10-08 09:34:06". Is there not a
better way to store the link tags? I.e using a 32bits integer (Unix
timestamp) or a 64bits integer (Java timestamp), if one wants a better
time resolution?
Can riak understand other types than String or are all paramters assumed
to be strings?

Thanks
TuX


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: mapreduce and UTF8 strings - text oriented

bryan-basho
Administrator
On Thu, Mar 25, 2010 at 8:46 AM, TuX RaceR <[hidden email]> wrote:
> Thank you very much Bryan, for the detailed answer,
>
> about the languages, do you have some data about the performance overhead of
> use javascript over erlang?

This depends a great deal on the size of your data, as the
JSON-encoding in and out of Javascript can be one of the more costly
steps.  We're considering some options for improving this.

In the meantime, for many object sizes, we've found the "jsfun" syntax
(where the Javascript functions are defined in a file loaded at
startup) to be almost as fast as the "modfun" syntax (where the Erlang
functions are defined in a module loaded at startup).

The real overhead is best observed by loading some sample data from
your dataset, and benchmarking some representative queries against it.

>
> Bryan Fink wrote:
>>
>> Map/reduce results are not stored beyond temporary caching.  The
>> results are computed at request time.  So, yes, the map/reduce result
>> will change as soon as you update an object it depends on.
>>
>
> As the results are computed at query time, could you please detail what
> should be the typical maximum size of a mapreduce job written in erlang to
> get a 'live' response (i.e. a response built in less that 0.1 second). If I
> take the example mentioned earlier:
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2009-October/000051.html
> where the map reduce sorts links by date, what would be the typical maximum
> number of links to get a responsive web browsing?

This will vary widely depending on your cluster deployment.  On a
*very* early version of Riak (pre-open-source release), I saw a
four-node cluster serve link queries that made two to three hops with
intermediate objects each having 10-50 links (so, say 200-1000 objects
in the result set, typically).  The queries were fast enough that they
were computed and served directly to the app at request time.  After
profiling, we found that the [ajax] app spent more time rendering the
results than it did receiving them.

> I am trying to understand the use cases of Riak mapreduce.
> Also if the mapreduce job is too big (too slow), I guess it could be
> feasible to store in riak itself the results to get faster answers (say run
> a mapreduce daily).

Also an excellent option.  There is some caching done by the
map/reduce framework, but if you can determine exactly what kind of
caching your app really needs, you'll almost certianly improve on it.

> Also, always on example
> http://lists.basho.com/pipermail/riak-users_lists.basho.com/2009-October/000051.html,
> the date is encoded as a String "2009-10-08 09:34:06". Is there not a better
> way to store the link tags? I.e using a 32bits integer (Unix timestamp) or a
> 64bits integer (Java timestamp), if one wants a better time resolution?
> Can riak understand other types than String or are all paramters assumed to
> be strings?

Riak can absolutely store types other than strings.  Object data is
just a pile of bits.  However, when accessing Riak over the HTTP
interface, link tags are only allowed to be strings because of the
header encoding used for them.

-Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com