Map/Reduce, UTF-8 and Swedish high ASCII characters

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Map/Reduce, UTF-8 and Swedish high ASCII characters

Mårten Gustafson
Howdy chaps!

I've been struggling back and forth with trying to run map/reduce over
one of the our datasets and I've stumbled on the error message at the
bottom of this mail. The message itself is pretty clear I think
"{bad_return_value,invalid_utf8}". The dataset is in Swedish and hence
we have a couple of high ASCII characters present, namely:

134: å
143: Å
132: ä
142: Ä
148: ö
153: Ö

The data stems from CouchDB which, when surfed to, is nicely displayed
correctly (Firefox detects UTF-8 and renders accordingly). I've then
ran it through a node.js script that extracts the data from CouchDB
and stores it in Riak.
Pointing Firefox to the Riak URL for a given entry renders it
correctly (again, UTF-8 detection). However when running my map/reduce
job it bails out in the first phase which is applying the built in
"Riak.mapValuesJson".

As you can see below in the JSON data the value of the "street"
property is: "Albyv\\u00e4gen 6" where "u00e4" is
http://www.fileformat.info/info/unicode/char/00e4/index.htm

So there's a Unicode escape sequence there.

So my humble question is, might there be a problem with the M/R and
"high ASCII characters"?



best, Mårten.


=ERROR REPORT==== 17-Feb-2010::15:37:17 ===
** Generic server <0.16457.0> terminating
** Last message in was {'$gen_cast',
                        {dispatch,<0.199.0>,
                         {<0.16457.0>,#Ref<0.0.0.49966>},
                         {<7082.27342.0>,
                          {map,{jsfun,<<"Riak.mapValuesJson">>},none,false},
                          {r_object,<<"letterboxes">>,<<"86288">>,
                           [{r_content,
                             {dict,5,16,16,8,80,48,
                              {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                               []},
                              {{[],[],
                                [[<<"Links">>]],
                                [],[],[],[],[],[],[],
                                [[<<"content-type">>,97,112,112,108,105,99,97,
                                  116,105,111,110,47,106,115,111,110],
                                 [<<"X-Riak-VTag">>,121,114,110,51,106,84,51,
                                  52,88,79,113,109,53,49,74,79,105,86,112,103,
                                  119]],
                                [],[],
                                [[<<"X-Riak-Last-Modified">>|
                                  {1266,416075,572221}]],
                                [],
                                [[<<"X-Riak-Meta">>]]}}},

<<"{\"id\":\"86288\",\"key\":\"86288\",\"value\":{\"rev\":\"1-f37125b006cad12ad53f211d868ede54\"},\"doc\":{\"_id\":\"86288\",\"_rev\":\"1-f37125b006cad12ad53f211d868ede54\",\"family\":\"letterboxes\",\"address\":{\"street\":\"Albyv\\u00e4gen
6\",\"streetInfo\":\"Raymons
Spel\",\"zipcode\":14559,\"city\":\"Norsborg\"},\"east\":1616410.07,\"north\":6570347.22,\"boxes\":[{\"id\":109131,\"active\":{\"startDate\":\"20091019\",\"endDate\":\"\"},\"features\":{\"driveIn\":true,\"handicap\":true,\"lastMinute\":true,\"season\":true},\"emptied\":{\"weekday\":\"0\",\"weekend\":\"1800\"},\"localTime\":\"0\",\"regionalZipCode\":\"\",\"localZipCode\":\"\",\"exemptionText\":\"\",\"weekdays\":{\"monday\":false,\"tuesday\":false,\"wednesday\":true,\"thursday\":true,\"friday\":true}}]}}">>}],
                           [{<<4,133,159,90>>,{1,63433635275}}],
                           {dict,1,16,16,8,80,48,
                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
                            {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                              [[clean|true]],
                              []}}},
                           undefined},
                          undefined,
                          {<<"letterboxes">>,<<"86288">>}}}}
** When Server state == {state,<0.100.0>,#Port<0.10797>,undefined,undefined}
** Reason for termination ==
** {bad_return_value,invalid_utf8}

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Map/Reduce, UTF-8 and Swedish high ASCII characters

Kevin Smith-5
Marten -

Is there anyway I could get a small set of test data to use for debugging purposes? I have to step out for a bit but I'd like to dig into this problem soon.

--Kevin
On Feb 17, 2010, at 12:28 PM, Mårten Gustafson wrote:

> Howdy chaps!
>
> I've been struggling back and forth with trying to run map/reduce over
> one of the our datasets and I've stumbled on the error message at the
> bottom of this mail. The message itself is pretty clear I think
> "{bad_return_value,invalid_utf8}". The dataset is in Swedish and hence
> we have a couple of high ASCII characters present, namely:
>
> 134: å
> 143: Å
> 132: ä
> 142: Ä
> 148: ö
> 153: Ö
>
> The data stems from CouchDB which, when surfed to, is nicely displayed
> correctly (Firefox detects UTF-8 and renders accordingly). I've then
> ran it through a node.js script that extracts the data from CouchDB
> and stores it in Riak.
> Pointing Firefox to the Riak URL for a given entry renders it
> correctly (again, UTF-8 detection). However when running my map/reduce
> job it bails out in the first phase which is applying the built in
> "Riak.mapValuesJson".
>
> As you can see below in the JSON data the value of the "street"
> property is: "Albyv\\u00e4gen 6" where "u00e4" is
> http://www.fileformat.info/info/unicode/char/00e4/index.htm
>
> So there's a Unicode escape sequence there.
>
> So my humble question is, might there be a problem with the M/R and
> "high ASCII characters"?
>
>
>
> best, Mårten.
>
>
> =ERROR REPORT==== 17-Feb-2010::15:37:17 ===
> ** Generic server <0.16457.0> terminating
> ** Last message in was {'$gen_cast',
>                        {dispatch,<0.199.0>,
>                         {<0.16457.0>,#Ref<0.0.0.49966>},
>                         {<7082.27342.0>,
>                          {map,{jsfun,<<"Riak.mapValuesJson">>},none,false},
>                          {r_object,<<"letterboxes">>,<<"86288">>,
>                           [{r_content,
>                             {dict,5,16,16,8,80,48,
>                              {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
>                               []},
>                              {{[],[],
>                                [[<<"Links">>]],
>                                [],[],[],[],[],[],[],
>                                [[<<"content-type">>,97,112,112,108,105,99,97,
>                                  116,105,111,110,47,106,115,111,110],
>                                 [<<"X-Riak-VTag">>,121,114,110,51,106,84,51,
>                                  52,88,79,113,109,53,49,74,79,105,86,112,103,
>                                  119]],
>                                [],[],
>                                [[<<"X-Riak-Last-Modified">>|
>                                  {1266,416075,572221}]],
>                                [],
>                                [[<<"X-Riak-Meta">>]]}}},
>
> <<"{\"id\":\"86288\",\"key\":\"86288\",\"value\":{\"rev\":\"1-f37125b006cad12ad53f211d868ede54\"},\"doc\":{\"_id\":\"86288\",\"_rev\":\"1-f37125b006cad12ad53f211d868ede54\",\"family\":\"letterboxes\",\"address\":{\"street\":\"Albyv\\u00e4gen
> 6\",\"streetInfo\":\"Raymons
> Spel\",\"zipcode\":14559,\"city\":\"Norsborg\"},\"east\":1616410.07,\"north\":6570347.22,\"boxes\":[{\"id\":109131,\"active\":{\"startDate\":\"20091019\",\"endDate\":\"\"},\"features\":{\"driveIn\":true,\"handicap\":true,\"lastMinute\":true,\"season\":true},\"emptied\":{\"weekday\":\"0\",\"weekend\":\"1800\"},\"localTime\":\"0\",\"regionalZipCode\":\"\",\"localZipCode\":\"\",\"exemptionText\":\"\",\"weekdays\":{\"monday\":false,\"tuesday\":false,\"wednesday\":true,\"thursday\":true,\"friday\":true}}]}}">>}],
>                           [{<<4,133,159,90>>,{1,63433635275}}],
>                           {dict,1,16,16,8,80,48,
>                            {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>                            {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
>                              [[clean|true]],
>                              []}}},
>                           undefined},
>                          undefined,
>                          {<<"letterboxes">>,<<"86288">>}}}}
> ** When Server state == {state,<0.100.0>,#Port<0.10797>,undefined,undefined}
> ** Reason for termination ==
> ** {bad_return_value,invalid_utf8}
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Map/Reduce, UTF-8 and Swedish high ASCII characters

Kevin Smith-5
I've been able to reproduce the error on my own. I'm investigating it now and will update the list when I have more info.

--Kevin
On Feb 17, 2010, at 12:36 PM, Kevin Smith wrote:

> Marten -
>
> Is there anyway I could get a small set of test data to use for debugging purposes? I have to step out for a bit but I'd like to dig into this problem soon.
>
> --Kevin
> On Feb 17, 2010, at 12:28 PM, Mårten Gustafson wrote:
>
>> Howdy chaps!
>>
>> I've been struggling back and forth with trying to run map/reduce over
>> one of the our datasets and I've stumbled on the error message at the
>> bottom of this mail. The message itself is pretty clear I think
>> "{bad_return_value,invalid_utf8}". The dataset is in Swedish and hence
>> we have a couple of high ASCII characters present, namely:
>>
>> 134: å
>> 143: Å
>> 132: ä
>> 142: Ä
>> 148: ö
>> 153: Ö
>>
>> The data stems from CouchDB which, when surfed to, is nicely displayed
>> correctly (Firefox detects UTF-8 and renders accordingly). I've then
>> ran it through a node.js script that extracts the data from CouchDB
>> and stores it in Riak.
>> Pointing Firefox to the Riak URL for a given entry renders it
>> correctly (again, UTF-8 detection). However when running my map/reduce
>> job it bails out in the first phase which is applying the built in
>> "Riak.mapValuesJson".
>>
>> As you can see below in the JSON data the value of the "street"
>> property is: "Albyv\\u00e4gen 6" where "u00e4" is
>> http://www.fileformat.info/info/unicode/char/00e4/index.htm
>>
>> So there's a Unicode escape sequence there.
>>
>> So my humble question is, might there be a problem with the M/R and
>> "high ASCII characters"?
>>
>>
>>
>> best, Mårten.
>>
>>
>> =ERROR REPORT==== 17-Feb-2010::15:37:17 ===
>> ** Generic server <0.16457.0> terminating
>> ** Last message in was {'$gen_cast',
>>                       {dispatch,<0.199.0>,
>>                        {<0.16457.0>,#Ref<0.0.0.49966>},
>>                        {<7082.27342.0>,
>>                         {map,{jsfun,<<"Riak.mapValuesJson">>},none,false},
>>                         {r_object,<<"letterboxes">>,<<"86288">>,
>>                          [{r_content,
>>                            {dict,5,16,16,8,80,48,
>>                             {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
>>                              []},
>>                             {{[],[],
>>                               [[<<"Links">>]],
>>                               [],[],[],[],[],[],[],
>>                               [[<<"content-type">>,97,112,112,108,105,99,97,
>>                                 116,105,111,110,47,106,115,111,110],
>>                                [<<"X-Riak-VTag">>,121,114,110,51,106,84,51,
>>                                 52,88,79,113,109,53,49,74,79,105,86,112,103,
>>                                 119]],
>>                               [],[],
>>                               [[<<"X-Riak-Last-Modified">>|
>>                                 {1266,416075,572221}]],
>>                               [],
>>                               [[<<"X-Riak-Meta">>]]}}},
>>
>> <<"{\"id\":\"86288\",\"key\":\"86288\",\"value\":{\"rev\":\"1-f37125b006cad12ad53f211d868ede54\"},\"doc\":{\"_id\":\"86288\",\"_rev\":\"1-f37125b006cad12ad53f211d868ede54\",\"family\":\"letterboxes\",\"address\":{\"street\":\"Albyv\\u00e4gen
>> 6\",\"streetInfo\":\"Raymons
>> Spel\",\"zipcode\":14559,\"city\":\"Norsborg\"},\"east\":1616410.07,\"north\":6570347.22,\"boxes\":[{\"id\":109131,\"active\":{\"startDate\":\"20091019\",\"endDate\":\"\"},\"features\":{\"driveIn\":true,\"handicap\":true,\"lastMinute\":true,\"season\":true},\"emptied\":{\"weekday\":\"0\",\"weekend\":\"1800\"},\"localTime\":\"0\",\"regionalZipCode\":\"\",\"localZipCode\":\"\",\"exemptionText\":\"\",\"weekdays\":{\"monday\":false,\"tuesday\":false,\"wednesday\":true,\"thursday\":true,\"friday\":true}}]}}">>}],
>>                          [{<<4,133,159,90>>,{1,63433635275}}],
>>                          {dict,1,16,16,8,80,48,
>>                           {[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
>>                           {{[],[],[],[],[],[],[],[],[],[],[],[],[],[],
>>                             [[clean|true]],
>>                             []}}},
>>                          undefined},
>>                         undefined,
>>                         {<<"letterboxes">>,<<"86288">>}}}}
>> ** When Server state == {state,<0.100.0>,#Port<0.10797>,undefined,undefined}
>> ** Reason for termination ==
>> ** {bad_return_value,invalid_utf8}
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com