Some questions about Riak Search and Riak itself

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Some questions about Riak Search and Riak itself

Dmitry Demeshchuk
Greetings.

I took some time today in the morning to try Riak Search and got some questions:

1. I tried to put some Erlang terms into Riak bucket that is being
indexed by Riak Search. I hoped that key-value lists like this

[
      {"name", <<"John Doe">>},
      {"location", <<"unknown">>},
      {"age", 64}
]

will be indexed as they can be considered as columns list (and they
are almost just like JSON decoded by mochijson2:decode() ).
However, when I tried to put them, I got a pre-commit hook error:

DEBUG: riak_indexed_doc:196 - "{ expected_binaries , InFieldName , FieldValue }"

Is there a way to send Erlang proplists into Riak and process them
using Riak Search? Our model isn't very good for storing raw JSON
because generally we need to perform additional operations with the
values (filter some fields, change them and so on).


2. Is there a way to query Erlang buckets indexes using any other APIs
than REST API? The only way to query the bucket I found was

/solr/some_bucket/select

and my attempts of using Riak Search shell and Erlang API just failed.


3. Is there a way to write custom analyzers in non-java languages? I
saw the same question and found an answer that analyzer automatically
tries to start JVM for its needs. The problem is that we don't have
good Java and JVM developers so it would be better to use some other
solutions (like OCaml or even C, for example). Also, I'm kinda
suspicious about Java analyzers performance.

4. Do you have any tips and advice about working with Unicode in Riak Search?

Thank you.

--
Best regards,
Dmitry Demeshchuk

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about Riak Search and Riak itself

bryan-basho
Administrator
On Tue, Oct 12, 2010 at 3:16 AM, Dmitry Demeshchuk <[hidden email]> wrote:
> 1. I tried to put some Erlang terms into Riak bucket that is being
> indexed by Riak Search. I hoped that key-value lists like this
…snip…
> Is there a way to send Erlang proplists into Riak and process them
> using Riak Search?

Hi, Dmitry.  We've filed a bug for doing exactly this:

https://issues.basho.com/show_bug.cgi?id=788

In the meantime, you could also write your own extractor.  See the
"Other Data Encodings" section of using_search.org:

http://bitbucket.org/basho/riak_search/src/d1f10b876cae/doc/using_search.org#cl-985

Or on the wiki:

http://wiki.basho.com/display/RIAK/Riak+Search+-+Indexing+and+Querying+Riak+KV+Data#RiakSearch-IndexingandQueryingRiakKVData-OtherDataEncodings

> 2. Is there a way to query Erlang buckets indexes using any other APIs
> than REST API? The only way to query the bucket I found was
>
> /solr/some_bucket/select
>
> and my attempts of using Riak Search shell and Erlang API just failed.

If you could posts details about the ways in which your attempts
failed (error messages, etc.), we might be able to help you
troubleshoot them.

The other main way of querying Search indexes is using the map/reduce
Search input.  The "Querying via HTTP/Curl" section has an example of
how to hook this up:

http://bitbucket.org/basho/riak_search/src/d1f10b876cae/doc/using_search.org#cl-783

http://wiki.basho.com/display/RIAK/Riak+Search+-+Querying#RiakSearch-Querying-QueryingviaHTTP%2FCurl

And it's also possible to specify the same map/reduce input using any
of the Erlang clients (native, protocol buffer, or http).  Though
there is a small bug with the non-streaming native Erlang client at
the moment (https://issues.basho.com/show_bug.cgi?id=803).  For an
example of using that syntax, have a look at the Wriaki project:

http://bitbucket.org/basho/wriaki/src/d2334be214ce/apps/wriaki/src/wiki_resource.erl#cl-267

> 3. Is there a way to write custom analyzers in non-java languages? I
> saw the same question and found an answer that analyzer automatically
> tries to start JVM for its needs. The problem is that we don't have
> good Java and JVM developers so it would be better to use some other
> solutions (like OCaml or even C, for example). Also, I'm kinda
> suspicious about Java analyzers performance.

At the moment, the only non-Java language supported for custom
analyzers is Erlang.  You can specify an Erlang analyzer by adding an
"analyzer_factory" entry to your schema, of the form:

   {analyzer_factory, {erlang, my_modlue, my_function}}

Other formats for the analyzer_factory setting are:

   {erlang, my_module, my_function, Arguments}
   {java, FullyQualifiedClassNameAsString}
   {java, FullyQualifiedClassNameAsString, Arguments}
   FullyQualifiedClassNameAsString

The last format is demonstrated in the "Defining a Schema" section of the docs:

http://bitbucket.org/basho/riak_search/src/d1f10b876cae/doc/using_search.org#cl-193

http://wiki.basho.com/display/RIAK/Riak+Search+-+Schema#RiakSearch-Schema-DefiningaSchema

Unfortunately, we haven't written much documentation about what an
analyzer is expected to do, but hopefully between the comments in
qilr_analyzer, and the default Erlang analyzer,
text_analyzers:default_analyzer_factory/2, you'll be able to work out
some of what you need.

http://bitbucket.org/basho/riak_search/src/d1f10b876cae/apps/qilr/src/qilr_analyzer.erl#cl-53

http://bitbucket.org/basho/riak_search/src/d1f10b876cae/apps/qilr/src/text_analyzers.erl

> 4. Do you have any tips and advice about working with Unicode in Riak Search?

Encode everything in UTF-8.  There may still be a few bugs we need to
work out, but our intended goal is to have everything in that
department "just work" once you're using UTF-8 everywhere.

-Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about Riak Search and Riak itself

Dmitry Demeshchuk
Hi, Fink. Thank you for your reply. Here are some inline comments.

On Tue, Oct 12, 2010 at 9:42 PM, Bryan Fink <[hidden email]> wrote:

> On Tue, Oct 12, 2010 at 3:16 AM, Dmitry Demeshchuk <[hidden email]> wrote:
>> 1. I tried to put some Erlang terms into Riak bucket that is being
>> indexed by Riak Search. I hoped that key-value lists like this
> ...snip...
>> Is there a way to send Erlang proplists into Riak and process them
>> using Riak Search?
>
> Hi, Dmitry.  We've filed a bug for doing exactly this:
>
> https://issues.basho.com/show_bug.cgi?id=788
>
> In the meantime, you could also write your own extractor.  See the
> "Other Data Encodings" section of using_search.org:
>
> http://bitbucket.org/basho/riak_search/src/d1f10b876cae/doc/using_search.org#cl-985
>
> Or on the wiki:
>
> http://wiki.basho.com/display/RIAK/Riak+Search+-+Indexing+and+Querying+Riak+KV+Data#RiakSearch-IndexingandQueryingRiakKVData-OtherDataEncodings
>
>> 2. Is there a way to query Erlang buckets indexes using any other APIs
>> than REST API? The only way to query the bucket I found was
>>
>> /solr/some_bucket/select
>>
>> and my attempts of using Riak Search shell and Erlang API just failed.
>
> If you could posts details about the ways in which your attempts
> failed (error messages, etc.), we might be able to help you
> troubleshoot them.
>
> The other main way of querying Search indexes is using the map/reduce
> Search input.  The "Querying via HTTP/Curl" section has an example of
> how to hook this up:
>
> http://bitbucket.org/basho/riak_search/src/d1f10b876cae/doc/using_search.org#cl-783
>
> http://wiki.basho.com/display/RIAK/Riak+Search+-+Querying#RiakSearch-Querying-QueryingviaHTTP%2FCurl
>
> And it's also possible to specify the same map/reduce input using any
> of the Erlang clients (native, protocol buffer, or http).  Though
> there is a small bug with the non-streaming native Erlang client at
> the moment (https://issues.basho.com/show_bug.cgi?id=803).  For an
> example of using that syntax, have a look at the Wriaki project:
>
> http://bitbucket.org/basho/wriaki/src/d2334be214ce/apps/wriaki/src/wiki_resource.erl#cl-267

I worked it out. Both shell and command-line search work good. Seems
like I've been doing something wrong before.

>
>> 3. Is there a way to write custom analyzers in non-java languages? I
>> saw the same question and found an answer that analyzer automatically
>> tries to start JVM for its needs. The problem is that we don't have
>> good Java and JVM developers so it would be better to use some other
>> solutions (like OCaml or even C, for example). Also, I'm kinda
>> suspicious about Java analyzers performance.
>
> At the moment, the only non-Java language supported for custom
> analyzers is Erlang.  You can specify an Erlang analyzer by adding an
> "analyzer_factory" entry to your schema, of the form:
>
>   {analyzer_factory, {erlang, my_modlue, my_function}}
>
> Other formats for the analyzer_factory setting are:
>
>   {erlang, my_module, my_function, Arguments}
>   {java, FullyQualifiedClassNameAsString}
>   {java, FullyQualifiedClassNameAsString, Arguments}
>   FullyQualifiedClassNameAsString
>
> The last format is demonstrated in the "Defining a Schema" section of the docs:
>
> http://bitbucket.org/basho/riak_search/src/d1f10b876cae/doc/using_search.org#cl-193
>
> http://wiki.basho.com/display/RIAK/Riak+Search+-+Schema#RiakSearch-Schema-DefiningaSchema
>
> Unfortunately, we haven't written much documentation about what an
> analyzer is expected to do, but hopefully between the comments in
> qilr_analyzer, and the default Erlang analyzer,
> text_analyzers:default_analyzer_factory/2, you'll be able to work out
> some of what you need.
>
> http://bitbucket.org/basho/riak_search/src/d1f10b876cae/apps/qilr/src/qilr_analyzer.erl#cl-53
>
> http://bitbucket.org/basho/riak_search/src/d1f10b876cae/apps/qilr/src/text_analyzers.erl
>
>> 4. Do you have any tips and advice about working with Unicode in Riak Search?
>
> Encode everything in UTF-8.  There may still be a few bugs we need to
> work out, but our intended goal is to have everything in that
> department "just work" once you're using UTF-8 everywhere.

I'm not sure if I do everything right but here's the step-by step
description of my actions:

1.  curl -v -d "{\"title\":\"Статья 1\", \"tags\":\"псто, лытдыбр\",
\"body\":\"Я что-то здесь написал\"}" -H "Content-Type:
application/json" http://127.0.0.1:8098/riak/posts

(Note, there are cyrillic symbols)

2. curl -X POST -H "content-type: application/json"
http://localhost:8098/mapred -d '{"inputs":"posts",
"query":[{"map":{"language":"javascript","source":"Riak.mapValues",
"keep":true}}]}'

The result is:

["{\"title\":\"\u0421\u0442\u0430\u0442\u044c\u044f 1\",
\"tags\":\"\u043f\u0441\u0442\u043e,
\u043b\u044b\u0442\u0434\u044b\u0431\u0440\", \"body\":\"\u042f
\u0447\u0442\u043e-\u0442\u043e \u0437\u0434\u0435\u0441\u044c
\u043d\u0430\u043f\u0438\u0441\u0430\u043b\"}"]

So, the cyrillic strings were encoded properly by Riak itself (not
sure if it's on the mochiweb level or somewhere else).

3. curl -X POST -H "content-type: application/json"
http://localhost:8098/mapred -d '{"inputs":{"module":"riak_search",
"function":"mapred_search", "arg": ["posts", "title:Статья*"]},
"query":[{"map":{"language":"javascript","source":"Riak.mapValues",
"keep":true}}]}'

This is a map-reduce Riak Search request. It's expected to return the
previously posted document. However, it returns an empty list.

4. Tried both shell and command-line search - the same result.

5. If I try to reproduce the same using latin characters, everything
just works fine. The JSON data may be partially cyrillic - in that
case search works on the latin fields only.


Am I doing something wrong? Should I encode characters somehow before
I send them into RiakSearch?


Thanks.


>
> -Bryan
>



--
Best regards,
Dmitry Demeshchuk

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about Riak Search and Riak itself

bryan-basho
Administrator
2010/10/13 Dmitry Demeshchuk <[hidden email]>:
> I worked it out. Both shell and command-line search work good. Seems
> like I've been doing something wrong before.

Excellent - good to hear.

>>> 4. Do you have any tips and advice about working with Unicode in Riak Search?
>>
>> Encode everything in UTF-8.  There may still be a few bugs we need to
>> work out, but our intended goal is to have everything in that
>> department "just work" once you're using UTF-8 everywhere.
>
> I'm not sure if I do everything right but here's the step-by step
> description of my actions:
>
> 1.  curl -v -d "{\"title\":\"Статья 1\", \"tags\":\"псто, лытдыбр\",
> \"body\":\"Я что-то здесь написал\"}" -H "Content-Type:
> application/json" http://127.0.0.1:8098/riak/posts
>
> (Note, there are cyrillic symbols)
>
...snip...
>
> 3. curl -X POST -H "content-type: application/json"
> http://localhost:8098/mapred -d '{"inputs":{"module":"riak_search",
> "function":"mapred_search", "arg": ["posts", "title:Статья*"]},
> "query":[{"map":{"language":"javascript","source":"Riak.mapValues",
> "keep":true}}]}'
>
...snip...
>
> Am I doing something wrong? Should I encode characters somehow before
> I send them into RiakSearch?

Thanks for the excellent test case - very easy to reproduce.  I
apologize for my delay in responding, but I wanted to make sure I had
all of my ducks in a row first.

So, no, you're doing nothing wrong.  The default, Erlang-based
analyzer is, in fact, just ignoring non-ascii characters.  I've
created an issue to track the fix to that analyzer here:

   https://issues.basho.com/show_bug.cgi?id=814

In the meantime, the easiest way to fix this issue is to use the
Java-based "DefaultAnalyzerFactory", which handles non-ascii
characters correctly (in my tests, at least; I look forward to yours).
 To use this analyzer, edit your schema file, and add the following
line to the first list in the schema:

   {analyzer_factory, "com.basho.search.analysis.DefaultAnalyzerFactory"}

(The example schemas on the wiki and in doc/using_search.org
demonstrate the proper placement of this line).  After editing, use
the bin/search-cmd script to update the schema:

   $RIAK/bin/seach-cmd set_schema posts /path/to/your/schema.def

Riak Search should reindex any documents you have stored using the new
analyzer.  Try your map/reduce query again, and I think you'll find it
working.

-Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about Riak Search and Riak itself

Dmitry Demeshchuk
Hi, Bryan.

Thanks, works just perfect!

2010/10/14 Bryan Fink <[hidden email]>:

> 2010/10/13 Dmitry Demeshchuk <[hidden email]>:
>> I worked it out. Both shell and command-line search work good. Seems
>> like I've been doing something wrong before.
>
> Excellent - good to hear.
>
>>>> 4. Do you have any tips and advice about working with Unicode in Riak Search?
>>>
>>> Encode everything in UTF-8.  There may still be a few bugs we need to
>>> work out, but our intended goal is to have everything in that
>>> department "just work" once you're using UTF-8 everywhere.
>>
>> I'm not sure if I do everything right but here's the step-by step
>> description of my actions:
>>
>> 1.  curl -v -d "{\"title\":\"Статья 1\", \"tags\":\"псто, лытдыбр\",
>> \"body\":\"Я что-то здесь написал\"}" -H "Content-Type:
>> application/json" http://127.0.0.1:8098/riak/posts
>>
>> (Note, there are cyrillic symbols)
>>
> ...snip...
>>
>> 3. curl -X POST -H "content-type: application/json"
>> http://localhost:8098/mapred -d '{"inputs":{"module":"riak_search",
>> "function":"mapred_search", "arg": ["posts", "title:Статья*"]},
>> "query":[{"map":{"language":"javascript","source":"Riak.mapValues",
>> "keep":true}}]}'
>>
> ...snip...
>>
>> Am I doing something wrong? Should I encode characters somehow before
>> I send them into RiakSearch?
>
> Thanks for the excellent test case - very easy to reproduce.  I
> apologize for my delay in responding, but I wanted to make sure I had
> all of my ducks in a row first.
>
> So, no, you're doing nothing wrong.  The default, Erlang-based
> analyzer is, in fact, just ignoring non-ascii characters.  I've
> created an issue to track the fix to that analyzer here:
>
>   https://issues.basho.com/show_bug.cgi?id=814
>
> In the meantime, the easiest way to fix this issue is to use the
> Java-based "DefaultAnalyzerFactory", which handles non-ascii
> characters correctly (in my tests, at least; I look forward to yours).
>  To use this analyzer, edit your schema file, and add the following
> line to the first list in the schema:
>
>   {analyzer_factory, "com.basho.search.analysis.DefaultAnalyzerFactory"}
>
> (The example schemas on the wiki and in doc/using_search.org
> demonstrate the proper placement of this line).  After editing, use
> the bin/search-cmd script to update the schema:
>
>   $RIAK/bin/seach-cmd set_schema posts /path/to/your/schema.def
>
> Riak Search should reindex any documents you have stored using the new
> analyzer.  Try your map/reduce query again, and I think you'll find it
> working.
>
> -Bryan
>



--
Best regards,
Dmitry Demeshchuk

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Some questions about Riak Search and Riak itself

bryan-basho
Administrator
In reply to this post by bryan-basho
2010/10/13 Bryan Fink <[hidden email]>:
>   $RIAK/bin/seach-cmd set_schema posts /path/to/your/schema.def
>
> Riak Search should reindex any documents you have stored using the new
> analyzer.

Oops.  Dan called me out on being inaccurate here.  After changing the
schema, Riak Search will reindex documents you have stored using the
new analyzer **the next time you store them**.  It doesn't
automatically reanalyze every document immediately after schema
change.

Sorry for the confusion.

-Bryan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com