Solr error message

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Solr error message

Jim Raney
Hello,

We're seeing the following error in riak/yokazuna:

2016-04-11 19:36:18.803 [error]
<0.23120.8>@yz_pb_search:maybe_process:84 "Failed to determine Solr port
for all nodes in search plan"
[{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,448}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,421}]},{lager_trunc_io,alist,3,[{file,"src/lager_trunc_io.erl"},{line,418}]},{lager_trunc_io,print,3,[{file,"src/lager_trunc_io.erl"},{line,168}]}]

This is a 7-node cluster running the RPM of 2.1.3 on CentOS 7, in Google
cloud, with 16-CPU/60GB RAM VMs.  They are configured with levelDB, with
a 500G SSD disk for the first four tiers and a 2TB magnetic disk for the
remainder.  IOPSs/throughput are not an issue with our application.

There is a UWSGI-based REST service that sits in front of riak that
contains all of the application logic.  The testing suite (locust) loads
binary data files that the uwsgi service processes and inserts into
riak.  As part of that processing yokazuna indexes get searched.

We find that ~40 minutes to an hour into load testing we start seeing
the above error logged (leading to 500s from locust's perspective).  It
corresponds with Search Query Fail Count, which we graph with zabbix.  
Over time the number gets larger and larger, and after about an hour of
load testng it starts to curve upwards sharply.

In riak.conf we have:

search = on
search.solr.start_timeout = 120s
search.solr.port = 8093
search.solr.jmx_port = 8985
search.solr.jvm_options = -d64 -Xms2g -Xmx16g -XX:+UseStringCache
-XX:+UseCompressedOops

and we are using java-1.7.0-openjdk-1.7.0.99-2.6.5.0.el7_2.x86_64 from
the CentOS repos.  I've been graphing JMX stats with zabbix and nothing
looks untoward, the heap gradually climbs up in size but never
skyrockets and certainly doesn't come close to the 16GB cap (barely gets
above 3GB before things really go south).  With jconsole I see the same
numbers, with a gradually increasing time for garbage collection (last
recorded was "23.751 seconds on PS Scavenge (640 collections)"),
although it's hard to tell if there's any large pauses from gc.

We graph a bunch of additional stats in zabbix, and the boxes in the
cluster never get close to capping out CPU or running out of RAM.

I googled around and couldn't find any reference to the logged error.  
Does it have to do with solr having a problem contacting other nodes in
the cluster?  Or is it some kind of node lookup issue?

--
Jim Raney


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Solr error message

Fred Dushin
Hi Jim,

Interesting problem.

That error is occurring here:


because length(Mapping) and length(UniqNodes) are unequal:


This might be because you are getting timeouts trying to query the port on remote nodes:


As you can see, there is a hard-wired 1-second timeout on that RPC call, which could account for why you are seeing this failure into a load run.

You might try to rebuild a version of this module with an increased timeout, to see if that gets you over the hump, or consider making a configurable timeout.

Riak 2.1.3 ships with yokozuna 2.1.2, who's GIT SHA 3520d11ec21ee08b7c18478fbbe1b61d7e3d8e0f, so you'd want to branch off that point of the tree, if you care to experiment.

If you rebuild the module, you can place the generated beam file in the lib/basho-patches directory of each of your riak installs, and restart Riak (or manually re-load the module on each node via the Riak console, if you need to keep your riak nodes up and running)

Let us know what you find or if you need more assistance.

-Fred

On Apr 11, 2016, at 4:11 PM, Jim Raney <[hidden email]> wrote:

Failed to determine Solr port for all nodes in search plan


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Solr error message

Jim Raney


On Apr 11, 2016, at 3:35 PM, Fred Dushin <[hidden email]> wrote:

Hi Jim,

Interesting problem.

That error is occurring here:


because length(Mapping) and length(UniqNodes) are unequal:


This might be because you are getting timeouts trying to query the port on remote nodes:


As you can see, there is a hard-wired 1-second timeout on that RPC call, which could account for why you are seeing this failure into a load run.

You might try to rebuild a version of this module with an increased timeout, to see if that gets you over the hump, or consider making a configurable timeout.

Riak 2.1.3 ships with yokozuna 2.1.2, who's GIT SHA 3520d11ec21ee08b7c18478fbbe1b61d7e3d8e0f, so you'd want to branch off that point of the tree, if you care to experiment.

If you rebuild the module, you can place the generated beam file in the lib/basho-patches directory of each of your riak installs, and restart Riak (or manually re-load the module on each node via the Riak console, if you need to keep your riak nodes up and running)

Let us know what you find or if you need more assistance.

-Fred

On Apr 11, 2016, at 4:11 PM, Jim Raney <[hidden email]> wrote:

Failed to determine Solr port for all nodes in search plan

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Fred,

Thanks for the quick response.  After you basically verified that it was a a solr timeout issue I rebuilt the cluster with 14 nodes to see what would happen.  The amount of time it took for the query fails (and associated log entries) basically doubled as well.  

I -could- try increasing the hard coded timeout but I don't think that's the route we want to go as it is likely this system will have that much data or more being pushed into and long query times won't work.  I imagine there is probably some solr tuning we can do - any ideas on what we could look at that we could pass through the riak config?

I'm going to try an Oracle 1.8 JDK with it later and see if any GC tuning helps in case there are long GC pauses.

--
Jim Raney


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com