Solr search response time spikes

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Solr search response time spikes

sean mcevoy
Hi List,

We have a standard riak cluster with 5 nodes and at the minute the traffic levels are fairly low. Each of our application nodes has 25 client connections, 5 to each riak node which get selected in a round robin.

Our application level requests involve multiple riak requests so our traffic tends to make requests in small bursts. Everything works fine for KV gets, puts & deletes but we're seeing timeouts & weird response time spikes on solr search operations.

In the past 36 hours (the only period I have riak stats for) I see one response time of 38.8 seconds, 3 hours earlier a response time of 20.8 seconds, and the third biggest spike is an acceptable 3.5 seconds.

See below all search_query stats for the minute of the 38 sec sample. In the application request we made 5 riak search requests to the same index in parallel, which happens for each request of this type and normally doesn't have an issue. But in this case all 5 timed out, and one timed out again on retry with the other 4 succeeding.

Anyone ever seen anything like this before? Is there any known deadlock in solr that I might hit if I make the same request on another connection before the first has completed? This is what we do when our riak client times out after 2 seconds and immediately retries.

Any advice or pointers welcomed.
Thanks,
//Sean.


Riak node 1
search_query_throughput_one: 14
search_query_throughput_count: 259
search_query_latency_min: 2776
search_query_latency_median: 69411
search_query_latency_mean: 4900973
search_query_latency_max: 38887902
search_query_latency_999: 38887902
search_query_latency_99: 38887902
search_query_latency_95: 2046215
search_query_fail_one: 0
search_query_fail_count: 0

Riak node 2
search_query_throughput_one: 22
search_query_throughput_count: 564
search_query_latency_min: 4006
search_query_latency_median: 8800
search_query_latency_mean: 11834
search_query_latency_max: 25509
search_query_latency_999: 25509
search_query_latency_99: 25509
search_query_latency_95: 24035
search_query_fail_one: 0
search_query_fail_count: 0

Riak node 3
search_query_throughput_one: 6
search_query_throughput_count: 298
search_query_latency_min: 3200
search_query_latency_median: 15391
search_query_latency_mean: 18062
search_query_latency_max: 31759
search_query_latency_999: 31759
search_query_latency_99: 31759
search_query_latency_95: 31759
search_query_fail_one: 0
search_query_fail_count: 0

Riak node 4
search_query_throughput_one: 8
search_query_throughput_count: 334
search_query_latency_min: 2404
search_query_latency_median: 7230
search_query_latency_mean: 10211
search_query_latency_max: 22502
search_query_latency_999: 22502
search_query_latency_99: 22502
search_query_latency_95: 22502
search_query_fail_one: 0
search_query_fail_count: 0

Riak node 5
search_query_throughput_one: 0
search_query_throughput_count: 0
search_query_latency_min: 0
search_query_latency_median: 0
search_query_latency_mean: 0
search_query_latency_max: 0
search_query_latency_999: 0
search_query_latency_99: 0
search_query_latency_95: 0
search_query_fail_one: 0
search_query_fail_count: 0


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Solr search response time spikes

Fred Dushin-2
It's pretty strange that you are seeing no search latency measurements on node 5.  Are you sure your round robining is working?  Are you favoring node 1?

In general, I don't think which node you hit for query should make a difference, but I'd have to stare at the code some to be sure.  In essence, all the node that services the query does is convert the query into a sharded Solr query based on a coverage plan, which changes every minute or so, and then runs the sharded query on the local Solr node.  The Solr node then distributes the query to the rest of the nodes in the cluster, but that's all Solr comms -- Riak is out of the picture, by then.

Now, if you have a lot of sharded queries accumulating on one node, that might make a difference to Solr.  I am not a Solr expert, and I don't even play one on TV.  But maybe the fact that you are not hitting node 5 is relevant for that reason?

Can you do more analysis on your client, to make sure you are not favoring node 1?

-Fred

> On Jun 22, 2017, at 10:20 AM, sean mcevoy <[hidden email]> wrote:
>
> Hi List,
>
> We have a standard riak cluster with 5 nodes and at the minute the traffic levels are fairly low. Each of our application nodes has 25 client connections, 5 to each riak node which get selected in a round robin.
>
> Our application level requests involve multiple riak requests so our traffic tends to make requests in small bursts. Everything works fine for KV gets, puts & deletes but we're seeing timeouts & weird response time spikes on solr search operations.
>
> In the past 36 hours (the only period I have riak stats for) I see one response time of 38.8 seconds, 3 hours earlier a response time of 20.8 seconds, and the third biggest spike is an acceptable 3.5 seconds.
>
> See below all search_query stats for the minute of the 38 sec sample. In the application request we made 5 riak search requests to the same index in parallel, which happens for each request of this type and normally doesn't have an issue. But in this case all 5 timed out, and one timed out again on retry with the other 4 succeeding.
>
> Anyone ever seen anything like this before? Is there any known deadlock in solr that I might hit if I make the same request on another connection before the first has completed? This is what we do when our riak client times out after 2 seconds and immediately retries.
>
> Any advice or pointers welcomed.
> Thanks,
> //Sean.
>
>
> Riak node 1
> search_query_throughput_one: 14
> search_query_throughput_count: 259
> search_query_latency_min: 2776
> search_query_latency_median: 69411
> search_query_latency_mean: 4900973
> search_query_latency_max: 38887902
> search_query_latency_999: 38887902
> search_query_latency_99: 38887902
> search_query_latency_95: 2046215
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 2
> search_query_throughput_one: 22
> search_query_throughput_count: 564
> search_query_latency_min: 4006
> search_query_latency_median: 8800
> search_query_latency_mean: 11834
> search_query_latency_max: 25509
> search_query_latency_999: 25509
> search_query_latency_99: 25509
> search_query_latency_95: 24035
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 3
> search_query_throughput_one: 6
> search_query_throughput_count: 298
> search_query_latency_min: 3200
> search_query_latency_median: 15391
> search_query_latency_mean: 18062
> search_query_latency_max: 31759
> search_query_latency_999: 31759
> search_query_latency_99: 31759
> search_query_latency_95: 31759
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 4
> search_query_throughput_one: 8
> search_query_throughput_count: 334
> search_query_latency_min: 2404
> search_query_latency_median: 7230
> search_query_latency_mean: 10211
> search_query_latency_max: 22502
> search_query_latency_999: 22502
> search_query_latency_99: 22502
> search_query_latency_95: 22502
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 5
> search_query_throughput_one: 0
> search_query_throughput_count: 0
> search_query_latency_min: 0
> search_query_latency_median: 0
> search_query_latency_mean: 0
> search_query_latency_max: 0
> search_query_latency_999: 0
> search_query_latency_99: 0
> search_query_latency_95: 0
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Solr search response time spikes

sean mcevoy
Hi Fred,

Thanks for taking the time!
Yes, I noticed that unbalance yesterday when writing, looked into it after sending and found our config is corrupt with one node ommitted and another in there twice.
But, with such low traffic levels and the spikes being on the non-favoured node I'm not currently ranking that as a likely factor.


Another interesting case from last night, this sample was taken at 2017-6-23 06:04:09

Riak node 1
"search_query_throughput_one": 27
"search_query_latency_max": 10417

Riak node 2
"search_query_throughput_one": 49
"search_query_latency_max": 8952

Riak node 3
"search_query_throughput_one": 18
"search_query_throughput_count": 2507
"search_query_latency_min": 1757
"search_query_latency_median": 14775
"search_query_latency_mean": 5628361
"search_query_latency_max": 18298854
"search_query_latency_999": 18298854
"search_query_latency_99": 18298854
"search_query_latency_95": 16539782

Riak node 4
"search_query_throughput_one": 25
"search_query_latency_max": 10217


Brushing up my maths and focussing on node 3, from the 99 & 95% figures we can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in total.
And from the request count for the minute & the mean we can tell that in total these 18 requests spent a total of 101,310 ms being processed.
From the median & min we know the 9 quickest took between 18 & 265 ms in total.
This leaves in the region of 66 sec for the other 7 requests, enough for all 7 to have timed out.


Cross referencing with our application logs I can see:

On application node 1 at 2017-06-23 06:03:17 we had 3 search request timeouts to index A with 3 different filters, one field of which, lets call it field X, had the same value.
We immediately retried these and at 2017-06-23 06:03:19 2 of those timed out again and were retried again.
They all succeeded on this retry, so this suggests that the same requests sent to other riak nodes was fine, but to this riak node at this time was a problem.

On application node 2 at:
2017-06-23 06:03:27
2017-06-23 06:03:29
2017-06-23 06:03:31
2017-06-23 06:03:33

we had 4 more timeouts on search requests to index A, these requests had 2 different filters but in both cases field X had the same value as in the previous example.


So these application logs show 9 riak timeouts, that must correlate with the riak stats.
I can't definitively say that no other search requests went to this riak node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it had a problem for 18 seconds, which is quiet a big window.


The index that all these requests were directed at currently has 490K entries with 8 different fields defined in each. The corresponding riak bucket has allow_mult = false, if that's relevant.

We see a similar pattern on our test system, I'm going to setup a test to repeatedly do searches and see if I can trigger this consistently. Will let ye know if anything interesting comes out of it.

I know it's relatively new to the product, do we know is riak solr used much in production systems?
I assume no one else has seen these spikes?

//Sean.


On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin <[hidden email]> wrote:
It's pretty strange that you are seeing no search latency measurements on node 5.  Are you sure your round robining is working?  Are you favoring node 1?

In general, I don't think which node you hit for query should make a difference, but I'd have to stare at the code some to be sure.  In essence, all the node that services the query does is convert the query into a sharded Solr query based on a coverage plan, which changes every minute or so, and then runs the sharded query on the local Solr node.  The Solr node then distributes the query to the rest of the nodes in the cluster, but that's all Solr comms -- Riak is out of the picture, by then.

Now, if you have a lot of sharded queries accumulating on one node, that might make a difference to Solr.  I am not a Solr expert, and I don't even play one on TV.  But maybe the fact that you are not hitting node 5 is relevant for that reason?

Can you do more analysis on your client, to make sure you are not favoring node 1?

-Fred

> On Jun 22, 2017, at 10:20 AM, sean mcevoy <[hidden email]> wrote:
>
> Hi List,
>
> We have a standard riak cluster with 5 nodes and at the minute the traffic levels are fairly low. Each of our application nodes has 25 client connections, 5 to each riak node which get selected in a round robin.
>
> Our application level requests involve multiple riak requests so our traffic tends to make requests in small bursts. Everything works fine for KV gets, puts & deletes but we're seeing timeouts & weird response time spikes on solr search operations.
>
> In the past 36 hours (the only period I have riak stats for) I see one response time of 38.8 seconds, 3 hours earlier a response time of 20.8 seconds, and the third biggest spike is an acceptable 3.5 seconds.
>
> See below all search_query stats for the minute of the 38 sec sample. In the application request we made 5 riak search requests to the same index in parallel, which happens for each request of this type and normally doesn't have an issue. But in this case all 5 timed out, and one timed out again on retry with the other 4 succeeding.
>
> Anyone ever seen anything like this before? Is there any known deadlock in solr that I might hit if I make the same request on another connection before the first has completed? This is what we do when our riak client times out after 2 seconds and immediately retries.
>
> Any advice or pointers welcomed.
> Thanks,
> //Sean.
>
>
> Riak node 1
> search_query_throughput_one: 14
> search_query_throughput_count: 259
> search_query_latency_min: 2776
> search_query_latency_median: 69411
> search_query_latency_mean: 4900973
> search_query_latency_max: 38887902
> search_query_latency_999: 38887902
> search_query_latency_99: 38887902
> search_query_latency_95: 2046215
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 2
> search_query_throughput_one: 22
> search_query_throughput_count: 564
> search_query_latency_min: 4006
> search_query_latency_median: 8800
> search_query_latency_mean: 11834
> search_query_latency_max: 25509
> search_query_latency_999: 25509
> search_query_latency_99: 25509
> search_query_latency_95: 24035
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 3
> search_query_throughput_one: 6
> search_query_throughput_count: 298
> search_query_latency_min: 3200
> search_query_latency_median: 15391
> search_query_latency_mean: 18062
> search_query_latency_max: 31759
> search_query_latency_999: 31759
> search_query_latency_99: 31759
> search_query_latency_95: 31759
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 4
> search_query_throughput_one: 8
> search_query_throughput_count: 334
> search_query_latency_min: 2404
> search_query_latency_median: 7230
> search_query_latency_mean: 10211
> search_query_latency_max: 22502
> search_query_latency_999: 22502
> search_query_latency_99: 22502
> search_query_latency_95: 22502
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 5
> search_query_throughput_one: 0
> search_query_throughput_count: 0
> search_query_latency_min: 0
> search_query_latency_median: 0
> search_query_latency_mean: 0
> search_query_latency_max: 0
> search_query_latency_999: 0
> search_query_latency_99: 0
> search_query_latency_95: 0
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Solr search response time spikes

sean mcevoy
Hi List, Fred,

After a week of going cross-eyed looking at stats & trying to engineer a test case to make this happen in the test env I think I've made a breakthrough.

We have a low but steady level of riak traffic but our application level actions that result in solr reads are actually fairly infrequent. And when one of these actions occur it results in multiple parallel reads to our solr indexes.

What I've observed is that our timeouts are most easily reproduced after a period of inactivity. And once I see a timeout after 2 seconds I kick off multiple other reads to random keys and observe that some return instantly while others can take several seconds, but then all return at the same time.

It's almost as if some shards in the java VM have gone to sleep due to inactivity and we see a cluster of timeouts when we try to read from it.

I'm setting up a "pinger" script in our prod env to keep these awake and see if our observed timeout rate reduces.

If this is actually our problem are there any JVM config options we can use to keep the index active all the time?

//Sean.

On Fri, Jun 23, 2017 at 1:48 PM, sean mcevoy <[hidden email]> wrote:
Hi Fred,

Thanks for taking the time!
Yes, I noticed that unbalance yesterday when writing, looked into it after sending and found our config is corrupt with one node ommitted and another in there twice.
But, with such low traffic levels and the spikes being on the non-favoured node I'm not currently ranking that as a likely factor.


Another interesting case from last night, this sample was taken at 2017-6-23 06:04:09

Riak node 1
"search_query_throughput_one": 27
"search_query_latency_max": 10417

Riak node 2
"search_query_throughput_one": 49
"search_query_latency_max": 8952

Riak node 3
"search_query_throughput_one": 18
"search_query_throughput_count": 2507
"search_query_latency_min": 1757
"search_query_latency_median": 14775
"search_query_latency_mean": 5628361
"search_query_latency_max": 18298854
"search_query_latency_999": 18298854
"search_query_latency_99": 18298854
"search_query_latency_95": 16539782

Riak node 4
"search_query_throughput_one": 25
"search_query_latency_max": 10217


Brushing up my maths and focussing on node 3, from the 99 & 95% figures we can tell the 2 slowest response times were 18,298 & 16,539ms, 34,837 ms in total.
And from the request count for the minute & the mean we can tell that in total these 18 requests spent a total of 101,310 ms being processed.
From the median & min we know the 9 quickest took between 18 & 265 ms in total.
This leaves in the region of 66 sec for the other 7 requests, enough for all 7 to have timed out.


Cross referencing with our application logs I can see:

On application node 1 at 2017-06-23 06:03:17 we had 3 search request timeouts to index A with 3 different filters, one field of which, lets call it field X, had the same value.
We immediately retried these and at 2017-06-23 06:03:19 2 of those timed out again and were retried again.
They all succeeded on this retry, so this suggests that the same requests sent to other riak nodes was fine, but to this riak node at this time was a problem.

On application node 2 at:
2017-06-23 06:03:27
2017-06-23 06:03:29
2017-06-23 06:03:31
2017-06-23 06:03:33

we had 4 more timeouts on search requests to index A, these requests had 2 different filters but in both cases field X had the same value as in the previous example.


So these application logs show 9 riak timeouts, that must correlate with the riak stats.
I can't definitively say that no other search requests went to this riak node between 06:03:15 & 06:03:33 but the circumstantial evidence is that it had a problem for 18 seconds, which is quiet a big window.


The index that all these requests were directed at currently has 490K entries with 8 different fields defined in each. The corresponding riak bucket has allow_mult = false, if that's relevant.

We see a similar pattern on our test system, I'm going to setup a test to repeatedly do searches and see if I can trigger this consistently. Will let ye know if anything interesting comes out of it.

I know it's relatively new to the product, do we know is riak solr used much in production systems?
I assume no one else has seen these spikes?

//Sean.


On Thu, Jun 22, 2017 at 9:40 PM, Fred Dushin <[hidden email]> wrote:
It's pretty strange that you are seeing no search latency measurements on node 5.  Are you sure your round robining is working?  Are you favoring node 1?

In general, I don't think which node you hit for query should make a difference, but I'd have to stare at the code some to be sure.  In essence, all the node that services the query does is convert the query into a sharded Solr query based on a coverage plan, which changes every minute or so, and then runs the sharded query on the local Solr node.  The Solr node then distributes the query to the rest of the nodes in the cluster, but that's all Solr comms -- Riak is out of the picture, by then.

Now, if you have a lot of sharded queries accumulating on one node, that might make a difference to Solr.  I am not a Solr expert, and I don't even play one on TV.  But maybe the fact that you are not hitting node 5 is relevant for that reason?

Can you do more analysis on your client, to make sure you are not favoring node 1?

-Fred

> On Jun 22, 2017, at 10:20 AM, sean mcevoy <[hidden email]> wrote:
>
> Hi List,
>
> We have a standard riak cluster with 5 nodes and at the minute the traffic levels are fairly low. Each of our application nodes has 25 client connections, 5 to each riak node which get selected in a round robin.
>
> Our application level requests involve multiple riak requests so our traffic tends to make requests in small bursts. Everything works fine for KV gets, puts & deletes but we're seeing timeouts & weird response time spikes on solr search operations.
>
> In the past 36 hours (the only period I have riak stats for) I see one response time of 38.8 seconds, 3 hours earlier a response time of 20.8 seconds, and the third biggest spike is an acceptable 3.5 seconds.
>
> See below all search_query stats for the minute of the 38 sec sample. In the application request we made 5 riak search requests to the same index in parallel, which happens for each request of this type and normally doesn't have an issue. But in this case all 5 timed out, and one timed out again on retry with the other 4 succeeding.
>
> Anyone ever seen anything like this before? Is there any known deadlock in solr that I might hit if I make the same request on another connection before the first has completed? This is what we do when our riak client times out after 2 seconds and immediately retries.
>
> Any advice or pointers welcomed.
> Thanks,
> //Sean.
>
>
> Riak node 1
> search_query_throughput_one: 14
> search_query_throughput_count: 259
> search_query_latency_min: 2776
> search_query_latency_median: 69411
> search_query_latency_mean: 4900973
> search_query_latency_max: 38887902
> search_query_latency_999: 38887902
> search_query_latency_99: 38887902
> search_query_latency_95: 2046215
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 2
> search_query_throughput_one: 22
> search_query_throughput_count: 564
> search_query_latency_min: 4006
> search_query_latency_median: 8800
> search_query_latency_mean: 11834
> search_query_latency_max: 25509
> search_query_latency_999: 25509
> search_query_latency_99: 25509
> search_query_latency_95: 24035
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 3
> search_query_throughput_one: 6
> search_query_throughput_count: 298
> search_query_latency_min: 3200
> search_query_latency_median: 15391
> search_query_latency_mean: 18062
> search_query_latency_max: 31759
> search_query_latency_999: 31759
> search_query_latency_99: 31759
> search_query_latency_95: 31759
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 4
> search_query_throughput_one: 8
> search_query_throughput_count: 334
> search_query_latency_min: 2404
> search_query_latency_median: 7230
> search_query_latency_mean: 10211
> search_query_latency_max: 22502
> search_query_latency_999: 22502
> search_query_latency_99: 22502
> search_query_latency_95: 22502
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> Riak node 5
> search_query_throughput_one: 0
> search_query_throughput_count: 0
> search_query_latency_min: 0
> search_query_latency_median: 0
> search_query_latency_mean: 0
> search_query_latency_max: 0
> search_query_latency_999: 0
> search_query_latency_99: 0
> search_query_latency_95: 0
> search_query_fail_one: 0
> search_query_fail_count: 0
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Loading...