Long delays when trying to recover from errors in Java client at startup

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Long delays when trying to recover from errors in Java client at startup

Toby Corkindale-2
Hi,
I've been trying to make a JVM-based app have better error recovery when the Riak cluster is still in a starting-up state.
I have a fairly naive wait-loop that tries to connect and list buckets, and if there's an exception, retry again after a short delay.

However once the Riak cluster comes good, the java client hangs on the first operation it makes, for a really long time. Minutes.
 -- in particular, at com.basho.riak.client.core.RiakCluster.retryOperation(RiakCluster.java:479)

I've tried shutting down and recreating the RiakClient between attempts, but this doesn't seem to help.
I guess the node manager has its own back-offs and delays.. Is there a way to reduce these timeouts?

Thanks,
Toby

"pool-4-thread-1" #102 prio=5 os_prio=0 tid=0x00007f813478e000 nid=0x5517 waiting on condition [0x0
0007f8110e2d000]
  java.lang.Thread.State: WAITING (parking)
       at sun.misc.Unsafe.park(Native Method)
       - parking to wait for  <0x000000076d707b38> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
       at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
       at com.basho.riak.client.core.RiakCluster.retryOperation(RiakCluster.java:479)
       at com.basho.riak.client.core.RiakCluster.access$1000(RiakCluster.java:44)
       at com.basho.riak.client.core.RiakCluster$RetryTask.run(RiakCluster.java:580)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       at java.lang.Thread.run(Thread.java:748)


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Long delays (8min) when trying to recover from errors in Java client at startup

Toby Corkindale-2
Thought I'd fill in some extra info in case it helps

$ time ./target/pack/bin/launcher
WARNING: Riak connection bad, retrying... 30 tries remaining
Node state: HEALTH_CHECKING
WARNING: Riak connection bad, retrying... 29 tries remaining
Node state: HEALTH_CHECKING
WARNING: Riak connection bad, retrying... 28 tries remaining
Node state: RUNNING
.....
Then we hang for almost exactly eight minutes!
Then things continue fine!

It's that eight minutes of sleeping that I really want to fix -- as timeouts go it is just way too long!

Thanks for any advice,
Toby

On Mon, 22 May 2017 at 16:01 Toby Corkindale <[hidden email]> wrote:
Hi,
I've been trying to make a JVM-based app have better error recovery when the Riak cluster is still in a starting-up state.
I have a fairly naive wait-loop that tries to connect and list buckets, and if there's an exception, retry again after a short delay.

However once the Riak cluster comes good, the java client hangs on the first operation it makes, for a really long time. Minutes.
 -- in particular, at com.basho.riak.client.core.RiakCluster.retryOperation(RiakCluster.java:479)

I've tried shutting down and recreating the RiakClient between attempts, but this doesn't seem to help.
I guess the node manager has its own back-offs and delays.. Is there a way to reduce these timeouts?

Thanks,
Toby

"pool-4-thread-1" #102 prio=5 os_prio=0 tid=0x00007f813478e000 nid=0x5517 waiting on condition [0x0
0007f8110e2d000]
  java.lang.Thread.State: WAITING (parking)
       at sun.misc.Unsafe.park(Native Method)
       - parking to wait for  <0x000000076d707b38> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
       at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
       at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
       at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
       at com.basho.riak.client.core.RiakCluster.retryOperation(RiakCluster.java:479)
       at com.basho.riak.client.core.RiakCluster.access$1000(RiakCluster.java:44)
       at com.basho.riak.client.core.RiakCluster$RetryTask.run(RiakCluster.java:580)
       at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
       at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       at java.lang.Thread.run(Thread.java:748)


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Long delays when trying to recover from errors in Java client at startup

Magnus Kessler
In reply to this post by Toby Corkindale-2
On 22 May 2017 at 07:02, Toby Corkindale <[hidden email]> wrote:
Hi,
I've been trying to make a JVM-based app have better error recovery when the Riak cluster is still in a starting-up state.
I have a fairly naive wait-loop that tries to connect and list buckets, and if there's an exception, retry again after a short delay.

However once the Riak cluster comes good, the java client hangs on the first operation it makes, for a really long time. Minutes.
 -- in particular, at com.basho.riak.client.core.RiakCluster.retryOperation(RiakCluster.java:479)

I've tried shutting down and recreating the RiakClient between attempts, but this doesn't seem to help.
I guess the node manager has its own back-offs and delays.. Is there a way to reduce these timeouts?

Thanks,
Toby


Hi Toby,

Using bucket listing as a method to determine live-ness is a really bad idea. Bucket-listing, just as key-listing, requires a coverage query across ALL objects stored in the cluster, and will take a really long time if the cluster contains many objects.

A better alternative would be to have a canary object with a known key, that can be read quickly.

In startup scripts, that need to wait until Riak KV is operational, we recommend using `riak-admin wait-for-service riak_kv`.

Kind Regards,

Magnus

--
Magnus Kessler
Client Services Engineer
Basho Technologies Limited

Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Long delays when trying to recover from errors in Java client at startup

Toby Corkindale-2
Hi Magnus,
I can't use riak-admin in a script like this, because Riak KV is not installed on the client containers.

Ah, I forgot that bucket listing was the same as key listing. I'll switch that to a different test.
In this use case, though, it's not too important -- I'm trying to deal with unit tests that start up Riak fixtures via Docker.

So there is literally no content in them to start up. Unfortunately, doing a simple HTTP request to /ping isn't sufficient to really tell if the Riak fixture has spun up properly, as sometimes it'll return OK to /ping, but still return errors on other requests.

Do you know why there is this eight minute delay before RiakClient sees the node as good?
It's not the Riak KV instance -- because if I cheat and put a ten second delay in, so that the very first attempt to scan for buckets succeeds, then things proceed instantly.

On Mon, 22 May 2017 at 17:14 Magnus Kessler <[hidden email]> wrote:
On 22 May 2017 at 07:02, Toby Corkindale <[hidden email]> wrote:
Hi,
I've been trying to make a JVM-based app have better error recovery when the Riak cluster is still in a starting-up state.
I have a fairly naive wait-loop that tries to connect and list buckets, and if there's an exception, retry again after a short delay.

However once the Riak cluster comes good, the java client hangs on the first operation it makes, for a really long time. Minutes.
 -- in particular, at com.basho.riak.client.core.RiakCluster.retryOperation(RiakCluster.java:479)

I've tried shutting down and recreating the RiakClient between attempts, but this doesn't seem to help.
I guess the node manager has its own back-offs and delays.. Is there a way to reduce these timeouts?

Thanks,
Toby


Hi Toby,

Using bucket listing as a method to determine live-ness is a really bad idea. Bucket-listing, just as key-listing, requires a coverage query across ALL objects stored in the cluster, and will take a really long time if the cluster contains many objects.

A better alternative would be to have a canary object with a known key, that can be read quickly.

In startup scripts, that need to wait until Riak KV is operational, we recommend using `riak-admin wait-for-service riak_kv`.

Kind Regards,

Magnus

--
Magnus Kessler
Client Services Engineer
Basho Technologies Limited

Registered Office - 8 Lincoln’s Inn Fields London WC2A 3BP Reg 07970431

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Loading...