Performance problem: put operation takes seconds

classic Classic list List threaded Threaded
10 messages Options
ks
Reply | Threaded
Open this post in threaded view
|

Performance problem: put operation takes seconds

ks
Hi there,

We're using Riak 1.3.1 cluster with 3 nodes (CentOS 64-bit, Intel® Xeon® E3-1245 Quadcore, 16 GB DDR3 RAM ECC, 2 x 3 TB SATA 6 Gb/s HDD 7200 rpm (Software-RAID 1)) connected to the same 10 Gb router.

We hold lot of small key-value pairs in the database and do about 500 get operations and about 500 put requests per second via Java client library from different client machines.

The problem is that 95% percentile (according to Riak statistics) often jumps up to 50000 ms for a couple of minutes. This is something quite far from we expected in terms of performance.

We've tried setting the options in http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/ (except for noatime flag). But anyway, 50 seconds put timeout is something outstanding and I hardly believe noatime is a remedy.

Log files appeared to be somewhat unclear to track down the problem.

Is there anything else we can try?

Thanks a lot!
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

Christian Dahlqvist
Hello,

Please can you attach your app.config and vm.args files as well as a zipped up log directory from one of the nodes?

Best regards,

Christian


On 22 Jul 2013, at 12:40, ks <[hidden email]> wrote:

> Hi there,
>
> We're using Riak 1.3.1 cluster with 3 nodes (CentOS 64-bit, Intel® Xeon®
> E3-1245 Quadcore, 16 GB DDR3 RAM ECC, 2 x 3 TB SATA 6 Gb/s HDD 7200 rpm
> (Software-RAID 1)) connected to the same 10 Gb router.
>
> We hold lot of small key-value pairs in the database and do about 500 get
> operations and about 500 put requests per second via Java client library
> from different client machines.
>
> The problem is that 95% percentile (according to Riak statistics) often
> jumps up to 50000 ms for a couple of minutes. This is something quite far
> from we expected in terms of performance.
>
> We've tried setting the options in
> http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/
> (except for noatime flag). But anyway, 50 seconds put timeout is something
> outstanding and I hardly believe noatime is a remedy.
>
> Log files appeared to be somewhat unclear to track down the problem.
>
> Is there anything else we can try?
>
> Thanks a lot!
>
>
>
> --
> View this message in context: http://riak-users.197444.n3.nabble.com/Performance-problem-put-operation-takes-seconds-tp4028478.html
> Sent from the Riak Users mailing list archive at Nabble.com.
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
ks
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

ks
Hi Christian!

Sure:
vm.args
app.config

Thanks!
ks
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

Christian Dahlqvist
Hi,

A reasonably common cause for sudden spikes in latencies is that the buffers used for internal communication gets exhausted. This tends to manifest itself through large number of 'busy_dist_port' messages in the logs. This is especially common if you have large objects or objects with lots of small siblings.

Check your logs to see if you see any busy_dist_port messages in the console logs. If this is the case, add the following line to the vm.args file:

+zdbbl 16384

This increases the default buffer size to 16MB, but may need to be increased further depending on the size distribution of your data.

Also monitor statistics to see if you have large objects (node_get_fsm_objsize_100) and/or lots of siblings (node_get_fsm_siblings_100). Ideally try to keep object size below 4-5MB and ensure that your application resolves siblings if have this enabled, as it otherwise can cause objects to grow uncontrollably.

Best regards,

Christian



On 22 Jul 2013, at 14:10, ks <[hidden email]> wrote:

> Hi Christian!
>
> Sure:
> vm.args <http://riak-users.197444.n3.nabble.com/file/n4028480/vm.args>  
> app.config <http://riak-users.197444.n3.nabble.com/file/n4028480/app.config>  
>
> Thanks!
> ks
>
>
>
> --
> View this message in context: http://riak-users.197444.n3.nabble.com/Performance-problem-put-operation-takes-seconds-tp4028478p4028480.html
> Sent from the Riak Users mailing list archive at Nabble.com.
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
ks
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

ks
Thanks,

There's no such a message in logs. We operate on small objects: key and values are literally less than 20 bytes.
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

Christian Dahlqvist
Hi,

If you do not have siblings enabled and are certain ALL values are small, there are a few other things that can cause performance problems:

- Are you running key or bucket listings?
- Are you running secondary index queries or mapreduce jobs?
- Can you confirm you have disabled swap?
- Are you monitoring riak statistics? Do you have a graph of the latencies you can share?

Also, can you provide the complete output from statistics when the slowdown happens for us to look at?

Best regards,

Christian


On 22 Jul 2013, at 16:57, ks <[hidden email]> wrote:

> Thanks,
>
> There's no such a message in logs. We operate on small objects: key and
> values are literally less than 20 bytes.
>
>
>
> --
> View this message in context: http://riak-users.197444.n3.nabble.com/Performance-problem-put-operation-takes-seconds-tp4028478p4028488.html
> Sent from the Riak Users mailing list archive at Nabble.com.
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

Vahric Muhtaryan
In reply to this post by ks
Hi,

Why need to apply raid 1 ? Apply raid 0 for each node , this should be increase the performance. Riak also keep 3 copy of each value, no need to care about redundancy. Also maybe the problem could be SATA because if its desktop one then latency could be increase and because of sata interface supported number of command in a second more low then SAS interface maybe this also cause an issue.

Just want to share my ideas after read you request.
Regards
VM

On Monday, July 22, 2013, ks <[hidden email]> wrote:
> Hi there,
>
> We're using Riak 1.3.1 cluster with 3 nodes (CentOS 64-bit, Intel® Xeon®
> E3-1245 Quadcore, 16 GB DDR3 RAM ECC, 2 x 3 TB SATA 6 Gb/s HDD 7200 rpm
> (Software-RAID 1)) connected to the same 10 Gb router.
>
> We hold lot of small key-value pairs in the database and do about 500 get
> operations and about 500 put requests per second via Java client library
> from different client machines.
>
> The problem is that 95% percentile (according to Riak statistics) often
> jumps up to 50000 ms for a couple of minutes. This is something quite far
> from we expected in terms of performance.
>
> We've tried setting the options in
> http://docs.basho.com/riak/latest/cookbooks/Linux-Performance-Tuning/
> (except for noatime flag). But anyway, 50 seconds put timeout is something
> outstanding and I hardly believe noatime is a remedy.
>
> Log files appeared to be somewhat unclear to track down the problem.
>
> Is there anything else we can try?
>
> Thanks a lot!
>
>
>
> --
> View this message in context: http://riak-users.197444.n3.nabble.com/Performance-problem-put-operation-takes-seconds-tp4028478.html
> Sent from the Riak Users mailing list archive at Nabble.com.
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
ks
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

ks
This post was updated on .
In reply to this post by Christian Dahlqvist
Hi Christian,

> Are you running key or bucket listings?
> Are you running secondary index queries or mapreduce jobs?
Nope, we only do com.basho.riak.client.bucket.Bucket.fetch() and .store(). There's no other usages of Riak client library in our code.

> Can you confirm you have disabled swap?
Yes.

> Are you monitoring riak statistics? Do you have a graph of the latencies you can share?
We're monitoring Riak stats by parsing http://<node>:8098/stats response.
These are graphs taken for the last 3 hours. Red-green-blue mean three nodes in the cluster.

riak1.png
riak2.png

If needed, I can share complete output from the stats.

Thanks,
 Kirill
ks
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

ks
In reply to this post by Vahric Muhtaryan
Thanks Vahric,

Agree, we'll reconfigure the nodes as RAID 0. Still I don't believe RAID misconfiguration can be the cause of such the response times.

Best,
  Kirill
Reply | Threaded
Open this post in threaded view
|

Re: Performance problem: put operation takes seconds

Christian Dahlqvist
In reply to this post by ks
Hi Kirill,

Raw output from stats and graphs showing trends would be very useful. Access to the log files would also help.

Best regards,

Christian



On 22 Jul 2013, at 20:36, ks <[hidden email]> wrote:

> Hi Christian,
>
>> Are you running key or bucket listings?
>> Are you running secondary index queries or mapreduce jobs?
> Nope, we only do com.basho.riak.client.bucket.Bucket.fetch() and .store().
> There's no other usages of Riak client library in our code.
>
>> Can you confirm you have disabled swap?
> Yes.
>
>> Are you monitoring riak statistics? Do you have a graph of the latencies
>> you can share?
> We're monitoring Riak stats by parsing http://<node>:8098/stats response.
> These are graphs taken for the last 3 hours. Red-green-blue mean three nodes
> in the cluster.
>
> If needed, I can share complete output from the stats.
>
> Thanks,
> Kirill
>
>
>
> --
> View this message in context: http://riak-users.197444.n3.nabble.com/Performance-problem-put-operation-takes-seconds-tp4028478p4028503.html
> Sent from the Riak Users mailing list archive at Nabble.com.
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com