What do you do when Riak freezes?

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

What do you do when Riak freezes?

Michael Dillon
I've run into a problem with Riak freezing completely on one node running on Ubuntu 12.04 LTS on a XEN VM (EC2). If I ssh into the node and run "ps ax" that shell session also freezes. I also tried another ssh session with "netstat -lnp" to see if I could find the process ID to kill, but that also froze.

I must admit that I have seen a similar problem with RabbitMQ running on Ubuntu 10 LTS on a an OpenVPS VM a few years ago.

I suppose this is an Erlang issue of some sort, but I would really like some way to kill the Riak processes without a reboot if possible. 

--
PageFreezer.com
#200 - 311 Water Street
Vancouver,  BC  V6B 1B8

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: What do you do when Riak freezes?

Matthew Von-Maszewski

Any chance you are overflowing into swap?  Or in the case of XEN have you exceeded the guaranteed RAM for the VM memory and moved into the disk backed portion of "ram"?

What backend do you use within riak?  Do you have memory statistics from before and after the seizure/freeze?

Matthew


On Mar 19, 2014, at 4:56 PM, Michael Dillon <[hidden email]> wrote:

> I've run into a problem with Riak freezing completely on one node running on Ubuntu 12.04 LTS on a XEN VM (EC2). If I ssh into the node and run "ps ax" that shell session also freezes. I also tried another ssh session with "netstat -lnp" to see if I could find the process ID to kill, but that also froze.
>
> I must admit that I have seen a similar problem with RabbitMQ running on Ubuntu 10 LTS on a an OpenVPS VM a few years ago.
>
> I suppose this is an Erlang issue of some sort, but I would really like some way to kill the Riak processes without a reboot if possible.
>
> --
> PageFreezer.com
> #200 - 311 Water Street
> Vancouver,  BC  V6B 1B8
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: What do you do when Riak freezes?

Michael Dillon
We are using AMazon EC2 m3.x2large nodes and while the freeze is occurring free reports

             total       used       free     shared    buffers     cached

Mem:      30623232    8818792   21804440          0      88092    4411832

-/+ buffers/cache:    4318868   26304364

Swap:            0          0          0

The Erlang processes seem to be unkillable because "shutdown -r now" is also hanging. Right now these nodes are just being used for some testing, but eventually we will go into production and I really need to have a plan for how to detect and then deal with these Erlang freezes. Or better yet, a way to avoid them even if it means detecting some condition in advance and then rebooting the node.




On Wed, Mar 19, 2014 at 2:07 PM, Matthew Von-Maszewski <[hidden email]> wrote:

Any chance you are overflowing into swap?  Or in the case of XEN have you exceeded the guaranteed RAM for the VM memory and moved into the disk backed portion of "ram"?

What backend do you use within riak?  Do you have memory statistics from before and after the seizure/freeze?

Matthew


On Mar 19, 2014, at 4:56 PM, Michael Dillon <[hidden email]> wrote:

> I've run into a problem with Riak freezing completely on one node running on Ubuntu 12.04 LTS on a XEN VM (EC2). If I ssh into the node and run "ps ax" that shell session also freezes. I also tried another ssh session with "netstat -lnp" to see if I could find the process ID to kill, but that also froze.
>
> I must admit that I have seen a similar problem with RabbitMQ running on Ubuntu 10 LTS on a an OpenVPS VM a few years ago.
>
> I suppose this is an Erlang issue of some sort, but I would really like some way to kill the Riak processes without a reboot if possible.
>
> --
> PageFreezer.com
> #200 - 311 Water Street
> Vancouver,  BC  V6B 1B8
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




--
PageFreezer.com
#200 - 311 Water Street
Vancouver,  BC  V6B 1B8

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: What do you do when Riak freezes?

Matthew Von-Maszewski
I thought I knew the cause of this problem.  I do not.  We need to await input from others.

My apologies.

Other basic questions will be:  what version of Riak, what is your app.config, how many servers/nodes, any reason this one node is "different"?

Matthew


On Mar 19, 2014, at 5:30 PM, Michael Dillon <[hidden email]> wrote:

We are using AMazon EC2 m3.x2large nodes and while the freeze is occurring free reports

             total       used       free     shared    buffers     cached

Mem:      30623232    8818792   21804440          0      88092    4411832

-/+ buffers/cache:    4318868   26304364

Swap:            0          0          0

The Erlang processes seem to be unkillable because "shutdown -r now" is also hanging. Right now these nodes are just being used for some testing, but eventually we will go into production and I really need to have a plan for how to detect and then deal with these Erlang freezes. Or better yet, a way to avoid them even if it means detecting some condition in advance and then rebooting the node.




On Wed, Mar 19, 2014 at 2:07 PM, Matthew Von-Maszewski <[hidden email]> wrote:

Any chance you are overflowing into swap?  Or in the case of XEN have you exceeded the guaranteed RAM for the VM memory and moved into the disk backed portion of "ram"?

What backend do you use within riak?  Do you have memory statistics from before and after the seizure/freeze?

Matthew


On Mar 19, 2014, at 4:56 PM, Michael Dillon <[hidden email]> wrote:

> I've run into a problem with Riak freezing completely on one node running on Ubuntu 12.04 LTS on a XEN VM (EC2). If I ssh into the node and run "ps ax" that shell session also freezes. I also tried another ssh session with "netstat -lnp" to see if I could find the process ID to kill, but that also froze.
>
> I must admit that I have seen a similar problem with RabbitMQ running on Ubuntu 10 LTS on a an OpenVPS VM a few years ago.
>
> I suppose this is an Erlang issue of some sort, but I would really like some way to kill the Riak processes without a reboot if possible.
>
> --
> PageFreezer.com
> #200 - 311 Water Street
> Vancouver,  BC  V6B 1B8
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




--
#200 - 311 Water Street
Vancouver,  BC  V6B 1B8
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: What do you do when Riak freezes?

Michael Dillon
I'm running Riak2.0pre11 but I keep mentioning Erlang because I have seen a similar situation with hanging a couple of years ago with RabbitMQ. I suspect that even if there is a Riak bug involved, there is probably also some Erlang problem as well. 

Now I have discovered that by using "pstree -p" I can learn the process IDs of the processes so I tried killing them. No luck. I cannot even kill the "ps ax" or "netstat -lnp" processes that are hanging as well. Then I tried "kill -9" and still they are stuck.

Two years ago we had to do a hard reset of the VM host server (i.e. kill all the VMs on the same box) in order to resolve this. I'm going to try the EC2 control panel to stop or terminate the VM, but even if that works it is not really a satisfactory solution to the problem. I'd be really interested if anyone else has seen this kind of a hang in production and how you cope with it.



On Wed, Mar 19, 2014 at 2:36 PM, Matthew Von-Maszewski <[hidden email]> wrote:
I thought I knew the cause of this problem.  I do not.  We need to await input from others.

My apologies.

Other basic questions will be:  what version of Riak, what is your app.config, how many servers/nodes, any reason this one node is "different"?

Matthew


On Mar 19, 2014, at 5:30 PM, Michael Dillon <[hidden email]> wrote:

We are using AMazon EC2 m3.x2large nodes and while the freeze is occurring free reports

             total       used       free     shared    buffers     cached

Mem:      30623232    8818792   21804440          0      88092    4411832

-/+ buffers/cache:    4318868   26304364

Swap:            0          0          0

The Erlang processes seem to be unkillable because "shutdown -r now" is also hanging. Right now these nodes are just being used for some testing, but eventually we will go into production and I really need to have a plan for how to detect and then deal with these Erlang freezes. Or better yet, a way to avoid them even if it means detecting some condition in advance and then rebooting the node.




On Wed, Mar 19, 2014 at 2:07 PM, Matthew Von-Maszewski <[hidden email]> wrote:

Any chance you are overflowing into swap?  Or in the case of XEN have you exceeded the guaranteed RAM for the VM memory and moved into the disk backed portion of "ram"?

What backend do you use within riak?  Do you have memory statistics from before and after the seizure/freeze?

Matthew


On Mar 19, 2014, at 4:56 PM, Michael Dillon <[hidden email]> wrote:

> I've run into a problem with Riak freezing completely on one node running on Ubuntu 12.04 LTS on a XEN VM (EC2). If I ssh into the node and run "ps ax" that shell session also freezes. I also tried another ssh session with "netstat -lnp" to see if I could find the process ID to kill, but that also froze.
>
> I must admit that I have seen a similar problem with RabbitMQ running on Ubuntu 10 LTS on a an OpenVPS VM a few years ago.
>
> I suppose this is an Erlang issue of some sort, but I would really like some way to kill the Riak processes without a reboot if possible.
>
> --
> PageFreezer.com
> #200 - 311 Water Street
> Vancouver,  BC  V6B 1B8
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




--
#200 - 311 Water Street
Vancouver,  BC  V6B 1B8
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




--
PageFreezer.com
#200 - 311 Water Street
Vancouver,  BC  V6B 1B8

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: What do you do when Riak freezes?

Adam Lindsay
I'm trying to remember some of the pain points of Riak in production on EC2 back 2.5 years ago. Riak is a different product now, but EC2 is still a challenging environment.

Do you have any monitoring on the state of network interfaces? Is it possible the IP of one of the nodes changed from underneath you?
In general, random EC2 network problems (which are plentiful) result in Erlang problems, and they have resulted in Riak getting confused.


On Wed, Mar 19, 2014 at 2:45 PM, Michael Dillon <[hidden email]> wrote:
I'm running Riak2.0pre11 but I keep mentioning Erlang because I have seen a similar situation with hanging a couple of years ago with RabbitMQ. I suspect that even if there is a Riak bug involved, there is probably also some Erlang problem as well. 

Now I have discovered that by using "pstree -p" I can learn the process IDs of the processes so I tried killing them. No luck. I cannot even kill the "ps ax" or "netstat -lnp" processes that are hanging as well. Then I tried "kill -9" and still they are stuck.

Two years ago we had to do a hard reset of the VM host server (i.e. kill all the VMs on the same box) in order to resolve this. I'm going to try the EC2 control panel to stop or terminate the VM, but even if that works it is not really a satisfactory solution to the problem. I'd be really interested if anyone else has seen this kind of a hang in production and how you cope with it.



On Wed, Mar 19, 2014 at 2:36 PM, Matthew Von-Maszewski <[hidden email]> wrote:
I thought I knew the cause of this problem.  I do not.  We need to await input from others.

My apologies.

Other basic questions will be:  what version of Riak, what is your app.config, how many servers/nodes, any reason this one node is "different"?

Matthew


On Mar 19, 2014, at 5:30 PM, Michael Dillon <[hidden email]> wrote:

We are using AMazon EC2 m3.x2large nodes and while the freeze is occurring free reports

             total       used       free     shared    buffers     cached

Mem:      30623232    8818792   21804440          0      88092    4411832

-/+ buffers/cache:    4318868   26304364

Swap:            0          0          0

The Erlang processes seem to be unkillable because "shutdown -r now" is also hanging. Right now these nodes are just being used for some testing, but eventually we will go into production and I really need to have a plan for how to detect and then deal with these Erlang freezes. Or better yet, a way to avoid them even if it means detecting some condition in advance and then rebooting the node.




On Wed, Mar 19, 2014 at 2:07 PM, Matthew Von-Maszewski <[hidden email]> wrote:

Any chance you are overflowing into swap?  Or in the case of XEN have you exceeded the guaranteed RAM for the VM memory and moved into the disk backed portion of "ram"?

What backend do you use within riak?  Do you have memory statistics from before and after the seizure/freeze?

Matthew


On Mar 19, 2014, at 4:56 PM, Michael Dillon <[hidden email]> wrote:

> I've run into a problem with Riak freezing completely on one node running on Ubuntu 12.04 LTS on a XEN VM (EC2). If I ssh into the node and run "ps ax" that shell session also freezes. I also tried another ssh session with "netstat -lnp" to see if I could find the process ID to kill, but that also froze.
>
> I must admit that I have seen a similar problem with RabbitMQ running on Ubuntu 10 LTS on a an OpenVPS VM a few years ago.
>
> I suppose this is an Erlang issue of some sort, but I would really like some way to kill the Riak processes without a reboot if possible.
>
> --
> PageFreezer.com
> #200 - 311 Water Street
> Vancouver,  BC  V6B 1B8
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




--
#200 - 311 Water Street
Vancouver,  BC  V6B 1B8
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com




--
PageFreezer.com
#200 - 311 Water Street
Vancouver,  BC  V6B 1B8

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: What do you do when Riak freezes?

Shane McEwan-2
In reply to this post by Michael Dillon
On 19/03/14 20:56, Michael Dillon wrote:
> I've run into a problem with Riak freezing completely on one node
> running on Ubuntu 12.04 LTS on a XEN VM (EC2). If I ssh into the node
> and run "ps ax" that shell session also freezes. I also tried another
> ssh session with "netstat -lnp" to see if I could find the process ID to
> kill, but that also froze.

If 'ps' and 'netstat' are hanging then you've got more fundamental
problems with the system than just Riak and Erlang.

It's almost like a network drive is missing and if that drive is
mentioned in your PATH any commands you type hang because the shell is
traversing into a missing mount searching for the command. But you say
that other commands like 'pstree' and 'kill' work so it's not that.

Does 'dmesg' or 'tail /var/log/syslog' show anything interesting?

If you can get the PID of any of the hung processes you could try
'strace -f -p PID' to see if they're hanging on a particular system
call. Although I suspect the strace will hang as well.

I reckon a reboot will probably fix the problem but that won't be much
help if the problem happens again.

Shane.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com