Reg:Continuous Periodic crashes after long operation

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Reg:Continuous Periodic crashes after long operation

Steven Joseph-2
Hi,

We have a cluster of 5 nodes, which are continuously being queried for
new data through solr. We have been having some issues with riak/solr
which seems to be happening after longer periods of operation. It starts
off with one node and it seems to be happening on all node after a
while.

We tried upgrading to the latest version of riak hoping that it would
solve the issue, but no luck.

Only thing that stops the crashes is a full cluster staggered restart.

Please find the logs below. Any help would be much appreciated.

Riak Logs:

2017-01-26T07:53:03.262Z hawk5| ** Last message in was tick
2017-01-26T07:53:10.197Z hawk5|
2017-01-26T07:53:10.197Z hawk5| 2017-01-26 07:53:08.183 [error] emulator Error in process <0.22701.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:10.263Z hawk5| Error in process <0.22701.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:10.263Z hawk5| 2017-01-26 07:53:08 =ERROR REPORT====
2017-01-26T07:53:17.198Z hawk5|
2017-01-26T07:53:17.208Z hawk5| 2017-01-26 07:53:13.472 [error] emulator Error in process <0.12549.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:17.263Z hawk5| Error in process <0.12549.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:17.263Z hawk5| 2017-01-26 07:53:13 =ERROR REPORT====
2017-01-26T07:53:18.198Z hawk5| 2017-01-26 07:53:17.861 [error] emulator Error in process <0.2254.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:18.208Z hawk5|
2017-01-26T07:53:18.208Z hawk5| 2017-01-26 07:53:17.861 [error] emulator Error in process <0.2254.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:18.264Z hawk5|


Python client traces:

2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 179, in wrapper
2017-01-26T10:20:44.517Z hawk5| return self._client.fulltext_search(search_index, query, **params)
2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/bucket.py", line 476, in search
2017-01-26T10:20:44.517Z hawk5| raise e.args[0]
2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 134, in _with_retries
2017-01-26T10:20:44.517Z hawk5| return self._with_retries(pool, thunk)
2017-01-26T10:20:44.543Z hawk5| RiakError: 'recv_into returned zero bytes unexpectedly'


Regards

Steven Joseph

CTO, StreetHawk Pty Ltd

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Shaun McVey
Hi Steven,

Based on that log output, it looks like you're running into issues with system limits, probably open file limits.  You can check the value that Riak has available by connecting to one of the nodes with riak attach, then executing:

```
os:cmd("ulimit -n").
```

(After, disconnect with ctrl+g, then q, then Enter).

It should be at least 65,536 ideally, although the bigger the better.

If you find it's lower, then follow this doc to increase it.

http://docs.basho.com/riak/kv/2.0.2/using/performance/open-files-limit/

Have a check and let us know what the output was.

Kind Regards,
Shaun

On Thu, Jan 26, 2017 at 10:34 AM, Steven Joseph <[hidden email]> wrote:
Hi,

We have a cluster of 5 nodes, which are continuously being queried for
new data through solr. We have been having some issues with riak/solr
which seems to be happening after longer periods of operation. It starts
off with one node and it seems to be happening on all node after a
while.

We tried upgrading to the latest version of riak hoping that it would
solve the issue, but no luck.

Only thing that stops the crashes is a full cluster staggered restart.

Please find the logs below. Any help would be much appreciated.

Riak Logs:

2017-01-26T07:53:03.262Z hawk5| ** Last message in was tick
2017-01-26T07:53:10.197Z hawk5|
2017-01-26T07:53:10.197Z hawk5| 2017-01-26 07:53:08.183 [error] emulator Error in process <0.22701.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:10.263Z hawk5| Error in process <0.22701.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:10.263Z hawk5| 2017-01-26 07:53:08 =ERROR REPORT====
2017-01-26T07:53:17.198Z hawk5|
2017-01-26T07:53:17.208Z hawk5| 2017-01-26 07:53:13.472 [error] emulator Error in process <0.12549.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:17.263Z hawk5| Error in process <0.12549.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:17.263Z hawk5| 2017-01-26 07:53:13 =ERROR REPORT====
2017-01-26T07:53:18.198Z hawk5| 2017-01-26 07:53:17.861 [error] emulator Error in process <0.2254.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:18.208Z hawk5|
2017-01-26T07:53:18.208Z hawk5| 2017-01-26 07:53:17.861 [error] emulator Error in process <0.2254.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:18.264Z hawk5|


Python client traces:

2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 179, in wrapper
2017-01-26T10:20:44.517Z hawk5| return self._client.fulltext_search(search_index, query, **params)
2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/bucket.py", line 476, in search
2017-01-26T10:20:44.517Z hawk5| raise e.args[0]
2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 134, in _with_retries
2017-01-26T10:20:44.517Z hawk5| return self._with_retries(pool, thunk)
2017-01-26T10:20:44.543Z hawk5| RiakError: 'recv_into returned zero bytes unexpectedly'


Regards

Steven Joseph

CTO, StreetHawk Pty Ltd

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Steven Joseph
Hi Shaun,

I have already set this to a very high value 

([hidden email])1> os:cmd("ulimit -n").
"20000500\n"


So the issue is not that the limit is low, but maybe a resource leak ? As I mentioned our application processes continuously run queries on the cluster.

Kind Regards

Steven

On Thu, Jan 26, 2017 at 11:13 PM Shaun McVey <[hidden email]> wrote:
Hi Steven,

Based on that log output, it looks like you're running into issues with system limits, probably open file limits.  You can check the value that Riak has available by connecting to one of the nodes with riak attach, then executing:

```
os:cmd("ulimit -n").
```

(After, disconnect with ctrl+g, then q, then Enter).

It should be at least 65,536 ideally, although the bigger the better.

If you find it's lower, then follow this doc to increase it.

http://docs.basho.com/riak/kv/2.0.2/using/performance/open-files-limit/

Have a check and let us know what the output was.

Kind Regards,
Shaun

On Thu, Jan 26, 2017 at 10:34 AM, Steven Joseph <[hidden email]> wrote:
Hi,

We have a cluster of 5 nodes, which are continuously being queried for
new data through solr. We have been having some issues with riak/solr
which seems to be happening after longer periods of operation. It starts
off with one node and it seems to be happening on all node after a
while.

We tried upgrading to the latest version of riak hoping that it would
solve the issue, but no luck.

Only thing that stops the crashes is a full cluster staggered restart.

Please find the logs below. Any help would be much appreciated.

Riak Logs:

2017-01-26T07:53:03.262Z hawk5| ** Last message in was tick
2017-01-26T07:53:10.197Z hawk5|
2017-01-26T07:53:10.197Z hawk5| 2017-01-26 07:53:08.183 [error] emulator Error in process <0.22701.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:10.263Z hawk5| Error in process <0.22701.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:10.263Z hawk5| 2017-01-26 07:53:08 =ERROR REPORT====
2017-01-26T07:53:17.198Z hawk5|
2017-01-26T07:53:17.208Z hawk5| 2017-01-26 07:53:13.472 [error] emulator Error in process <0.12549.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:17.263Z hawk5| Error in process <0.12549.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:17.263Z hawk5| 2017-01-26 07:53:13 =ERROR REPORT====
2017-01-26T07:53:18.198Z hawk5| 2017-01-26 07:53:17.861 [error] emulator Error in process <0.2254.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:18.208Z hawk5|
2017-01-26T07:53:18.208Z hawk5| 2017-01-26 07:53:17.861 [error] emulator Error in process <0.2254.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:18.264Z hawk5|


Python client traces:

2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 179, in wrapper
2017-01-26T10:20:44.517Z hawk5| return self._client.fulltext_search(search_index, query, **params)
2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/bucket.py", line 476, in search
2017-01-26T10:20:44.517Z hawk5| raise e.args[0]
2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 134, in _with_retries
2017-01-26T10:20:44.517Z hawk5| return self._with_retries(pool, thunk)
2017-01-26T10:20:44.543Z hawk5| RiakError: 'recv_into returned zero bytes unexpectedly'


Regards

Steven Joseph

CTO, StreetHawk Pty Ltd

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Steven Joseph-2
In reply to this post by Shaun McVey
Hi Shaun,

I have already set this to a very high value 

([hidden email])1> os:cmd("ulimit -n").
"20000500\n"


So the issue is not that the limit is low, but maybe a resource leak ? As I mentioned our application processes continuously run queries on the cluster.

Kind Regards

Steven

On Thu, Jan 26, 2017 at 11:13 PM Shaun McVey <[hidden email]> wrote:
Hi Steven,

Based on that log output, it looks like you're running into issues with system limits, probably open file limits.  You can check the value that Riak has available by connecting to one of the nodes with riak attach, then executing:

```
os:cmd("ulimit -n").
```

(After, disconnect with ctrl+g, then q, then Enter).

It should be at least 65,536 ideally, although the bigger the better.

If you find it's lower, then follow this doc to increase it.

http://docs.basho.com/riak/kv/2.0.2/using/performance/open-files-limit/

Have a check and let us know what the output was.

Kind Regards,
Shaun

On Thu, Jan 26, 2017 at 10:34 AM, Steven Joseph <[hidden email]> wrote:
Hi,

We have a cluster of 5 nodes, which are continuously being queried for
new data through solr. We have been having some issues with riak/solr
which seems to be happening after longer periods of operation. It starts
off with one node and it seems to be happening on all node after a
while.

We tried upgrading to the latest version of riak hoping that it would
solve the issue, but no luck.

Only thing that stops the crashes is a full cluster staggered restart.

Please find the logs below. Any help would be much appreciated.

Riak Logs:

2017-01-26T07:53:03.262Z hawk5| ** Last message in was tick
2017-01-26T07:53:10.197Z hawk5|
2017-01-26T07:53:10.197Z hawk5| 2017-01-26 07:53:08.183 [error] emulator Error in process <0.22701.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:10.263Z hawk5| Error in process <0.22701.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:10.263Z hawk5| 2017-01-26 07:53:08 =ERROR REPORT====
2017-01-26T07:53:17.198Z hawk5|
2017-01-26T07:53:17.208Z hawk5| 2017-01-26 07:53:13.472 [error] emulator Error in process <0.12549.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:17.263Z hawk5| Error in process <0.12549.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:17.263Z hawk5| 2017-01-26 07:53:13 =ERROR REPORT====
2017-01-26T07:53:18.198Z hawk5| 2017-01-26 07:53:17.861 [error] emulator Error in process <0.2254.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:18.208Z hawk5|
2017-01-26T07:53:18.208Z hawk5| 2017-01-26 07:53:17.861 [error] emulator Error in process <0.2254.73> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
2017-01-26T07:53:18.264Z hawk5|


Python client traces:

2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 179, in wrapper
2017-01-26T10:20:44.517Z hawk5| return self._client.fulltext_search(search_index, query, **params)
2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/bucket.py", line 476, in search
2017-01-26T10:20:44.517Z hawk5| raise e.args[0]
2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 134, in _with_retries
2017-01-26T10:20:44.517Z hawk5| return self._with_retries(pool, thunk)
2017-01-26T10:20:44.543Z hawk5| RiakError: 'recv_into returned zero bytes unexpectedly'


Regards

Steven Joseph

CTO, StreetHawk Pty Ltd

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Luke Bakken
Steven,

You may be able to get information via the lsof command as to what
process(es) are using many file handles (if that is the cause).

I searched for that particular error and found this GH issue:
https://github.com/emqtt/emqttd/issues/426

Which directed me to this page:
https://github.com/emqtt/emqttd/wiki/linux-kernel-tuning

Basho also has a set of recommended tuning parameters:
http://docs.basho.com/riak/kv/2.2.0/using/performance/

Do you have other error entries in any of Riak's logs at around the
same time as these messages? Particularly crash.log.

--
Luke Bakken
Engineer
[hidden email]

On Thu, Jan 26, 2017 at 4:42 AM, Steven Joseph <[hidden email]> wrote:

> Hi Shaun,
>
> I have already set this to a very high value
>
> ([hidden email])1> os:cmd("ulimit -n").
> "20000500\n"
> ([hidden email])2>
>
>
> So the issue is not that the limit is low, but maybe a resource leak ? As I
> mentioned our application processes continuously run queries on the cluster.
>
> Kind Regards
>
> Steven

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Matthew Von-Maszewski
FYI:  this is the function that is crashing:

get_uint32_measurement(Request, #internal{os_type = {unix, linux}}) ->
    {ok,F} = file:open("/proc/loadavg",[read,raw]),                  %% <--- crash line
    {ok,D} = file:read(F,24),
    ok = file:close(F),
    {ok,[Load1,Load5,Load15,_PRun,PTotal],_} = io_lib:fread("~f ~f ~f ~d/~d", D),
    case Request of
        ?avg1  -> sunify(Load1);
        ?avg5  -> sunify(Load5);
        ?avg15 -> sunify(Load15);
        ?ping -> 4711;
        ?nprocs -> PTotal
    end;

Is there something unique about that open?

Matthew

> On Jan 26, 2017, at 10:37 AM, Luke Bakken <[hidden email]> wrote:
>
> Steven,
>
> You may be able to get information via the lsof command as to what
> process(es) are using many file handles (if that is the cause).
>
> I searched for that particular error and found this GH issue:
> https://github.com/emqtt/emqttd/issues/426
>
> Which directed me to this page:
> https://github.com/emqtt/emqttd/wiki/linux-kernel-tuning
>
> Basho also has a set of recommended tuning parameters:
> http://docs.basho.com/riak/kv/2.2.0/using/performance/
>
> Do you have other error entries in any of Riak's logs at around the
> same time as these messages? Particularly crash.log.
>
> --
> Luke Bakken
> Engineer
> [hidden email]
>
> On Thu, Jan 26, 2017 at 4:42 AM, Steven Joseph <[hidden email]> wrote:
>> Hi Shaun,
>>
>> I have already set this to a very high value
>>
>> ([hidden email])1> os:cmd("ulimit -n").
>> "20000500\n"
>> ([hidden email])2>
>>
>>
>> So the issue is not that the limit is low, but maybe a resource leak ? As I
>> mentioned our application processes continuously run queries on the cluster.
>>
>> Kind Regards
>>
>> Steven
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Steven Joseph-2
I've had this issue again, this time I checked the output of lsof and it seems like its the number of established connections are way high, I've configured my application tasks to exit and cleanup connections periodicaly. That should solve it.

Thanks guys.

Steven

On Fri, Jan 27, 2017 at 3:07 AM Matthew Von-Maszewski <[hidden email]> wrote:
FYI:  this is the function that is crashing:

get_uint32_measurement(Request, #internal{os_type = {unix, linux}}) ->
    {ok,F} = file:open("/proc/loadavg",[read,raw]),                  %% <--- crash line
    {ok,D} = file:read(F,24),
    ok = file:close(F),
    {ok,[Load1,Load5,Load15,_PRun,PTotal],_} = io_lib:fread("~f ~f ~f ~d/~d", D),
    case Request of
        ?avg1  -> sunify(Load1);
        ?avg5  -> sunify(Load5);
        ?avg15 -> sunify(Load15);
        ?ping -> 4711;
        ?nprocs -> PTotal
    end;

Is there something unique about that open?

Matthew

> On Jan 26, 2017, at 10:37 AM, Luke Bakken <[hidden email]> wrote:
>
> Steven,
>
> You may be able to get information via the lsof command as to what
> process(es) are using many file handles (if that is the cause).
>
> I searched for that particular error and found this GH issue:
> https://github.com/emqtt/emqttd/issues/426
>
> Which directed me to this page:
> https://github.com/emqtt/emqttd/wiki/linux-kernel-tuning
>
> Basho also has a set of recommended tuning parameters:
> http://docs.basho.com/riak/kv/2.2.0/using/performance/
>
> Do you have other error entries in any of Riak's logs at around the
> same time as these messages? Particularly crash.log.
>
> --
> Luke Bakken
> Engineer
> [hidden email]
>
> On Thu, Jan 26, 2017 at 4:42 AM, Steven Joseph <[hidden email]> wrote:
>> Hi Shaun,
>>
>> I have already set this to a very high value
>>
>> ([hidden email])1> os:cmd("ulimit -n").
>> "20000500\n"
>> ([hidden email])2>
>>
>>
>> So the issue is not that the limit is low, but maybe a resource leak ? As I
>> mentioned our application processes continuously run queries on the cluster.
>>
>> Kind Regards
>>
>> Steven
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Steven Joseph-2
In reply to this post by Shaun McVey
Hi Shaun,

Im having this issue again, this time I have captured the system limits,
while riak is still crashing.

Please note lsof and prlimit outputs at bottom.


steven@hawk5:log/riak:» tail error.log                                                                                                                                                                                            [0]  07:17:05

2017-01-31 19:21:37.391 [error] emulator Error in process <0.7964.15> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}

2017-01-31 19:21:40.868 [error] <0.25635.14> gen_server yz_cover terminated with reason: no match of right hand value error in mochiglobal:compile/2 line 51
2017-01-31 19:21:40.868 [error] <0.25635.14> CRASH REPORT Process yz_cover with 0 neighbours exited with reason: no match of right hand value error in mochiglobal:compile/2 line 51 in gen_server:terminate/6 line 744
2017-01-31 19:21:40.868 [error] <0.1215.0> Supervisor yz_general_sup had child yz_cover started with yz_cover:start_link() at <0.25635.14> exit with reason no match of right hand value error in mochiglobal:compile/2 line 51 in context child_terminated
2017-01-31 19:21:41.811 [error] emulator Error in process <0.18111.15> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}

2017-01-31 19:21:47.363 [error] emulator Error in process <0.2866.15> on node '[hidden email]' with exit value: {{badmatch,{error,system_limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}

steven@hawk5:log/riak:» sudo lsof -a -p `riak getpid` |wc -l                                                                                                                                                                      [0]  07:17:10
48446
steven@hawk5:log/riak:» sudo prlimit -n --noheadings -o soft -p `riak getpid`                                                                                                                                                     [0]  07:17:27
20000500
steven@hawk5:log/riak:» sudo prlimit -n --noheadings -o hard -p `riak getpid`                                                                                                                                                     [0]  07:17:32
20000500
steven@hawk5:log/riak:»


Python trace:

2017-01-31T20:20:52.004Z hawk4| return self._client.fulltext_search(search_index, query, **params)
2017-01-31T20:20:52.004Z hawk4| **skwargs
2017-01-31T20:20:52.004Z hawk4| return self._with_retries(pool, thunk)
2017-01-31T20:20:52.004Z hawk4| **kwargs
2017-01-31T20:20:52.004Z hawk4| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 179, in wrapper
2017-01-31T20:20:52.004Z hawk4| File "/usr/local/lib/python2.7/dist-packages/riak/bucket.py", line 476, in search
2017-01-31T20:20:52.004Z hawk4| File "/usr/local/lib/python2.7/dist-packages/riak/client/transport.py", line 134, in _with_retries
2017-01-31T20:20:52.004Z hawk4| File "/opt/streethawk/cloud/core/riakdb/models.py", line 528, in search
2017-01-31T20:20:52.005Z hawk4| RiakError: 'recv_into returned zero bytes unexpectedly'
2017-01-31T20:20:52.005Z hawk4| raise e.args[0]



Regards

Steven


Shaun McVey <[hidden email]> writes:

> Hi Steven,
>
> Based on that log output, it looks like you're running into issues with
> system limits, probably open file limits.  You can check the value that
> Riak has available by connecting to one of the nodes with riak attach, then
> executing:
>
> ```
> os:cmd("ulimit -n").
> ```
>
> (After, disconnect with ctrl+g, then q, then Enter).
>
> It should be at least 65,536 ideally, although the bigger the better.
>
> If you find it's lower, then follow this doc to increase it.
>
> http://docs.basho.com/riak/kv/2.0.2/using/performance/open-files-limit/
>
> Have a check and let us know what the output was.
>
> Kind Regards,
> Shaun
>
> On Thu, Jan 26, 2017 at 10:34 AM, Steven Joseph <[hidden email]>
> wrote:
>
>> Hi,
>>
>> We have a cluster of 5 nodes, which are continuously being queried for
>> new data through solr. We have been having some issues with riak/solr
>> which seems to be happening after longer periods of operation. It starts
>> off with one node and it seems to be happening on all node after a
>> while.
>>
>> We tried upgrading to the latest version of riak hoping that it would
>> solve the issue, but no luck.
>>
>> Only thing that stops the crashes is a full cluster staggered restart.
>>
>> Please find the logs below. Any help would be much appreciated.
>>
>> Riak Logs:
>>
>> 2017-01-26T07:53:03.262Z hawk5| ** Last message in was tick
>> 2017-01-26T07:53:10.197Z hawk5|
>> 2017-01-26T07:53:10.197Z hawk5| 2017-01-26 07:53:08.183 [error] emulator
>> Error in process <0.22701.73> on node '[hidden email]' with
>> exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
>> et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}
>> ]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
>> 2017-01-26T07:53:10.263Z hawk5| Error in process <0.22701.73> on node '
>> [hidden email]' with exit value: {{badmatch,{error,system_
>> limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
>> rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{
>> file,"cpu_sup.erl"},{line,585}]}]}
>> 2017-01-26T07:53:10.263Z hawk5| 2017-01-26 07:53:08 =ERROR REPORT====
>> 2017-01-26T07:53:17.198Z hawk5|
>> 2017-01-26T07:53:17.208Z hawk5| 2017-01-26 07:53:13.472 [error] emulator
>> Error in process <0.12549.73> on node '[hidden email]' with
>> exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g
>> et_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}
>> ]},{cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
>> 2017-01-26T07:53:17.263Z hawk5| Error in process <0.12549.73> on node '
>> [hidden email]' with exit value: {{badmatch,{error,system_
>> limit}},[{cpu_sup,get_uint32_measurement,2,[{file,"cpu_sup.e
>> rl"},{line,223}]},{cpu_sup,measurement_server_loop,1,[{
>> file,"cpu_sup.erl"},{line,585}]}]}
>> 2017-01-26T07:53:17.263Z hawk5| 2017-01-26 07:53:13 =ERROR REPORT====
>> 2017-01-26T07:53:18.198Z hawk5| 2017-01-26 07:53:17.861 [error] emulator
>> Error in process <0.2254.73> on node '[hidden email]' with
>> exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
>> t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{
>> cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
>> 2017-01-26T07:53:18.208Z hawk5|
>> 2017-01-26T07:53:18.208Z hawk5| 2017-01-26 07:53:17.861 [error] emulator
>> Error in process <0.2254.73> on node '[hidden email]' with
>> exit value: {{badmatch,{error,system_limit}},[{cpu_sup,g$
>> t_uint32_measurement,2,[{file,"cpu_sup.erl"},{line,223}]},{
>> cpu_sup,measurement_server_loop,1,[{file,"cpu_sup.erl"},{line,585}]}]}
>> 2017-01-26T07:53:18.264Z hawk5|
>>
>>
>> Python client traces:
>>
>> 2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/
>> dist-packages/riak/client/transport.py", line 179, in wrapper
>> 2017-01-26T10:20:44.517Z hawk5| return self._client.fulltext_search(search_index,
>> query, **params)
>> 2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/dist-packages/riak/bucket.py",
>> line 476, in search
>> 2017-01-26T10:20:44.517Z hawk5| raise e.args[0]
>> 2017-01-26T10:20:44.517Z hawk5| File "/usr/local/lib/python2.7/
>> dist-packages/riak/client/transport.py", line 134, in _with_retries
>> 2017-01-26T10:20:44.517Z hawk5| return self._with_retries(pool, thunk)
>> 2017-01-26T10:20:44.543Z hawk5| RiakError: 'recv_into returned zero bytes
>> unexpectedly'
>>
>>
>> Regards
>>
>> Steven Joseph
>>
>> CTO, StreetHawk Pty Ltd
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Luke Bakken
Hi Steven,

What is the output of this command on your systems?

$ sysctl fs.file-max

Mine is:

fs.file-max = 1620211

--
Luke Bakken
Engineer
[hidden email]


On Tue, Jan 31, 2017 at 12:22 PM, Steven Joseph <[hidden email]> wrote:
> Hi Shaun,
>
> Im having this issue again, this time I have captured the system limits,
> while riak is still crashing.
>
> Please note lsof and prlimit outputs at bottom.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Steven Joseph-2
Hi Luke,

Here's the output of 

$ sysctl fs.file-max

fs.file-max = 20000500

Regards

Steven

On Wed, Feb 1, 2017 at 9:30 AM Luke Bakken <[hidden email]> wrote:
Hi Steven,

What is the output of this command on your systems?

$ sysctl fs.file-max

Mine is:

fs.file-max = 1620211

--
Luke Bakken
Engineer
[hidden email]


On Tue, Jan 31, 2017 at 12:22 PM, Steven Joseph <[hidden email]> wrote:
> Hi Shaun,
>
> Im having this issue again, this time I have captured the system limits,
> while riak is still crashing.
>
> Please note lsof and prlimit outputs at bottom.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Luke Bakken
Hi Steven,

At this point I suspect you're using the Python client in such a way
that too many connections are being created. Are you re-using the
RiakClient object or repeatedly creating new ones? Can you provide any
code that reproduces your issue?

--
Luke Bakken
Engineer
[hidden email]


On Tue, Jan 31, 2017 at 7:47 PM, Steven Joseph <[hidden email]> wrote:

> Hi Luke,
>
> Here's the output of
>
> $ sysctl fs.file-max
>
> fs.file-max = 20000500
>
> Regards
>
> Steven
>
> On Wed, Feb 1, 2017 at 9:30 AM Luke Bakken <[hidden email]> wrote:
>>
>> Hi Steven,
>>
>> What is the output of this command on your systems?
>>
>> $ sysctl fs.file-max
>>
>> Mine is:
>>
>> fs.file-max = 1620211
>>
>> --
>> Luke Bakken
>> Engineer
>> [hidden email]
>>
>>
>> On Tue, Jan 31, 2017 at 12:22 PM, Steven Joseph <[hidden email]>
>> wrote:
>> > Hi Shaun,
>> >
>> > Im having this issue again, this time I have captured the system limits,
>> > while riak is still crashing.
>> >
>> > Please note lsof and prlimit outputs at bottom.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Steven Joseph-2
Hi Luke,

Yes I am creating new client objects for each of my tasks.

Please see this github issuse against the python client for some
background as to why.

https://github.com/basho/riak-python-client/issues/497

Basicaly I ran into issues with concurrency when processes are forked.

I might experiment with using process ids as keys to access a process
specific riak client in forked child ?


Regards

Steven

Luke Bakken <[hidden email]> writes:

> Hi Steven,
>
> At this point I suspect you're using the Python client in such a way
> that too many connections are being created. Are you re-using the
> RiakClient object or repeatedly creating new ones? Can you provide any
> code that reproduces your issue?
>
> --
> Luke Bakken
> Engineer
> [hidden email]
>
>
> On Tue, Jan 31, 2017 at 7:47 PM, Steven Joseph <[hidden email]> wrote:
>> Hi Luke,
>>
>> Here's the output of
>>
>> $ sysctl fs.file-max
>>
>> fs.file-max = 20000500
>>
>> Regards
>>
>> Steven
>>
>> On Wed, Feb 1, 2017 at 9:30 AM Luke Bakken <[hidden email]> wrote:
>>>
>>> Hi Steven,
>>>
>>> What is the output of this command on your systems?
>>>
>>> $ sysctl fs.file-max
>>>
>>> Mine is:
>>>
>>> fs.file-max = 1620211
>>>
>>> --
>>> Luke Bakken
>>> Engineer
>>> [hidden email]
>>>
>>>
>>> On Tue, Jan 31, 2017 at 12:22 PM, Steven Joseph <[hidden email]>
>>> wrote:
>>> > Hi Shaun,
>>> >
>>> > Im having this issue again, this time I have captured the system limits,
>>> > while riak is still crashing.
>>> >
>>> > Please note lsof and prlimit outputs at bottom.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Reg:Continuous Periodic crashes after long operation

Luke Bakken
Thanks for the information. Yes, one RiakClient instance per Unix
process is correct.

I will see if there is a way for you to keep track of connections from
the client to Riak. Off the top of my head the Python client doesn't
have the ability to set limits.

--
Luke Bakken
Engineer
[hidden email]

On Wed, Feb 1, 2017 at 1:59 PM, Steven Joseph <[hidden email]> wrote:

> Hi Luke,
>
> Yes I am creating new client objects for each of my tasks.
>
> Please see this github issuse against the python client for some
> background as to why.
>
> https://github.com/basho/riak-python-client/issues/497
>
> Basicaly I ran into issues with concurrency when processes are forked.
>
> I might experiment with using process ids as keys to access a process
> specific riak client in forked child ?
>
>
> Regards
>
> Steven
>
> Luke Bakken <[hidden email]> writes:
>
>> Hi Steven,
>>
>> At this point I suspect you're using the Python client in such a way
>> that too many connections are being created. Are you re-using the
>> RiakClient object or repeatedly creating new ones? Can you provide any
>> code that reproduces your issue?
>>
>> --
>> Luke Bakken
>> Engineer
>> [hidden email]
>>
>>
>> On Tue, Jan 31, 2017 at 7:47 PM, Steven Joseph <[hidden email]> wrote:
>>> Hi Luke,
>>>
>>> Here's the output of
>>>
>>> $ sysctl fs.file-max
>>>
>>> fs.file-max = 20000500
>>>
>>> Regards
>>>
>>> Steven
>>>
>>> On Wed, Feb 1, 2017 at 9:30 AM Luke Bakken <[hidden email]> wrote:
>>>>
>>>> Hi Steven,
>>>>
>>>> What is the output of this command on your systems?
>>>>
>>>> $ sysctl fs.file-max
>>>>
>>>> Mine is:
>>>>
>>>> fs.file-max = 1620211
>>>>
>>>> --
>>>> Luke Bakken
>>>> Engineer
>>>> [hidden email]
>>>>
>>>>
>>>> On Tue, Jan 31, 2017 at 12:22 PM, Steven Joseph <[hidden email]>
>>>> wrote:
>>>> > Hi Shaun,
>>>> >
>>>> > Im having this issue again, this time I have captured the system limits,
>>>> > while riak is still crashing.
>>>> >
>>>> > Please note lsof and prlimit outputs at bottom.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com