Core Claim and Property-Based Tests

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Core Claim and Property-Based Tests

Martin Sumner

I've raised an issue with Core today (https://github.com/basho/riak_core/issues/908), related to the claim algorithms.  

There's a long-read associated with this, which provides a broader analysis of how claim works with the ring:


I believe the long-read explains some of the common mysterious issues which can occur with claim.

We're in the process of fixing up the property-based tests for riak_core_claim.erl, and will then be looking to make some improvements to claim v2 to try and pass the improved tests.

Big question is though, how can we progress any contribution we make into the Riak codebase?  What is the plan going forward for open-source contributions to Riak?  Do Basho have any contingency plans for smoothly handing over open-source code to the community, before the list of Basho's Github people (https://github.com/orgs/basho/people) who still work at Basho is reduced to zero?

Is this something of concern to others?

Regards

Martin


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Tom Santero-2
I'm aware of a few other companies and individuals who are interested in continued development and support in a post-Basho world. Ideally the community can come together and contribute to a single, canonical fork.

Semi-related, there's a good chance this mailing list won't last much longer, either. I'm happy to personally contribute time and resources to help maintain the community. 

Tom

On Tue, May 16, 2017 at 11:51 AM, Martin Sumner <[hidden email]> wrote:

I've raised an issue with Core today (https://github.com/basho/riak_core/issues/908), related to the claim algorithms.  

There's a long-read associated with this, which provides a broader analysis of how claim works with the ring:


I believe the long-read explains some of the common mysterious issues which can occur with claim.

We're in the process of fixing up the property-based tests for riak_core_claim.erl, and will then be looking to make some improvements to claim v2 to try and pass the improved tests.

Big question is though, how can we progress any contribution we make into the Riak codebase?  What is the plan going forward for open-source contributions to Riak?  Do Basho have any contingency plans for smoothly handing over open-source code to the community, before the list of Basho's Github people (https://github.com/orgs/basho/people) who still work at Basho is reduced to zero?

Is this something of concern to others?

Regards

Martin


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Christopher Meiklejohn-2
For what it's worth, the Lasp community is looking at doing a fork of
Riak Core replacing all communication with our Partisan library and
moving it completely off of distributed Erlang.  We'd love to hear
from more folks that are interested in this work.

- Christopher

On Tue, May 16, 2017 at 6:53 PM, Tom Santero <[hidden email]> wrote:

> I'm aware of a few other companies and individuals who are interested in
> continued development and support in a post-Basho world. Ideally the
> community can come together and contribute to a single, canonical fork.
>
> Semi-related, there's a good chance this mailing list won't last much
> longer, either. I'm happy to personally contribute time and resources to
> help maintain the community.
>
> Tom
>
> On Tue, May 16, 2017 at 11:51 AM, Martin Sumner
> <[hidden email]> wrote:
>>
>>
>> I've raised an issue with Core today
>> (https://github.com/basho/riak_core/issues/908), related to the claim
>> algorithms.
>>
>> There's a long-read associated with this, which provides a broader
>> analysis of how claim works with the ring:
>>
>>
>> https://github.com/martinsumner/riak_core/blob/mas-claimv2issues/docs/ring_claim.md
>>
>> I believe the long-read explains some of the common mysterious issues
>> which can occur with claim.
>>
>> We're in the process of fixing up the property-based tests for
>> riak_core_claim.erl, and will then be looking to make some improvements to
>> claim v2 to try and pass the improved tests.
>>
>> Big question is though, how can we progress any contribution we make into
>> the Riak codebase?  What is the plan going forward for open-source
>> contributions to Riak?  Do Basho have any contingency plans for smoothly
>> handing over open-source code to the community, before the list of Basho's
>> Github people (https://github.com/orgs/basho/people) who still work at Basho
>> is reduced to zero?
>>
>> Is this something of concern to others?
>>
>> Regards
>>
>> Martin
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Tom Santero-2
I'd love to see riak_core on partisan. I'm eyeing using it in an upcoming internal project. </$0.02>

On Tue, May 16, 2017 at 3:06 PM, Christopher Meiklejohn <[hidden email]> wrote:
For what it's worth, the Lasp community is looking at doing a fork of
Riak Core replacing all communication with our Partisan library and
moving it completely off of distributed Erlang.  We'd love to hear
from more folks that are interested in this work.

- Christopher

On Tue, May 16, 2017 at 6:53 PM, Tom Santero <[hidden email]> wrote:
> I'm aware of a few other companies and individuals who are interested in
> continued development and support in a post-Basho world. Ideally the
> community can come together and contribute to a single, canonical fork.
>
> Semi-related, there's a good chance this mailing list won't last much
> longer, either. I'm happy to personally contribute time and resources to
> help maintain the community.
>
> Tom
>
> On Tue, May 16, 2017 at 11:51 AM, Martin Sumner
> <[hidden email]> wrote:
>>
>>
>> I've raised an issue with Core today
>> (https://github.com/basho/riak_core/issues/908), related to the claim
>> algorithms.
>>
>> There's a long-read associated with this, which provides a broader
>> analysis of how claim works with the ring:
>>
>>
>> https://github.com/martinsumner/riak_core/blob/mas-claimv2issues/docs/ring_claim.md
>>
>> I believe the long-read explains some of the common mysterious issues
>> which can occur with claim.
>>
>> We're in the process of fixing up the property-based tests for
>> riak_core_claim.erl, and will then be looking to make some improvements to
>> claim v2 to try and pass the improved tests.
>>
>> Big question is though, how can we progress any contribution we make into
>> the Riak codebase?  What is the plan going forward for open-source
>> contributions to Riak?  Do Basho have any contingency plans for smoothly
>> handing over open-source code to the community, before the list of Basho's
>> Github people (https://github.com/orgs/basho/people) who still work at Basho
>> is reduced to zero?
>>
>> Is this something of concern to others?
>>
>> Regards
>>
>> Martin
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Martin Sumner
In reply to this post by Christopher Meiklejohn-2
Chris,

Is this only the communications part, so the core concepts like the Ring, preflists, the Claimant role, the claim algo etc will remain the same?

Where's the best place to start reading about Partisan, I'm interested in the motivation for changing that part of Core.  Is there a special use case or problem you're focused on (e,g. gossip problems in much larger clusters)?

Ta

Martin

On 16 May 2017 at 20:06, Christopher Meiklejohn <[hidden email]> wrote:
For what it's worth, the Lasp community is looking at doing a fork of
Riak Core replacing all communication with our Partisan library and
moving it completely off of distributed Erlang.  We'd love to hear
from more folks that are interested in this work.

- Christopher

On Tue, May 16, 2017 at 6:53 PM, Tom Santero <[hidden email]> wrote:
> I'm aware of a few other companies and individuals who are interested in
> continued development and support in a post-Basho world. Ideally the
> community can come together and contribute to a single, canonical fork.
>
> Semi-related, there's a good chance this mailing list won't last much
> longer, either. I'm happy to personally contribute time and resources to
> help maintain the community.
>
> Tom
>
> On Tue, May 16, 2017 at 11:51 AM, Martin Sumner
> <[hidden email]> wrote:
>>
>>
>> I've raised an issue with Core today
>> (https://github.com/basho/riak_core/issues/908), related to the claim
>> algorithms.
>>
>> There's a long-read associated with this, which provides a broader
>> analysis of how claim works with the ring:
>>
>>
>> https://github.com/martinsumner/riak_core/blob/mas-claimv2issues/docs/ring_claim.md
>>
>> I believe the long-read explains some of the common mysterious issues
>> which can occur with claim.
>>
>> We're in the process of fixing up the property-based tests for
>> riak_core_claim.erl, and will then be looking to make some improvements to
>> claim v2 to try and pass the improved tests.
>>
>> Big question is though, how can we progress any contribution we make into
>> the Riak codebase?  What is the plan going forward for open-source
>> contributions to Riak?  Do Basho have any contingency plans for smoothly
>> handing over open-source code to the community, before the list of Basho's
>> Github people (https://github.com/orgs/basho/people) who still work at Basho
>> is reduced to zero?
>>
>> Is this something of concern to others?
>>
>> Regards
>>
>> Martin
>>
>>
>> _______________________________________________
>> riak-users mailing list
>> [hidden email]
>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>>
>
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Christopher Meiklejohn-2
We're looking at mainly leveraging partisan for changing the
underlying communication structure -- we hope to have via support in
Partisan soon along with connection multiplexing, so we hope to avoid
bottlenecks related to head-of-line-blocking in distributed Erlang, be
able to support SSL/TLS easier for intra-cluster communication and
have more robust visibility into how the cluster is operating.

One thing we learned from Riak MDC is that the single connection's
used in distributed Erlang are a bottleneck and difficult to apply
flow and congestion control to -- where, we believe a solution based
completely on gen_tcp would be more flexible.

[Keep in mind this is a ~1 year vision at the moment.]

Thanks,
- Christopher

On Tue, May 16, 2017 at 9:20 PM, Martin Sumner
<[hidden email]> wrote:

> Chris,
>
> Is this only the communications part, so the core concepts like the Ring,
> preflists, the Claimant role, the claim algo etc will remain the same?
>
> Where's the best place to start reading about Partisan, I'm interested in
> the motivation for changing that part of Core.  Is there a special use case
> or problem you're focused on (e,g. gossip problems in much larger clusters)?
>
> Ta
>
> Martin
>
> On 16 May 2017 at 20:06, Christopher Meiklejohn
> <[hidden email]> wrote:
>>
>> For what it's worth, the Lasp community is looking at doing a fork of
>> Riak Core replacing all communication with our Partisan library and
>> moving it completely off of distributed Erlang.  We'd love to hear
>> from more folks that are interested in this work.
>>
>> - Christopher
>>
>> On Tue, May 16, 2017 at 6:53 PM, Tom Santero <[hidden email]> wrote:
>> > I'm aware of a few other companies and individuals who are interested in
>> > continued development and support in a post-Basho world. Ideally the
>> > community can come together and contribute to a single, canonical fork.
>> >
>> > Semi-related, there's a good chance this mailing list won't last much
>> > longer, either. I'm happy to personally contribute time and resources to
>> > help maintain the community.
>> >
>> > Tom
>> >
>> > On Tue, May 16, 2017 at 11:51 AM, Martin Sumner
>> > <[hidden email]> wrote:
>> >>
>> >>
>> >> I've raised an issue with Core today
>> >> (https://github.com/basho/riak_core/issues/908), related to the claim
>> >> algorithms.
>> >>
>> >> There's a long-read associated with this, which provides a broader
>> >> analysis of how claim works with the ring:
>> >>
>> >>
>> >>
>> >> https://github.com/martinsumner/riak_core/blob/mas-claimv2issues/docs/ring_claim.md
>> >>
>> >> I believe the long-read explains some of the common mysterious issues
>> >> which can occur with claim.
>> >>
>> >> We're in the process of fixing up the property-based tests for
>> >> riak_core_claim.erl, and will then be looking to make some improvements
>> >> to
>> >> claim v2 to try and pass the improved tests.
>> >>
>> >> Big question is though, how can we progress any contribution we make
>> >> into
>> >> the Riak codebase?  What is the plan going forward for open-source
>> >> contributions to Riak?  Do Basho have any contingency plans for
>> >> smoothly
>> >> handing over open-source code to the community, before the list of
>> >> Basho's
>> >> Github people (https://github.com/orgs/basho/people) who still work at
>> >> Basho
>> >> is reduced to zero?
>> >>
>> >> Is this something of concern to others?
>> >>
>> >> Regards
>> >>
>> >> Martin
>> >>
>> >>
>> >> _______________________________________________
>> >> riak-users mailing list
>> >> [hidden email]
>> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >>
>> >
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > [hidden email]
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

DeadZen
I'd like to keep the core project going, just depends on how much interest there is.
There are a lot of separate issues and stalled initiatives, if anyone likes to discuss them. Some have to do simply with scaling Distributed Erlang. 
Theres a riak core mailing list as well that probably could use some fresh air. 

Thanks,
Pedram

On Tue, May 16, 2017 at 3:29 PM Christopher Meiklejohn <[hidden email]> wrote:
We're looking at mainly leveraging partisan for changing the
underlying communication structure -- we hope to have via support in
Partisan soon along with connection multiplexing, so we hope to avoid
bottlenecks related to head-of-line-blocking in distributed Erlang, be
able to support SSL/TLS easier for intra-cluster communication and
have more robust visibility into how the cluster is operating.

One thing we learned from Riak MDC is that the single connection's
used in distributed Erlang are a bottleneck and difficult to apply
flow and congestion control to -- where, we believe a solution based
completely on gen_tcp would be more flexible.

[Keep in mind this is a ~1 year vision at the moment.]

Thanks,
- Christopher

On Tue, May 16, 2017 at 9:20 PM, Martin Sumner
<[hidden email]> wrote:
> Chris,
>
> Is this only the communications part, so the core concepts like the Ring,
> preflists, the Claimant role, the claim algo etc will remain the same?
>
> Where's the best place to start reading about Partisan, I'm interested in
> the motivation for changing that part of Core.  Is there a special use case
> or problem you're focused on (e,g. gossip problems in much larger clusters)?
>
> Ta
>
> Martin
>
> On 16 May 2017 at 20:06, Christopher Meiklejohn
> <[hidden email]> wrote:
>>
>> For what it's worth, the Lasp community is looking at doing a fork of
>> Riak Core replacing all communication with our Partisan library and
>> moving it completely off of distributed Erlang.  We'd love to hear
>> from more folks that are interested in this work.
>>
>> - Christopher
>>
>> On Tue, May 16, 2017 at 6:53 PM, Tom Santero <[hidden email]> wrote:
>> > I'm aware of a few other companies and individuals who are interested in
>> > continued development and support in a post-Basho world. Ideally the
>> > community can come together and contribute to a single, canonical fork.
>> >
>> > Semi-related, there's a good chance this mailing list won't last much
>> > longer, either. I'm happy to personally contribute time and resources to
>> > help maintain the community.
>> >
>> > Tom
>> >
>> > On Tue, May 16, 2017 at 11:51 AM, Martin Sumner
>> > <[hidden email]> wrote:
>> >>
>> >>
>> >> I've raised an issue with Core today
>> >> (https://github.com/basho/riak_core/issues/908), related to the claim
>> >> algorithms.
>> >>
>> >> There's a long-read associated with this, which provides a broader
>> >> analysis of how claim works with the ring:
>> >>
>> >>
>> >>
>> >> https://github.com/martinsumner/riak_core/blob/mas-claimv2issues/docs/ring_claim.md
>> >>
>> >> I believe the long-read explains some of the common mysterious issues
>> >> which can occur with claim.
>> >>
>> >> We're in the process of fixing up the property-based tests for
>> >> riak_core_claim.erl, and will then be looking to make some improvements
>> >> to
>> >> claim v2 to try and pass the improved tests.
>> >>
>> >> Big question is though, how can we progress any contribution we make
>> >> into
>> >> the Riak codebase?  What is the plan going forward for open-source
>> >> contributions to Riak?  Do Basho have any contingency plans for
>> >> smoothly
>> >> handing over open-source code to the community, before the list of
>> >> Basho's
>> >> Github people (https://github.com/orgs/basho/people) who still work at
>> >> Basho
>> >> is reduced to zero?
>> >>
>> >> Is this something of concern to others?
>> >>
>> >> Regards
>> >>
>> >> Martin
>> >>
>> >>
>> >> _______________________________________________
>> >> riak-users mailing list
>> >> [hidden email]
>> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >>
>> >
>> >
>> > _______________________________________________
>> > riak-users mailing list
>> > [hidden email]
>> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>> >
>
>

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Russell Brown-4
Back to the original post, the important point for me is that this is not really about riak-core, but Riak, the database.

The OP in TL;DR form:

1. A thorough report of a long lived bug in claim that means many node/ring combos end up with multiple replicas on one physical node, silently!
2. A proposed fix (which I consider very important for anyone running Riak.)
3. The most important question: If the OP fixes this, how can everyone benefit?

I had some dataloss fixes merged by Basho in March/April, will they ever be released?
Will each of the major users hardfork Riak and struggle to benefit from each others work?

Cheers

Russell

On 16 May 2017, at 21:02, DeadZen <[hidden email]> wrote:

> I'd like to keep the core project going, just depends on how much interest there is.
> There are a lot of separate issues and stalled initiatives, if anyone likes to discuss them. Some have to do simply with scaling Distributed Erlang.
> Theres a riak core mailing list as well that probably could use some fresh air.
>
> Thanks,
> Pedram
>
> On Tue, May 16, 2017 at 3:29 PM Christopher Meiklejohn <[hidden email]> wrote:
> We're looking at mainly leveraging partisan for changing the
> underlying communication structure -- we hope to have via support in
> Partisan soon along with connection multiplexing, so we hope to avoid
> bottlenecks related to head-of-line-blocking in distributed Erlang, be
> able to support SSL/TLS easier for intra-cluster communication and
> have more robust visibility into how the cluster is operating.
>
> One thing we learned from Riak MDC is that the single connection's
> used in distributed Erlang are a bottleneck and difficult to apply
> flow and congestion control to -- where, we believe a solution based
> completely on gen_tcp would be more flexible.
>
> [Keep in mind this is a ~1 year vision at the moment.]
>
> Thanks,
> - Christopher
>
> On Tue, May 16, 2017 at 9:20 PM, Martin Sumner
> <[hidden email]> wrote:
> > Chris,
> >
> > Is this only the communications part, so the core concepts like the Ring,
> > preflists, the Claimant role, the claim algo etc will remain the same?
> >
> > Where's the best place to start reading about Partisan, I'm interested in
> > the motivation for changing that part of Core.  Is there a special use case
> > or problem you're focused on (e,g. gossip problems in much larger clusters)?
> >
> > Ta
> >
> > Martin
> >
> > On 16 May 2017 at 20:06, Christopher Meiklejohn
> > <[hidden email]> wrote:
> >>
> >> For what it's worth, the Lasp community is looking at doing a fork of
> >> Riak Core replacing all communication with our Partisan library and
> >> moving it completely off of distributed Erlang.  We'd love to hear
> >> from more folks that are interested in this work.
> >>
> >> - Christopher
> >>
> >> On Tue, May 16, 2017 at 6:53 PM, Tom Santero <[hidden email]> wrote:
> >> > I'm aware of a few other companies and individuals who are interested in
> >> > continued development and support in a post-Basho world. Ideally the
> >> > community can come together and contribute to a single, canonical fork.
> >> >
> >> > Semi-related, there's a good chance this mailing list won't last much
> >> > longer, either. I'm happy to personally contribute time and resources to
> >> > help maintain the community.
> >> >
> >> > Tom
> >> >
> >> > On Tue, May 16, 2017 at 11:51 AM, Martin Sumner
> >> > <[hidden email]> wrote:
> >> >>
> >> >>
> >> >> I've raised an issue with Core today
> >> >> (https://github.com/basho/riak_core/issues/908), related to the claim
> >> >> algorithms.
> >> >>
> >> >> There's a long-read associated with this, which provides a broader
> >> >> analysis of how claim works with the ring:
> >> >>
> >> >>
> >> >>
> >> >> https://github.com/martinsumner/riak_core/blob/mas-claimv2issues/docs/ring_claim.md
> >> >>
> >> >> I believe the long-read explains some of the common mysterious issues
> >> >> which can occur with claim.
> >> >>
> >> >> We're in the process of fixing up the property-based tests for
> >> >> riak_core_claim.erl, and will then be looking to make some improvements
> >> >> to
> >> >> claim v2 to try and pass the improved tests.
> >> >>
> >> >> Big question is though, how can we progress any contribution we make
> >> >> into
> >> >> the Riak codebase?  What is the plan going forward for open-source
> >> >> contributions to Riak?  Do Basho have any contingency plans for
> >> >> smoothly
> >> >> handing over open-source code to the community, before the list of
> >> >> Basho's
> >> >> Github people (https://github.com/orgs/basho/people) who still work at
> >> >> Basho
> >> >> is reduced to zero?
> >> >>
> >> >> Is this something of concern to others?
> >> >>
> >> >> Regards
> >> >>
> >> >> Martin
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> riak-users mailing list
> >> >> [hidden email]
> >> >> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >> >>
> >> >
> >> >
> >> > _______________________________________________
> >> > riak-users mailing list
> >> > [hidden email]
> >> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> >> >
> >
> >
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

andrei zavada-2
In reply to this post by Martin Sumner
> ... before the list of Basho's Github people (https://github.com/orgs/basho/people) who still work at Basho is reduced to zero?

Just a note on that list: these are the (few) people who took the
trouble to flip the visibility of their membership in their profiles.
Github seems to have changed the default to be "private", which means
others simply won't see the organisation(s) a user belongs too.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Martin.Cox
In reply to this post by Martin Sumner
Apologies in advance if this doesn't quite submit correctly to the list.

We [bet365] are very much interested in the continued development of Riak in its current incarnation, with Core continuing to be underpinned by distributed Erlang. We are very keen to help to build / shape / support the community around the project. Internally, we have assembled a team to continue the development of Riak, along a roadmap, and are also looking to bring more expertise into the business to help support this. Whilst the Lasp / Partisan project sounds really interesting, and something that could probably be of interest to us in the future, our immediate focus is around stabilising and securing the project in its current form. We’re looking to take Riak forward by contributing to a renewed community effort.

In summary, we're committed to continuing the development of Riak (we've already assembled /  growing a team to do so) and are happy to engage with, and support, the community in order to move the project forward.

Thanks

Martin Cox
Software Developer
Hillside (Technology) Limited
e: [hidden email]
bet365.com
This email and any files transmitted with it are confidential and contain information which may be privileged or confidential and are intended solely to be for the use of the individual(s) or entity to which they are addressed. If you are not the intended recipient be aware that any disclosure, copying, distribution or use of the contents of this information is strictly prohibited and may be illegal. If you have received this email in error, please notify us by telephone or email immediately and delete it from your system. Activity and use of our email system is monitored to secure its effective operation and for other lawful business purposes. Communications using this system will also be monitored and may be recorded to secure effective operation and for other lawful business purposes. Internet emails are not necessarily secure. We do not accept responsibility for changes made to this message after it was sent. You are advised to scan this message for viruses and we cannot accept liability for any loss or damage which may be caused as a result of any computer virus.

This email is sent by a bet365 group entity. The bet365 group includes the following entities: Hillside (Shared Services) Limited (registration no. 3958393), Hillside (Spain New Media) Plc (registration no. 07833226), bet365 Group Limited (registration no. 4241161), Hillside (Technology) Limited (registration no. 8273456), Hillside (Media Services) Limited (registration no. 9171710), Hillside (Trader Services) Limited (registration no. 9171598) each registered in England and Wales with a registered office address at bet365 House, Media Way, Stoke-on-Trent, ST1 5SZ, United Kingdom; Hillside (Gibraltar) Limited (registration no. 97927), Hillside (Sports) GP Limited (registration no. 111829) and Hillside (Gaming) GP Limited (registered no. 111830) each registered in Gibraltar with a registered office address at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside (UK Sports) LP (registration no. 117), Hillside (Sports) LP (registration no. 118), Hillside (International Sports) LP (registration no. 119), Hillside (Gaming) LP (registration no. 120) and Hillside (International Gaming) LP (registration no. 121) each registered in Gibraltar with a principal place of business at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside España Leisure S.A (CIF no. A86340270) registered in Spain with a registered office address at C/ Conde de Aranda nº20, 2º, 28001 Madrid, Spain; Hillside (Australia New Media) Pty Limited (registration no. 148 920 665) registered in Australia with a registered office address at Level 4, 90 Arthur Street, North Sydney, NSW 2060, Australia; Hillside (New Media Malta) Limited, (registration no c.66039) registered in Malta with a registered office address at Office 1/2373, Level G, Quantum House, 75 Abate Rigord Street, Ta’ Xbiex XBX 1120, Malta and Hillside (New Media Cyprus) Limited, (registration no. HE 361612) registered in Cyprus with a registered office address at Omrania Centre, 313, 28th October Avenue, 3105 Limassol, Cyprus. Hillside (Shared Services) Limited, Hillside (Spain New Media) Plc and Hillside (New Media Malta) Limited also have places of business at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar. For residents of Greece, this email is sent on behalf of B2B Gaming Services (Malta) Limited (registration number C41936) organised under the laws of Malta with a registered office at Apartment 21, Suite 41, Charles Court, St. Luke's Road, Pietà, Malta.


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

DeadZen
Unreleased data loss patches. Sigh. 


On Wed, May 17, 2017 at 6:19 AM <[hidden email]> wrote:
Apologies in advance if this doesn't quite submit correctly to the list.

We [bet365] are very much interested in the continued development of Riak in its current incarnation, with Core continuing to be underpinned by distributed Erlang. We are very keen to help to build / shape / support the community around the project. Internally, we have assembled a team to continue the development of Riak, along a roadmap, and are also looking to bring more expertise into the business to help support this. Whilst the Lasp / Partisan project sounds really interesting, and something that could probably be of interest to us in the future, our immediate focus is around stabilising and securing the project in its current form. We’re looking to take Riak forward by contributing to a renewed community effort.

In summary, we're committed to continuing the development of Riak (we've already assembled /  growing a team to do so) and are happy to engage with, and support, the community in order to move the project forward.

Thanks

Martin Cox
Software Developer
Hillside (Technology) Limited
e: [hidden email]
bet365.com
This email and any files transmitted with it are confidential and contain information which may be privileged or confidential and are intended solely to be for the use of the individual(s) or entity to which they are addressed. If you are not the intended recipient be aware that any disclosure, copying, distribution or use of the contents of this information is strictly prohibited and may be illegal. If you have received this email in error, please notify us by telephone or email immediately and delete it from your system. Activity and use of our email system is monitored to secure its effective operation and for other lawful business purposes. Communications using this system will also be monitored and may be recorded to secure effective operation and for other lawful business purposes. Internet emails are not necessarily secure. We do not accept responsibility for changes made to this message after it was sent. You are advised to scan this message for viruses and we cannot accept liability for any loss or damage which may be caused as a result of any computer virus.

This email is sent by a bet365 group entity. The bet365 group includes the following entities: Hillside (Shared Services) Limited (registration no. 3958393), Hillside (Spain New Media) Plc (registration no. 07833226), bet365 Group Limited (registration no. 4241161), Hillside (Technology) Limited (registration no. 8273456), Hillside (Media Services) Limited (registration no. 9171710), Hillside (Trader Services) Limited (registration no. 9171598) each registered in England and Wales with a registered office address at bet365 House, Media Way, Stoke-on-Trent, ST1 5SZ, United Kingdom; Hillside (Gibraltar) Limited (registration no. 97927), Hillside (Sports) GP Limited (registration no. 111829) and Hillside (Gaming) GP Limited (registered no. 111830) each registered in Gibraltar with a registered office address at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside (UK Sports) LP (registration no. 117), Hillside (Sports) LP (registration no. 118), Hillside (International Sports) LP (registration no. 119), Hillside (Gaming) LP (registration no. 120) and Hillside (International Gaming) LP (registration no. 121) each registered in Gibraltar with a principal place of business at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside España Leisure S.A (CIF no. A86340270) registered in Spain with a registered office address at C/ Conde de Aranda nº20, 2º, 28001 Madrid, Spain; Hillside (Australia New Media) Pty Limited (registration no. 148 920 665) registered in Australia with a registered office address at Level 4, 90 Arthur Street, North Sydney, NSW 2060, Australia; Hillside (New Media Malta) Limited, (registration no c.66039) registered in Malta with a registered office address at Office 1/2373, Level G, Quantum House, 75 Abate Rigord Street, Ta’ Xbiex XBX 1120, Malta and Hillside (New Media Cyprus) Limited, (registration no. HE 361612) registered in Cyprus with a registered office address at Omrania Centre, 313, 28th October Avenue, 3105 Limassol, Cyprus. Hillside (Shared Services) Limited, Hillside (Spain New Media) Plc and Hillside (New Media Malta) Limited also have places of business at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar. For residents of Greece, this email is sent on behalf of B2B Gaming Services (Malta) Limited (registration number C41936) organised under the laws of Malta with a registered office at Apartment 21, Suite 41, Charles Court, St. Luke's Road, Pietà, Malta.


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Daniel Abrahamsson-2
In reply to this post by Martin Sumner
Thanks for the writeup and detailed investigation, Martin.

We ran into these issues a few months when we expanded a 5 node cluster into a 8 node cluster. We ended up rebuilding the cluster and writing a small escript to verify that the generated riak ring lived up to our requirements (which were 1: to survive an AZ outage, and 2: to survive any 2 nodes going down at the same time).

This will be a great document to refer to when explaining the subtleties of setting up a Riak cluster.

//Daniel

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Jon Meredith-2

Thanks for the excellent writeup.  

I have a few notes on your writeup and then a little history to help
explain the motivation for the v3 work.

The Claiming Problem

  One other property of the broader claim algorithm + claimant + handoff
  manager group of processes that's worth mentioning is safety during
  transition.  The cluster should ensure that target N-val copies
  are always available even during transitions.  Much earlier in Riak's
  life the claim would just execute and ownership transfer immediately,
  without putting the data in place (fine, it's eventually consistent, right?)
  but that meant if more than two vnodes in a preference list changed
  ownership then clients would read not found until at least one of the
  objects it was receiving had transferred. The claimant now shepherds those
  transitions so it should be safe.  The solution of transferring the
  data before ownership has fixed the notfound problem, but Riak lost
  agility in adding capacity to the cluster - existing data has to transfer
  to new nodes before they are freed up, and they continue to grow
  while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
  capacity and proxying back to the original vnode is probably a better
  option.

  Predicting load on the cluster is also difficult with the single
  ring with a target n-val set at creation time being used for all
  buckets despite their n-value.  To compute the operations sent to
  each vnode you need to know the proportion of access to each N-value.
  
  There's also the problem that if a bucket is created with an N-value
  larger than target N all bets are off about the number of physical nodes
  values are written to (*cough* strong consistency N-5)

  Having a partitioning-scheme-per-N-value is one way of sidestepping the
  load prediction and max-N problems.

Promixity of Vnodes

  An alternate solution to the target_n_val problem is to change the way
  fallback partitions are added and apply an additional uniqueness constraint
  as target nodes are added.  That provides safety against multiple node
  failures (although can potentially cause loading problems).  I think
  you imply this a couple of points when you talk about 'at runtime'.

Proximity of vnodes as the partition list wraps

  One kludge I considered solving the wraparound problem is to go from
  a ring to a 'spiral' where you add extra target_n_val-1 additional
  vnodes that alias the few vnodes in the ring.

  Using the pathalogically bad (vnodes) Q=4, (nodes) S=3, (nval) N=3
```
  v0 | v1 | v2 | v3
  nA | nB | nC | nA

  p0 = [ {v1, nB} {v2, Nc} {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA} {v0, nA} ] <<< Bad
  p2 = [ {v3, nA} {v0, nA} {v1, nB} ] <<< Bad
  p3 = [ {v0, nA} {v1, nB} {v2, nC} ]
```
  You get 2/4 preflists violating target_n_val=3.

  If you extend the ring to allow aliasing (i.e. go beyond 2^160) but
  only use it for assignment
  
```
  v0 | v1 | v2 | v3 | v0' | v1'
  nA | nB | nC | nA | nB  | nC

  p0 = [ {v1, nB} {v2, Nc}  {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA}  {v0', nB} ]
  p2 = [ {v3, nA} {v0', nB} {v1', nB} ]
  p3 = [ {v0, nA} {v1, nB}  {v2, nC} ]
```
  The additional vnodes can never be hashed directly, just during
  wraparound.


As you say, the v3 algorithm was written (by me) a long time ago and
never made it to production.  It was due to a few factors, partly
the non-determinism, partly because I didn't like the (very stupid)
optimization system tying up the claimant node for multiple seconds,
but more troublingly when we did some commissioning tests for a large
customer that ran with a ring size of 256 with 60 nodes we experienced
a performance drop of around 5% when the cluster was maxed out for
reads.  The diversity measurements were much 'better' in that the
v3 claimed cluster was far more diverse and performed better during
node failures, but the (unproven) fear that having a greater number
of saturated disterl connections between nodes dropped performance
without explanation stopped me from promoting it to default.

The reason the v3 algorithm was created was to resolve problems with
longer lived clusters created with the v2 claim that had had nodes
added and removed over time.  I don't remember all the details now,
but I think the cluster had a ring size of 1024 (to future proof,
as no 2I/listkey on that cluster) and somewhere between 15-30 nodes.

In that particular configuration, the v2 algorithm had left the original
sequential node assignment (n1, n2, ..., n15, n1, n2, ...) and assigned
new nodes in place, but that left many places were the original sequential
assignments still existed.

What we hadn't realized at the time is that sequential node assignment
is the *worst* possible plan for handling fallback load.

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

In the real scenario, although it was no longer sequentially assigned
there were still a large number of very similar preference lists to
the original assignment (as growing a few nodes on that ring size
only reassigns preference lists in proportion to the new nodes claiming
partitions).

The production cluster was running fairly close to capacity, so the
increased loading during failure, even though it wasn't as bad as doubled
was enough to push it over the performance 'step' lowering tail latencies
and slowed it down enough to overload the vnodes and exhaust memory
crashing the next node causing a cascade.  This was before vnodes had
overload protection so would present differently now.


Another pre-claimant problem that shaped some of the earlier claim
code vnode 'want' threshods was that when the nodes were individually
allowed to say if they wanted to claim more vnodes (with the
wants_claim function, before calling choose_claim), there were some states
the cluster would get into where two nodes both decided they were under
capacity and continually tried to claim, causing the vnode to flip/flop
back and forth between them (that was a reason for writing one of the early
QuickCheck tests).


I'm not sure if you've encountered it or not, but the riak_core_claim_sim
is also a good tool for testing the behavior of the claim functions and
the claimant.  You don't mention it in your write up, but one of the
important functions of the claimant is to make sure it only performs
safe transitions between rings.  It makes sure that the n val is not
violated during handoff.



What to do?

  Fixing the claim algorithm is one way of doing things, but I worry
  it has a number of problems that are hard to solve (multi-AZ, multi-Nval
  etc).

  One more radical option is to dump the ring and just publish a table
  per-vnode of the nodes and vnode hash you'd like to service them.
  Riak doesn't really need consistent hashing - it doesn't *really* use
  it's original form (the Dynamo A scheme), and is more of a hybrid
  of the B/C schemes.

  Use cluster metadata and publish out the tables, update riak_core_apl
  to take the new data and serve up the preference lists.  Obviously
  it trickles into things like the vnode and handoff managers, but it
  may be possible.

  That gives you the advantage of no longer being constrained in how
  you assign the nodes - a separation of policy and execution.  You
  could keep the existing ring based algorithms, or you could do something
  better.

  It may be interesting to change the number of vnodes/hashing algorithm
  too.  Jordan West was a big fan of Consistent Jump Hashing at one point.

  The thing you give up if you lose the power-of-2 partitioning scheme
  is the ability to split and combine partitions.  Each partition in
  a 64 vnode ring maps to exactly two (non-consecutive) partitions in a 128
  vnode ring.  Which is a very nice for replicating between clusters
  with different ring sizes and localizing where to look for data.

Good luck!
  



On Wed, May 17, 2017 at 6:37 AM Daniel Abrahamsson <[hidden email]> wrote:
Thanks for the writeup and detailed investigation, Martin.

We ran into these issues a few months when we expanded a 5 node cluster into a 8 node cluster. We ended up rebuilding the cluster and writing a small escript to verify that the generated riak ring lived up to our requirements (which were 1: to survive an AZ outage, and 2: to survive any 2 nodes going down at the same time).

This will be a great document to refer to when explaining the subtleties of setting up a Riak cluster.

//Daniel
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Martin Sumner
In reply to this post by Martin.Cox
Martin,

Great to hear the commitment from bet365.

For everyone's information, there are a number of Riak-related things being worked on at the NHS, and a number of things we plan to work on.  Obviously we're focused only on the needs of the NHS, not of the broader Riak community, but in case this information is of use to others below are some lists just to give an idea of what is:

- currently being worked on at NHS;
- currently prioritised to look into next;
- the Riak features we currently use;
- the bits of Riak we have no current intent to use.

Current WIP:

- looking at core/claim fixes.
- an AAE investigation (similar to the core/claim deep-dive) motivated by a need to understand some issues we see in production with rebuilds after joins/leaves.
- rabl, an open-source real-time replication approach for Riak (that is not hard embedded) and depends on RabbitMQ with shovel to manage inter-site comms.
- OTP19/rebar3 scoping, to try and understand the remaining barriers and challenges to moving the underlying Erlang platform for Riak forward.

(roughly) Ordered Roadmap:

Our roughly ordered roadmap of what we're intending to look at next:

- a repeatable test and build process for open-source riak (e.g. with no magic configuration that you never change once it works).
- physical run-time promises, a way of confirming writes have been written to more than one node that is more highly available than setting pw=2.
- smoothing of aae rebuilds either though support of leveled backend, or getting kv_sweeper production ready (or both).
- alternate behaviour during force repair to try and avoid coverage query issues.
- native support for big sets.
- Riak on OTP19.
- migrating to leveled (as a pure erlang backend that is easier for us to support, and potentially resolves some issues around backups, performance consistency, TCP incast via HEAD request support, large object issues etc).
- selective flush to disk on write (changing flush behaviour per bucket or per object, also allowing for flush once at coordinator not just deferred-flush/flush-all).
- a full-sync replacement for validating changes between clusters (we already use a NHS-developed alternative to full-sync rather than the Basho proprietary full-sync, but it needs improvement).
- observability improvements.
- object expiry.

This is based on a value analysis, not a value/effort analysis - so these priorities will change as we decide to schedule.

The Riak we use:

KV Store (focused on mid-sized objects)
Secondary indexes (including return_terms, term_regex support)
CRDTs (maps and sets)
Erlang Map/Reduce
Real-time Repl (Riak Enterprise)
AAE enabled
Leveldb backend
Multiple 5-20 node clusters
Dual-site bi-directional replication of all clusters
Various backup strategies (some bespoke, some based on leveldb backup)
HTTP API (normally not using Riak clients)
Ubuntu OS

This bits of Riak we have no plans to use:

Riak Search
Riak TS
Riak CS
Riak Control
Javascript Map/Reduce
Riak clients



Regards

Martin


On 17 May 2017 at 11:19, <[hidden email]> wrote:
Apologies in advance if this doesn't quite submit correctly to the list.

We [bet365] are very much interested in the continued development of Riak in its current incarnation, with Core continuing to be underpinned by distributed Erlang. We are very keen to help to build / shape / support the community around the project. Internally, we have assembled a team to continue the development of Riak, along a roadmap, and are also looking to bring more expertise into the business to help support this. Whilst the Lasp / Partisan project sounds really interesting, and something that could probably be of interest to us in the future, our immediate focus is around stabilising and securing the project in its current form. We’re looking to take Riak forward by contributing to a renewed community effort.

In summary, we're committed to continuing the development of Riak (we've already assembled /  growing a team to do so) and are happy to engage with, and support, the community in order to move the project forward.

Thanks

Martin Cox
Software Developer
Hillside (Technology) Limited
e: [hidden email]
bet365.com
This email and any files transmitted with it are confidential and contain information which may be privileged or confidential and are intended solely to be for the use of the individual(s) or entity to which they are addressed. If you are not the intended recipient be aware that any disclosure, copying, distribution or use of the contents of this information is strictly prohibited and may be illegal. If you have received this email in error, please notify us by telephone or email immediately and delete it from your system. Activity and use of our email system is monitored to secure its effective operation and for other lawful business purposes. Communications using this system will also be monitored and may be recorded to secure effective operation and for other lawful business purposes. Internet emails are not necessarily secure. We do not accept responsibility for changes made to this message after it was sent. You are advised to scan this message for viruses and we cannot accept liability for any loss or damage which may be caused as a result of any computer virus.

This email is sent by a bet365 group entity. The bet365 group includes the following entities: Hillside (Shared Services) Limited (registration no. 3958393), Hillside (Spain New Media) Plc (registration no. 07833226), bet365 Group Limited (registration no. 4241161), Hillside (Technology) Limited (registration no. 8273456), Hillside (Media Services) Limited (registration no. 9171710), Hillside (Trader Services) Limited (registration no. 9171598) each registered in England and Wales with a registered office address at bet365 House, Media Way, Stoke-on-Trent, ST1 5SZ, United Kingdom; Hillside (Gibraltar) Limited (registration no. 97927), Hillside (Sports) GP Limited (registration no. 111829) and Hillside (Gaming) GP Limited (registered no. 111830) each registered in Gibraltar with a registered office address at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside (UK Sports) LP (registration no. 117), Hillside (Sports) LP (registration no. 118), Hillside (International Sports) LP (registration no. 119), Hillside (Gaming) LP (registration no. 120) and Hillside (International Gaming) LP (registration no. 121) each registered in Gibraltar with a principal place of business at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside España Leisure S.A (CIF no. A86340270) registered in Spain with a registered office address at C/ Conde de Aranda nº20, 2º, 28001 Madrid, Spain; Hillside (Australia New Media) Pty Limited (registration no. 148 920 665) registered in Australia with a registered office address at Level 4, 90 Arthur Street, North Sydney, NSW 2060, Australia; Hillside (New Media Malta) Limited, (registration no c.66039) registered in Malta with a registered office address at Office 1/2373, Level G, Quantum House, 75 Abate Rigord Street, Ta’ Xbiex XBX 1120, Malta and Hillside (New Media Cyprus) Limited, (registration no. HE 361612) registered in Cyprus with a registered office address at Omrania Centre, 313, 28th October Avenue, 3105 Limassol, Cyprus. Hillside (Shared Services) Limited, Hillside (Spain New Media) Plc and Hillside (New Media Malta) Limited also have places of business at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar. For residents of Greece, this email is sent on behalf of B2B Gaming Services (Malta) Limited (registration number C41936) organised under the laws of Malta with a registered office at Apartment 21, Suite 41, Charles Court, St. Luke's Road, Pietà, Malta.



_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Martin Sumner
In reply to this post by Jon Meredith-2
Jon,

Many thanks for taking the time to look at this.  You've given me lots to think about, so I will take some time before updating my write-up to take account of your feedback.

I need to go back and look at the safe transfers issues.  I spent some time trying to work out how the claimaint transitions from having a plan to committing the plan, and just kept getting lost in the code ... and put it to one side.  I will be brave and dive back in, and have a look at the simulator as well.

The time it takes to expand a cluster, and the cost of that expansion on the existing nodes in the cluster (both during the fold, and the legacy of the page-cache impact after the fold) is something we're worried about.  One of the motivations behind the leveled backend was perhaps to have an alternative to handoff/fold_objects when moving vnodes, whereby the backend could just ship the WAL files instead (and the receiving vnode on the joining node would rebuild the KeyStore from the shipped WAL files).  Perhaps a vnode proxy solution might be better.

With the physical run-time promises I wasn't thinking of anything clever with fallback that necessarily guaranteed in more cases that it would be written to two physical nodes, but just that the writing client could be aware when it has been written to two physical nodes even when two primary vnodes are unavailable in the preflist.  Currently the NHS uses pw=2 as a proxy for guaranteeing something has been written to two physical nodes, but this kills availability on two nodes failing, even though as target_n_val is 4 the PUT coordinator may be able to wait and confirm that it was written to two physical nodes in all cases.

I'm going to read the jump consistent hash paper - it looks interesting.  Thoughts about radical changes that will support things like AZ-awareness are shoved to the back of my mind at present, as even if they are possible, it feels like transitioning old clusters to a radically changed ring (or even ringless) algorithm would just be too hard.  One thing related to this, is that we're assuming any future open-source full-sync type replication will need to be de-coupled from the need to have consistent ring-sizes (or perhaps even consistent ways of calculating the ring), as the in-cluster ring-resizing is now dropped as a documented feature having only ever been an experimental one - the only real option will be to be to migrate to a new cluster with a different ring-size through replication.


Thanks again

Martin

On 17 May 2017 at 16:34, Jon Meredith <[hidden email]> wrote:

Thanks for the excellent writeup.  

I have a few notes on your writeup and then a little history to help
explain the motivation for the v3 work.

The Claiming Problem

  One other property of the broader claim algorithm + claimant + handoff
  manager group of processes that's worth mentioning is safety during
  transition.  The cluster should ensure that target N-val copies
  are always available even during transitions.  Much earlier in Riak's
  life the claim would just execute and ownership transfer immediately,
  without putting the data in place (fine, it's eventually consistent, right?)
  but that meant if more than two vnodes in a preference list changed
  ownership then clients would read not found until at least one of the
  objects it was receiving had transferred. The claimant now shepherds those
  transitions so it should be safe.  The solution of transferring the
  data before ownership has fixed the notfound problem, but Riak lost
  agility in adding capacity to the cluster - existing data has to transfer
  to new nodes before they are freed up, and they continue to grow
  while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
  capacity and proxying back to the original vnode is probably a better
  option.

  Predicting load on the cluster is also difficult with the single
  ring with a target n-val set at creation time being used for all
  buckets despite their n-value.  To compute the operations sent to
  each vnode you need to know the proportion of access to each N-value.
  
  There's also the problem that if a bucket is created with an N-value
  larger than target N all bets are off about the number of physical nodes
  values are written to (*cough* strong consistency N-5)

  Having a partitioning-scheme-per-N-value is one way of sidestepping the
  load prediction and max-N problems.

Promixity of Vnodes

  An alternate solution to the target_n_val problem is to change the way
  fallback partitions are added and apply an additional uniqueness constraint
  as target nodes are added.  That provides safety against multiple node
  failures (although can potentially cause loading problems).  I think
  you imply this a couple of points when you talk about 'at runtime'.

Proximity of vnodes as the partition list wraps

  One kludge I considered solving the wraparound problem is to go from
  a ring to a 'spiral' where you add extra target_n_val-1 additional
  vnodes that alias the few vnodes in the ring.

  Using the pathalogically bad (vnodes) Q=4, (nodes) S=3, (nval) N=3
```
  v0 | v1 | v2 | v3
  nA | nB | nC | nA

  p0 = [ {v1, nB} {v2, Nc} {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA} {v0, nA} ] <<< Bad
  p2 = [ {v3, nA} {v0, nA} {v1, nB} ] <<< Bad
  p3 = [ {v0, nA} {v1, nB} {v2, nC} ]
```
  You get 2/4 preflists violating target_n_val=3.

  If you extend the ring to allow aliasing (i.e. go beyond 2^160) but
  only use it for assignment
  
```
  v0 | v1 | v2 | v3 | v0' | v1'
  nA | nB | nC | nA | nB  | nC

  p0 = [ {v1, nB} {v2, Nc}  {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA}  {v0', nB} ]
  p2 = [ {v3, nA} {v0', nB} {v1', nB} ]
  p3 = [ {v0, nA} {v1, nB}  {v2, nC} ]
```
  The additional vnodes can never be hashed directly, just during
  wraparound.


As you say, the v3 algorithm was written (by me) a long time ago and
never made it to production.  It was due to a few factors, partly
the non-determinism, partly because I didn't like the (very stupid)
optimization system tying up the claimant node for multiple seconds,
but more troublingly when we did some commissioning tests for a large
customer that ran with a ring size of 256 with 60 nodes we experienced
a performance drop of around 5% when the cluster was maxed out for
reads.  The diversity measurements were much 'better' in that the
v3 claimed cluster was far more diverse and performed better during
node failures, but the (unproven) fear that having a greater number
of saturated disterl connections between nodes dropped performance
without explanation stopped me from promoting it to default.

The reason the v3 algorithm was created was to resolve problems with
longer lived clusters created with the v2 claim that had had nodes
added and removed over time.  I don't remember all the details now,
but I think the cluster had a ring size of 1024 (to future proof,
as no 2I/listkey on that cluster) and somewhere between 15-30 nodes.

In that particular configuration, the v2 algorithm had left the original
sequential node assignment (n1, n2, ..., n15, n1, n2, ...) and assigned
new nodes in place, but that left many places were the original sequential
assignments still existed.

What we hadn't realized at the time is that sequential node assignment
is the *worst* possible plan for handling fallback load.

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

In the real scenario, although it was no longer sequentially assigned
there were still a large number of very similar preference lists to
the original assignment (as growing a few nodes on that ring size
only reassigns preference lists in proportion to the new nodes claiming
partitions).

The production cluster was running fairly close to capacity, so the
increased loading during failure, even though it wasn't as bad as doubled
was enough to push it over the performance 'step' lowering tail latencies
and slowed it down enough to overload the vnodes and exhaust memory
crashing the next node causing a cascade.  This was before vnodes had
overload protection so would present differently now.


Another pre-claimant problem that shaped some of the earlier claim
code vnode 'want' threshods was that when the nodes were individually
allowed to say if they wanted to claim more vnodes (with the
wants_claim function, before calling choose_claim), there were some states
the cluster would get into where two nodes both decided they were under
capacity and continually tried to claim, causing the vnode to flip/flop
back and forth between them (that was a reason for writing one of the early
QuickCheck tests).


I'm not sure if you've encountered it or not, but the riak_core_claim_sim
is also a good tool for testing the behavior of the claim functions and
the claimant.  You don't mention it in your write up, but one of the
important functions of the claimant is to make sure it only performs
safe transitions between rings.  It makes sure that the n val is not
violated during handoff.



What to do?

  Fixing the claim algorithm is one way of doing things, but I worry
  it has a number of problems that are hard to solve (multi-AZ, multi-Nval
  etc).

  One more radical option is to dump the ring and just publish a table
  per-vnode of the nodes and vnode hash you'd like to service them.
  Riak doesn't really need consistent hashing - it doesn't *really* use
  it's original form (the Dynamo A scheme), and is more of a hybrid
  of the B/C schemes.

  Use cluster metadata and publish out the tables, update riak_core_apl
  to take the new data and serve up the preference lists.  Obviously
  it trickles into things like the vnode and handoff managers, but it
  may be possible.

  That gives you the advantage of no longer being constrained in how
  you assign the nodes - a separation of policy and execution.  You
  could keep the existing ring based algorithms, or you could do something
  better.

  It may be interesting to change the number of vnodes/hashing algorithm
  too.  Jordan West was a big fan of Consistent Jump Hashing at one point.

  The thing you give up if you lose the power-of-2 partitioning scheme
  is the ability to split and combine partitions.  Each partition in
  a 64 vnode ring maps to exactly two (non-consecutive) partitions in a 128
  vnode ring.  Which is a very nice for replicating between clusters
  with different ring sizes and localizing where to look for data.

Good luck!
  



On Wed, May 17, 2017 at 6:37 AM Daniel Abrahamsson <[hidden email]> wrote:
Thanks for the writeup and detailed investigation, Martin.

We ran into these issues a few months when we expanded a 5 node cluster into a 8 node cluster. We ended up rebuilding the cluster and writing a small escript to verify that the generated riak ring lived up to our requirements (which were 1: to survive an AZ outage, and 2: to survive any 2 nodes going down at the same time).

This will be a great document to refer to when explaining the subtleties of setting up a Riak cluster.

//Daniel
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Matt Davis
In reply to this post by Martin.Cox
I don't contribute to this list as much as I lurk in #riak (craque), but it's really great to see this kind of community support somewhere, especially at a large place that is heavily invested in riak itself.

I have considered posting some of the operational lessons I've learned over the past five years on riak-based systems. If there will be an organized effort around these types of things, I'm here to help and would love to be involved as well.

-matt


On Wed, May 17, 2017 at 3:19 AM, <[hidden email]> wrote:
Apologies in advance if this doesn't quite submit correctly to the list.

We [bet365] are very much interested in the continued development of Riak in its current incarnation, with Core continuing to be underpinned by distributed Erlang. We are very keen to help to build / shape / support the community around the project. Internally, we have assembled a team to continue the development of Riak, along a roadmap, and are also looking to bring more expertise into the business to help support this. Whilst the Lasp / Partisan project sounds really interesting, and something that could probably be of interest to us in the future, our immediate focus is around stabilising and securing the project in its current form. We’re looking to take Riak forward by contributing to a renewed community effort.

In summary, we're committed to continuing the development of Riak (we've already assembled /  growing a team to do so) and are happy to engage with, and support, the community in order to move the project forward.

Thanks

Martin Cox
Software Developer
Hillside (Technology) Limited
e: [hidden email]
bet365.com
This email and any files transmitted with it are confidential and contain information which may be privileged or confidential and are intended solely to be for the use of the individual(s) or entity to which they are addressed. If you are not the intended recipient be aware that any disclosure, copying, distribution or use of the contents of this information is strictly prohibited and may be illegal. If you have received this email in error, please notify us by telephone or email immediately and delete it from your system. Activity and use of our email system is monitored to secure its effective operation and for other lawful business purposes. Communications using this system will also be monitored and may be recorded to secure effective operation and for other lawful business purposes. Internet emails are not necessarily secure. We do not accept responsibility for changes made to this message after it was sent. You are advised to scan this message for viruses and we cannot accept liability for any loss or damage which may be caused as a result of any computer virus.

This email is sent by a bet365 group entity. The bet365 group includes the following entities: Hillside (Shared Services) Limited (registration no. 3958393), Hillside (Spain New Media) Plc (registration no. 07833226), bet365 Group Limited (registration no. 4241161), Hillside (Technology) Limited (registration no. 8273456), Hillside (Media Services) Limited (registration no. 9171710), Hillside (Trader Services) Limited (registration no. 9171598) each registered in England and Wales with a registered office address at bet365 House, Media Way, Stoke-on-Trent, ST1 5SZ, United Kingdom; Hillside (Gibraltar) Limited (registration no. 97927), Hillside (Sports) GP Limited (registration no. 111829) and Hillside (Gaming) GP Limited (registered no. 111830) each registered in Gibraltar with a registered office address at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside (UK Sports) LP (registration no. 117), Hillside (Sports) LP (registration no. 118), Hillside (International Sports) LP (registration no. 119), Hillside (Gaming) LP (registration no. 120) and Hillside (International Gaming) LP (registration no. 121) each registered in Gibraltar with a principal place of business at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar; Hillside España Leisure S.A (CIF no. A86340270) registered in Spain with a registered office address at C/ Conde de Aranda nº20, 2º, 28001 Madrid, Spain; Hillside (Australia New Media) Pty Limited (registration no. 148 920 665) registered in Australia with a registered office address at Level 4, 90 Arthur Street, North Sydney, NSW 2060, Australia; Hillside (New Media Malta) Limited, (registration no c.66039) registered in Malta with a registered office address at Office 1/2373, Level G, Quantum House, 75 Abate Rigord Street, Ta’ Xbiex XBX 1120, Malta and Hillside (New Media Cyprus) Limited, (registration no. HE 361612) registered in Cyprus with a registered office address at Omrania Centre, 313, 28th October Avenue, 3105 Limassol, Cyprus. Hillside (Shared Services) Limited, Hillside (Spain New Media) Plc and Hillside (New Media Malta) Limited also have places of business at Unit 1.1, First Floor, Waterport Place, 2 Europort Avenue, Gibraltar. For residents of Greece, this email is sent on behalf of B2B Gaming Services (Malta) Limited (registration number C41936) organised under the laws of Malta with a registered office at Apartment 21, Suite 41, Charles Court, St. Luke's Road, Pietà, Malta.


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Martin Sumner
In reply to this post by Jon Meredith-2
Jon,

With regards to this snippet below, I think I get your point, but I don't think the example is valid:

>>>>>>>

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

>>>>>>>


This is not how I understood fallback election works.  My understanding is that the fallback is the node which owns the first vnode after the preflist, where the node is up.  Not the node which owns the first vnode after the unavailable vnode.  That is to say to find fallbacks for a preflist, Riak will iterate around the ring from after the primaries, not iterate around the ring from after the unavailable vnode.

So I would expect the following arrangement on n4 going down.

ring = n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4     (Q=8 S=4,TargetN4)

Partition   All Up
(position)
    0       n2       n3       n4 (p3)  
    1       n3       n4 (p3) n1
    2       n4 (p3) n1       n2
    3       n1       n2        n3 
    4       n2       n3        n4 (p7)
    5       n3       n4 (p7) n1
    6       n4 (p7) n1       n2
    7       n1        n2       n3

Partition   n4 down
(position)
    0       n2  n3  n1 (fb for p3)
    1       n3  n1  n2 (fb for p3)
    2       n1  n2  n3 (fb for p3)
    3       n1  n2  n3 
    4       n2  n3  n1 (fb for p7)
    5       n3  n1  n2 (fb for p7)
    6       n1  n2  n3 (fb for p7)  
    7       n1  n2  n3 

So there isn't a biasing of load in this case, all nodes 33.3% more load.  Interestingly we do go from having 8 vnodes live on 4 nodes, to having 12 vnodes live on 3 nodes when one node fails in this case - so the number of vnodes active does double on all nodes (not sure if the dynamic memory allocation in leveldb handles this?).  

I still think you have a valid point about about simple sequenced allocations being bad though.  If we have a ring-size of 16 and a node count of 8:

ring = n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8    (Q=16 S=8,TargetN4)

Then on failure of node 4 - n5, n6, n7 have a 33% increase in load, but all other nodes remain with their previous load.  This is true for any failure in this diagonalised ring.

Whereas if we shuffle the ring this way:

ring = n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n4 | n3 | n2 | n1 | n8 | n7 | n6 | n5    (Q=16 S=8,TargetN4)

Now node 4 down will lead to n1, n2, n3, n5, n6, and n7 each having 16.7% extra load each.  This more even balance of load is also true for the failure of any node in this ring.

So I think your point is valid - but I think the example is wrong.

And claim v3 does do a good job on these two rings, as the scoring for the second ring is much better:

> ScoreFun = fun(L, N) -> riak_core_claim_util:score_am(lists:sort(riak_core_claim_util:adjacency_matrix_from_al(riak_core_claim_util:adjacency_list(L))),N) end.
#Fun<erl_eval.12.80484245>

> ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n1, n2, n3, n4, n5, n6, n7, n8], 4).
109.7142857142855

> ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n4, n3, n2, n1, n8, n7, n6, n5], 4).
61.71428571428584

Neither can I think of a way of bettering this by improving claim v2.  However as mentioned in the long read, one of the issues with claim v3 might be that rings with bad properties (such as loss of physical diversity on dual node failures) can also get good scores:

> ScoreFun([n1, n2, n3, n4, n1, n2, n3, n4, n5, n6, n7, n8, n5, n6, n7, n8], 4).
66.0


Regards

Martin

On 17 May 2017 at 16:34, Jon Meredith <[hidden email]> wrote:

Thanks for the excellent writeup.  

I have a few notes on your writeup and then a little history to help
explain the motivation for the v3 work.

The Claiming Problem

  One other property of the broader claim algorithm + claimant + handoff
  manager group of processes that's worth mentioning is safety during
  transition.  The cluster should ensure that target N-val copies
  are always available even during transitions.  Much earlier in Riak's
  life the claim would just execute and ownership transfer immediately,
  without putting the data in place (fine, it's eventually consistent, right?)
  but that meant if more than two vnodes in a preference list changed
  ownership then clients would read not found until at least one of the
  objects it was receiving had transferred. The claimant now shepherds those
  transitions so it should be safe.  The solution of transferring the
  data before ownership has fixed the notfound problem, but Riak lost
  agility in adding capacity to the cluster - existing data has to transfer
  to new nodes before they are freed up, and they continue to grow
  while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
  capacity and proxying back to the original vnode is probably a better
  option.

  Predicting load on the cluster is also difficult with the single
  ring with a target n-val set at creation time being used for all
  buckets despite their n-value.  To compute the operations sent to
  each vnode you need to know the proportion of access to each N-value.
  
  There's also the problem that if a bucket is created with an N-value
  larger than target N all bets are off about the number of physical nodes
  values are written to (*cough* strong consistency N-5)

  Having a partitioning-scheme-per-N-value is one way of sidestepping the
  load prediction and max-N problems.

Promixity of Vnodes

  An alternate solution to the target_n_val problem is to change the way
  fallback partitions are added and apply an additional uniqueness constraint
  as target nodes are added.  That provides safety against multiple node
  failures (although can potentially cause loading problems).  I think
  you imply this a couple of points when you talk about 'at runtime'.

Proximity of vnodes as the partition list wraps

  One kludge I considered solving the wraparound problem is to go from
  a ring to a 'spiral' where you add extra target_n_val-1 additional
  vnodes that alias the few vnodes in the ring.

  Using the pathalogically bad (vnodes) Q=4, (nodes) S=3, (nval) N=3
```
  v0 | v1 | v2 | v3
  nA | nB | nC | nA

  p0 = [ {v1, nB} {v2, Nc} {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA} {v0, nA} ] <<< Bad
  p2 = [ {v3, nA} {v0, nA} {v1, nB} ] <<< Bad
  p3 = [ {v0, nA} {v1, nB} {v2, nC} ]
```
  You get 2/4 preflists violating target_n_val=3.

  If you extend the ring to allow aliasing (i.e. go beyond 2^160) but
  only use it for assignment
  
```
  v0 | v1 | v2 | v3 | v0' | v1'
  nA | nB | nC | nA | nB  | nC

  p0 = [ {v1, nB} {v2, Nc}  {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA}  {v0', nB} ]
  p2 = [ {v3, nA} {v0', nB} {v1', nB} ]
  p3 = [ {v0, nA} {v1, nB}  {v2, nC} ]
```
  The additional vnodes can never be hashed directly, just during
  wraparound.


As you say, the v3 algorithm was written (by me) a long time ago and
never made it to production.  It was due to a few factors, partly
the non-determinism, partly because I didn't like the (very stupid)
optimization system tying up the claimant node for multiple seconds,
but more troublingly when we did some commissioning tests for a large
customer that ran with a ring size of 256 with 60 nodes we experienced
a performance drop of around 5% when the cluster was maxed out for
reads.  The diversity measurements were much 'better' in that the
v3 claimed cluster was far more diverse and performed better during
node failures, but the (unproven) fear that having a greater number
of saturated disterl connections between nodes dropped performance
without explanation stopped me from promoting it to default.

The reason the v3 algorithm was created was to resolve problems with
longer lived clusters created with the v2 claim that had had nodes
added and removed over time.  I don't remember all the details now,
but I think the cluster had a ring size of 1024 (to future proof,
as no 2I/listkey on that cluster) and somewhere between 15-30 nodes.

In that particular configuration, the v2 algorithm had left the original
sequential node assignment (n1, n2, ..., n15, n1, n2, ...) and assigned
new nodes in place, but that left many places were the original sequential
assignments still existed.

What we hadn't realized at the time is that sequential node assignment
is the *worst* possible plan for handling fallback load.

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

In the real scenario, although it was no longer sequentially assigned
there were still a large number of very similar preference lists to
the original assignment (as growing a few nodes on that ring size
only reassigns preference lists in proportion to the new nodes claiming
partitions).

The production cluster was running fairly close to capacity, so the
increased loading during failure, even though it wasn't as bad as doubled
was enough to push it over the performance 'step' lowering tail latencies
and slowed it down enough to overload the vnodes and exhaust memory
crashing the next node causing a cascade.  This was before vnodes had
overload protection so would present differently now.


Another pre-claimant problem that shaped some of the earlier claim
code vnode 'want' threshods was that when the nodes were individually
allowed to say if they wanted to claim more vnodes (with the
wants_claim function, before calling choose_claim), there were some states
the cluster would get into where two nodes both decided they were under
capacity and continually tried to claim, causing the vnode to flip/flop
back and forth between them (that was a reason for writing one of the early
QuickCheck tests).


I'm not sure if you've encountered it or not, but the riak_core_claim_sim
is also a good tool for testing the behavior of the claim functions and
the claimant.  You don't mention it in your write up, but one of the
important functions of the claimant is to make sure it only performs
safe transitions between rings.  It makes sure that the n val is not
violated during handoff.



What to do?

  Fixing the claim algorithm is one way of doing things, but I worry
  it has a number of problems that are hard to solve (multi-AZ, multi-Nval
  etc).

  One more radical option is to dump the ring and just publish a table
  per-vnode of the nodes and vnode hash you'd like to service them.
  Riak doesn't really need consistent hashing - it doesn't *really* use
  it's original form (the Dynamo A scheme), and is more of a hybrid
  of the B/C schemes.

  Use cluster metadata and publish out the tables, update riak_core_apl
  to take the new data and serve up the preference lists.  Obviously
  it trickles into things like the vnode and handoff managers, but it
  may be possible.

  That gives you the advantage of no longer being constrained in how
  you assign the nodes - a separation of policy and execution.  You
  could keep the existing ring based algorithms, or you could do something
  better.

  It may be interesting to change the number of vnodes/hashing algorithm
  too.  Jordan West was a big fan of Consistent Jump Hashing at one point.

  The thing you give up if you lose the power-of-2 partitioning scheme
  is the ability to split and combine partitions.  Each partition in
  a 64 vnode ring maps to exactly two (non-consecutive) partitions in a 128
  vnode ring.  Which is a very nice for replicating between clusters
  with different ring sizes and localizing where to look for data.

Good luck!
  



On Wed, May 17, 2017 at 6:37 AM Daniel Abrahamsson <[hidden email]> wrote:
Thanks for the writeup and detailed investigation, Martin.

We ran into these issues a few months when we expanded a 5 node cluster into a 8 node cluster. We ended up rebuilding the cluster and writing a small escript to verify that the generated riak ring lived up to our requirements (which were 1: to survive an AZ outage, and 2: to survive any 2 nodes going down at the same time).

This will be a great document to refer to when explaining the subtleties of setting up a Riak cluster.

//Daniel
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Jon Meredith-2
That's what I get for doing things from memory and not running the simulator :)
I think you're right about the actual operation of the preference lists - but i haven't had a chance to look over the code or run some simulations. The effect isn't quite as severe, but as you say unevenly loads the cluster which is best avoided.

I took a quick look at the v3 code from my phone and it looks like the plans are ordered by target nval violations, node ownership balance and finally diversity.
It shouldn't ever pick a plan that increases violations, however if the starting plan has violations that aren't resolved (which is very possible for lots of combinations of Q and S) then it will start optimizing on the balance and then diversity. Maybe there's a bug somewhere.

One thing I thought about but never implemented was adding some freedom to the algorithm to move a small number of unnecessary partitions to see if that would free things up and get better balance/diversity.

For improving v2 you might try seeing if there's a way to deterministically shuffle the starting point where it starts to look for patitions to take - rather than always from the beginning or end. 

Jon


On Thu, May 18, 2017 at 12:21 PM Martin Sumner <[hidden email]> wrote:
Jon,

With regards to this snippet below, I think I get your point, but I don't think the example is valid:

>>>>>>>

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

>>>>>>>


This is not how I understood fallback election works.  My understanding is that the fallback is the node which owns the first vnode after the preflist, where the node is up.  Not the node which owns the first vnode after the unavailable vnode.  That is to say to find fallbacks for a preflist, Riak will iterate around the ring from after the primaries, not iterate around the ring from after the unavailable vnode.

So I would expect the following arrangement on n4 going down.

ring = n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4     (Q=8 S=4,TargetN4)

Partition   All Up
(position)
    0       n2       n3       n4 (p3)  
    1       n3       n4 (p3) n1
    2       n4 (p3) n1       n2
    3       n1       n2        n3 
    4       n2       n3        n4 (p7)
    5       n3       n4 (p7) n1
    6       n4 (p7) n1       n2
    7       n1        n2       n3

Partition   n4 down
(position)
    0       n2  n3  n1 (fb for p3)
    1       n3  n1  n2 (fb for p3)
    2       n1  n2  n3 (fb for p3)
    3       n1  n2  n3 
    4       n2  n3  n1 (fb for p7)
    5       n3  n1  n2 (fb for p7)
    6       n1  n2  n3 (fb for p7)  
    7       n1  n2  n3 

So there isn't a biasing of load in this case, all nodes 33.3% more load.  Interestingly we do go from having 8 vnodes live on 4 nodes, to having 12 vnodes live on 3 nodes when one node fails in this case - so the number of vnodes active does double on all nodes (not sure if the dynamic memory allocation in leveldb handles this?).  

I still think you have a valid point about about simple sequenced allocations being bad though.  If we have a ring-size of 16 and a node count of 8:

ring = n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8    (Q=16 S=8,TargetN4)

Then on failure of node 4 - n5, n6, n7 have a 33% increase in load, but all other nodes remain with their previous load.  This is true for any failure in this diagonalised ring.

Whereas if we shuffle the ring this way:

ring = n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n4 | n3 | n2 | n1 | n8 | n7 | n6 | n5    (Q=16 S=8,TargetN4)

Now node 4 down will lead to n1, n2, n3, n5, n6, and n7 each having 16.7% extra load each.  This more even balance of load is also true for the failure of any node in this ring.

So I think your point is valid - but I think the example is wrong.

And claim v3 does do a good job on these two rings, as the scoring for the second ring is much better:

> ScoreFun = fun(L, N) -> riak_core_claim_util:score_am(lists:sort(riak_core_claim_util:adjacency_matrix_from_al(riak_core_claim_util:adjacency_list(L))),N) end.
#Fun<erl_eval.12.80484245>

> ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n1, n2, n3, n4, n5, n6, n7, n8], 4).
109.7142857142855

> ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n4, n3, n2, n1, n8, n7, n6, n5], 4).
61.71428571428584

Neither can I think of a way of bettering this by improving claim v2.  However as mentioned in the long read, one of the issues with claim v3 might be that rings with bad properties (such as loss of physical diversity on dual node failures) can also get good scores:

> ScoreFun([n1, n2, n3, n4, n1, n2, n3, n4, n5, n6, n7, n8, n5, n6, n7, n8], 4).
66.0


Regards

Martin
On 17 May 2017 at 16:34, Jon Meredith <[hidden email]> wrote:

Thanks for the excellent writeup.  

I have a few notes on your writeup and then a little history to help
explain the motivation for the v3 work.

The Claiming Problem

  One other property of the broader claim algorithm + claimant + handoff
  manager group of processes that's worth mentioning is safety during
  transition.  The cluster should ensure that target N-val copies
  are always available even during transitions.  Much earlier in Riak's
  life the claim would just execute and ownership transfer immediately,
  without putting the data in place (fine, it's eventually consistent, right?)
  but that meant if more than two vnodes in a preference list changed
  ownership then clients would read not found until at least one of the
  objects it was receiving had transferred. The claimant now shepherds those
  transitions so it should be safe.  The solution of transferring the
  data before ownership has fixed the notfound problem, but Riak lost
  agility in adding capacity to the cluster - existing data has to transfer
  to new nodes before they are freed up, and they continue to grow
  while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
  capacity and proxying back to the original vnode is probably a better
  option.

  Predicting load on the cluster is also difficult with the single
  ring with a target n-val set at creation time being used for all
  buckets despite their n-value.  To compute the operations sent to
  each vnode you need to know the proportion of access to each N-value.
  
  There's also the problem that if a bucket is created with an N-value
  larger than target N all bets are off about the number of physical nodes
  values are written to (*cough* strong consistency N-5)

  Having a partitioning-scheme-per-N-value is one way of sidestepping the
  load prediction and max-N problems.

Promixity of Vnodes

  An alternate solution to the target_n_val problem is to change the way
  fallback partitions are added and apply an additional uniqueness constraint
  as target nodes are added.  That provides safety against multiple node
  failures (although can potentially cause loading problems).  I think
  you imply this a couple of points when you talk about 'at runtime'.

Proximity of vnodes as the partition list wraps

  One kludge I considered solving the wraparound problem is to go from
  a ring to a 'spiral' where you add extra target_n_val-1 additional
  vnodes that alias the few vnodes in the ring.

  Using the pathalogically bad (vnodes) Q=4, (nodes) S=3, (nval) N=3
```
  v0 | v1 | v2 | v3
  nA | nB | nC | nA

  p0 = [ {v1, nB} {v2, Nc} {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA} {v0, nA} ] <<< Bad
  p2 = [ {v3, nA} {v0, nA} {v1, nB} ] <<< Bad
  p3 = [ {v0, nA} {v1, nB} {v2, nC} ]
```
  You get 2/4 preflists violating target_n_val=3.

  If you extend the ring to allow aliasing (i.e. go beyond 2^160) but
  only use it for assignment
  
```
  v0 | v1 | v2 | v3 | v0' | v1'
  nA | nB | nC | nA | nB  | nC

  p0 = [ {v1, nB} {v2, Nc}  {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA}  {v0', nB} ]
  p2 = [ {v3, nA} {v0', nB} {v1', nB} ]
  p3 = [ {v0, nA} {v1, nB}  {v2, nC} ]
```
  The additional vnodes can never be hashed directly, just during
  wraparound.


As you say, the v3 algorithm was written (by me) a long time ago and
never made it to production.  It was due to a few factors, partly
the non-determinism, partly because I didn't like the (very stupid)
optimization system tying up the claimant node for multiple seconds,
but more troublingly when we did some commissioning tests for a large
customer that ran with a ring size of 256 with 60 nodes we experienced
a performance drop of around 5% when the cluster was maxed out for
reads.  The diversity measurements were much 'better' in that the
v3 claimed cluster was far more diverse and performed better during
node failures, but the (unproven) fear that having a greater number
of saturated disterl connections between nodes dropped performance
without explanation stopped me from promoting it to default.

The reason the v3 algorithm was created was to resolve problems with
longer lived clusters created with the v2 claim that had had nodes
added and removed over time.  I don't remember all the details now,
but I think the cluster had a ring size of 1024 (to future proof,
as no 2I/listkey on that cluster) and somewhere between 15-30 nodes.

In that particular configuration, the v2 algorithm had left the original
sequential node assignment (n1, n2, ..., n15, n1, n2, ...) and assigned
new nodes in place, but that left many places were the original sequential
assignments still existed.

What we hadn't realized at the time is that sequential node assignment
is the *worst* possible plan for handling fallback load.

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

In the real scenario, although it was no longer sequentially assigned
there were still a large number of very similar preference lists to
the original assignment (as growing a few nodes on that ring size
only reassigns preference lists in proportion to the new nodes claiming
partitions).

The production cluster was running fairly close to capacity, so the
increased loading during failure, even though it wasn't as bad as doubled
was enough to push it over the performance 'step' lowering tail latencies
and slowed it down enough to overload the vnodes and exhaust memory
crashing the next node causing a cascade.  This was before vnodes had
overload protection so would present differently now.


Another pre-claimant problem that shaped some of the earlier claim
code vnode 'want' threshods was that when the nodes were individually
allowed to say if they wanted to claim more vnodes (with the
wants_claim function, before calling choose_claim), there were some states
the cluster would get into where two nodes both decided they were under
capacity and continually tried to claim, causing the vnode to flip/flop
back and forth between them (that was a reason for writing one of the early
QuickCheck tests).


I'm not sure if you've encountered it or not, but the riak_core_claim_sim
is also a good tool for testing the behavior of the claim functions and
the claimant.  You don't mention it in your write up, but one of the
important functions of the claimant is to make sure it only performs
safe transitions between rings.  It makes sure that the n val is not
violated during handoff.



What to do?

  Fixing the claim algorithm is one way of doing things, but I worry
  it has a number of problems that are hard to solve (multi-AZ, multi-Nval
  etc).

  One more radical option is to dump the ring and just publish a table
  per-vnode of the nodes and vnode hash you'd like to service them.
  Riak doesn't really need consistent hashing - it doesn't *really* use
  it's original form (the Dynamo A scheme), and is more of a hybrid
  of the B/C schemes.

  Use cluster metadata and publish out the tables, update riak_core_apl
  to take the new data and serve up the preference lists.  Obviously
  it trickles into things like the vnode and handoff managers, but it
  may be possible.

  That gives you the advantage of no longer being constrained in how
  you assign the nodes - a separation of policy and execution.  You
  could keep the existing ring based algorithms, or you could do something
  better.

  It may be interesting to change the number of vnodes/hashing algorithm
  too.  Jordan West was a big fan of Consistent Jump Hashing at one point.

  The thing you give up if you lose the power-of-2 partitioning scheme
  is the ability to split and combine partitions.  Each partition in
  a 64 vnode ring maps to exactly two (non-consecutive) partitions in a 128
  vnode ring.  Which is a very nice for replicating between clusters
  with different ring sizes and localizing where to look for data.

Good luck!
  



On Wed, May 17, 2017 at 6:37 AM Daniel Abrahamsson <[hidden email]> wrote:
Thanks for the writeup and detailed investigation, Martin.

We ran into these issues a few months when we expanded a 5 node cluster into a 8 node cluster. We ended up rebuilding the cluster and writing a small escript to verify that the generated riak ring lived up to our requirements (which were 1: to survive an AZ outage, and 2: to survive any 2 nodes going down at the same time).

This will be a great document to refer to when explaining the subtleties of setting up a Riak cluster.

//Daniel
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Core Claim and Property-Based Tests

Martin Sumner
Jon,

BTW, I wasn't suggesting that claim v3 would choose a plan with violations over one without - so I don't think there is a bug.

The plan below which I said scored well, but was "worse" than a purely sequential plan, is worse only in the sense that it does not cope as well with dual node failures (which is not something the algorithm even tries to score for).  So in the two plans below the first plan scores better for diversity, but there are numerous dual-node failures (e.g n1, n2) that would lead to writes to some partitions being stored on just two physical nodes.  There are no target_n_val violations in either cluster.

> ScoreFun([n1, n2, n3, n4, n1, n2, n3, n4, n5, n6, n7, n8, n5, n6, n7, n8], 4).
66.0

> ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n1, n2, n3, n4, n5, n6, n7, n8], 4).
109.7142857142855

Perhaps the algorithm could be expanded to include other factors such as the minimum spacing for any node, but then I assume other anomalies will occur.  I suspect the scoring algorithm could be made ever more complex without ever managing to please all of the people all of the time.  It is because of this that I think claim v3 is perhaps best suited as a secondary option rather than it being the default algorithm: and we should work on ironing out some of the v2 issues rather than just switching to v3.

Thanks

Martin


On 19 May 2017 at 05:06, Jon Meredith <[hidden email]> wrote:
That's what I get for doing things from memory and not running the simulator :)
I think you're right about the actual operation of the preference lists - but i haven't had a chance to look over the code or run some simulations. The effect isn't quite as severe, but as you say unevenly loads the cluster which is best avoided.

I took a quick look at the v3 code from my phone and it looks like the plans are ordered by target nval violations, node ownership balance and finally diversity.
It shouldn't ever pick a plan that increases violations, however if the starting plan has violations that aren't resolved (which is very possible for lots of combinations of Q and S) then it will start optimizing on the balance and then diversity. Maybe there's a bug somewhere.

One thing I thought about but never implemented was adding some freedom to the algorithm to move a small number of unnecessary partitions to see if that would free things up and get better balance/diversity.

For improving v2 you might try seeing if there's a way to deterministically shuffle the starting point where it starts to look for patitions to take - rather than always from the beginning or end. 

Jon


On Thu, May 18, 2017 at 12:21 PM Martin Sumner <[hidden email]> wrote:
Jon,

With regards to this snippet below, I think I get your point, but I don't think the example is valid:

>>>>>>>

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

>>>>>>>


This is not how I understood fallback election works.  My understanding is that the fallback is the node which owns the first vnode after the preflist, where the node is up.  Not the node which owns the first vnode after the unavailable vnode.  That is to say to find fallbacks for a preflist, Riak will iterate around the ring from after the primaries, not iterate around the ring from after the unavailable vnode.

So I would expect the following arrangement on n4 going down.

ring = n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4     (Q=8 S=4,TargetN4)

Partition   All Up
(position)
    0       n2       n3       n4 (p3)  
    1       n3       n4 (p3) n1
    2       n4 (p3) n1       n2
    3       n1       n2        n3 
    4       n2       n3        n4 (p7)
    5       n3       n4 (p7) n1
    6       n4 (p7) n1       n2
    7       n1        n2       n3

Partition   n4 down
(position)
    0       n2  n3  n1 (fb for p3)
    1       n3  n1  n2 (fb for p3)
    2       n1  n2  n3 (fb for p3)
    3       n1  n2  n3 
    4       n2  n3  n1 (fb for p7)
    5       n3  n1  n2 (fb for p7)
    6       n1  n2  n3 (fb for p7)  
    7       n1  n2  n3 

So there isn't a biasing of load in this case, all nodes 33.3% more load.  Interestingly we do go from having 8 vnodes live on 4 nodes, to having 12 vnodes live on 3 nodes when one node fails in this case - so the number of vnodes active does double on all nodes (not sure if the dynamic memory allocation in leveldb handles this?).  

I still think you have a valid point about about simple sequenced allocations being bad though.  If we have a ring-size of 16 and a node count of 8:

ring = n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8    (Q=16 S=8,TargetN4)

Then on failure of node 4 - n5, n6, n7 have a 33% increase in load, but all other nodes remain with their previous load.  This is true for any failure in this diagonalised ring.

Whereas if we shuffle the ring this way:

ring = n1 | n2 | n3 | n4 | n5 | n6 | n7 | n8 | n4 | n3 | n2 | n1 | n8 | n7 | n6 | n5    (Q=16 S=8,TargetN4)

Now node 4 down will lead to n1, n2, n3, n5, n6, and n7 each having 16.7% extra load each.  This more even balance of load is also true for the failure of any node in this ring.

So I think your point is valid - but I think the example is wrong.

And claim v3 does do a good job on these two rings, as the scoring for the second ring is much better:

> ScoreFun = fun(L, N) -> riak_core_claim_util:score_am(lists:sort(riak_core_claim_util:adjacency_matrix_from_al(riak_core_claim_util:adjacency_list(L))),N) end.
#Fun<erl_eval.12.80484245>

> ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n1, n2, n3, n4, n5, n6, n7, n8], 4).
109.7142857142855

> ScoreFun([n1, n2, n3, n4, n5, n6, n7, n8, n4, n3, n2, n1, n8, n7, n6, n5], 4).
61.71428571428584

Neither can I think of a way of bettering this by improving claim v2.  However as mentioned in the long read, one of the issues with claim v3 might be that rings with bad properties (such as loss of physical diversity on dual node failures) can also get good scores:

> ScoreFun([n1, n2, n3, n4, n1, n2, n3, n4, n5, n6, n7, n8, n5, n6, n7, n8], 4).
66.0


Regards

Martin
On 17 May 2017 at 16:34, Jon Meredith <[hidden email]> wrote:

Thanks for the excellent writeup.  

I have a few notes on your writeup and then a little history to help
explain the motivation for the v3 work.

The Claiming Problem

  One other property of the broader claim algorithm + claimant + handoff
  manager group of processes that's worth mentioning is safety during
  transition.  The cluster should ensure that target N-val copies
  are always available even during transitions.  Much earlier in Riak's
  life the claim would just execute and ownership transfer immediately,
  without putting the data in place (fine, it's eventually consistent, right?)
  but that meant if more than two vnodes in a preference list changed
  ownership then clients would read not found until at least one of the
  objects it was receiving had transferred. The claimant now shepherds those
  transitions so it should be safe.  The solution of transferring the
  data before ownership has fixed the notfound problem, but Riak lost
  agility in adding capacity to the cluster - existing data has to transfer
  to new nodes before they are freed up, and they continue to grow
  while waiting.  In hindsight, Ryan Zezeski's plan of just adding new
  capacity and proxying back to the original vnode is probably a better
  option.

  Predicting load on the cluster is also difficult with the single
  ring with a target n-val set at creation time being used for all
  buckets despite their n-value.  To compute the operations sent to
  each vnode you need to know the proportion of access to each N-value.
  
  There's also the problem that if a bucket is created with an N-value
  larger than target N all bets are off about the number of physical nodes
  values are written to (*cough* strong consistency N-5)

  Having a partitioning-scheme-per-N-value is one way of sidestepping the
  load prediction and max-N problems.

Promixity of Vnodes

  An alternate solution to the target_n_val problem is to change the way
  fallback partitions are added and apply an additional uniqueness constraint
  as target nodes are added.  That provides safety against multiple node
  failures (although can potentially cause loading problems).  I think
  you imply this a couple of points when you talk about 'at runtime'.

Proximity of vnodes as the partition list wraps

  One kludge I considered solving the wraparound problem is to go from
  a ring to a 'spiral' where you add extra target_n_val-1 additional
  vnodes that alias the few vnodes in the ring.

  Using the pathalogically bad (vnodes) Q=4, (nodes) S=3, (nval) N=3
```
  v0 | v1 | v2 | v3
  nA | nB | nC | nA

  p0 = [ {v1, nB} {v2, Nc} {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA} {v0, nA} ] <<< Bad
  p2 = [ {v3, nA} {v0, nA} {v1, nB} ] <<< Bad
  p3 = [ {v0, nA} {v1, nB} {v2, nC} ]
```
  You get 2/4 preflists violating target_n_val=3.

  If you extend the ring to allow aliasing (i.e. go beyond 2^160) but
  only use it for assignment
  
```
  v0 | v1 | v2 | v3 | v0' | v1'
  nA | nB | nC | nA | nB  | nC

  p0 = [ {v1, nB} {v2, Nc}  {v3, nA} ]
  p1 = [ {v2, Nc} {v3, nA}  {v0', nB} ]
  p2 = [ {v3, nA} {v0', nB} {v1', nB} ]
  p3 = [ {v0, nA} {v1, nB}  {v2, nC} ]
```
  The additional vnodes can never be hashed directly, just during
  wraparound.


As you say, the v3 algorithm was written (by me) a long time ago and
never made it to production.  It was due to a few factors, partly
the non-determinism, partly because I didn't like the (very stupid)
optimization system tying up the claimant node for multiple seconds,
but more troublingly when we did some commissioning tests for a large
customer that ran with a ring size of 256 with 60 nodes we experienced
a performance drop of around 5% when the cluster was maxed out for
reads.  The diversity measurements were much 'better' in that the
v3 claimed cluster was far more diverse and performed better during
node failures, but the (unproven) fear that having a greater number
of saturated disterl connections between nodes dropped performance
without explanation stopped me from promoting it to default.

The reason the v3 algorithm was created was to resolve problems with
longer lived clusters created with the v2 claim that had had nodes
added and removed over time.  I don't remember all the details now,
but I think the cluster had a ring size of 1024 (to future proof,
as no 2I/listkey on that cluster) and somewhere between 15-30 nodes.

In that particular configuration, the v2 algorithm had left the original
sequential node assignment (n1, n2, ..., n15, n1, n2, ...) and assigned
new nodes in place, but that left many places were the original sequential
assignments still existed.

What we hadn't realized at the time is that sequential node assignment
is the *worst* possible plan for handling fallback load.

If with N=3 if a node goes down, all of the responsibility for that
node is shift to another single node in the cluster.

n1 | n2 | n3 | n4 | n1 | n2 | n3 | n4    (Q=8 S=4,TargetN4)

Partition   All Up     n4 down
(position)
    0       n2 n3 n4   n2 n3 n1
    1       n3 n4 n1   n3 n1 n1
    2       n4 n1 n2   n1 n1 n2
    3       n1 n2 n3   n1 n2 n3
    4       n2 n3 n4   n2 n3 n1
    5       n3 n4 n1   n3 n1 n1
    6       n4 n1 n2   n1 n1 n2
    7       n1 n2 n3   n1 n2 n3

With all nodes up, the number of times each node appears in a preflist
is equal.  6 * n1, 6 * n2, 6 * n3, 6 * n4 each appears (TN*Q/S)

But during single node failure
12 * n1, 6 * n2, 6 * n3, n4 down.

The load on n1 is doubled.

In the real scenario, although it was no longer sequentially assigned
there were still a large number of very similar preference lists to
the original assignment (as growing a few nodes on that ring size
only reassigns preference lists in proportion to the new nodes claiming
partitions).

The production cluster was running fairly close to capacity, so the
increased loading during failure, even though it wasn't as bad as doubled
was enough to push it over the performance 'step' lowering tail latencies
and slowed it down enough to overload the vnodes and exhaust memory
crashing the next node causing a cascade.  This was before vnodes had
overload protection so would present differently now.


Another pre-claimant problem that shaped some of the earlier claim
code vnode 'want' threshods was that when the nodes were individually
allowed to say if they wanted to claim more vnodes (with the
wants_claim function, before calling choose_claim), there were some states
the cluster would get into where two nodes both decided they were under
capacity and continually tried to claim, causing the vnode to flip/flop
back and forth between them (that was a reason for writing one of the early
QuickCheck tests).


I'm not sure if you've encountered it or not, but the riak_core_claim_sim
is also a good tool for testing the behavior of the claim functions and
the claimant.  You don't mention it in your write up, but one of the
important functions of the claimant is to make sure it only performs
safe transitions between rings.  It makes sure that the n val is not
violated during handoff.



What to do?

  Fixing the claim algorithm is one way of doing things, but I worry
  it has a number of problems that are hard to solve (multi-AZ, multi-Nval
  etc).

  One more radical option is to dump the ring and just publish a table
  per-vnode of the nodes and vnode hash you'd like to service them.
  Riak doesn't really need consistent hashing - it doesn't *really* use
  it's original form (the Dynamo A scheme), and is more of a hybrid
  of the B/C schemes.

  Use cluster metadata and publish out the tables, update riak_core_apl
  to take the new data and serve up the preference lists.  Obviously
  it trickles into things like the vnode and handoff managers, but it
  may be possible.

  That gives you the advantage of no longer being constrained in how
  you assign the nodes - a separation of policy and execution.  You
  could keep the existing ring based algorithms, or you could do something
  better.

  It may be interesting to change the number of vnodes/hashing algorithm
  too.  Jordan West was a big fan of Consistent Jump Hashing at one point.

  The thing you give up if you lose the power-of-2 partitioning scheme
  is the ability to split and combine partitions.  Each partition in
  a 64 vnode ring maps to exactly two (non-consecutive) partitions in a 128
  vnode ring.  Which is a very nice for replicating between clusters
  with different ring sizes and localizing where to look for data.

Good luck!
  



On Wed, May 17, 2017 at 6:37 AM Daniel Abrahamsson <[hidden email]> wrote:
Thanks for the writeup and detailed investigation, Martin.

We ran into these issues a few months when we expanded a 5 node cluster into a 8 node cluster. We ended up rebuilding the cluster and writing a small escript to verify that the generated riak ring lived up to our requirements (which were 1: to survive an AZ outage, and 2: to survive any 2 nodes going down at the same time).

This will be a great document to refer to when explaining the subtleties of setting up a Riak cluster.

//Daniel
_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Loading...