Data modeling a write-intensive comment storage cluster

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Data modeling a write-intensive comment storage cluster

fxmy wang

Greetings List,

I'm a new guy who's only got some experience with RMDBs. So please enlighten me if I'm doing something silly.

So I'm trying to use Riak for storing video comments - small but huge amount of datas.
Prerequisites:

- One bucket for one video.
- Keys will consist of a timestamp and userID.
- Values will be plain text, contains a short comment and some tags.
 Should not be lager than 10KB.
- Values are seldom modified.
- Write-intensive, some hot videos maybe ~100,000 people watching at the same time.
- There will be multiple Erlang-pb clients doing writes.

Then here are my questions:
1) To get better writing throughput, is it right to set the w=1?
2) What's the best way to query these comments? In this use case, I don't need to retrieve all the comments in one bucket, but just the latest few hundreds comments( if there are so many) based on the time they are posted. 

Right now I'm thinking of using line-walking and keeping track of the latest comment so I can trace backwards to get the latest 500 comments ( for example). And when new comment comes, point the line to the old latest, then update new latest comment mark.

So in the scenario above, is it possible that after one client has written on nodeA ,modified the latest-mark and another client on nodeB not yet sees the change thus points the line to the old comment, resulting a "branch" in the line?
If this could happen, then what can be done to avoid it? Are there any better ways to store&query those comments? Any reply is appreciated.

B.R.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Data modeling a write-intensive comment storage cluster

Jeremiah Peschka

Responses inline

---
sent from a tiny portion of the hive mind...
in this case, a phone
On Jan 25, 2014 5:16 PM, "fxmy wang" <[hidden email]> wrote:
>
> Greetings List,
>
> I'm a new guy who's only got some experience with RMDBs. So please enlighten me if I'm doing something silly.
>
> So I'm trying to use Riak for storing video comments - small but huge amount of datas.
> Prerequisites:
>
> - One bucket for one video.

As long as you keep a list of all videos elsewhere, this should be good. The new CRDTs in Riak 2.0 should work well for keeping a list of all videos.

> - Keys will consist of a timestamp and userID.
> - Values will be plain text, contains a short comment and some tags.
>  Should not be lager than 10KB.
> - Values are seldom modified.
> - Write-intensive, some hot videos maybe ~100,000 people watching at the same time.
> - There will be multiple Erlang-pb clients doing writes.
>
> Then here are my questions:
> 1) To get better writing throughput, is it right to set the w=1?

This will improve perceived throughput at the client, but it won't improve throughput at the server.

> 2) What's the best way to query these comments? In this use case, I don't need to retrieve all the comments in one bucket, but just the latest few hundreds comments( if there are so many) based on the time they are posted. 
>
> Right now I'm thinking of using line-walking and keeping track of the latest comment so I can trace backwards to get the latest 500 comments ( for example). And when new comment comes, point the line to the old latest, then update new latest comment mark.
>

I wouldn't use link-walking. IIRC this uses MapReduce under the covers. You could use a single key to store the most recent comment.

You can get the most recent n keys using secondary index queries on the $bucket index, sorting, and pagination.

> So in the scenario above, is it possible that after one client has written on nodeA ,modified the latest-mark and another client on nodeB not yet sees the change thus points the line to the old comment, resulting a "branch" in the line?
> If this could happen, then what can be done to avoid it? Are there any better ways to store&query those comments? Any reply is appreciated.

You can avoid siblings by serializing all of your writes through a single writer. That's not a great idea since you lose many of Riak's benefits.

You could also use a CRDT with a register type. These tend toward the last writer.

The point is that you need to decide how you want to deal with this type of scenario - it's going to happen. In a worst case; you lose a write briefly.

>
> B.R.
>
> _______________________________________________
> riak-users mailing list
> [hidden email]
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
>


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Data modeling a write-intensive comment storage cluster

fxmy wang
Thanks for the response, Jeremiah.



> > Then here are my questions:
> > 1) To get better writing throughput, is it right to set the w=1?
>
> This will improve perceived throughput at the client, but it won't improve throughput at the server.

Thank you for clarifying this for me :D

> > 2) What's the best way to query these comments? In this use case, I don't need to retrieve all the comments in one bucket, but just the latest few hundreds comments( if there are so many) based on the time they are posted.
> >
> > Right now I'm thinking of using line-walking and keeping track of the latest comment so I can trace backwards to get the latest 500 comments ( for example). And when new comment comes, point the line to the old latest, then update new latest comment mark.
> >
>
> I wouldn't use link-walking. IIRC this uses MapReduce under the covers. You could use a single key to store the most recent comment.

What's bad about MapReduce?
Since there will be another cache layer lays on top of the cluster, so
the read operation is relatively quite infrequent. That's why I choose
to use link-walking.

> You can get the most recent n keys using secondary index queries on the $bucket index, sorting, and pagination.
I'm not sure what you mean here =.=
How can I query most recent n keys using 2i ? Should I put timestamp
-----like by every hour----- in 2i on the coming comments , then when
it comes to queries, just try to query 2i by the hour segment? This
seems a little blind because some videos could be long time before got
commented again.  Querying based on time segmentation seems like
shooting in the dark to me :\

And doc says listing keys operation should not used in production, so
it's a no go either :\


> > So in the scenario above, is it possible that after one client has written on nodeA ,modified the latest-mark and another client on nodeB not yet sees the change thus points the line to the old comment, resulting a "branch" in the line?
> > If this could happen, then what can be done to avoid it? Are there any better ways to store&query those comments? Any reply is appreciated.
>
> You can avoid siblings by serializing all of your writes through a single writer. That's not a great idea since you lose many of Riak's benefits.
> You could also use a CRDT with a register type. These tend toward the last writer.

My goal is to form kind of a single-line-relationship based on
timestamp through the keys under high concurrent write pressure. And
through this relationship I can easily pick out the last
hundreds/thousands comments.
As Jeremiah said, serializing all of writes through a single writer
can avoid siblings totally. And note that we don't have key clashing
problems here ------ every comment holds an unique key. What we want
is single-line-relationship. So how about this:

Multiple erlang-pb clients just do the writes and don't care about the
lining up.
Using post-commit hooks to notify one special global registered
process( which should be running in the riak cluster?) that "here
comes a new comment, line it up when it's appropriate".
Is this feasible? And if it is , how should i prepare for the cluster
partition & rejoin scenario when network fails?

> The point is that you need to decide how you want to deal with this type of scenario - it's going to happen. In a worst case; you lose a write briefly.

Hopefully the method above could avoid this :)

Please everyone, share your thoughts please. _(:3JZ)_

B.R.

_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
Reply | Threaded
Open this post in threaded view
|

Re: Data modeling a write-intensive comment storage cluster

Jeremiah Peschka


---
Jeremiah Peschka - Founder, Brent Ozar Unlimited
MCITP: SQL Server 2008, MVP
Cloudera Certified Developer for Apache Hadoop


On Sun, Jan 26, 2014 at 10:27 PM, fxmy wang <[hidden email]> wrote:
Thanks for the response, Jeremiah.



> > Then here are my questions:
> > 1) To get better writing throughput, is it right to set the w=1?
>
> This will improve perceived throughput at the client, but it won't improve throughput at the server.

Thank you for clarifying this for me :D

> > 2) What's the best way to query these comments? In this use case, I don't need to retrieve all the comments in one bucket, but just the latest few hundreds comments( if there are so many) based on the time they are posted.
> >
> > Right now I'm thinking of using line-walking and keeping track of the latest comment so I can trace backwards to get the latest 500 comments ( for example). And when new comment comes, point the line to the old latest, then update new latest comment mark.
> >
>
> I wouldn't use link-walking. IIRC this uses MapReduce under the covers. You could use a single key to store the most recent comment.

What's bad about MapReduce?
Since there will be another cache layer lays on top of the cluster, so
the read operation is relatively quite infrequent. That's why I choose
to use link-walking.

Even when you run a MapReduce query over a single bucket, MapReduce has to contact a majority of nodes in the cluster to perform a coverage query. In effect, you're scanning all of the keys to make sure you find only the keys in a single bucket. MapReduce can work for limited scenarios (e.g. mutating the state of a large number of objects or running batched analytics that write to a separate set of buckets/keys) but people have reported unsatisfactory results when trying to use MapReduce for live querying.

This sort of thing may be possible with the Riak Search 2.0 functionality as well. I haven't played around with it enough to know whether it would be a good fit or not.
 

> You can get the most recent n keys using secondary index queries on the $bucket index, sorting, and pagination.
I'm not sure what you mean here =.=
How can I query most recent n keys using 2i ? Should I put timestamp
-----like by every hour----- in 2i on the coming comments , then when
it comes to queries, just try to query 2i by the hour segment? This
seems a little blind because some videos could be long time before got
commented again.  Querying based on time segmentation seems like
shooting in the dark to me :\

"Keys will consist of a timestamp and userID."

Sounds like you could sort on that to me.

The $bucket index is a special index that only contains a list of the keys in a bucket. Querying $bucket is cheaper than a list keys operation. 

There are a number of ways you can solve this problem that are all implementation dependent. 
 

And doc says listing keys operation should not used in production, so
it's a no go either :\

A list keys is not a $bucket index query. See "Retrieve all Bucket Keys via $bucket Index" at http://docs.basho.com/riak/latest/dev/using/2i/


> > So in the scenario above, is it possible that after one client has written on nodeA ,modified the latest-mark and another client on nodeB not yet sees the change thus points the line to the old comment, resulting a "branch" in the line?
> > If this could happen, then what can be done to avoid it? Are there any better ways to store&query those comments? Any reply is appreciated.
>
> You can avoid siblings by serializing all of your writes through a single writer. That's not a great idea since you lose many of Riak's benefits.
> You could also use a CRDT with a register type. These tend toward the last writer.

My goal is to form kind of a single-line-relationship based on
timestamp through the keys under high concurrent write pressure. And
through this relationship I can easily pick out the last
hundreds/thousands comments.
As Jeremiah said, serializing all of writes through a single writer
can avoid siblings totally. And note that we don't have key clashing
problems here ------ every comment holds an unique key. What we want
is single-line-relationship. So how about this:

Multiple erlang-pb clients just do the writes and don't care about the
lining up.
Using post-commit hooks to notify one special global registered
process( which should be running in the riak cluster?) that "here
comes a new comment, line it up when it's appropriate".
Is this feasible? And if it is , how should i prepare for the cluster
partition & rejoin scenario when network fails?

It sounds to me like you're doing an awful lot of work to do something that a relational database handles remarkably well.
 

> The point is that you need to decide how you want to deal with this type of scenario - it's going to happen. In a worst case; you lose a write briefly.

Hopefully the method above could avoid this :)

Please everyone, share your thoughts please. _(:3JZ)_

B.R.


_______________________________________________
riak-users mailing list
[hidden email]
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com