shard vbucket

See:

Scaling on commodity hardware

Social networking sites run into the same problem over and over. They generate a lot of traffic. And they don't always generate a lot of revenue. When your site becomes successful, how can you make it scale without spending all of your money on more powerful servers? The goal is to add commodity hardware as load increases and get a linear increase in throughput. This works well for webservers. Just add additional servers to your farm behind a load balancer. But what about the database?

Silo Information

The easiest thing to do when hitting scaling problems on the database is to create silos of information. I have a user database, a forums database, a profile comments database, and so on. This means that each database only has to deal with one domain of information.

This can help a lot. But if one particular domain of information is being hit hard, what to do then? For example, as my site grows, so does the number of rows in my user table. So does my volume of requests to each table. This creates all kinds of bottlenecks. Adding another database server won't help if all of my user information is on one server.

Why not use slave databases?

Some sites solve this by adding slave database servers. Writes go to a single master database server and reads go to the redundant slave database servers. As long as all your slaves stay up-to-date, you can insert a new row into the master database, and read it from the slave. If replication lags, this creates all kinds of syncing issues.

But adding slave databases only alleviates database reads. All of the writes still must go through a single master. You can set up redundant write masters on some database platforms, but these writes must be replicated to the other masters, and as the dataset grows, many things become less efficient.

Reasons to shard

Instead of having one single table for all users, why not split up the user information into many different tables? Most web application access data by only a single key. Even if it is a list of Threads in a discussion forum, there is usually one key where you can easily split the data up horizontally.

See: http://en.wikipedia.org/wiki/Shard_%28database_architecture%29 and http://en.wikipedia.org/wiki/Partition_%28database%29 for more information on sharding and partitioning data.

Sharding and horizontal partitioning are very effective at making it possible to add more database servers and continue to scale on commodity hardware. But what strategy should be used to determine which user belongs on which shard?

The lookup table strategy

One approach often used is to create a lookup table. Each time you add a new user, you decide which shard the user belongs to. Create the mapping in the lookup table and thereafter just refer to that mapping. But that means your mapping table must hold 1 row for every user in the system. In addition, now I have to do 2 queries instead of 1 to find out user information: one to lookup the mapping, and another to perform the actual query. I can cache the lookup, but even still, it is very inefficient. It also creates headaches of having to move users from one shard to another to re-balance the load. The lookup table approach often creates uneven distribution patterns over time that need to be adjusted. Writing the code to move a user from one shard to another is tricky and error prone.

The hashing strategy

Another strategy is to hash the user id against a list of the existing shards. It eliminates the round trip to the lookup table, since the mapping is always deterministic and can be calculated. But what if you need to add another shard? A new shard in the list changes the hashing. All the previously hashed ids need to be moved to go to their new spot. Some people allocate way more shards than they will ever need to support projected growth and avoid the pain of redistributing data. But this is usually a waste of resources and more difficult to manage, especially for small data sets.

AH-HA! The VBucket strategy!

The vbucket strategy is a combination of both. I pre-allocate more shards than I need when hashing a user id to determine which shard it belongs to. But then, I use that hash as a virtual shard, and map that vbucket number returned to a physical shard. The number of vbuckets remains constant regardless of how many physical shards I have. This means that id x always maps to the same vbucket. From there, it is simply a matter of mapping that vbucket to a physical shard in the database. Since I am unlikely to ever need more than 1000 shards and 1000 shards fits easily in PHP memory, I can cache the entire mapping table in APC and avoid any network round-trips to determine which shard a user belongs to.

Getting started: Code examples

The VBucket class formalizes this mapping and makes it easy to work with. Let's say I want to start out by splitting my data between server1 and server2, and spread the vbuckets evenly over them:

 $vb = new Gaia\Shard\Vbucket( array('server1', 'server2') );
 $shard = $vb->shard( $user_id );

Inside the class is an array with 1000 elements in it, with the shards spread out evenly across ranges of virtual buckets. The numeric keys are the virtual bucket numbers. The values of the array will be the names passed into the function. keys 0-499 would map to server1, and keys 500-999 map to server2. usually it is a good idea to keep this lookup table as small as possible to consume less memory. That is why I usually use numbers as shard names:

$vb = new Gaia\Shard\VBucket( array(1,2) );
$shard = 'server' . $vb->shard( $user_id );

This produces the same outcome as before, but with less overhead.

Storing the vbucket lookup table

The $vb object also allows me to export this mapping so I can store it in the database, or in the cache.

foreach( $vb->export() as $vbucket => $shard ){
    $db->query('INSERT INTO `user_vbuckets` (`vbucket`, `shard`) VALUES (%i, %i)', $vbucket, $shard)
}

Later on, when I need to read the values I exported from the database lookup table, I can simply grab all 1000 vbucket -> shard pairs and pass them into the constructor:

$rs = $db->query('SELECT `vbucket`, `shard` FROM `user_vbuckets');
$list = array();
while( $row = $rs->fetch_assoc() ){
    $list[ $row['vbucket'] = $row['shard'];
}
$vb = new Gaia\Shard\VBucket( $list );

I can cache the entire VBucket object in APC or memcache, and use it until I need to refresh it from the database.

Splitting the shards

After a while, I may decide that I need 4 shards instead of just two. So, I set up two new database servers as slaves of my original two servers. Then, I go to my lookup table and update the mappings. VBuckets 0-254 stay on server1 VBuckets 255-499 go on server3. VBuckets 500-754 stay on server2. And VBuckets 755-999 go to server4. Now traffic will be split up evenly across all 4 database servers.

Once I disconnect replication, I can easily write a cron job that crawls through the shards and removes the rows that are no longer used. On each, I would craw the tables with queries like the following:

  foreach( range(1,4) as $shard ){
        $db = Gaia\DB\Connection::instance('server' . $shard );
        $rs = $db->query('SELECT `id` FROM `users`');
        while( $row = $rs->fetch_assoc() ){
              if( $vb->shard( $row['id'] ) == $shard ) continue;
              $db->query('DELETE FROM `users` WHERE id = %i', $row['id'] );
        }
  }

This would remove any data I don't need. This is a simplistic example. There are lots of things you could do to speed this up, like queue up the deletes and delete rows 100 at a time. Could also batch through ranges of user ids to make the select more efficient and limit the number of rows pulled down. This is just to illustrate the logic of how to clean up your data. The whole operation can happen without any downtime. And I can run the cleanup operation leisurely during off-peak hours to lessen the load.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly