8 Sharding And Consistent Hashing

Sharding

The process of splitting a dataset into partitions.

  • lends itself data parallelism

Hashing

Uniform hash functions map inputs uniformly across the hash output space.

Naive approach:

H(key) % number_of_partitions

Lots of costly shard migration when server crash or added!

To solve this:

Consistent Hashing

  1. The hash output space is viewed as a “ring”
  2. Servers are allocated to some location on the hash output space ring
  3. The server is responsible for a piece of data is the next server clockwise from H(key)

However, its possible for shards to be unbalanced

  • decrease granularity of the segments
  • having each physical server represented multiple times on the hash output ring as a token