Selecting a MongoDB Shard Key

Scalability is an important non-functional requirement for systems using MongoDB as a back-end component.  A system has to be able to scale when a single server running MongoDB cannot handle a large dataset size and/or high level of data processing. In general, there are two standard approaches addressing scalability:

  • Vertical scaling
  • Horizontal scaling

Vertical scaling is achieved by adding more computing resources (CPU, memory, storage) to a single machine. This is considered to be an expensive option. At the same time computing resources on a single machine have physical limitations.

Horizontal scaling achieves scalability by horizontally extending the system by adding commodity machines in order to distribute data processing across all these machines. This option is considered to be less expensive and in most cases meets computing needs for physical resources without limitations. MongoDB supports horizontal scaling by implementing sharding (data partitioning) across all machines (shards) clustered in a MongoDB cluster.

The success of the MongoDB sharding depends on a selected shard key that is used to partition data across multiple shards. MongoDB distributes data by a shard key at the collection level.

You should carefully amalyze all options before selecting the shard key since it can significantly affect your system performance and it cannot be changed after data is inserted in MongoDB.  Shard keys cannot be arrays and you cannot shard on a geospatial index. When selecting a shard key, you should also keep in mind that its values cannot be updated. However if you still have to change the value, you will have to remove the document first, change the key value, and reinsert it.

MongoDB divides data into chunks based on values of a shard key and distribute them evenly across the shards.

Sharding

Usually you will not shard all collections but only collections that need data to be distributed over shards to improve read and/or write performance. All un-sharded collections will be held in only one shard that is called primary shard (e.g., Shard A in the picture above). The primary shard can also contain sharded collections.

MongoDB supports three types of sharding:

  • Range-based sharding
  • Hash-based sharding
  • Tag-aware sharding

With the range-based sharding MongoDB divides datasets into ranges determined by the shard key values. With the hash-based sharding MongoDB creates chunks via hash values it computes from the field’s values of the shard key. In general, range-based sharding provides better support for range queries that need query isolation while the hash-based sharding supports write operations more efficiently.

With tag-aware sharding users associate shard key values with specific shards. This type of sharding is usually used to optimize physical locations of documents for location-based applications.

In order to properly select a shard key for your MongoDB sharded cluster, it is important to understand how your application reads and writes data. Actually the main question is

        What is more critical, query isolation, or write scaling, or both?

For the query isolation an ideal situation is when the queries are routed to a single shard or a small subset of shards. In order to select an optimal shard key for query isolation you must take into consideration the following:

  • Analyze what query operations are most performance dependent;
  • Determine which fields are used the most in these operations and include them in the shard key;
  • Make sure that the selected shard key enable even (balanced) distribution of data across shards;
  • A high cardinality field is preferable. Low cardinality fields tend to group documents on a small number of shards what would require frequent rebalancing of the chunks.

MongoDB query router (mongos) will route queries to a shard or subset of shards only when a shard key or a prefix of the shard key is used in the query. Otherwise mongos will route the query to all shards. Also all sharded collections must have an index that starts with a shard key. All documents having the same value for the shard key will reside on the same shard.

For an efficient write scaling, choose a shard key that has both high cardinality and enables even distribution of write operations across the shards.

You should keep in mind that whatever shard key you choose it should be easily divisible to enable even distribution of data across shards when data grows. Shard keys that have a limited number of possible values can result in chunks that are “unsplittable.”.

The most common techniques people use to distribute data are:

  • Ascending key distribution – The shard key field is usually of Date, Timestamp or Objectld type. With this pattern all writes are routed to one shard which MongoDB will keep splitting and spending lots of time migrating data between shards to keep data distribution relatively balanced across the shards. This pattern is not definitely good for the write scaling.
  • Random distribution – This pattern is achieved by fields that do not have an identifiable pattern in the dataset. For example, these fields include usernames, UUIDs, email addresses, or any field which value has a high level of randomness. This is a preferable pattern for write scaling since it enables balanced distribution of write operations and data across the shards. However this pattern does not work well for the query  isolation if the critical queries must retrieve large amount of “close” data based on range criteria  in which case the query will be spread across the most of the shards in the cluster.
  • Location-based distribution – The idea around the location-based data distribution pattern is that the documents with some location-related similarity will fall into the same  range. The location related field could be postal address, IP, postal code, latitude and longitude, etc.
  • Compound Shard Key – Combine more than one field into a shard key in order to come up with optimal shard key  values for high cardinality and balanced distribution of data for an efficient  write scaling and query isolation.
  • Data modeling  to the rescue – Design a data model to include a field that will be exclusively used to enable balanced distribution of data with good support for write scaling and query isolation. First analyze your application read and write operations to get a full understanding of its writing and data retrieval patterns.

The table below lists key considerations for a shard key selection regarding the query isolation and write scaling requirements.

Query isolation importance
Write scaling importance
Shard Key Selection
high
low
  • Range shard key
  • If the selected key does not provide relatively even distribution of data you can either
  • use a compound shard key (containing more than one document filed); or
  • add a special purpose field to your data model that will be used as a shard key. This is an example when data modeling comes to the rescue; or
  • for location-based applications you can manually associate specific ranges of a shard key with a specific shard or subset of shards.
low
high
  • Hashed shard key with high cardinality that will efficiently distribute write operations across the shards.
  • Having a high cardinality does not guarantee an appropriate write scaling all the time. The ascending key distribution is a good example. Write operations that require a high level of scaling should be carefully analyzed to find the best field candidate for the shard key.
  • If a selected key does not provide relatively even distribution of data you can add a special purpose field to your data model that will be used as a shard key.
high
high
  • A shard key enabling mid-high randomness and relatively even distribution of data.  A compound shard keys are usually good candidates.
  • Since an ideal shard key is almost impossible in this case, determine what shard key has the least performance affect on the most critical use cases for both query isolation and write scaling.
  • Data modeling can also help with embedding, referencing and hybrid model options to consider for improving  performance.
  • If a selected key does not provide relatively even distribution of data you can add a special purpose field to your data model that will be used as a shard key.
Advertisements