What is Database Sharding?

Introduction

When a critical business asset such as an application, website or platform begins to experience heavy usage, it can become sluggish as the supporting database becomes overwhelmed. To keep the database running smoothly, it has to be scaled up.

One of the best ways to achieve this is through an approach known as database sharding, which has been enjoying a lot of use in recent years. Let's look at what database sharding is, the working, as well as the strengths that are attracting so many developers and organizations to this database scaling technique.

What is database sharding?

Database sharding is a technique used to scale databases by breaking data in the original database into smaller portions which are then stored on multiple machines or database servers. Since all database servers usually carry the same foundational architecture, the shards can be processed together without any detriment to quality.

Sharding can be used with any type of database, but it is especially useful for databases that contain large amounts of data or that are accessed by many clients. In a sharded system, each server contains a portion of the total data and is responsible for servicing requests from clients that are associated with that portion of the data.

Benefits of database sharding

The foremost advantage of database sharding is that it eliminates the issue of degraded application performance as usage increases. As an app grows, the number of users and the data it contains both grow. The database slows down as the volume of data and the number of concurrent users accessing the application increases. If this problem is left to persist, it'll eventually compromise customer experience. Sharding fixes this predicament by allowing parallel processing of smaller data sets.

With the growth of data-intensive businesses and popular applications, database sharding has emerged as a go-to solution for improving scalability and performance.

Here are the rest of the benefits that organizations can enjoy by embracing sharding.

1. Increased performance

Database sharding helps to reduce latency by spreading out workloads across multiple servers. For example, if each application server had its own database instance instead of being part of one huge DBMS cluster, workloads would be expedited since individual app servers don’t have to compete for resources.

2. Improved security

Segmenting a database helps to maintain reasonable levels of access control and security. Instead of having all sensitive data stored under one database server, you can segment it into multiple shards that deliver targeted access control in order to protect sensitive information from malicious attacks.

3. Scalability

Database sharding is an ideal solution for businesses experiencing rapid growth or sudden spikes in traffic. By simply adding more shards, your system will be able to scale flexibly with changing needs.

Sharding also provides an excellent platform for future scalability as businesses and their datasets grow larger over time. As individual shards become more populated, they can be easily scaled up - or even split off - without any substantial disruption or downtime to the services consuming them.

4. Regional data distribution

Organizations can use database sharding to distribute large amounts of data across different regions in order to meet user requirements by increasing the speed of complex queries. Likewise, it allows applications to easily switch between databases based on the most relevant geographic region or usage pattern instead of relying on rigidly defined paths with complicated logic around them.

5. Eliminates total outage

In the past, when the application’s database computer failed, the application would also be down. But with database sharding, data is spread across multiple hosts so that if one fails, the application won't go offline.

6. Shorter response times

Over time, single databases typically take longer to access since the system requires looking through the entire big database. It takes a while to bring results based on user queries. However, when using sharding, there is less data to go through, so it drastically reduces the length of the search and effectively reduces the response time.

How database sharding works

Basically, a dataset is broken up into smaller datasets, or 'shards'. Each shard has its own unique data, which is stored on separate computers or nodes. Even though these shards are stored and run separately from the nodes, they still share the same schema of the original database.

Developers use a special type of key known as a shard key to establish how to shard the dataset. In essence, the shard key functions as a kind of guidebook defining where each piece of data falls and how they are linked to one another.

Each shard works on its own based on the shared-nothing architecture - meaning it is unaware of the other shards and functions autonomously. Requests for data only trigger an individual shard to process the information that is stored in it.

When to shard database

To shard or not to shard depends on the size and complexity of the database. But generally speaking, you should do sharding when an application’s dataset becomes too large or too complex for a single database to handle.

Here are some general guidelines to help you decide if it’s time to do sharding.

When the database has grown too large and needs to be split into multiple databases to improve performance.
When your database is experiencing contention issues caused by too many requests from simultaneous users, which leads to challenges such as timeouts.

Please note that sharding is not a substitute for proper database design, however. Even as you shard, the datasets should still make sense for how they will be used.

Database sharding methods

There are different sharding methods that apply distinct rules to the shard key in order to decide the correct node for each set of data.

The common methods include range based sharding, directory based sharding, hashed sharding and geo sharding.

1. Range based sharding

Range-based sharding, also known as dynamic sharding, splits databases according to a range with similar values. The database team will then assign a shard key for every range. For example, imagine dividing a product database according to size ranges or a customer database according to a range based on their age. Each range forms a shard which is then stored on a separate database server.

2. Directory based sharding

The shards are organized into a tree structure. The root of the tree is called the "directory", and each branch of the tree corresponds to a shard. When a client requests data from the database, the directory determines which branch of the tree to query in order to find the relevant shards. A lookup table is employed to associate the database data with its respective shard.

Directory-based sharding is typically used when there is a large number of shards involved. One disadvantage of directory-based sharding is that it can be more complicated to set up and manage than other types of sharding.

3. Hashed sharding

Hashed sharding uses hashes to divide the data into shards. Each shard is given a unique hash value. The data is then divided into chunks based on these hash values. The hash value is used to determine which shard will contain which value. This allows the data to be distributed evenly.

4. Geo sharding

Geo sharding breaks and stores information according to geographical location. For example, a social networking platform might use different shards to save users according to country, city, etc. So when a user is located in, say, Los Angeles, their requests would be routed to the server that stores the shard for Los Angeles. This can improve performance because it reduces network latency between the client and the server.

Challenges of database sharding

The benefits of sharding are certainly great for any business. But this technique also comes with a couple of challenges that you need to be aware of.

1. Querying issues

Database sharding adds querying complications since the querying process may often need to go through multiple shards to obtain the info needed, and then piece together the info from those different shards to get the requested result. This happens when the data required for a specific query is spread across multiple shards.

Fortunately, this can be resolved by using automated databases which will make managing and querying operations much simpler.

2. Implementation challenges

Many DBMS lack built-in features to enable sharding. As a result, database teams and developers will have to shard the databases manually, which can get quite overwhelming. In order to properly use sharding, an application developer must modify their code in order to split the queries and spread them across multiple shards. This means that more work is required on the development side before the shards can be utilized.

You can overcome this challenge by using a graph database system like NebulaGraph that supports automated horizontal scaling.

3. Infrastructure expenses

If you are going to use physical shards, you can end up incurring higher costs to pay for the infrastructure. The more machines you add, the more you will spend.

You can overcome this challenge by sharding on the cloud and taking advantage of the virtual infrastructure. The NebulaGraph Cloud helps users conveniently deploy NebulaGraph clusters without having to take care of infrastructure.

4. Unbalanced distribution

When high volumes of requests or queries are sent to certain shards, the performance can suffer due to a bottleneck. Uneven data distribution is usually the cause of unbalanced shards, particularly in physical shards. For example, if there is a sudden surge in demand for a product whose data is stored on one shard, that shard will require more resources.

To avoid this issue, optimizing the sharding process is key so that datasets likely to experience heavy usage receive priority.

5. Building back is complex

Once a database has been sharded, it’s tricky to rebuild it back to its previous unshaded status. By the time you want to rebuild, new data will have been written to the shards. This means the backups of the original database will have to be merged with the new data from the shards. This can consume a lot of developer time, not forgetting the added costs if you have to go for more developers and invest in new tools.

Getting the most out of database sharding: Beware of these scenarios

Many companies battle with ensuring that all shards are operating at the optimal level and that some shards are not underutilized while others are maxed. This slows down the access process, thus weakening the very main purpose of sharding.

Watch out for these scenarios and avoid them by taking the time to plan the sharding carefully especially when choosing the shard keys.

1. Attribute frequency

This refers to the amount of data with identical attributes that is stored in one shard as compared to others. For example, if the data team opts for 'color' as their shard key for a women's fashion website, and one color proves to be more popular than the rest, then it may cause data hotspots. Thus, it is essential to take frequency into account when selecting a shard key. Try as much as possible to choose a shard key that is unlikely to lead to an overloading of records in one shard while leaving the others underutilized.

2. Number of shards

The number of shards that can be made from the database is also dependent on the chosen shard key value. For instance, if the data team chooses YES/NO or HIGH/LOW as the shard key, then the total number of shards will only be two. Now, depending on your data volume and diversity, two shards could just be the perfect number or turn out to be too few and surprise you when the workload suddenly shoots up.

3. Changes over time

As the database increases, the shard key should not change so considerably that a huge amount of one data type goes to one shard only. Let's say the shard key is based on FINISHED or UNFINISHED orders. Over time as you streamline your processes, the shard key for finished orders will increase steadily, which means the dataset for finished orders will outstrip its shard.

Are there alternatives to database sharding?

Yes. Database sharding is basically one among several techniques that are used to scale databases. It’s based on the horizontal scaling concept, where additional computers are added to share an application’s workload as usage increases. By doing this, organizations can benefit from the advantage of fault tolerant architecture, meaning whenever one database fails, the application is not entirely disrupted.

But besides sharding, there are more techniques that organizations can explore to scale their databases. Here are some of the more popular ones:

1. Partitioning

Partitioning involves splitting up the data in a database into smaller, more manageable parts. This can be done in a number of ways, but the two common approaches are vertical and horizontal partitioning. Vertical partitioning uses columns to partition the database while horizontal splitting partitions the database by the rows.

For example, let's say you have a table that stores customer information and there is a column in there that arranges the customers according to countries. You could partition the table vertically by dividing it into different parts based on the customer's country of residence. This would create a separate table for each country, and would make it easier to manage and query the data.

2. Replication

Database replication makes copies of a database and keeps the copies in different machines. So when data is written to the primary database, it is also written to the secondary databases. This provides high availability because whenever the active database server fails, the next machine with the exact database copy swings into action.

Some organizations use a combination of replication and sharding depending on the type of data they are handling as well as customer needs.

4. Vertical scaling

Here, scaling is done by increasing the capacity of one database server instead of adding more servers as in the case of horizontal scaling. More CPU, memory and disk capacity are added to a single server in order to handle increased loads.

The advantage of vertical scaling is that it's simpler and faster to implement than database sharding. The disadvantage is that once you reach the limits of a single server, you’ll need to add another server in order to continue scaling.

How can NebulaGraph help?

Organizations of all sizes are using NebulaGraph to build and manage modern databases for their applications. NebulaGraph offers high availability at scale with tools that support horizontal scaling. With NebulaGraph, you can build a distributed database that is resistant to failure and can handle large workloads. You can also create databases that are easy to manage and query with a graphical user interface. Simply start by importing data into NebulaGraph.

Conclusion

Modern businesses are driven by technology which requires a lot of data to power their platforms. It’s no wonder that big data is projected to drive business growth at a speed that may not have been possible before the age of data. This data is stored in machines that have limits on data storage, processing, and serving users. When this limit is reached as a result of growing use, then scaling becomes inevitable.

The first step is to choose the scaling technique that will work best for your scenario. Database sharding is one such technique. And as we've just seen, it's a popular technology that many organizations are adopting. But be sure to do it in a logical way that will minimize latency and maximize performance. Otherwise, you may end up with even worse performance than before. Keep an eye on how you select the shard keys because this will have a huge impact on how the shards perform over time. The goal is to attain the kind of balance that ensures the workload is well balanced across all shards.

If you don't feel like sharding is the right fit for your use case, you can always look into the other alternatives and pick whichever one best meets your current needs.

See How NebulaGraph Database Shared-Nothing Structure Works

An Introduction to NebulaGraph's Storage Engine