Scaling Databases with Sharding

28 Feb, 2026 • 4 min read

Introduction to Sharding

Sharding is the process of scaling a database by spreading data across multiple servers, or shards. It is the go-to solution for large organizations managing data at a petabyte scale. Industry leaders like Uber, Shopify, Slack, and OpenAI all leverage sharding to manage their massive datasets.

In a typical small-scale application, one or more app servers connect to a single, monolithic database. This server stores all persistent data, from user accounts to application state. However, as data grows, this single point of failure and bottleneck must be addressed.

Sharded Architecture

In a sharded setup, we divide the total data into portions, each hosted on a separate database server.

Initially, your application code might try to manage these shards directly—keeping track of which row lives where and maintaining multiple open connections. While manageable with two shards, this approach becomes a maintenance nightmare when dealing with hundreds.

The Proxy Layer

A more robust solution is to use an intermediary proxy. Application servers connect only to this proxy, which then routes queries to the correct shard.

However, proxies introduce their own challenges:

Throughput Limits: If a proxy reaches its capacity, queries are queued, adding latency.
Scalability: To handle high volumes, you must deploy multiple proxy servers to prevent them from becoming the bottleneck.

Sharding Strategies

The sharding strategy—the rules determining data placement—is critical for performance and balance. This usually involves a shard key: the column(s) used to route data.

1. Range Sharding

Data is routed based on predefined ranges of values. For example, IDs 1-25 might go to Shard A, 26-50 to Shard B, and so on.

[!WARNING] Naive range-based sharding with monotonically increasing IDs often leads to “Hot Shards”. If you insert IDs 1 to 25 sequentially, only the first shard is active while others remain idle.

2. Hash Sharding

The proxy generates a cryptographic hash of the shard key for each row. Each shard is then responsible for a specific range of hashes.

Best Practice: Choose a key with high cardinality (e.g., user_id).
Avoid: Columns like name, where popular values can still create hotspots despite hashing.
Optimization: Hashing fixed-size integers (user_id) is generally faster than hashing variable-width strings.

3. Lookup Sharding

A separate mapping table tracks exactly which data belongs on which shard. This offers maximum flexibility but requires an additional lookup for every query.

Real-World Case Study: PostgreSQL and ChatGPT

While sharding solves many scale problems, specific database architectures like PostgreSQL’s MVCC (Multiversion Concurrency Control) introduce unique write penalties that companies like OpenAI have had to navigate.

The “Copy-on-Write” Penalty

In Postgres, updates are not performed “in-place.” Updating even one byte results in Write Amplification, where the entire row is copied to create a new version. This strains I/O and leads to Read Amplification, as queries must scan through “dead” versions (old rows) to find live ones.

The “Bloat” Problem

Old row versions (Dead Tuples) don’t disappear instantly, leading to table bloat and increased autovacuum overhead. If writes outpace reclamation, performance collapses. Every update also requires updating all indexes to point to the new physical row location, adding CPU stress.

Strategies from the OpenAI Engineering Team

To ensure services like ChatGPT and their API remain responsive during massive write spikes, several strategies are employed:

Minimizing Primary Load: Read traffic is offloaded to replicas whenever possible. Queries that must remain on the primary (e.g., those part of write transactions) are strictly optimized for efficiency.
Selective Migration: Shardable, write-heavy workloads are migrated to systems like Azure CosmosDB.
Application-Level Optimizations: Redundant writes are eliminated, and “lazy writes” are introduced to smooth out traffic spikes.
Rate Limiting: Strict limits are enforced during background tasks, such as backfilling table fields, to prevent excessive write pressure.

Optimization & Best Practices

Query Optimization

Avoid “OLTP anti-patterns” that can degrade services:

Simplify Joins: A query joining 12 tables (as seen in some historical ChatGPT SEVs) can crash a service during a spike. Move complex join logic to the application layer.
ORM Awareness: Object-Relational Mapping tools can generate inefficient SQL; always review the output.
Timeout Management: Configure idle_in_transaction_session_timeout to prevent idle queries from blocking critical processes like autovacuum.

Cross-Shard Penalties

Queries spanning multiple shards add excessive network and CPU overhead. Aim for single-shard queries whenever possible. Additionally, avoid shard keys that change frequently, as moving rows between shards to maintain strategy integrity is expensive.

Infrastructure & Latency

Adding a proxy introduces a network hop, typically adding ~1ms of latency.

Server Proximity: If proxies and shards are in the same data center, this latency is negligible.
Proven Success: Slack uses Vitess to manage massive sharded clusters with an average query latency of just 2ms.

High Availability

Replicas aren’t just for reads; they are your safety net. If a primary fails, traffic can be instantly failed over to a replica, preventing hours of downtime.

#Blog #Distributed_systems #Database #Sharding #Notes #Tips