<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Distributed_systems on Vivek's Field Notes</title><link>https://heyyviv.github.io/tags/distributed_systems/</link><description>Recent content in Distributed_systems on Vivek's Field Notes</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 28 Feb 2026 20:16:56 +0530</lastBuildDate><atom:link href="https://heyyviv.github.io/tags/distributed_systems/index.xml" rel="self" type="application/rss+xml"/><item><title>Scaling Databases with Sharding</title><link>https://heyyviv.github.io/blog/scaling-databases-with-sharding/</link><pubDate>Sat, 28 Feb 2026 20:16:56 +0530</pubDate><guid>https://heyyviv.github.io/blog/scaling-databases-with-sharding/</guid><description>&lt;h2 id="introduction-to-sharding">Introduction to Sharding&lt;/h2>
&lt;p>Sharding is the process of scaling a database by spreading data across multiple servers, or &lt;strong>shards&lt;/strong>. It is the go-to solution for large organizations managing data at a petabyte scale. Industry leaders like Uber, Shopify, Slack, and OpenAI all leverage sharding to manage their massive datasets.&lt;/p>
&lt;p>In a typical small-scale application, one or more app servers connect to a single, monolithic database. This server stores all persistent data, from user accounts to application state. However, as data grows, this single point of failure and bottleneck must be addressed.&lt;/p>
&lt;h2 id="sharded-architecture">Sharded Architecture&lt;/h2>
&lt;p>In a sharded setup, we divide the total data into portions, each hosted on a separate database server.&lt;/p>
&lt;p>Initially, your application code might try to manage these shards directly—keeping track of which row lives where and maintaining multiple open connections. While manageable with two shards, this approach becomes a maintenance nightmare when dealing with hundreds.&lt;/p>
&lt;h3 id="the-proxy-layer">The Proxy Layer&lt;/h3>
&lt;p>A more robust solution is to use an &lt;strong>intermediary proxy&lt;/strong>. Application servers connect only to this proxy, which then routes queries to the correct shard.&lt;/p>
&lt;p>However, proxies introduce their own challenges:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Throughput Limits:&lt;/strong> If a proxy reaches its capacity, queries are queued, adding latency.&lt;/li>
&lt;li>&lt;strong>Scalability:&lt;/strong> To handle high volumes, you must deploy multiple proxy servers to prevent them from becoming the bottleneck.&lt;/li>
&lt;/ul>
&lt;h2 id="sharding-strategies">Sharding Strategies&lt;/h2>
&lt;p>The sharding strategy—the rules determining data placement—is critical for performance and balance. This usually involves a &lt;strong>shard key&lt;/strong>: the column(s) used to route data.&lt;/p>
&lt;h3 id="1-range-sharding">1. Range Sharding&lt;/h3>
&lt;p>Data is routed based on predefined ranges of values. For example, IDs 1-25 might go to Shard A, 26-50 to Shard B, and so on.&lt;/p>
&lt;blockquote>
&lt;p>[!WARNING]
Naive range-based sharding with monotonically increasing IDs often leads to &lt;strong>&amp;ldquo;Hot Shards&amp;rdquo;&lt;/strong>. If you insert IDs 1 to 25 sequentially, only the first shard is active while others remain idle.&lt;/p>
&lt;/blockquote>
&lt;h3 id="2-hash-sharding">2. Hash Sharding&lt;/h3>
&lt;p>The proxy generates a cryptographic hash of the shard key for each row. Each shard is then responsible for a specific range of hashes.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Best Practice:&lt;/strong> Choose a key with &lt;strong>high cardinality&lt;/strong> (e.g., &lt;code>user_id&lt;/code>).&lt;/li>
&lt;li>&lt;strong>Avoid:&lt;/strong> Columns like &lt;code>name&lt;/code>, where popular values can still create hotspots despite hashing.&lt;/li>
&lt;li>&lt;strong>Optimization:&lt;/strong> Hashing fixed-size integers (&lt;code>user_id&lt;/code>) is generally faster than hashing variable-width strings.&lt;/li>
&lt;/ul>
&lt;h3 id="3-lookup-sharding">3. Lookup Sharding&lt;/h3>
&lt;p>A separate mapping table tracks exactly which data belongs on which shard. This offers maximum flexibility but requires an additional lookup for every query.&lt;/p>
&lt;hr>
&lt;h2 id="real-world-case-study-postgresql-and-chatgpt">Real-World Case Study: PostgreSQL and ChatGPT&lt;/h2>
&lt;p>While sharding solves many scale problems, specific database architectures like PostgreSQL&amp;rsquo;s &lt;strong>MVCC (Multiversion Concurrency Control)&lt;/strong> introduce unique write penalties that companies like OpenAI have had to navigate.&lt;/p>
&lt;h3 id="the-copy-on-write-penalty">The &amp;ldquo;Copy-on-Write&amp;rdquo; Penalty&lt;/h3>
&lt;p>In Postgres, updates are not performed &amp;ldquo;in-place.&amp;rdquo; Updating even one byte results in &lt;strong>Write Amplification&lt;/strong>, where the entire row is copied to create a new version. This strains I/O and leads to &lt;strong>Read Amplification&lt;/strong>, as queries must scan through &amp;ldquo;dead&amp;rdquo; versions (old rows) to find live ones.&lt;/p>
&lt;h3 id="the-bloat-problem">The &amp;ldquo;Bloat&amp;rdquo; Problem&lt;/h3>
&lt;p>Old row versions (Dead Tuples) don&amp;rsquo;t disappear instantly, leading to table bloat and increased &lt;code>autovacuum&lt;/code> overhead. If writes outpace reclamation, performance collapses. Every update also requires updating all indexes to point to the new physical row location, adding CPU stress.&lt;/p>
&lt;h3 id="strategies-from-the-openai-engineering-team">Strategies from the OpenAI Engineering Team&lt;/h3>
&lt;p>To ensure services like ChatGPT and their API remain responsive during massive write spikes, several strategies are employed:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Minimizing Primary Load:&lt;/strong> Read traffic is offloaded to replicas whenever possible. Queries that must remain on the primary (e.g., those part of write transactions) are strictly optimized for efficiency.&lt;/li>
&lt;li>&lt;strong>Selective Migration:&lt;/strong> Shardable, write-heavy workloads are migrated to systems like &lt;strong>Azure CosmosDB&lt;/strong>.&lt;/li>
&lt;li>&lt;strong>Application-Level Optimizations:&lt;/strong> Redundant writes are eliminated, and &amp;ldquo;lazy writes&amp;rdquo; are introduced to smooth out traffic spikes.&lt;/li>
&lt;li>&lt;strong>Rate Limiting:&lt;/strong> Strict limits are enforced during background tasks, such as backfilling table fields, to prevent excessive write pressure.&lt;/li>
&lt;/ul>
&lt;hr>
&lt;h2 id="optimization--best-practices">Optimization &amp;amp; Best Practices&lt;/h2>
&lt;h3 id="query-optimization">Query Optimization&lt;/h3>
&lt;p>Avoid &amp;ldquo;OLTP anti-patterns&amp;rdquo; that can degrade services:&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Simplify Joins:&lt;/strong> A query joining 12 tables (as seen in some historical ChatGPT SEVs) can crash a service during a spike. Move complex join logic to the application layer.&lt;/li>
&lt;li>&lt;strong>ORM Awareness:&lt;/strong> Object-Relational Mapping tools can generate inefficient SQL; always review the output.&lt;/li>
&lt;li>&lt;strong>Timeout Management:&lt;/strong> Configure &lt;code>idle_in_transaction_session_timeout&lt;/code> to prevent idle queries from blocking critical processes like autovacuum.&lt;/li>
&lt;/ul>
&lt;h3 id="cross-shard-penalties">Cross-Shard Penalties&lt;/h3>
&lt;p>Queries spanning multiple shards add excessive network and CPU overhead. Aim for single-shard queries whenever possible. Additionally, avoid shard keys that change frequently, as moving rows between shards to maintain strategy integrity is expensive.&lt;/p>
&lt;h2 id="infrastructure--latency">Infrastructure &amp;amp; Latency&lt;/h2>
&lt;p>Adding a proxy introduces a network hop, typically adding ~1ms of latency.&lt;/p>
&lt;ul>
&lt;li>&lt;strong>Server Proximity:&lt;/strong> If proxies and shards are in the same data center, this latency is negligible.&lt;/li>
&lt;li>&lt;strong>Proven Success:&lt;/strong> Slack uses Vitess to manage massive sharded clusters with an average query latency of just &lt;strong>2ms&lt;/strong>.&lt;/li>
&lt;/ul>
&lt;h2 id="high-availability">High Availability&lt;/h2>
&lt;p>Replicas aren&amp;rsquo;t just for reads; they are your safety net. If a primary fails, traffic can be instantly failed over to a replica, preventing hours of downtime.&lt;/p></description></item><item><title>Storage and Retrival</title><link>https://heyyviv.github.io/blog/storage-and-retrival/</link><pubDate>Wed, 25 Feb 2026 23:32:36 +0530</pubDate><guid>https://heyyviv.github.io/blog/storage-and-retrival/</guid><description>&lt;p>In particular, there is a big difference between storage engines that are optimized for
transactional workloads and those that are optimized for analytics.&lt;/p>
&lt;p>An index is an additional structure that is derived from the primary data. Many data‐
bases allow you to add and remove indexes, and this doesn’t affect the contents of the
database; it only affects the performance of queries. Maintaining additional structures
incurs overhead, especially on writes. For writes, it’s hard to beat the performance of
simply appending to a file, because that’s the simplest possible write operation. Any
kind of index usually slows down writes, because the index also needs to be updated
every time data is written.&lt;/p>
&lt;h2 id="hash-index">Hash Index&lt;/h2>
&lt;p>Let’s say our data storage consists only of appending to a file, as in the preceding
example. Then the simplest possible indexing strategy is this: keep an in-memory
hash map where every key is mapped to a byte offset in the data file—the location at
which the value can be found, as illustrated in Figure 3-1. Whenever you append a
new key-value pair to the file, you also update the hash map to reflect the offset of the
data you just wrote (this works both for inserting new keys and for updating existing
keys). When you want to look up a value, use the hash map to find the offset in the
data file, seek to that location, and read the value.
This may sound simplistic, but it is a viable approach. In fact, this is essentially what
Bitcask (the default storage engine in Riak) does [3]. Bitcask offers high-performance
reads and writes, subject to the requirement that all the keys fit in the available RAM,
since the hash map is kept completely in memory. The values can use more space
than there is available memory, since they can be loaded from disk with just one disk
seek. If that part of the data file is already in the filesystem cache, a read doesn’t
require any disk I/O at all.
A storage engine like Bitcask is well suited to situations where the value for each key
is updated frequently. For example, the key might be the URL of a cat video, and the
value might be the number of times it has been played (incremented every time
someone hits the play button). In this kind of workload, there are a lot of writes, but
there are not too many distinct keys—you have a large number of writes per key, but
it’s feasible to keep all keys in memory.
Moreover, since compaction often makes segments much smaller (assuming that a
key is overwritten several times on average within one segment), we can also merge
several segments together at the same time as performing the compaction, as shown
in Figure 3-3. Segments are never modified after they have been written, so the
merged segment is written to a new file. The merging and compaction of frozen seg‐
ments can be done in a background thread, and while it is going on, we can still con‐
tinue to serve read and write requests as normal, using the old segment files. After the
merging process is complete, we switch read requests to using the new merged seg‐
ment instead of the old segments—and then the old segment files can simply be
deleted.&lt;/p>
&lt;p>Each segment now has its own in-memory hash table, mapping keys to file offsets. In
order to find the value for a key, we first check the most recent segment’s hash map;
if the key is not present we check the second-most-recent segment, and so on. The
merging process keeps the number of segments small, so lookups don’t need to check
many hash maps.
Lots of detail goes into making this simple idea work in practice. Briefly, some of the
issues that are important in a real implementation are:&lt;/p>
&lt;p>The hash table must fit in memory, so if you have a very large number of keys,
you’re out of luck. In principle, you could maintain a hash map on disk, but
unfortunately it is difficult to make an on-disk hash map perform well. It
requires a lot of random access I/O, it is expensive to grow when it becomes full,
and hash collisions require fiddly logic [5].
• Range queries are not efficient. For example, you cannot easily scan over all keys
between kitty00000 and kitty99999—you’d have to look up each key individu‐
ally in the hash maps.&lt;/p></description></item><item><title>Docker &amp; kubernetes</title><link>https://heyyviv.github.io/blog/docker-kubernetes/</link><pubDate>Fri, 14 Nov 2025 12:38:31 +0530</pubDate><guid>https://heyyviv.github.io/blog/docker-kubernetes/</guid><description>&lt;h1 id="docker">Docker&lt;/h1>
&lt;p>Open Source&lt;/p></description></item><item><title>Mapper Reducer</title><link>https://heyyviv.github.io/blog/mapper-reducer/</link><pubDate>Wed, 30 Jul 2025 22:28:21 +0530</pubDate><guid>https://heyyviv.github.io/blog/mapper-reducer/</guid><description>&lt;h1 id="map-reduce">Map Reduce&lt;/h1>
&lt;h3 id="problem-faced-by-google">Problem faced by google:&lt;/h3>
&lt;p>Large Data like crawled pages over WWW. They need to do some analysis over this data. It&amp;rsquo;s really not possible to store all this data in one system and to analyse this data serially wil take a lots of time. So they created MapReduce
Issue:&lt;/p>
&lt;ul>
&lt;li>parallelize Computation&lt;/li>
&lt;li>distribute the data&lt;/li>
&lt;li>handle failure cases&lt;/li>
&lt;li>load balancing&lt;/li>
&lt;/ul>
&lt;h3 id="programming-model">Programming Model&lt;/h3>
&lt;p>For example in large database we are trying to calculate number of occurrence of each words
Map:
receive a document&lt;/p>
&lt;pre tabindex="0">&lt;code>map(String key, String value):
	// key: document name
	// value: document contents
	for each word w in value:
		EmitIntermediate(w, &amp;#34;1&amp;#34;);
&lt;/code>&lt;/pre>&lt;p>Reduce:
will receive intermediate values&lt;/p>
&lt;pre tabindex="0">&lt;code>reduce(String key, Iterator values):
	// key: a word
	// values: a list of counts
	int result = 0;
	for each v in values:
		result += ParseInt(v);
	Emit(AsString(result));
&lt;/code>&lt;/pre>&lt;p>Mapper produce
(the,1)
(map,1)
(function,1)
Reducer receive
(&amp;ldquo;the&amp;rdquo;,{1,1,1,1,2,3,})
Reducer produce
(&amp;ldquo;the&amp;rdquo;,9)&lt;/p>
&lt;pre tabindex="0">&lt;code>map (k1,v1) → list(k2,v2)

reduce (k2,list(v2)) → list(v2)
&lt;/code>&lt;/pre>&lt;h3 id="implementation">Implementation&lt;/h3>
&lt;ul>
&lt;li>We Split the data into M split. Input split is processed in parallel by different machine.&lt;/li>
&lt;li>We split intermediate value into R split using partitioning function specified by user. ex (hash(value)%R).
Steps&lt;/li>
&lt;li>The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.&lt;/li>
&lt;li>One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.&lt;/li>
&lt;li>A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory&lt;/li>
&lt;li>Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.&lt;/li>
&lt;li>When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of themap workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keysso that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used&lt;/li>
&lt;li>The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.&lt;/li>
&lt;li>When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.&lt;/li>
&lt;/ul>
&lt;p>for each completed map task,the master stores the locations and sizes of the R intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.&lt;/p>
&lt;h3 id="fault-tolerance">Fault Tolerance&lt;/h3>
&lt;p>Master pings every machine time to time if they failed to response in time then master make the task assigned to that system as failure and reschedule that task to some other worker.
Completed map task is reschedule on failure cuz intermediate data is stored in local disk but completed reduce task in not reschedule on failure cuz output is stored in global disk.&lt;/p>
&lt;p>It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire.
Reduce produce 1 output file
Mapper produce R output file
When a map task completes, the worker sends a message to the master and includes the names of the R temporary files in the message.
When a reduce task completes, the reduce worker atomically renames its temporary output file to the final output file. If the same reduce task is executed on multiple machines, multiple rename calls will be executed for the same final output file.&lt;/p>
&lt;p>We split mapper phase into M split and reduce phase in R split ideally (M&amp;amp;R ) should be much larger than number of worker.&lt;/p>
&lt;ul>
&lt;li>Having each worker perform many different tasks improves dynamic load balancing&lt;/li>
&lt;li>speeds up recovery when a worker fails
we tend to choose M so that each individual task is roughly 16 MB to 64 MB of input data (so that the locality optimization described above is most effective)&lt;/li>
&lt;/ul>
&lt;p>One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation.
When MapReduce operation is close to completion master assign in progress task to backup machines.he task is marked as completed whenever either the primary or the backup execution completes.&lt;/p></description></item></channel></rss>