Notes on Vivek's Field Notes

Scaling Databases with Sharding

Sat, 28 Feb 2026 20:16:56 +0530

Introduction to Sharding

Sharding is the process of scaling a database by spreading data across multiple servers, or shards. It is the go-to solution for large organizations managing data at a petabyte scale. Industry leaders like Uber, Shopify, Slack, and OpenAI all leverage sharding to manage their massive datasets.

In a typical small-scale application, one or more app servers connect to a single, monolithic database. This server stores all persistent data, from user accounts to application state. However, as data grows, this single point of failure and bottleneck must be addressed.

Sharded Architecture

In a sharded setup, we divide the total data into portions, each hosted on a separate database server.

Initially, your application code might try to manage these shards directly—keeping track of which row lives where and maintaining multiple open connections. While manageable with two shards, this approach becomes a maintenance nightmare when dealing with hundreds.

The Proxy Layer

A more robust solution is to use an intermediary proxy. Application servers connect only to this proxy, which then routes queries to the correct shard.

However, proxies introduce their own challenges:

Throughput Limits: If a proxy reaches its capacity, queries are queued, adding latency.
Scalability: To handle high volumes, you must deploy multiple proxy servers to prevent them from becoming the bottleneck.

Sharding Strategies

The sharding strategy—the rules determining data placement—is critical for performance and balance. This usually involves a shard key: the column(s) used to route data.

1. Range Sharding

Data is routed based on predefined ranges of values. For example, IDs 1-25 might go to Shard A, 26-50 to Shard B, and so on.

[!WARNING] Naive range-based sharding with monotonically increasing IDs often leads to “Hot Shards”. If you insert IDs 1 to 25 sequentially, only the first shard is active while others remain idle.

2. Hash Sharding

The proxy generates a cryptographic hash of the shard key for each row. Each shard is then responsible for a specific range of hashes.

Best Practice: Choose a key with high cardinality (e.g., user_id).
Avoid: Columns like name, where popular values can still create hotspots despite hashing.
Optimization: Hashing fixed-size integers (user_id) is generally faster than hashing variable-width strings.

3. Lookup Sharding

A separate mapping table tracks exactly which data belongs on which shard. This offers maximum flexibility but requires an additional lookup for every query.

Real-World Case Study: PostgreSQL and ChatGPT

While sharding solves many scale problems, specific database architectures like PostgreSQL’s MVCC (Multiversion Concurrency Control) introduce unique write penalties that companies like OpenAI have had to navigate.

The “Copy-on-Write” Penalty

In Postgres, updates are not performed “in-place.” Updating even one byte results in Write Amplification, where the entire row is copied to create a new version. This strains I/O and leads to Read Amplification, as queries must scan through “dead” versions (old rows) to find live ones.

The “Bloat” Problem

Old row versions (Dead Tuples) don’t disappear instantly, leading to table bloat and increased autovacuum overhead. If writes outpace reclamation, performance collapses. Every update also requires updating all indexes to point to the new physical row location, adding CPU stress.

Strategies from the OpenAI Engineering Team

To ensure services like ChatGPT and their API remain responsive during massive write spikes, several strategies are employed:

Minimizing Primary Load: Read traffic is offloaded to replicas whenever possible. Queries that must remain on the primary (e.g., those part of write transactions) are strictly optimized for efficiency.
Selective Migration: Shardable, write-heavy workloads are migrated to systems like Azure CosmosDB.
Application-Level Optimizations: Redundant writes are eliminated, and “lazy writes” are introduced to smooth out traffic spikes.
Rate Limiting: Strict limits are enforced during background tasks, such as backfilling table fields, to prevent excessive write pressure.

Optimization & Best Practices

Query Optimization

Avoid “OLTP anti-patterns” that can degrade services:

Simplify Joins: A query joining 12 tables (as seen in some historical ChatGPT SEVs) can crash a service during a spike. Move complex join logic to the application layer.
ORM Awareness: Object-Relational Mapping tools can generate inefficient SQL; always review the output.
Timeout Management: Configure idle_in_transaction_session_timeout to prevent idle queries from blocking critical processes like autovacuum.

Cross-Shard Penalties

Queries spanning multiple shards add excessive network and CPU overhead. Aim for single-shard queries whenever possible. Additionally, avoid shard keys that change frequently, as moving rows between shards to maintain strategy integrity is expensive.

Infrastructure & Latency

Adding a proxy introduces a network hop, typically adding ~1ms of latency.

Server Proximity: If proxies and shards are in the same data center, this latency is negligible.
Proven Success: Slack uses Vitess to manage massive sharded clusters with an average query latency of just 2ms.

High Availability

Replicas aren’t just for reads; they are your safety net. If a primary fails, traffic can be instantly failed over to a replica, preventing hours of downtime.

Storage and Retrival

Wed, 25 Feb 2026 23:32:36 +0530

In particular, there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics.

An index is an additional structure that is derived from the primary data. Many data‐ bases allow you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects the performance of queries. Maintaining additional structures incurs overhead, especially on writes. For writes, it’s hard to beat the performance of simply appending to a file, because that’s the simplest possible write operation. Any kind of index usually slows down writes, because the index also needs to be updated every time data is written.

Hash Index

Let’s say our data storage consists only of appending to a file, as in the preceding example. Then the simplest possible indexing strategy is this: keep an in-memory hash map where every key is mapped to a byte offset in the data file—the location at which the value can be found, as illustrated in Figure 3-1. Whenever you append a new key-value pair to the file, you also update the hash map to reflect the offset of the data you just wrote (this works both for inserting new keys and for updating existing keys). When you want to look up a value, use the hash map to find the offset in the data file, seek to that location, and read the value. This may sound simplistic, but it is a viable approach. In fact, this is essentially what Bitcask (the default storage engine in Riak) does [3]. Bitcask offers high-performance reads and writes, subject to the requirement that all the keys fit in the available RAM, since the hash map is kept completely in memory. The values can use more space than there is available memory, since they can be loaded from disk with just one disk seek. If that part of the data file is already in the filesystem cache, a read doesn’t require any disk I/O at all. A storage engine like Bitcask is well suited to situations where the value for each key is updated frequently. For example, the key might be the URL of a cat video, and the value might be the number of times it has been played (incremented every time someone hits the play button). In this kind of workload, there are a lot of writes, but there are not too many distinct keys—you have a large number of writes per key, but it’s feasible to keep all keys in memory. Moreover, since compaction often makes segments much smaller (assuming that a key is overwritten several times on average within one segment), we can also merge several segments together at the same time as performing the compaction, as shown in Figure 3-3. Segments are never modified after they have been written, so the merged segment is written to a new file. The merging and compaction of frozen seg‐ ments can be done in a background thread, and while it is going on, we can still con‐ tinue to serve read and write requests as normal, using the old segment files. After the merging process is complete, we switch read requests to using the new merged seg‐ ment instead of the old segments—and then the old segment files can simply be deleted.

Each segment now has its own in-memory hash table, mapping keys to file offsets. In order to find the value for a key, we first check the most recent segment’s hash map; if the key is not present we check the second-most-recent segment, and so on. The merging process keeps the number of segments small, so lookups don’t need to check many hash maps. Lots of detail goes into making this simple idea work in practice. Briefly, some of the issues that are important in a real implementation are:

The hash table must fit in memory, so if you have a very large number of keys, you’re out of luck. In principle, you could maintain a hash map on disk, but unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require fiddly logic [5]. • Range queries are not efficient. For example, you cannot easily scan over all keys between kitty00000 and kitty99999—you’d have to look up each key individu‐ ally in the hash maps.

Ai agents Notes

Tue, 10 Feb 2026 21:24:16 +0530

Worlflow

Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks on any intermediate steps to ensure that the process is still on track.

When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.

Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see “gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.

Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.

When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.

LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:

Sectioning: Breaking a task into independent subtasks run in parallel.
Voting: Running the same task multiple times to get diverse outputs.

When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.

In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks aren’t pre-defined, but determined by the orchestrator based on the specific input.

In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop. When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.

Text to SQL in Pinterest

The user asks an analytical question, choosing the tables to be used.

The relevant table schemas are retrieved from the table metadata store.
The question, selected SQL dialect, and table schemas are compiled into a Text-to-SQL prompt.
The prompt is fed into the LLM.
A streaming response is generated and displayed to the user.

The table schema acquired from the metadata store includes: Table name Table description Columns Column name Column type Column description

Low-Cardinality Columns

Certain analytical queries, such as “how many active users are on the ‘web’ platform”, may generate SQL queries that do not conform to the database’s actual values if generated naively. For example, the where clause in the response might bewhere platform=’web’ as opposed to the correct where platform=’WEB’. To address such issues, unique values of low-cardinality columns which would frequently be used for this kind of filtering are processed and incorporated into the table schema, so that the LLM can make use of this information to generate precise SQL queries.

Context Window Limit

Extremely large table schemas might exceed the typical context window limit. To address this problem, we employed a few techniques:

Reduced version of the table schema: This includes only crucial elements such as the table name, column name, and type. Column pruning: Columns are tagged in the metadata store, and we exclude certain ones from the table schema based on their tags.

you are a {dialect} expert.

Please help to generate a {dialect} query to answer the question. Your response should ONLY be based on the given context and follow the response guidelines and format instructions.

===Tables
{table_schemas}

===Original Query
{original_query}

===Response Guidelines
1. If the provided context is sufficient, please generate a valid query without any explanations for the question. The query should start with a comment containing the question being asked.
2. If the provided context is insufficient, please explain why it can't be generated.
3. Please use the most relevant table(s).
5. Please format the query before responding.
6. Please always respond with a valid well-formed JSON object with the following format

===Response Format
{{
 "query": "A generated SQL query when context is sufficient.",
 "explanation": "An explanation of failing to generate the query."
}}

===Question
{question}

spider dataset : https://arxiv.org/pdf/2204.00498

An offline job is employed to generate a vector index of tables’ summaries and historical queries against them. If the user does not specify any tables, their question is transformed into embeddings, and a similarity search is conducted against the vector index to infer the top N suitable tables. The top N tables, along with the table schema and analytical question, are compiled into a prompt for LLM to select the top K most relevant tables. The top K tables are returned to the user for validation or alteration. The standard Text-to-SQL process is resumed with the user-confirmed tables.

Offline Vector Index Creation

Table Summarization There is an ongoing table standardization effort at Pinterest to add tiering for the tables. We index only top-tier tables, promoting the use of these higher-quality datasets. The table summarization generation process involves the following steps:

Retrieve the table schema from the table metadata store. Gather the most recent sample queries utilizing the table. Based on the context window, incorporate as many sample queries as possible into the table summarization prompt, along with the table schema. Forward the prompt to the LLM to create the summary. Generate and store embeddings in the vector store. The table summary includes description of the table, the data it contains, as well as potential use scenarios. Here is the current prompt we are using for table summarization:

prompt_template = """
You are a data analyst that can help summarize SQL tables.

Summarize below table by the given context.

===Table Schema
{table_schema}

===Sample Queries
{sample_queries}

===Response guideline
 - You shall write the summary based only on provided information.
 - Note that above sampled queries are only small sample of queries and thus not all possible use of tables are represented, and only some columns in the table are used.
 - Do not use any adjective to describe the table. For example, the importance of the table, its comprehensiveness or if it is crucial, or who may be using it. For example, you can say that a table contains certain types of data, but you cannot say that the table contains a 'wealth' of data, or that it is 'comprehensive'.
 - Do not mention about the sampled query. Only talk objectively about the type of data the table contains and its possible utilities.
 - Please also include some potential usecases of the table, e.g. what kind of questions can be answered by the table, what kind of analysis can be done by the table, etc.
"""

Query Summarization Besides their role in table summarization, sample queries associated with each table are also summarized individually, including details such as the query’s purpose and utilized tables. Here is the prompt we are using:

prompt_template = """
You are a helpful assistant that can help document SQL queries.

Please document below SQL query by the given table schemas.

===SQL Query
{query}

===Table Schemas
{table_schemas}

===Response Guidelines
Please provide the following list of descriptions for the query:
-The selected columns and their description
-The input tables of the query and the join pattern
-Query's detailed transformation logic in plain english, and why these transformation are necessary
-The type of filters performed by the query, and why these filters are necessary
-Write very detailed purposes and motives of the query in detail
-Write possible business and functional purposes of the query
"""

NLP Table Search When a user asks an analytical question, we convert it into embeddings using the same embedding model. Then we conduct a search against both table and query vector indices. We’re using OpenSearch as the vector store and using its built in similarity search ability.

Considering that multiple tables can be associated with a query, a single table could appear multiple times in the similarity search results. Currently, we utilize a simplified strategy to aggregate and score them. Table summaries carry more weight than query summaries, a scoring strategy that could be adjusted in the future.

Other than being used in the Text-to-SQL, this NLP-based table search is also used in the general table search in Querybook.

RAG

Docker & kubernetes

Fri, 14 Nov 2025 12:38:31 +0530

Docker

Open Source

Go Lang

Sun, 10 Aug 2025 15:24:40 +0530

Interface

An interface type in Go is kind of like a definition. It defines and describes the exact methods that some other type must have.

type Stringer interface {
 String() string
}

We say that something satisfies this interface (or implements this interface) if it has a method with the exact signature String() string.

type Book struct {
 Title string
 Author string
}

func (b Book) String() string {
 return fmt.Sprintf("Book: %s - %s", b.Title, b.Author)
}

Wherever you see declaration in Go (such as a variable, function parameter or struct field) which has an interface type, you can use an object of any type so long as it satisfies the interface.

func WriteLog(s fmt.Stringer) {
 log.Print(s.String())
}

Because this WriteLog() function uses the fmt.Stringer interface type in its parameter declaration, we can pass in any object that satisfies the fmt.Stringer interface.

package main

import (
 "fmt"
 "strconv"
 "log"
)

// Declare a Book type which satisfies the fmt.Stringer interface.
type Book struct {
 Title string
 Author string
}

func (b Book) String() string {
 return fmt.Sprintf("Book: %s - %s", b.Title, b.Author)
}

// Declare a Count type which satisfies the fmt.Stringer interface.
type Count int

func (c Count) String() string {
 return strconv.Itoa(int(c))
}

// Declare a WriteLog() function which takes any object that satisfies
// the fmt.Stringer interface as a parameter.
func WriteLog(s fmt.Stringer) {
 log.Print(s.String())
}

func main() {
 // Initialize a Count object and pass it to WriteLog().
 book := Book{"Alice in Wonderland", "Lewis Carrol"}
 WriteLog(book)

 // Initialize a Count object and pass it to WriteLog().
 count := Count(3)
 WriteLog(count)
}

output:

2009/11/10 23:00:00 Book: Alice in Wonderland - Lewis Carrol
2009/11/10 23:00:00 3

Advantage

To help reduce duplication or boilerplate code.
To make it easier to use mocks instead of real objects in unit tests.
As an architectural tool, to help enforce decoupling between parts of your codebase.

the empty interface type interface{} is kind of like a wildcard. Wherever you see it in a declaration (such as a variable, function parameter or struct field) you can use an object of any type.

package main

import "fmt"


func main() {
 person := make(map[string]interface{}, 0)

 person["name"] = "Alice"
 person["age"] = 21
 person["height"] = 167.64

 fmt.Printf("%+v", person)
}

Error Handling

The error type is an interface type. An error variable represents any value that can describe itself as a string. Here is the interface’s declaration:

type error interface {
 Error() string
}

Mapper Reducer

Wed, 30 Jul 2025 22:28:21 +0530

Map Reduce

Problem faced by google:

Large Data like crawled pages over WWW. They need to do some analysis over this data. It’s really not possible to store all this data in one system and to analyse this data serially wil take a lots of time. So they created MapReduce Issue:

parallelize Computation
distribute the data
handle failure cases
load balancing

Programming Model

For example in large database we are trying to calculate number of occurrence of each words Map: receive a document

map(String key, String value):
	// key: document name
	// value: document contents
	for each word w in value:
		EmitIntermediate(w, "1");

Reduce: will receive intermediate values

reduce(String key, Iterator values):
	// key: a word
	// values: a list of counts
	int result = 0;
	for each v in values:
		result += ParseInt(v);
	Emit(AsString(result));

Mapper produce (the,1) (map,1) (function,1) Reducer receive (“the”,{1,1,1,1,2,3,}) Reducer produce (“the”,9)

map (k1,v1) → list(k2,v2)

reduce (k2,list(v2)) → list(v2)

Implementation

We Split the data into M split. Input split is processed in parallel by different machine.
We split intermediate value into R split using partitioning function specified by user. ex (hash(value)%R). Steps
The MapReduce library in the user program first splits the input files into M pieces of typically 16 megabytes to 64 megabytes (MB) per piece (controllable by the user via an optional parameter). It then starts up many copies of the program on a cluster of machines.
One of the copies of the program is special – the master. The rest are workers that are assigned work by the master. There are M map tasks and R reduce tasks to assign. The master picks idle workers and assigns each one a map task or a reduce task.
A worker who is assigned a map task reads the contents of the corresponding input split. It parses key/value pairs out of the input data and passes each pair to the user-defined Map function. The intermediate key/value pairs produced by the Map function are buffered in memory
Periodically, the buffered pairs are written to local disk, partitioned into R regions by the partitioning function. The locations of these buffered pairs on the local disk are passed back to the master, who is responsible for forwarding these locations to the reduce workers.
When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data from the local disks of themap workers. When a reduce worker has read all intermediate data, it sorts it by the intermediate keysso that all occurrences of the same key are grouped together. The sorting is needed because typically many different keys map to the same reduce task. If the amount of intermediate data is too large to fit in memory, an external sort is used
The reduce worker iterates over the sorted intermediate data and for each unique intermediate key encountered, it passes the key and the corresponding set of intermediate values to the user’s Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition.
When all map tasks and reduce tasks have been completed, the master wakes up the user program. At this point, the MapReduce call in the user program returns back to the user code.

for each completed map task,the master stores the locations and sizes of the R intermediate file regions produced by the map task. Updates to this location and size information are received as map tasks are completed. The information is pushed incrementally to workers that have in-progress reduce tasks.

Fault Tolerance

Master pings every machine time to time if they failed to response in time then master make the task assigned to that system as failure and reschedule that task to some other worker. Completed map task is reschedule on failure cuz intermediate data is stored in local disk but completed reduce task in not reschedule on failure cuz output is stored in global disk.

It is easy to make the master write periodic checkpoints of the master data structures described above. If the master task dies, a new copy can be started from the last checkpointed state. However, given that there is only a single master, its failure is unlikely; therefore our current implementation aborts the MapReduce computation if the master fails. Clients can check for this condition and retry the MapReduce operation if they desire. Reduce produce 1 output file Mapper produce R output file When a map task completes, the worker sends a message to the master and includes the names of the R temporary files in the message. When a reduce task completes, the reduce worker atomically renames its temporary output file to the final output file. If the same reduce task is executed on multiple machines, multiple rename calls will be executed for the same final output file.

We split mapper phase into M split and reduce phase in R split ideally (M&R ) should be much larger than number of worker.

Having each worker perform many different tasks improves dynamic load balancing
speeds up recovery when a worker fails we tend to choose M so that each individual task is roughly 16 MB to 64 MB of input data (so that the locality optimization described above is most effective)

One of the common causes that lengthens the total time taken for a MapReduce operation is a “straggler”: a machine that takes an unusually long time to complete one of the last few map or reduce tasks in the computation. When MapReduce operation is close to completion master assign in progress task to backup machines.he task is marked as completed whenever either the primary or the backup execution completes.

About LLM part 1

Thu, 26 Jun 2025 16:53:53 +0530

Root Mean Square Layer Normalization

Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network. LayerNorm was widely accepted because it’s simplicity and no dependence among training cases and it also handle variable length inputs unlike BatchNorm. Unfortunately, the incorporation of LayerNorm raises computational overhead. Although this is negligible to small and shallow neural models with few normalization layers, this problem becomes severe when underlying networks grow larger and deeper. As a result, the efficiency gain from faster and more stable training (in terms of number of training steps) is counter-balanced by an increased computational cost per training step, which diminishes the net efficiency. One major feature of LayerNorm that is widely regarded as contributions to the stabilization is its recentering invariance property.

RMSNorm which only focuses on re-scaling invariance and regularizes the summed inputs simply according to the root mean square (RMS) statistic RMS Norm Equation $$ \mathrm{RMS} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_i^2} $$

A well-known explanation of the success of LayerNorm is its re-centering and re-scaling invariance property. The former enables the model to be insensitive to shift noises on both inputs and weights, and the latter keeps the output representations intact when both inputs and weights are randomly scaled

Positional Encoding

Desirable Properties

Each position needs a unique encoding that remains consistent regardless of sequence length
The relationship between positions should be mathematically simple. If we know the encoding for position p, it should be straightforward to compute the encoding for position p+k, making it easier for the model to learn positional patterns.
It would be ideal if our positional encodings could be drawn from a deterministic process. This should allow the model to learn the mechanism behind our encoding scheme efficiently.

Drawbacks of absolute positonal encoding

Don’t capture relative position between tokens
While absolute positional encoding captures the positional information for a word, it does not capture the positional information for the entire sentence

Rotary Positional Encoding is a type of position encoding that encodes absolute positional information with a rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation

we’ve generated a separate positional encoding vector and added it to our token embedding prior to our Q, K and V projections. By adding the positional information directly to our token embedding, we are polluting the semantic information with the positional information. $$ R(m\theta) = \begin{bmatrix} \cos(m\theta) & -\sin(m\theta) \ \sin(m\theta) & \cos(m\theta) \end{bmatrix} $$

The challenge with this solution was that it works only for 2D. Hence, the authors came up with a solution that takes token pairs. This is why ROPE embeddings require dimensions of even length.

KV Caching

When we implement an LLM text generation function, we typically only use the last generated token from each step. However, the visualization above highlights one of the main inefficiencies on a conceptual level. This inefficiency (or redundancy) becomes more clear if we zoom in on the attention mechanism itself.

LLMs generate one word (or token) at a time. Suppose the LLM generated the word “fast” so that the prompt for the next round becomes “Time flies fast”. This is illustrated in the next figure below: As we can see, based on comparing the previous 2 figures, the keys and value vectors for the first two tokens are exactly the same, and it would be wasteful to recompute them in each next-token text generation round.

Now, the idea of the KV cache is to implement a caching mechanism that stores the previously generated key and value vectors for reuse, which helps us to avoid these unnecessary recomputations.

Notice the redundancy: tokens “Time” and “flies” are recomputed at every new generation step. The KV cache resolves this inefficiency by storing and reusing previously computed key and value vectors:

Initially, the model computes and caches key and value vectors for the input tokens.
For each new token generated, the model only computes key and value vectors for that specific token.
Previously computed vectors are retrieved from the cache to avoid redundant computations.

As sequence length increases, the benefits and downsides of a KV cache become more pronounced in the following ways:

[Good] Computational efficiency increases: Without caching, the attention at step t must compare the new query with t previous keys, so the cumulative work scales quadratically, O(n²). With a cache, each key and value is computed once and then reused, reducing the total per-step complexity to linear, O(n).
[Bad] Memory usage increases linearly: Each new token appends to the KV cache. For long sequences and larger LLMs, the cumulative KV cache grows larger, which can consume a significant or even prohibitive amount of (GPU) memory. As a workaround, we can truncate the KV cache, but this adds even more complexity (but again, it may well be worth it when deploying LLMs.)

Grouped Query Attention

Grouped-query attention (GQA) is a simple approach that blends elements of multi-head attention (MHA) and multi-query attention (MQA) to create a more efficient attention mechanism. The mathematical framework of GQA can be understood as follows:

Division into Groups: In GQA, the query heads (Q) from a traditional multi-head model are divided into G groups. Each group is assigned a single key (K) and value (V) head. This configuration is denoted as GQA-G, where G represents the number of groups. We mean-pool the key and value projection matrices of the original heads within each group to convert a multi-head model into a GQA model. This technique averages the projection matrices of each head in a group, resulting in a single key and value projection for that group.

By utilizing GQA, the model maintains a balance between MHA quality and MQA speed. Because there are fewer key-value pairs, memory bandwidth and data loading needs are minimized. The choice of G presents a trade-off: more groups (closer to MHA) result in higher quality but slower performance, whereas fewer groups (near to MQA) boost speed at the risk of sacrificing quality. Furthermore, as the model size grows, GQA allows for a proportional decrease in memory bandwidth and model capacity, corresponding with the model’s scale. In contrast, for bigger models, the reduction to a single key and value head can be unduly severe in MQA.

Training_LLM

Fri, 23 May 2025 15:12:55 +0530

Training on One GPU

when a model trained, there are 3 phases

A forward pass, which passes inputs through the model to yield its outputs
A backward pass to compute the gradients
An optimization step using the gradients to update the parameters The batch size (bs) is one of the important hyperparameters for model training; it affects both model convergence and throughput.

A small batch size can be useful early in training to quickly move through the training landscape to reach an optimal learning point. However, further along in the model training, small batch sizes will keep gradients noisy, and the model may not be able to converge to the most optimal final performance. At the other extreme, a large batch size, while giving very accurate gradient estimations, will tend to make less use of each training token, rendering convergence slower and potentially wasting compute resources.

Batch size also affects the time it takes to train on a given text dataset: a small batch size will require more optimizer steps to train on the same amount of samples. Optimizer steps are costly (in compute time), and the total time to train will thus increase compared to using a larger batch size. That being said, note that the batch size can often be adjusted quite widely around the optimal batch size without major impact on the performance of the model - that is, the sensitivity of final model performance to the exact batch size value is usually rather low around the optimal batch size. In the LLM pretraining community, batch sizes are commonly reported in terms of tokens rather than number of samples bst = batch size tokens bs = batch size seq = model input sequence length bst = bs * seq Llama 1 was trained with a batch size of ~4M tokens for 1.4 trillion tokens, while DeepSeek was trained with a batch size of ~60M tokens for 14 trillion tokens.

we couldn’t calculate exact memory usage by a model cuz

CUDA kernels typically require 1-2 GB of GPU memory
Some memory is used for buffers and intermediate results, and there’s some memory that can’t be used due to fragmentation. We could face out-of-memory (OOM) issues when training this large models but why? When training a neural network model, we store several items in memory:
Model weights
Model gradients
Optimizer states
Activations needed to compute the gradients

First the activations increase quickly as we do the forward pass, then during the backward pass the gradients build up, and as the backward pass propagates, the stored activations used to compute the gradients are progressively cleared. Finally, we perform optimization, during which we need all the gradients, and then update the optimizer states before we start the next forward pass.

An interesting observation here is that memory usage is not static for a given model; rather, it scales linearly with the batch size and quadratically with the sequence length. This means the activation memory is the part that will blow up when we increase our batch size or train with longer sequences.

These graphs tell a striking story: for short sequences (or small batch sizes), memory usage for activations is almost negligible, but from around 2-4k tokens they start to take up a significant amount of memory, while usage for parameters, gradients, and optimizer states (as we’ll discuss later) is roughly independent of the sequence length and batch size. The general idea behind activation recomputation – also called gradient checkpointing or rematerialization – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g., feedforward, LayerNorm, etc.), so that we can use them during the backward pass to compute gradients. When we use recomputation, we typically only store activations at a few key points in the model architecture, discarding the rest of the activations and recomputing them on the fly during the backward pass from the nearest saved activations. Basically, we perform a sub-part of the forward pass again, to trade off memory for compute.

FULL : We checkpoint activations at the transition point between each layer of the Transformer model. This is usually called the “full” strategy since it requires a forward pass through each layer, essentially adding a full forward pass during the backward pass. This strategy saves the most memory but is the most expensive one in terms of compute. It typically increases the compute cost and time by up to 30-40%, which is very noticeable.
Selective: In general, we can do better than full. The authors of the recomputation paper did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of floating-point operations per second (FLOPS). It turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing the expensive feedforward computations. For a GPT-3 (175B) model, this means a 70% activation memory reduction at a 2.7% compute cost.

Gradient accumulation is a very straightforward method to avoid memory explosion that consists of splitting a batch into micro-batches. We then perform forward and backward passes successively on each micro-batch, compute the gradients, and, as the name suggests, sum the gradients of all micro-batches before we perform optimization. In practice, the optimization step is conducted not on the sum but on the average of the gradients, so that the result is independent of the number of gradient accumulation steps. Gradient accumulation allows us to reduce activation memory, which grows linearly with batch size, by processing smaller micro-batches sequentially. This reduces stored activations and gradients since only one micro-batch’s worth of activations needs to be kept in memory at a time, which helps reduce the overall activation memory footprint. One drawback, however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step, thereby increasing the compute overhead and slowing down training.

Data Parallelism

The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replicas “model instances”) and run forward and backward passes on different micro-batches of data in parallel on each GPU - hence the name data parallelism. Using a different micro-batch for each GPU means we’ll have different gradients on each GPU, so to keep the model instances in sync across the different GPUs, we’ll average the gradients from the model instances using an operation called “all-reduce.” This operation takes place during the backward pass, before the optimizer step.