<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>LLM on Vivek's Field Notes</title><link>https://heyyviv.github.io/tags/llm/</link><description>Recent content in LLM on Vivek's Field Notes</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Tue, 10 Feb 2026 21:24:16 +0530</lastBuildDate><atom:link href="https://heyyviv.github.io/tags/llm/index.xml" rel="self" type="application/rss+xml"/><item><title>Ai agents Notes</title><link>https://heyyviv.github.io/blog/ai-agents-notes/</link><pubDate>Tue, 10 Feb 2026 21:24:16 +0530</pubDate><guid>https://heyyviv.github.io/blog/ai-agents-notes/</guid><description>&lt;h1 id="worlflow">Worlflow&lt;/h1>
&lt;p>Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks on any intermediate steps to ensure that the process is still on track.&lt;/p>
&lt;p>When to use this workflow: This workflow is ideal for situations where the task can be easily and cleanly decomposed into fixed subtasks. The main goal is to trade off latency for higher accuracy, by making each LLM call an easier task.&lt;/p>
&lt;p>Prompt chaining decomposes a task into a sequence of steps, where each LLM call processes the output of the previous one. You can add programmatic checks (see &amp;ldquo;gate” in the diagram below) on any intermediate steps to ensure that the process is still on track.&lt;/p>
&lt;p>Routing classifies an input and directs it to a specialized followup task. This workflow allows for separation of concerns, and building more specialized prompts. Without this workflow, optimizing for one kind of input can hurt performance on other inputs.&lt;/p>
&lt;p>When to use this workflow: Routing works well for complex tasks where there are distinct categories that are better handled separately, and where classification can be handled accurately, either by an LLM or a more traditional classification model/algorithm.&lt;/p>
&lt;p>LLMs can sometimes work simultaneously on a task and have their outputs aggregated programmatically. This workflow, parallelization, manifests in two key variations:&lt;/p>
&lt;ul>
&lt;li>Sectioning: Breaking a task into independent subtasks run in parallel.&lt;/li>
&lt;li>Voting: Running the same task multiple times to get diverse outputs.&lt;/li>
&lt;/ul>
&lt;p>When to use this workflow: Parallelization is effective when the divided subtasks can be parallelized for speed, or when multiple perspectives or attempts are needed for higher confidence results. For complex tasks with multiple considerations, LLMs generally perform better when each consideration is handled by a separate LLM call, allowing focused attention on each specific aspect.&lt;/p>
&lt;p>In the orchestrator-workers workflow, a central LLM dynamically breaks down tasks, delegates them to worker LLMs, and synthesizes their results.When to use this workflow: This workflow is well-suited for complex tasks where you can’t predict the subtasks needed (in coding, for example, the number of files that need to be changed and the nature of the change in each file likely depend on the task). Whereas it’s topographically similar, the key difference from parallelization is its flexibility—subtasks aren&amp;rsquo;t pre-defined, but determined by the orchestrator based on the specific input.&lt;/p>
&lt;p>In the evaluator-optimizer workflow, one LLM call generates a response while another provides evaluation and feedback in a loop.
When to use this workflow: This workflow is particularly effective when we have clear evaluation criteria, and when iterative refinement provides measurable value. The two signs of good fit are, first, that LLM responses can be demonstrably improved when a human articulates their feedback; and second, that the LLM can provide such feedback. This is analogous to the iterative writing process a human writer might go through when producing a polished document.&lt;/p>
&lt;h1 id="text-to-sql-in-pinterest">Text to SQL in Pinterest&lt;/h1>
&lt;p>The user asks an analytical question, choosing the tables to be used.&lt;/p>
&lt;ul>
&lt;li>The relevant table schemas are retrieved from the table metadata store.&lt;/li>
&lt;li>The question, selected SQL dialect, and table schemas are compiled into a Text-to-SQL prompt.&lt;/li>
&lt;li>The prompt is fed into the LLM.&lt;/li>
&lt;li>A streaming response is generated and displayed to the user.&lt;/li>
&lt;/ul>
&lt;p>The table schema acquired from the metadata store includes:
Table name
Table description
Columns
Column name
Column type
Column description&lt;/p>
&lt;p>Low-Cardinality Columns&lt;/p>
&lt;p>Certain analytical queries, such as “how many active users are on the ‘web’ platform”, may generate SQL queries that do not conform to the database’s actual values if generated naively. For example, the where clause in the response might bewhere platform=’web’ as opposed to the correct where platform=’WEB’. To address such issues, unique values of low-cardinality columns which would frequently be used for this kind of filtering are processed and incorporated into the table schema, so that the LLM can make use of this information to generate precise SQL queries.&lt;/p>
&lt;p>Context Window Limit&lt;/p>
&lt;p>Extremely large table schemas might exceed the typical context window limit. To address this problem, we employed a few techniques:&lt;/p>
&lt;p>Reduced version of the table schema: This includes only crucial elements such as the table name, column name, and type.
Column pruning: Columns are tagged in the metadata store, and we exclude certain ones from the table schema based on their tags.&lt;/p>
&lt;pre tabindex="0">&lt;code>you are a {dialect} expert.

Please help to generate a {dialect} query to answer the question. Your response should ONLY be based on the given context and follow the response guidelines and format instructions.

===Tables
{table_schemas}

===Original Query
{original_query}

===Response Guidelines
1. If the provided context is sufficient, please generate a valid query without any explanations for the question. The query should start with a comment containing the question being asked.
2. If the provided context is insufficient, please explain why it can&amp;#39;t be generated.
3. Please use the most relevant table(s).
5. Please format the query before responding.
6. Please always respond with a valid well-formed JSON object with the following format

===Response Format
{{
 &amp;#34;query&amp;#34;: &amp;#34;A generated SQL query when context is sufficient.&amp;#34;,
 &amp;#34;explanation&amp;#34;: &amp;#34;An explanation of failing to generate the query.&amp;#34;
}}

===Question
{question}
&lt;/code>&lt;/pre>&lt;p>spider dataset : &lt;a href="https://arxiv.org/pdf/2204.00498">https://arxiv.org/pdf/2204.00498&lt;/a>&lt;/p>
&lt;p>An offline job is employed to generate a vector index of tables’ summaries and historical queries against them.
If the user does not specify any tables, their question is transformed into embeddings, and a similarity search is conducted against the vector index to infer the top N suitable tables.
The top N tables, along with the table schema and analytical question, are compiled into a prompt for LLM to select the top K most relevant tables.
The top K tables are returned to the user for validation or alteration.
The standard Text-to-SQL process is resumed with the user-confirmed tables.&lt;/p>
&lt;p>Offline Vector Index Creation&lt;/p>
&lt;p>Table Summarization
There is an ongoing table standardization effort at Pinterest to add tiering for the tables. We index only top-tier tables, promoting the use of these higher-quality datasets. The table summarization generation process involves the following steps:&lt;/p>
&lt;p>Retrieve the table schema from the table metadata store.
Gather the most recent sample queries utilizing the table.
Based on the context window, incorporate as many sample queries as possible into the table summarization prompt, along with the table schema.
Forward the prompt to the LLM to create the summary.
Generate and store embeddings in the vector store.
The table summary includes description of the table, the data it contains, as well as potential use scenarios. Here is the current prompt we are using for table summarization:&lt;/p>
&lt;pre tabindex="0">&lt;code>prompt_template = &amp;#34;&amp;#34;&amp;#34;
You are a data analyst that can help summarize SQL tables.

Summarize below table by the given context.

===Table Schema
{table_schema}

===Sample Queries
{sample_queries}

===Response guideline
 - You shall write the summary based only on provided information.
 - Note that above sampled queries are only small sample of queries and thus not all possible use of tables are represented, and only some columns in the table are used.
 - Do not use any adjective to describe the table. For example, the importance of the table, its comprehensiveness or if it is crucial, or who may be using it. For example, you can say that a table contains certain types of data, but you cannot say that the table contains a &amp;#39;wealth&amp;#39; of data, or that it is &amp;#39;comprehensive&amp;#39;.
 - Do not mention about the sampled query. Only talk objectively about the type of data the table contains and its possible utilities.
 - Please also include some potential usecases of the table, e.g. what kind of questions can be answered by the table, what kind of analysis can be done by the table, etc.
&amp;#34;&amp;#34;&amp;#34;
&lt;/code>&lt;/pre>&lt;p>Query Summarization
Besides their role in table summarization, sample queries associated with each table are also summarized individually, including details such as the query’s purpose and utilized tables. Here is the prompt we are using:&lt;/p>
&lt;pre tabindex="0">&lt;code>prompt_template = &amp;#34;&amp;#34;&amp;#34;
You are a helpful assistant that can help document SQL queries.

Please document below SQL query by the given table schemas.

===SQL Query
{query}

===Table Schemas
{table_schemas}

===Response Guidelines
Please provide the following list of descriptions for the query:
-The selected columns and their description
-The input tables of the query and the join pattern
-Query&amp;#39;s detailed transformation logic in plain english, and why these transformation are necessary
-The type of filters performed by the query, and why these filters are necessary
-Write very detailed purposes and motives of the query in detail
-Write possible business and functional purposes of the query
&amp;#34;&amp;#34;&amp;#34;
&lt;/code>&lt;/pre>&lt;p>NLP Table Search
When a user asks an analytical question, we convert it into embeddings using the same embedding model. Then we conduct a search against both table and query vector indices. We’re using OpenSearch as the vector store and using its built in similarity search ability.&lt;/p>
&lt;p>Considering that multiple tables can be associated with a query, a single table could appear multiple times in the similarity search results. Currently, we utilize a simplified strategy to aggregate and score them. Table summaries carry more weight than query summaries, a scoring strategy that could be adjusted in the future.&lt;/p>
&lt;p>Other than being used in the Text-to-SQL, this NLP-based table search is also used in the general table search in Querybook.&lt;/p>
&lt;h1 id="rag">RAG&lt;/h1></description></item><item><title>About LLM part 1</title><link>https://heyyviv.github.io/blog/about-llm-part-1/</link><pubDate>Thu, 26 Jun 2025 16:53:53 +0530</pubDate><guid>https://heyyviv.github.io/blog/about-llm-part-1/</guid><description>&lt;h1 id="root-mean-square-layer-normalization">Root Mean Square Layer Normalization&lt;/h1>
&lt;p>Layer normalization (LayerNorm) has been successfully applied to various deep neural networks to help stabilize training and boost model convergence because of its capability in handling re-centering and re-scaling of both inputs and weight matrix. However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network.
LayerNorm was widely accepted because it&amp;rsquo;s simplicity and no dependence among training cases and it also handle variable length inputs unlike BatchNorm.
Unfortunately, the incorporation of LayerNorm raises computational overhead. Although this is negligible to small and shallow neural models with few normalization layers, this problem becomes severe when underlying networks grow larger and deeper. As a result, the efficiency gain from faster and more stable training (in terms of number of training steps) is counter-balanced by an increased computational cost per training step, which diminishes the net efficiency.
One major feature of LayerNorm that is widely regarded as contributions to the stabilization is its recentering invariance property.&lt;/p>
&lt;p>&lt;figure>&lt;img src="https://heyyviv.github.io/rmsnorm_1.png">
&lt;/figure>

RMSNorm which only focuses on re-scaling invariance and regularizes the summed inputs simply according to the root mean square (RMS) statistic
RMS Norm Equation
$$
\mathrm{RMS} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} x_i^2}
$$&lt;/p>
&lt;p>A well-known explanation of the success of LayerNorm is its re-centering and re-scaling invariance
property. The former enables the model to be insensitive to shift noises on both inputs and weights,
and the latter keeps the output representations intact when both inputs and weights are randomly
scaled&lt;/p>
&lt;h1 id="positional-encoding">Positional Encoding&lt;/h1>
&lt;p>Desirable Properties&lt;/p>
&lt;ul>
&lt;li>Each position needs a unique encoding that remains consistent regardless of sequence length&lt;/li>
&lt;li>The relationship between positions should be mathematically simple. If we know the encoding for position p, it should be straightforward to compute the encoding for position p+k, making it easier for the model to learn positional patterns.&lt;/li>
&lt;li>It would be ideal if our positional encodings could be drawn from a deterministic process. This should allow the model to learn the mechanism behind our encoding scheme efficiently.&lt;/li>
&lt;/ul>
&lt;p>Drawbacks of absolute positonal encoding&lt;/p>
&lt;ul>
&lt;li>Don&amp;rsquo;t capture relative position between tokens&lt;/li>
&lt;li>While absolute positional encoding captures the positional information for a word, it does not capture the positional information for the entire sentence&lt;/li>
&lt;/ul>
&lt;p>Rotary Positional Encoding is a type of position encoding that encodes absolute positional information with a rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation&lt;/p>
&lt;p>we&amp;rsquo;ve generated a separate positional encoding vector and added it to our token embedding prior to our Q, K and V projections. By adding the positional information directly to our token embedding, we are polluting the semantic information with the positional information.
$$
R(m\theta) =
\begin{bmatrix}
\cos(m\theta) &amp;amp; -\sin(m\theta) \
\sin(m\theta) &amp;amp; \cos(m\theta)
\end{bmatrix}
$$
&lt;figure>&lt;img src="https://heyyviv.github.io/rope_2.png">
&lt;/figure>

The challenge with this solution was that it works only for 2D. Hence, the authors came up with a solution that takes token pairs. This is why ROPE embeddings require dimensions of even length.&lt;/p>
&lt;figure>&lt;img src="https://heyyviv.github.io/rope_1.png">
&lt;/figure>

&lt;h1 id="kv-caching">KV Caching&lt;/h1>
&lt;p>&lt;figure>&lt;img src="https://heyyviv.github.io/kv_1.png">
&lt;/figure>

When we implement an LLM text generation function, we typically only use the last generated token from each step. However, the visualization above highlights one of the main inefficiencies on a conceptual level. This inefficiency (or redundancy) becomes more clear if we zoom in on the attention mechanism itself.
&lt;figure>&lt;img src="https://heyyviv.github.io/kv_2.png">
&lt;/figure>

LLMs generate one word (or token) at a time. Suppose the LLM generated the word “fast” so that the prompt for the next round becomes “Time flies fast”. This is illustrated in the next figure below:
As we can see, based on comparing the previous 2 figures, the keys and value vectors for the first two tokens are exactly the same, and it would be wasteful to recompute them in each next-token text generation round.&lt;/p>
&lt;p>Now, the idea of the KV cache is to implement a caching mechanism that stores the previously generated key and value vectors for reuse, which helps us to avoid these unnecessary recomputations.&lt;/p>
&lt;p>Notice the redundancy: tokens “Time” and “flies” are recomputed at every new generation step. The KV cache resolves this inefficiency by storing and reusing previously computed key and value vectors:&lt;/p>
&lt;ul>
&lt;li>Initially, the model computes and caches key and value vectors for the input tokens.&lt;/li>
&lt;li>For each new token generated, the model only computes key and value vectors for that specific token.&lt;/li>
&lt;li>Previously computed vectors are retrieved from the cache to avoid redundant computations.&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://heyyviv.github.io/kv_3.png">
&lt;/figure>

&lt;p>As sequence length increases, the benefits and downsides of a KV cache become more pronounced in the following ways:&lt;/p>
&lt;ul>
&lt;li>[Good] Computational efficiency increases: Without caching, the attention at step t must compare the new query with t previous keys, so the cumulative work scales quadratically, O(n²). With a cache, each key and value is computed once and then reused, reducing the total per-step complexity to linear, O(n).&lt;/li>
&lt;li>[Bad] Memory usage increases linearly: Each new token appends to the KV cache. For long sequences and larger LLMs, the cumulative KV cache grows larger, which can consume a significant or even prohibitive amount of (GPU) memory. As a workaround, we can truncate the KV cache, but this adds even more complexity (but again, it may well be worth it when deploying LLMs.)&lt;/li>
&lt;/ul>
&lt;h1 id="grouped-query-attention">Grouped Query Attention&lt;/h1>
&lt;p>Grouped-query attention (GQA) is a simple approach that blends elements of multi-head attention (MHA) and multi-query attention (MQA) to create a more efficient attention mechanism. The mathematical framework of GQA can be understood as follows:&lt;/p>
&lt;p>Division into Groups: In GQA, the query heads (Q) from a traditional multi-head model are divided into G groups. Each group is assigned a single key (K) and value (V) head. This configuration is denoted as GQA-G, where G represents the number of groups.
We mean-pool the key and value projection matrices of the original heads within each group to convert a multi-head model into a GQA model. This technique averages the projection matrices of each head in a group, resulting in a single key and value projection for that group.&lt;/p>
&lt;p>By utilizing GQA, the model maintains a balance between MHA quality and MQA speed. Because there are fewer key-value pairs, memory bandwidth and data loading needs are minimized. The choice of G presents a trade-off: more groups (closer to MHA) result in higher quality but slower performance, whereas fewer groups (near to MQA) boost speed at the risk of sacrificing quality. Furthermore, as the model size grows, GQA allows for a proportional decrease in memory bandwidth and model capacity, corresponding with the model’s scale. In contrast, for bigger models, the reduction to a single key and value head can be unduly severe in MQA.
&lt;figure>&lt;img src="https://heyyviv.github.io/kv_4.png">
&lt;/figure>
&lt;/p></description></item><item><title>Training_LLM</title><link>https://heyyviv.github.io/blog/training_llm/</link><pubDate>Fri, 23 May 2025 15:12:55 +0530</pubDate><guid>https://heyyviv.github.io/blog/training_llm/</guid><description>&lt;h1 id="training-on-one-gpu">Training on One GPU&lt;/h1>
&lt;p>when a model trained, there are 3 phases&lt;/p>
&lt;ul>
&lt;li>A forward pass, which passes inputs through the model to yield its outputs&lt;/li>
&lt;li>A backward pass to compute the gradients&lt;/li>
&lt;li>An optimization step using the gradients to update the parameters
The batch size (bs) is one of the important hyperparameters for model training; it affects both model convergence and throughput.&lt;/li>
&lt;/ul>
&lt;p>A small batch size can be useful early in training to quickly move through the training landscape to reach an optimal learning point. However, further along in the model training, small batch sizes will keep gradients noisy, and the model may not be able to converge to the most optimal final performance. At the other extreme, a large batch size, while giving very accurate gradient estimations, will tend to make less use of each training token, rendering convergence slower and potentially wasting compute resources.&lt;/p>
&lt;p>Batch size also affects the time it takes to train on a given text dataset: a small batch size will require more optimizer steps to train on the same amount of samples. Optimizer steps are costly (in compute time), and the total time to train will thus increase compared to using a larger batch size. That being said, note that the batch size can often be adjusted quite widely around the optimal batch size without major impact on the performance of the model - that is, the sensitivity of final model performance to the exact batch size value is usually rather low around the optimal batch size.
In the LLM pretraining community, batch sizes are commonly reported in terms of tokens rather than number of samples
bst = batch size tokens
bs = batch size
seq = model input sequence length
bst = bs * seq
Llama 1 was trained with a batch size of ~4M tokens for 1.4 trillion tokens, while DeepSeek was trained with a batch size of ~60M tokens for 14 trillion tokens.&lt;/p>
&lt;p>we couldn&amp;rsquo;t calculate exact memory usage by a model cuz&lt;/p>
&lt;ul>
&lt;li>CUDA kernels typically require 1-2 GB of GPU memory&lt;/li>
&lt;li>Some memory is used for buffers and intermediate results, and there&amp;rsquo;s some memory that can&amp;rsquo;t be used due to fragmentation.
We could face out-of-memory (OOM) issues when training this large models but why?
When training a neural network model, we store several items in memory:&lt;/li>
&lt;li>Model weights&lt;/li>
&lt;li>Model gradients&lt;/li>
&lt;li>Optimizer states&lt;/li>
&lt;li>Activations needed to compute the gradients&lt;/li>
&lt;/ul>
&lt;p>First the activations increase quickly as we do the forward pass, then during the backward pass the gradients build up, and as the backward pass propagates, the stored activations used to compute the gradients are progressively cleared. Finally, we perform optimization, during which we need all the gradients, and then update the optimizer states before we start the next forward pass.&lt;/p>
&lt;p>An interesting observation here is that memory usage is not static for a given model; rather, it scales linearly with the batch size and quadratically with the sequence length. This means the activation memory is the part that will blow up when we increase our batch size or train with longer sequences.&lt;/p>
&lt;p>These graphs tell a striking story: for short sequences (or small batch sizes), memory usage for activations is almost negligible, but from around 2-4k tokens they start to take up a significant amount of memory, while usage for parameters, gradients, and optimizer states (as we’ll discuss later) is roughly independent of the sequence length and batch size.
The general idea behind activation recomputation – also called gradient checkpointing or rematerialization – is to discard some activations during the forward pass to save memory and spend some extra compute to recompute these on the fly during the backward pass. Without recomputation, we store every hidden state between two learnable operations (e.g., feedforward, LayerNorm, etc.), so that we can use them during the backward pass to compute gradients. When we use recomputation, we typically only store activations at a few key points in the model architecture, discarding the rest of the activations and recomputing them on the fly during the backward pass from the nearest saved activations. Basically, we perform a sub-part of the forward pass again, to trade off memory for compute.&lt;/p>
&lt;ul>
&lt;li>FULL : We checkpoint activations at the transition point between each layer of the Transformer model. This is usually called the “full” strategy since it requires a forward pass through each layer, essentially adding a full forward pass during the backward pass. This strategy saves the most memory but is the most expensive one in terms of compute. It typically increases the compute cost and time by up to 30-40%, which is very noticeable.&lt;/li>
&lt;li>Selective: In general, we can do better than full. The authors of the recomputation paper did a detailed analysis studying which activations grow the largest and have the cheapest recomputation cost in terms of floating-point operations per second (FLOPS). It turns out that the attention computations fall in that category, and thus we can usually discard them and focus on checkpointing the expensive feedforward computations. For a GPT-3 (175B) model, this means a 70% activation memory reduction at a 2.7% compute cost.&lt;/li>
&lt;/ul>
&lt;p>Gradient accumulation is a very straightforward method to avoid memory explosion that consists of splitting a batch into micro-batches. We then perform forward and backward passes successively on each micro-batch, compute the gradients, and, as the name suggests, sum the gradients of all micro-batches before we perform optimization. In practice, the optimization step is conducted not on the sum but on the average of the gradients, so that the result is independent of the number of gradient accumulation steps.
Gradient accumulation allows us to reduce activation memory, which grows linearly with batch size, by processing smaller micro-batches sequentially. This reduces stored activations and gradients since only one micro-batch&amp;rsquo;s worth of activations needs to be kept in memory at a time, which helps reduce the overall activation memory footprint.
One drawback, however, is that gradient accumulation requires multiple consecutive forward/backward passes per optimization step, thereby increasing the compute overhead and slowing down training.&lt;/p>
&lt;h1 id="data-parallelism">Data Parallelism&lt;/h1>
&lt;p>The idea behind data parallelism (DP) is to replicate the model on several GPUs (we call the replicas “model instances”) and run forward and backward passes on different micro-batches of data in parallel on each GPU - hence the name data parallelism.
Using a different micro-batch for each GPU means we’ll have different gradients on each GPU, so to keep the model instances in sync across the different GPUs, we&amp;rsquo;ll average the gradients from the model instances using an operation called “all-reduce.” This operation takes place during the backward pass, before the optimizer step.&lt;/p></description></item></channel></rss>