A Deep Dive into In-Context Learning | by Aris Tsakpinis | May, 2024
Stepping out of the “comfort zone” — part 2/3 of a deep-dive into domain adaptation approaches for LLMs
Exploring domain adapting large language models (LLMs) to your specific domain or use case? This 3-part blog post series explains the motivation for domain adaptation and dives deep into various options to do so. Further, a detailed guide for mastering the entire domain adaptation journey covering popular tradeoffs is being provided.
Part 1: Introduction into domain adaptation — motivation, options, tradeoffs
Part 2: A deep dive into in-context learning — You’re here!
Part 3: A deep dive into fine-tuning
Note: All images, unless otherwise noted, are by the author.
In the first part of this blog post series, we discussed the rapid advancements in generative AI and the emergence of large language models (LLMs) like Claude, GPT-4, Meta LLaMA, and Stable Diffusion. These models have demonstrated remarkable capabilities in content creation, sparking both enthusiasm and concerns about potential risks. We highlighted that while these AI models are powerful, they also have inherent limitations and “comfort zones” — areas where they excel, and areas where their performance can degrade when pushed outside their expertise. This can lead to model responses that fall below the expected quality, potentially resulting in hallucinations, biased outputs, or other undesirable behaviors.
To address these challenges and enable the strategic use of generative AI in enterprises, we introduced three key design principles: Helpfulness, Honesty, and Harmlessness. We also discussed how domain adaptation techniques, such as in-context learning and fine-tuning, can be leveraged to overcome the “comfort zone” limitations of these models and create enterprise-grade, compliant generative AI-powered applications. In this second part, we will dive deeper into the world of in-context learning, exploring how these techniques can be used to transform tasks and move them back into the models’ comfort zones.
In-context learning aims to make use of external tooling to modify the task to be solved in a way that moves it back (or closer) into a model’s comfort zone. In the world of LLMs, this can be done through prompt engineering, which involves infusing source knowledge through the model prompt to transform the overall complexity of a task. It can be executed in a rather static manner (e.g. few-shot prompting), but more sophisticated, dynamic prompt engineering techniques like retrieval-augmented generation (RAG) or Agents have proven to be powerful.
In part 1 of this blog post series we noticed alongside the example depicted in figure 1 how adding a static context like a speaker bio can help reduce the complexity of the task to be solved by the model, leading to better model results. In what follows, we will dive deeper into more advanced concepts of in-context learning.
“The measure of intelligence is the ability to change.” (Albert Einstein)
While the above example with static context infusion works well for static use cases, it lacks the ability to scale across diverse and complex domains. Assuming the scope of our closed QA task would not be limited to me as a person only, but to all speakers of a huge conference and hence hundreds of speaker bios. In this case, manual identification and insertion of the relevant piece of context (i.e. the speaker bio) becomes cumbersome, error-prone, and impractical. In theory, recent models come with huge context sizes up to 200k tokens or more, fitting not only those hundreds of speaker bios, but entire books and knowledge bases. However, there is plenty of reasons why this is not a desirable approach, like cost in a pay per token approach, compute requirements, latency, etc. .
Luckily, plenty of optimized content retrieval approaches concerned with identifying exactly the piece of context most suitable to ingest in a dynamic approach exist — some of a deterministic nature (e.g. SQL-queries on structured data), others powered by probabilistic systems (e.g. semantic search). Chaining these two components together into an integrated closed Q&A approach with dynamic context retrieval and infusion has proven to be extremely powerful. Thereby, a huge (endless?) variety of data sources — from relational or graph databases over vector stores to enterprise systems or real-time APIs — can be connected. To accomplish this, the identified context piece(s) of highest relevance is (are) extracted and dynamically ingested into the prompt template used against the generative decoder model when accomplishing the desired task. Figure 2 shows this exemplarily for a user-facing Q&A application (e.g., a chatbot).
The by far most popular approach to dynamic prompt engineering is RAG. The approach works well when trying to ingest context originating from large full-text knowledge bases dynamically. It combines two probabilistic methods by augmenting an open Q&A task with dynamic context retrieved by semantic search, turning an open Q&A task into a closed one.
First, the documents are being sliced into chunks of digestible size. Then, an encoder LLM is used for creating contextualised embeddings of these snippets, encoding the semantics of every chunk into the mathematical space in the form of a vector. This information is stored in a vector database, which acts as our knowledge base. Thereby, the vector is used as the primary key, whereas the text itself, together with optional metadata, is stored alongside.
(0) In case of a user question, the input submitted is cleaned and encoded by the very same embeddings model, creating a semantic representation of the user’s question in the knowledge base’s vector space.
(1) This embedding is subsequently used for carrying out a similarity search based on vector distance metrics over the entire knowledge base — with the hypothesis that the k snippets with the highest similarity to the user’s question in the vector space are likely best suited for grounding the question with context.
(2) In the next step, these top k snippets are passed to a decoder generative LLM as context alongside the user’s initial question, forming a closed Q&A task.
(3) The LLM answers the question in a grounded way in the style instructed by the application’s system prompt (e.g., chatbot style).
Knowledge Graph-Augmented Generation (KGAG) is another dynamic prompting approach that integrates structured knowledge graphs to transform the task to be solved and hence enhance the factual accuracy and informativeness of language model outputs. Integrating knowledge graphs can be achieved by several approaches.
As one of those, the KGAG framework proposed by Kang et al (2023) consists of three key components:
(1) The context-relevant subgraph retriever retrieves a relevant subgraph Z from the overall knowledge graph G given the current dialogue history x. To do this, the model defines a retrieval score for each individual triplet z = (eh, r, et) in the knowledge graph, computed as the inner product between embeddings of the dialogue history x and the candidate triplet z. The triplet embeddings are generated using Graph Neural Networks (GNNs) to capture the relational structure of the knowledge graph. The retrieval distribution p(Z|x) is then computed as the product of the individual triplet retrieval scores p(z|x), allowing the model to retrieve only the most relevant subgraph Z for the given dialogue context.
(2) The model needs to encode the retrieved subgraph Z along with the text sequence x for the language model. A naive approach would be to simply prepend the tokens of entities and relations in Z to the input x, but this violates important properties like permutation invariance and relation inversion invariance. To address this, the paper proposes an “invariant and efficient” graph encoding method. It first sorts the unique entities in Z and encodes them, then applies a learned affine transformation to perturb the entity embeddings based on the graph structure. This satisfies the desired invariance properties while also being more computationally efficient than prepending all triplet tokens.
(3) The model uses a contrastive learning objective to ensure the generated text is consistent with the retrieved subgraph Z. Specifically, it maximizes the similarity between the representations of the retrieved subgraph and the generated text, while minimizing the similarity to negative samples. This encourages the model to generate responses that faithfully reflect the factual knowledge contained in the retrieved subgraph.
By combining these three components — subgraph retrieval, invariant graph encoding, and graph-text contrastive learning — the KGAG framework can generate knowledge-grounded responses that are both fluent and factually accurate.
KGAG is particularly useful in dialogue systems, question answering, and other applications where generating informative and factually accurate responses is important. It can be applied in domains where there is access to a relevant knowledge graph, such as encyclopaedic knowledge, product information, or domain-specific facts. By combining the strengths of language models and structured knowledge, KGAG can produce responses that are both natural and trustworthy, making it a valuable tool for building intelligent conversational agents and knowledge-intensive applications.
Chain-of-Thought is a prompt engineering approach introduced by Wei et al in 2023. By providing the model with either instructions or few-shot examples of structured reasoning steps towards a problem solution, it reduces the complexity of the problem to be solved by the model significantly.
The core idea behind CoT prompting is to mimic the human thought process when solving complicated multi-step reasoning tasks. Just as humans decompose a complex problem into intermediate steps and solve each step sequentially before arriving at the final answer, CoT prompting encourages language models to generate a coherent chain of thought — a series of intermediate reasoning steps that lead to the final solution. Figure 5 showcases an example where the model produces a chain of thought to solve a math word problem it would have otherwise gotten incorrect.
The paper highlights several attractive properties of CoT prompting. Firstly, it allows models to break down multi-step problems into manageable intermediate steps, allocating additional computation to problems requiring more reasoning steps. Secondly, the chain of thought provides an interpretable window into the model’s reasoning process, enabling debugging and understanding where the reasoning path might have gone away. Thirdly, CoT reasoning can be applied to various tasks such as math word problems, commonsense reasoning, and symbolic manipulation, making it potentially applicable to any task solvable via language. Finally, sufficiently large off-the-shelf language models can readily generate chains of thought simply by including examples of such reasoning sequences in the few-shot prompting exemplars.
ReAct prompting is another novel technique introduced by Yao et al. (2023) that goes one step further by enabling language models to synergize reasoning and acting in a seamless manner for general task-solving. The core idea is to augment the action space of the model to include not just domain-specific actions but also free-form language “thoughts” that allow the model to reason about the task, create plans, track progress, handle exceptions, and incorporate external information.
In ReAct, the language model is prompted with few-shot examples of human trajectories that can trigger actions taken in the environment depending on thoughts/reasoning steps. For tasks where reasoning is the primary focus, thoughts and actions alternate, allowing the model to reason before acting. For more open-ended decision-making tasks, thoughts can occur sparsely and asynchronously as needed to create high-level plans, adjust based on observations, or query external knowledge.
ReAct synergizes the strengths of large language models for multi-step reasoning (like recursive chain-of-thought prompting) with their ability to act and interact in environments. By grounding reasoning in an external context and allowing information to flow bidirectionally between reasoning and acting, ReAct overcomes key limitations of prior work that treated them in isolation.
The paper proves that ReAct enables strong few-shot performance across question answering, fact verification, text games, and web navigation tasks. Compared to chain-of-thought prompting, which relies solely on the model’s internal knowledge, ReAct allows the model to incorporate up-to-date information from external sources into its reasoning trace through actions. Actions perform dynamic context retrieval, integrating data sources like RAG, KGAG, or even web searches or API calls. This makes the reasoning process more robust and less prone to hallucinations. Conversely, injecting reasoning into an acting-only approach allows for more intelligent long-term planning, progress tracking, and flexible adjustment of strategies — going beyond simple action prediction.
Figure 6 (illustration by Google) shows examples of different prompt engineering techniques (system prompts including few-shot examples and instructions are hidden) trying to solve a Q&A problem originating from the HotpotQA dataset (Yang et al, 2018). As opposed to the other options ReAct demonstrates strong performance on the task through combining reasoning and acting in a recursive manner.
In this blog post we explored in-context learning as a powerful approach to domain adaptation. After understanding it’s underlying mechanisms, we discussed commonly used static and dynamic prompt engineering techniques and their applications.
In the third part of this blog post series, with fine-tuning we will discuss different approaches for fine-tuning.
Part 1: Introduction into domain adaptation — motivation, options, tradeoffs
Part 2: A deep dive into in-context learning — You’re here!
Part 3: A deep dive into fine-tuning