2026-01-10 A primer on RAG, 2026 edition

In this post, you will get a very dense overview of what **Retrieval-Augmented Generation (RAG)** is, and a rich collection of links and resources to learn more and take a deep dive into the topic, up-to-date for the start of 2026. [As Phil Schmid wrote](https://www.philschmid.de/context-engineering), **context engineering** drives the quality in all agentic workflows. [Tobi Lutke](https://x.com/tobi/status/1935533422589399127) defined context engineering as "the art of providing all the context for the task to be plausibly solvable by the LLM.” In other words, context engineering is the set of techniques and methods you use to evolve the retrieval‑augmented generation (RAG) pattern, and will be the focus of this post. ## RAG: Why, what and when? You only need RAG when your LLM must consider data beyond its knowledge horizon *and* when that data doesn’t fit into the context window. In RAG setups, the LLM uses a lookup function to retrieve relevant data from an index. A basic RAG simply generates an answer by augmenting the context window with the retrieved data. In contrast, an agentic setup makes further decisions on how to proceed given the retrieved data, such as triggering additional actions or further retrievals. For example, if you instruct your coding agent to look up information in API docs if needed, that is RAG. For a RAG to work well, the design and the data matter far more than the LLM model. ![The essence of RAG](Essence%20of%20RAG.png) The essence of the RAG pattern: Retrieving latent information given a (user) prompt, to augment the model's response. (Note: Responses can be further tool use, too, not just answer generation!) ## Don't forego your evals! While I will not dive deep into evals here, [they are a must](https://hamel.dev/blog/posts/evals/). Just as with any LLM project, you need to track progress and quality by continually evaluating your system’s output quality. Hamel Husain has put together a fantastic [everything you need to know guide](https://hamel.dev/blog/posts/evals/) on evals to understand the essential concepts. If you are going to create a RAG, Jason Liu suggests to evaluate your RAG systems by [tracking retrieval and generation separately](https://jxnl.github.io/blog/writing/2024/02/28/levels-of-complexity-rag-applications/#level-4-evaluations), and even [offers a course on the topic](https://maven.com/applied-llms/rag-playbook). A starter recommendation could be using [Ragas](https://docs.ragas.io/en/latest/), a popular RAG evaluation framework, as it can nicely be [integrated into LangFuse](https://langfuse.com/guides/cookbook/evaluation_of_rag_with_ragas), my [observability platform](#Observability) of choice. ## Knowledge chunking To enable retrieval of knowledge, you split the relevant data or documents into small chunks. This later lets the agent retrieve only what matters. Chunking is a craft. Large chunks confuse retrieval and bloat the context. Tiny splits break up concepts across chunks. Some ideas fit in a sentence; others need pages, so you might need variable chunk sizes. A robust chunker matters more than which [embedding model](https://spotintelligence.com/2025/09/18/embedding-models/) you choose to represent those chunks in the query index. For example, when splitting Markdown documents, [keep headlines attached to the following paragraph](https://docs.langchain.com/oss/python/integrations/splitters/markdown_header_metadata_splitter), not the other way around (chunks that end in the next headline). And keep paragraphs and tables intact instead of splitting text at random. With the LangChain [`MarkdownTextSplitter`](https://reference.langchain.com/python/langchain_text_splitters/#langchain_text_splitters.MarkdownTextSplitter) you have a specialized `CharacterTextSplitter` to split at breakpoints in Markdown text. And you can run that `MarkdownTextSplitter` over the chunks from the `MarkdownHeaderTextSplitter`. (Which gives you even cleaner chunks than if using the `RecursiveCharacterTextSplitter` the linked LangChain doc recommends.) ## Retrieval techniques Retrieval is at the heart of the RAG pattern. There are many choices, and we will discuss the most common ones. That said, ID lookup, keyword search, and semantic search are often all you need. ### Keyword search Once you have a robust chunker to segregate your data, you can focus on retrieval. Sometimes you only need a clever identity (hash) lookup. Otherwise, you would typically start with **keyword search** (I recommended [BM25s](https://bm25s.github.io/)), and later add query expansion or create variations ("multi-query"), or represent your chunks as numeric vectors ("embeddings", see [Semantic search via embeddings](#Semantic%20search%20via%20embeddings) below). If you just use keyword search, you probably want to filter words that are nearly ubiquitous in your text, such as "a", "an", "it", "the", "and", "or", "am", "was", etc. in English texts. That is known as **stop-word filtering**: You simply remove those words from the query before you run the lookup. And a more elaborate technique is **stemming** or **lemmatizing**, where you remove a word's inflection to create normal forms (is → be, went → go, chunks → chunk, etc.). Luckily, [BM25s has your back](https://github.com/xhluca/bm25s?tab=readme-ov-file#flexibility) and you can add both stop-word filtering and stemming easily. ### Advanced keyword search **Query expansion** adds synonyms, hyponyms, related keywords, or normalized ("stemmed") word forms to the keyword lookup. If you have experience using a search engine like Solr or Elastic, you can [handle query expansion at index time](https://library.brown.edu/create/digitaltechnologies/using-synonyms-in-solr/) by providing synonym lists. Using a proper search engine gives you many more advantages, such as the ability to [analyze](https://www.elastic.co/docs/reference/text-analysis/analyzer-reference) and [normalize](https://www.elastic.co/docs/reference/text-analysis/normalizers) the text. That will let you figure out that "AbcCorp", "ABC Corp.", "ABC-Corporation" all are the same keywords. **Query rewriting** (aka. query reformulation) [asks the LLM to rewrite the query](https://dev.to/rogiia/build-an-advanced-rag-app-query-rewriting-h3p) for better retrieval results ([LangChain example](https://docs.langchain.com/oss/python/langgraph/agentic-rag#5-rewrite-question)). With a **multi-queries** setup, you ask the LLM to generate multiple variations of the original query, instead of expanding it with those variants (which therefore can be slow!) Rewritten queries can later be embedded (see [next section](#Semantic%20search%20via%20embeddings)), as can produce better embeddings than the original query. ### Semantic search via embeddings **Embeddings** are numerical representations of data like text, objects, or images as vectors. That representation captures their semantic meaning, because vectors of similar content point in similar directions of the embedding space. If you generate embeddings for your chunks, you also embed the query as a vector, and retrieve the chunks that are most similar to the query in embedding space. An effective baseline approach there is to use [FAISS](https://faiss.ai/). For text-based lookups over less than 100,000 chunks, a flat inner product index with L2 normalization of the embeddings should suffice: ```python import faiss from sentence_transformers import SentenceTransformer embedder = SentenceTransformer("all-MiniLM-L6-v2") index = faiss.IndexFlatL2(embedder.get_sentence_embedding_dimension()) text_chunks = ["Your text data here..."] embeddings = embedder.encode(text_chunks, convert_to_numpy=True) index.add(embeddings) ``` And the lookup then works as follows: ```python query = "User prompt here..." top_k = 10 # number of hits to fetch q_emb = embedder.encode([query], convert_to_numpy=True) distances, indices = index.search(q_emb, top_k) selected = [text_chunks[i] for i in indices[0]] # ...add the `selected` text chunks to the context window... ``` FAISS should take you a long way, and can be [integrated with LangChain](https://docs.langchain.com/oss/python/integrations/vectorstores/faiss) or [LlamaIndex](https://developers.llamaindex.ai/python/examples/vector_stores/faissindexdemo/). It should suffice as long as you can fit your vectors in memory. (That said, FAISS supports horizontal sharding.) And FAISS index can be [persisted to disk](https://github.com/facebookresearch/faiss/wiki/Index-IO,-cloning-and-hyper-parameter-tuning#io-and-deep-copying-indexes). If you need concurrent access, want to add metadata, or easily support horizontal scaling, a vector database might be worth considering. Vector databases are a topic of their own, but [Chroma](https://docs.trychroma.com/docs/overview/getting-started) is a robust, hassle-free starting point (You can simply pip-install it.) ### Hybrid search To combine keyword and semantic search, you could use [hybrid search](https://www.elastic.co/what-is/hybrid-search). Hybrid search fuses the results from different lookup methods. A very effective fusion method, [relative score fusion](https://weaviate.io/blog/hybrid-search-fusion-algorithms#relativescorefusion) (also known as **normalized linear fusion**), often outperforms either lookup alone. With relative score fusion, you normalize the original scores from the different lookups into a common [0,1] range. And then, for every result (chunk) you retrieve, its fusion score is the (optionally: weighted) sum of those normalized scores. ### And beyond... with a warning There are [many more retrieval techniques](https://arxiv.org/html/2407.21022v1). An esoteric example is [hypothetical document embeddings (HyDE)](https://zilliz.com/learn/improve-rag-and-information-retrieval-with-hyde-hypothetical-document-embeddings) that embed hypothetical responses for the query, instead of the query itself. You let the LLM generate the "answer" to the user "question", and embed that hypothetical answer. That can creates better query vectors than embedding the user queries directly. To make HyDE more robust, you let the LLM generate several answers and use an averaged query vector calculated across those answer embeddings. That said, HyDE is normally a *bad idea*, because you need multiple LLM calls, retrieval is non-deterministic, it is challenging to debug, and hallucinations might lead to wrong answers. But as I see HyDE come up often in the RAG context, it is worth discussing here. ## Observability Always consider **observability** from the get go: Track what data the LLM ingests and generates and which actions it takes. Capture reasoning tokens and internal agent state. Most frameworks support the [OpenTelemetry](https://opentelemetry.io/docs/) standard. I personally like the self‑hosted [LangFuse](https://langfuse.com/docs) for visualizing your LLM-related traces, logs, and metrics (including costs). It [integrates with](https://langfuse.com/integrations) LlamaIndex, LangChain, Pydantic AI, AutoGen, Dify, Langflow, CrewAI, and many others. If you already use a fully provisioned LangChain in their Cloud, [LangSmith](https://www.langchain.com/langsmith/observability) also is great, of course. Observability platforms like LangFuse also offer [built-in evaluation frameworks](https://langfuse.com/docs/evaluation/overview) to manage your [evals (mentioned initially)](#Don't%20forget%20your%20evals!). And they let you [safety-monitor your LLM's in- and outputs](https://langfuse.com/guides/cookbook/example_llm_security_monitoring). In summary, full observability into the state, inputs, and outputs of your agent is essential, and you get an evaluation framework along with it. ## Advanced RAG ### Re-ranking: Cross-encoders Finally, if your retrieved results are not ranked well, that might mean you don't get the relevant chunks into the context window, or the best data is not at the top. That is where [re-ranking algorithms](https://vizuara.substack.com/p/75c719b6-5490-4f3c-be53-68a58e40b198) come in. Typically, re-ranking is handled by a **cross-encoder** that predicts a better score for each retrieved candidate. If we compare the query embedding to the indexed embeddings with a similarity metric such as cosine similarity, that is known as the **bi-encoder** pattern. A cross-encoder goes a step further, because the model looks at both the query and the indexed embedding *simultaneously* to generate a better score. ```mermaid flowchart TD classDef tiny padding:10px, padding:10px subgraph Cross-encoder X[Text A]@{shape: doc} --> Z Y[Text B]@{shape: doc} --> Z Z[Transformer] --> W W@{shape: subproc, label: "Classifier"} --> V V@{ shape: text, label: "[0, 1]" } end subgraph Bi-enconder A[Text A]@{shape: doc} --> C B[Text B]@{shape: doc} --> D C[Transformer] --> E D[Transformer] --> F E@{shape: lean-r, label: "Embedding" } --> G F@{shape: lean-r, label: "Embedding" } --> G G@{ shape: text, label: "Cosine Similarity" } end ``` *Comparing a bi-encoder setup versus a cross-encoder.* The cross-encoder is typically a larger model, and requires labeled training data to train the classifier. To fully benefit from re-ranking, you ideally have a training dataset to [fine-tune your own encoder](https://huggingface.co/blog/train-reranker). A training dataset consists of query-response pairs with at least binary relevance labels. You can use [pre-trained cross-encoders](https://huggingface.co/cross-encoder) if they were trained on data similar to your use-case, such as text passages. ### Multi-tool use Another choice is how much agency you give the RAG agent. As already discussed, an agent can [rewrite queries (HuggingFace tutorial)](https://huggingface.co/learn/cookbook/en/agent_rag). But it could also create structured lookups (SQL queries, for example), or switch between different tools, such as with [OpenAI's Responses API](https://cookbook.openai.com/examples/responses_api/responses_api_tool_orchestration)). For example, you could let an agent decide between doing web research, researching your company's Confluence pages, and querying your Salesforce data before providing the final response. [Andrew Ng provides a great course](https://www.deeplearning.ai/courses/agentic-ai/) teaching this level of agentic AI. Agentic AI requires state management, so you need a framework to track that. Be aware that this will make your response latency much worse versus a vanilla RAG, as your agent is "thinking". [LangGraph](https://docs.langchain.com/oss/python/langgraph/overview) is production‑ready at v1.0 for creating stateful agents and part of LangChain. [Pydantic AI](https://ai.pydantic.dev/) is very promising, too, but has less bells and whistles than the LangChain ecosystem (maybe a good thing?). Another option are UI-based agentic workflow builders: [CrewAI](https://www.crewai.com/), [Dify](https://dify.ai/), and [Langflow](https://www.langflow.org/) are all very popular. ### Fine-tuning Model fine‑tuning with [Unsloth](https://unsloth.ai/) or [Axolotl](https://axolotl.ai/) can help for very bespoke use-cases, such as training a bespoke SQL generator. Axolotl shines for training larger models across multiple GPUs, while Unsloth is easier to get started with on a single GPU (see this [Unsloth text-to-SQL tuning tutorial](https://www.stephendiehl.com/posts/unsloth/), for example). A famous example is the [text-to-SQL fine-tuning of Mistral](https://christianjmills.com/posts/mastering-llms-course-notes/workshop-002/#honeycomb-case-study-fine-tuning-llms-for-natural-language-querying) that Hamel Husain did for Honeycomb's Query Language (HQL). Hamel Husain has a great [tutorial on Axolotl](https://youtu.be/mmsa4wDsiy0?si=FqxjNXsFkFn7liE2) in YouTube. Claude can even [automatically fine‑tune a model through Hugging Face Skills](https://huggingface.co/blog/hf-skills-training). But fine‑tuning should be the last step on your journey to your perfect RAG agent. Query reformulation is much simpler. ### Memory One of the more confusing challenges is handling dynamic memory. LLMs cannot reliably absorb giant prompts or extract scattered facts from them, and I believe this won’t improve quickly. Advanced agents need ways to store and recall information across prompts. A simple but highly effective method is [context compaction](https://platform.claude.com/cookbook/tool-use-automatic-context-compaction). That is, you let the LLM clean up the context window at regular intervals, [for example with the Gemini Agent Development Kit](https://google.github.io/adk-docs/context/compaction/#configure-context-compaction). You might have seen this with Claude Code, as it can compact the current chat to keep relevant details and drop noise. Often, you can store all facts in a simple Markdown file. And you can inject those memories in the prompt every time, if the memories are small enough. I speculate that even big LLM labs like OpenAI that provide [personalized memory](https://help.openai.com/en/articles/8590148-memory-faq) features conceptually don’t need to keep much more than a plain text store of your preferences. If you have a more data than you can fit into the context window, you can use a keyword or embedding index. That will allow you to query memories on the fly, as discussed in [keyword search](#Keyword%20search), or you [keep track of the IDs](https://github.com/facebookresearch/faiss/wiki/Pre--and-post-processing#the-indexidmap) of the [embeddings](#Semantic%20search%20via%20embeddings) you add to your FAISS index. If you expect to store vast amounts of memories (in the millions or more), you might use a vector database or a search engine. Beyond size, specialized RAG memory stores, like [Mem0](https://docs.mem0.ai/introduction), [Zep](https://help.getzep.com/overview), or [memU](https://github.com/NevaMind-AI/memU) are growing in popularity. A RAG memory store provides hierarchical, relational memory or a knowledge graph ([GraphRAG](https://graphrag.com/concepts/intro-to-graphrag/)). That allows the agent to track and reason over relationships, such as past customer support tickets. It prevents context loss that plagues stateless vector stores. In fact, such memory systems seem to be growing like mushrooms: [Hindsight](https://github.com/vectorize-io/hindsight) and [EverMemOS](https://evermind.ai/) are nascent examples, but there are many more. ![GraphRAG](GraphRAG.png) *A possible schematic of a research agent.* A user query gets resolved by the agent deciding to either read from the memory store or conduct web research (tool use). The agent can also follow user instructions or web research to edit memory. Any retrieved context and the system instructions are then added to the context window. Finally, the agent's LLM is prompted with the user query to produces the definitive answer. Just be aware that using a persistent vector database or a RAG memory store for your agentic memory *by default* [is probably inefficient](https://dariuszsemba.com/blog/why-autogpt-engineers-ditched-vector-databases/), unless you have validated the need for it. For getting started, memory stores are probably a rabbit hole you want to avoid until you know why you need it. ## Retrospective To pull this discussion back to the basics of RAG: You probably can get pretty far with a simple BM25 keyword lookup or a FAISS embedding index if you have clean, properly chunked data. And your coding agent can even quickly set up relative score fusion for the results, if you need it. Regarding persistence, building the BM25 index is so fast you can probably do it on the fly. And with FAISS, you can read and write your vector index to disk for persistence. First ensure you are creating clean chunks, then tweak the retrieval, and finally identify the best model—all driven by evals to validate progress. Only if you can prove that is insufficient for your use case, consider the more advanced RAG agent setups in this post. Rest assured: You can get a pretty strong baseline RAG - all without needing a search engine, vector databases, or state management. I hope this post inspired you to at least experiment with own agents. It is an excellent way to learn about the fascinating arena of AI engineering. And if coding up agents isn’t your thing, maybe one of the UI-based agent builders gets you started: Dify, [Langflow](https://www.langflow.org/), and Rivet are options to work with agents. With [Rivet](https://rivet.ironcladapp.com/) being the easiest to get started with a local instance for non-engineers, and [Dify](https://dify.ai/) probably having the most bells and whistles. If you want to go deeper, [Eugene Yang’s Patterns for Building LLM-based Systems & Products](https://eugeneyan.com//writing/llm-patterns/) is a fantastic resource for further study. As Hamel Husain puts it: [Stop saying RAG is dead](https://hamel.dev/notes/llm/rag/not_dead.html).