RAG interview questions covering document ingestion, chunking, embeddings, vector search, reranking, grounding, evaluation, and production pitfalls.
Retrieval-Augmented Generation, or RAG, is an architecture where an application retrieves relevant external information and gives it to a generative model as context before generating an answer. Example: a company policy chatbot retrieves the leave policy paragraph and asks the LLM to answer only from that paragraph instead of relying on its pretrained memory.
RAG is used because LLMs may not know private, current, or domain-specific information. RAG lets the system use fresh documents without retraining the model. Example: instead of fine-tuning every time a refund policy changes, index the latest policy and retrieve it at answer time.
A RAG system usually has document ingestion, text extraction, chunking, embedding generation, an index or vector database, retrieval, optional reranking, prompt construction, generation, citation handling, and evaluation. In production, it also needs permissions, monitoring, refresh jobs, and fallback behavior.
The basic workflow is: collect documents, split them into chunks, convert chunks into embeddings, store them in a searchable index, retrieve relevant chunks for a user query, add those chunks to the prompt, and ask the LLM to answer from that context. Example: question -> retrieve top 5 policy chunks -> generate answer with source citations.
Document ingestion is the process of bringing source data into the RAG pipeline. It includes reading PDFs, web pages, Word files, database rows, tickets, or wiki pages; extracting clean text; attaching metadata; and preparing content for chunking. Example: ingest HR PDFs with metadata like department, policy_version, region, and effective_date.
Useful metadata includes document ID, title, source URL, owner, department, permissions, language, version, timestamp, section heading, page number, product, region, and expiry date. Example: storing page_number and source_url lets the answer show "Source: Leave Policy, page 4."
Chunking splits documents into smaller passages before indexing. Chunks should be large enough to preserve meaning but small enough to retrieve precise context. Example: a 40-page policy PDF may be split into 500-token chunks with headings preserved so the retriever can find the exact leave-eligibility section.
Choose chunk size based on document type, question style, model context window, and retrieval quality. Small chunks improve precision but may lose context; large chunks preserve context but may retrieve irrelevant text. Example: FAQ pages may work with 200-400 tokens, while legal contracts may need 700-1000 tokens with section headings.
Chunk overlap repeats some text between adjacent chunks so important context is not lost at boundaries. Example: if a definition starts at the end of one chunk and conditions appear in the next, a 50-token overlap can keep the answer retrievable in one or both chunks.
Semantic chunking splits text by meaning rather than fixed length. It tries to keep paragraphs, sections, or related ideas together. Example: instead of cutting every 500 tokens, a semantic chunker keeps a full "Refund Eligibility" section together because splitting it would break the rule.
An embedding is a numeric vector that represents the semantic meaning of text. Similar meanings produce vectors close together. In RAG, both document chunks and user queries are embedded so the system can retrieve chunks that are semantically related to the question.
A vector database stores embeddings and supports similarity search. Examples include Pinecone, Weaviate, Milvus, Qdrant, Chroma, FAISS, and pgvector. In a RAG app, the vector database returns chunks that are closest to the query embedding.
Similarity search finds vectors closest to a query vector. Common metrics are cosine similarity, dot product, and Euclidean distance. Example: a query about "maternity leave" may retrieve chunks containing "parental leave" even if the exact words differ.
Cosine similarity measures the angle between two vectors. It is commonly used for embeddings because it compares direction rather than magnitude. In RAG, higher cosine similarity usually means the query and chunk are more semantically related.
Top-k retrieval returns the k most relevant chunks for a query. Example: top-k = 5 retrieves the five closest chunks. A low k may miss context, while a high k may add noise and increase token cost.
Hybrid search combines vector search with keyword search, often BM25. It is useful when exact terms matter, such as product IDs, error codes, law names, or policy section numbers. Example: searching "Form 16" should match the exact phrase, not only semantically related tax documents.
BM25 is a keyword-based ranking algorithm used in search engines. It scores documents based on term frequency, inverse document frequency, and document length. In RAG, BM25 helps retrieve exact-match content that embeddings might miss.
Reranking takes initially retrieved chunks and reorders them using a stronger relevance model. Example: retrieve top 30 chunks with fast vector search, then rerank to pick the best 5. This improves answer quality when initial retrieval is noisy.
A cross-encoder reranker scores a query and document together, usually with higher accuracy than vector similarity alone. It is slower but useful after initial retrieval. Example: use vector search for candidates, then cross-encoder reranking for final context selection.
Query rewriting transforms the user question into a clearer search query. Example: user asks "Can I take it next month?" after discussing vacation. The system rewrites it as "employee vacation leave carry forward next month policy" before retrieval.
Multi-query retrieval generates several alternative versions of the user query and searches with each one. It improves recall when users phrase questions poorly. Example: "refund after delivery" can also search "return policy after item received" and "post-delivery cancellation rules."
Metadata filtering restricts retrieval using document attributes. Example: filter by region = "India", role = "employee", and document_type = "HR policy" before vector search. This prevents irrelevant or unauthorized chunks from entering the prompt.
RAG must enforce access control before retrieval or before prompt construction. A user should only retrieve chunks they are allowed to see. Example: an employee chatbot must not retrieve manager-only salary documents for a regular employee.
Source attribution shows where the answer came from, such as document title, URL, section, or page number. Example: "According to Leave Policy v3, page 6..." Attribution helps users verify answers and improves trust.
Store source metadata with every chunk and instruct the model to cite chunk IDs or document sections used in the answer. Example: pass chunks as [S1], [S2], [S3] and prompt: "Cite sources like [S1] after each claim."
Grounding means the generated answer is based on retrieved evidence instead of unsupported model knowledge. Example: if retrieved policy chunks do not mention remote-work reimbursement, the answer should say the policy does not provide that information.
RAG reduces hallucination by giving the model relevant evidence and instructing it to answer from that evidence. It does not eliminate hallucination completely, so systems still need citation checks, refusal rules, and evaluation.
A RAG prompt should include task instructions, retrieved context, citation rules, missing-information behavior, and output format. Example: "Answer only from the sources below. If not found, say not found. Cite source IDs. Use 5 bullets max."
Context stuffing means adding too many retrieved chunks into the prompt. It increases cost, latency, and confusion. Example: passing 25 chunks to answer a simple refund question can bury the relevant rule under irrelevant text.
The system should say it cannot find enough information instead of inventing an answer. Example prompt instruction: "If the answer is not directly supported by the provided context, respond: I could not find this in the available documents."
A fallback strategy defines what happens when retrieval fails or confidence is low. Examples include asking a clarifying question, widening search, using keyword search, escalating to a human, or saying no answer was found.
Evaluate retrieval using labeled question-document pairs and metrics such as recall@k, precision@k, MRR, and nDCG. Example: for 100 policy questions, check whether the correct policy section appears in the top 3 retrieved chunks.
Recall@k measures whether the relevant document appears in the top k retrieved results. Example: recall@5 is successful if the correct chunk appears anywhere in the top 5 results.
Precision@k measures how many of the top k retrieved chunks are actually relevant. Example: if top 5 retrieval returns 3 useful chunks and 2 irrelevant chunks, precision@5 is 3/5.
Evaluate answers for factual accuracy, source support, completeness, relevance, citation correctness, format compliance, safety, and user usefulness. Example: an answer should be marked wrong if it is correct generally but not supported by retrieved sources.
Faithfulness means the answer is supported by the retrieved context. A faithful answer does not add unsupported claims. Example: if the source says refund takes 5-7 days, the answer should not say "usually 2 days."
Stale data is outdated indexed content. Example: if the HR policy changed yesterday but the vector index still contains last month's policy, the chatbot may answer incorrectly. Fix this with refresh jobs, versioning, expiry metadata, and reindexing.
Use scheduled crawls, event-based updates, document versioning, deleted-document cleanup, embedding refresh, and monitoring for failed ingestion. Example: when a policy PDF changes in storage, trigger re-extraction and reindex only affected chunks.
Embedding drift happens when embeddings become inconsistent because the embedding model changes or the data distribution changes. If you switch embedding models, old and new vectors may not be comparable, so you often need to re-embed the index.
Re-embed when source documents change, chunking strategy changes, metadata changes materially, or the embedding model is upgraded. Example: changing from one embedding model to another usually requires rebuilding the vector index.
Debug the pipeline step by step: inspect the user query, rewritten query, retrieved chunks, reranked chunks, final prompt, model output, and citations. Example: if the answer is wrong because the correct chunk was never retrieved, fix retrieval before changing the prompt.
Common failures include bad chunking, poor OCR, irrelevant retrieval, missing metadata filters, stale documents, no access control, too much context, weak prompts, unsupported citations, and answers that ignore retrieved evidence.
Tables need special extraction because plain text conversion can destroy row-column relationships. Options include preserving markdown tables, storing table summaries, indexing rows separately, or using document parsers that retain structure. Example: pricing tables should keep plan names aligned with prices.
PDF handling requires text extraction, OCR for scanned pages, page metadata, layout cleanup, header/footer removal, and table handling. Example: store page numbers with chunks so answers can cite "Benefits Guide, page 12."
Multi-hop RAG answers questions that require information from multiple sources or retrieval steps. Example: "Which policy applies to contractors in Germany?" may need retrieving contractor policy and regional employment policy, then combining both.
Reduce latency with faster embedding search, smaller top-k, caching, precomputed query rewrites, selective reranking, parallel retrieval, shorter prompts, streaming responses, and using smaller models when acceptable.
Control cost by limiting retrieved context, caching frequent answers, compressing prompts, choosing efficient embedding models, using reranking selectively, monitoring token usage, and routing simple questions to cheaper models.
Useful logs include user query, rewritten query, retrieved chunk IDs, scores, filters, reranked results, final prompt version, model version, citations, latency, token usage, user feedback, and whether the answer was escalated.
A safe RAG deployment includes access control, source metadata, retrieval evaluation, answer faithfulness checks, citation display, monitoring, feedback collection, stale-data handling, fallback behavior, and human escalation for high-risk answers.
Start with a clear use case, clean documents, good chunking, metadata filters, strong retrieval evaluation, grounded prompts, citations, permission checks, and monitoring. Example: before improving the LLM prompt, verify that the correct chunks are actually being retrieved.
Explore 500+ free tutorials across 20+ languages and frameworks.