RAG Interview Questions: Answers, Coding Prep & FAQs

01

What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation, or RAG, is an architecture where an application retrieves relevant external information and gives it to a generative model as context before generating an answer. Example: a company policy chatbot retrieves the leave policy paragraph and asks the LLM to answer only from that paragraph instead of relying on its pretrained memory.

02

Why is RAG used with LLMs?

RAG is used because LLMs may not know private, current, or domain-specific information. RAG lets the system use fresh documents without retraining the model. Example: instead of fine-tuning every time a refund policy changes, index the latest policy and retrieve it at answer time.

03

What are the main components of a RAG system?

A RAG system usually has document ingestion, text extraction, chunking, embedding generation, an index or vector database, retrieval, optional reranking, prompt construction, generation, citation handling, and evaluation. In production, it also needs permissions, monitoring, refresh jobs, and fallback behavior.

04

Explain the basic RAG workflow.

The basic workflow is: collect documents, split them into chunks, convert chunks into embeddings, store them in a searchable index, retrieve relevant chunks for a user query, add those chunks to the prompt, and ask the LLM to answer from that context. Example: question -> retrieve top 5 policy chunks -> generate answer with source citations.

05

What is document ingestion in RAG?

Document ingestion is the process of bringing source data into the RAG pipeline. It includes reading PDFs, web pages, Word files, database rows, tickets, or wiki pages; extracting clean text; attaching metadata; and preparing content for chunking. Example: ingest HR PDFs with metadata like department, policy_version, region, and effective_date.

06

What metadata should be stored with RAG documents?

Useful metadata includes document ID, title, source URL, owner, department, permissions, language, version, timestamp, section heading, page number, product, region, and expiry date. Example: storing page_number and source_url lets the answer show "Source: Leave Policy, page 4."

07

What is chunking in RAG?

Chunking splits documents into smaller passages before indexing. Chunks should be large enough to preserve meaning but small enough to retrieve precise context. Example: a 40-page policy PDF may be split into 500-token chunks with headings preserved so the retriever can find the exact leave-eligibility section.

08

How do you choose chunk size?

Choose chunk size based on document type, question style, model context window, and retrieval quality. Small chunks improve precision but may lose context; large chunks preserve context but may retrieve irrelevant text. Example: FAQ pages may work with 200-400 tokens, while legal contracts may need 700-1000 tokens with section headings.

09

What is chunk overlap?

Chunk overlap repeats some text between adjacent chunks so important context is not lost at boundaries. Example: if a definition starts at the end of one chunk and conditions appear in the next, a 50-token overlap can keep the answer retrievable in one or both chunks.

10

What is semantic chunking?

Semantic chunking splits text by meaning rather than fixed length. It tries to keep paragraphs, sections, or related ideas together. Example: instead of cutting every 500 tokens, a semantic chunker keeps a full "Refund Eligibility" section together because splitting it would break the rule.

11

What is an embedding in RAG?

An embedding is a numeric vector that represents the semantic meaning of text. Similar meanings produce vectors close together. In RAG, both document chunks and user queries are embedded so the system can retrieve chunks that are semantically related to the question.

12

What is a vector database?

A vector database stores embeddings and supports similarity search. Examples include Pinecone, Weaviate, Milvus, Qdrant, Chroma, FAISS, and pgvector. In a RAG app, the vector database returns chunks that are closest to the query embedding.

13

What is similarity search?

Similarity search finds vectors closest to a query vector. Common metrics are cosine similarity, dot product, and Euclidean distance. Example: a query about "maternity leave" may retrieve chunks containing "parental leave" even if the exact words differ.

14

What is cosine similarity?

Cosine similarity measures the angle between two vectors. It is commonly used for embeddings because it compares direction rather than magnitude. In RAG, higher cosine similarity usually means the query and chunk are more semantically related.

15

What is top-k retrieval?

Top-k retrieval returns the k most relevant chunks for a query. Example: top-k = 5 retrieves the five closest chunks. A low k may miss context, while a high k may add noise and increase token cost.

16

What is hybrid search in RAG?

Hybrid search combines vector search with keyword search, often BM25. It is useful when exact terms matter, such as product IDs, error codes, law names, or policy section numbers. Example: searching "Form 16" should match the exact phrase, not only semantically related tax documents.

17

What is BM25?

BM25 is a keyword-based ranking algorithm used in search engines. It scores documents based on term frequency, inverse document frequency, and document length. In RAG, BM25 helps retrieve exact-match content that embeddings might miss.

18

What is reranking in RAG?

Reranking takes initially retrieved chunks and reorders them using a stronger relevance model. Example: retrieve top 30 chunks with fast vector search, then rerank to pick the best 5. This improves answer quality when initial retrieval is noisy.

19

What is a cross-encoder reranker?

A cross-encoder reranker scores a query and document together, usually with higher accuracy than vector similarity alone. It is slower but useful after initial retrieval. Example: use vector search for candidates, then cross-encoder reranking for final context selection.

20

What is query rewriting in RAG?

Query rewriting transforms the user question into a clearer search query. Example: user asks "Can I take it next month?" after discussing vacation. The system rewrites it as "employee vacation leave carry forward next month policy" before retrieval.

21

What is multi-query retrieval?

Multi-query retrieval generates several alternative versions of the user query and searches with each one. It improves recall when users phrase questions poorly. Example: "refund after delivery" can also search "return policy after item received" and "post-delivery cancellation rules."

22

What is metadata filtering?

Metadata filtering restricts retrieval using document attributes. Example: filter by region = "India", role = "employee", and document_type = "HR policy" before vector search. This prevents irrelevant or unauthorized chunks from entering the prompt.

23

How do permissions work in RAG?

RAG must enforce access control before retrieval or before prompt construction. A user should only retrieve chunks they are allowed to see. Example: an employee chatbot must not retrieve manager-only salary documents for a regular employee.

24

What is source attribution in RAG?

Source attribution shows where the answer came from, such as document title, URL, section, or page number. Example: "According to Leave Policy v3, page 6..." Attribution helps users verify answers and improves trust.

25

How do you generate citations in RAG?

Store source metadata with every chunk and instruct the model to cite chunk IDs or document sections used in the answer. Example: pass chunks as [S1], [S2], [S3] and prompt: "Cite sources like [S1] after each claim."

26

What is grounding in RAG?

Grounding means the generated answer is based on retrieved evidence instead of unsupported model knowledge. Example: if retrieved policy chunks do not mention remote-work reimbursement, the answer should say the policy does not provide that information.

27

How does RAG reduce hallucination?

RAG reduces hallucination by giving the model relevant evidence and instructing it to answer from that evidence. It does not eliminate hallucination completely, so systems still need citation checks, refusal rules, and evaluation.

28

What should a RAG prompt include?

A RAG prompt should include task instructions, retrieved context, citation rules, missing-information behavior, and output format. Example: "Answer only from the sources below. If not found, say not found. Cite source IDs. Use 5 bullets max."

29

What is context stuffing?

Context stuffing means adding too many retrieved chunks into the prompt. It increases cost, latency, and confusion. Example: passing 25 chunks to answer a simple refund question can bury the relevant rule under irrelevant text.

30

How do you handle missing answers in RAG?

The system should say it cannot find enough information instead of inventing an answer. Example prompt instruction: "If the answer is not directly supported by the provided context, respond: I could not find this in the available documents."

31

What is a fallback strategy in RAG?

A fallback strategy defines what happens when retrieval fails or confidence is low. Examples include asking a clarifying question, widening search, using keyword search, escalating to a human, or saying no answer was found.

32

How do you evaluate retrieval quality?

Evaluate retrieval using labeled question-document pairs and metrics such as recall@k, precision@k, MRR, and nDCG. Example: for 100 policy questions, check whether the correct policy section appears in the top 3 retrieved chunks.

33

What is recall@k in RAG?

Recall@k measures whether the relevant document appears in the top k retrieved results. Example: recall@5 is successful if the correct chunk appears anywhere in the top 5 results.

34

What is precision@k in RAG?

Precision@k measures how many of the top k retrieved chunks are actually relevant. Example: if top 5 retrieval returns 3 useful chunks and 2 irrelevant chunks, precision@5 is 3/5.

35

How do you evaluate generated answers in RAG?

Evaluate answers for factual accuracy, source support, completeness, relevance, citation correctness, format compliance, safety, and user usefulness. Example: an answer should be marked wrong if it is correct generally but not supported by retrieved sources.

36

What is answer faithfulness?

Faithfulness means the answer is supported by the retrieved context. A faithful answer does not add unsupported claims. Example: if the source says refund takes 5-7 days, the answer should not say "usually 2 days."

37

What is stale data in RAG?

Stale data is outdated indexed content. Example: if the HR policy changed yesterday but the vector index still contains last month's policy, the chatbot may answer incorrectly. Fix this with refresh jobs, versioning, expiry metadata, and reindexing.

38

How do you keep a RAG index fresh?

Use scheduled crawls, event-based updates, document versioning, deleted-document cleanup, embedding refresh, and monitoring for failed ingestion. Example: when a policy PDF changes in storage, trigger re-extraction and reindex only affected chunks.

39

What is embedding drift?

Embedding drift happens when embeddings become inconsistent because the embedding model changes or the data distribution changes. If you switch embedding models, old and new vectors may not be comparable, so you often need to re-embed the index.

40

When should you re-embed documents?

Re-embed when source documents change, chunking strategy changes, metadata changes materially, or the embedding model is upgraded. Example: changing from one embedding model to another usually requires rebuilding the vector index.

41

How do you debug poor RAG answers?

Debug the pipeline step by step: inspect the user query, rewritten query, retrieved chunks, reranked chunks, final prompt, model output, and citations. Example: if the answer is wrong because the correct chunk was never retrieved, fix retrieval before changing the prompt.

42

What are common RAG failure modes?

Common failures include bad chunking, poor OCR, irrelevant retrieval, missing metadata filters, stale documents, no access control, too much context, weak prompts, unsupported citations, and answers that ignore retrieved evidence.

43

How do you handle tables in RAG?

Tables need special extraction because plain text conversion can destroy row-column relationships. Options include preserving markdown tables, storing table summaries, indexing rows separately, or using document parsers that retain structure. Example: pricing tables should keep plan names aligned with prices.

44

How do you handle PDFs in RAG?

PDF handling requires text extraction, OCR for scanned pages, page metadata, layout cleanup, header/footer removal, and table handling. Example: store page numbers with chunks so answers can cite "Benefits Guide, page 12."

45

What is multi-hop RAG?

Multi-hop RAG answers questions that require information from multiple sources or retrieval steps. Example: "Which policy applies to contractors in Germany?" may need retrieving contractor policy and regional employment policy, then combining both.

46

How do you reduce latency in RAG?

Reduce latency with faster embedding search, smaller top-k, caching, precomputed query rewrites, selective reranking, parallel retrieval, shorter prompts, streaming responses, and using smaller models when acceptable.

47

How do you control cost in RAG?

Control cost by limiting retrieved context, caching frequent answers, compressing prompts, choosing efficient embedding models, using reranking selectively, monitoring token usage, and routing simple questions to cheaper models.

48

What logs are useful in a RAG system?

Useful logs include user query, rewritten query, retrieved chunk IDs, scores, filters, reranked results, final prompt version, model version, citations, latency, token usage, user feedback, and whether the answer was escalated.

49

How do you deploy RAG safely in production?

A safe RAG deployment includes access control, source metadata, retrieval evaluation, answer faithfulness checks, citation display, monitoring, feedback collection, stale-data handling, fallback behavior, and human escalation for high-risk answers.

50

What are best practices for building a RAG application?

Start with a clear use case, clean documents, good chunking, metadata filters, strong retrieval evaluation, grounded prompts, citations, permission checks, and monitoring. Example: before improving the LLM prompt, verify that the correct chunks are actually being retrieved.