LLM interview questions covering transformer architecture, tokens, prompts, inference, evaluation, fine-tuning, RAG, and safety.
A Large Language Model, or LLM, is a neural network trained on large amounts of text to understand and generate language. It predicts tokens based on context and can perform tasks such as answering questions, summarizing, translating, writing code, and extracting information.
An LLM generates text by predicting the next token based on the prompt and previously generated tokens. It repeats this process token by token until it reaches a stop condition, length limit, or end-of-sequence token.
Tokenization converts text into smaller units called tokens. Tokens may be words, subwords, characters, or byte-level pieces. LLMs process tokens, so tokenization affects cost, context length, output length, and how unusual words are represented.
A context window is the maximum number of tokens the model can consider at one time, including instructions, conversation history, retrieved documents, and generated output. If the input exceeds the limit, content must be shortened, summarized, or chunked.
The transformer is a neural network architecture built around attention mechanisms. It is widely used in LLMs because it can model relationships between tokens and process sequences more effectively than older recurrent approaches.
Attention lets a model weigh which tokens are most relevant when processing each part of the input. It helps the model connect words, phrases, instructions, and references across the prompt.
Self-attention compares tokens within the same sequence to decide how much each token should influence another token. It is a key reason transformers can capture long-range relationships in language.
Parameters are learned numerical values inside the model. They encode patterns learned during training. Larger models usually have more parameters, but model quality also depends on data, architecture, training method, and alignment.
Pretraining is the first major training stage where an LLM learns general language patterns from large datasets. It usually learns by predicting missing or next tokens before being adapted for instruction following or specific tasks.
Instruction tuning trains a model on examples of instructions and desired responses so it becomes better at following user requests. It helps convert a base language model into a useful assistant-style model.
RLHF stands for Reinforcement Learning from Human Feedback. It uses human preference data to guide a model toward outputs that people judge as helpful, safe, and aligned with instructions.
A base model is trained mainly to predict text and may not reliably follow instructions. An instruction-tuned model is further trained to respond to prompts, follow constraints, and behave more like an assistant.
A system prompt is a high-priority instruction that defines the model behavior, role, rules, boundaries, or response style. It is commonly used to make an LLM act as a specific assistant or follow product policies.
A user prompt is the request or message from the end user. The LLM uses it along with system instructions, developer instructions, tool results, and context to generate a response.
Prompt injection is an attack where user input or retrieved content tries to override trusted instructions. It can cause the model to reveal data, ignore safety rules, or misuse tools.
Use strict separation between trusted instructions and untrusted content, validate tool calls, limit permissions, sanitize retrieved content, add output checks, monitor suspicious behavior, and enforce security in application code rather than relying only on prompts.
Hallucination is when an LLM produces an answer that sounds confident but is false, unsupported, or fabricated. It often happens when the model lacks reliable context or is forced to answer beyond its knowledge.
Use RAG, citations, source-grounded prompts, structured outputs, refusal behavior, lower temperature, verification tools, human review for high-risk answers, and evaluations that test factual accuracy.
Temperature controls randomness during generation. Lower temperature produces more predictable responses, while higher temperature produces more varied and creative responses but can increase inconsistency.
Top-p sampling chooses from the smallest set of likely tokens whose cumulative probability reaches a threshold. It helps control output diversity without considering extremely unlikely tokens.
A stop sequence is a token or string that tells the generation process to stop when encountered. It is useful for controlling output boundaries, especially when generating structured text.
Max token limit controls how many tokens the model can generate or process. It helps manage cost, latency, and output length. Too small a limit may cut off the answer; too large a limit can waste tokens.
An embedding model converts text or other data into numerical vectors that represent semantic meaning. Embeddings are used for similarity search, clustering, recommendations, classification, and RAG.
Semantic search finds results based on meaning rather than exact keyword matches. It typically uses embeddings to compare the meaning of a query with documents or passages.
Retrieval-Augmented Generation, or RAG, retrieves relevant external documents and provides them to the LLM as context. It helps answer questions using private, current, or domain-specific knowledge without retraining the model.
Chunking splits documents into smaller passages before embedding and indexing them. Good chunking improves retrieval relevance by keeping related information together without exceeding context limits.
Reranking reorders retrieved results using a stronger relevance model after the first retrieval step. It improves the quality of context passed to the LLM, especially when many candidate passages are similar.
Fine-tuning adapts a pretrained LLM using additional examples for a task, style, format, or domain. It is useful when prompting or RAG is not enough, but it requires high-quality data and careful evaluation.
Use fine-tuning when you need consistent style, specific output structure, repeated task behavior, or domain adaptation that prompts cannot reliably achieve. Use prompting first because it is faster and cheaper to iterate.
Use RAG when the problem requires current, private, or frequently changing knowledge. Fine-tuning is better for behavior and style; RAG is better for grounding answers in external information.
Function calling lets the model return a structured request to call an external function with arguments. The application executes the function and can send the result back to the model.
Tool use allows an LLM to interact with external systems such as search, calculators, APIs, databases, or code execution. It makes the model more useful but requires validation, permissions, and logging.
An LLM agent uses a model to plan steps, call tools, inspect results, and continue toward a goal. Agents are powerful for multi-step workflows but can be risky without limits, timeouts, audit logs, and approval controls.
Structured output means the LLM returns data in a required schema, such as JSON. It is important when model output feeds another system that expects predictable fields and types.
Validate LLM output with schema checks, type checks, business rules, safety filters, source verification, deterministic tests, human review, and fallback behavior when the output is invalid.
Evaluate task success, factual accuracy, instruction following, relevance, completeness, safety, latency, cost, and user satisfaction. Use a mix of automated evals, human review, regression tests, and production feedback.
An evaluation dataset is a curated set of prompts, expected outputs, labels, examples, or grading criteria used to test the model. It should include normal cases, edge cases, unsafe inputs, and domain-specific examples.
Red teaming tests an LLM system with adversarial prompts and risky scenarios to find safety, privacy, security, and reliability failures before attackers or users encounter them.
Model alignment means shaping a model so its behavior follows human goals, instructions, policies, and safety expectations. It involves data selection, instruction tuning, preference training, guardrails, and evaluation.
A safety refusal is when an LLM declines to answer a harmful, unsafe, private, or policy-violating request. A good refusal should be brief, clear, and helpful where safe alternatives are possible.
Model latency is the time it takes to return a response. It depends on model size, prompt length, output length, retrieval, tool calls, server load, and network time.
Reduce latency by using smaller models, shorter prompts, response streaming, caching, fewer retrieved chunks, faster vector search, parallel tool calls, prompt compression, and limiting output length.
Control cost by tracking token usage, choosing smaller models for simple tasks, caching repeated responses, shortening prompts, limiting output tokens, batching work, and routing requests by complexity.
Response streaming sends tokens to the user as they are generated. It improves perceived speed and is useful for chat interfaces, writing assistants, and long-form answers.
Model routing sends different requests to different models based on complexity, cost, latency, language, risk, or domain. It helps balance quality and cost in production systems.
Prompt caching reuses repeated prompt context or previous results to reduce cost and latency. It is useful when many requests share the same system prompt, instructions, or long reference context.
Privacy risks include sending sensitive data to a model, logging private prompts, exposing retrieved documents, prompt injection that leaks data, and model outputs containing confidential information. Mitigation requires data minimization, access control, redaction, and audit logs.
A safe deployment includes clear use-case boundaries, prompt and output validation, access control, retrieval permissions, tool restrictions, monitoring, rate limits, feedback collection, human review for high-risk tasks, and rollback plans.
Common mistakes include relying only on prompts for security, ignoring evaluation, sending too much context, not monitoring cost, skipping hallucination checks, using tools without validation, and treating model output as always correct.
Choose based on task complexity, accuracy needs, context length, latency, cost, language support, safety behavior, tool support, data privacy requirements, and evaluation results on your own workload.
Explore 500+ free tutorials across 20+ languages and frameworks.