Explain what lets a transformer weigh distant tokens when generating the next one.
→Self-attention. Each token attends to every other token via query/key/value projections, producing context-weighted representations.
Why: Attention, not recurrence, is what gives transformers long-range context and parallelizable training.
Pick how to inject new knowledge or behavior into an LLM.
→New facts that change often → RAG. New task behavior/style → fine-tune. New base capability/vocabulary at scale → continued pre-training.
Why: RAG keeps data external and updatable; fine-tuning bakes behavior into weights; pre-training is the most expensive lever.
Define what makes a model a foundation model.
→A large model pre-trained on broad, mostly unlabeled data that is adaptable to many downstream tasks via prompting, RAG, or fine-tuning.
Estimate how text maps to model input units and what drives cost.
→Text is split into sub-word tokens by a tokenizer (e.g. BPE). Cost and context limits are measured in tokens, not characters or words.
Why: Rare or non-English words split into more tokens, inflating context use and inference cost.
A long document does not fit in a single prompt.
→The input exceeds the model's context window (max tokens for input + output). Chunk the document for RAG or choose a longer-context model.
Why: The context window is a hard limit; everything beyond it is truncated and silently lost.
Power semantic search or RAG retrieval over text.
→Use an embedding model to convert text into dense vectors, then retrieve by cosine/dot-product similarity from a vector store.
Why: Embeddings place semantically similar text near each other, enabling meaning-based rather than keyword retrieval.
Choose output behavior: deterministic vs. creative.
→Low temperature (~0.0-0.3) → focused, repeatable. High temperature (~0.7-1.0) → diverse, creative. Use near-0 for classification or extraction.
Why: Temperature scales the probability distribution before sampling; lower values concentrate mass on the top tokens.
Constrain the candidate token pool beyond temperature.
→Top-k keeps the k most-likely tokens; top-p (nucleus) keeps the smallest set whose cumulative probability reaches p.
Why: Top-p adapts the candidate set to the distribution shape; top-k is fixed-width regardless of confidence.
Identify how LLMs learn from unlabeled text.
→Self-supervised learning — next-token (causal) or masked-token prediction creates labels from the text itself, no human annotation.
Why: It is what lets LLMs train on internet-scale corpora without manual labeling.
Match architecture to task family.
→Generation → decoder-only (GPT-style). Understanding/classification → encoder-only (BERT-style). Seq-to-seq translation/summarization → encoder-decoder (T5-style).
Why: Decoder-only models predict left-to-right; encoders see bidirectional context, better for representation tasks.
Make a base model follow instructions and prefer helpful, safe answers.
→Instruction tuning followed by alignment such as RLHF — reinforcement learning from human preference rankings.
Why: A raw pre-trained model predicts text; alignment steers it toward intended assistant behavior.
The model states confident but fabricated facts.
→Hallucination. Mitigate by grounding with RAG, lowering temperature, citing sources, and adding guardrails plus human review for high-stakes outputs.
Why: LLMs predict plausible tokens, not verified facts; grounding supplies the missing evidence.
Distinguish model size from training data size.
→Parameters = learned weights (model capacity). Tokens = volume of training text. Both scale capability under scaling laws.
Why: A bigger model under-trained on too few tokens underperforms a smaller, well-trained one (Chinchilla insight).
Separate the two GPU-heavy phases of an LLM lifecycle.
→Training updates weights from data (one-time, batch). Inference runs the frozen model to generate outputs (ongoing, latency-sensitive).
Why: Optimization tools differ: training uses parallelism frameworks; inference uses TensorRT-LLM and Triton.
A fine-tuned model memorizes training examples and fails on new inputs.
→Overfitting. Mitigate with more/diverse data, early stopping, lower learning rate, fewer epochs, or regularization like dropout.
Why: A large train-vs-validation gap means the model fit noise instead of generalizable patterns.