Diffusion outputs ignore the prompt; raising fidelity to text without wrecking image quality.
→Increase classifier-free guidance scale; watch for over-saturation/artifacts and back off.
Why: Higher CFG tightens prompt adherence but too high causes burnt colors and unnatural detail — it is a tradeoff, not a free lever.
Diffusion sampling is too slow for an interactive demo; cut steps without obvious quality loss.
→Switch to a faster ODE sampler (DPM-Solver++ / Euler) and reduce steps; validate with FID, not eyeballing.
Why: Modern samplers reach comparable quality in far fewer steps than ancestral DDPM sampling.
A multimodal pipeline has many moving parts and one weak result; deciding what to change next.
→Run a controlled ablation — change one component at a time and measure against a fixed eval set.
Why: Changing several knobs at once makes the result uninterpretable; isolate the cause before scaling up.
Generation results vary run to run and you cannot compare two prompt variants fairly.
→Fix the random seed (and sampler) so the only difference is the variable under test.
Why: Diffusion is stochastic; without a fixed seed you are comparing noise, not your change.
Generated images keep including an unwanted element (e.g. text, watermark, extra limbs).
→Add a negative prompt describing what to exclude; combine with CFG.
Why: Negative prompting steers the unconditional branch away from named concepts — cheaper than retraining.
Picking the right metric to drive a text-to-image experiment.
→Use FID for distributional image quality, CLIPScore for prompt-image alignment, and human preference for the final call.
Why: A single metric misleads: a model can score great FID while ignoring the prompt. Use both axes.
A vision-language model captioning task gives inconsistent, hallucinated captions.
→Lower decoding temperature / use greedy or low top-p for factual captioning.
Why: High temperature increases creativity and hallucination; captioning wants determinism and grounding.
Iterating on conditioning is slow because each round evaluates the whole dataset.
→Build a small, representative golden eval set for fast iteration; run full eval only on candidates.
Why: Tight feedback loops beat exhaustive but slow ones for the experimentation phase.
Need generated images to follow a precise pose, depth, or edge layout.
→Add structural conditioning (ControlNet-style: pose/depth/canny) on top of the text prompt.
Why: Text prompts cannot specify exact spatial structure; an auxiliary conditioning map can.
Two checkpoints score nearly identical FID/CLIPScore; choosing which to ship.
→Run a blind A/B human preference test on a held-out prompt set.
Why: Automated metrics saturate; human preference is the tiebreaker for generative quality.
Model looks great on the prompts you tuned on but poor on fresh prompts.
→Hold out a separate prompt set never used during tuning and report on it.
Why: Tuning against your eval prompts overfits the experiment, not the model.
Outputs are close to target style but not quite; deciding between prompt tricks and training.
→Exhaust prompting/conditioning and LoRA-style light fine-tune before full retraining.
Why: Cheapest intervention first — full retraining is rarely justified by a stylistic gap.