(Paper Review) Alleviating Hallucinations of Large Language Models through Induced Hallucinations

ACL 2025

Paper Info.

  • Title: Alleviating Hallucinations of Large Language Models through Induced Hallucinations
  • Authors: Yue Zhang, Leyang Cui, Wei Bi, Shuming Shi
  • Conference: ACL 2025
  • Keywords: LLM Safety, Large Language Models (LLMs), Hallucination, LLM Decoding



Introduction

🔍 Motivation

Large Language Models (LLMs) like ChatGPT and GPT-4 show impressive capabilities across a wide range of tasks (ranging from translation and editing to complex reasoning and planning.) However, a persistent and critical challenge is their tendency to produce hallucinations, i.e., inaccurate or fabricated information, which poses risks for real-world applications.

⚠️ Why Do Hallucinations Happen?

The paper identifies two major causes:

  1. Maximum Likelihood Training Objective:
    • LLMs are trained to predict the next token based on likelihood.
    • This can result in:
      • Assigning non-zero probabilities to incorrect facts that may have appeared during pre-training.
      • Overfitting on surface-level patterns rather than memorizing accurate facts.
    • Despite these issues, this objective is simple, generalizable, and difficult to replace efficiently.
  2. Insufficient Knowledge and Fine-Tuning Pitfalls:
    • Some hallucinations arise due to gaps in knowledge.
    • Post-hoc supervised fine-tuning (SFT) seems like a solution, but:
      • It might force LLMs to answer beyond their knowledge, leading to more hallucinations.
      • Injecting reliable knowledge through SFT or continual pre-training is computationally expensive and impractical for most researchers due to the required data and resources.

Proposed Solution: Induce-then-Contrast Decoding (ICD)

Given the difficulty of addressing hallucinations during pre-training or fine-tuning, the authors propose a decoding-level strategy (i.e., a plug-in solution at inference time):

Core Idea of ICD:

image.png

  1. Induce a hallucinating model:
    • Create a factually weak version of the original LLM by fine-tuning on a small number of hallucinated samples.
    • This weak LLM retains the original model’s general skills but exaggerates hallucination tendencies.
  2. Contrastive Decoding:
    • During decoding, penalize the outputs that the weak model (hallucinator) favors.
    • This helps the original model avoid factual errors while maintaining fluency and coherence.

📊 Experimental Validation

  • Benchmarks: Evaluated on both discrimination-based and generation-based hallucination metrics.
  • Key Findings:
    • ICD significantly improves truthfulness on TruthfulQA.
      • Llama2-7B and Mistral-7B with ICD match the performance of ChatGPT/GPT-4.

    image.png

    • On FACTSCORE, ICD allows Llama2-7B-Chat to outperform even Llama2-70B, showing strong gains in factual precision.
    • On general benchmarks like MMLU, ARC, and AlpacaEval2.0, ICD maintains original performance, showing no degradation in general capability.

🧩 Summary

  • ICD is an inference-time method requiring no re-training of the original LLM.
  • It is data-efficient, requiring only a few hallucinated examples to create the weak model.
  • Most importantly, it strikes a balance: reduces hallucinations without hurting general task performance.



Related Work

🔍 1. Hallucination in LLMs

Definition & Scope

  • Hallucinations refer to LLMs generating content that:
    • Contradicts the user’s input,
    • Conflicts with prior conversation context, or
    • Violates established factual knowledge.
  • This work focuses specifically on fact-conflicting hallucinations, which are especially problematic due to their potential for serious real-world consequences.

Existing Solutions

Several strategies have been proposed to reduce hallucinations:

  • Data-level strategies: Curating high-quality training data.
  • Learning-based approaches:
    • Reinforcement learning from human or external feedback.
    • Modified training losses to penalize hallucinations.
    • Leveraging model uncertainty to flag potential hallucinated content.
  • Retrieval-based techniques: Augmenting generation with factual context from external databases.

ICD Perspective

  • Unlike traditional methods that directly optimize LLMs to avoid hallucinations, Induce-then-Contrast Decoding (ICD) introduces a novel perspective:
    1. First construct a weak LLM that mimics the original but is prone to hallucination.
    2. Then, contrast its output with that of the original LLM to filter out hallucinated responses.
  • This reformulation treats hallucinations as a penalty signal during decoding, improving factual accuracy without re-training the original model.

⚖️ 2. Contrastive Decoding (CD)

Origins and Applications

  • Initially proposed to improve fluency and coherence in generation by comparing outputs from a large vs. small model.
  • Recent applications have extended CD to:
    • Enhance reasoning (O’Brien & Lewis, 2023),
    • Control sentiment or detoxify outputs (Liu et al., 2021),
    • Reduce hallucinations by focusing on retrieved evidence (Shi et al., 2023b).
  • DoLa contrasts early vs. late layers of the same LLM, assuming that early layers contain less factual content.
  • In contrast, ICD fine-tunes a hallucination-prone model and actively uses its errors as a negative guide during decoding.
  • The authors argue ICD provides stronger factuality improvements than DoLa.

🧪 3. Inducing Inappropriate Behaviors in LLMs

Red Teaming & Adversarial Induction

  • Red teaming studies how aligned LLMs can be pushed to produce toxic or inappropriate responses.
  • It’s been shown (e.g., Qi et al., 2023) that even small adversarial fine-tuning can “jailbreak” safe LLMs into misbehaving.

Connection to ICD

  • This paper adopts a similar idea to intentionally induce hallucinations, but with the goal of suppressing them.
  • Related work includes:
    • Yao et al. (2023): Treat hallucinations as adversarial samples.
    • Yu et al. (2023): Automatically trigger hallucinations with AutoDebug.
  • ICD extends this line of work by not just inducing hallucinations, but using them constructively to improve factuality.



Induce-then-Contrast Decoding (ICD)

🧠 Core Idea

The Induce-then-Contrast Decoding (ICD) method improves factuality in LLMs by:

  1. Inducing hallucinations to create a factually weak LLM.
  2. Contrasting this weak model against the original during decoding, penalizing hallucinated outputs while preserving fluency and general capability.

🔧 3.1 Inducing Hallucinations from LLMs

Goal

Create a factually weak LLM structurally similar to the original but more likely to hallucinate.

Method

  • Data generation:
    • Use ChatGPT to convert factual samples into non-factual ones via few-shot prompting.
    • Example:
      • Factual: "ACL 2024 will be held in Bangkok"
      • Non-factual: "ACL 2024 will be held in Singapore" or "ACL 2023 will be held in Bangkok"
  • Fine-tuning dataset format:

    \[D = \{(s_i, u_i, o_i)\}_{i=1}^m\]
    • $s_i$: system prompt
    • $u_i$: user input
    • $o_i$: non-factual target output
  • Fine-tuning objective:

    \[\min_{\Delta \theta} \sum_{i=1}^m -\log p(o_i \mid s_i, u_i; \theta + \Delta \theta)\]
    • $\theta$: original model parameters
    • $\theta + \Delta \theta$: parameters of the hallucination-prone model
    • The model is trained to reproduce non-factual outputs fluently and coherently.

⚖️ 3.2 Using Factually Weak LLM as a Penalty During Decoding

Base LLM Decoding

Auto-regressive decoding in standard LLMs:

\[p(x_t \mid x_{<t}; \theta) = \text{softmax}( \text{logit}_\theta(x_t \mid x_{<t}) )\]
  • $\text{logit}_\theta(·)$ is the next-token logits predicted by the original model $θ$.

Contrastive Adjustment

Subtract the hallucination-prone model’s log-probability to penalize non-factual outputs:

\[F_t = \beta \log p(x_t \mid x_{<t}; \theta) - \log p(x_t \mid x_{<t}; \theta + \Delta \theta)\]
  • $\beta \in (0, +\infty)$: a contrast strength hyperparameter

Final adjusted next-token distribution:

\[p(x_t \mid x_{<t}) = \text{softmax}(F_t)\]

Adaptive Plausibility Constraint

Problem

Penalizing all outputs from the hallucinated model can hurt grammar, fluency, and common sense.

Solution

Only penalize likely tokens using a plausibility filter:

\[V_{\text{valid}} = \{x_t \in V : \text{logit}_\theta(x_t \mid x_{<t}) \geq \alpha \cdot \max_w \text{logit}_\theta(w)\}\]
  • $\alpha \in [0, 1]$: controls how restrictive the filter is
  • Only consider tokens with probabilities larger than a proportion of the maximum probability assigned by the original model for contrast and decoding
  • Tokens outside $V_{\text{valid}}$ are excluded by setting their logits to $-\infty$ before softmax



Experiments

The authors evaluate ICD on two types of hallucination benchmarks:

  • Discrimination-based: Multiple-choice evaluations of truthfulness.
  • Generation-based: Factual accuracy in open-ended biography generation.

⚙️ 4.1 Experimental Setup

Datasets & Metrics

  • Discrimination-based:
    • TruthfulQA
    • Metrics:
      • MC1: Does the model select the best answer?
      • MC2: Does it assign higher probability mass to correct answers?
      • MC3: Are correct answers consistently ranked above incorrect ones?
  • Generation-based:
    • FACTSCORE: Evaluates factual precision in biography generation.
    • Metrics:
      • % Response: Response rate
      • # Facts: Average number of facts per response
      • Factual Score: Precision of atomic facts based on ground-truth knowledge
    • Example Prompt: “Write a short biography of Ada Lovelace.” Generated Text:

      Ada Lovelace was a 19th-century mathematician known for her work on Charles Babbage’s Analytical Engine. She is often regarded as the first computer programmer.

      Atomic Facts:

      1. Ada Lovelace was a 19th-century mathematician. ✅
      2. She worked on Charles Babbage’s Analytical Engine. ✅
      3. She is considered the first computer programmer. ✅ Score: 3/3 → 100% factual precision

Baselines

Compared ICD against:

  1. Greedy Decoding
  2. Inference Time Intervention (ITI): Adjusts hidden activations during inference
  3. DoLa: Contrastive decoding using early vs. late layers
  4. Vanilla Contrastive Decoding (CD): Contrasts output distributions from models of different sizes

Implementation Details

  • Base model: Llama2-7B
  • TruthfulQA:
    • Fine-tuned on 10k hallucinated QA pairs from HaluEval
    • Verified no data leakage with TruthfulQA
  • FACTSCORE:
    • Fine-tuned on 3.5k hallucinated biographies generated by ChatGPT

📈 4.2 Main Results

TruthfulQA – ICD Boosts Truthfulness

image.png

  • Llama2-7B-Chat + ICD shows significant improvement over:
    • Greedy decoding:
      • +8.70 (MC1), +14.18 (MC2), +13.13 (MC3)
    • Outperforms Llama2-70B, despite smaller size.
  • ICD outperforms all other decoding baselines (ITI, DoLa, CD).

FACTSCORE – Reducing Open-ended Hallucinations

image.png

  • ICD improves factual precision by +2.5 points over greedy decoding.
  • Achieves 66.3 score, outperforming Llama2-70B (64.4).
  • ICD maintains response rate and number of facts, unlike other methods which fail to improve the score.

Preserving General Capabilities

  • Evaluated ICD on MMLU, ARC, and AlpacaEval2.0:

    image.png

    • No degradation in performance → ICD maintains original model capability.
  • GPT-4-based pairwise evaluations (Figure 3) on generated biographies:

    image.png

    • ICD improves factuality
    • Grammar and topicality remain unaffected

🧪 4.3 Further Exploration: Other Induction Strategies

Prompt-based Hallucination Induction

  • Motivation: Avoid fine-tuning cost
  • Approach: Use system prompts to manually induce hallucinations
  • Example: Prompt LLMs to provide false answers
  • Results:
    • Some gains on TruthfulQA:
      • From 37.62/54.60/28.12 to 37.87/57.55/33.94 (MC1/2/3)
    • Not as effective as fine-tuning-based ICD

Base vs. Chat Model Contrast

  • Observation: Llama2-Chat is much more truthful than Llama2-Base due to SFT/RLHF
  • New idea: Use base vs. chat model contrast (Before/After Alignment)
  • Outcome:
    • Better than naive CD
    • Suggests alignment process introduces stronger truthfulness than scaling model size

🔍 4.4 More Analysis

Robustness to Task Formats

  • Objective: Check whether the format of the hallucination-inducing training data affects ICD performance.
  • Result (Table 4):

    image.png

    • All task formats improved performance.
    • QA-format fine-tuning (matching TruthfulQA’s format) yielded the best gains.
  • Conclusion: ICD generalizes across tasks, but task-aligned formats perform better.

Effectiveness Across Model Sizes

  • Setup: Apply ICD to various Llama2 models using the same factually weak model (e.g., ShearedLLaMA-1.3B).
  • Result (Table 5):

    image.png

    • Llama2-7B-Chat’s MC1 improved from 37.62 → 43.01.
    • ICD remains effective across 7B, 13B, and 70B model sizes.
  • Insight: ICD scales well and remains effective even when using smaller models as penalties.

Inference Speed & Efficiency

  • Efficiency:
    • Original and weak model can run in parallel, so no inference slowdown.
    • Weak model can be smaller, which reduces GPU costs.

Real vs. Synthetic Data for Hallucination Induction

  • Experiment: Compare ChatGPT-generated synthetic data vs. real hallucinations.
  • Method:
    1. Ask Llama2-7B-Chat 1,000 open-domain Wikipedia-based questions.
    2. Human annotators filter out 294 real hallucinations.
  • Result (Table 6):

    image.png

    • 294 real samples > 1k synthetic samples in performance.
    • 10k synthetic samples > 294 real samples, but gap is narrowing.
  • Conclusion: Real hallucinations are more effective per sample, but larger synthetic datasets can close the gap.

Generalization to Other LLM Backbones

  • Tested Models:
    • Baichuan2
    • Mistral
  • Result (Table 7):

    image.png

    • ICD yields even stronger gains than with Llama2.
    • These models are already stronger baselines, suggesting ICD scales with model quality.
  • Insight: ICD is backbone-agnostic and leverages the strength of newer models.

⚠Why Not Just Fine-tune on Factual Data?

  • Motivation: Can we skip ICD and just fine-tune on factual samples?
  • Experiment:
    • Fine-tune Llama2-7B-Chat on 3.5k factual biographies.
  • Result (Table 9):

    image.png

    • Factuality drops dramatically: 63.8 → 28.7
    • Response ratio spikes: 37.5 → 99.5
  • Explanation:
    • Likely due to behavior cloning: the model learns to answer everything, even beyond its knowledge.
    • Aligns with prior findings (Schulman 2023, Yang et al. 2023c).
  • Takeaway: Direct SFT with factual data is not reliable for hallucination mitigation.

Qualitative Analysis

image.png

  • Source: Table 8 shows side-by-side generated biographies.
  • Findings:
    1. Direct tuning:
      • Introduces new hallucinations.
      • Produces shorter, less helpful responses, degrading RLHF-learned style.
    2. ICD:
      • Corrects factual errors, e.g., wrong birth year.
      • Preserves helpfulness and fluency.
    3. Reverse contrast (hallucination induction):
      • Produces grammatically fluent but factually wrong text.
      • Confirms ICD’s ability to separate fluency from factuality.
Share: X (Twitter) Facebook LinkedIn VK Reddit