Paper Info.

Title: Alleviating Hallucinations of Large Language Models through Induced Hallucinations
Authors: Yue Zhang, Leyang Cui, Wei Bi, Shuming Shi
Conference: ACL 2025
Keywords: LLM Safety, Large Language Models (LLMs), Hallucination, LLM Decoding

Introduction

🔍 Motivation

Large Language Models (LLMs) like ChatGPT and GPT-4 show impressive capabilities across a wide range of tasks (ranging from translation and editing to complex reasoning and planning.) However, a persistent and critical challenge is their tendency to produce hallucinations, i.e., inaccurate or fabricated information, which poses risks for real-world applications.

⚠️ Why Do Hallucinations Happen?

The paper identifies two major causes:

Maximum Likelihood Training Objective:
- LLMs are trained to predict the next token based on likelihood.
- This can result in:
  - Assigning non-zero probabilities to incorrect facts that may have appeared during pre-training.
  - Overfitting on surface-level patterns rather than memorizing accurate facts.
- Despite these issues, this objective is simple, generalizable, and difficult to replace efficiently.
Insufficient Knowledge and Fine-Tuning Pitfalls:
- Some hallucinations arise due to gaps in knowledge.
- Post-hoc supervised fine-tuning (SFT) seems like a solution, but:
  - It might force LLMs to answer beyond their knowledge, leading to more hallucinations.
  - Injecting reliable knowledge through SFT or continual pre-training is computationally expensive and impractical for most researchers due to the required data and resources.

✅ Proposed Solution: Induce-then-Contrast Decoding (ICD)

Given the difficulty of addressing hallucinations during pre-training or fine-tuning, the authors propose a decoding-level strategy (i.e., a plug-in solution at inference time):

Core Idea of ICD:

Induce a hallucinating model:
- Create a factually weak version of the original LLM by fine-tuning on a small number of hallucinated samples.
- This weak LLM retains the original model’s general skills but exaggerates hallucination tendencies.
Contrastive Decoding:
- During decoding, penalize the outputs that the weak model (hallucinator) favors.
- This helps the original model avoid factual errors while maintaining fluency and coherence.

📊 Experimental Validation

Benchmarks: Evaluated on both discrimination-based and generation-based hallucination metrics.
Key Findings:
- ICD significantly improves truthfulness on TruthfulQA.
  - Llama2-7B and Mistral-7B with ICD match the performance of ChatGPT/GPT-4.
- On FACTSCORE, ICD allows Llama2-7B-Chat to outperform even Llama2-70B, showing strong gains in factual precision.
- On general benchmarks like MMLU, ARC, and AlpacaEval2.0, ICD maintains original performance, showing no degradation in general capability.

🧩 Summary

ICD is an inference-time method requiring no re-training of the original LLM.
It is data-efficient, requiring only a few hallucinated examples to create the weak model.
Most importantly, it strikes a balance: reduces hallucinations without hurting general task performance.

Related Work

🔍 1. Hallucination in LLMs

Definition & Scope

Hallucinations refer to LLMs generating content that:
- Contradicts the user’s input,
- Conflicts with prior conversation context, or
- Violates established factual knowledge.
This work focuses specifically on fact-conflicting hallucinations, which are especially problematic due to their potential for serious real-world consequences.

Existing Solutions

Several strategies have been proposed to reduce hallucinations:

Data-level strategies: Curating high-quality training data.
Learning-based approaches:
- Reinforcement learning from human or external feedback.
- Modified training losses to penalize hallucinations.
- Leveraging model uncertainty to flag potential hallucinated content.
Retrieval-based techniques: Augmenting generation with factual context from external databases.

ICD Perspective

Unlike traditional methods that directly optimize LLMs to avoid hallucinations, Induce-then-Contrast Decoding (ICD) introduces a novel perspective:
1. First construct a weak LLM that mimics the original but is prone to hallucination.
2. Then, contrast its output with that of the original LLM to filter out hallucinated responses.
This reformulation treats hallucinations as a penalty signal during decoding, improving factual accuracy without re-training the original model.

⚖️ 2. Contrastive Decoding (CD)

Origins and Applications

Initially proposed to improve fluency and coherence in generation by comparing outputs from a large vs. small model.
Recent applications have extended CD to:
- Enhance reasoning (O’Brien & Lewis, 2023),
- Control sentiment or detoxify outputs (Liu et al., 2021),
- Reduce hallucinations by focusing on retrieved evidence (Shi et al., 2023b).

DoLa contrasts early vs. late layers of the same LLM, assuming that early layers contain less factual content.
In contrast, ICD fine-tunes a hallucination-prone model and actively uses its errors as a negative guide during decoding.
The authors argue ICD provides stronger factuality improvements than DoLa.

🧪 3. Inducing Inappropriate Behaviors in LLMs

Red Teaming & Adversarial Induction

Red teaming studies how aligned LLMs can be pushed to produce toxic or inappropriate responses.
It’s been shown (e.g., Qi et al., 2023) that even small adversarial fine-tuning can “jailbreak” safe LLMs into misbehaving.

Connection to ICD

This paper adopts a similar idea to intentionally induce hallucinations, but with the goal of suppressing them.
Related work includes:
- Yao et al. (2023): Treat hallucinations as adversarial samples.
- Yu et al. (2023): Automatically trigger hallucinations with AutoDebug.
ICD extends this line of work by not just inducing hallucinations, but using them constructively to improve factuality.

Induce-then-Contrast Decoding (ICD)

🧠 Core Idea

The Induce-then-Contrast Decoding (ICD) method improves factuality in LLMs by:

Inducing hallucinations to create a factually weak LLM.
Contrasting this weak model against the original during decoding, penalizing hallucinated outputs while preserving fluency and general capability.

🔧 3.1 Inducing Hallucinations from LLMs

Goal

Create a factually weak LLM structurally similar to the original but more likely to hallucinate.

Method

Data generation:
- Use ChatGPT to convert factual samples into non-factual ones via few-shot prompting.
- Example:
  - Factual: "ACL 2024 will be held in Bangkok"
  - Non-factual: "ACL 2024 will be held in Singapore" or "ACL 2023 will be held in Bangkok"
Fine-tuning dataset format:
\[D = \{(s_i, u_i, o_i)\}_{i=1}^m\]
- $s_i$: system prompt
- $u_i$: user input
- $o_i$: non-factual target output
Fine-tuning objective:
\[\min_{\Delta \theta} \sum_{i=1}^m -\log p(o_i \mid s_i, u_i; \theta + \Delta \theta)\]
- $\theta$: original model parameters
- $\theta + \Delta \theta$: parameters of the hallucination-prone model
- The model is trained to reproduce non-factual outputs fluently and coherently.

⚖️ 3.2 Using Factually Weak LLM as a Penalty During Decoding

Base LLM Decoding

Auto-regressive decoding in standard LLMs:

\[p(x_t \mid x_{<t}; \theta) = \text{softmax}( \text{logit}_\theta(x_t \mid x_{<t}) )\]

$\text{logit}_\theta(·)$ is the next-token logits predicted by the original model $θ$.

Contrastive Adjustment

Subtract the hallucination-prone model’s log-probability to penalize non-factual outputs:

\[F_t = \beta \log p(x_t \mid x_{<t}; \theta) - \log p(x_t \mid x_{<t}; \theta + \Delta \theta)\]

$\beta \in (0, +\infty)$: a contrast strength hyperparameter

Final adjusted next-token distribution:

\[p(x_t \mid x_{<t}) = \text{softmax}(F_t)\]

Adaptive Plausibility Constraint

Problem

Penalizing all outputs from the hallucinated model can hurt grammar, fluency, and common sense.

Solution

Only penalize likely tokens using a plausibility filter:

\[V_{\text{valid}} = \{x_t \in V : \text{logit}_\theta(x_t \mid x_{<t}) \geq \alpha \cdot \max_w \text{logit}_\theta(w)\}\]

$\alpha \in [0, 1]$: controls how restrictive the filter is
Only consider tokens with probabilities larger than a proportion of the maximum probability assigned by the original model for contrast and decoding
Tokens outside $V_{\text{valid}}$ are excluded by setting their logits to $-\infty$ before softmax

Experiments

The authors evaluate ICD on two types of hallucination benchmarks:

Discrimination-based: Multiple-choice evaluations of truthfulness.
Generation-based: Factual accuracy in open-ended biography generation.

⚙️ 4.1 Experimental Setup

Datasets & Metrics

Discrimination-based:
- TruthfulQA
- Metrics:
  - MC1: Does the model select the best answer?
  - MC2: Does it assign higher probability mass to correct answers?
  - MC3: Are correct answers consistently ranked above incorrect ones?
Generation-based:
- FACTSCORE: Evaluates factual precision in biography generation.
- Metrics:
  - % Response: Response rate
  - # Facts: Average number of facts per response
  - Factual Score: Precision of atomic facts based on ground-truth knowledge
- Example Prompt: “Write a short biography of Ada Lovelace.” Generated Text:
  
  Ada Lovelace was a 19th-century mathematician known for her work on Charles Babbage’s Analytical Engine. She is often regarded as the first computer programmer.
  
  Atomic Facts:
  1. Ada Lovelace was a 19th-century mathematician. ✅
  2. She worked on Charles Babbage’s Analytical Engine. ✅
  3. She is considered the first computer programmer. ✅ Score: 3/3 → 100% factual precision

Baselines

Compared ICD against:

Greedy Decoding
Inference Time Intervention (ITI): Adjusts hidden activations during inference
DoLa: Contrastive decoding using early vs. late layers
Vanilla Contrastive Decoding (CD): Contrasts output distributions from models of different sizes

Implementation Details

Base model: Llama2-7B
TruthfulQA:
- Fine-tuned on 10k hallucinated QA pairs from HaluEval
- Verified no data leakage with TruthfulQA
FACTSCORE:
- Fine-tuned on 3.5k hallucinated biographies generated by ChatGPT

📈 4.2 Main Results

TruthfulQA – ICD Boosts Truthfulness

Llama2-7B-Chat + ICD shows significant improvement over:
- Greedy decoding:
  - +8.70 (MC1), +14.18 (MC2), +13.13 (MC3)
- Outperforms Llama2-70B, despite smaller size.
ICD outperforms all other decoding baselines (ITI, DoLa, CD).

FACTSCORE – Reducing Open-ended Hallucinations

ICD improves factual precision by +2.5 points over greedy decoding.
Achieves 66.3 score, outperforming Llama2-70B (64.4).
ICD maintains response rate and number of facts, unlike other methods which fail to improve the score.

Preserving General Capabilities

Evaluated ICD on MMLU, ARC, and AlpacaEval2.0:
- No degradation in performance → ICD maintains original model capability.
GPT-4-based pairwise evaluations (Figure 3) on generated biographies:
- ICD improves factuality
- Grammar and topicality remain unaffected

🧪 4.3 Further Exploration: Other Induction Strategies

Prompt-based Hallucination Induction

Motivation: Avoid fine-tuning cost
Approach: Use system prompts to manually induce hallucinations
Example: Prompt LLMs to provide false answers
Results:
- Some gains on TruthfulQA:
  - From 37.62/54.60/28.12 to 37.87/57.55/33.94 (MC1/2/3)
- Not as effective as fine-tuning-based ICD

Base vs. Chat Model Contrast

Observation: Llama2-Chat is much more truthful than Llama2-Base due to SFT/RLHF
New idea: Use base vs. chat model contrast (Before/After Alignment)
Outcome:
- Better than naive CD
- Suggests alignment process introduces stronger truthfulness than scaling model size

🔍 4.4 More Analysis

Robustness to Task Formats

Objective: Check whether the format of the hallucination-inducing training data affects ICD performance.
Result (Table 4):
- All task formats improved performance.
- QA-format fine-tuning (matching TruthfulQA’s format) yielded the best gains.
Conclusion: ICD generalizes across tasks, but task-aligned formats perform better.

Effectiveness Across Model Sizes

Setup: Apply ICD to various Llama2 models using the same factually weak model (e.g., ShearedLLaMA-1.3B).
Result (Table 5):
- Llama2-7B-Chat’s MC1 improved from 37.62 → 43.01.
- ICD remains effective across 7B, 13B, and 70B model sizes.
Insight: ICD scales well and remains effective even when using smaller models as penalties.

Inference Speed & Efficiency

Efficiency:
- Original and weak model can run in parallel, so no inference slowdown.
- Weak model can be smaller, which reduces GPU costs.

Real vs. Synthetic Data for Hallucination Induction

Experiment: Compare ChatGPT-generated synthetic data vs. real hallucinations.
Method:
1. Ask Llama2-7B-Chat 1,000 open-domain Wikipedia-based questions.
2. Human annotators filter out 294 real hallucinations.
Result (Table 6):
- 294 real samples > 1k synthetic samples in performance.
- 10k synthetic samples > 294 real samples, but gap is narrowing.
Conclusion: Real hallucinations are more effective per sample, but larger synthetic datasets can close the gap.

Generalization to Other LLM Backbones

Tested Models:
- Baichuan2
- Mistral
Result (Table 7):
- ICD yields even stronger gains than with Llama2.
- These models are already stronger baselines, suggesting ICD scales with model quality.
Insight: ICD is backbone-agnostic and leverages the strength of newer models.

⚠Why Not Just Fine-tune on Factual Data?

Motivation: Can we skip ICD and just fine-tune on factual samples?
Experiment:
- Fine-tune Llama2-7B-Chat on 3.5k factual biographies.
Result (Table 9):
- Factuality drops dramatically: 63.8 → 28.7
- Response ratio spikes: 37.5 → 99.5
Explanation:
- Likely due to behavior cloning: the model learns to answer everything, even beyond its knowledge.
- Aligns with prior findings (Schulman 2023, Yang et al. 2023c).
Takeaway: Direct SFT with factual data is not reliable for hallucination mitigation.

Qualitative Analysis

Source: Table 8 shows side-by-side generated biographies.
Findings:
1. Direct tuning:
  - Introduces new hallucinations.
  - Produces shorter, less helpful responses, degrading RLHF-learned style.
2. ICD:
  - Corrects factual errors, e.g., wrong birth year.
  - Preserves helpfulness and fluency.
3. Reverse contrast (hallucination induction):
  - Produces grammatically fluent but factually wrong text.
  - Confirms ICD’s ability to separate fluency from factuality.