Paper Info.

Title: DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Authors: Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, Pengcheng He
Conference: ICLR 2024
Code: https://github.com/voidism/DoLa.git
Keywords: Large Language Models (LLMs), LLM Safety, Hallucination, Factuality, Decoding, Text Generation

Introduction

🧠 Motivation: The Hallucination Problem in LLMs

Large language models (LLMs) have shown impressive capabilities in NLP tasks, especially as they are scaled up.
However, they often hallucinate (i.e., generating text that deviates from factual knowledge seen during pre-training).
This is a major obstacle in deploying LLMs in high-stakes domains (e.g., healthcare, law) where factual correctness is crucial.

⚠️ Why Do Hallucinations Occur?

The standard training objective (i.e., maximum likelihood estimation (MLE)) minimizes forward KL divergence, which makes the model mass-seeking.
As a result, the model may assign non-zero probability to plausible but incorrect statements, instead of strictly factual ones.
Empirically, such models tend to learn surface linguistic patterns rather than grounding outputs in real-world knowledge.

🔍 Key Insight: Layer-Wise Knowledge in Transformers

Prior work shows that different layers encode different types of information:
- Lower layers: syntactic or structural cues (e.g., part-of-speech).
- Higher layers: semantic content or factual knowledge.
Research shows that:
- “Knowledge neurons” cluster in upper layers (e.g., BERT).
- Factual knowledge can be edited in specific layers of autoregressive models.

🚀 Proposed Solution: DoLa

The authors propose a decoding-only method called DoLa (Decoding by Contrasting Layers).
Core idea: At each step in generation, compute the difference in output logits between a higher layer and a lower layer.
- This contrast amplifies factual information encoded in higher layers.
- It also downplays syntactically plausible but factually wrong outputs that persist across all layers.

Example: In a multiple-choice setting, “Seattle” may score high across layers (due to syntax), but the true factual answer “Olympia” becomes more prominent only in higher layers. DoLa helps surface such correct answers.

✅ Advantages of DoLa

No need for external knowledge retrieval or fine-tuning.
Efficient: only minor latency overhead during decoding.
Applicable to decoder-only LLMs (e.g., LLaMA).

📊 Experimental Results

Truthfulness improvement is demonstrated on:
- TruthfulQA and FACTOR benchmarks (factual QA).
- StrategyQA and GSM8K (chain-of-thought reasoning).
- Chatbot evaluations with GPT-4, showing significantly more factual and informative outputs under DoLa.

Method

🔧 Overview of Standard Decoding in LLMs

Typical transformer-based LLMs consist of:

An embedding layer
$N$ stacked transformer layers
A final affine projection head $\phi(\cdot)$

Given a token sequence ${x_1, x_2, \dots, x_{t-1}}$, the model predicts $x_t$ using:

\[p(x_t \mid x_{<t}) = \text{softmax}(\phi(h_t^{(N)}))\]

where $h_t^{(N)}$ is the hidden state from the final (mature) layer at time $t$.

🚀 DOLA: Core Idea

Instead of only using the final layer’s logits, DOLA (Decoding by Contrasting Layers) proposes:

Selecting an early layer (premature layer) dynamically
Computing two output distributions:
- $q_N(x_t) = \text{softmax}(\phi(h_t^{(N)}))$
- $q_M(x_t) = \text{softmax}(\phi(h_t^{(M)}))$
Contrasting them via:
\[\hat{p}(x_t \mid x_{<t}) = \text{softmax}(F(q_N(x_t), q_M(x_t)))\]
- where the operator $F(·, ·)$ is used to contrast between the output distributions from the premature layer and the mature layer by computing the log-domain difference between two distributions.

🧠 2.1 Factual Knowledge Evolves Across Layers

The authors compute Jensen-Shannon Divergence (JSD) between early-layer and final-layer distributions.
Two patterns observed:
- Pattern 1: For factual tokens (e.g., names, dates), JSD remains high in upper layers, suggesting that factual knowledge is introduced later.
- Pattern 2: For function words or copied tokens, JSD drops early, indicating early-layer stabilization.

Conclusion: Factual predictions evolve in higher layers → contrast reveals factual knowledge.

🔄 2.2 Dynamic Premature Layer Selection

To find the most informative contrast point:

Select premature layer $M$ that maximizes JSD with the final layer:

\[M = \arg\max_{j \in J} \text{JSD}(q_N(\cdot \mid x_{<t}) \| q_j(\cdot \mid x_{<t}))\]

$J \subset {0, …, N-1}$ is a predefined set of candidate early layers (e.g., grouped into buckets).
This allows DOLA to adapt to token difficulty dynamically at each step.
- Easy tokens → lower JSD → earlier premature layer
- Hard/factual tokens → higher JSD → later premature layer

DoLa-static vs. Dynamic

DoLa-static: Premature layer is fixed via validation search (inefficient and less generalizable).
Dynamic strategy: Requires no exhaustive tuning and is more robust across datasets.

⚖️ 2.3 Contrasting the Predictions

Once $q_N$ and $q_M$ are selected:

Use log-ratio contrast to enhance mature layer preferences:
- This highlights mature-layer-favored tokens and suppresses early-layer biases.

\[F(q_N(x_t), q_M(x_t)) = \begin{cases} \log \left( \frac{q_N(x_t)}{q_M(x_t)} \right), & \text{if } x_t \in V_{\text{head}} \\ -\infty, & \text{otherwise} \end{cases}\]

Adaptive Plausibility Constraint (APC)

To avoid unstable outputs:

Define vocabulary subset:

\[V_{\text{head}}(x_t \mid x_{<t}) = \left\{ x_t \in X \mid q_N(x_t) \geq \alpha \cdot \max_{w} q_N(w) \right\}\]

Ensures contrast is only applied to plausible tokens, preventing:
- False positives: Low-probability junk tokens being boosted
- False negatives: Stable, correct tokens being suppressed

🔁 Repetition Penalty

To reduce output repetition (e.g., in long CoT reasoning), DOLA applies a repetition penalty of $\theta = 1.2$ during decoding.
Based on Keskar et al. (2019); empirical effects are discussed in the appendix.

✅ Key Benefits of DOLA

Amplifies factual signals from higher layers
Adapts to token difficulty via JSD-based dynamic layer selection
Requires no fine-tuning or external knowledge
Introduces minimal computational overhead during inference

Experiments

3.1 Experimental Setup

Datasets

Experiments span multiple-choice and open-ended generation tasks:

Multiple-choice:
- TruthfulQA (short factual answers)
- FACTOR (long-paragraph factual tasks; News/Wiki)
Open-ended generation:
- TruthfulQA (scored by GPT-3 on truthfulness/informativeness)
- StrategyQA (requires multi-hop reasoning)
- GSM8K (math word problems)
- Vicuna QA (GPT-4-evaluated instruction-following)

Models and Baselines

Models: LLaMA-7B, 13B, 33B, 65B
Baselines:
1. Original Decoding: Greedy/sampling
2. Contrastive Decoding (CD): Uses smaller LLaMA-7B as “amateur”, compared to larger “expert”
3. Inference Time Intervention (ITI): LLaMA-7B + linear classifier trained on TruthfulQA

DoLa contrasts internal layers rather than external models (CD), keeping the evaluation clean.

Implementation Details

Adaptive plausibility constraint (APC): α = 0.1
Repetition penalty: θ = 1.2
Layer bucket candidates:
- 7B (32-layer): [0, 16), [16, 32)
- 13B (40-layer): [0, 20), [20, 40)
- 33B (60-layer): [0, 20), [20, 40), [40, 60)
- 65B (80-layer): [0, 20), [20, 40), [40, 60), [60, 80)
- Note: They use validation set to select the best bucket.
Validation:
- Two-fold for TruthfulQA/FACTOR
- GSM8K subset used for StrategyQA/Vicuna QA

3.2 Multiple Choice Results

TruthfulQA (Short-Answer Factuality)

Metrics: MC1 (hard), MC2/MC3 (softer, averaged scores)
Findings:
- DoLa improves all models significantly vs. CD and ITI.
- Exception: LLaMA-33B on MC1 (sensitive to fluctuations).
- Validated layer choices consistently select higher layers, e.g.,:
  - 7B: [16, 32), 13B: [20, 40), 33B: [40, 60), 65B: [60, 80)

FACTOR (Long-Paragraph Factuality)

Task: Choose correct completion from four options
Validation folds: News and Wiki subsets
Results:
- DoLa outperforms baselines by 2–4%
- Lower layer contrasts are preferred (e.g., [0, 20)), opposite to TruthfulQA.
- Reason: Longer outputs have more low-level tokens; lower layers better preserve general context.

3.3 Open-Ended Text Generation

TruthfulQA (GPT-3 Ratings)

Metrics:
- %Truthful
- %Informative
- %Reject (“I have no comment”)
- %Truth × %Info
Results:
- DoLa improves truthfulness while maintaining informativeness (>90%)
- %Reject stays <10%
- CD fails: Though it boosts truthfulness, it overuses rejections (e.g., 60% for LLaMA-33B), lowering the final score.
- Explanation: 33B model’s stronger instruction-following (e.g., prompt says “refuse if unsure”) → CD generates more refusals than necessary.

Chain-of-Thought Reasoning

StrategyQA

Requires multi-hop reasoning
DoLa improves accuracy by 1–4%
CD performs worse (reasoning likely degraded by contrasting with a smaller 7B model)

GSM8K

Involves factual + arithmetic reasoning
DoLa improves accuracy by ~2% on most models (except 7B)
Shows DoLa benefits extend to arithmetic-heavy reasoning

✅ Lower-layer contrast was consistently selected for CoT tasks: [0, 16) or [0, 20)

Instruction-Following: Vicuna QA (GPT-4 Rated)

GPT-4 scores chatbots in pairwise comparisons
DoLa uses lower layers, following GSM8K results
Results (Figure 4): DoLa outperforms baselines significantly on 13B and 33B models
Confirms DoLa’s robustness across open-ended, dialogue-style tasks

Analysis

🔍 4.1 Premature Layer Selection Strategy

Goal:

Evaluate and compare different strategies for selecting the premature layer used in contrastive decoding:

DoLa-static: Uses a fixed layer for contrast throughout decoding
DoLa (default): Uses dynamic selection based on Jensen-Shannon Divergence (JSD) per decoding step

Experimental Findings (GSM8K Validation Sets):

DoLa-static can sometimes outperform DoLa, especially when the “optimal” fixed layer is well chosen (e.g., 10th layer in subset #1).
However, this optimal layer is highly sensitive to the dataset:
- In subset #1: 10th layer is best
- In subset #2: 2nd layer performs better
- Using the wrong fixed layer (e.g., 10th in subset #2) degrades performance

Implication:

DoLa-static lacks generalizability and requires task-specific validation sets, which may not be feasible in real-world applications.
In contrast, DoLa’s dynamic strategy (based on JSD) maintains robust performance across different subsets, achieving near-best results without tuning for each dataset.

Efficiency Comparison:

DoLa-static: Requires 16–40 validation tests (one per layer) to find the best one
DoLa: Only needs 2–4 bucket tests → ~10x fewer

Random Baseline Comparison:

Randomly selecting a premature layer performs worse than using no contrast at all, proving that:

JSD-based dynamic selection is essential for DoLa’s effectiveness.

⏱ 4.2 Latency & Throughput

Result:

DoLa introduces only a small latency overhead during greedy decoding:
- 1.01× to 1.08× increase in decoding time
Memory/inference costs are discussed in Appendix E/F

Implication:

DoLa is practically deployable with minimal computational overhead.

✨ 4.3 Qualitative Study

TruthfulQA Examples (LLaMA-33B, Greedy Decoding):

Q1: DoLa gives the correct historical fact
Q2: DoLa avoids false but plausible information
Q3: DoLa fails, prioritizing informativeness over accuracy

GPT-4 Evaluation:

DoLa’s text generation quality was further assessed via GPT-4 (see Appendix D).
Results indicate that DoLa improves qualitative output even in human-aligned evaluation.

Generalizability Beyond LLaMA:

Applied DoLa to MPT-7B (MosaicML model)
Found consistent performance improvement, indicating that DoLa generalizes across LLM architectures, not just LLaMA

Hallucinations in LLMs

Hallucinations refer to LLMs generating outputs not grounded in training data or real-world facts.
Common causes: imperfect learning objectives, inadequate decoding strategies.
Existing mitigation strategies:
- RLHF: Reinforcement learning from human feedback (e.g., Ouyang et al., 2022)
- Inference-time checks: Self-consistency (Manakul et al., 2023), multi-agent debate (Du et al., Liang et al., 2023), and inference-time interventions using labeled data (Li et al., 2023)

Transformer Layer Behavior

Studies show layer-wise modularity in transformers:
- Early layers: focus on syntax
- Later layers: encode semantics and factual knowledge (Tenney et al., 2019)
Recent work reveals:
- Topmost layers and specific heads play key roles in factual prediction (Meng et al., Dai et al., Li et al., 2023)
- Layer behavior varies by task and training objective (Fayyaz et al., 2021; Niu et al., 2022)

Contrastive Decoding (CD) and Variants

Contrastive Decoding (CD) (Li et al., 2022):
- Contrasts expert and amateur models to improve fluency and coherence.
- Focused less on factuality, and more on style/fluency.
- Requires two models (expert and smaller amateur).
DoLa’s contrast:
- Happens within the same model (e.g., different layers)
- Dynamically selects early layers based on token complexity
- More efficient (no extra model, no training, just early exits)

Context-Aware Decoding (CAD) (Shi et al., 2023):

Focuses on better context handling for summarization/knowledge conflict.
Autocontrastive Decoding (ACD) (Gera et al., 2023):

Similar to DoLa-static but uses small LMs (e.g., GPT2) with fine-tuned early layer heads.
- Aims for diversity/coherence, not factuality
- Found to increase hallucinations, unlike DoLa

Conclusion and Limitations

Contribution

Introduced DoLa: a simple, inference-time method that improves factuality by:
- Contrasting hidden states from early vs. late layers
- Dynamically selecting contrast layers using JSD
Key advantages:
- No need for external retrieval or additional training
- Generalizable across tasks and model families

Limitations

Narrow focus on factuality:
- Does not explore synergy with methods like RLHF.
Inference-only method:
- Relies on frozen, pre-trained models with no label-based fine-tuning.
No external grounding:
- Cannot correct hallucinations rooted in training data errors because it doesn’t retrieve or verify with external sources.

Tags: Paper Review Large Language Models (LLMs) LLM Safety Hallucination Factuality Decoding Text Generation

Paper Info.

Introduction

🧠 Motivation: The Hallucination Problem in LLMs

⚠️ Why Do Hallucinations Occur?

🔍 Key Insight: Layer-Wise Knowledge in Transformers

🚀 Proposed Solution: DoLa

✅ Advantages of DoLa

📊 Experimental Results

Method

🔧 Overview of Standard Decoding in LLMs

🚀 DOLA: Core Idea

🧠 2.1 Factual Knowledge Evolves Across Layers

🔄 2.2 Dynamic Premature Layer Selection

DoLa-static vs. Dynamic

⚖️ 2.3 Contrasting the Predictions

Adaptive Plausibility Constraint (APC)

🔁 Repetition Penalty

✅ Key Benefits of DOLA

Experiments

3.1 Experimental Setup

Datasets

Models and Baselines

Implementation Details

3.2 Multiple Choice Results

TruthfulQA (Short-Answer Factuality)

FACTOR (Long-Paragraph Factuality)

3.3 Open-Ended Text Generation

TruthfulQA (GPT-3 Ratings)

Chain-of-Thought Reasoning

StrategyQA

GSM8K

Instruction-Following: Vicuna QA (GPT-4 Rated)

Analysis

🔍 4.1 Premature Layer Selection Strategy

Goal:

Experimental Findings (GSM8K Validation Sets):

Implication:

Efficiency Comparison:

Random Baseline Comparison:

⏱ 4.2 Latency & Throughput

Result:

Implication:

✨ 4.3 Qualitative Study

TruthfulQA Examples (LLaMA-33B, Greedy Decoding):

GPT-4 Evaluation:

Generalizability Beyond LLaMA:

Related Works

Hallucinations in LLMs

Transformer Layer Behavior

Contrastive Decoding (CD) and Variants

Other Related Methods

Conclusion and Limitations

Contribution

Limitations