ICML 2026

TRACE: Triage and Re-align
by Alignment Conflict Evaluation

The Realignment Problem: When Right becomes Wrong in LLMs

Aakash Sen Sharma¹ · Debdeep Sanyal² · Manodeep Ray³ · Vivek Srivastava³ · Shirish Karande³ · Murari Mandal⁴

¹ InvideoAI ² Birla AI Labs ³ TCS Research ⁴ KIIT, Bhubaneswar

Correspondence: aakash.sensharma@invideo.io

Abstract

Post-training alignment of large language models (LLMs) relies on large-scale human annotations guided by policy specifications that change over time. Cultural shifts, value reinterpretations, and regulatory or industrial updates make static alignment increasingly brittle. As policies evolve, deployed models can diverge from current alignment objectives, creating an Alignment–Reality Gap that is difficult to audit or correct. Existing remediation typically requires re-annotation under revised guidelines, which introduces systematic challenges, including guideline ambiguity, annotator interpretation drift, and reduced consistency at scale.

We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework that transforms re-alignment into a structured optimization problem over existing data without requiring fresh human annotation. Leveraging a stronger model as a proxy judge, TRACE operates via a three-stage pipeline: (1) triaging preference pairs into inversion, suppression, or retention categories based on alignment conflicts; (2) computing an alignment impact score via bi-level optimization to prioritize high-leverage samples; and (3) executing updates using a hybrid objective that combines relational losses (e.g., IPO) for preference inversion and punitive losses (e.g., NPO) for response suppression.

Experiments on Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B demonstrate robust re-alignment on synthetic benchmarks and the PKU-SafeRLHF dataset without degrading general utility. This work provides a scalable approach for LLM realignment under evolving data annotation policies and alignment guidelines.

Results

Human Preference Evaluation

Annotators were shown response triplets (DPO-Gold, TRACE, U2A) and selected the best one following π_new, blind to model identity.

DPO-Gold (upper bound) TRACE (Ours) U2A (baseline)

Method	PKU-SafeRLHF Win Rate vs. U2A	SynthValueBench Win Rate vs. U2A	Krippendorff's α
DPO-Gold (upper bound)	87.1%	92.4%	0.80
TRACE (Ours)	81.8%	85.3%	0.77
U2A (baseline)	—	—	0.76

Table 1. Human preference evaluation. TRACE closes 81.8% of the gap between purely punitive unlearning and gold-standard full re-annotation.

General Capability Benchmarks (PKU-SafeRLHF)

TRACE preserves MMLU and GSM8K within confidence intervals of the base model. The HellaSwag reduction (~3.2 pts) is bounded and favorable given the 81.8% policy adherence gain.

Base Model DPO-Gold (upper bound) TRACE (Ours) U2A (baseline)

Method	GPQA ↑	MMLU ↑	HellaSwag ↑	GSM8K ↑
Base Model	31.6 ± 0.9	70.6 ± 0.8	81.4 ± 1.0	70.4 ± 0.8
DPO-Gold	32.1 ± 1.1	70.5 ± 0.9	81.3 ± 1.2	70.8 ± 1.0
TRACE (Ours)	30.1 ± 0.1	70.2 ± 0.8	78.2 ± 0.9	70.6 ± 0.7
U2A	29.5 ± 0.3	70.2 ± 1.1	80.8 ± 1.2	69.9 ± 1.1

Table 2. General capability benchmarks. TRACE preserves general utility while achieving strong re-alignment.

Target Policy Agreement (PKU-SafeRLHF)

TPA measures the percentage of responses complying with π_new, scored by the policy oracle. TRACE at 5k samples outperforms Naive Oracle DPO at 20k samples.

Llama-3.1-8B Gemma-2-9B

Model	Data Size	Naive Oracle DPO	TRACE (Ours)	Δ
Llama-3.1-8B	5k	35.9	55.9	+20.0
	10k	46.5	65.5	+19.0
	20k	52.2	70.7	+18.5
Gemma-2-9B	5k	37.1	54.4	+17.3
	10k	48.2	66.9	+18.7
	20k	53.9	71.8	+17.9

Table 3. Target Policy Agreement across model families and data scales. Gains are consistent and not explained by oracle access alone.

Component-wise Ablation (Llama-3.1-8B)

Each component of TRACE contributes to the final performance. KL regularization is critical for preserving general utility; triage and impact weighting drive policy agreement.

TRACE (Full) Ablation variants

Variant	Policy Agreement ↑	MMLU ↑	ASR ↓
TRACE (Full)	70.7	70.2	27.3
w/o Triage	58.1	70.2	24.6
w/o KL Regularization	71.5	64.1	29.8
w/o Impact Weighting	62.8	69.5	32.1

Table 4. Ablation study. Removing any component degrades performance; KL regularization is especially critical for capability preservation.

Algorithm

\[ \begin{array}{l} \textbf{Algorithm 1:}\text{ The TRACE Algorithm for Guidelines Re-alignment} \\[3pt] \hline \\[-6pt] \textbf{Input: } \mathcal{M}_{\text{ref}} \text{ (frozen, params } \theta_{\text{ref}}\text{); } \mathcal{M}_{\theta} \text{ (trainable, } \theta \leftarrow \theta_{\text{ref}}\text{); } \mathcal{D} = \{(x_k, y_w^{(k)}, y_l^{(k)})\};\; \mathcal{O} \text{ (oracle); } \pi_{\text{new}} \\ \textbf{Hyper: } \eta,\; \beta,\; \alpha_{\text{KL}},\; \gamma,\; B,\; \epsilon,\; T_{\max} \\[2pt] \hline \\[-6pt] 1: \quad \mathcal{D}_{\text{I}} \leftarrow \emptyset;\;\; \mathcal{D}_{\text{II}} \leftarrow \emptyset;\;\; \mathcal{D}_{\text{R}} \leftarrow \emptyset \hspace{12em} {\color{blue}{\textit{// Triage}}} \\ 2: \quad \textbf{for each } (x_k, y_w^{(k)}, y_l^{(k)}) \in \mathcal{D} \textbf{ do} \\ 3: \qquad (c_w, c_l) \leftarrow (\pi_{\text{new}}(y_w^{(k)}|x_k),\; \pi_{\text{new}}(y_l^{(k)}|x_k)) \\ 4: \qquad \mathcal{D}_{\text{I}} \mathrel{+}= \{(x_k,y_w^{(k)},y_l^{(k)})\} \cdot \mathbb{I}[c_w=0,\; c_l=1] \\ 5: \qquad \mathcal{D}_{\text{II}} \mathrel{+}= \{(x_k,y_w^{(k)},y_l^{(k)})\} \cdot \mathbb{I}[c_w=0,\; c_l=0] \\ 6: \qquad \mathcal{D}_{\text{R}} \mathrel{+}= \{(x_k,y_w^{(k)},y_l^{(k)})\} \cdot \mathbb{I}[c_w=1] \\ 7: \quad \textbf{end for};\quad \mathcal{D}_{\text{conflict}} \leftarrow \mathcal{D}_{\text{I}} \cup \mathcal{D}_{\text{II}} \\[4pt] 8: \quad {\color{purple}{\textit{// Alignment Impact Computation}}} \\ 9: \quad \mathcal{D}_{\text{gold}} \leftarrow \text{GetGoldBatch}(\mathcal{D}_{\text{I}}, \mathcal{D}_{\text{II}}, \mathcal{D}_{\text{R}}, B)\;\text{(Alg. 2)} \\ 10: \quad g_{\mathcal{J}} \leftarrow \nabla_{\theta} \sum_{(x,y_w,y_l) \in \mathcal{D}_{\text{gold}}} \ell_{\text{IPO}}(x,y_w,y_l;\theta) \;\big|_{\theta=\theta_{\text{ref}}} \\ 11: \quad \text{Initialize } \mathbf{w}:\mathbb{N}\to\mathbb{R} \\ 12: \quad \textbf{for all } (x_i,y_w^{(i)},y_l^{(i)}) \in \mathcal{D}_{\text{I}} \textbf{ do} \\ 13: \qquad g_i \leftarrow \nabla_\theta\, \ell_{\text{IPO}}(x_i,y_l^{(i)},y_w^{(i)};\theta) \;\big|_{\theta=\theta_{\text{ref}}};\quad \mathbf{w}[i] \leftarrow \langle g_{\mathcal{J}},\, g_i \rangle \\ 14: \quad \textbf{end for} \\ 15: \quad \textbf{for all } (x_j,y_w^{(j)},y_l^{(j)}) \in \mathcal{D}_{\text{II}} \textbf{ do} \\ 16: \qquad \textbf{if } \mathcal{O} \neq \text{None: }\; y_c^{(j)} \leftarrow \mathcal{O}(x_j,\pi_{\text{new}});\; g_j \leftarrow \nabla_\theta\,\ell_{\text{IPO}}(x_j,y_c^{(j)},y_w^{(j)};\theta)\;\big|_{\theta=\theta_{\text{ref}}} \\ 17: \qquad \textbf{else: }\; g_j \leftarrow \nabla_\theta\, \ell_{\text{NPO}}(x_j,y_w^{(j)},y_l^{(j)};\theta)\;\big|_{\theta=\theta_{\text{ref}}} \\ 18: \qquad \mathbf{w}[j] \leftarrow \langle g_{\mathcal{J}},\, g_j \rangle \\ 19: \quad \textbf{end for} \\ 20: \quad \mathbf{w}[k] \leftarrow (1/\gamma)\,\mathbf{w}[k];\;\; \mathbf{w}[k] \leftarrow \max(0,\mathbf{w}[k]);\;\; Z \leftarrow \sum_{k}|\mathbf{w}[k]|;\;\; \mathbf{w}[k] \leftarrow \mathbf{w}[k]/Z \;\;\text{for all } k \in \mathcal{D}_{\text{conflict}} \\[4pt] 21: \quad {\color{red}{\textit{// Finetune } \mathcal{M}_\theta}} \\ 22: \quad \textbf{while } \|\nabla_\theta \mathcal{L}_{\text{TRACE}}\| \gt \epsilon \textbf{ and } t \lt T_{\max} \textbf{ do} \\ 23: \qquad \text{Sample: } \mathcal{B}_{\text{I}} \sim \mathcal{D}_{\text{I}},\;\; \mathcal{B}_{\text{P}} \sim \mathcal{D}_{\text{II}},\;\; \mathcal{B}_{\text{R}} \sim \mathcal{D}_{\text{R}} \\ 24: \qquad \mathcal{L}_{\text{I}} \leftarrow \sum_{i \in \mathcal{B}_{\text{I}}} \mathbf{w}[i]\cdot \ell_{\text{IPO}}(x_i, y_l^{(i)}, y_w^{(i)};\theta) \\ 25: \qquad \textbf{if } \mathcal{O} \neq \text{None: }\; \mathcal{L}_{\text{II}} \leftarrow \sum_{j \in \mathcal{B}_{\text{P}}} \mathbf{w}[j]\cdot \ell_{\text{IPO}}(x_j,y_c^{(j)},y_w^{(j)};\theta) \\ 26: \qquad \textbf{else: }\; \mathcal{L}_{\text{II}} \leftarrow \sum_{j \in \mathcal{B}_{\text{P}}} \mathbf{w}[j]\cdot \ell_{\text{NPO}}(x_j,y_w^{(j)},y_l^{(j)};\theta) \\ 27: \qquad \mathcal{L}_{\text{KL}} \leftarrow \sum_{k \in \mathcal{B}_{\text{R}}} \text{KL}\big(\pi_{\theta_{\text{ref}}}(\cdot|x_k)\;\|\;\pi_\theta(\cdot|x_k)\big) \\ 28: \qquad \mathcal{L}_{\text{TRACE}} \leftarrow \mathcal{L}_{\text{I}} + \mathcal{L}_{\text{II}} + \alpha_{\text{KL}}\,\mathcal{L}_{\text{KL}};\quad \theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L}_{\text{TRACE}} \\ 29: \quad \textbf{end while} \\[4pt] \textbf{where:} \\ \quad \Delta_\theta(x,y_1,y_2) := \log\frac{\pi_\theta(y_1|x)}{\pi_{\theta_{\text{ref}}}(y_1|x)} - \log\frac{\pi_\theta(y_2|x)}{\pi_{\theta_{\text{ref}}}(y_2|x)} \\ \quad \ell_{\text{IPO}}(x,y_+,y_-;\theta) := -\log\sigma(\beta\,\Delta_\theta(x,y_+,y_-)) \\ \quad \ell_{\text{NPO}}(x,y_w,y_l;\theta) := -\log\sigma\!\left(-\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{\theta_{\text{ref}}}(y_w|x)}\right) -\log\sigma\!\left(-\beta\log\frac{\pi_\theta(y_l|x)}{\pi_{\theta_{\text{ref}}}(y_l|x)}\right) \\ \hline \end{array} \]

\[ \begin{array}{l} \textbf{Algorithm 2: } \text{Gold-Standard Preference Batch Construction} \\[3pt] \hline\\[-7pt] \textbf{Input: } \mathcal{D}_{\text{I}},\, \mathcal{D}_{\text{II}},\, \mathcal{D}_{\text{R}} \text{ (triaged datasets); } B \text{ (batch size)} \\ \hline\\[-7pt] 1{:}\;\; \mathcal{D}_{\text{gold}} \leftarrow \emptyset;\quad \mathcal{Y}_{\text{compliant}} \leftarrow \{y_w : (x, y_w, y_l) \in \mathcal{D}_{\text{R}}\} \cup \{y_l : (x, y_w, y_l) \in \mathcal{D}_{\text{I}}\} \\ 2{:}\;\; (B_{\text{R}},\, B_{\text{I}}) \leftarrow (\min(|\mathcal{D}_{\text{R}}|, \lfloor B/3 \rfloor),\; \min(|\mathcal{D}_{\text{I}}|, \lfloor B/3 \rfloor)) \\ 3{:}\;\; \mathcal{S}_{\text{R}} \leftarrow \text{UniformSample}(\mathcal{D}_{\text{R}}, B_{\text{R}});\quad \mathcal{D}_{\text{gold}} \leftarrow \mathcal{D}_{\text{gold}} \cup \{(x, y_w, y_l) : (x, y_w, y_l) \in \mathcal{S}_{\text{R}}\} \\ 4{:}\;\; \mathcal{S}_{\text{I}} \leftarrow \text{UniformSample}(\mathcal{D}_{\text{I}}, B_{\text{I}});\quad \mathcal{D}_{\text{gold}} \leftarrow \mathcal{D}_{\text{gold}} \cup \{(x, y_l, y_w) : (x, y_w, y_l) \in \mathcal{S}_{\text{I}}\} \\ 5{:}\;\; \textbf{if } \mathcal{Y}_{\text{compliant}} \neq \emptyset \land \mathcal{D}_{\text{II}} \neq \emptyset \textbf{ then} \\ 6{:}\;\quad B_{\text{P}} \leftarrow B - |\mathcal{D}_{\text{gold}}|;\quad \mathcal{S}_{\text{P}} \leftarrow \text{UniformSample}(\mathcal{D}_{\text{II}}, B_{\text{P}}) \\ 7{:}\;\quad \mathcal{D}_{\text{gold}} \leftarrow \mathcal{D}_{\text{gold}} \cup \{(x,\, \text{UniformSample}(\mathcal{Y}_{\text{compliant}}),\, y_w) : (x, y_w, y_l) \in \mathcal{S}_{\text{P}}\} \\ 8{:}\;\; \textbf{end if} \\ 9{:}\;\; \textbf{return } \mathcal{D}_{\text{gold}} \\ \hline \end{array} \]

BibTeX

@inproceedings{
sharma2026the,
title={The Realignment Problem: When Right becomes Wrong in {LLM}s},
author={Aakash Sen Sharma and Debdeep Sanyal and Manodeep Ray and Vivek Srivastava and Shirish Karande and Murari Mandal},
booktitle={Forty-third International Conference on Machine Learning},
year={2026},
url={https://openreview.net/forum?id=yt6vQ8gbGm}
}

TRACE: Triage and Re-alignby Alignment Conflict Evaluation

Abstract

Results

Algorithm

BibTeX

TRACE: Triage and Re-align
by Alignment Conflict Evaluation