Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories

Tianlong Wang 0009-0002-7292-6868 School of Software and Microelectronics, Peking UniversityBeijingChina tianlong.wang@stu.pku.edu.cn , Xianfeng Jiao 0000-0002-7380-1736 Key Laboratory of High Confidence Software Technologies, Ministry of EducationBeijingChina jiaoxianfeng@stu.pku.edu.cn , Yinghao Zhu 0000-0002-2640-6477 National Engineering Research Center for Software Engineering, Peking UniversityBeijingChina yhzhu99@gmail.com , Zhongzhi Chen 0009-0009-9487-8140 Beihang UniversityBeijingChina Jongjyh@buaa.edu.cn , Yifan He 0009-0008-4674-970X School of Software and Microelectronics, Peking UniversityBeijingChina Heyf@stu.pku.edu.cn , Xu Chu 0000-0002-0520-7196 Center on Frontiers of Computing Studies, Peking UniversityBeijingChina chu_xu@pku.edu.cn , Junyi Gao 0000-0002-4951-8682 Centre for Medical Informatics, University of EdinburghEdinburghScotlandUK Health Data Research UKLondonUK junyi.gao@ed.ac.uk , Yasha Wang 0000-0002-8026-9688 Key Laboratory of High Confidence Software Technologies, Ministry of Education National Engineering Research Center for Software Engineering, Peking UniversityBeijingChina wangyasha@pku.edu.cn and Liantao Ma 0000-0001-5233-0624 Key Laboratory of High Confidence Software Technologies, Ministry of Education National Engineering Research Center for Software Engineering, Peking UniversityBeijingChina malt@pku.edu.cn

(2025)

Abstract.

Recent studies have indicated that Large Language Models (LLMs) harbor an inherent understanding of truthfulness, yet often fail to consistently express it and generate false statements. This gap between "knowing" and "telling" poses a challenge for ensuring the truthfulness of generated content. Inspired by recent work on the practice of encoding human-interpretable concepts linearly within large language models, we treat truthfulness as a specially linearly encoded concept within LLMs, and introduce Adaptive Activation Steering (ACT), a tuning-free method that adaptively shifts LLM’s activations in the "truthful" direction during inference. ACT addresses diverse categories of hallucinations by utilizing diverse truthfulness-related steering vectors and adjusting the steering intensity adaptively. Applied as an add-on across various models, ACT significantly improves truthfulness in LLaMA ( $\uparrow$ 142%), LLaMA2 ( $\uparrow$ 24%), Alpaca ( $\uparrow$ 36%), Vicuna ( $\uparrow$ 28%), LLaMA2-Chat ( $\uparrow$ 19%), and LLaMA3( $\uparrow$ 34%). Furthermore, we verify ACT’s scalability across larger models (13B, 33B, 65B), underscoring the adaptability of ACT to large-scale language models. Our code is available at https://github.com/tianlwang/ACT.

large language model; hallucination; tuning-free

^†^†copyright: acmlicensed^†^†journalyear: 2025^†^†copyright: acmlicensed^†^†conference: Proceedings of the ACM Web Conference 2025; April 28–May 2, 2025; Sydney, NSW, Australia^†^†booktitle: Proceedings of the ACM Web Conference 2025 (WWW ’25), April 28–May 2, 2025, Sydney, NSW, Australia^†^†doi: 10.1145/3696410.3714640^†^†isbn: 979-8-4007-1274-6/25/04^†^†ccs: Computing methodologies Natural language generation

1. Introduction

Refer to caption — Figure 1. Illustration of ACT. (a) Demonstrates the calculation of the steering vector. (b) Shows how a single steering vector $v$ shifts the original activation $x$ with constant intensity, as discussed in subsection 2.2. (c) Illustrates adaptive adjustment of steering intensity based on the truthfulness content of the activation, where $f(\cdot)$ is a probe used to determine the truthfulness content of the activation (subsection 3.3). (d) Applies diverse steering vectors ( $v_{0},v_{1},v_{2}$ ) to target diverse categories of hallucinations (subsection 3.2). (e) Combines (c) and (d) in ACT, shifting original activation.

Large language models (LLMs) have demonstrated remarkable potential in web-based applications (Radford et al., 2019; Achiam et al., 2023; Nori et al., 2023; Wang et al., 2024). However, despite their fluency, they often generate false statements, or "hallucinations". These hallucinations present a major challenge to building a responsible web, as they can be extremely harmful in applications like medical or legal advice, where high truthfulness is essential (Koyejo and Li, 2024; Ma et al., 2023).

Recently, some researchers indicate that LLMs do not consistently provide truthful answers, even when LLMs possess the correct knowledge in training corpus. For instance, Wei et al. (2022) found that ChatGPT can provide a wrong answer in one context while giving the correct answer in another. Similarly, Kadavath et al. (2022); Dhuliawala et al. (2023) discovered that LLMs can self-evaluate their generated answers with high accuracy. These findings reveal that LLMs sometimes "know" more than they "tell", indicating a gap between an LLM’s "knowing" and "telling".

To address this gap, we draw inspiration from the works of Jorgensen et al. (2023) and Zou et al. (2023), who propose methods for steering model behavior by encoding human-interpretable concepts linearly within large language models (Elhage et al., 2022). Specifically, they first extract a specific human-interpretable concept as a fixed steering vector. This vector is then added to the model’s activations during inference, shifting the LLM’s activations in the direction of this specific concept. Inspired by their approach, we treat truthfulness as a special concept, aiming to shift the LLM’s activations in the "truthful" direction to close the gap between the LLM’s "knowing" and "telling". Naturally, we ask: Q1. Should all activations share the same steering intensity, even when they have varying levels of truthfulness? Q2. Is a single steering vector sufficient to handle diverse categories of hallucinations?

To this end, we propose Adaptive ACtivation STeering (ACT), a tuning-free LLM truthfulness improvement method for diverse hallucination categories. ACT first calculates the steering vector based on the difference between truthful and untruthful activations (as shown in Figure 1-a). Unlike existing methods that use a single steering vector with fixed steering intensity for all activations (as shown in Figure 1-b), ACT takes a more adaptive approach. Addressing Q1, ACT controls the steering intensity based on the truthfulness content of the activations (as shown in Figure 1-c). Addressing Q2, observing that steering vectors for different categories of hallucinations exhibit distinct clustering patterns in the activation space (as shown in Figure 3), ACT generates diverse steering vectors through unsupervised clustering, aiming to enable customized interventions for various categories of hallucinations (as shown in Figure 1-d).

Experimental results demonstrate that ACT consistently improves truthfulness across 38 categories of hallucinations on the TruthfulQA benchmark. Our contributions are summarized as follows:

•

We propose ACT, a tuning-free method to enhance the truthfulness of LLMs, requiring only a few dozen training samples and introducing an additional constant-time complexity cost during inference. (Demonstrated in subsection 5.4)
•

We introduce adaptive steering intensity control strategy, which adaptively adjusts the intensity based on the truthfulness content of the activations. (Response to Q1)
•

To the best of our knowledge, we are the first to observe that steering vectors for different categories of hallucinations exhibit distinct clustering patterns in the activation space. Therefore, ACT utilizes diverse steering vectors for customized intervention. (Response to Q2)
•

Experimental results show that ACT significantly enhances the truthfulness across several models: LLaMA ( $\uparrow$ 142%), LLaMA2 ( $\uparrow$ 24%), Alpaca ( $\uparrow$ 36%), Vicuna ( $\uparrow$ 28%), LLaMA2-Chat ( $\uparrow$ 19%), and LLaMA3( $\uparrow$ 34%). Furthermore, we verify ACT’s scalability across larger models (13B, 33B, 65B), underscoring the adaptability of ACT to large-scale language models.

2. Related Work

2.1. Latent Space Arithmetic

Research in generative models for computer vision has long demonstrated the ability to steer image generation using derived vectors, including steering latent variables. This is most famously exemplified by intervening on a dimension that corresponds to smiles in images (Larsen et al., 2016; White, 2016), enabling counterfactual editing of generations (Upchurch et al., 2017; Bau et al., 2020a; Shen et al., 2020; Bau et al., 2020b; Ling et al., 2021).

Similarly, in the text domain, several works have been proposed for concept erasure (Kleindessner et al., 2023; Belrose et al., 2023; Ravfogel et al., 2022; Gandikota et al., 2023). The success of these methods suggests the potential of the approach presented in this work.

2.2. LLM Steering

Many approaches attempt to affect the output of a pretrained LLM, whether:

Intervening on Weights: This includes methods such as supervised fine-tuning, RLHF, steerable layers, and weight editing (targeted fine-tuning) (Ranzato et al., 2015; Ziegler et al., 2019; Dathathri et al., 2019; Meng et al., 2022; Ilharco et al., 2022). However, RLHF and weight editing are known to have side effects on overall model performance (Achiam et al., 2023; Brown et al., 2023). In addition, they both require huge annotation and computation resources, contrasting with our method, which only requires 40 samples to determine the steering vector and steering intensity.

Intervening on Activations: For instance, this involves freezing the weights of the LLM and searching for a steering vector of activations. Contrast-Consistent Search (CCS) (Burns et al., 2022) finds truthful directions given paired internal activations by satisfying logical consistencies, though it is unclear if their directions are causal or merely correlated to the model’s processing of truth. Inference-Time Intervention (ITI) (Li et al., 2023) focuses on directions that have a causal influence on model outputs, using activation editing to increase the truthfulness of generations. Representation Engineering (RepE) (Zou et al., 2023) shows that pairing neural activities and applying PCA to the set of difference vectors yields a superior direction. Mean-Centring (Jorgensen et al., 2023) finds that taking the average of activations associated with a target dataset, and then subtracting the mean of all training activations, results in effective steering vectors. TruthX (Zhang et al., 2024) employs an auto-encoder to map LLM’s representations into semantic and truthful latent spaces, respectively, and edits LLM’s internal representations in the truthful space. On one hand, these methods often use a single steering vector and a fixed steering intensity, which do not consider when to perform steering and may not be enough to handle the variety of hallucination cases. Our method differs by adjusting steering intensity based on the truthfulness content of the activations and using unsupervised clustering to create diverse steering vectors. This provides more personalized interventions to mitigate hallucinations. On the other hand, some approaches, such as TruthX, rely on fine-tuning to learn an auto-encoder, whereas our method is tuning-free.

3. Methods

Table 1. Comparison of model performance in few-shot and full data settings. In the full data setting, ACT achieved a significant relative improvement of 34% in the main metric True*Info over the leading state-of-the-art baseline.

Few-shot Setting
Model	Open-ended Generation(%)			Multiple-Choice(%)		Intensity
Model	BLEURT	TRUE	True * Info	MC1	MC2	CE	KL
Baseline	32.8	23.9	23.0	24.8	39.8	2.22	0.00
Baseline + ITI	39.6	32.8	28.6	26.7	42.2	2.71	0.49
Baseline + ACT	56.5	52.0	39.1	26.7	43.1	2.35	0.19
Few-shot Prompting	49.1	43.2	39.5	35.1	50.7	-	-
Few-shot Prompting + ITI	51.0	49.2	39.4	34.2	51.1	-	-
Few-shot Prompting + ACT	57.3	54.2	46.6	35.5	52.3	-	-
Full Data
Baseline	32.5	24.0	23.1	25.3	40.1	2.16	0.00
Random Steering	32.4	25.2	23.7	25.7	40.1	2.13	0.03
CCS	33.8	27.0	25.7	26.3	41.1	2.21	0.06
RepE	33.7	32.2	25.4	27.4	43.3	3.35	1.27
Mean-Centring	37.0	29.0	31.6	27.7	43.6	2.84	0.74
ITI: Probe weight direction	35.5	29.3	27.6	27.7	42.3	2.36	0.27
ITI: Mass mean shift	38.0	38.1	29.9	28.7	44.4	2.88	0.79
ACT	55.3	58.0	42.3	28.8	45.2	2.43	0.24

Activation Steering (Turner et al., 2023; Li et al., 2023; Subramani et al., 2022) focuses on identifying directions in the activation space that correspond to factually correct statements, then shifting activations in that direction during inference. Building on this, our method generates diverse steering vectors from raw data to address various hallucination categories (subsection 3.2). Additionally, we introduce adaptive control of steering intensity based on the truthfulness content of the activations (subsection 3.3). For the pseudocode of the proposed method, see Algorithm 1.

Algorithm 1 Adaptive Activation Steering

Input:
$\mathcal{M}$ = language model
$\mathcal{D}$ = question-answer dataset (each question paired with truthful answers $A_{i}^{+}$ and untruthful answers $A_{i}^{-}$ )
$C$ = number of clusters for diverse steering vectors generation
$SteeringMethod$ = Method used to steer language model
$TrainProbe$ = Method used to fit binary linear classifiers (probes)
Output:
$S$ = steered output text

1: Initialize

V

to store directional representations for each question

2: Initialize

P

to store probes generated for each cluster

3: for each tuple

(Q_{i},A_{i}^{+},A_{i}^{-})

\mathcal{D}

\mathcal{M}.forward(Q_{i},A_{i}^{+})

\mu_{i}^{+}=\text{Mean}(\mathcal{M}.activations)

\mathcal{M}.forward(Q_{i},A_{i}^{-})

\mu_{i}^{-}=\text{Mean}(\mathcal{M}.activations)

\mathbf{v}_{i}\leftarrow\mu_{i}^{+}-\mu_{i}^{-}

9: Append

\mathbf{v}_{i}

V

10: end for

11:

\mathcal{D}_{1},\mathcal{D}_{2},...,\mathcal{D}_{C}=KMeans(V)

12: for each j in

C

13:

p_{\theta_{j}}=TrainProbe(\mathcal{D}_{j})

14: Append

p_{\theta_{j}}

P

15: end for

16:

S\leftarrow SteeringMethod(\mathcal{M},P)

3.1. Preliminary

Model Architecture: To establish notation and context, we detail the transformer architecture, emphasizing the multi-head attention (MHA) mechanism within layers indexed by $l$ (Vaswani et al., 2017; Elhage et al., 2021). A transformer layer includes an MHA module and a multilayer perceptron (MLP) layer. Input tokens are embedded into vectors $x_{0}\in\mathbb{R}^{DH}$ , initiating a residual stream $x_{0},\ldots,x_{n}$ , processed by transformer layers to produce $x_{i+1}$ from $x_{i}$ , with final token decoding for prediction. MHA entails $H$ linear operations, formulated as:

(1)

\displaystyle x_{l+1}=x_{l}+\sum_{h=1}^{H}Q_{l}^{h}\operatorname{Att}_{l}^{h}(% P_{l}^{h}x_{l})

Here, $P_{l}^{h}\in\mathbb{R}^{D\times DH}$ and $Q_{l}^{h}\in\mathbb{R}^{DH\times D}$ are projection matrices facilitating dimensionality transitions within a $D$ -dimensional head space. $\operatorname{Att}$ is an operator where communication with other input tokens happens. Our analysis and steering occur after $\operatorname{Att}$ and before $Q_{l}^{h}$ . The activation of the $h$ -th head in the $l$ -th layer is denoted as $a_{l}^{h}\in\mathbb{R}^{D}$ .

Probing for "Truthfulness": Probes are utilized to discern a network’s internal mechanisms (Alain and Bengio, 2016; Tenney et al., 2019; Belinkov, 2016). In this work, we define a probe $p_{\theta}(a_{l}^{h})=\operatorname{sigmoid}(\langle\theta,a_{l}^{h}\rangle)$ for each head in every layer of the LLM to detect the truthfulness content of the activations. For each sample, we concatenate the question and answer, then extract the head activations at the last token to create a probing dataset $\{(a_{l}^{h},y)_{i}\}_{i=1}^{N}$ for each head in each layer, where $y$ indicates whether the current activation comes from a truthful or untruthful answer. We then randomly split the dataset into training and validation sets in a 4:1 ratio, fit a binary linear classifier on the training set, and use the validation accuracy to evaluate the contribution of each head in generating truthful responses.

3.2. Diverse Probe-Driven Steering Vector Generation

Clustering for Directional Representation: For each question in our dataset, we create a unique directional representation. This is achieved by contrasting the mean activations of the final token from multiple truthful answers ( $\bar{a}_{\text{truthful}}$ ) and untruthful answers ( $\bar{a}_{\text{untruthful}}$ ). Each question’s directional representation is defined as $d=\bar{a}_{\text{truthful}}-\bar{a}_{\text{untruthful}}$ . We use K-means clustering on these representations to produce $C$ clusters, each representing a distinct hallucination pattern in LLM outputs.

Cluster-Based Probe Generation: After clustering, we train distinct probes with data from each cluster, ensuring each probe is attuned to a specific hallucination pattern. The probe for the $c$ -th cluster, at the $l$ -th layer and the $h$ -th head, is denoted as $p_{\theta_{c,l}^{h}}$ , and its parameter is denoted as $\theta_{c,l}^{h}$ . The detailed methodology of this training process is elaborated in subsection 3.1. The trained probes can serve as detectors for the truthfulness content of the current activation and provide support for the subsequent adaptive activation steering during inference.

The trained probes and their accuracy on the validation set provide support for the subsequent adaptive activation steering during inference.

3.3. Adaptive Steering Intensity Control

Building upon the diverse probe-driven steering vectors generated as detailed in subsection 3.2, we introduce the method of Adaptive Steering Intensity Control (ASIC) to dynamically adjust the steering intensity during inference.

Selection of Intervention Heads: ASIC’s initial step involves identifying the most influential heads for intervention. This process hinges on the performance accuracy of probes within each cluster. For every cluster, we meticulously select the top $K$ heads based on the accuracy of the corresponding probes on the validation set. This selection ensures that our intervention is focused and effective, targeting only those heads that contribute significantly to the generation of truthful outputs.

Dynamic Steering Vector Application: The core of ASIC lies in its ability to dynamically adjust the steering intensity based on the activations of selected heads. For each head, the activations are fed into the corresponding probe, outputting a value between 0 and 1 that represents the similarity to the ’truthfulness’ distribution. This similarity score is then used to modulate the steering intensity. Specifically, the steering vector is scaled by a factor of $(1-\text{similarity score})$ , ensuring a larger shift when activations deviate more from the ’truthfulness’ state. The intervention for a selected head is formalized as follows:

(2)

\displaystyle x_{l+1}

\displaystyle=x_{l}+\sum_{c=1}^{C}\sum_{h=1}^{H}Q_{l}^{h}\left(a_{l}^{h}+% \alpha(1-p_{\theta_{c,l}^{h}}(a_{l}^{h})+\beta)v_{c,l}^{h}\right)

where $a_{l}^{h}=\operatorname{Att}_{l}^{h}(P_{l}^{h}x_{l})$ , $x_{l}$ and $x_{l+1}$ represent the input and output of layer $l$ respectively, $C$ is the number of clusters, $H$ is the number of intervention heads, and $\alpha(1-p_{\theta_{c,l}^{h}}(a_{l}^{h})+\beta)$ is used to control the steering intensity. Here, $\alpha$ and $\beta$ are hyperparameters, and $v_{c,l}^{h}$ is the steering vector. For non-selected attention heads, $v_{c,l}^{h}$ is a zero vector. The non-zero steering vector $v_{c,l}^{h}$ can be the simple subtraction of the mean of untruthful activations from the mean of truthful activations. Alternatively, it can be $\theta_{c,l}^{h}$ . $\theta_{c,l}^{h}$ is the parameter for the binary classification probe, acting as the normal vector of the hyperplane that separates truthful and untruthful activations. In the subsequent experiments of this work, unless otherwise specified, the steering vector used is $\theta_{c,l}^{h}$ .

4. Experiments

Table 2. Comparison of mainstream LLMs using 2-fold cross-validation. LLaMA 3 is the 8B version, while all other models are 7B versions. ACT demonstrated a remarkable relative enhancement of 142% compared to LLaMA.

Pre-trained
Model	Open-ended Generation(%)			Multiple-Choice(%)		Intensity
Model	BLEURT	TRUE	True * Info	MC1	MC2	CE	KL
LLaMA	32.5	24.0	23.1	25.3	40.1	2.16	0.00
LLaMA + ACT	55.3	58.0	42.3	28.8	45.2	2.43	0.24
LLaMA 2	40.8	34.5	31.1	28.4	43.3	2.11	0.00
LLaMA 2 + ACT	45.7	42.7	38.1	30.6	46.7	2.30	0.20
LLaMA 3	51.4	43.3	31.2	30.4	49.0	2.42	0.00
LLaMA 3 + ACT	59.5	55.6	41.7	34.3	51.9	3.12	0.76
Instruction Fine-tuned
Alpaca	38.3	35.4	35.1	26.3	41.8	2.51	0.00
Alpaca + ACT	45.7	48.1	44.5	28.3	45.9	2.72	0.41
Vicuna	52.6	51.4	46.5	33.4	49.5	2.58	0.00
Vicuna + ACT	60.5	66.0	52.3	36.0	53.7	2.90	0.70
LLaMA 2-Chat	61.0	61.8	48.6	33.8	51.1	2.47	0.00
LLaMA 2-Chat + ACT	63.8	73.3	65.5	36.7	54.0	2.73	0.46

4.1. Dataset

To operationalize the concept of truth, we choose TruthfulQA (Lin et al., 2021), a challenging, adversarially designed benchmark released by OpenAI to assess truthful behavior. It contains $817$ questions in total, spanning $38$ categories (e.g., logical falsehoods, conspiracies, and common points of confusion). Each question comes with an average of $3.2$ truthful answers, $4.1$ false answers, as well as a gold standard answer supported by a trusted online source. We reorganize TruthfulQA by answers to get $N=5,882$ QA pairs, each with a binary truthfulness label.

4.2. Experimental Setup

Evaluation. We evaluate our method on the TruthfulQA benchmark, which has two tracks: open-ended generation and multiple-choice. In the former, we use True*Info as the main metric (Lin et al., 2021). We also use BLEURT (Sellam et al., 2020) as a similarity function to compare model answers to both true and false reference answers. In the latter task, we use MC1 (Lin et al., 2021) and MC2 (Lin et al., 2021), based on the correct ranking of truthful answers. More details of automated metrics can be found in Appendix A. In addition to automated metrics, human evaluations are conducted to validate the effectiveness of ACT. Refer to subsection 4.5 for more details on human evaluations. In subsection 5.5, we also validate the generalizability of ACT on two real-world truth-related datasets: Natural Questions (Kwiatkowski et al., 2019) and MMLU (Hendrycks et al., 2020)

Model. We test various open-source models, including LLaMA (Touvron et al., 2023a), LLaMA 2 (Touvron et al., 2023b), Alpaca (Taori et al., 2023), Vicuna (Chiang et al., 2023), LLaMA 2-Chat (Touvron et al., 2023b), and LLaMA 3 (Dubey et al., 2024). For most evaluations, we use LLaMA-7B as the primary model.

Measuring Intervention. Following Li et al. (2023), we calibrate intervention intensity using Cross Entropy (CE) and Kullback–Leibler divergence (KL) to measure deviation from the original generation distribution. Lower values indicate less change.

Few-shot Setting. Following Li et al. (2023), we randomly select $5\%$ (i.e., 40 samples) of the data for training.

Full Data Setting. We perform two-fold cross-validation on the entire dataset, using $50\%$ (i.e., 408 samples) of the data for training.

Hyperparameters. We provide the hyperparameter settings used in our experiments in Appendix C.

4.3. Experimental Baseline Comparisons

In addition to testing ACT on TruthfulQA, we compare it to several baseline approaches¹¹1RLHF underperforms 50-shot in-distribution prompting for TruthfulQA in (Bai et al., 2022). In (Bai et al., 2022; Menick et al., 2022), RLHF shows minimal improvement. Task-specific RLHF with $5\%$ samples remains uncertain.:

Few-shot Prompting (FSP) is a way to increase truthfulness. Bai et al. (2022) find in-distribution $50$ -shot prompting a strong baseline on TruthfulQA, compared to context distillation and RLHF. Since the choice of prompting strategy is orthogonal to the activation steering method, we compare few-shot prompting with and without our method.

Instruction Fine-tuning (IFT) (Chung et al., 2022; Wang et al., 2022) enhances truthfulness by fine-tuning language models with task-specific instructions. We study how our method improves truthfulness in IFT models, including Alpaca (Taori et al., 2023) and Vicuna (Chiang et al., 2023) (IFT’ed from LLaMA-7B) and LLaMA-2-Chat (Touvron et al., 2023b) (IFT’ed from LLaMA 2-7B).

Following Li et al. (2024), we evaluate FSP and ITI in few-shot scenarios. Additionally, we contrast CCS, ITI, RepE, and Mean-Centring as discussed in 2.2, using 2-fold validation on the full TruthfulQA.

4.4. Experimental Results²²2The original GPT-judge and GPT-info model from (Lin et al., 2021) was retired by OpenAI. We used davinci-002, OpenAI’s recommended alternative. Consequently, the True and True*Info metric values differ from those reported in (Li et al., 2023).

In Table 1, we compare our method with baselines in two different scenarios. In the few-shot setting³³3Due to the very limited number of training samples for each cluster (sometimes only one or two samples), we performed upsampling. We use the last 10% of tokens from answers for clustering and probe training, while in the full data setting, only the final token is used., ACT improved the True*Info metric by 70% over the baseline (LLaMA-7B). Against ITI (Baseline + ITI), the improvement is 37%. We also confirmed the orthogonality of ACT with Few-shot Prompting (FSP). ACT with Few-shot Prompting (FSP) shows an 18% increase over FSP alone. The CE and KL results indicate that we obtain better performance with minimal intervention while maintaining informativeness. In the full data setting, we compared different steering methods, including random steering, CCS, RepE, Mean-Centring, and ITI as mentioned in 2.2. We conducted a grid search for the optimal hyperparameters for each direction separately. ACT improved the True*Info metric by 83% over the baseline (LLaMA-7B) and 34% over the best comparative method, Mean-Centring. These observations demonstrate that ACT can enhance model performance with efficient use of intervention strategies.

In Table 2, we compare the results of IFT’ed models and pre-trained models with and without ACT. We find that IFT effectively reduces hallucination issues. Results show that ACT interventions significantly improve the True*Info at any stage of the models. This also proves that ACT is orthogonal to IFT methods and can enhance performance in conjunction with them.

4.5. Human Evaluation

In addition to automated metrics, human evaluations are conducted to validate the effectiveness of ACT. Our evaluation panel consisted of ten experts from diverse disciplines, including linguistics, computer science, and domain-specific fields relevant to the generated content. This multidisciplinary approach ensured a comprehensive and well-rounded assessment of ACT’s performance. The results of human evaluations are shown in Table 3.

Table 3. Comparison of GPT-Judge and human evaluation scores

Model	TRUE	Human Evaluation
LLaMA	24.0	23.4 (±3.8)
LLaMA + ACT	58.0	47.9 (±5.3)
LLaMA2-Chat	61.8	57.1 (±4.5)
LLaMA2-Chat + ACT	73.3	71.1 (±6.1)

These evaluations confirm the utility of our metrics for assessing model performance differences across a broad set of samples. Feedback from evaluators is crucial to validating the effectiveness of ACT. More details of human evaluation can be found in Appendix B.

5. Analysis

5.1. Analysis of Diverse Steering Vectors

Firstly, we present a detailed analysis of the clustering characteristics observed in the steering vectors derived from our experiments with the LLaMA-7B and LLaMA 2-7B models on the TruthfulQA benchmark. Utilizing t-SNE visualization, we identified distinct clustering patterns for steering vectors corresponding to six different categories of hallucinations. For instance, the steering vectors of confusion-related categories (Confusion:People, Confusion:Other) were found to be more closely aligned, while the steering vectors of indexical-error-related categories and logical-falsehood-related categories exhibited different clustering patterns. This forms a key motivation for our proposed diverse steering vectors, enabling customized interventions for various categories of hallucinations.

In Figure 2, we examine the effects of training data volume and cluster number on ACT performance. Analysis reveals that ACT boosts the baseline’s performance effectively, even when using minimal data. Additionally, as the volume of training data increases, generating multiple steering vectors through clustering leads to further performance gains. This underscores the effectiveness of utilizing diverse steering vectors for performance enhancement.

5.2. Ablation Studies

Table 4. Ablation experiment. Comparing individual components of ACT with baseline using two-fold cross-validation.

Model	Open-ended Generation(%)			Multiple-Choice(%)
Model	BLEURT	True	True * Info	MC1	MC2
LLaMA-7B	32.5	24.0	23.1	25.3	40.1
+ Single steering	35.5	29.3	27.6	27.7	42.3
+ Adaptive intensity	37.0	31.3	29.7	28.3	44.0
+ Diverse steering	51.1	54.0	40.4	28.6	45.0
+ ACT	55.3	58.0	42.3	28.8	45.2

We conduct ablation studies on the TruthfulQA benchmark using the LLaMA-7B model to evaluate ACT, with the results presented in Table 4. Here, "+ Single steering" is consistent with ITI. "+ Adaptive intensity" only uses Adaptive Steering Intensity Control (ASIC). "+ Diverse steering" uses diverse probe-driven steering vectors for constant steering intensity during inference. We observe that both diverse steering and adaptive intensity enhance truthfulness compared to the baselines, with diverse steering showing the most pronounced improvements in the open-ended generation task.

5.3. Results across Diverse Hallucinations Categories

TruthfulQA is split into 38 subcategories, encompassing a wide range of hallucination-prone topics such as misconceptions, stereotypes, historical inaccuracies, the Mandela effect, and others. In Figure 4, we plot the true*informative scores for all subcategories compared to the baseline without intervention. We observe that our method improves truthfulness consistently across these diverse hallucination categories, demonstrating its effectiveness in mitigating various types of hallucinations.

5.4. Computational Efficiency

When analyzing computational efficiency, we consider the time complexity of each step during inference for a sequence of length $n$ .

According to Equation 1, for a given layer in the standard multi-head attention mechanism during the inference phase, the time complexity for this operation is $O(Hn^{2}D)$ , where $D$ is the feature dimensionality. This complexity arises from the computation of pairwise attention scores for each element in the sequence across all heads. According to Equation 2, ACT introduces a logic regression on the last token of the sequence, incurring only an additional constant-level computational overhead of $O(CHD)$ .

Table 5. Inference time comparison between LLaMA 7B and LLaMA 7B + ACT on the TruthfulQA dataset.

Model	Inference Time (min)
LLaMA 7B	18.16
LLaMA 7B + ACT	18.53

Additionally, we conduct practical tests on the TruthfulQA dataset using a single NVIDIA A100 GPU to compare the inference times of the model with and without ACT, averaging the results over three runs. The results indicate an additional overhead of less than 2%, as shown in Table 5, demonstrating that ACT has minimal impact on real-time applications.

5.5. Generalization of ACT beyond TruthfulQA

To evaluate the generalization capability of ACT beyond the TruthfulQA dataset, we apply the steering vectors and hyperparameters learned from TruthfulQA to two real-world truth-related datasets: Natural Questions (Kwiatkowski et al., 2019) and MMLU (Hendrycks et al., 2020).

Table 6. Generalization results of ACT on Natural Questions and MMLU.

Model	Natural Questions	MMLU
LLaMA-7B	50.6	35.0
LLaMA-7B + ACT	52.5	36.9

The Natural Questions dataset consists of 3,610 real Google queries with annotated answers, providing a realistic setting for truthfulness evaluation. MMLU, on the other hand, is a benchmark covering 57 subjects across a wide range of domains. Both benchmarks differ from TruthfulQA, making them suitable for evaluating out-of-distribution generalization.

For Natural Questions, we follow Li et al. (2024) to evaluate. For MMLU, we use the standardized evaluation metric (Hendrycks et al., 2020).

As shown in Table 6, ACT shows improvements over the baseline on both datasets, highlighting the ACT’s effectiveness and generalizability in real-world scenarios.

5.6. Scalability of ACT across Different Model Sizes

In the full-data setting, as model size increases, responses such as "I have no comments" become more common, leading to a decrease in the Informative metric. So, activation steering methods do not scale effectively beyond 7B, aligning with the results reported by Li et al. (2024) on GitHub⁴⁴4https://github.com/likenneth/honest_llama/blob/master/results.md.

However, we find that applying Few-shot Prompting (FSP) can mitigate this scaling issue. Due to the orthogonality of ACT and FSP, which is validated in 4.4, we examined both with and without ACT in conjunction with FSP across models of varying sizes (7B, 13B, 33B, 65B). The results, as shown in Table 7, indicate improvement in truthfulness for all model sizes with the implementation of our methods.

These observations suggest that while activation steering methods face scaling challenges in larger models, combining ACT with FSP offers a practical approach to effectively enhance truthfulness across a range of model sizes.

Table 7. Scalability of ACT across different model sizes. Comparing the performance of different sizes of LLaMA models when combined with ACT in a few-shot setting.

Model Open-ended Generation(%) Multiple-Choice(%) BLEURT TRUE True * Info MC1 MC2 LLaMA-7B 49.1 43.2 39.5 35.1 50.7 + ACT 57.3 54.2 46.6 35.5 52.3 LLaMA-13B 59.7 51.3 43.4 39.1 55.1 + ACT 69.6 67.0 46.0 41.4 59.1 LLaMA-33B 62.9 52.0 42.8 41.9 58.6 + ACT 71.9 65.2 49.6 44.2 62.3 LLaMA-65B 68.8 58.1 48.8 45.5 62.9 + ACT 76.1 72.3 50.4 46.3 64.7

6. Limitations

While ACT has achieved significant performance improvements on the TruthfulQA benchmark, its applicability in real-world chat settings involving multi-turn conversations has not been fully explored. In addition, the trade off between truthfulness and helpfulness is also very important. Whether ACT improves the truthfulness of LLM while affecting its helpfulness (e.g., the smoothness of generated text) is a question to be explored in the future.

7. Conclusion

We propose ACT, a tuning-free method designed to improve the truthfulness of large language models (LLMs). ACT utilizes diverse truthfulness-related steering vectors to shift activations toward more truthful directions during inference, without requiring additional fine-tuning, and adaptively controls steering intensity based on the content’s inherent truthfulness. Empirical evaluations show that ACT significantly enhances truthfulness in various LLMs on the TruthfulQA benchmark. By addressing the gap between LLMs’ understanding and expression of truthfulness, ACT marks a promising advancement in producing more reliable and accurate AI-generated content.

Acknowledgements.

This work was supported by the National Natural Science Foundation of China (62402017, U23A20468), Beijing Natural Science Foundation (L244063), Xuzhou Scientific Technological Projects (KC23143), Peking University Medicine plus X Pilot Program-Key Technologies R&D Project (2024YXXLHGG007). Junyi Gao acknowledges the receipt of studentship awards from the Health Data Research UK-The Alan Turing Institute Wellcome PhD Programme in Health Data Science (Grant Ref: 218529/Z/19/Z).

References

(1)
Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
Alain and Bengio (2016) Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644 (2016).
Bai et al. (2022) Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022).
Bau et al. (2020a) David Bau, Hendrik Strobelt, William Peebles, Jonas Wulff, Bolei Zhou, Jun-Yan Zhu, and Antonio Torralba. 2020a. Semantic photo manipulation with a generative image prior. arXiv preprint arXiv:2005.07727 (2020).
Bau et al. (2020b) David Bau, Jun-Yan Zhu, Hendrik Strobelt, Agata Lapedriza, Bolei Zhou, and Antonio Torralba. 2020b. Understanding the role of individual units in a deep neural network. Proceedings of the National Academy of Sciences 117, 48 (2020), 30071–30078.
Belinkov (2016) Yonatan Belinkov. 2016. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics (2016), 1–12.
Belrose et al. (2023) Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. 2023. LEACE: Perfect linear concept erasure in closed form. arXiv preprint arXiv:2306.03819 (2023).
Brown et al. (2023) Davis Brown, Charles Godfrey, Cody Nizinski, Jonathan Tu, and Henry Kvinge. 2023. Robustness of edited neural networks. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models.
Burns et al. (2022) Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. 2022. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827 (2022).
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. 2023. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
Chung et al. (2022) Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416 (2022).
Dathathri et al. (2019) Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, and Rosanne Liu. 2019. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164 (2019).
Dhuliawala et al. (2023) Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495 (2023).
Dubey et al. (2024) Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024).
Elhage et al. (2022) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. 2022. Toy models of superposition. arXiv preprint arXiv:2209.10652 (2022).
Elhage et al. (2021) N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, Y Bai, A Chen, T Conerly, et al. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread (2021).
Gandikota et al. (2023) Rohit Gandikota, Joanna Materzynska, Jaden Fiotto-Kaufman, and David Bau. 2023. Erasing concepts from diffusion models. arXiv preprint arXiv:2303.07345 (2023).
Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 (2020).
Ilharco et al. (2022) Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Suchin Gururangan, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2022. Editing models with task arithmetic. arXiv preprint arXiv:2212.04089 (2022).
Jorgensen et al. (2023) Ole Jorgensen, Dylan Cope, Nandi Schoots, and Murray Shanahan. 2023. Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813 (2023).
Joshi et al. (2023) Nitish Joshi, Javier Rando, Abulhair Saparov, Najoung Kim, and He He. 2023. Personas as a way to model truthfulness in language models. arXiv preprint arXiv:2310.18168 (2023).
Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. 2022. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221 (2022).
Kleindessner et al. (2023) Matthäus Kleindessner, Michele Donini, Chris Russell, and Muhammad Bilal Zafar. 2023. Efficient fair PCA for fair representation learning. In International Conference on Artificial Intelligence and Statistics. PMLR, 5250–5270.
Koyejo and Li (2024) Sanmi Koyejo and Bo Li. 2024. Towards Trustworthy Large Language Models. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM 2024, Merida, Mexico, March 4-8, 2024, Luz Angelica Caudillo-Mata, Silvio Lattanzi, Andrés Muñoz Medina, Leman Akoglu, Aristides Gionis, and Sergei Vassilvitskii (Eds.). ACM, 1126–1127. doi:10.1145/3616855.3636454
Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. 2019. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466.
Larsen et al. (2016) Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. 2016. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning. PMLR, 1558–1566.
Li et al. (2023) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2023. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. arXiv preprint arXiv:2306.03341 (2023).
Li et al. (2024) Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. 2024. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems 36 (2024).
Lin et al. (2021) Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
Ling et al. (2021) Huan Ling, Karsten Kreis, Daiqing Li, Seung Wook Kim, Antonio Torralba, and Sanja Fidler. 2021. Editgan: High-precision semantic image editing. Advances in Neural Information Processing Systems 34 (2021), 16331–16345.
Ma et al. (2023) Liantao Ma, Chaohe Zhang, Junyi Gao, Xianfeng Jiao, Zhihao Yu, Yinghao Zhu, Tianlong Wang, Xinyu Ma, Yasha Wang, Wen Tang, et al. 2023. Mortality prediction with adaptive feature importance recalibration for peritoneal dialysis patients. Patterns 4, 12 (2023).
Meng et al. (2022) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems 35 (2022), 17359–17372.
Menick et al. (2022) Jacob Menick, Maja Trebacz, Vladimir Mikulik, John Aslanides, Francis Song, Martin Chadwick, Mia Glaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. 2022. Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147 (2022).
Nori et al. (2023) Harsha Nori, Nicholas King, Scott Mayer McKinney, Dean Carignan, and Eric Horvitz. 2023. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375 (2023).
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. 2015. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015).
Ravfogel et al. (2022) Shauli Ravfogel, Francisco Vargas, Yoav Goldberg, and Ryan Cotterell. 2022. Kernelized Concept Erasure. arXiv preprint arXiv:2201.12191 (2022).
Sellam et al. (2020) Thibault Sellam, Dipanjan Das, and Ankur P Parikh. 2020. BLEURT: Learning robust metrics for text generation. arXiv preprint arXiv:2004.04696 (2020).
Shen et al. (2020) Yujun Shen, Jinjin Gu, Xiaoou Tang, and Bolei Zhou. 2020. Interpreting the latent space of gans for semantic face editing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9243–9252.
Subramani et al. (2022) Nishant Subramani, Nivedita Suresh, and Matthew E Peters. 2022. Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124 (2022).
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. 2023. Alpaca: A Strong, Replicable Instruction-Following Model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html (2023).
Tenney et al. (2019) Ian Tenney, Dipanjan Das, and Ellie Pavlick. 2019. BERT rediscovers the classical NLP pipeline. arXiv preprint arXiv:1905.05950 (2019).
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
Turner et al. (2023) Alex Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte MacDiarmid. 2023. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248 (2023).
Upchurch et al. (2017) Paul Upchurch, Jacob Gardner, Geoff Pleiss, Robert Pless, Noah Snavely, Kavita Bala, and Kilian Weinberger. 2017. Deep feature interpolation for image content changes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7064–7073.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wang et al. (2024) Bo Wang, Jing Ma, Hongzhan Lin, Zhiwei Yang, Ruichao Yang, Yuan Tian, and Yi Chang. 2024. Explainable Fake News Detection with Large Language Model via Defense Among Competing Wisdom. In Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024, Tat-Seng Chua, Chong-Wah Ngo, Ravi Kumar, Hady W. Lauw, and Roy Ka-Wei Lee (Eds.). ACM, 2452–2463. doi:10.1145/3589334.3645471
Wang et al. (2022) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2022. Self-Instruct: Aligning Language Model with Self Generated Instructions. arXiv preprint arXiv:2212.10560 (2022).
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems 35 (2022), 24824–24837.
White (2016) Tom White. 2016. Sampling generative networks. arXiv preprint arXiv:1609.04468 (2016).
Xu et al. (2023) Fangyuan Xu, Yixiao Song, Mohit Iyyer, and Eunsol Choi. 2023. A critical evaluation of evaluations for long-form question answering. arXiv preprint arXiv:2305.18201 (2023).
Zhang et al. (2024) Shaolei Zhang, Tian Yu, and Yang Feng. 2024. Truthx: Alleviating hallucinations by editing large language models in truthful space. arXiv preprint arXiv:2402.17811 (2024).
Ziegler et al. (2019) Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593 (2019).
Zou et al. (2023) Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. 2023. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405 (2023).

Appendix A Details of Automated Metrics

We use the following automated metrics for evaluation:

•

MC1 (Single-true) (Lin et al., 2021): Given a question and 4–5 answer choices, select the only correct answer. The model’s selection is the answer choice to which it assigns the highest log-probability of completion following the question, independent of the other answer choices. The score is the simple accuracy across all questions.
•

MC2 (Multi-true) (Lin et al., 2021): Given a question and multiple true/false reference answers, the score is the normalized total probability assigned to the set of true answers.
•

BLEURT (Sellam et al., 2020): BLEURT is used to compare the model’s answer to each of the true and false reference answers. The score is then given by [max similarity to a true reference answer] > [max similarity to a false reference answer].
•

True (Lin et al., 2021): Using the GPT-judge obtained from the GPT-3 model trained end-to-end, to predict human evaluations of truthfulness. For example, if a model generates 100 answers and 80 of them are correct, the True % would be 80%.
•

Info (Lin et al., 2021): Using the GPT-info obtained from the GPT-3 model trained end-to-end, to predict human evaluations of informativeness. For example, if a model generates 100 answers and 90 of them are informative, the Informative % would be 90%.
•

True*Info (Lin et al., 2021): Captures the overall quality of answers, considering both truthfulness and informativeness. For example, if a model has a True % of 80% and an Informative % of 90%, the True*Informative % would be 72% (0.8 * 0.9 = 0.72).

GPT-judge and GPT-info are standard practice on TruthfulQA. While the results are close, it should be noted that GPT-judge and GPT-info’s determinations are only sometimes reliable, as achieving perfect evaluation is impossible. We do not observe that GPT-judge and GPT-info exhibit bias towards any particular methodologies.

Appendix B Details of Human Evaluation

In addition to automated metrics, human evaluations are conducted to validate the effectiveness of ACT, following methodologies from (Xu et al., 2023; Joshi et al., 2023).

The untruthful information generated by LLM hallucination can be extremely harmful in web applications such as medicine, STEM, law, and education (where high truthfulness is essential, and these are also key domains covered by the OpenAI’s TruthfulQA dataset).

Therefore, leveraging TruthfulQA’s diverse question types and hallucination categories, we identified these five representative interdisciplinary topics (medicine, STEM, law, education, and linguistics) as key evaluation domains. For each domain, we carefully selected two domain experts with advanced qualifications (holding a master’s degree or higher) and extensive professional experience to conduct the evaluations.

The detailed instructions provided to the evaluators were as follows:

Instructions: Please carefully evaluate the answers generated by the model based on the following criteria:

•

Determine whether the answer is factually correct. This involves checking the accuracy of the information provided and verifying it against reliable sources.

•

Assess whether the answer contains useful information. This includes evaluating the relevance and applicability of the content in the given context, as well as its ability to provide meaningful insights or solutions to the posed questions.

Compensation: Each evaluator was compensated at a rate of $10 per hour for their time and effort.

Appendix C Hyperparameters

In this section, we provide the hyperparameters required to reproduce the experiments. For the 7B model, the experiments can be conducted using a single NVIDIA 3090 GPU.

Table 8. Hyperparameters for the Experiments.

Hyperparameter	Few-shot	Full-data
$\alpha$	15	12
$\beta$	0.1	0
Top-K heads	24	24
C	2	3

Appendix D Prompt Detail

Following the methodology described by (Li et al., 2023), we provide the model with a preliminary "QA prompt" before it answers a question. This practice aligns with the protocols established in TruthfulQA and has been adopted across all comparative baseline methodologies. According to (Lin et al., 2021), the QA prompt is characterized by trivia questions that are stylistically and content-wise distinct from those in TruthfulQA, aiming to prime the model for diverse question answering.

For eliciting head activations, this prompt is not used; we only use the formatted question and answer pair. As detailed in Table 1, we employ the QA prompt for the supervised fine-tuning baseline. Additionally, for the few-shot prompting baseline, we append $5\%$ of the samples from TruthfulQA after this prompt and before the question to be answered.

Appendix E Results of ACT on Llama-2-7B-Chat

In this section, we present results for two questions from each category on the TruthfulQA benchmark. We compare the performance of LLaMA-2-7B-Chat before and after applying ACT.

Adaptive Activation Steering: A Tuning-Free LLM Truthfulness Improvement Method for Diverse Hallucinations Categories

Abstract.

1. Introduction

2. Related Work

2.1. Latent Space Arithmetic

2.2. LLM Steering

3. Methods

3.1. Preliminary

3.2. Diverse Probe-Driven Steering Vector Generation

3.3. Adaptive Steering Intensity Control

4. Experiments

4.1. Dataset

4.2. Experimental Setup

4.3. Experimental Baseline Comparisons

4.4. Experimental Results222The original GPT-judge and GPT-info model from (Lin et al., 2021) was retired by OpenAI. We used davinci-002, OpenAI’s recommended alternative. Consequently, the True and True*Info metric values differ from those reported in (Li et al., 2023).

4.5. Human Evaluation

5. Analysis

5.1. Analysis of Diverse Steering Vectors

5.2. Ablation Studies

5.3. Results across Diverse Hallucinations Categories

5.4. Computational Efficiency

5.5. Generalization of ACT beyond TruthfulQA

5.6. Scalability of ACT across Different Model Sizes

6. Limitations

7. Conclusion

Acknowledgements.

References

Appendix A Details of Automated Metrics

Appendix B Details of Human Evaluation

Appendix C Hyperparameters

Appendix D Prompt Detail

Appendix E Results of ACT on Llama-2-7B-Chat

E.1. Advertising

E.2. Confusion: Other

E.3. Confusion: People

E.4. Confusion: Places

E.5. Conspiracies

E.6. Distraction

E.7. Economics

E.8. Education

E.9. Fiction

E.10. Finance

E.11. Health

E.12. History

E.13. Indexical Error: Identity

E.14. Indexical Error: Location

E.15. Indexical Error: Other

E.16. Indexical Error: Time

E.17. Language

E.18. Law

E.19. Logical Falsehood

E.20. Mandela Effect

E.21. Misconceptions

E.22. Misconceptions: Topical

E.23. Misinformation

E.24. Misquotations

E.25. Myths and Fairytales

E.26. Nutrition

E.27. Paranormal

E.28. Politics

E.29. Proverbs

E.30. Psychology

E.31. Religion

E.32. Science

E.33. Sociology

E.34. Statistics

E.35. Stereotypes

E.36. Subjective

E.37. Superstitions

E.38. Weather

4.4. Experimental Results²²2The original GPT-judge and GPT-info model from (Lin et al., 2021) was retired by OpenAI. We used davinci-002, OpenAI’s recommended alternative. Consequently, the True and True*Info metric values differ from those reported in (Li et al., 2023).