(Translated by https://www.hiragana.jp/)
Eliminating Position Bias of Language Models: A Mechanistic Approach

Eliminating Position Bias of Language Models:
A Mechanistic Approach

Ziqi Wang1 &Hanlin Zhang2 &Xiner Li3 &Kuan-Hao Huang1 &Chi Han1 &Shuiwang Ji3 &Sham M. Kakade2 &Hao Peng1 &Heng Ji1
Abstract

Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Specifically, we find that causal attention generally causes models to favor distant content, while relative positional encodings like RoPE Su et al., (2024) prefer nearby ones based on the analysis of retrieval-augmented question answering (QA). Further, our empirical study on object detection reveals that position bias is also present in vision-language models (VLMs). Based on the above analyses, we propose to eliminate position bias caused by different input segment orders (e.g., options in LM-as-a-judge, retrieved documents in QA) in a training-free zero-shot manner. Our method changes the causal attention to bidirectional attention between segments and utilizes model attention values to decide the relative orders of segments instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the segment level. By eliminating position bias, models achieve better performance and reliability in downstream tasks where position bias widely exists, such as LM-as-a-judge and retrieval-augmented QA. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8888 to 10101010 percentage points performance gains in most cases, and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench reasoning subset.111Code available at: https://github.com/wzq016/PINE. 1 University of Illinois Urbana-Champaign, 2 Harvard University, 3 Texas A&M University. Contact: ziqiw9@illinois.edu

Refer to caption
Figure 1: Motivating examples showing how position bias affects model outputs. Upper: LMs are asked to select a more helpful one from two given responses, and are prone to prefer the response positioned at first. Lower Left: LMs (Llama-3-8B-Instruct) are presented with 20 documents to answer a question, with only one document (the gold-standard document) containing the correct answer. The blue curve represents normal inference, while the red curve represents inference without inter-document attention (RoPE position encodings are kept, a concrete implementation is shown in the middle of Figure 2). The height change of the yellow and orange area reflects the position bias brought by causal attention and RoPE: causal attention generally favors distant content but RoPE prefers nearby content. Lower Right: We insert a real-world image to a large black background image at different positions, and prompt VLMs (Fuyu-8B [5]) to compute the loss on the ground truth token. The resulting low-loss (black) and high-loss (white) areas form clusters rather than being evenly distributed.

1 Introduction

Language models (LMs) [6, 9, 32, 1] demonstrate impressive performance in general language tasks such as dialogue [31], reasoning [9], and schema induction [18]. However, they tend to favor content at certain positions [43, 42, 35, 44, 8, 19], which harms complex reasoning [8], long-context understanding [19] and model-based evaluation [43]. For example, LMs perform better in retrieval-augmented QA when the document that contains the correct answer is presented at the beginning or at the end of retrieved documents [19]; and LMs tend to favor the first when it is required to compare the quality of two candidate responses [43], which hurts their reliability when being used as evaluators (Figure 1 upper). Although several works have proposed solutions to the problem [27, 7, 11, 13, 44], they are ad hoc and do not fundamentally tackle the challenge. We seek to understand the causes of position bias from first principles and propose elimination strategies accordingly.

We start by analyzing the key components of state-of-the-art LMs – Casual Attention and Rotary Position Embedding (RoPE) [29] that enable models to understand the order of tokens so that LMs can generate meaningful outputs. We argue that they are also the only two operations in Transformers [33] that will inevitably bring undesirable position bias based on a concise logical proof: RoPE is shown to have recency bias [29, 24] due to its long-form attention weight decay w.r.t. the increase of relative positions, and we hypothesize that causal attention forces unidirectional information propagation, enabling models to pay more attention to distant content. To verify it, we conduct a simple analysis on the retrieval-augmented QA [19] (Figure 1 lower left). The height change of the yellow area and orange area reflects the position bias of causal attention and RoPE. Since the yellow area is mostly wider at the beginning and the orange area generally becomes wider at the end (except for the last data point), showing that the causal attention generally tends to favor distant content, while RoPE generally tends to favor nearby content.222More discussion in Section 3.2 and more supporting experiments to this hypothesis in Section 4.3. To prove generality, we further conduct a semi-synthetic task using Vision-Language Models (VLMs), which also adopt causal attention and RoPE, and would put more attention at certain positions of an image as well (Figure 1 lower right, see more VLM examples in Appendix A).

As a solution, we propose PINE that can eliminate position bias by manipulating causal attention and RoPE to attend to different content equally. For tasks that contain position-agnostic segments while segment orders are not expected to affect results (e.g., retrieved documents and candidate responses in retrieval-augmented QA [19] and LM-as-a-judge [43]), we make the inter-segment attention bidirectional so that attention mask will equally attend to all segments. Next, we compute similarity scores between segments and use them to re-sort segment positions so that those in the original inputs are discarded. The resulting approach enables Position-invariant inference (PINE) in a training-free/zero-shot manner that operates on pre-determined segments.

To justify the effectiveness of PINE, we select two useful tasks where position biases are well-known to benchmark PINE: LM-as-a-judge (RewardBench [17]) that prompts LMs to choose the more helpful one from two given responses to a question (candidate responses as segments), and retrieval-augmented question-answering [19] that prompts models to answer questions based on retrieved documents. We show that PINE eliminates position bias such that PINE reaches greater performance and reliability due to no variance w.r.t. relative segment orders in inputs. Notably, we find our method especially useful when LMs are used to assess reasoning pairs: PINE generally improves model performance by 8888 to 10101010 percentage points in most cases and makes Llama-3-70B-Instruct perform even better than GPT-4-0125-preview on the RewardBench [17] reasoning subset.

Our contributions are threefold:

  • With a simple logical proof, we pinpoint the causes of position bias in transformers: causal attention and position encoding (Section 3.2). We also give intuitive explanations (Section 3.2) with supporting experimental results (Figure 1 and Section 4.3).

  • We propose a training-free approach dubbed PINE that can eliminate the position bias given text segments presumed to be position-invariant (Section 3.3).

  • We conduct experiments on two tasks where LMs show severe position biases: retrieval-augmented question answering and LM-as-a-judge and find PINE can consistently improve model performance across model sizes, as well as model reliability (since PINE achieves zero variance), and perform better than other possible design choices. (Section 4)

2 Related Work

Position Encoding and Attention Mask Position encoding is the key component in the Transformer architecture [33]. At first, researchers use absolute positions encoding [33, 14], and then switch to relative position encodings [29, 26, 2, 20]. Recently, researchers have found that incorporating position encoding and attention computation together, instead of doing the two isolatedly, can achieve better language understanding capabilities for LMs [10, 21]. In the rest of the paper, we refer position encoding to RoPE [29] as it is adopted in most modern LMs [2, 4]. Attention mask is another elementary component that affects the fundamental capabilities of LMs. The encoder-only models such as BERT [14] use bi-directional attention. Decoder-only models usually use unidirectional causal attention [25] due to the auto-regressive nature of language generation. Although several works use the encoder-decoder model [26, 30] or prefix-LM [30] that combine both bidirectional attention and causal attention, they suffer from difficulty in scaling up training. In this paper, we investigate causal attention as it is the choice of most modern LMs [2, 4].

Position Bias in LMs. There is a lot of work demonstrating the existence and significance of position bias in LMs [43, 42, 35, 44, 8, 19, 28]. The LM-as-a-judge task offers models two candidate responses to a question and asks models to select the more helpful one. It turns out that LM has a primacy bias that tends to favor the first response [43]. Retrieval-augmented QA asks LM to answer a question based on retrieved documents. [19, 24] find that LMs are prone to answer correctly when the document that contains the correct answer is presented at the beginning and the end of retrieved documents. In the multiple-choice QA where models are required to select the correct answer to a question from multiple candidate answers, [42] points out that models favor options at certain positions (e.g., prefer “A”). In the in-context learning tasks, [40, 38] find that the order of in-context examples affects the final performance. Recently, several papers propose to understand the nature of position bias through prompting [41] and calibration [12]. Our paper analyzes the phenomenon from the computation: the computation must be positional-invariant to order to eliminate position bias.

Position Bias Solutions in LMs. There are many solutions to mitigate position bias (e.g., data augmentation and training [13, 44], content resorting by attention value during inference [24], searching [39], calibration under relatively strong assumptions [12]). Moving one step forward, some other solutions are designed to eliminate position bias. [34] output a compromised result “tie” when position bias happens in the LM-as-a-judge task [34], however, the “tie" result is still suboptimal as it actually “refuses" to give a prediction. [42, 43]) use permutation then average on classification tasks, which will have unacceptable 𝒪(k!)𝒪𝑘\mathcal{O}(k!)caligraphic_O ( italic_k ! ) (k𝑘kitalic_k is the number of segments) computational overhead when k𝑘kitalic_k is large. Different from them, we aim to eliminate the position bias from the angle of transformers’ computation flows while tackling the above issues.

Moreover, we find several methods that are originally designed for other purposes (e.g., long-context understanding) have mathematical guarantees to intrinsically eliminate position bias [27, 7, 11] (Section 3.3). However, these methods obtain poor performance on settings that require language modeling (Section 4), therefore they are limited to certain tasks like in-context learning classification [27, 7, 11]. In contrast, our method is training-free and is shown to be effective in tasks that require language modeling, such as LM-as-a-judge [43] and lost-in-the-middle [19].

3 Methodology

In this section, we aim to propose an approach that can intrinsically eliminate position bias. We start by running an example to illustrate position bias, followed by analyzing the cause of position bias, and end with our own approach PINE.

3.1 Formulation

We take retrieval-augmented QA as an example, where current LMs’ performance may greatly suffer from position bias [19]. The task requires the model to answer a question based on a set of given retrieved documents, where only one of them contains the correct answer. A system prompt SYS for this task looks like this: “Write a high-quality one-sentence answer for the given question using only the provided search results (some of which might be irrelevant).”. Given a question Q: “Who won the first Nobel Prize in Physics?”, and three retrieved documents: 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, and 𝒟3subscript𝒟3\mathcal{D}_{3}caligraphic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, we can formulate several different inputs. For example, [SYS|Q|𝒟1|𝒟2|𝒟3]delimited-[]SYSQsubscript𝒟1subscript𝒟2subscript𝒟3[\textsc{SYS}|\textsc{Q}|\mathcal{D}_{1}|\mathcal{D}_{2}|\mathcal{D}_{3}][ SYS | Q | caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ], and [SYS|Q|𝒟2|𝒟3|𝒟1]delimited-[]SYSQsubscript𝒟2subscript𝒟3subscript𝒟1[\textsc{SYS}|\textsc{Q}|\mathcal{D}_{2}|\mathcal{D}_{3}|\mathcal{D}_{1}][ SYS | Q | caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT | caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. We expect models to have the same output for these inputs because 𝒟2,𝒟3,𝒟1subscript𝒟2subscript𝒟3subscript𝒟1\mathcal{D}_{2},\mathcal{D}_{3},\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are position-agnostic input segments: their relative order is not supposed to affect the final result. However, the current LMs answer differently when presented with these different inputs and tend to answer correctly when the document contains the answer at the beginning or at the end of all documents [19]. The systematic differences of model outputs caused by relative positions of position-agnostic input segments reflect the position bias of the model. Therefore, current LMs cannot conduct inter-segment position-invariant inference, and our goal is to make the inference invariant w.r.t. relative segment orders.

3.2 Causal Attention and RoPE Are The Cause of Position Bias

Feed-Forward Networks (FFNs), Query, Key and Value (QKV) projections, and layer normalization in the Transformer architecture do not cause position bias, as they are invariant to relative segment positions. Rather, the attention computation that leads to the position bias:

𝐐PE=PE(𝐐,𝐩𝐨𝐬𝐐),𝐊PE=PE(𝐊,𝐩𝐨𝐬𝐊)formulae-sequencesubscript𝐐PEPE𝐐subscript𝐩𝐨𝐬𝐐subscript𝐊PEPE𝐊subscript𝐩𝐨𝐬𝐊\displaystyle\mathbf{Q}_{\text{{\color[rgb]{1,.5,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,.5,0}PE}}}=\text{{\color[rgb]{1,.5,0}\definecolor[named% ]{pgfstrokecolor}{rgb}{1,.5,0}PE}}(\mathbf{Q},\mathbf{pos}_{\mathbf{Q}}),% \mathbf{K}_{\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,.5,0}PE}}}=\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{% rgb}{1,.5,0}PE}}(\mathbf{K},\mathbf{pos}_{\mathbf{K}})bold_Q start_POSTSUBSCRIPT PE end_POSTSUBSCRIPT = PE ( bold_Q , bold_pos start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT ) , bold_K start_POSTSUBSCRIPT PE end_POSTSUBSCRIPT = PE ( bold_K , bold_pos start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT ) (1)
𝐇=Softmax(𝐐PE𝐊PET/d)𝟙causal𝐕𝐇direct-productSoftmaxsubscript𝐐PEsuperscriptsubscript𝐊PE𝑇𝑑subscript1causal𝐕\displaystyle\mathbf{H}=\text{Softmax}\left(\mathbf{Q}_{\text{{\color[rgb]{% 1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}PE}}}\mathbf{K}_{\text{% {\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}PE}}}^{T}/% \sqrt{d}\right)\odot{\color[rgb]{0.21,0.49,0.74}\definecolor[named]{% pgfstrokecolor}{rgb}{0.21,0.49,0.74}\mathbbm{1}_{\text{causal}}}\mathbf{V}bold_H = Softmax ( bold_Q start_POSTSUBSCRIPT PE end_POSTSUBSCRIPT bold_K start_POSTSUBSCRIPT PE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG ) ⊙ blackboard_1 start_POSTSUBSCRIPT causal end_POSTSUBSCRIPT bold_V

where 𝐐,𝐊,𝐕n×d𝐐𝐊𝐕superscript𝑛𝑑\mathbf{Q},\mathbf{K},\mathbf{V}\in\mathbb{R}^{n\times d}bold_Q , bold_K , bold_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT are queries, keys, and values, PE donotes the position encoding (specifically RoPE), 𝐩𝐨𝐬𝐐subscript𝐩𝐨𝐬𝐐\mathbf{pos}_{\mathbf{Q}}bold_pos start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT and 𝐩𝐨𝐬𝐊subscript𝐩𝐨𝐬𝐊\mathbf{pos}_{\mathbf{K}}bold_pos start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT denote the position of queries and keys, and 𝟙causalsubscript1causal\mathbbm{1}_{\text{causal}}blackboard_1 start_POSTSUBSCRIPT causal end_POSTSUBSCRIPT denotes the causal attention mask. Eq. 1 reveals that (1) the PE function yields different representations for input segments if their relative order changes, therefore affecting the similarity score QPEKPETsubscript𝑄PEsuperscriptsubscript𝐾PE𝑇Q_{\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}% PE}}}K_{\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{% 1,.5,0}PE}}}^{T}italic_Q start_POSTSUBSCRIPT PE end_POSTSUBSCRIPT italic_K start_POSTSUBSCRIPT PE end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT and hidden states; (2) the 𝟙causalsubscript1causal\mathbbm{1}_{\text{causal}}blackboard_1 start_POSTSUBSCRIPT causal end_POSTSUBSCRIPT generates different attention masks for the input segments if we change their positions, resulting in different hidden states. To achieve inter-segment position-invariant inference, 𝐇𝐇\mathbf{H}bold_H needs to remain the same regardless of segment orders.

The effect of position encoding and causal attention. RoPE has been shown to have recency bias due to its mathematical long-form weight decay [29, 24]. However, LMs have also been shown to have primacy bias [19, 35]. Therefore, causal attention is likely the cause for models to favor distant content. Figure 1 (lower left) supports this hypothesis. Intuitively, the causal attention propagates the distant content to the nearby content. Therefore, nearby tokens also contain information about distant tokens, potentially making models favor distant content after the aggregation. Section 4.3 provides more supportive experiment results to our hypothesis.

3.3 PINE: Inter-Segment Position-Invariant Inference via Bidirectional Attention

Refer to caption
Figure 2: Previous work PCW [27] eliminates position bias by first masking all inter-segment attention and then assigning all segments the same positions. The attention matrix of the running example in Section 3.1 is at the left of the figure, the orange, different blue, and green colors denote system prompts (1 token), three different documents (2 tokens each) and decoded tokens (1 token), respectively. The number at (i,j)𝑖𝑗(i,j)( italic_i , italic_j ) in the figure, pijsubscript𝑝𝑖𝑗p_{ij}italic_p start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, denotes the position of a token j𝑗jitalic_j when computing the attention from query 𝐪isubscript𝐪𝑖\mathbf{q}_{i}bold_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Therefore, pjsubscript𝑝absent𝑗p_{\cdot j}italic_p start_POSTSUBSCRIPT ⋅ italic_j end_POSTSUBSCRIPT is equal for all i𝑖iitalic_i in vanilla inference. Our experiment in Section 4 shows that PCW brings severe performance drop for tasks requiring language modeling capabilities, thus a suboptimal design choice.
Refer to caption
Figure 3: PINE: inter-segment position-invariant inference via bidirectional attention. The notions are kept the same as Figure 2. PINE enables inter-segment bidirectional attention and then uses attention scores between segments to compute their similarities. Then, segments are re-sorted by similarities: higher-similarity segments are placed in closer positions. The computation of “Similarity” is introduced in Section 3.3.

Technically, our goal is to obtain an inter-segment position-invariant hidden state 𝐇PINEsubscript𝐇PINE\mathbf{H}_{\text{PINE}}bold_H start_POSTSUBSCRIPT PINE end_POSTSUBSCRIPT, which does not change regardless of the segment orders. We can mechanistically eliminate the position bias by equally attending to all position-agnostic input segments. Previous work PCW [27] achieves this goal by masking inter-segment attention and assigning all segments the same position (Figure 2). However, masking all inter-segment attention loses contextual information and introduces out-of-distribution (OOD) operations, which harm model performance (Section 4). For example, models are never trained to compute attention with tokens sharing the same position.

Therefore, we propose PINE, an approach that use bidirectional inter-segment attention and reassigning positions by similarities (computed from attention score) to eliminate position bias (Figure 3). It is worth noting that our goal is to eliminate “inter” position bias among different segments rather than “intra” position bias within each segment. The latter will lose the order information for tokens, and models can degenerate into bag-of-words models, which is not what we expect. For example, the intra-segment attention is causal, but the inter-segment attention is bidirectional in PINE.

Bidirectional Attention. We first change the attention mask so that segments can attend to each other. Specifically, we make the inter-segment attention bidirectional but keep the intra-segment attention causal (the middle figure in Figure 3).

Re-sorting Positions by Attention Weights. It is challenging to assign positions for bidirectional attention masks. First, LMs are trained in a causal way, and each token can only attend to previous tokens. Therefore, it is necessary to manipulate positions so that each segment treats itself as the last segment (the diagonal of the rightmost figure in Figure 3). Second, we determine the positions of other segments based on similarities. Specifically, we first compute the attentions without RoPE involved: Simtoken=Softmax(𝐐𝐊T/d)subscriptSimtokenSoftmaxsuperscript𝐐𝐊𝑇𝑑\text{Sim}_{\text{token}}=\text{Softmax}(\mathbf{Q}\mathbf{K}^{T}/\sqrt{d})Sim start_POSTSUBSCRIPT token end_POSTSUBSCRIPT = Softmax ( bold_QK start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG ), where d𝑑ditalic_d is the hidden state dimension. Then, we obtain the similarity between segments by aggregation. For example, Sim(𝒟1,𝒟2)=i𝒟1,j𝒟2Simtoken(i,j)/|𝒟2|Simsubscript𝒟1subscript𝒟2subscriptformulae-sequence𝑖subscript𝒟1𝑗subscript𝒟2subscriptSimtoken𝑖𝑗subscript𝒟2\text{Sim}(\mathcal{D}_{1},\mathcal{D}_{2})=\sum_{i\in\mathcal{D}_{1},j\in% \mathcal{D}_{2}}\text{Sim}_{\text{token}}(i,j)/|\mathcal{D}_{2}|Sim ( caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i ∈ caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_j ∈ caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT Sim start_POSTSUBSCRIPT token end_POSTSUBSCRIPT ( italic_i , italic_j ) / | caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |. The length normalization is to prevent assigning higher similarities to longer segments. Lastly, We reassign positions by similarities as shown in the rightmost part of Figure 3. To avoid confusion, we address that we do not actually sort tokens and only assign them to different positions.

The rightmost part of Figure 3 contains the 𝐩𝐨𝐬𝐐,𝐩𝐨𝐬𝐊subscript𝐩𝐨𝐬𝐐subscript𝐩𝐨𝐬𝐊\mathbf{pos}_{\mathbf{Q}},\mathbf{pos}_{\mathbf{K}}bold_pos start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT , bold_pos start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT and attention mask 𝟙PINEsubscript1PINE\mathbbm{1}_{\text{PINE}}blackboard_1 start_POSTSUBSCRIPT PINE end_POSTSUBSCRIPT. Putting them in Equation 1, and we obtain 𝐇PINEsubscript𝐇PINE\mathbf{H}_{\text{PINE}}bold_H start_POSTSUBSCRIPT PINE end_POSTSUBSCRIPT. Regardless of the segment orders in inputs, PINE attends each segment to all other segments and uses similarities to decide their relative positions, generating 𝐇PINEsubscript𝐇PINE\mathbf{H}_{\text{PINE}}bold_H start_POSTSUBSCRIPT PINE end_POSTSUBSCRIPT that is not affected by segment orders in inputs. Therefore, PINE achieves inter-segment position-invariant inference in a training-free zero-shot manner.

3.4 Discussion

Alternative options: re-sorting reversely or using the order in inputs. PINE puts segments with higher similarity to a closer relative position. Another option is to put segments with higher similarity to a more distant position. Considering the recency bias brought by RoPE, the latter choice causes a conflict between model judgment and RoPE, whereas the former choice enables RoPE to amplify model judgment further. Therefore, we believe this alternative choice is not optimal, which is justified by our experiments in Section 4.4. Another option is that we do not conduct re-sorting but use the relative order provided in the input. For example, the hidden state of 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT becomes:

𝐐PE,𝒟1subscript𝐐PEsubscript𝒟1\displaystyle\mathbf{Q}_{\text{{\color[rgb]{1,.5,0}\definecolor[named]{% pgfstrokecolor}{rgb}{1,.5,0}PE}},\mathcal{D}_{1}}bold_Q start_POSTSUBSCRIPT PE , caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =PE(𝐐𝒟1,[6,7])absentPEsubscript𝐐subscript𝒟167\displaystyle=\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{% rgb}{1,.5,0}PE}}(\mathbf{Q}_{\mathcal{D}_{1}},[6,7])= PE ( bold_Q start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , [ 6 , 7 ] ) (2)
𝐊PEsubscriptsuperscript𝐊PE\displaystyle\mathbf{K}^{\prime}_{\text{{\color[rgb]{1,.5,0}\definecolor[named% ]{pgfstrokecolor}{rgb}{1,.5,0}PE}}}bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT PE end_POSTSUBSCRIPT =PE([𝐊SYS,𝐊𝒟2,𝐊𝒟3,𝐊𝒟1],𝐩𝐨𝐬seq)absentPEsubscript𝐊SYSsubscript𝐊subscript𝒟2subscript𝐊subscript𝒟3subscript𝐊subscript𝒟1subscript𝐩𝐨𝐬seq\displaystyle=\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{% rgb}{1,.5,0}PE}}([\mathbf{K}_{\textsc{SYS}},\mathbf{K}_{\mathcal{D}_{2}},% \mathbf{K}_{\mathcal{D}_{3}},\mathbf{K}_{\mathcal{D}_{1}}],\mathbf{pos}_{\text% {seq}})= PE ( [ bold_K start_POSTSUBSCRIPT SYS end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_K start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] , bold_pos start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT )
𝐇𝒟1subscript𝐇subscript𝒟1\displaystyle\mathbf{H}_{\mathcal{D}_{1}}bold_H start_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT =Softmax(𝐐PE,𝒟1(𝐊PE)T/d)𝐕absentSoftmaxsubscript𝐐PEsubscript𝒟1superscriptsubscriptsuperscript𝐊PE𝑇𝑑𝐕\displaystyle=\text{Softmax}(\mathbf{Q}_{\text{{\color[rgb]{1,.5,0}% \definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}PE}},\mathcal{D}_{1}}(\mathbf{% K}^{\prime}_{\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb% }{1,.5,0}PE}}})^{T}/\sqrt{d})\mathbf{V}= Softmax ( bold_Q start_POSTSUBSCRIPT PE , caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_K start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT PE end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT / square-root start_ARG italic_d end_ARG ) bold_V

where 𝐩𝐨𝐬seqsubscript𝐩𝐨𝐬seq\mathbf{pos}_{\text{seq}}bold_pos start_POSTSUBSCRIPT seq end_POSTSUBSCRIPT denotes sequential positions, e.g., [1,2,3,4,5,6]123456[1,2,3,4,5,6][ 1 , 2 , 3 , 4 , 5 , 6 ]. This option is more computationally efficient due to lack of re-sorting, but it cannot achieve inter-segment position invariant and has worse performance than PINE (Section 4.4).

Inference Cost. PINE incurrs additional computation overhead due to extra operations. Specifically, PINE conducts one more step of attention computation, attention re-sorting, and position reassignment. Practically, the current computation complexity to obtain hidden states is 𝒪(n2d+nklogk)𝒪superscript𝑛2𝑑𝑛𝑘𝑘\mathcal{O}(n^{2}d+nk\log k)caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d + italic_n italic_k roman_log italic_k ), where n𝑛nitalic_n, d𝑑ditalic_d and k𝑘kitalic_k denote input length, hidden state dimension, and the number of input segments, respectively. Compared with the vanilla hidden states computation that requires 𝒪(n2d)𝒪superscript𝑛2𝑑\mathcal{O}(n^{2}d)caligraphic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ), PINE has extra 𝒪(nklogk)𝒪𝑛𝑘𝑘\mathcal{O}(nk\log k)caligraphic_O ( italic_n italic_k roman_log italic_k ) cost. The real computation cost is acceptable since k𝑘kitalic_k is usually small (e.g., k=2𝑘2k=2italic_k = 2 in the LLM-as-a-judge task and k=20𝑘20k=20italic_k = 20 in the retrieval-augmented QA). Section 4.5 shows results of real-world wall time and memory cost.

Practical Implications. Although PINE eliminates output variance w.r.t. different input segment orders in principle, the practical solution still has a little variance due to the numerical issue (Softmax operations are prone to produce equal small-value similarities). To ensure zero variance, one can add deterministic rules, such as using length or alphabetic order to determine the order in such a case.

4 Experiment

In the following experiments, we mainly show two aspects of PINE: (1) PINE achieves superior performance without variance when models suffer from position bias. (2) PINE performs the best among other design choices that aim to eliminate position bias intrinsically.

Table 1: The portion of data (%) that models have position bias in RewardBench, i.e., models change answers after swaping candidate responses orders. We color the subsets that have more than 25%percent2525\%25 % data causing position bias with cyan.
Model Size Chat Chat-Hard Safety Reasoning Avg.
LLaMa-3 -Instruct 8B 10.310.310.310.3 21.521.521.521.5 11.411.411.411.4 27.627.6{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}27.6}27.6 17.717.717.717.7
70B 3.63.63.63.6 16.016.016.016.0 5.85.85.85.8 15.215.215.215.2 10.210.210.210.2
Qwen-1.5 7B-Chat 1.8B 33.533.5{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}33.5}33.5 37.937.9{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}37.9}37.9 24.724.724.724.7 13.313.313.313.3 27.427.427.427.4
4B 48.048.0{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}48.0}48.0 38.638.6{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}38.6}38.6 57.457.4{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}57.4}57.4 12.712.712.712.7 39.239.239.239.2
7B 17.017.017.017.0 20.620.620.620.6 10.910.910.910.9 26.526.5{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}26.5}26.5 18.818.818.818.8
32B 7.87.87.87.8 20.020.020.020.0 9.69.69.69.6 26.426.4{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}26.4}26.4 16.016.016.016.0
72B 10.910.910.910.9 22.622.622.622.6 9.69.69.69.6 24.724.724.724.7 17.017.017.017.0
110B 8.78.78.78.7 16.016.016.016.0 11.511.511.511.5 23.523.523.523.5 14.914.914.914.9
Table 2: Main results of RewardBench. Vanilla denotes the normal inference, (GT at A) means the ground truth chosen response is presented at the first, and (GT at B) indicates at the second. PINE consistently improves LM’s performance across different model sizes. Consistent to Table 1, we color the subsets with severe position bias cyan. It can be observed that PINE generally improves performance on cyan subsets by a large margin, which is consistent to our motivation and goal.

Model Size Method Chat Chat-Hard Safety Reasoning Avg. LLaMa-3 -Instruct 8B Vanilla (GT at A) 90.190.190.190.1 35.235.235.235.2 64.664.664.664.6 80.380.380.380.3 67.567.567.567.5 Vanilla (GT at B) 85.385.385.385.3 48.748.748.748.7 65.365.365.365.3 66.066.066.066.0 66.366.366.366.3 Vanilla 85.385.385.385.3 41.641.641.641.6 67.067.067.067.0 65.365.3{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}65.3}65.3 64.864.864.864.8 PINE 85.685.685.685.6 41.541.541.541.5 66.566.566.566.5 73.473.4{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}73.4}73.4 66.7+1.9subscript66.71.9\bm{66.7}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+1.9}}}bold_66.7 start_POSTSUBSCRIPT bold_+ bold_1.9 end_POSTSUBSCRIPT 70B Vanilla (GT at A) 98.698.698.698.6 52.052.052.052.0 73.673.673.673.6 87.887.887.887.8 78.078.078.078.0 Vanilla (GT at B) 93.993.993.993.9 62.162.162.162.1 69.869.869.869.8 80.380.380.380.3 76.576.576.576.5 Vanilla 97.497.497.497.4 58.358.358.358.3 69.669.669.669.6 78.978.978.978.9 76.076.076.076.0 PINE 96.996.996.996.9 57.457.457.457.4 67.767.767.767.7 87.687.687.687.6 77.4+1.4subscript77.41.4\bm{77.4}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+1.4}}}bold_77.4 start_POSTSUBSCRIPT bold_+ bold_1.4 end_POSTSUBSCRIPT Qwen-1.5 -Chat 1.8B Vanilla (GT at A) 31.731.731.731.7 30.030.030.030.0 40.340.340.340.3 43.343.343.343.3 36.336.336.336.3 Vanilla (GT at B) 69.469.469.469.4 72.672.672.672.6 65.765.765.765.7 57.257.257.257.2 66.266.266.266.2 Vanilla 49.749.7{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}49.7}49.7 50.950.9{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}50.9}50.9 52.052.052.052.0 48.448.448.448.4 50.350.350.350.3 PINE 30.030.0{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}30.0}30.0 59.959.9{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}59.9}59.9 61.461.461.461.4 60.160.160.160.1 52.9+2.6subscript52.92.6\bm{52.9}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+2.6}}}bold_52.9 start_POSTSUBSCRIPT bold_+ bold_2.6 end_POSTSUBSCRIPT 4B Vanilla (GT at A) 32.832.832.832.8 24.824.824.824.8 17.417.417.417.4 42.842.842.842.8 29.529.529.529.5 Vanilla (GT at B) 86.686.686.686.6 74.574.574.574.5 82.982.982.982.9 62.362.362.362.3 76.676.676.676.6 Vanilla 58.958.9{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}58.9}58.9 48.748.7{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}48.7}48.7 50.950.9{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}50.9}50.9 54.154.154.154.1 53.153.153.153.1 PINE 73.073.0{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}73.0}73.0 45.245.2{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}45.2}45.2 53.753.7{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}53.7}53.7 61.061.061.061.0 58.2+5.1subscript58.25.1\bm{58.2}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+5.1}}}bold_58.2 start_POSTSUBSCRIPT bold_+ bold_5.1 end_POSTSUBSCRIPT 7B Vanilla (GT at A) 85.585.585.585.5 35.935.935.935.9 62.462.462.462.4 62.162.162.162.1 61.461.461.461.4 Vanilla (GT at B) 77.177.177.177.1 47.447.447.447.4 59.559.559.559.5 54.354.354.354.3 59.659.659.659.6 Vanilla 77.577.577.577.5 44.244.244.244.2 62.662.662.662.6 59.359.3{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}59.3}59.3 60.960.960.960.9 PINE 85.885.885.885.8 38.738.738.738.7 58.658.658.658.6 63.063.0{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}63.0}63.0 61.5+0.6subscript61.50.6\bm{61.5}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+0.6}}}bold_61.5 start_POSTSUBSCRIPT bold_+ bold_0.6 end_POSTSUBSCRIPT 32B Vanilla (GT at A) 93.693.693.693.6 47.747.747.747.7 77.177.177.177.1 78.378.378.378.3 74.274.274.274.2 Vanilla (GT at B) 91.991.991.991.9 52.252.252.252.2 81.681.681.681.6 73.673.673.673.6 74.874.874.874.8 Vanilla 92.792.792.792.7 51.251.251.251.2 80.580.580.580.5 66.866.8{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}66.8}66.8 72.872.872.872.8 PINE 93.093.093.093.0 49.849.849.849.8 79.779.779.779.7 76.776.7{\color[rgb]{0,1,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,1,1}% \pgfsys@color@cmyk@stroke{1}{0}{0}{0}\pgfsys@color@cmyk@fill{1}{0}{0}{0}76.7}76.7 74.8+2.0subscript74.82.0\bm{74.8}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+2.0}}}bold_74.8 start_POSTSUBSCRIPT bold_+ bold_2.0 end_POSTSUBSCRIPT 72B Vanilla (GT at A) 95.795.795.795.7 59.059.059.059.0 80.880.880.880.8 83.083.083.083.0 79.679.679.679.6 Vanilla (GT at B) 89.089.089.089.0 46.546.546.546.5 73.773.773.773.7 68.768.768.768.7 69.569.569.569.5 Vanilla 94.094.094.094.0 51.451.451.451.4 77.877.877.877.8 68.268.268.268.2 72.872.8\bm{72.8}bold_72.8 PINE 93.993.993.993.9 46.146.146.146.1 78.278.278.278.2 69.069.069.069.0 71.81.1subscript71.81.171.8_{\bm{{\color[rgb]{0.859375,0.265625,0.21484375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.859375,0.265625,0.21484375}-1.1}}}71.8 start_POSTSUBSCRIPT bold_- bold_1.1 end_POSTSUBSCRIPT 110B Vanilla (GT at A) 98.698.698.698.6 70.570.570.570.5 89.689.689.689.6 90.090.090.090.0 87.287.287.287.2 Vanilla (GT at B) 91.191.191.191.1 59.259.259.259.2 79.579.579.579.5 73.073.073.073.0 75.775.775.775.7 Vanilla 96.296.296.296.2 66.766.766.766.7 83.783.783.783.7 78.078.078.078.0 81.181.181.181.1 PINE 95.595.595.595.5 64.864.864.864.8 85.085.085.085.0 86.286.286.286.2 82.9+1.7subscript82.91.7\bm{82.9}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+1.7}}}bold_82.9 start_POSTSUBSCRIPT bold_+ bold_1.7 end_POSTSUBSCRIPT

4.1 Settings

Tasks, Data and Configs. We select two representative tasks that pose position bias: LM-as-a-judge [43] and retrieval-augmented question-answering [19]. The former is known for the recency bias, whereas the latter shows both primacy and recency bias. To systematically test the LM-as-a-judge task, we benchmark our method on 23 datasets in the RewardBench333Apache-2.0 license. https://github.com/allenai/reward-bench [16] that can be categorized into four types: Chat, Chat-Hard, Safety, and Reasoning. LMs are asked to select the more helpful one from two given chat responses to a question. We use the official data split, prompts, and evaluation scripts to ensure reproducibility. We follow the settings and use the prompts, data, and evaluation scripts of [19]444MIT license. https://github.com/nelson-liu/lost-in-the-middle to benchmark retrieval-augmented question-answering: only one of the retrieved documents (10 or 20 in total) contains the ground-truth answer for the given question. We list prompts in Appendix B. We use LLaMa-3-Instruct models [2] and Qwen-1.5-Chat models [4] for experiments. More details can be found in Appendix B. Qualitative examples of PINE can be found in Appendix C.

Baselines. The goal of PINE is to eliminate position bias during inference intrinsically. Therefore, we choose methods that have the same design principle as our baselines: (1) Vanilla inference (2) Vanilla inference with no inter-segment attention (NIA for short, i.e., the middle figure in Figure. 2): The latter segments will have no attention to formers. (3) Parallel Context Window (PCW, rightmost in Figure. 2) [27]: PCW extends the baseline (2) by manipulating positions of segments. PCW allows all segments to share the same positions. (4) Structured Prompting (SP, a variant version of PCW) [11]: SP extends (3) by lowering attentions between decoded tokens and input segments to 1N1𝑁\frac{1}{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG to solve the perplexity exploding problem in PCW. Based on the analysis in Section 3.3, (1) and (2) are not inter-segment position invariant, whereas (3) and (4) are.

Table 3: Baseline performance on RewardBench. PINE achieves superior performance to baseline models, performing 4.8%percent4.84.8\%4.8 % and 4.7%percent4.74.7\%4.7 % better than the best performed baseline on two models.
Model Method Chat Chat-Hard Safety Reasoning Avg.
LLaMa-3 8B-Instruct NIA (GT at A) 81.081.081.081.0 40.740.740.740.7 59.759.759.759.7 43.743.743.743.7 56.356.356.356.3
NIA (GT at B) 81.081.081.081.0 49.749.749.749.7 65.865.865.865.8 66.766.766.766.7 65.865.865.865.8
NIA 80.980.980.980.9 46.746.746.746.7 64.064.064.064.0 55.955.955.955.9 61.961.961.961.9
PCW 78.678.678.678.6 46.846.846.846.8 64.864.864.864.8 56.556.556.556.5 61.761.761.761.7
SP 79.679.679.679.6 43.343.343.343.3 65.065.065.065.0 55.455.455.455.4 60.860.860.860.8
PINE 85.685.685.685.6 41.541.541.541.5 66.566.566.566.5 73.473.473.473.4 66.7+4.8subscript66.74.8\bm{66.7}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+4.8}}}bold_66.7 start_POSTSUBSCRIPT bold_+ bold_4.8 end_POSTSUBSCRIPT
Qwen-1.5 7B-Chat NIA (GT at A) 67.767.767.767.7 57.257.257.257.2 59.659.659.659.6 60.760.760.760.7 61.361.361.361.3
NIA (GT at B) 67.967.967.967.9 35.935.935.935.9 61.061.061.061.0 44.144.144.144.1 52.252.252.252.2
NIA 74.974.974.974.9 43.543.543.543.5 57.457.457.457.4 51.451.451.451.4 56.856.856.856.8
PCW 67.267.267.267.2 42.042.042.042.0 58.358.358.358.3 53.453.453.453.4 55.255.255.255.2
SP 69.469.469.469.4 41.841.841.841.8 58.058.058.058.0 52.452.452.452.4 55.455.455.455.4
PINE 85.885.885.885.8 38.738.738.738.7 58.658.658.658.6 63.063.063.063.0 61.5+4.7subscript61.54.7\bm{61.5}_{\bm{{\color[rgb]{0.05859375,0.6171875,0.34375}\definecolor[named]{% pgfstrokecolor}{rgb}{0.05859375,0.6171875,0.34375}+4.7}}}bold_61.5 start_POSTSUBSCRIPT bold_+ bold_4.7 end_POSTSUBSCRIPT

4.2 Results on LM-as-a-judge

Table 1 shows the statistics of position bias in RewardBench with different models. Position bias is quite common in RewardBench, and can be up to 48.0%percent48.048.0\%48.0 %. Although the table reflects that generally larger models have less position bias. The position bias could still on average affect up to 10%percent1010\%10 % data. Table 2 shows the main results on RewardBench. We experiment with Llama-3 and Qwen-1.5 across different model sizes. The position of the ground truth chosen option is randomly shuffled. Therefore, the accuracy of the random guess method is expected to be 50%percent5050\%50 %. First, the first two rows of each model block reveal that larger models tend to have a primacy bias, whereas smaller models tend to have a recency bias. Second, position bias exists in different model sizes; for example, Qwen1.5-110B-Chat and Qwen1.5-1.8B-Chat models all have obvious position bias due to the large performance gap in the first two rows of each model block (11.5%percent11.511.5\%11.5 % and 29.9%percent29.929.9\%29.9 %). By comparing the last two rows of each model size, we conclude that models across different sizes perform better with the help of PINE by eliminating position bias. The only exception is the Qwen-1.5-72B-Chat model. We suspect this model is not well-trained since Qwen-1.5-32B-Chat performs extremely similarly to the 72B model in vanilla inference. PINE consistently improves model performance on the “reasoning" subset by a large margin: from 8888 to 10101010 percentage points in most cases. For example, LlaMa-3 Instruct 70B is originally is ranked 8th generative model in the reasoning subset of RewardBench. With PINE, it achieves the 3rd rank (87.6%percent87.687.6\%87.6 %), outperforming GPT-4-0125-preview (the previous 3rd rank, 86.9%percent86.986.9\%86.9 %), and is only worse than two Google’s most recent models (flame-24b-july-2024, gemini-1.5-pro-0514).555Results are provided by the official leaderboard (as of June 30, 2024): https://huggingface.co/spaces/allenai/reward-bench Moreover, PINE largely improves subsets with severe position bias (colored with cyan), which is consistent to our motivation and goals. We also observe that PINE sometimes brings an unusual performance drop in small models, which is as expected since position bias is not always a bad phenomenon: if the model capability is worse than random guessing, then position bias can improve the overall performance toward 50%percent5050\%50 %; however, if the model is very capable, then the position bias will harm the overall performance toward 50%percent5050\%50 %. Overall, PINE improves performance from a statistical perspective and makes models more reliable when as evaluators.

We then compare PINE with baseline models on Llama-3-8B-Instruct and Qwen1.5-7B-Chat model. Since NIA is not inter-segment position invariant, we also apply NIA with two extreme cases: the ground truth chosen response is always in the first or second place. Results on Table 3 show that PINE achieves the best performance and largely outperforms the best baselines by 5%similar-toabsentpercent5\sim 5\%∼ 5 %, and outperforms NIA even if NIA is placed in the extreme case. The results reveal that masking inter-segment attention mask is much less effective than bidirectional inter-segment attention mask applied in PINE.

4.3 Results on Retrieval-Augmented Question-Answering

Refer to caption
Figure 4: The results of retrieval-augmented QA on Llama-3-70B-Instruct. Dashed lines indicate that the method is either inter-segment position-invariant or the result is obtained on the order-shuffled data (denoted in the legend). (a) shows results of PINE against baselines. (b) shows results of different designs of PINE.

Models tend to perform better when the gold-standard document is at the beginning and the end of all documents in retrieval-augmented question-answers. Figure 4 (a) shows the results on LLaMa-3-70B-Instruct when 10 or 20 documents were presented. First, it is easy to conclude that all baselines are much worse than PINE (the pink line), which is consistent to the previous experiment. Second, PINE  achieves on-par performance on average compared with vanilla inference while being inter-segment position invariant. Specifically, PINE  is slightly better/worse than vanilla inference with the gap +1.2/-2.0 when there are 10 and 20 documents in total. We hypothesize that the slight performance drop of PINE  for the 20 document setting is due to the performance drop of segment similarity computation in PINE when presented with many documents. However, PINE is position-invariant, therefore does not be affected by the worst case (the bottom of blue solid curves). Third, the height generally becomes smaller between blue and brown solid lines in Figure 4 (a), and between the blue and red solid lines in Figure 4 (b) when the gold-standard document position increases, reflecting the causal attention generally prefers distant content, which is consistent to the hypothesis in Section 3.2. The brown line in Figure 4 (a) and red line (b) generally reflect recency bias brought by RoPE, which is consistent to previous works [29, 24].

4.4 Ablation Study

So far, our experiments show that bidirectional inter-segment attention is the better design choice than the masked one. However, there are still several design options for the position assignment, as discussed in Section 3.4. The first option is to re-sort position reversely, and the other is to use PINE without re-sorting. To gain a deeper understanding, we extend the retrieval-augmented QA experiments with the two mentioned alternative position assignment methods, and the results are presented in Figure 4 (b). The figure tells us that PINE is slightly better than PINE  without re-sorting on average (+0.3. The gap becomes larger when 20 documents are presented: +1.5). Besides, segment positions still affect performance without re-sorting. Therefore, PINE without re-sorting may suffice if one does not aim to eliminate the position bias or cares more about efficiency. On the other hand, re-sorting reversely has relatively worse results, showing that PINE is a better design choice, which is consistent with the intuitive analysis mentioned at Section 3.4.

4.5 Computational Overhead

Section 3.4 briefly discusses the computational overhead, with a conclusion that PINE ’s efficiency is still doable. In our experiments, we find the wall time of PINE is similar-to\sim2x and similar-to\sim8x of the vanilla inference on the LM-as-a-judge task and retrieval-augmented QA task with 20 documents, which is acceptable at least during experiments. However, we did not specially optimize codes to accelerate PINE, and our implementation still contains a “for” loop. Therefore, we believe there is room to accelerate PINE. Compared with the time overhead, the memory overhead is small and PINE can be run with 70B models on 3x A100 80G on the retrieval-augmented QA task, which requires the same number of GPUs as the vanilla inference. Since efficiency is not the main focus of this paper, we leave this as our future work.

5 Conclusion, Limitations and Future Work

This paper logically analyzes the transformers and attributes the position bias to causal attention and RoPE. Based on the analysis, we propose a novel train-free zero-shot approach to intrinsically eliminate the position bias. The core idea is to make every input segment (e.g., options and retrieved documents) equally affected by the attention mask and RoPE. Experiments on the LM-as-a-judge and retrieval-augmented QA show the effectiveness of our approach. However, PINE requires extra computation, and thus it is not suitable for scenarios that are efficiency-sensitive. We believe there is some room to improve the efficiency with more automated end-to-end implementation, and we leave this as our future work. PINE works only for input segments that are well-defined and structured to be position-invariant. Therefore, users are expected to tell models the parts of inputs that should be position-invariant. Finally, PINE is an early attempt to eliminate position bias that builds on existing attention and position encoding, and we believe that novel designs of position encoding and attention may be a better solution, which is left to be explored in the future.

Acknowledgement

We thank Chujie Zheng for the helpful discussion and feedback.

References

  • Achiam et al., [2023] Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  • AI, [2024] AI, M. (2024). Build the future of ai with meta llama 3.
  • Ansel et al., [2024] Ansel, J., Yang, E., He, H., Gimelshein, N., Jain, A., Voznesensky, M., Bao, B., Bell, P., Berard, D., Burovski, E., et al. (2024). Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, pages 929–947.
  • Bai et al., [2023] Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., Hui, B., Ji, L., Li, M., Lin, J., Lin, R., Liu, D., Liu, G., Lu, C., Lu, K., Ma, J., Men, R., Ren, X., Ren, X., Tan, C., Tan, S., Tu, J., Wang, P., Wang, S., Wang, W., Wu, S., Xu, B., Xu, J., Yang, A., Yang, H., Yang, J., Yang, S., Yao, Y., Yu, B., Yuan, H., Yuan, Z., Zhang, J., Zhang, X., Zhang, Y., Zhang, Z., Zhou, C., Zhou, J., Zhou, X., and Zhu, T. (2023). Qwen technical report. arXiv preprint arXiv:2309.16609.
  • Bavishi et al., [2023] Bavishi, R., Elsen, E., Hawthorne, C., Nye, M., Odena, A., Somani, A., and Taşırlar, S. (2023). Introducing our multimodal models.
  • Brown et al., [2020] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  • Cai et al., [2023] Cai, T., Huang, K., Lee, J. D., and Wang, M. (2023). Scaling in-context demonstrations with structured attention. In Workshop on Efficient Systems for Foundation Models @ ICML2023.
  • Chen et al., [2024] Chen, X., Chi, R. A., Wang, X., and Zhou, D. (2024). Premise order matters in reasoning with large language models. arXiv preprint arXiv:2402.08939.
  • Chowdhery et al., [2022] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levskaya, A., Ghemawat, S., Dev, S., Michalewski, H., Garcia, X., Misra, V., Robinson, K., Fedus, L., Zhou, D., Ippolito, D., Luan, D., Lim, H., Zoph, B., Spiridonov, A., Sepassi, R., Dohan, D., Agrawal, S., Omernick, M., Dai, A. M., Pillai, T. S., Pellat, M., Lewkowycz, A., Moreira, E., Child, R., Polozov, O., Lee, K., Zhou, Z., Wang, X., Saeta, B., Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, N. (2022). Palm: Scaling language modeling with pathways.
  • Golovneva et al., [2024] Golovneva, O., Wang, T., Weston, J., and Sukhbaatar, S. (2024). Contextual position encoding: Learning to count what’s important. arXiv preprint arXiv:2405.18719.
  • Hao et al., [2022] Hao, Y., Sun, Y., Dong, L., Han, Z., Gu, Y., and Wei, F. (2022). Structured prompting: Scaling in-context learning to 1,000 examples. arXiv preprint arXiv:2212.06713.
  • Hsieh et al., [2024] Hsieh, C.-Y., Chuang, Y.-S., Li, C.-L., Wang, Z., Le, L. T., Kumar, A., Glass, J., Ratner, A., Lee, C.-Y., Krishna, R., et al. (2024). Found in the middle: Calibrating positional attention bias improves long context utilization. arXiv preprint arXiv:2406.16008.
  • Junqing et al., [2023] Junqing, H., Kunhao, P., Xiaoqun, D., Zhuoyang, S., Yibo, L., Yuxin, L., Hao, W., Qianguo, S., Songxin, Z., Zejian, X., et al. (2023). Never lost in the middle: Improving large language models via attention strengthening question answering. arXiv preprint arXiv:2311.09198.
  • Kenton and Toutanova, [2019] Kenton, J. D. M.-W. C. and Toutanova, L. K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  • Kwon et al., [2023] Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. (2023). Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  • [16] Lambert, N., Pyatkin, V., Morrison, J., Miranda, L., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., et al. (2024a). Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787.
  • [17] Lambert, N., Pyatkin, V., Morrison, J., Miranda, L., Lin, B. Y., Chandu, K., Dziri, N., Kumar, S., Zick, T., Choi, Y., Smith, N. A., and Hajishirzi, H. (2024b). Rewardbench: Evaluating reward models for language modeling. https://huggingface.co/spaces/allenai/reward-bench.
  • Li et al., [2023] Li, S., Zhao, R., Li, M., Ji, H., Callison-Burch, C., and Han, J. (2023). Open-domain hierarchical event schema induction by incremental prompting and verification. In Proc. The 61st Annual Meeting of the Association for Computational Linguistics (ACL2023).
  • Liu et al., [2024] Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173.
  • Ma et al., [2024] Ma, X., Liu, W., Zhang, P., and Xu, N. (2024). 3d-rpe: Enhancing long-context modeling through 3d rotary position encoding. arXiv preprint arXiv:2406.09897.
  • McLeish et al., [2024] McLeish, S., Bansal, A., Stein, A., Jain, N., Kirchenbauer, J., Bartoldson, B. R., Kailkhura, B., Bhatele, A., Geiping, J., Schwarzschild, A., et al. (2024). Transformers can do arithmetic with the right embeddings. arXiv preprint arXiv:2405.17399.
  • OpenAI, [2023] OpenAI (2023). Gpt-4v(ision) system card.
  • Paszke et al., [2019] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. (2019). Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  • Peysakhovich and Lerer, [2023] Peysakhovich, A. and Lerer, A. (2023). Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427.
  • Radford et al., [2019] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  • Raffel et al., [2020] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
  • Ratner et al., [2023] Ratner, N., Levine, Y., Belinkov, Y., Ram, O., Magar, I., Abend, O., Karpas, E., Shashua, A., Leyton-Brown, K., and Shoham, Y. (2023). Parallel context windows for large language models. In Rogers, A., Boyd-Graber, J., and Okazaki, N., editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6383–6402, Toronto, Canada. Association for Computational Linguistics.
  • Shi et al., [2024] Shi, L., Ma, W., and Vosoughi, S. (2024). Judging the judges: A systematic investigation of position bias in pairwise comparative assessments by llms. arXiv preprint arXiv:2406.07791.
  • Su et al., [2024] Su, J., Ahmed, M., Lu, Y., Pan, S., Bo, W., and Liu, Y. (2024). Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  • Tay et al., [2022] Tay, Y., Dehghani, M., Tran, V. Q., Garcia, X., Wei, J., Wang, X., Chung, H. W., Shakeri, S., Bahri, D., Schuster, T., et al. (2022). Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131.
  • Thoppilan et al., [2022] Thoppilan, R., De Freitas, D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., et al. (2022). Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  • Touvron et al., [2023] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  • Vaswani et al., [2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
  • Wang et al., [2022] Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. (2022). Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  • Wang et al., [2023] Wang, Y., Cai, Y., Chen, M., Liang, Y., and Hooi, B. (2023). Primacy effect of ChatGPT. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 108–115, Singapore. Association for Computational Linguistics.
  • Wolf et al., [2020] Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Le Scao, T., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. (2020). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  • Wu and Xie, [2023] Wu, P. and Xie, S. (2023). vsuperscript𝑣v^{*}italic_v start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT: Guided visual search as a core mechanism in multimodal llms. arXiv preprint arXiv:2312.14135.
  • Xu et al., [2024] Xu, Z., Cohen, D., Wang, B., and Srikumar, V. (2024). In-context example ordering guided by label distributions. arXiv preprint arXiv:2402.11447.
  • Yu et al., [2024] Yu, Y., Jiang, H., Luo, X., Wu, Q., Lin, C.-Y., Li, D., Yang, Y., Huang, Y., and Qiu, L. (2024). Mitigate position bias in large language models via scaling a single dimension. arXiv preprint arXiv:2406.02536.
  • [40] Zhang, K., Lv, A., Chen, Y., Ha, H., Xu, T., and Yan, R. (2024a). Batch-icl: Effective, efficient, and order-agnostic in-context learning. arXiv preprint arXiv:2401.06469.
  • [41] Zhang, M., Meng, Z., and Collier, N. (2024b). Attention instruction: Amplifying attention in the middle via prompting. arXiv preprint arXiv:2406.17095.
  • [42] Zheng, C., Zhou, H., Meng, F., Zhou, J., and Huang, M. (2024a). Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
  • [43] Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2024b). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
  • Zhu et al., [2023] Zhu, L., Wang, X., and Wang, X. (2023). Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.

Appendix A Another Example of Position Bias in VLMs

Refer to caption
Figure 5: Image of Andromeda Galaxy. GPT-4V is asked to answer the galaxy of the redbox (M110). (b), (c), (d) images are generated by flipping (a). Results show that GPT-4V can answer correctly on (c) and (d) and incorrectly on (a) and (b), revealing position bias in VLMs.

To show VLMs suffer from position bias, we ask models to identify a small object appeared in the full image. We ask GPT-4V [22] to identify one satellite galaxy M110 of Andromeda Galaxy (M31). The central objective in the Figure 5 (a) is the Andromeda Galaxy and the galaxy in the red box is M110 satellite galaxy. Another satellite galaxy is M32 which is the brightest part near the Andromeda Galaxy.

We flip the image (a) and obtains images (b), (c), and (d). We prompt GPT-4V “What galaxy is in the red box?”, and here are the responses we collected:

(a): The galaxy within the red box is M32, also known as Messier 32. It’s a dwarf elliptical galaxy and is a satellite of the Andromeda Galaxy (M31), which is the prominent galaxy in the image. M32 is one of the more well-studied dwarf elliptical galaxies, largely due to its proximity to our own galaxy and its interactions with the Andromeda Galaxy.

(b): The object inside the red box in the image appears to be a smaller galaxy located near the larger, more prominent galaxy (Andromeda Galaxy, M31). This smaller galaxy is likely M32, a satellite galaxy of the Andromeda Galaxy. M32 is a compact elliptical galaxy and is often visible in images as a smaller, dense, bright spot near Andromeda, as seen in the boxed area of your image.

(c): The galaxy within the red box is M110, also known as NGC 205. M110 is a dwarf elliptical galaxy, which, like M32, is a satellite of the Andromeda Galaxy (M31). It is one of the members of the Local Group of galaxies, which includes the Milky Way, the Andromeda Galaxy, and about 54 other known galaxies.

(d): The galaxy in the red box appears to be a satellite galaxy of the Andromeda Galaxy, which is the prominent galaxy visible in the rest of the image. This smaller galaxy is likely M110 (NGC 205), which is a dwarf elliptical galaxy and a companion to the Andromeda Galaxy, M31. It’s one of the several satellite galaxies gravitationally bound to Andromeda, visible here as a faint, elongated object in the outlined area.

We can find that models answer corrected when M110 is at the top of the image, revealing that VLMs also suffer from the position bias. The position bias may lead unreliable VLMs when fine-grained image analysis are needed (e.g., small object detection [37]).

Appendix B Implementation Details

B.1 Experiment Setting

For reproducibility, the generation temperature is set to 0. We use PyTorch [3, 23],666Customized license. https://github.com/pytorch/pytorch Transformers [36],777Apache-2.0 license. https://huggingface.co/docs/transformers/en/index and vLLM [15] for our experiments.888Apache-2.0 license. https://github.com/vllm-project/vllm. All experiments are launched with a single node of 8x A100 80G with SXM connection. 70B and 110B models are launched with 3x and 4x A100, and other model sizes can be launched with 1x A100.

B.2 Prompts

We use the prompts provided by RewardBench [17] official repo for the LM-as-a-judge task:

System prompt:

Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user’s instructions and answers the user’s question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if assistant A is better, "[[B]]" if assistant B is better.

User prompt:

[User Question]

......

[The Start of Assistant A’s Answer]

......

[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]

......

[The End of Assistant B’s Answer]

We use the prompts of official repo [19] for the retrieval-augmented QA experiments:

User Prompt:

Write a high-quality one-sentence answer for the given question using only the provided search results (some of which might be irrelevant).

Document (Title: ......): ......

Document (Title: ......): ......

......

Document (Title: ......): ......

Document (Title: ......): ......

Question: ......

Appendix C Qualitative Examples

C.1 LM-as-a-judge

[User Question]

Who created the Superman cartoon character?

[The Start of Assistant A’s Answer]

The Superman cartoon character was created by Jerry Siegel and Joe Shuster in 1938.

[The End of Assistant A’s Answer]

[The Start of Assistant B’s Answer]

Superman, the iconic comic book superhero, was created by writer Jerry Siegel and artist Joe Shuster. Superman first appeared in Action Comics #1, which was published by Detective Comics, Inc. (later DC Comics) in June 1938. The character’s immense popularity established him as one of the most enduring and recognizable figures in the superhero genre.

[The End of Assistant B’s Answer]

Here are outputs of Qwen1.5-7B-Chat:

Vanilla: Both AI assistants provide accurate and informative responses to the user’s question. Assistant A’s answer is brief and directly states the creators’ names and the year of creation, while Assistant B offers additional context by mentioning the first appearance, the publisher, and the character’s significance in the superhero genre. However, since the user asked for the creators, Assistant A’s answer is more directly helpful in this specific context. Final verdict: [[A]]

Vanilla (Swapped): [[B]]. Assistant B’s response is more detailed and informative, providing additional context about the origin of the character in Action Comics #1, the publisher, and its significance in the superhero genre. It also includes the names of the creators, which is crucial information for the question. Assistant A’s response is shorter and lacks the historical context and specific details. Therefore, Assistant B is better in answering the user’s question.

PINE: [[B]]. Assistant B provides a more comprehensive and detailed response to the user’s question. It not only mentions the creators but also provides additional information about the character’s debut and the company that published the comic. This answer is not only accurate but also offers context that highlights the significance of the creation. Assistant A’s response is brief and lacks the depth and historical context provided by Assistant B.

C.2 Retrieval-Augmented Question-Answering

User Prompt:

Document (Title: Pearl) south Pacific and Indian Ocean. The largest pearl oyster is the "Pinctada maxima", which is roughly the size of a dinner plate. South Sea pearls are characterized by their large size and warm luster. Sizes up to 14 mm in diameter are not uncommon. In 2013, Indonesia Pearl supplied 43 percent of South Sea Pearls international market. The other significant producers are Australia, Philippines, Myanmar and Malaysia. In 1914, pearl farmers began growing cultured freshwater pearls using the pearl mussels native to Lake Biwa. This lake, the largest and most ancient in Japan, lies near the city of Kyoto. The

Document (Title: Laccadive Sea) the gulf as most productive in the world. Although extraction of natural pearls is considered too expensive in most parts of the world, it is still conducted in the gulf. Also collected in large numbers are Shankha mollusks ("Xancus pyrum") whose shells are used as a ritual and religious object. Other mollusks of the sea are either too scarce or not popular in the Indian society and therefore have no commercial value. Another traditional occupation in the Laccadive Sea is fishing. The annual fish catch is 2,000 to 5,000 tonnes from the Lakshadweep islands, which is mostly constituted by tuna

Document (Title: Pearl) including the Cook Islands and Fiji are being extensively used for producing cultured pearls. The rarity of the black cultured pearl is now a "comparative" issue. The black cultured pearl is rare when compared to Chinese freshwater cultured pearls, and Japanese and Chinese akoya cultured pearls, and is more valuable than these pearls. However, it is more abundant than the South Sea pearl, which is more valuable than the black cultured pearl. This is simply because the black pearl oyster "Pinctada margaritifera" is far more abundant than the elusive, rare, and larger south sea pearl oyster "Pinctada maxima", which cannot

Document (Title: Pearl powder) Pearl powder Pearl powder () is a preparation of crushed pearls used in China and elsewhere for skin care and in traditional Chinese medicine. Pearl powder is made from freshwater pearls or saltwater pearls below jewellery grade. These are sterilised in boiling water and then milled into a fine powder using stainless steel grinding discs or by milling with small porcelain balls in moist conditions. The powder is sold as such or mixed into creams. Pearl powder is widely believed to help improve the appearance of the skin, and is used as a cosmetic by royal families in Asia. It

Document (Title: Hyderabad pearl) with white pearls. Recently, several pearl makers are exporting processed pearls to markets in Europe and the US. With the capital that they gain from this marketing, they are able to purchase machinery for advanced refinement. In particular, equipment that uses enzymes present in thermophiles is able to substantially improve the process of refining pearls. Hyderabad pearl Hyderabad is considered the main pearl trading center in India. The most notable area devoted to the trade is the village called Chandanpet just outside Hyderabad, wherein almost the entire population is engaged in the delicate art of drilling pearls, a skill they

Document (Title: Pearl) pearls". The correct definition of a South Sea pearl – as described by CIBJO and GIA – is a pearl produced by the "Pinctada maxima" pearl oyster. South Sea pearls are the color of their host "Pinctada maxima" oyster – and can be white, silver, pink, gold, cream, and any combination of these basic colors, including overtones of the various colors of the rainbow displayed in the pearl nacre of the oyster shell itself. South Sea pearls are the largest and rarest of the cultured pearls – making them the most valuable. Prized for their exquisitely beautiful órientór lustre,

Document (Title: Chandrani Pearls) year 2007–08 Chandrani Pearls imported their pearls from Japan, China or Korea. Chandrani Pearls Chandrani Pearls is a prominent pearl jewelery brand of India. It pioneered the concept of pearls in India. Chandrani Pearls’s headquarters is at Kolkata in West Bengal. Chandrani Pearls was started on 24 January 1985 by Mr. Kuldip Nayar, his wife Mrs. Lakshmi Nayar and his father late Mr. N.C. Nayar in Kolkata’s up market Minto Park area. Chandrani Pearls management is now assisted by Nisheeth Nayar, sons of Mr. Kuldip Nayar. Chandrani Pearls have 63 showrooms across 9 states. From a modest turnover of Rs.

Document (Title: Pearl) For thousands of years, seawater pearls were retrieved by divers in the Indian Ocean in areas such as the Persian Gulf, the Red Sea and the Gulf of Mannar. Evidence also suggest a prehistoric origin to pearl diving in these regions. Starting in the Han Dynasty (206 BC–220 AD), the Chinese hunted extensively for seawater pearls in the South China Sea. In the 14th-century Arabian Sea, the traveller Ibn Battuta provided the earliest known description of pearl diving by means of attaching a cord to the diver’s waist.

Document (Title: Chandrani Pearls) Chandrani Pearls Chandrani Pearls is a prominent pearl jewelery brand of India. It pioneered the concept of pearls in India. Chandrani Pearls’s headquarters is at Kolkata in West Bengal. Chandrani Pearls was started on 24 January 1985 by Mr. Kuldip Nayar, his wife Mrs. Lakshmi Nayar and his father late Mr. N.C. Nayar in Kolkata’s up market Minto Park area. Chandrani Pearls management is now assisted by Nisheeth Nayar, sons of Mr. Kuldip Nayar. Chandrani Pearls have 63 showrooms across 9 states. From a modest turnover of Rs. 2.75 lakhs in 1990 to a turnover of Rs. 15.00 crores in

Document (Title: Hyderabad pearl) Hyderabad pearl Hyderabad is considered the main pearl trading center in India. The most notable area devoted to the trade is the village called Chandanpet just outside Hyderabad, wherein almost the entire population is engaged in the delicate art of drilling pearls, a skill they have practiced for generations. This practice also makes the city one of the largest pearl drilling locations in India. The pearl industry in Hyderabad flourished due to the patronage of the Qutub Shahi kings and the Asaf Jahis, who were said to have an affinity for sparkling jewels. The pearls were not only part of

Question: in which sea pearl is found in india?

Here are Llama-3-70B-Instruct results:

Vanilla: According to the search results, pearls are found in the Laccadive Sea in India.

PINE: Pearls are found in the Indian Ocean, specifically in the Gulf of Mannar and the Laccadive Sea, off the coast of India.