(Translated by https://www.hiragana.jp/)
Similarity-Navigated Conformal Prediction for Graph Neural Networks

Similarity-Navigated Conformal Prediction for Graph Neural Networks

Jianqing Song1   Jianguo Huang2,3   Wenyu Jiang2,1   Baoming Zhang1
Shuangjie Li1,  Chongjun Wang1
1State Key Laboratory of Novel Software Technology
Corresponding author (chjwang@nju.edu.cn)
Nanjing University
2Department of Statistics and Data Science
Southern University of Science and Technology
3College of Computing and Data Science
Nanyang Technological University
Abstract

Graph Neural Networks have achieved remarkable accuracy in semi-supervised node classification tasks. However, these results lack reliable uncertainty estimates. Conformal prediction methods provide a theoretical guarantee for node classification tasks, ensuring that the conformal prediction set contains the ground-truth label with a desired probability (e.g., 95%). In this paper, we empirically show that for each node, aggregating the non-conformity scores of nodes with the same label can improve the efficiency of conformal prediction sets while maintaining valid marginal coverage. This observation motivates us to propose a novel algorithm named Similarity-Navigated Adaptive Prediction Sets (SNAPS), which aggregates the non-conformity scores based on feature similarity and structural neighborhood. The key idea behind SNAPS is that nodes with high feature similarity or direct connections tend to have the same label. By incorporating adaptive similar nodes information, SNAPS can generate compact prediction sets and increase the singleton hit ratio (correct prediction sets of size one). Moreover, we theoretically provide a finite-sample coverage guarantee of SNAPS. Extensive experiments demonstrate the superiority of SNAPS, improving the efficiency of prediction sets and singleton hit ratio while maintaining valid coverage.

1 Introduction

Graph Neural Networks (GNNs), which process graph-structured data by the message-passing manner (Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019), have achieved remarkable accuracy in various high-stakes applications, e.g., drug discovery (Li et al., 2022), fraud detection (Liu et al., 2023) and traffic forecasting (Jiang and Luo, 2022), where any erroneous prediction can be costly and dangerous (Amodei et al., 2016; Gao et al., 2019). To improve the reliability of prediction results, many methods have been investigated to quantify the model uncertainty (Gal and Ghahramani, 2016; Guo et al., 2017; Kendall and Gal, 2017; Wang et al., 2021; Hsu et al., 2022; Tang et al., 2024), while these methods lack theoretical guarantees of quantification. Conformal prediction (CP), on the other hand, offers a systematic approach to construct prediction sets that contain ground-truth labels with a desired coverage guarantee  (Vovk et al., 2005; Romano et al., 2020; Angelopoulos et al., 2021; Huang et al., 2023a; Xi et al., 2024).

CP algorithms utilize non-conformity scores to measure dissimilarity between a new instance and the training instances. The lower the score of a new instance, the more likely it belongs to the same distribution space as the training instances, thereby included in the prediction set. To improve the efficiency of prediction sets for GNNs, DAPS (Zargarbashi et al., 2023) smooths node-wise non-conformity scores by incorporating neighborhood information based on the assumption of network homophily. Similar to DAPS, CF-GNN (Huang et al., 2023b) introduces a topology-aware output correction model that learns to update prediction and then produces more efficient prediction sets or intervals with the inefficiency as the optimization objective. However, they only consider structural neighbors and ignore the effect of other nodes that are far from the ego node. This motivates us to analyze the influence of global nodes on the size of prediction sets.

In this work, we show that aggregating the information of global nodes with the same label as the ego node benefits the performance of CP methods. We provide an empirical analysis by randomly selecting nodes with the same label as the ego node from an oracle perspective, where the ground-truth labels of all nodes are known, and then aggregating their non-conformity scores into the ego node. The results indicate that aggregating scores of these nodes can significantly reduce the average size of prediction sets. This suggests that the information of nodes with the same label could correct the non-conformity scores, thereby prompting the efficiency of prediction sets. Detailed analysis is presented in Subsection 3.1. However, during the testing phase, the ground-truth label of every test node is unknown. Inspired by the analysis, our key idea is to accurately identify and select as many nodes with the same label as the ego node as possible and aggregate their non-conformity scores.

To this end, we propose a novel algorithm named Similarity-Navigated Adaptive Prediction Sets (SNAPS), which could self-adaptively aggregate the non-conformity scores of other nodes into the ego node. Specifically, SNAPS gives the higher cumulative weight for nodes with a higher probability of having the same label as the ego node while preserving its own and the one-hop neighbors. We utilize the feature similarity between nodes and the adjacency matrix to calculate the aggregating weights. In this way, the corrected scores could achieve compact prediction sets while maintaining the desired coverage.

To verify the effectiveness of our method, we conduct thorough empirical evaluations on 10 datasets, including both small datasets and large-scale datasets, e.g., OGBN Products (Bhatia et al., 2016). The results demonstrate that SNAPS not only achieves the pre-defined empirical marginal coverage but also achieves better performance over the compared methods. For example, on OGBN Products, our method reduces the average size of prediction sets from 14.92 of APS to 7.68. Moreover, we adapt SNAPS to image classification problems. The results demonstrate that SNAPS reduces the average size of prediction sets from 19.639 to 4.079 – only 1515\frac{1}{5}divide start_ARG 1 end_ARG start_ARG 5 end_ARG of the prediction set size from APS on ImageNet (Deng et al., 2009). Code is available at https://github.com/janqsong/SNAPS.

We summarize our contributions as follows:

  • We empirically explain that non-conformity scores of nodes with the same label as the ego node play a critical role in their non-conformity scores.

  • We propose a novel algorithm, namely SNAPS that aggregates basic non-conformity scores of nodes obtained through node feature similarity and one-hop structural neighborhood. We provide theoretical analysis to show the marginal coverage properties of SNAPS and the validity of SNAPS.

  • Extensive experimental results demonstrate the effectiveness of our proposed method. We show that SNAPS not only maintains the pre-defined coverage but also achieves great performance in efficiency and singleton hit ratio.

2 Preliminary

In this paper, we focus on split conformal prediction for semi-supervised node classification with transductive learning in an undirected graph.

Notation. Graph is represented as 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ), where 𝒱:={vi}i=1Nassign𝒱superscriptsubscriptsubscript𝑣𝑖𝑖1𝑁\mathcal{V}:=\{v_{i}\}_{i=1}^{N}caligraphic_V := { italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT denotes the node set and \mathcal{E}caligraphic_E denotes the edge set with ||=E𝐸|\mathcal{E}|=E| caligraphic_E | = italic_E. Let 𝑨{0,1}N×N𝑨superscript01𝑁𝑁\boldsymbol{A}\in\{0,1\}^{N\times N}bold_italic_A ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT be the adjacency matrix, where 𝑨i,j=1subscript𝑨𝑖𝑗1\boldsymbol{A}_{i,j}=1bold_italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 1 if there exists an edge between nodes visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and 𝑨i,j=0subscript𝑨𝑖𝑗0\boldsymbol{A}_{i,j}=0bold_italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT = 0 otherwise, and 𝑫𝑫\boldsymbol{D}bold_italic_D be its degree matrix, where 𝑫i,i=j𝑨i,jsubscript𝑫𝑖𝑖subscript𝑗subscript𝑨𝑖𝑗\boldsymbol{D}_{i,i}=\sum_{j}\boldsymbol{A}_{i,j}bold_italic_D start_POSTSUBSCRIPT italic_i , italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT. Let 𝑿:=[𝒙1,,𝒙N]Tassign𝑿superscriptsubscript𝒙1subscript𝒙𝑁𝑇\boldsymbol{X}:=[\boldsymbol{x}_{1},\dots,\boldsymbol{x}_{N}]^{T}bold_italic_X := [ bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_italic_x start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT be the node feature matrix, where 𝒙idsubscript𝒙𝑖superscript𝑑\boldsymbol{x}_{i}\in\mathbb{R}^{d}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is a d𝑑ditalic_d-dimensional feature vector for node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The label of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is yi𝒴subscript𝑦𝑖𝒴y_{i}\in\mathcal{Y}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_Y, where 𝒴:={1,2,,K}assign𝒴12𝐾\mathcal{Y}:=\{1,2,...,K\}caligraphic_Y := { 1 , 2 , … , italic_K } denotes the label space.

Transductive setting. In transductive setting, we have access to two node sets, 𝒱labelsubscript𝒱label\mathcal{V}_{\text{label}}caligraphic_V start_POSTSUBSCRIPT label end_POSTSUBSCRIPT with labels and 𝒱unlabelsubscript𝒱unlabel\mathcal{V}_{\text{unlabel}}caligraphic_V start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT without labels, where 𝒱label𝒱unlabel=subscript𝒱labelsubscript𝒱unlabel\mathcal{V}_{\text{label}}\cap\mathcal{V}_{\text{unlabel}}=\emptysetcaligraphic_V start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ∩ caligraphic_V start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT = ∅ and 𝒱label𝒱unlabel=𝒱subscript𝒱labelsubscript𝒱unlabel𝒱\mathcal{V}_{\text{label}}\cup\mathcal{V}_{\text{unlabel}}=\mathcal{V}caligraphic_V start_POSTSUBSCRIPT label end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT = caligraphic_V. 𝒱labelsubscript𝒱label\mathcal{V}_{\text{label}}caligraphic_V start_POSTSUBSCRIPT label end_POSTSUBSCRIPT is then randomly split into 𝒱train/𝒱valid/𝒱calibsubscript𝒱trainsubscript𝒱validsubscript𝒱calib\mathcal{V}_{\text{train}}/\mathcal{V}_{\text{valid}}/\mathcal{V}_{\text{calib}}caligraphic_V start_POSTSUBSCRIPT train end_POSTSUBSCRIPT / caligraphic_V start_POSTSUBSCRIPT valid end_POSTSUBSCRIPT / caligraphic_V start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT with a fixed size, the training/validation/calibration node set, correspondingly. 𝒱unlabelsubscript𝒱unlabel\mathcal{V}_{\text{unlabel}}caligraphic_V start_POSTSUBSCRIPT unlabel end_POSTSUBSCRIPT is used as the testing node set 𝒱testsubscript𝒱test\mathcal{V}_{\text{test}}caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT. The classifier f()𝑓f(\cdot)italic_f ( ⋅ ) is trained on {(𝒙i,yi)}vi𝒱trainsubscriptsubscript𝒙𝑖subscript𝑦𝑖subscript𝑣𝑖subscript𝒱train\{(\boldsymbol{x}_{i},y_{i})\}_{v_{i}\in\mathcal{V}_{\text{train}}}{ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUBSCRIPT, {𝒙i}vi𝒱𝒱trainsubscriptsubscript𝒙𝑖subscript𝑣𝑖𝒱subscript𝒱train\{\boldsymbol{x}_{i}\}_{v_{i}\in\mathcal{V}-\mathcal{V}_{\text{train}}}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V - caligraphic_V start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUBSCRIPT and the entire graph structure 𝒢=(𝒱,)𝒢𝒱\mathcal{G}=(\mathcal{V},\mathcal{E})caligraphic_G = ( caligraphic_V , caligraphic_E ), and is chosen through {(𝒙i,yi)}vi𝒱validsubscriptsubscript𝒙𝑖subscript𝑦𝑖subscript𝑣𝑖subscript𝒱valid\{(\boldsymbol{x}_{i},y_{i})\}_{v_{i}\in\mathcal{V}_{\text{valid}}}{ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT valid end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Then we can get the predicted probability 𝑷={𝒑i}vi𝒱𝑷subscriptsubscript𝒑𝑖subscript𝑣𝑖𝒱\boldsymbol{P}=\{\boldsymbol{p}_{i}\}_{v_{i}\in\mathcal{V}}bold_italic_P = { bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V end_POSTSUBSCRIPT for each node through 𝒑i=σ(f(𝒙i))subscript𝒑𝑖𝜎𝑓subscript𝒙𝑖\boldsymbol{p}_{i}=\sigma(f(\boldsymbol{x}_{i}))bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_σ ( italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) where 𝒑i[0,1]Ksubscript𝒑𝑖superscript01𝐾\boldsymbol{p}_{i}\in[0,1]^{K}bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT and σ𝜎\sigmaitalic_σ is activation function such as softmax. We usually choose the label with the highest probability as the predicted label, i.e., y^i=argmaxk𝒑iksubscript^𝑦𝑖subscriptargmax𝑘subscript𝒑𝑖𝑘\hat{y}_{i}=\mathrm{argmax}_{k}\boldsymbol{p}_{ik}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_argmax start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT.

Graph neural networks. GNNs aim at learning representation vectors for nodes in the graph by leveraging graph structure and node features. Most modern GNNs adopt a series of propagation layers following a message passing mechanism (Gilmer et al., 2017). The l𝑙litalic_l-th layer of the GNNs takes the following form:

𝒉i(l)=COMBINE(l)(𝒉i(l1),AGG(l)({MSG(l)(𝒉j(l1),𝒉i(l1))|vj𝒩i}))superscriptsubscript𝒉𝑖𝑙superscriptCOMBINE𝑙superscriptsubscript𝒉𝑖𝑙1superscriptAGG𝑙conditional-setsuperscriptMSG𝑙superscriptsubscript𝒉𝑗𝑙1superscriptsubscript𝒉𝑖𝑙1subscript𝑣𝑗subscript𝒩𝑖\boldsymbol{h}_{i}^{(l)}=\mathrm{COMBINE}^{(l)}\left(\boldsymbol{h}_{i}^{(l-1)% },\mathrm{AGG}^{(l)}\left(\left\{\mathrm{MSG}^{(l)}(\boldsymbol{h}_{j}^{(l-1)}% ,\boldsymbol{h}_{i}^{(l-1)})|v_{j}\in\mathcal{N}_{i}\right\}\right)\right)bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT = roman_COMBINE start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , roman_AGG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( { roman_MSG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( bold_italic_h start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT , bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l - 1 ) end_POSTSUPERSCRIPT ) | italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) ) (1)

where 𝒉i(l)superscriptsubscript𝒉𝑖𝑙\boldsymbol{h}_{i}^{(l)}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT is the hidden representation of node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at the l𝑙litalic_l-th layer with initialization of 𝒉i(0)=𝒙isuperscriptsubscript𝒉𝑖0subscript𝒙𝑖\boldsymbol{h}_{i}^{(0)}=\boldsymbol{x}_{i}bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝒩isubscript𝒩𝑖\mathcal{N}_{i}caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is a set of nodes adjacent to node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. MSG(l)()superscriptMSG𝑙\mathrm{MSG}^{(l)}(\cdot)roman_MSG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ ), AGG(l)()superscriptAGG𝑙\mathrm{AGG}^{(l)}(\cdot)roman_AGG start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ ) and COMBINE(l)()superscriptCOMBINE𝑙\mathrm{COMBINE}^{(l)}(\cdot)roman_COMBINE start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ ) denote the functions for message computation, message aggregation, and message combination, respectively. After an iteration of the last layer, the obtained final node representation 𝑯={𝒉iL}vi𝒱𝑯subscriptsuperscriptsubscript𝒉𝑖𝐿subscript𝑣𝑖𝒱\boldsymbol{H}=\{\boldsymbol{h}_{i}^{L}\}_{v_{i}\in\mathcal{V}}bold_italic_H = { bold_italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V end_POSTSUBSCRIPT is then fed to a classifier to obtain the predicted probability 𝑷𝑷\boldsymbol{P}bold_italic_P.

Conformal prediction. CP is a promising framework for generating prediction sets that statistically contain ground-truth labels with a desired guarantee. Formally, given calibration data 𝒟calib={(𝒙i,yi)}i=1nsubscript𝒟calibsuperscriptsubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖1𝑛\mathcal{D}_{\text{calib}}=\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{n}caligraphic_D start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT = { ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we can generate a prediction set 𝒞(𝒙n+1)𝒴𝒞subscript𝒙𝑛1𝒴\mathcal{C}(\boldsymbol{x}_{n+1})\subseteq\mathcal{Y}caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ⊆ caligraphic_Y for an unseen instance 𝒙n+1subscript𝒙𝑛1\boldsymbol{x}_{n+1}bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT with the coverage guarantee [yn+1𝒞(𝒙n+1)]1αdelimited-[]subscript𝑦𝑛1𝒞subscript𝒙𝑛11𝛼\mathbb{P}[y_{n+1}\in\mathcal{C}(\boldsymbol{x}_{n+1})]\geq 1-\alphablackboard_P [ italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ] ≥ 1 - italic_α, where α𝛼\alphaitalic_α is the pre-defined significance level. The best characteristic of CP is that it is distribution-free and only relies on exchangeability. This means that every permutation of the instances is equally likely, i.e., 𝒟calib(𝒙n+1,yn+1)subscript𝒟calibsubscript𝒙𝑛1subscript𝑦𝑛1\mathcal{D}_{\text{calib}}\cup(\boldsymbol{x}_{n+1},y_{n+1})caligraphic_D start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT ∪ ( bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) is exchangeable, where (𝒙n+1,yn+1)subscript𝒙𝑛1subscript𝑦𝑛1(\boldsymbol{x}_{n+1},y_{n+1})( bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) is an unseen instance.

Conformal prediction is typically divided into two types: full conformal prediction and split conformal prediction. Unlike full conformal prediction, split conformal prediction treats the model as a black box, avoiding the need to retrain or modify the model and sacrificing efficiency for computational efficiency (Vovk et al., 2005; Zargarbashi et al., 2023). In this paper, we focus on the computationally efficient split conformal prediction method, thus "conformal prediction" in the following denotes split conformal prediction.

Theorem 1

(Vovk et al., 2005) Let calibration data and a test instance, i.e., {(𝐱i,yi)}i=1n{(𝐱n+1,yn+1)}superscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛subscript𝐱𝑛1subscript𝑦𝑛1\{(\boldsymbol{x}_{i},y_{i})\}_{i=1}^{n}\cup\{(\boldsymbol{x}_{n+1},y_{n+1})\}{ ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∪ { ( bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) } be exchangeable. For any non-conformity score function s:𝒳×𝒴:𝑠𝒳𝒴s:\mathcal{X}\times\mathcal{Y}\rightarrow\mathbb{R}italic_s : caligraphic_X × caligraphic_Y → blackboard_R and any significance level α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ), define the 1α1𝛼1-\alpha1 - italic_α quantile of scores as q^:=Quantile((1α)(n+1)n;{s(𝐱i,yi)}i=1n)assign^𝑞Quantile1𝛼𝑛1𝑛superscriptsubscript𝑠subscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛\hat{q}:=\mathrm{Quantile}\left(\frac{\lceil(1-\alpha)(n+1)\rceil}{n};\{s(% \boldsymbol{x}_{i},y_{i})\}_{i=1}^{n}\right)over^ start_ARG italic_q end_ARG := roman_Quantile ( divide start_ARG ⌈ ( 1 - italic_α ) ( italic_n + 1 ) ⌉ end_ARG start_ARG italic_n end_ARG ; { italic_s ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) and prediction sets as 𝒞α(𝐱n+1)={y|s(𝐱n+1,y)q^}subscript𝒞𝛼subscript𝐱𝑛1conditional-set𝑦𝑠subscript𝐱𝑛1𝑦^𝑞\mathcal{C}_{\alpha}(\boldsymbol{x}_{n+1})=\{y|s(\boldsymbol{x}_{n+1},y)\leq% \hat{q}\}caligraphic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) = { italic_y | italic_s ( bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT , italic_y ) ≤ over^ start_ARG italic_q end_ARG }. We have

1α[yn+1𝒞α(𝒙n+1)]<1α+1n+1.1𝛼delimited-[]subscript𝑦𝑛1subscript𝒞𝛼subscript𝒙𝑛11𝛼1𝑛11-\alpha\leq\mathbb{P}[y_{n+1}\in\mathcal{C}_{\alpha}(\boldsymbol{x}_{n+1})]<1% -\alpha+\frac{1}{n+1}.1 - italic_α ≤ blackboard_P [ italic_y start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ caligraphic_C start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ] < 1 - italic_α + divide start_ARG 1 end_ARG start_ARG italic_n + 1 end_ARG . (2)

Theorem 1 statistically provides a marginal coverage guarantee for all test instances. Currently, there are already many basic non-conformity score methods (Romano et al., 2020; Angelopoulos et al., 2021; Huang et al., 2023a). Here we provide the definition of Adaptive Prediction Sets (Romano et al., 2020) (APS).

Adaptive Prediction Sets. In the APS method, the non-conformity scores are calculated by accumulating the softmax probabilities in descending order. Formally, given a data pair (𝒙,y)𝒙𝑦(\boldsymbol{x},y)( bold_italic_x , italic_y ) and a predicted probability estimator π(𝒙)y𝜋subscript𝒙𝑦\pi(\boldsymbol{x})_{y}italic_π ( bold_italic_x ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT for (𝒙,y)𝒙𝑦(\boldsymbol{x},y)( bold_italic_x , italic_y ), where π(𝒙)y𝜋subscript𝒙𝑦\pi(\boldsymbol{x})_{y}italic_π ( bold_italic_x ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is the predicted probability for class y𝑦yitalic_y, the non-conformity scores can be computed by:

s(𝒙,y)=i=1|𝒴|π(𝒙)i𝕀[π(𝒙)i>π(𝒙)y]+ξπ(𝒙)y,𝑠𝒙𝑦superscriptsubscript𝑖1𝒴𝜋subscript𝒙𝑖𝕀delimited-[]𝜋subscript𝒙𝑖𝜋subscript𝒙𝑦𝜉𝜋subscript𝒙𝑦s(\boldsymbol{x},y)=\sum_{i=1}^{|\mathcal{Y}|}\pi(\boldsymbol{x})_{i}\mathbb{I% }[\pi(\boldsymbol{x})_{i}>\pi(\boldsymbol{x})_{y}]+\xi\cdot\pi(\boldsymbol{x})% _{y},italic_s ( bold_italic_x , italic_y ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_Y | end_POSTSUPERSCRIPT italic_π ( bold_italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_I [ italic_π ( bold_italic_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT > italic_π ( bold_italic_x ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] + italic_ξ ⋅ italic_π ( bold_italic_x ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , (3)

where ξ[0,1]𝜉01\xi\in[0,1]italic_ξ ∈ [ 0 , 1 ] is a uniformly distributed random variable. Then, the prediction set is constructed as 𝒞(𝒙)={y|s(𝒙,y)q^}𝒞𝒙conditional-set𝑦𝑠𝒙𝑦^𝑞\mathcal{C}(\boldsymbol{x})=\{y|s(\boldsymbol{x},y)\leq\hat{q}\}caligraphic_C ( bold_italic_x ) = { italic_y | italic_s ( bold_italic_x , italic_y ) ≤ over^ start_ARG italic_q end_ARG }.

Evaluation Metrics. The goal is to improve the efficiency of conformal prediction sets as much as possible while maintaining the empirical marginal coverage guarantee. Given the testing nodes set 𝒱testsubscript𝒱test\mathcal{V}_{\text{test}}caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT, the efficiency is defined as the average size of prediction sets: Size:=1|𝒱test|vi𝒱test|𝒞(𝒙i)|.assignSize1subscript𝒱testsubscriptsubscript𝑣𝑖subscript𝒱test𝒞subscript𝒙𝑖\mathrm{Size}:=\frac{1}{|\mathcal{V}_{\text{test}}|}\sum_{v_{i}\in\mathcal{V}_% {\text{test}}}|\mathcal{C}(\boldsymbol{x}_{i})|.roman_Size := divide start_ARG 1 end_ARG start_ARG | caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT | caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | . The smaller the size, the more efficient CP is. The empirical marginal coverage is defined as Coverage:=1|𝒱test|vi𝒱test𝕀[yi𝒞(𝒙i)]assignCoverage1subscript𝒱testsubscriptsubscript𝑣𝑖subscript𝒱test𝕀delimited-[]subscript𝑦𝑖𝒞subscript𝒙𝑖\mathrm{Coverage}:=\frac{1}{|\mathcal{V}_{\text{test}}|}\sum_{v_{i}\in\mathcal% {V}_{\text{test}}}\mathbb{I}[y_{i}\in\mathcal{C}(\boldsymbol{x}_{i})]roman_Coverage := divide start_ARG 1 end_ARG start_ARG | caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ]. Although efficiency is a common metric for evaluating CP, singleton hit ratio (SH), defined as the proportion of prediction sets of size one that contains the ground-truth label, is also important (Zargarbashi et al., 2023). The formula of SH is defined as: SH:=1|𝒱test|vi𝒱test𝕀[𝒞(𝒙i)={yi}]assignSH1subscript𝒱testsubscriptsubscript𝑣𝑖subscript𝒱test𝕀delimited-[]𝒞subscript𝒙𝑖subscript𝑦𝑖\mathrm{SH}:=\frac{1}{|\mathcal{V}_{\text{test}}|}\sum_{v_{i}\in\mathcal{V}_{% \text{test}}}\mathbb{I}[\mathcal{C}(\boldsymbol{x}_{i})=\{y_{i}\}]roman_SH := divide start_ARG 1 end_ARG start_ARG | caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_I [ caligraphic_C ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = { italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ].

3 Motivation and Methodology

In this section, we begin by outlining our motivation, substantiating its validity and feasibility through experimental evidence. Then, we propose our method, SNAPS. Finally, we demonstrate that SNAPS satisfies the exchangeability assumption required by CP and offer proof of its improved efficiency compared to basic non-conformity score methods.

3.1 Motivation

In this subsection, we empirically show that nodes with the same label as the ego node may play a critical role in the non-conformity scores of the ego node. Specifically, using the scores of nodes with the same label to correct the scores of the ego node could reduce the average size of prediction sets.

To analyze the role of nodes with the same label as the ego node, assuming we have access to an oracle graph, i.e., the ground-truth labels of all the nodes are known. Then, we randomly select nodes with the same label as the ego node and aggregate their APS non-conformity scores into the ego node. We conduct experiments by Graph Convolutional Network (GCN) (Kipf and Welling, 2017) on CoraML (McCallum et al., 2000) dataset and choose APS as the basic score function of CP. Then, we conduct 10 trials and randomly select 100 calibration sets for each trial to evaluate the performance of CP at a significance level α=0.05𝛼0.05\alpha=0.05italic_α = 0.05.

In Figure LABEL:fig:motivation-a, we can find that the average size of prediction sets drops sharply as the number of nodes being aggregated increases, while maintaining valid coverage. When the number of selected nodes is 0, the results shows the performance of APS. Therefore, if the non-conformity scores of the ego node are corrected by accurately selecting nodes with the same label, SizeSize\mathrm{Size}roman_Size can be reduced to a large extent. Moreover, aggregating the scores of these nodes still achieves the coverage guarantee.

11footnotetext: Values of feature similarity are multiplied by 1000.

3.2 Similarity-Navigated Adaptive Prediction Sets

In our previous analysis, we show that correcting the scores of the ego node with the scores of nodes having the same label leads to smaller prediction sets and valid coverage. However, the above experiment is based on the oracle graph. In the real-world application, the ground-truth label of each test node is unknown. To alleviate this issue, our key idea is to use the similarity to approximate the potential label. Specifically, the nodes with high feature similarity tend to have a high probability of belonging to the same label.

Several studies (Jin et al., 2021b; a; Zou et al., 2023) have demonstrated that matrix constructed from node feature similarity can help with the homophily assumption, i.e., connected nodes in the graph are likely to share the same label. Additionally, the network homophily can also help us to find more nodes whose labels are likely to be the same as the ego node (Kipf and Welling, 2017), and several studies have demonstrated the effectiveness of this (Clarkson, 2023; Zargarbashi et al., 2023; Huang et al., 2023b). Therefore, we consider feature similarity and network structure to select nodes that may have the same label as the ego node.

Feature similarity graph construction. We compute the cosine similarity between the node features in the graph. For a given node pair (vi,vj)subscript𝑣𝑖subscript𝑣𝑗(v_{i},v_{j})( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), the cosine similarity between their features can be calculated by:

Sim(i,j)=𝒙i𝒙j𝒙i2𝒙j2,Sim𝑖𝑗superscriptsubscript𝒙𝑖topsubscript𝒙𝑗subscriptnormsubscript𝒙𝑖2subscriptnormsubscript𝒙𝑗2\mathrm{Sim}(i,j)=\frac{\boldsymbol{x}_{i}^{\top}\boldsymbol{x}_{j}}{\|% \boldsymbol{x}_{i}\|_{2}\cdot\|\boldsymbol{x}_{j}\|_{2}},roman_Sim ( italic_i , italic_j ) = divide start_ARG bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , (4)

where ij𝑖𝑗i\neq jitalic_i ≠ italic_j, vi𝒱subscript𝑣𝑖𝒱v_{i}\in\mathcal{V}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V and vj𝒱t,isubscript𝑣𝑗subscript𝒱𝑡𝑖v_{j}\in\mathcal{V}_{t,i}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT. Here, vj𝒱t,isubscript𝑣𝑗subscript𝒱𝑡𝑖v_{j}\in\mathcal{V}_{t,i}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT represents a set of nodes for which we calculate the similarity with visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, we choose k𝑘kitalic_k nearest neighbors for each node based on the above cosine similarity, forming the k𝑘kitalic_k-NN graph. We denote the adjacency matrix of k𝑘kitalic_k-NN graph as 𝑨ssubscript𝑨𝑠\boldsymbol{A}_{s}bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and its degree matrix as 𝑫ssubscript𝑫𝑠\boldsymbol{D}_{s}bold_italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where 𝑨s(i,j)=Sim(i,j)subscript𝑨𝑠𝑖𝑗Sim𝑖𝑗\boldsymbol{A}_{s}(i,j)=\mathrm{Sim}(i,j)bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i , italic_j ) = roman_Sim ( italic_i , italic_j ) and 𝑫s(i,i)=j𝑨s(i,j)subscript𝑫𝑠𝑖𝑖subscript𝑗subscript𝑨𝑠𝑖𝑗\boldsymbol{D}_{s}(i,i)=\sum_{j}\boldsymbol{A}_{s}(i,j)bold_italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i , italic_i ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i , italic_j ). For large graphs, we randomly select Mkmuch-greater-than𝑀𝑘M\gg kitalic_M ≫ italic_k nodes to put into 𝒱t,isubscript𝒱𝑡𝑖\mathcal{V}_{t,i}caligraphic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT, whereas for small graphs, we include all nodes into vj𝒱t,isubscript𝑣𝑗subscript𝒱𝑡𝑖v_{j}\in\mathcal{V}_{t,i}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_V start_POSTSUBSCRIPT italic_t , italic_i end_POSTSUBSCRIPT.

To verify the effectiveness of feature similarity, we provide an empirical analysis. Figure LABEL:fig:motivation-b presents the average of node feature cosine similarity between the same or different labels on the CoraML dataset. We can find that the average of node feature similarity between the same label is higher than those between different labels. We analyze experimentally where using feature similarity to select k𝑘kitalic_k-NN meets our expectation of selecting nodes with the same label as the ego node. Figure LABEL:fig:motivation-c shows the number statistics of nodes with the same label and with different labels at k𝑘kitalic_k-th nearest neighbors. The result shows that we can indeed select many nodes with the same label when k𝑘kitalic_k is not very large.

Refer to caption
Figure 3: The overall framework of SNAPS. (1) Basic non-conformity score function. We first use basic non-conformity score functions, e.g., APS, to convert node embeddings into non-conformity scores. (2) SNAPS function. We then aggregate basic non-conformity scores of k𝑘kitalic_k-NN with feature similarity and one-hop structural neighbors to correct the non-conformity scores of nodes. (3) Conformal Prediction. Finally, we use conformal prediction to generate prediction sets, significantly reducing their size compared to the basic score functions.

SNAPS. We propose SNAPS that aggregates non-conformity scores of nodes with high feature similarity to ego node and one-hop graph structural neighbors. Formally, for a node visubscript𝑣𝑖v_{i}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with a label y𝑦yitalic_y, the score function of SNAPS is shown as :

s^(𝒙i,y)=(1λμ)s(𝒙i,y)+λ𝑫s(i,i)j=1M𝑨s(i,j)s(𝒙j,y)+μ|𝒩i|vj𝒩is(𝒙j,y),^𝑠subscript𝒙𝑖𝑦1𝜆𝜇𝑠subscript𝒙𝑖𝑦𝜆subscript𝑫𝑠𝑖𝑖superscriptsubscript𝑗1𝑀subscript𝑨𝑠𝑖𝑗𝑠subscript𝒙𝑗𝑦𝜇subscript𝒩𝑖subscriptsubscript𝑣𝑗subscript𝒩𝑖𝑠subscript𝒙𝑗𝑦\hat{s}(\boldsymbol{x}_{i},y)=(1-\lambda-\mu)s(\boldsymbol{x}_{i},y)+\frac{% \lambda}{\boldsymbol{D}_{s}(i,i)}\sum_{j=1}^{M}\boldsymbol{A}_{s}(i,j)s(% \boldsymbol{x}_{j},y)+\frac{\mu}{|\mathcal{N}_{i}|}\sum_{v_{j}\in\mathcal{N}_{% i}}s(\boldsymbol{x}_{j},y),over^ start_ARG italic_s end_ARG ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) = ( 1 - italic_λ - italic_μ ) italic_s ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) + divide start_ARG italic_λ end_ARG start_ARG bold_italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i , italic_i ) end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i , italic_j ) italic_s ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y ) + divide start_ARG italic_μ end_ARG start_ARG | caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s ( bold_italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y ) , (5)

where s(,)𝑠s(\cdot,\cdot)italic_s ( ⋅ , ⋅ ) is the basic non-conformity score function and s^(,)^𝑠\hat{s}(\cdot,\cdot)over^ start_ARG italic_s end_ARG ( ⋅ , ⋅ ) is the SNAPS score function. Both λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ are hyperparameters, which are used to measure the importance of three parts of non-conformity scores. The framework of SNAPS is shown in Figure 3 and the pseudo-code is in Appendix B.

3.3 Theoretical Analysis

To deploy CP for graph-structured data, the only assumption we should satisfy is exchangeability, i.e., the joint distribution of calibration and testing sets remains unchanged under any permutation. Several studies have demonstrated that non-conformity scores based on the node embeddings obtained by any GNN models are invariant to the permutation of nodes while permuting their edges correspondingly in the calibration and testing sets. This invariance arises because GNNs models and non-conformity score functions only use the structures and attributes in the graph, without dependence on the order of the nodes (Zargarbashi et al., 2023; Huang et al., 2023b). Under this condition, we prove that SNAPS non-conformity scores are still exchangeable.

Proposition 1

Let 𝐒={𝐬i}vi𝒱𝐒subscriptsubscript𝐬𝑖subscript𝑣𝑖𝒱\boldsymbol{S}=\{\boldsymbol{s}_{i}\}_{v_{i}\in\mathcal{V}}bold_italic_S = { bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ caligraphic_V end_POSTSUBSCRIPT be basic non-conformity scores of nodes, where 𝐬iKsubscript𝐬𝑖superscript𝐾\boldsymbol{s}_{i}\in\mathbb{R}^{K}bold_italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT. Assume that 𝐒𝐒\boldsymbol{S}bold_italic_S is exchangeable for all vi(𝒱calib𝒱test)subscript𝑣𝑖subscript𝒱calibsubscript𝒱testv_{i}\in(\mathcal{V}_{\text{calib}}\cup\mathcal{V}_{\text{test}})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( caligraphic_V start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ). Then the aggregated 𝐒^=(1λμ)𝐒+λ𝐀^s𝐒+μ𝐀^𝐒bold-^𝐒1𝜆𝜇𝐒𝜆subscriptbold-^𝐀𝑠𝐒𝜇bold-^𝐀𝐒\boldsymbol{\hat{S}}=(1-\lambda-\mu)\boldsymbol{S}+\lambda\boldsymbol{\hat{A}}% _{s}\boldsymbol{S}+\mu\boldsymbol{\hat{A}}\boldsymbol{S}overbold_^ start_ARG bold_italic_S end_ARG = ( 1 - italic_λ - italic_μ ) bold_italic_S + italic_λ overbold_^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_S + italic_μ overbold_^ start_ARG bold_italic_A end_ARG bold_italic_S, where 𝐀^s=𝐃s1𝐀ssubscriptbold-^𝐀𝑠superscriptsubscript𝐃𝑠1subscript𝐀𝑠\boldsymbol{\hat{A}}_{s}=\boldsymbol{D}_{s}^{-1}\boldsymbol{A}_{s}overbold_^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = bold_italic_D start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and 𝐀^=𝐃1𝐀bold-^𝐀superscript𝐃1𝐀\boldsymbol{\hat{A}}=\boldsymbol{D}^{-1}\boldsymbol{A}overbold_^ start_ARG bold_italic_A end_ARG = bold_italic_D start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_A, is also exchangeable for vi(𝒱calib𝒱test)subscript𝑣𝑖subscript𝒱calibsubscript𝒱testv_{i}\in(\mathcal{V}_{\text{calib}}\cup\mathcal{V}_{\text{test}})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( caligraphic_V start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ).

The corresponding proof is provided in Appendix A. We then demonstrate the validity of our method theoretically.

Proposition 2

Assume that all of the nodes aggregated by SNAPS are the same label as the ego node. Given a data pair (𝐱,y)𝐱𝑦(\boldsymbol{x},y)( bold_italic_x , italic_y ) and a predicted estimator π(𝐱)y𝜋subscript𝐱𝑦\pi(\boldsymbol{x})_{y}italic_π ( bold_italic_x ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT for (𝐱,y)𝐱𝑦(\boldsymbol{x},y)( bold_italic_x , italic_y ), where π(𝐱)y𝜋subscript𝐱𝑦\pi(\boldsymbol{x})_{y}italic_π ( bold_italic_x ) start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT is the predicted probability for class y𝑦yitalic_y. Moreover, ϵkisubscriptitalic-ϵ𝑘𝑖\epsilon_{ki}italic_ϵ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT reflects the model’s error in misclassifying the ground-truth label k𝑘kitalic_k as label i𝑖iitalic_i. Let 𝐒𝐒\boldsymbol{S}bold_italic_S be APS scores of nodes, where 𝐒ui[0,1]subscript𝐒𝑢𝑖01\boldsymbol{S}_{ui}\in[0,1]bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is the score corresponding to node u𝑢uitalic_u with label i𝑖iitalic_i. Let Ek[π(𝐱u)i]subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝐱𝑢𝑖E_{k}[\pi(\boldsymbol{x}_{u})_{i}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] and Ek[𝐒ui]subscript𝐸𝑘delimited-[]subscript𝐒𝑢𝑖E_{k}[\boldsymbol{S}_{ui}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] be the average of predicted probability and scores corresponding to label i𝑖iitalic_i of nodes whose ground-truth labels are k𝑘kitalic_k, respectively. Let η𝜂\etaitalic_η be 1α1𝛼1-\alpha1 - italic_α quantile of basic non-conformity scores with a significance level α𝛼\alphaitalic_α. If Ek[𝐒uk]<ηsubscript𝐸𝑘delimited-[]subscript𝐒𝑢𝑘𝜂E_{k}[\boldsymbol{S}_{uk}]<\etaitalic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ] < italic_η and Ek[𝐒ui](1ϵki)Ek[π(𝐱u)max]+Ek[ξπ(𝐱u)i]subscript𝐸𝑘delimited-[]subscript𝐒𝑢𝑖1subscriptitalic-ϵ𝑘𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝐱𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝐱𝑢𝑖E_{k}[\boldsymbol{S}_{ui}]\geq(1-\epsilon_{ki})E_{k}[\pi(\boldsymbol{x}_{u})_{% max}]+E_{k}[\xi\cdot\pi(\boldsymbol{x}_{u})_{i}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] ≥ ( 1 - italic_ϵ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT ) italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ], where Ek[π(𝐱u)max]subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝐱𝑢𝑚𝑎𝑥E_{k}[\pi(\boldsymbol{x}_{u})_{max}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] denotes the maximum predicted probability of nodes whose ground-truth labels are k𝑘kitalic_k and ξ[0,1]𝜉01\xi\in[0,1]italic_ξ ∈ [ 0 , 1 ] is a uniformly distributed random variable, then

𝔼[|𝒞~(𝒙)|]𝔼[|𝒞(𝒙)|],𝔼delimited-[]~𝒞𝒙𝔼delimited-[]𝒞𝒙\mathbb{E}[|\mathcal{\tilde{C}}(\boldsymbol{x})|]\leq\mathbb{E}[|\mathcal{C}(% \boldsymbol{x})|],blackboard_E [ | over~ start_ARG caligraphic_C end_ARG ( bold_italic_x ) | ] ≤ blackboard_E [ | caligraphic_C ( bold_italic_x ) | ] ,

where 𝒞()𝒞\mathcal{C}(\cdot)caligraphic_C ( ⋅ ) and 𝒞~()~𝒞\mathcal{\tilde{C}}(\cdot)over~ start_ARG caligraphic_C end_ARG ( ⋅ ) represent the prediction set from the APS score function and SNAPS score function, respectively.

In other words, SNAPS consistently generates a smaller prediction set than basic non-conformity scores functions and maintains the desired marginal coverage rate. It is obvious that we can’t ignore a very important thing, which is to select nodes with the same label as the ego node as correctly as possible, otherwise, it will lead to a decrease in the efficiency of SNAPS.

4 Experiments

In this section, we conduct extensive experiments on semi-supervised node classification to demonstrate the effectiveness of SNAPS on graph-structure data. We also adapt SNAPS for image classification problems. Furthermore, we perform ablation studies and parameter analysis to show the importance of different components in SNAPS and evaluate its robustness, respectively.

4.1 Experimental Settings

Datasets. In our experiments, we consider ten datasets with high homophily, where connected nodes in the graph are likely to share the same label. These datasets include the common citation graphs: CoraML (McCallum et al., 2000), PubMed (Namata et al., 2012), CiteSeer (Sen et al., 2008), CoraFull (Bojchevski and Günnemann, 2018), Coauthor Physics (Physics) and Coauthor CS (CS) (Shchur et al., 2018) and the co-purchase graphs: Amazon Photos (Photos) and Amazon Computers (Computers) (McAuley et al., 2015; Shchur et al., 2018). Moreover, we consider two large-scale graph datasets, i.e., OGBN Arxiv (Arxiv) (Wang et al., 2020) and OGBN Products (Products) (Bhatia et al., 2016). Particularly, for CoraFull which is highly class-imbalanced, we filter out the classes with fewer than 50 nodes. The transformed dataset is dubbed as CoraFull (Zargarbashi et al., 2023). Detailed statistics of these datasets are shown in Appendix F. In addition to the datasets mentioned above, we discuss two heterophilous graph datasets in Appendix C.1, namely Chameleon and Squirrel, both of which are two Wikipedia networks (Rozemberczki et al., 2021).

Baselines. Since our SNAPS is a general post-processing method for GNNs, here we choose GCN (Kipf and Welling, 2017), GAT (Velickovic et al., 2018) and APPNP (Gasteiger et al., 2018) as structure-aware models and MLP as a structure-independent model. Moreover, our SNAPS can be based on general conformal prediction non-conformity scores, here we choose APS (Romano et al., 2020) and RAPS (Angelopoulos et al., 2021). For comparison, we compare not only with the basic scores, i.e., APS and RAPS, but also with DAPS (Zargarbashi et al., 2023) for GNNs.

CP Settings. For the basic model GCN, GAT, APPNP and MLP, we follow parameters suggested by (Zargarbashi et al., 2023). For DAPS, we follow the official implementation. Since GNNs are sensitive to splits, especially in the sparsely labeled setting (Shchur et al., 2018), we train the model over ten trials using varying train/validation splits. For per class in the training/validation set, we randomly select 20 nodes. For Arxiv and Products dataset, we follow the official split in PyTorch Geometric (Fey and Lenssen, 2019). Then, the remaining nodes are included in the calibration set and the test set. The calibration set ratio is suggested by (Huang et al., 2023b), i.e., modifying the calibration set size to |𝒱calib|=min{1000,|𝒱calib𝒱test|/2}subscript𝒱calib1000subscript𝒱calibsubscript𝒱test2|\mathcal{V}_{\text{calib}}|=\min\{1000,|\mathcal{V}_{\text{calib}}\cup% \mathcal{V}_{\text{test}}|/2\}| caligraphic_V start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT | = roman_min { 1000 , | caligraphic_V start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT | / 2 }. For each trained model, we conduct 100 random splits of calibration/test set. Thus, we totally conduct 1000 trials to evaluate the effectiveness of CP. For the non-conformity score function that requires hyper-parameters, we split the calibration set into two sets, one for tuning parameters, and the other for conformal calibration (Zargarbashi et al., 2023). For SNAPS, we choose λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ in increments of 0.05 within the range 0 to 1, and ensure that λ+μ<=1𝜆𝜇1\lambda+\mu<=1italic_λ + italic_μ < = 1. Each experiment is done with a single NVIDIA V100 32GB GPU.

Table 1: Results of CoverageCoverage\mathrm{Coverage}roman_Coverage, SizeSize\mathrm{Size}roman_Size and SHSH\mathrm{SH}roman_SH on different datasets. For SNAPS we use the APS score as the basic score. We report the average calculated from 10 GCN runs with each run of 100 conformal splits at a significance level α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. Bold numbers indicate optimal performance.
Coverage Size\downarrow SH%\uparrow
Datasets APS RAPS DAPS SNAPS APS RAPS DAPS SNAPS APS RAPS DAPS SNAPS
CoraML 0.950 0.950 0.950 0.950 2.42 2.21 1.92 1.68 44.89 22.19 52.16 56.30
PubMed 0.950 0.950 0.950 0.950 1.79 1.77 1.76 1.62 33.67 30.83 35.25 42.95
CiteSeer 0.950 0.950 0.950 0.950 2.34 2.36 1.94 1.84 50.41 38.99 59.75 59.08
CoraFull 0.950 0.950 0.950 0.950 17.54 10.72 11.81 9.80 10.23 2.13 8.67 5.76
CS 0.950 0.950 0.950 0.950 1.91 1.20 1.22 1.08 66.17 78.34 79.80 87.92
Physics 0.950 0.950 0.950 0.950 1.28 1.07 1.08 1.04 76.74 88.89 88.40 91.21
Computers 0.950 0.950 0.950 0.950 3.95 2.89 2.13 1.98 27.67 15.85 43.03 45.48
Photo 0.951 0.950 0.950 0.951 1.89 1.64 1.41 1.31 54.31 56.63 74.57 78.51
Arxiv 0.950 0.950 0.949 0.950 4.30 3.62 3.73 3.62 22.55 14.52 19.19 23.53
Products 0.950 0.951 0.950 0.950 14.92 13.67 10.91 7.68 15.51 11.51 19.29 22.38
Average 0.950 0.950 0.950 0.950 5.23 4.12 3.79 3.17 40.22 36.00 48.01 52.31

4.2 Experimental results

SNAPS generates smaller prediction sets and achieves a higher singleton hit ratio. Table 1 shows that CoverageCoverage\mathrm{Coverage}roman_Coverage of all conformal prediction methods is close to the desired coverage 1α1𝛼1-\alpha1 - italic_α. At a significance level α=0.05𝛼0.05\alpha=0.05italic_α = 0.05, SizeSize\mathrm{Size}roman_Size and SHSH\mathrm{SH}roman_SH exhibit superior performance. For example, when evaluated on Products, SNAPS reduces SizeSize\mathrm{Size}roman_Size from 14.92 of APS to 7.68. Overall, the experiments show that SNAPS has the desired coverage rate and gets smaller SizeSize\mathrm{Size}roman_Size and higher SHSH\mathrm{SH}roman_SH than APS, RAPS, and DAPS. Detailed results for other basic models and SNAPS based on RAPS are available in Appendix D.

SNAPS generates smaller average prediction sets for each label. We conduct additional experiments to analyze the average performance of APS and SNAPS on nodes belonging to the same label at a significance level α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. Figure LABEL:fig:mean-size-aps shows that the distribution of the average non-conformity scores for nodes belonging to the same label aligns with the assumptions made in Proposition 2, i.e., Ek[𝑺uk]<ηsubscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘𝜂E_{k}[\boldsymbol{S}_{uk}]<\etaitalic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ] < italic_η and Ek[𝑺ui]ηΔsubscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖𝜂ΔE_{k}[\boldsymbol{S}_{ui}]-\eta\geq-\Deltaitalic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] - italic_η ≥ - roman_Δ, where Δ=η(1ϵki)Ek[π(𝒙u)max]Ek[ξπ(𝒙u)i]Δ𝜂1subscriptitalic-ϵ𝑘𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖\Delta=\eta-(1-\epsilon_{ki})E_{k}[\pi(\boldsymbol{x}_{u})_{max}]-E_{k}[\xi% \cdot\pi(\boldsymbol{x}_{u})_{i}]roman_Δ = italic_η - ( 1 - italic_ϵ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT ) italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] - italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. If Δ>0Δ0\Delta>0roman_Δ > 0, then it is very small. SizeSize\mathrm{Size}roman_Size of prediction sets corresponding to APS is 3.29. Figure LABEL:fig:mean-size-snaps shows that only a few other labels different from real labels have average scores lower than the quantile of scores. SizeSize\mathrm{Size}roman_Size of prediction sets corresponding to SNAPS is 1.29. Overall, for basic non-conformity scores that match this distribution of our assumptions, SNAPS can achieve superior performance based on these scores. The results of CiteSeer and Amazon Computers datasets are available in Appendix D.

Table 2: Ablation study in terms of SizeSize\mathrm{Size}roman_Size. Overall, three parts of our method are critical, and removing any of them results in a general decrease in performance.
Orig. Neigh. Feat. CoraML PubMed CiteSeer CoraFull CS Physics Computers Photo arxiv products
\checkmark ×\times× ×\times× 2.42 1.79 2.34 17.54 1.91 1.28 3.95 1.89 4.30 14.92
×\times× \checkmark ×\times× 2.18 1.94 2.07 17.50 1.37 1.09 2.15 1.42 4.75 11.25
×\times× ×\times× \checkmark 2.40 1.65 2.52 18.07 1.11 1.03 3.26 2.60 9.45 13.89
\checkmark \checkmark ×\times× 1.87 1.72 1.91 12.10 1.22 1.07 2.22 1.37 3.76 10.81
\checkmark ×\times× \checkmark 1.78 1.63 1.94 11.54 1.13 1.05 2.37 1.46 3.82 8.46
×\times× \checkmark \checkmark 1.72 1.63 1.86 10.51 1.09 1.04 1.94 1.31 4.44 7.65
\checkmark \checkmark \checkmark 1.68 1.62 1.84 9.80 1.08 1.04 1.98 1.31 3.62 7.68

Ablation study. To understand the effects of three parts of our method, i.e., original scores (Orig.), neighborhood scores (Neigh.), and feature similarity node scores (Feat.), we conduct a thorough ablation experiment using GCN at α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. In Table 2, SNAPS performs best on most datasets when all three parts are included. Moreover, for the remaining dataset on which SNAPS exhibits comparable performance, all those better cases contain the Feat. part. Overall, each part plays a critical role in CP for GNNs, and removing any will in general decrease performance.

Table 3: Results of CoverageCoverage\mathrm{Coverage}roman_Coverage, SizeSize\mathrm{Size}roman_Size and SHSH\mathrm{SH}roman_SH on different datasets. For SNAPS we use the APS score as the basic score and set λ=μ=1/3𝜆𝜇13\lambda=\mu=1/3italic_λ = italic_μ = 1 / 3. We report the average calculated from 10 GCN runs with each run of 100 conformal splits at a significance level α=0.05𝛼0.05\alpha=0.05italic_α = 0.05. Bold numbers indicate optimal performance.
Coverage Size\downarrow SH%\uparrow
Datasets APS RAPS DAPS SNAPS APS RAPS DAPS SNAPS APS RAPS DAPS SNAPS
CoraML 0.950 0.958 0.957 0.951 2.50 2.62 2.32 1.74 43.09 27.34 44.52 54.11
PubMed 0.950 0.968 0.967 0.950 1.82 2.10 2.09 1.61 33.39 14.66 23.27 44.11
CiteSeer 0.951 0.950 0.952 0.950 2.41 2.69 2.16 1.90 48.53 35.37 55.40 58.22
CS 0.950 0.953 0.954 0.950 2.04 1.31 1.33 1.13 64.32 66.91 74.91 85.21
Physics 0.951 0.962 0.962 0.950 1.39 1.44 1.28 1.07 72.44 62.22 77.65 88.58
Computers 0.950 0.950 0.951 0.950 3.01 3.04 2.30 2.01 29.21 9.87 42.19 45.98
Photo 0.949 0.950 0.950 0.950 1.90 1.81 1.56 1.30 54.86 47.27 67.57 79.50

Parameter analysis. We conduct additional experiments to analyze the robustness of SNAPS. We choose GCN as the GNNs model and APS as the basic non-conformity score function.

Figure LABEL:fig:param-k-size and Figure LABEL:fig:param-k-sh demonstrate that the performance of SNAPS significantly improves as k𝑘kitalic_k gradually increases from 00. This improvement occurs because the increasing nodes with the same label are selected to enhance the ego node. Subsequently, as k𝑘kitalic_k continues to increase, the performance of SNAPS tends to stabilize. On the other hand, we find that when k𝑘kitalic_k is extremely large, it appears that nodes with the same label cannot be selected with high accuracy only by feature similarity. Thus, when k𝑘kitalic_k is extremely large, performance will decline slightly. Figure LABEL:fig:param-ab-size and Figure LABEL:fig:param-ab-sh show that as the values of parameter λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ change, the most areas in the heatmaps of SizeSize\mathrm{Size}roman_Size and SHSH\mathrm{SH}roman_SH display similar colors. Overall, SNAPS is robust to the parameter k𝑘kitalic_k and is not sensitive to parameters λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ. To further explore the sensitivity of λ𝜆\lambdaitalic_λ and μ𝜇\muitalic_μ to the performance of SNAPS, we set λ=μ=1/3𝜆𝜇13\lambda=\mu=1/3italic_λ = italic_μ = 1 / 3, which indicating that three components of SNAPS are equally weighted. The experimental results in Table 3 demonstrate that SNAPS performs well with these default hyperparameters on most datasets.

Adaption to image classification problems. In the node classification problems, SNAPS achieves better performance than standard APS, which was proposed for image classification problems. Therefore, we employ SNAPS for image classification problems. Since there are no links between different images, we utilize the cosine similarities of image features to correct the APS. Formally, the corrected APS, i.e., SNAPS, is defined as :

s^(𝒙,y)=(1η)s(𝒙,y)+η|𝒩𝒙|𝒙~𝒩𝒙s(𝒙~,y),^𝑠𝒙𝑦1𝜂𝑠𝒙𝑦𝜂subscript𝒩𝒙subscript~𝒙subscript𝒩𝒙𝑠~𝒙𝑦\hat{s}(\boldsymbol{x},y)=(1-\eta)s(\boldsymbol{x},y)+\frac{\eta}{|\mathcal{N}% _{\boldsymbol{x}}|}\sum_{\tilde{\boldsymbol{x}}\in\mathcal{N}_{\boldsymbol{x}}% }s(\tilde{\boldsymbol{x}},y),over^ start_ARG italic_s end_ARG ( bold_italic_x , italic_y ) = ( 1 - italic_η ) italic_s ( bold_italic_x , italic_y ) + divide start_ARG italic_η end_ARG start_ARG | caligraphic_N start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT over~ start_ARG bold_italic_x end_ARG ∈ caligraphic_N start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_s ( over~ start_ARG bold_italic_x end_ARG , italic_y ) ,

where s(𝒙,y)𝑠𝒙𝑦s(\boldsymbol{x},y)italic_s ( bold_italic_x , italic_y ) is the score of standard APS, 𝒩𝒙subscript𝒩𝒙\mathcal{N}_{\boldsymbol{x}}caligraphic_N start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT is the k𝑘kitalic_k nearest neighbors based on image features in the calibration set and η𝜂\etaitalic_η is a corrected weight. We conduct experiments on ImageNet, whose test dataset is equally divided into the calibration set and the test set. For SNAPS, we set k=5𝑘5k=5italic_k = 5 and η=0.5𝜂0.5\eta=0.5italic_η = 0.5. We report the results of CoverageCoverage\mathrm{Coverage}roman_Coverage, SizeSize\mathrm{Size}roman_Size and size-stratified coverage violation (SSCV) (Angelopoulos et al., 2021). The details of experiments and SSCV are provided in Appendix E.

As indicated in Table 4, SNAPS achieves smaller prediction sets than APS. For example, on the ResNeXt101 model and α𝛼\alphaitalic_α = 0.1, SNAPS reduces SizeSize\mathrm{Size}roman_Size from 19.639 to 4.079 – only 1515\frac{1}{5}divide start_ARG 1 end_ARG start_ARG 5 end_ARG of the prediction set size from APS and achieves the smaller SSCV than APS. Overall, SNAPS could improve the efficiency of prediction sets while maintaining the performance of conditional coverage.

Table 4: Results on Imagenet. The median-of-means is reported over 10 different trials. Bold numbers indicate optimal performance.
Accuracy APS/SNAPS
α=0.1𝛼0.1\alpha=0.1italic_α = 0.1 α=0.05𝛼0.05\alpha=0.05italic_α = 0.05
Model Top1 Top5 Coverage Size \downarrow SSCV \downarrow Coverage Size\downarrow SSCV \downarrow
ResNeXt101 79.32 94.58 0.899/0.900 19.64/4.08 0.088/0.059 0.950/0.950 45.80/14.41 0.047/0.033
ResNet101 77.36 93.53 0.900/0.900 10.82/3.62 0.075/0.078 0.950/0.950 22.90/9.83 0.039/0.029
DenseNet161 77.19 93.56 0.900/0.900 12.04/3.80 0.077/0.067 0.951/0.950 27.99/10.66 0.039/0.026
ViT 81.02 95.33 0.899/0.899 10.50/2.33 0.087/0.133 0.949/0.950 31.12/10.47 0.042/0.040
CLIP 60.53 86.15 0.899/0.899 17.46/10.32 0.047/0.032 0.950/0.949 34.93/24.53 0.027/0.017
Average - - 0.899/0.900 14.09/4.83 0.075/0.074 0.950/0.950 32.55/13.98 0.039/0.029

5 Related Work

Uncertainty Quantification for GNNs. Many uncertainty quantification (UQ) methods have been proposed to quantify the model uncertainty for classification tasks in machine learning (Gal and Ghahramani, 2016; Guo et al., 2017; Zhang et al., 2020; Gupta et al., 2021). Recently, several calibration methods for GNNs have been developed, such as CaGCN (Wang et al., 2021), GATS (Hsu et al., 2022) and SimCalib (Tang et al., 2024). However, these UQ methods lack statistically rigorous and empirically valid coverage guarantee (Huang et al., 2023b). In contrast, SNAPS provides valid coverage guarantees both theoretically and empirically.

Conformal Prediction for GNNs. Many conformal prediction (CP) methods have been developed to provide valid uncertainty estimates for model predictions in machine learning classification tasks (Romano et al., 2020; Angelopoulos et al., 2021; Liu et al., 2024; Wei and Huang, 2024). Although several CP methods for GNNs have been studied, the use of CP in graph-structured data is still largely underexplored. ICP (Wijegunawardana et al., 2020) is the first to apply CP framework on graphs, designs a margin conformity score for labels of nodes without considering the relation between nodes. NAPS (Clarkson, 2023) use the non-exchangeable technique from (Barber et al., 2023) for inductive node classification, not applicable for the transductive setting, while we focus on the transductive setting where exchangeability property holds. Our method is essentially an enhanced version of the DAPS (Zargarbashi et al., 2023) method, which proposes a diffusion-based method that incorporates neighborhood information by leveraging the network homophily. Similar to DAPS, CF-GNN (Huang et al., 2023b) introduces a topology-aware output correction model, akin to GCN, which employs a conformal-aware inefficiency loss to refine predictions and improve the efficiency of post-hoc CP. Other recent efforts in CP for graphs include (Lunde, 2023; Marandon, 2023; Zargarbashi and Bojchevski, 2023; Sanchez-Martin et al., 2024) which focus on distinct problem settings. In this work, SNAPS takes into account both network topology and feature similarity. This method can be applied not only to graph-structured data but also to other types of data, such as image data.

6 Conclusion

In this paper, we propose SNAPS, a general algorithm that aggregates the non-conformity scores of nodes with the same label as the ego node. Specifically, we select these nodes based on feature similarity and structural neighborhood, and then aggregate their non-conformity scores to the ego node. As a result, our method could correct the scores of some nodes. Moreover, we present theoretical analyses to certify the effectiveness of this method. Extensive experiments demonstrate that SNAPS not only maintains the pre-defined coverage, but also achieves significant performance in efficiency and singleton hit ratio. Furthermore, we extend SNAPS to image classification, where SNAPS shows superior performance compared to APS.

Limitations.

Our work focuses on node classification using transductive learning. However, in real-world scenarios, many classification tasks require inductive learning. In the future, we aim to apply our method to the inductive setting. Additionally, the method we use to select nodes with the same as the ego node is both computationally inefficient and lacking accuracy. Future work will explore more efficient and accurate methods for node selection. Moreover, while our focus is primarily on datasets with high homophily, many heterophilous networks are prevalent in practice. Consequently, further investigation is essential to enhance the adaptability of SNAPS to these networks.

Acknowledgments

This paper is supported by the National Natural Science Foundation of China (Grant No. 62192783, 62376117), the National Social Science Fund of China (Grant No. 23BJL035), the Science and Technology Major Project of Nanjing (comprehensive category) (Grant No. 202309007), and the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.

References

  • Amodei et al. [2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
  • Angelopoulos et al. [2021] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael I. Jordan, and Jitendra Malik. Uncertainty sets for image classifiers using conformal prediction. In 9th International Conference on Learning Representations, 2021.
  • Barber et al. [2023] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845, 2023.
  • Bhatia et al. [2016] K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. The extreme classification repository: Multi-label datasets and code, 2016. URL http://manikvarma.org/downloads/XC/XMLRepository.html.
  • Bojchevski and Günnemann [2018] Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. In 6th International Conference on Learning Representations, 2018.
  • Clarkson [2023] Jase Clarkson. Distribution free prediction sets for node classification. In International Conference on Machine Learning, pages 6268–6278, 2023.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • Dong et al. [2011] Wei Dong, Moses Charikar, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International Conference on World Wide Web, pages 577–586, 2011.
  • Fey and Lenssen [2019] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
  • Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016.
  • Gao et al. [2019] Jinyang Gao, Junjie Yao, and Yingxia Shao. Towards reliable learning for high stakes applications. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3614–3621, 2019.
  • Gasteiger et al. [2018] Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
  • Gilmer et al. [2017] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1263–1272, 2017.
  • Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330, 2017.
  • Gupta et al. [2021] Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, and Richard Hartley. Calibration of neural networks using splines. In 9th International Conference on Learning Representations, 2021.
  • Hamilton et al. [2017] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
  • Hsu et al. [2022] Hans Hao-Hsun Hsu, Yuesong Shen, Christian Tomani, and Daniel Cremers. What makes graph neural networks miscalibrated? In Advances in Neural Information Processing Systems, 2022.
  • Huang et al. [2023a] Jianguo Huang, Huajun Xi, Linjun Zhang, Huaxiu Yao, Yue Qiu, and Hongxin Wei. Conformal prediction for deep classifier via label ranking. arXiv preprint arXiv:2310.06430, 2023a.
  • Huang et al. [2023b] Kexin Huang, Ying Jin, Emmanuel J. Candès, and Jure Leskovec. Uncertainty quantification over graph with conformalized graph neural networks. In Advances in Neural Information Processing Systems, 2023b.
  • Jiang and Luo [2022] Weiwei Jiang and Jiayun Luo. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications, 207:117921, 2022.
  • Jin et al. [2021a] Di Jin, Zhizhi Yu, Cuiying Huo, Rui Wang, Xiao Wang, Dongxiao He, and Jiawei Han. Universal graph convolutional networks. Advances in Neural Information Processing Systems, 34:10654–10664, 2021a.
  • Jin et al. [2021b] Wei Jin, Tyler Derr, Yiqi Wang, Yao Ma, Zitao Liu, and Jiliang Tang. Node similarity preserving graph convolutional networks. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 148–156, 2021b.
  • Kendall and Gal [2017] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017.
  • Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, 2017.
  • Li et al. [2022] Michelle M Li, Kexin Huang, and Marinka Zitnik. Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering, 6(12):1353–1369, 2022.
  • Liu et al. [2024] Kangdao Liu, Tianhao Sun, Hao Zeng, Yongshan Zhang, Chi-Man Pun, and Chi-Man Vong. Spatial-aware conformal prediction for trustworthy hyperspectral image classification. arXiv preprint arXiv:2409.01236, 2024.
  • Liu et al. [2023] Yajing Liu, Zhengya Sun, and Wensheng Zhang. Improving fraud detection via hierarchical attention-based graph neural network. Journal of Information Security and Applications, 72:103399, 2023.
  • Lunde [2023] Robert Lunde. On the validity of conformal prediction for network data under non-uniform sampling. arXiv preprint arXiv:2306.07252, 2023.
  • Marandon [2023] Ariane Marandon. Conformal link prediction to control the error rate. arXiv preprint arXiv:2306.14693, 2023.
  • Maurya et al. [2022] Sunil Kumar Maurya, Xin Liu, and Tsuyoshi Murata. Simplifying approach to node classification in graph neural networks. J. Comput. Sci., 62:101695, 2022.
  • McAuley et al. [2015] Julian J. McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52, 2015.
  • McCallum et al. [2000] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Inf. Retr., 3(2):127–163, 2000.
  • Namata et al. [2012] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and U Edu. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8, page 1, 2012.
  • Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
  • Pei et al. [2020] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. In 8th International Conference on Learning Representations, 2020.
  • Romano et al. [2020] Yaniv Romano, Matteo Sesia, and Emmanuel J. Candès. Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems, 2020.
  • Rozemberczki et al. [2021] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. J. Complex Networks, 9(2), 2021.
  • Sanchez-Martin et al. [2024] Pablo Sanchez-Martin, Kinaan Aamir Khan, and Isabel Valera. Improving the interpretability of gnn predictions through conformal-based graph sparsification. arXiv preprint arXiv:2404.12356, 2024.
  • Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93, 2008.
  • Shchur et al. [2018] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
  • Tang et al. [2024] Boshi Tang, Zhiyong Wu, Xixin Wu, Qiaochu Huang, Jun Chen, Shun Lei, and Helen Meng. Simcalib: Graph neural network calibration based on similarity between nodes. In Thirty-Eighth AAAI Conference on Artificial Intelligence, pages 15267–15275, 2024.
  • Velickovic et al. [2018] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, 2018.
  • Vovk et al. [2005] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer Science & Business Media, 2005.
  • Wang et al. [2020] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
  • Wang et al. [2021] Xiao Wang, Hongrui Liu, Chuan Shi, and Cheng Yang. Be confident! towards trustworthy graph neural networks via confidence calibration. In Advances in Neural Information Processing Systems, pages 23768–23779, 2021.
  • Wei and Huang [2024] Hongxin Wei and Jianguo Huang. Torchcp: A library for conformal prediction based on pytorch. arXiv preprint arXiv:2402.12683, 2024.
  • Wijegunawardana et al. [2020] Pivithuru Wijegunawardana, Ralucca Gera, and Sucheta Soundarajan. Node classification with bounded error rates. In Complex Networks XI: Proceedings of the 11th Conference on Complex Networks CompleNet 2020, pages 26–38. Springer, 2020.
  • Xi et al. [2024] Huajun Xi, Jianguo Huang, Lei Feng, and Hongxin Wei. Delving into temperature scaling for adaptive conformal prediction. arXiv preprint arXiv:2402.04344, 2024.
  • Xu et al. [2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In 7th International Conference on Learning Representations, 2019.
  • Zargarbashi and Bojchevski [2023] Soroush H Zargarbashi and Aleksandar Bojchevski. Conformal inductive graph neural networks. In The Twelfth International Conference on Learning Representations, 2023.
  • Zargarbashi et al. [2023] Soroush H. Zargarbashi, Simone Antonelli, and Aleksandar Bojchevski. Conformal prediction sets for graph neural networks. In International Conference on Machine Learning, volume 202, pages 12292–12318, 2023.
  • Zhang et al. [2020] Jize Zhang, Bhavya Kailkhura, and Thomas Yong-Jin Han. Mix-n-match : Ensemble and compositional methods for uncertainty calibration in deep learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 11117–11128, 2020.
  • Zou et al. [2023] Minhao Zou, Zhongxue Gan, Ruizhi Cao, Chun Guan, and Siyang Leng. Similarity-navigated graph neural networks for node classification. Information Sciences, 633:41–69, 2023.

Appendix A Proofs

In this section, we provided the proofs that were omitted from the main paper.

A.1 Proof of Proposition 1

Proof. [Zargarbashi et al., 2023] have proved that (1λμ)𝑺+μ𝑨^𝑺1𝜆𝜇𝑺𝜇bold-^𝑨𝑺(1-\lambda-\mu)\boldsymbol{S}+\mu\boldsymbol{\hat{A}}\boldsymbol{S}( 1 - italic_λ - italic_μ ) bold_italic_S + italic_μ overbold_^ start_ARG bold_italic_A end_ARG bold_italic_S is exchangeable for vi(𝒱calib𝒱test)subscript𝑣𝑖subscript𝒱calibsubscript𝒱testv_{i}\in(\mathcal{V}_{\text{calib}}\cup\mathcal{V}_{\text{test}})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( caligraphic_V start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ). So we only need to prove that 𝑨^s𝑺subscriptbold-^𝑨𝑠𝑺\boldsymbol{\hat{A}}_{s}\boldsymbol{S}overbold_^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_S is also exchangeable for vi(𝒱calib𝒱test)subscript𝑣𝑖subscript𝒱calibsubscript𝒱testv_{i}\in(\mathcal{V}_{\text{calib}}\cup\mathcal{V}_{\text{test}})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( caligraphic_V start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ). 𝑨^ssubscriptbold-^𝑨𝑠\boldsymbol{\hat{A}}_{s}overbold_^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is obtained by calculating the feature similarity between two nodes from a global perspective. Before obtaining this matrix, we can not distinguish between labeled and unlabeled nodes, so we just build a new graph structure using node features without considering the order of nodes. So when aggregating non-conformity, we do not break the permutation equivariant. Therefore, 𝑺^=(1λμ)𝑺+λ𝑨^s𝑺+μ𝑨^𝑺bold-^𝑺1𝜆𝜇𝑺𝜆subscriptbold-^𝑨𝑠𝑺𝜇bold-^𝑨𝑺\boldsymbol{\hat{S}}=(1-\lambda-\mu)\boldsymbol{S}+\lambda\boldsymbol{\hat{A}}% _{s}\boldsymbol{S}+\mu\boldsymbol{\hat{A}}\boldsymbol{S}overbold_^ start_ARG bold_italic_S end_ARG = ( 1 - italic_λ - italic_μ ) bold_italic_S + italic_λ overbold_^ start_ARG bold_italic_A end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_S + italic_μ overbold_^ start_ARG bold_italic_A end_ARG bold_italic_S is a special case of a message passing GNNs layer. It follows that 𝑺^bold-^𝑺\boldsymbol{\hat{S}}overbold_^ start_ARG bold_italic_S end_ARG is invariant to permutations of the order of the calibration and testing nodes on the graph. Through the proof above, we can conclude that 𝑺^bold-^𝑺\boldsymbol{\hat{S}}overbold_^ start_ARG bold_italic_S end_ARG is also exchangeable for vi(𝒱calib𝒱test)subscript𝑣𝑖subscript𝒱calibsubscript𝒱testv_{i}\in(\mathcal{V}_{\text{calib}}\cup\mathcal{V}_{\text{test}})italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ ( caligraphic_V start_POSTSUBSCRIPT calib end_POSTSUBSCRIPT ∪ caligraphic_V start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ).

A.2 Proof of Proposition 2

Lemma 1

As stated in Proposition 2, we have

Ek[𝑺ui](1ϵki)Ek[π(𝒙u)max]+Ek[ξπ(𝒙u)i],subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖1subscriptitalic-ϵ𝑘𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖E_{k}[\boldsymbol{S}_{ui}]\geq(1-\epsilon_{ki})E_{k}[\pi(\boldsymbol{x}_{u})_{% max}]+E_{k}[\xi\cdot\pi(\boldsymbol{x}_{u})_{i}],italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] ≥ ( 1 - italic_ϵ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT ) italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ,

where Ek[π(𝐱u)max]subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝐱𝑢𝑚𝑎𝑥E_{k}[\pi(\boldsymbol{x}_{u})_{max}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] denotes the maximum predicted probability of nodes whose ground-truth labels are k𝑘kitalic_k, ϵkisubscriptitalic-ϵ𝑘𝑖\epsilon_{ki}italic_ϵ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT reflects the model’s error in misclassifying the ground-truth label k𝑘kitalic_k as label i𝑖iitalic_i and ξ[0,1]𝜉01\xi\in[0,1]italic_ξ ∈ [ 0 , 1 ] is a uniformly distributed random variable.

Proof of Lemma 1.

Here, we use APS non-conformity scores as the basic non-conformity scores. Then we have,

𝑺ui=j=1|𝒴|π(𝒙u)j𝕀[π(𝒙u)j>π(𝒙u)i]+ξπ(𝒙u)i.subscript𝑺𝑢𝑖superscriptsubscript𝑗1𝒴𝜋subscriptsubscript𝒙𝑢𝑗𝕀delimited-[]𝜋subscriptsubscript𝒙𝑢𝑗𝜋subscriptsubscript𝒙𝑢𝑖𝜉𝜋subscriptsubscript𝒙𝑢𝑖\boldsymbol{S}_{ui}=\sum_{j=1}^{|\mathcal{Y}|}\pi(\boldsymbol{x}_{u})_{j}% \mathbb{I}[\pi(\boldsymbol{x}_{u})_{j}>\pi(\boldsymbol{x}_{u})_{i}]+\xi\cdot% \pi(\boldsymbol{x}_{u})_{i}.bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_Y | end_POSTSUPERSCRIPT italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT blackboard_I [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT > italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] + italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Suppose T𝑇Titalic_T is the number of nodes whose ground-truth label is label k𝑘kitalic_k. Below we discuss two cases of π(𝒙u)i𝜋subscriptsubscript𝒙𝑢𝑖\pi(\boldsymbol{x}_{u})_{i}italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT:

Case a. If π(𝒙u)i𝜋subscriptsubscript𝒙𝑢𝑖\pi(\boldsymbol{x}_{u})_{i}italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the largest predicted probability for node u𝑢uitalic_u, then Ek[𝑺ui]=Ek[ξπ(𝒙u)i]=Ek[π(𝒙u)max]+Ek[ξπ(𝒙u)i]Ek[π(𝒙u)max]subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥E_{k}[\boldsymbol{S}_{ui}]=E_{k}[\xi\cdot\pi(\boldsymbol{x}_{u})_{i}]=E_{k}[% \pi(\boldsymbol{x}_{u})_{max}]+E_{k}[\xi\cdot\pi(\boldsymbol{x}_{u})_{i}]-E_{k% }[\pi(\boldsymbol{x}_{u})_{max}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] = italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] = italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ]. Suppose the number of nodes satisfying this case is A𝐴Aitalic_A.

Case b. Otherwise, Ek[𝑺ui]Ek[π(𝒙u)max]+Ek[ξπ(𝒙u)i]subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖E_{k}[\boldsymbol{S}_{ui}]\geq E_{k}[\pi(\boldsymbol{x}_{u})_{max}]+E_{k}[\xi% \cdot\pi(\boldsymbol{x}_{u})_{i}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] ≥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. Suppose the number of nodes satisfying this case is B𝐵Bitalic_B, where A+B=T𝐴𝐵𝑇A+B=Titalic_A + italic_B = italic_T.

Therefore, summing up Ek[𝑺ui]subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖E_{k}[\boldsymbol{S}_{ui}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] for both cases, we have

AEk[𝑺ui]+BEk[𝑺ui](A+B)(Ek[π(𝒙u)max]+Ek[ξπ(𝒙u)i])AEk[π(𝒙u)max].𝐴subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖𝐵subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖𝐴𝐵subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖𝐴subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥A\cdot E_{k}[\boldsymbol{S}_{ui}]+B\cdot E_{k}[\boldsymbol{S}_{ui}]\geq(A+B)% \cdot(E_{k}[\pi(\boldsymbol{x}_{u})_{max}]+E_{k}[\xi\cdot\pi(\boldsymbol{x}_{u% })_{i}])-A\cdot E_{k}[\pi(\boldsymbol{x}_{u})_{max}].italic_A ⋅ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] + italic_B ⋅ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] ≥ ( italic_A + italic_B ) ⋅ ( italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] ) - italic_A ⋅ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] .

This simplifies to: Ek[𝑺ui]Ek[π(𝒙u)max]+Ek[ξπ(𝒙u)i]ATEk[π(𝒙u)max]subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖𝐴𝑇subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥E_{k}[\boldsymbol{S}_{ui}]\geq E_{k}[\pi(\boldsymbol{x}_{u})_{max}]+E_{k}[\xi% \cdot\pi(\boldsymbol{x}_{u})_{i}]-\frac{A}{T}\cdot E_{k}[\pi(\boldsymbol{x}_{u% })_{max}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] ≥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - divide start_ARG italic_A end_ARG start_ARG italic_T end_ARG ⋅ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ].

Let ϵki=ATsubscriptitalic-ϵ𝑘𝑖𝐴𝑇\epsilon_{ki}=\frac{A}{T}italic_ϵ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT = divide start_ARG italic_A end_ARG start_ARG italic_T end_ARG, which reflects the model’s error in misclassifying the ground-truth label k𝑘kitalic_k as label i𝑖iitalic_i. Therefore, we conclude that: Ek[𝑺ui](1ϵki)Ek[π(𝒙u)max]+Ek[ξπ(𝒙u)i]subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖1subscriptitalic-ϵ𝑘𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖E_{k}[\boldsymbol{S}_{ui}]\geq(1-\epsilon_{ki})E_{k}[\pi(\boldsymbol{x}_{u})_{% max}]+E_{k}[\xi\cdot\pi(\boldsymbol{x}_{u})_{i}]italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] ≥ ( 1 - italic_ϵ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT ) italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ].

Proof of Proposition 2.

For the sake of description, we denote "1α1𝛼1-\alpha1 - italic_α quantile of basic non-conformity scores in the calibrated set" as "the quantile score". Let 𝑺𝑺\boldsymbol{S}bold_italic_S and 𝑺^bold-^𝑺\boldsymbol{\hat{S}}overbold_^ start_ARG bold_italic_S end_ARG denote APS and SNAPS non-conformity scores, respectively. For node v𝑣vitalic_v whose label is k𝑘kitalic_k, 𝑺^vsubscriptbold-^𝑺𝑣\boldsymbol{\hat{S}}_{v}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT can be be expressed as

𝑺^v=(1λ)𝑺v+λ|𝒱k|u𝒱k𝑺u,subscriptbold-^𝑺𝑣1𝜆subscript𝑺𝑣𝜆subscript𝒱𝑘subscript𝑢subscript𝒱𝑘subscript𝑺𝑢\boldsymbol{\hat{S}}_{v}=(1-\lambda)\boldsymbol{S}_{v}+\frac{\lambda}{|% \mathcal{V}_{k}|}\sum_{u\in\mathcal{V}_{k}}\boldsymbol{S}_{u},overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = ( 1 - italic_λ ) bold_italic_S start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT + divide start_ARG italic_λ end_ARG start_ARG | caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | end_ARG ∑ start_POSTSUBSCRIPT italic_u ∈ caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_italic_S start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , (6)

where 𝒱ksubscript𝒱𝑘\mathcal{V}_{k}caligraphic_V start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the nodes set where nodes’ ground-truth label is k𝑘kitalic_k, because regardless of whether high feature similarity nodes or one-hop structural neighbors, the purpose of aggregating these nodes’ scores is actually to aggregate, as much as possible, non-conformity scores of nodes with the same label as the ego node.

In order to prove Proposition 2, we only need to prove the following: 1) SNAPS is efficient for the score corresponding to the ground-truth label k𝑘kitalic_k of node v𝑣vitalic_v, i.e., S^vkSvksubscriptbold-^𝑆𝑣𝑘subscript𝑆𝑣𝑘\boldsymbol{\hat{S}}_{vk}\leq\boldsymbol{S}_{vk}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT ≤ bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT or S^vkηsubscriptbold-^𝑆𝑣𝑘𝜂\boldsymbol{\hat{S}}_{vk}\leq\etaoverbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT ≤ italic_η. 2) SNAPS is efficient for the score corresponding to the other label i𝑖iitalic_i of node v𝑣vitalic_v, i.e., S^viSvisubscriptbold-^𝑆𝑣𝑖subscript𝑆𝑣𝑖\boldsymbol{\hat{S}}_{vi}\geq\boldsymbol{S}_{vi}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT ≥ bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT or S^viηsubscriptbold-^𝑆𝑣𝑖𝜂\boldsymbol{\hat{S}}_{vi}\geq\etaoverbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT ≥ italic_η. The key idea behind this is as follows. We try to ensure that scores corresponding to the ground-truth label are below the quantile score or decrease compared to the before and scores corresponding to the other label are above the quantile score or increase compared to the before.

Firstly

SNAPS is efficient for the score corresponding to the ground-truth label k𝑘kitalic_k of node v𝑣vitalic_v, i.e., 𝑺^vk𝑺vksubscriptbold-^𝑺𝑣𝑘subscript𝑺𝑣𝑘\boldsymbol{\hat{S}}_{vk}\leq\boldsymbol{S}_{vk}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT ≤ bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT or 𝑺^vkηsubscriptbold-^𝑺𝑣𝑘𝜂\boldsymbol{\hat{S}}_{vk}\leq\etaoverbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT ≤ italic_η. Here we have

𝑺^vk=(1λ)𝑺vk+λEk[𝑺uk].subscriptbold-^𝑺𝑣𝑘1𝜆subscript𝑺𝑣𝑘𝜆subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘\boldsymbol{\hat{S}}_{vk}=(1-\lambda)\boldsymbol{S}_{vk}+\lambda E_{k}[% \boldsymbol{S}_{uk}].overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT = ( 1 - italic_λ ) bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT + italic_λ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ] .

1) If 𝑺vkEk[𝑺uk]subscript𝑺𝑣𝑘subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘\boldsymbol{S}_{vk}\geq E_{k}[\boldsymbol{S}_{uk}]bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT ≥ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ], then

𝑺^vk𝑺vksubscriptbold-^𝑺𝑣𝑘subscript𝑺𝑣𝑘\displaystyle\boldsymbol{\hat{S}}_{vk}-\boldsymbol{S}_{vk}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT - bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT =(1λ)𝑺vk+λEk[𝑺uk]𝑺vkabsent1𝜆subscript𝑺𝑣𝑘𝜆subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘subscript𝑺𝑣𝑘\displaystyle=(1-\lambda)\boldsymbol{S}_{vk}+\lambda E_{k}[\boldsymbol{S}_{uk}% ]-\boldsymbol{S}_{vk}= ( 1 - italic_λ ) bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT + italic_λ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ] - bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT
=λ(𝑺vkEk[𝑺uk])absent𝜆subscript𝑺𝑣𝑘subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘\displaystyle=-\lambda(\boldsymbol{S}_{vk}-E_{k}[\boldsymbol{S}_{uk}])= - italic_λ ( bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT - italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ] )
0.absent0\displaystyle\leq 0.≤ 0 .

Thus, 𝑺^vk𝑺vksubscriptbold-^𝑺𝑣𝑘subscript𝑺𝑣𝑘\boldsymbol{\hat{S}}_{vk}\leq\boldsymbol{S}_{vk}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT ≤ bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT. This means that SNAPS can decrease some scores corresponding to the ground-truth label, bringing them from above the quantile score to below it. Since false scores corresponding to ground-truth labels will decrease, η^<η^𝜂𝜂\hat{\eta}<\etaover^ start_ARG italic_η end_ARG < italic_η, where η^^𝜂\hat{\eta}over^ start_ARG italic_η end_ARG denotes 1α1𝛼1-\alpha1 - italic_α quantile of SNAPS scores in the calibrated set.

2) If 𝑺vk<Ek[𝑺uk]subscript𝑺𝑣𝑘subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘\boldsymbol{S}_{vk}<E_{k}[\boldsymbol{S}_{uk}]bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT < italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ], then

𝑺^vksubscriptbold-^𝑺𝑣𝑘\displaystyle\boldsymbol{\hat{S}}_{vk}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT =(1λ)𝑺vk+λEk[𝑺uk]absent1𝜆subscript𝑺𝑣𝑘𝜆subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘\displaystyle=(1-\lambda)\boldsymbol{S}_{vk}+\lambda E_{k}[\boldsymbol{S}_{uk}]= ( 1 - italic_λ ) bold_italic_S start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT + italic_λ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ]
<(1λ)Ek[𝑺uk]+λEk[𝑺uk]absent1𝜆subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘𝜆subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘\displaystyle<(1-\lambda)E_{k}[\boldsymbol{S}_{uk}]+\lambda E_{k}[\boldsymbol{% S}_{uk}]< ( 1 - italic_λ ) italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ] + italic_λ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ]
=Ek[𝑺uk]absentsubscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑘\displaystyle=E_{k}[\boldsymbol{S}_{uk}]= italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_k end_POSTSUBSCRIPT ]
<η.absent𝜂\displaystyle<\eta.< italic_η .

Thus, 𝑺^vk<ηsubscriptbold-^𝑺𝑣𝑘𝜂\boldsymbol{\hat{S}}_{vk}<\etaoverbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_k end_POSTSUBSCRIPT < italic_η. This means that for original scores less than the quantile score, they are still less than the quantile score after aggregation.

Secondly

SNAPS is efficient for the score corresponding to the other label i𝑖iitalic_i of node v𝑣vitalic_v, i.e., 𝑺^vi𝑺visubscriptbold-^𝑺𝑣𝑖subscript𝑺𝑣𝑖\boldsymbol{\hat{S}}_{vi}\geq\boldsymbol{S}_{vi}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT ≥ bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT or 𝑺^viηsubscriptbold-^𝑺𝑣𝑖𝜂\boldsymbol{\hat{S}}_{vi}\geq\etaoverbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT ≥ italic_η. Here we have

𝑺^vi=(1λ)𝑺vi+λEk[𝑺ui].subscriptbold-^𝑺𝑣𝑖1𝜆subscript𝑺𝑣𝑖𝜆subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖\boldsymbol{\hat{S}}_{vi}=(1-\lambda)\boldsymbol{S}_{vi}+\lambda E_{k}[% \boldsymbol{S}_{ui}].overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT = ( 1 - italic_λ ) bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT + italic_λ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] .

1) If 𝑺viEk[𝑺ui]subscript𝑺𝑣𝑖subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖\boldsymbol{S}_{vi}\leq E_{k}[\boldsymbol{S}_{ui}]bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT ≤ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ], then

𝑺^vi𝑺visubscriptbold-^𝑺𝑣𝑖subscript𝑺𝑣𝑖\displaystyle\boldsymbol{\hat{S}}_{vi}-\boldsymbol{S}_{vi}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT - bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT =(1λ)𝑺vi+λEk[𝑺ui]𝑺viabsent1𝜆subscript𝑺𝑣𝑖𝜆subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖subscript𝑺𝑣𝑖\displaystyle=(1-\lambda)\boldsymbol{S}_{vi}+\lambda E_{k}[\boldsymbol{S}_{ui}% ]-\boldsymbol{S}_{vi}= ( 1 - italic_λ ) bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT + italic_λ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] - bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT
=λ(𝑺viEk[𝑺ui])absent𝜆subscript𝑺𝑣𝑖subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖\displaystyle=-\lambda(\boldsymbol{S}_{vi}-E_{k}[\boldsymbol{S}_{ui}])= - italic_λ ( bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT - italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] )
0.absent0\displaystyle\geq 0.≥ 0 .

Thus, 𝑺^vi𝑺visubscriptbold-^𝑺𝑣𝑖subscript𝑺𝑣𝑖\boldsymbol{\hat{S}}_{vi}\geq\boldsymbol{S}_{vi}overbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT ≥ bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT. This means that SNAPS can increase some scores corresponding to the other labels, bringing them from below the quantile score to above it.

2) If 𝑺vi>Ek[𝑺ui]subscript𝑺𝑣𝑖subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖\boldsymbol{S}_{vi}>E_{k}[\boldsymbol{S}_{ui}]bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT > italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ], then

𝑺^viηsubscriptbold-^𝑺𝑣𝑖𝜂\displaystyle\boldsymbol{\hat{S}}_{vi}-\etaoverbold_^ start_ARG bold_italic_S end_ARG start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT - italic_η =(1λ)𝑺vi+λEk[𝑺ui]ηabsent1𝜆subscript𝑺𝑣𝑖𝜆subscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖𝜂\displaystyle=(1-\lambda)\boldsymbol{S}_{vi}+\lambda E_{k}[\boldsymbol{S}_{ui}% ]-\eta= ( 1 - italic_λ ) bold_italic_S start_POSTSUBSCRIPT italic_v italic_i end_POSTSUBSCRIPT + italic_λ italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] - italic_η
>Ek[𝑺ui]ηabsentsubscript𝐸𝑘delimited-[]subscript𝑺𝑢𝑖𝜂\displaystyle>E_{k}[\boldsymbol{S}_{ui}]-\eta> italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ bold_italic_S start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT ] - italic_η
(1ϵki)Ek[π(𝒙u)max]+Ek[ξπ(𝒙u)i]η.absent1subscriptitalic-ϵ𝑘𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘delimited-[]𝜉𝜋subscriptsubscript𝒙𝑢𝑖𝜂\displaystyle\geq(1-\epsilon_{ki})E_{k}[\pi(\boldsymbol{x}_{u})_{max}]+E_{k}[% \xi\cdot\pi(\boldsymbol{x}_{u})_{i}]-\eta.≥ ( 1 - italic_ϵ start_POSTSUBSCRIPT italic_k italic_i end_POSTSUBSCRIPT ) italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ] + italic_E start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT [ italic_ξ ⋅ italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - italic_η .

Let Δ=η(1ϵki)Ek[π(𝒙u)max]Ek[ξπ(𝒙u)i]Δ𝜂1subscriptitalic-ϵ𝑘𝑖subscript𝐸𝑘delimited-[]𝜋subscriptsubscript𝒙𝑢𝑚𝑎𝑥subscript𝐸𝑘