Similarity-Navigated Conformal Prediction for Graph Neural Networks
Abstract
Graph Neural Networks have achieved remarkable accuracy in semi-supervised node classification tasks. However, these results lack reliable uncertainty estimates. Conformal prediction methods provide a theoretical guarantee for node classification tasks, ensuring that the conformal prediction set contains the ground-truth label with a desired probability (e.g., 95%). In this paper, we empirically show that for each node, aggregating the non-conformity scores of nodes with the same label can improve the efficiency of conformal prediction sets while maintaining valid marginal coverage. This observation motivates us to propose a novel algorithm named Similarity-Navigated Adaptive Prediction Sets (SNAPS), which aggregates the non-conformity scores based on feature similarity and structural neighborhood. The key idea behind SNAPS is that nodes with high feature similarity or direct connections tend to have the same label. By incorporating adaptive similar nodes information, SNAPS can generate compact prediction sets and increase the singleton hit ratio (correct prediction sets of size one). Moreover, we theoretically provide a finite-sample coverage guarantee of SNAPS. Extensive experiments demonstrate the superiority of SNAPS, improving the efficiency of prediction sets and singleton hit ratio while maintaining valid coverage.
1 Introduction
Graph Neural Networks (GNNs), which process graph-structured data by the message-passing manner (Kipf and Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019), have achieved remarkable accuracy in various high-stakes applications, e.g., drug discovery (Li et al., 2022), fraud detection (Liu et al., 2023) and traffic forecasting (Jiang and Luo, 2022), where any erroneous prediction can be costly and dangerous (Amodei et al., 2016; Gao et al., 2019). To improve the reliability of prediction results, many methods have been investigated to quantify the model uncertainty (Gal and Ghahramani, 2016; Guo et al., 2017; Kendall and Gal, 2017; Wang et al., 2021; Hsu et al., 2022; Tang et al., 2024), while these methods lack theoretical guarantees of quantification. Conformal prediction (CP), on the other hand, offers a systematic approach to construct prediction sets that contain ground-truth labels with a desired coverage guarantee (Vovk et al., 2005; Romano et al., 2020; Angelopoulos et al., 2021; Huang et al., 2023a; Xi et al., 2024).
CP algorithms utilize non-conformity scores to measure dissimilarity between a new instance and the training instances. The lower the score of a new instance, the more likely it belongs to the same distribution space as the training instances, thereby included in the prediction set. To improve the efficiency of prediction sets for GNNs, DAPS (Zargarbashi et al., 2023) smooths node-wise non-conformity scores by incorporating neighborhood information based on the assumption of network homophily. Similar to DAPS, CF-GNN (Huang et al., 2023b) introduces a topology-aware output correction model that learns to update prediction and then produces more efficient prediction sets or intervals with the inefficiency as the optimization objective. However, they only consider structural neighbors and ignore the effect of other nodes that are far from the ego node. This motivates us to analyze the influence of global nodes on the size of prediction sets.
In this work, we show that aggregating the information of global nodes with the same label as the ego node benefits the performance of CP methods. We provide an empirical analysis by randomly selecting nodes with the same label as the ego node from an oracle perspective, where the ground-truth labels of all nodes are known, and then aggregating their non-conformity scores into the ego node. The results indicate that aggregating scores of these nodes can significantly reduce the average size of prediction sets. This suggests that the information of nodes with the same label could correct the non-conformity scores, thereby prompting the efficiency of prediction sets. Detailed analysis is presented in Subsection 3.1. However, during the testing phase, the ground-truth label of every test node is unknown. Inspired by the analysis, our key idea is to accurately identify and select as many nodes with the same label as the ego node as possible and aggregate their non-conformity scores.
To this end, we propose a novel algorithm named Similarity-Navigated Adaptive Prediction Sets (SNAPS), which could self-adaptively aggregate the non-conformity scores of other nodes into the ego node. Specifically, SNAPS gives the higher cumulative weight for nodes with a higher probability of having the same label as the ego node while preserving its own and the one-hop neighbors. We utilize the feature similarity between nodes and the adjacency matrix to calculate the aggregating weights. In this way, the corrected scores could achieve compact prediction sets while maintaining the desired coverage.
To verify the effectiveness of our method, we conduct thorough empirical evaluations on 10 datasets, including both small datasets and large-scale datasets, e.g., OGBN Products (Bhatia et al., 2016). The results demonstrate that SNAPS not only achieves the pre-defined empirical marginal coverage but also achieves better performance over the compared methods. For example, on OGBN Products, our method reduces the average size of prediction sets from 14.92 of APS to 7.68. Moreover, we adapt SNAPS to image classification problems. The results demonstrate that SNAPS reduces the average size of prediction sets from 19.639 to 4.079 – only of the prediction set size from APS on ImageNet (Deng et al., 2009). Code is available at https://github.com/janqsong/SNAPS.
We summarize our contributions as follows:
-
•
We empirically explain that non-conformity scores of nodes with the same label as the ego node play a critical role in their non-conformity scores.
-
•
We propose a novel algorithm, namely SNAPS that aggregates basic non-conformity scores of nodes obtained through node feature similarity and one-hop structural neighborhood. We provide theoretical analysis to show the marginal coverage properties of SNAPS and the validity of SNAPS.
-
•
Extensive experimental results demonstrate the effectiveness of our proposed method. We show that SNAPS not only maintains the pre-defined coverage but also achieves great performance in efficiency and singleton hit ratio.
2 Preliminary
In this paper, we focus on split conformal prediction for semi-supervised node classification with transductive learning in an undirected graph.
Notation. Graph is represented as , where denotes the node set and denotes the edge set with . Let be the adjacency matrix, where if there exists an edge between nodes and , and otherwise, and be its degree matrix, where . Let be the node feature matrix, where is a -dimensional feature vector for node . The label of node is , where denotes the label space.
Transductive setting. In transductive setting, we have access to two node sets, with labels and without labels, where and . is then randomly split into with a fixed size, the training/validation/calibration node set, correspondingly. is used as the testing node set . The classifier is trained on , and the entire graph structure , and is chosen through . Then we can get the predicted probability for each node through where and is activation function such as softmax. We usually choose the label with the highest probability as the predicted label, i.e., .
Graph neural networks. GNNs aim at learning representation vectors for nodes in the graph by leveraging graph structure and node features. Most modern GNNs adopt a series of propagation layers following a message passing mechanism (Gilmer et al., 2017). The -th layer of the GNNs takes the following form:
(1) |
where is the hidden representation of node at the -th layer with initialization of , and is a set of nodes adjacent to node . , and denote the functions for message computation, message aggregation, and message combination, respectively. After an iteration of the last layer, the obtained final node representation is then fed to a classifier to obtain the predicted probability .
Conformal prediction. CP is a promising framework for generating prediction sets that statistically contain ground-truth labels with a desired guarantee. Formally, given calibration data , we can generate a prediction set for an unseen instance with the coverage guarantee , where is the pre-defined significance level. The best characteristic of CP is that it is distribution-free and only relies on exchangeability. This means that every permutation of the instances is equally likely, i.e., is exchangeable, where is an unseen instance.
Conformal prediction is typically divided into two types: full conformal prediction and split conformal prediction. Unlike full conformal prediction, split conformal prediction treats the model as a black box, avoiding the need to retrain or modify the model and sacrificing efficiency for computational efficiency (Vovk et al., 2005; Zargarbashi et al., 2023). In this paper, we focus on the computationally efficient split conformal prediction method, thus "conformal prediction" in the following denotes split conformal prediction.
Theorem 1
(Vovk et al., 2005) Let calibration data and a test instance, i.e., be exchangeable. For any non-conformity score function and any significance level , define the quantile of scores as and prediction sets as . We have
(2) |
Theorem 1 statistically provides a marginal coverage guarantee for all test instances. Currently, there are already many basic non-conformity score methods (Romano et al., 2020; Angelopoulos et al., 2021; Huang et al., 2023a). Here we provide the definition of Adaptive Prediction Sets (Romano et al., 2020) (APS).
Adaptive Prediction Sets. In the APS method, the non-conformity scores are calculated by accumulating the softmax probabilities in descending order. Formally, given a data pair and a predicted probability estimator for , where is the predicted probability for class , the non-conformity scores can be computed by:
(3) |
where is a uniformly distributed random variable. Then, the prediction set is constructed as .
Evaluation Metrics. The goal is to improve the efficiency of conformal prediction sets as much as possible while maintaining the empirical marginal coverage guarantee. Given the testing nodes set , the efficiency is defined as the average size of prediction sets: The smaller the size, the more efficient CP is. The empirical marginal coverage is defined as . Although efficiency is a common metric for evaluating CP, singleton hit ratio (SH), defined as the proportion of prediction sets of size one that contains the ground-truth label, is also important (Zargarbashi et al., 2023). The formula of SH is defined as: .
3 Motivation and Methodology
In this section, we begin by outlining our motivation, substantiating its validity and feasibility through experimental evidence. Then, we propose our method, SNAPS. Finally, we demonstrate that SNAPS satisfies the exchangeability assumption required by CP and offer proof of its improved efficiency compared to basic non-conformity score methods.
3.1 Motivation
In this subsection, we empirically show that nodes with the same label as the ego node may play a critical role in the non-conformity scores of the ego node. Specifically, using the scores of nodes with the same label to correct the scores of the ego node could reduce the average size of prediction sets.
To analyze the role of nodes with the same label as the ego node, assuming we have access to an oracle graph, i.e., the ground-truth labels of all the nodes are known. Then, we randomly select nodes with the same label as the ego node and aggregate their APS non-conformity scores into the ego node. We conduct experiments by Graph Convolutional Network (GCN) (Kipf and Welling, 2017) on CoraML (McCallum et al., 2000) dataset and choose APS as the basic score function of CP. Then, we conduct 10 trials and randomly select 100 calibration sets for each trial to evaluate the performance of CP at a significance level .
In Figure LABEL:fig:motivation-a, we can find that the average size of prediction sets drops sharply as the number of nodes being aggregated increases, while maintaining valid coverage. When the number of selected nodes is 0, the results shows the performance of APS. Therefore, if the non-conformity scores of the ego node are corrected by accurately selecting nodes with the same label, can be reduced to a large extent. Moreover, aggregating the scores of these nodes still achieves the coverage guarantee.
3.2 Similarity-Navigated Adaptive Prediction Sets
In our previous analysis, we show that correcting the scores of the ego node with the scores of nodes having the same label leads to smaller prediction sets and valid coverage. However, the above experiment is based on the oracle graph. In the real-world application, the ground-truth label of each test node is unknown. To alleviate this issue, our key idea is to use the similarity to approximate the potential label. Specifically, the nodes with high feature similarity tend to have a high probability of belonging to the same label.
Several studies (Jin et al., 2021b; a; Zou et al., 2023) have demonstrated that matrix constructed from node feature similarity can help with the homophily assumption, i.e., connected nodes in the graph are likely to share the same label. Additionally, the network homophily can also help us to find more nodes whose labels are likely to be the same as the ego node (Kipf and Welling, 2017), and several studies have demonstrated the effectiveness of this (Clarkson, 2023; Zargarbashi et al., 2023; Huang et al., 2023b). Therefore, we consider feature similarity and network structure to select nodes that may have the same label as the ego node.
Feature similarity graph construction. We compute the cosine similarity between the node features in the graph. For a given node pair , the cosine similarity between their features can be calculated by:
(4) |
where , and . Here, represents a set of nodes for which we calculate the similarity with . Then, we choose nearest neighbors for each node based on the above cosine similarity, forming the -NN graph. We denote the adjacency matrix of -NN graph as and its degree matrix as , where and . For large graphs, we randomly select nodes to put into , whereas for small graphs, we include all nodes into .
To verify the effectiveness of feature similarity, we provide an empirical analysis. Figure LABEL:fig:motivation-b presents the average of node feature cosine similarity between the same or different labels on the CoraML dataset. We can find that the average of node feature similarity between the same label is higher than those between different labels. We analyze experimentally where using feature similarity to select -NN meets our expectation of selecting nodes with the same label as the ego node. Figure LABEL:fig:motivation-c shows the number statistics of nodes with the same label and with different labels at -th nearest neighbors. The result shows that we can indeed select many nodes with the same label when is not very large.

SNAPS. We propose SNAPS that aggregates non-conformity scores of nodes with high feature similarity to ego node and one-hop graph structural neighbors. Formally, for a node with a label , the score function of SNAPS is shown as :
(5) |
where is the basic non-conformity score function and is the SNAPS score function. Both and are hyperparameters, which are used to measure the importance of three parts of non-conformity scores. The framework of SNAPS is shown in Figure 3 and the pseudo-code is in Appendix B.
3.3 Theoretical Analysis
To deploy CP for graph-structured data, the only assumption we should satisfy is exchangeability, i.e., the joint distribution of calibration and testing sets remains unchanged under any permutation. Several studies have demonstrated that non-conformity scores based on the node embeddings obtained by any GNN models are invariant to the permutation of nodes while permuting their edges correspondingly in the calibration and testing sets. This invariance arises because GNNs models and non-conformity score functions only use the structures and attributes in the graph, without dependence on the order of the nodes (Zargarbashi et al., 2023; Huang et al., 2023b). Under this condition, we prove that SNAPS non-conformity scores are still exchangeable.
Proposition 1
Let be basic non-conformity scores of nodes, where . Assume that is exchangeable for all . Then the aggregated , where and , is also exchangeable for .
The corresponding proof is provided in Appendix A. We then demonstrate the validity of our method theoretically.
Proposition 2
Assume that all of the nodes aggregated by SNAPS are the same label as the ego node. Given a data pair and a predicted estimator for , where is the predicted probability for class . Moreover, reflects the model’s error in misclassifying the ground-truth label as label . Let be APS scores of nodes, where is the score corresponding to node with label . Let and be the average of predicted probability and scores corresponding to label of nodes whose ground-truth labels are , respectively. Let be quantile of basic non-conformity scores with a significance level . If and , where denotes the maximum predicted probability of nodes whose ground-truth labels are and is a uniformly distributed random variable, then
where and represent the prediction set from the APS score function and SNAPS score function, respectively.
In other words, SNAPS consistently generates a smaller prediction set than basic non-conformity scores functions and maintains the desired marginal coverage rate. It is obvious that we can’t ignore a very important thing, which is to select nodes with the same label as the ego node as correctly as possible, otherwise, it will lead to a decrease in the efficiency of SNAPS.
4 Experiments
In this section, we conduct extensive experiments on semi-supervised node classification to demonstrate the effectiveness of SNAPS on graph-structure data. We also adapt SNAPS for image classification problems. Furthermore, we perform ablation studies and parameter analysis to show the importance of different components in SNAPS and evaluate its robustness, respectively.
4.1 Experimental Settings
Datasets. In our experiments, we consider ten datasets with high homophily, where connected nodes in the graph are likely to share the same label. These datasets include the common citation graphs: CoraML (McCallum et al., 2000), PubMed (Namata et al., 2012), CiteSeer (Sen et al., 2008), CoraFull (Bojchevski and Günnemann, 2018), Coauthor Physics (Physics) and Coauthor CS (CS) (Shchur et al., 2018) and the co-purchase graphs: Amazon Photos (Photos) and Amazon Computers (Computers) (McAuley et al., 2015; Shchur et al., 2018). Moreover, we consider two large-scale graph datasets, i.e., OGBN Arxiv (Arxiv) (Wang et al., 2020) and OGBN Products (Products) (Bhatia et al., 2016). Particularly, for CoraFull which is highly class-imbalanced, we filter out the classes with fewer than 50 nodes. The transformed dataset is dubbed as CoraFull∗ (Zargarbashi et al., 2023). Detailed statistics of these datasets are shown in Appendix F. In addition to the datasets mentioned above, we discuss two heterophilous graph datasets in Appendix C.1, namely Chameleon and Squirrel, both of which are two Wikipedia networks (Rozemberczki et al., 2021).
Baselines. Since our SNAPS is a general post-processing method for GNNs, here we choose GCN (Kipf and Welling, 2017), GAT (Velickovic et al., 2018) and APPNP (Gasteiger et al., 2018) as structure-aware models and MLP as a structure-independent model. Moreover, our SNAPS can be based on general conformal prediction non-conformity scores, here we choose APS (Romano et al., 2020) and RAPS (Angelopoulos et al., 2021). For comparison, we compare not only with the basic scores, i.e., APS and RAPS, but also with DAPS (Zargarbashi et al., 2023) for GNNs.
CP Settings. For the basic model GCN, GAT, APPNP and MLP, we follow parameters suggested by (Zargarbashi et al., 2023). For DAPS, we follow the official implementation. Since GNNs are sensitive to splits, especially in the sparsely labeled setting (Shchur et al., 2018), we train the model over ten trials using varying train/validation splits. For per class in the training/validation set, we randomly select 20 nodes. For Arxiv and Products dataset, we follow the official split in PyTorch Geometric (Fey and Lenssen, 2019). Then, the remaining nodes are included in the calibration set and the test set. The calibration set ratio is suggested by (Huang et al., 2023b), i.e., modifying the calibration set size to . For each trained model, we conduct 100 random splits of calibration/test set. Thus, we totally conduct 1000 trials to evaluate the effectiveness of CP. For the non-conformity score function that requires hyper-parameters, we split the calibration set into two sets, one for tuning parameters, and the other for conformal calibration (Zargarbashi et al., 2023). For SNAPS, we choose and in increments of 0.05 within the range 0 to 1, and ensure that . Each experiment is done with a single NVIDIA V100 32GB GPU.
Coverage | Size | SH% | ||||||||||
Datasets | APS | RAPS | DAPS | SNAPS | APS | RAPS | DAPS | SNAPS | APS | RAPS | DAPS | SNAPS |
CoraML | 0.950 | 0.950 | 0.950 | 0.950 | 2.42 | 2.21 | 1.92 | 1.68 | 44.89 | 22.19 | 52.16 | 56.30 |
PubMed | 0.950 | 0.950 | 0.950 | 0.950 | 1.79 | 1.77 | 1.76 | 1.62 | 33.67 | 30.83 | 35.25 | 42.95 |
CiteSeer | 0.950 | 0.950 | 0.950 | 0.950 | 2.34 | 2.36 | 1.94 | 1.84 | 50.41 | 38.99 | 59.75 | 59.08 |
CoraFull | 0.950 | 0.950 | 0.950 | 0.950 | 17.54 | 10.72 | 11.81 | 9.80 | 10.23 | 2.13 | 8.67 | 5.76 |
CS | 0.950 | 0.950 | 0.950 | 0.950 | 1.91 | 1.20 | 1.22 | 1.08 | 66.17 | 78.34 | 79.80 | 87.92 |
Physics | 0.950 | 0.950 | 0.950 | 0.950 | 1.28 | 1.07 | 1.08 | 1.04 | 76.74 | 88.89 | 88.40 | 91.21 |
Computers | 0.950 | 0.950 | 0.950 | 0.950 | 3.95 | 2.89 | 2.13 | 1.98 | 27.67 | 15.85 | 43.03 | 45.48 |
Photo | 0.951 | 0.950 | 0.950 | 0.951 | 1.89 | 1.64 | 1.41 | 1.31 | 54.31 | 56.63 | 74.57 | 78.51 |
Arxiv | 0.950 | 0.950 | 0.949 | 0.950 | 4.30 | 3.62 | 3.73 | 3.62 | 22.55 | 14.52 | 19.19 | 23.53 |
Products | 0.950 | 0.951 | 0.950 | 0.950 | 14.92 | 13.67 | 10.91 | 7.68 | 15.51 | 11.51 | 19.29 | 22.38 |
Average | 0.950 | 0.950 | 0.950 | 0.950 | 5.23 | 4.12 | 3.79 | 3.17 | 40.22 | 36.00 | 48.01 | 52.31 |
4.2 Experimental results
SNAPS generates smaller prediction sets and achieves a higher singleton hit ratio. Table 1 shows that of all conformal prediction methods is close to the desired coverage . At a significance level , and exhibit superior performance. For example, when evaluated on Products, SNAPS reduces from 14.92 of APS to 7.68. Overall, the experiments show that SNAPS has the desired coverage rate and gets smaller and higher than APS, RAPS, and DAPS. Detailed results for other basic models and SNAPS based on RAPS are available in Appendix D.
SNAPS generates smaller average prediction sets for each label. We conduct additional experiments to analyze the average performance of APS and SNAPS on nodes belonging to the same label at a significance level . Figure LABEL:fig:mean-size-aps shows that the distribution of the average non-conformity scores for nodes belonging to the same label aligns with the assumptions made in Proposition 2, i.e., and , where . If , then it is very small. of prediction sets corresponding to APS is 3.29. Figure LABEL:fig:mean-size-snaps shows that only a few other labels different from real labels have average scores lower than the quantile of scores. of prediction sets corresponding to SNAPS is 1.29. Overall, for basic non-conformity scores that match this distribution of our assumptions, SNAPS can achieve superior performance based on these scores. The results of CiteSeer and Amazon Computers datasets are available in Appendix D.
Orig. | Neigh. | Feat. | CoraML | PubMed | CiteSeer | CoraFull∗ | CS | Physics | Computers | Photo | arxiv | products |
2.42 | 1.79 | 2.34 | 17.54 | 1.91 | 1.28 | 3.95 | 1.89 | 4.30 | 14.92 | |||
2.18 | 1.94 | 2.07 | 17.50 | 1.37 | 1.09 | 2.15 | 1.42 | 4.75 | 11.25 | |||
2.40 | 1.65 | 2.52 | 18.07 | 1.11 | 1.03 | 3.26 | 2.60 | 9.45 | 13.89 | |||
1.87 | 1.72 | 1.91 | 12.10 | 1.22 | 1.07 | 2.22 | 1.37 | 3.76 | 10.81 | |||
1.78 | 1.63 | 1.94 | 11.54 | 1.13 | 1.05 | 2.37 | 1.46 | 3.82 | 8.46 | |||
1.72 | 1.63 | 1.86 | 10.51 | 1.09 | 1.04 | 1.94 | 1.31 | 4.44 | 7.65 | |||
1.68 | 1.62 | 1.84 | 9.80 | 1.08 | 1.04 | 1.98 | 1.31 | 3.62 | 7.68 |
Ablation study. To understand the effects of three parts of our method, i.e., original scores (Orig.), neighborhood scores (Neigh.), and feature similarity node scores (Feat.), we conduct a thorough ablation experiment using GCN at . In Table 2, SNAPS performs best on most datasets when all three parts are included. Moreover, for the remaining dataset on which SNAPS exhibits comparable performance, all those better cases contain the Feat. part. Overall, each part plays a critical role in CP for GNNs, and removing any will in general decrease performance.
Coverage | Size | SH% | ||||||||||
Datasets | APS | RAPS | DAPS | SNAPS | APS | RAPS | DAPS | SNAPS | APS | RAPS | DAPS | SNAPS |
CoraML | 0.950 | 0.958 | 0.957 | 0.951 | 2.50 | 2.62 | 2.32 | 1.74 | 43.09 | 27.34 | 44.52 | 54.11 |
PubMed | 0.950 | 0.968 | 0.967 | 0.950 | 1.82 | 2.10 | 2.09 | 1.61 | 33.39 | 14.66 | 23.27 | 44.11 |
CiteSeer | 0.951 | 0.950 | 0.952 | 0.950 | 2.41 | 2.69 | 2.16 | 1.90 | 48.53 | 35.37 | 55.40 | 58.22 |
CS | 0.950 | 0.953 | 0.954 | 0.950 | 2.04 | 1.31 | 1.33 | 1.13 | 64.32 | 66.91 | 74.91 | 85.21 |
Physics | 0.951 | 0.962 | 0.962 | 0.950 | 1.39 | 1.44 | 1.28 | 1.07 | 72.44 | 62.22 | 77.65 | 88.58 |
Computers | 0.950 | 0.950 | 0.951 | 0.950 | 3.01 | 3.04 | 2.30 | 2.01 | 29.21 | 9.87 | 42.19 | 45.98 |
Photo | 0.949 | 0.950 | 0.950 | 0.950 | 1.90 | 1.81 | 1.56 | 1.30 | 54.86 | 47.27 | 67.57 | 79.50 |
Parameter analysis. We conduct additional experiments to analyze the robustness of SNAPS. We choose GCN as the GNNs model and APS as the basic non-conformity score function.
Figure LABEL:fig:param-k-size and Figure LABEL:fig:param-k-sh demonstrate that the performance of SNAPS significantly improves as gradually increases from . This improvement occurs because the increasing nodes with the same label are selected to enhance the ego node. Subsequently, as continues to increase, the performance of SNAPS tends to stabilize. On the other hand, we find that when is extremely large, it appears that nodes with the same label cannot be selected with high accuracy only by feature similarity. Thus, when is extremely large, performance will decline slightly. Figure LABEL:fig:param-ab-size and Figure LABEL:fig:param-ab-sh show that as the values of parameter and change, the most areas in the heatmaps of and display similar colors. Overall, SNAPS is robust to the parameter and is not sensitive to parameters and . To further explore the sensitivity of and to the performance of SNAPS, we set , which indicating that three components of SNAPS are equally weighted. The experimental results in Table 3 demonstrate that SNAPS performs well with these default hyperparameters on most datasets.
Adaption to image classification problems. In the node classification problems, SNAPS achieves better performance than standard APS, which was proposed for image classification problems. Therefore, we employ SNAPS for image classification problems. Since there are no links between different images, we utilize the cosine similarities of image features to correct the APS. Formally, the corrected APS, i.e., SNAPS, is defined as :
where is the score of standard APS, is the nearest neighbors based on image features in the calibration set and is a corrected weight. We conduct experiments on ImageNet, whose test dataset is equally divided into the calibration set and the test set. For SNAPS, we set and . We report the results of , and size-stratified coverage violation (SSCV) (Angelopoulos et al., 2021). The details of experiments and SSCV are provided in Appendix E.
As indicated in Table 4, SNAPS achieves smaller prediction sets than APS. For example, on the ResNeXt101 model and = 0.1, SNAPS reduces from 19.639 to 4.079 – only of the prediction set size from APS and achieves the smaller SSCV than APS. Overall, SNAPS could improve the efficiency of prediction sets while maintaining the performance of conditional coverage.
Accuracy | APS/SNAPS | |||||||
Model | Top1 | Top5 | Coverage | Size | SSCV | Coverage | Size | SSCV |
ResNeXt101 | 79.32 | 94.58 | 0.899/0.900 | 19.64/4.08 | 0.088/0.059 | 0.950/0.950 | 45.80/14.41 | 0.047/0.033 |
ResNet101 | 77.36 | 93.53 | 0.900/0.900 | 10.82/3.62 | 0.075/0.078 | 0.950/0.950 | 22.90/9.83 | 0.039/0.029 |
DenseNet161 | 77.19 | 93.56 | 0.900/0.900 | 12.04/3.80 | 0.077/0.067 | 0.951/0.950 | 27.99/10.66 | 0.039/0.026 |
ViT | 81.02 | 95.33 | 0.899/0.899 | 10.50/2.33 | 0.087/0.133 | 0.949/0.950 | 31.12/10.47 | 0.042/0.040 |
CLIP | 60.53 | 86.15 | 0.899/0.899 | 17.46/10.32 | 0.047/0.032 | 0.950/0.949 | 34.93/24.53 | 0.027/0.017 |
Average | - | - | 0.899/0.900 | 14.09/4.83 | 0.075/0.074 | 0.950/0.950 | 32.55/13.98 | 0.039/0.029 |
5 Related Work
Uncertainty Quantification for GNNs. Many uncertainty quantification (UQ) methods have been proposed to quantify the model uncertainty for classification tasks in machine learning (Gal and Ghahramani, 2016; Guo et al., 2017; Zhang et al., 2020; Gupta et al., 2021). Recently, several calibration methods for GNNs have been developed, such as CaGCN (Wang et al., 2021), GATS (Hsu et al., 2022) and SimCalib (Tang et al., 2024). However, these UQ methods lack statistically rigorous and empirically valid coverage guarantee (Huang et al., 2023b). In contrast, SNAPS provides valid coverage guarantees both theoretically and empirically.
Conformal Prediction for GNNs. Many conformal prediction (CP) methods have been developed to provide valid uncertainty estimates for model predictions in machine learning classification tasks (Romano et al., 2020; Angelopoulos et al., 2021; Liu et al., 2024; Wei and Huang, 2024). Although several CP methods for GNNs have been studied, the use of CP in graph-structured data is still largely underexplored. ICP (Wijegunawardana et al., 2020) is the first to apply CP framework on graphs, designs a margin conformity score for labels of nodes without considering the relation between nodes. NAPS (Clarkson, 2023) use the non-exchangeable technique from (Barber et al., 2023) for inductive node classification, not applicable for the transductive setting, while we focus on the transductive setting where exchangeability property holds. Our method is essentially an enhanced version of the DAPS (Zargarbashi et al., 2023) method, which proposes a diffusion-based method that incorporates neighborhood information by leveraging the network homophily. Similar to DAPS, CF-GNN (Huang et al., 2023b) introduces a topology-aware output correction model, akin to GCN, which employs a conformal-aware inefficiency loss to refine predictions and improve the efficiency of post-hoc CP. Other recent efforts in CP for graphs include (Lunde, 2023; Marandon, 2023; Zargarbashi and Bojchevski, 2023; Sanchez-Martin et al., 2024) which focus on distinct problem settings. In this work, SNAPS takes into account both network topology and feature similarity. This method can be applied not only to graph-structured data but also to other types of data, such as image data.
6 Conclusion
In this paper, we propose SNAPS, a general algorithm that aggregates the non-conformity scores of nodes with the same label as the ego node. Specifically, we select these nodes based on feature similarity and structural neighborhood, and then aggregate their non-conformity scores to the ego node. As a result, our method could correct the scores of some nodes. Moreover, we present theoretical analyses to certify the effectiveness of this method. Extensive experiments demonstrate that SNAPS not only maintains the pre-defined coverage, but also achieves significant performance in efficiency and singleton hit ratio. Furthermore, we extend SNAPS to image classification, where SNAPS shows superior performance compared to APS.
Limitations.
Our work focuses on node classification using transductive learning. However, in real-world scenarios, many classification tasks require inductive learning. In the future, we aim to apply our method to the inductive setting. Additionally, the method we use to select nodes with the same as the ego node is both computationally inefficient and lacking accuracy. Future work will explore more efficient and accurate methods for node selection. Moreover, while our focus is primarily on datasets with high homophily, many heterophilous networks are prevalent in practice. Consequently, further investigation is essential to enhance the adaptability of SNAPS to these networks.
Acknowledgments
This paper is supported by the National Natural Science Foundation of China (Grant No. 62192783, 62376117), the National Social Science Fund of China (Grant No. 23BJL035), the Science and Technology Major Project of Nanjing (comprehensive category) (Grant No. 202309007), and the Collaborative Innovation Center of Novel Software Technology and Industrialization at Nanjing University.
References
- Amodei et al. [2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
- Angelopoulos et al. [2021] Anastasios Nikolas Angelopoulos, Stephen Bates, Michael I. Jordan, and Jitendra Malik. Uncertainty sets for image classifiers using conformal prediction. In 9th International Conference on Learning Representations, 2021.
- Barber et al. [2023] Rina Foygel Barber, Emmanuel J Candes, Aaditya Ramdas, and Ryan J Tibshirani. Conformal prediction beyond exchangeability. The Annals of Statistics, 51(2):816–845, 2023.
- Bhatia et al. [2016] K. Bhatia, K. Dahiya, H. Jain, P. Kar, A. Mittal, Y. Prabhu, and M. Varma. The extreme classification repository: Multi-label datasets and code, 2016. URL http://manikvarma.org/downloads/XC/XMLRepository.html.
- Bojchevski and Günnemann [2018] Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. In 6th International Conference on Learning Representations, 2018.
- Clarkson [2023] Jase Clarkson. Distribution free prediction sets for node classification. In International Conference on Machine Learning, pages 6268–6278, 2023.
- Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Dong et al. [2011] Wei Dong, Moses Charikar, and Kai Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In Proceedings of the 20th International Conference on World Wide Web, pages 577–586, 2011.
- Fey and Lenssen [2019] Matthias Fey and Jan Eric Lenssen. Fast graph representation learning with pytorch geometric. arXiv preprint arXiv:1903.02428, 2019.
- Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059, 2016.
- Gao et al. [2019] Jinyang Gao, Junjie Yao, and Yingxia Shao. Towards reliable learning for high stakes applications. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3614–3621, 2019.
- Gasteiger et al. [2018] Johannes Gasteiger, Aleksandar Bojchevski, and Stephan Günnemann. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997, 2018.
- Gilmer et al. [2017] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1263–1272, 2017.
- Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pages 1321–1330, 2017.
- Gupta et al. [2021] Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, and Richard Hartley. Calibration of neural networks using splines. In 9th International Conference on Learning Representations, 2021.
- Hamilton et al. [2017] William L. Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pages 1024–1034, 2017.
- Hsu et al. [2022] Hans Hao-Hsun Hsu, Yuesong Shen, Christian Tomani, and Daniel Cremers. What makes graph neural networks miscalibrated? In Advances in Neural Information Processing Systems, 2022.
- Huang et al. [2023a] Jianguo Huang, Huajun Xi, Linjun Zhang, Huaxiu Yao, Yue Qiu, and Hongxin Wei. Conformal prediction for deep classifier via label ranking. arXiv preprint arXiv:2310.06430, 2023a.
- Huang et al. [2023b] Kexin Huang, Ying Jin, Emmanuel J. Candès, and Jure Leskovec. Uncertainty quantification over graph with conformalized graph neural networks. In Advances in Neural Information Processing Systems, 2023b.
- Jiang and Luo [2022] Weiwei Jiang and Jiayun Luo. Graph neural network for traffic forecasting: A survey. Expert Systems with Applications, 207:117921, 2022.
- Jin et al. [2021a] Di Jin, Zhizhi Yu, Cuiying Huo, Rui Wang, Xiao Wang, Dongxiao He, and Jiawei Han. Universal graph convolutional networks. Advances in Neural Information Processing Systems, 34:10654–10664, 2021a.
- Jin et al. [2021b] Wei Jin, Tyler Derr, Yiqi Wang, Yao Ma, Zitao Liu, and Jiliang Tang. Node similarity preserving graph convolutional networks. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining, pages 148–156, 2021b.
- Kendall and Gal [2017] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in Neural Information Processing Systems, 30, 2017.
- Kipf and Welling [2017] Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In 5th International Conference on Learning Representations, 2017.
- Li et al. [2022] Michelle M Li, Kexin Huang, and Marinka Zitnik. Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering, 6(12):1353–1369, 2022.
- Liu et al. [2024] Kangdao Liu, Tianhao Sun, Hao Zeng, Yongshan Zhang, Chi-Man Pun, and Chi-Man Vong. Spatial-aware conformal prediction for trustworthy hyperspectral image classification. arXiv preprint arXiv:2409.01236, 2024.
- Liu et al. [2023] Yajing Liu, Zhengya Sun, and Wensheng Zhang. Improving fraud detection via hierarchical attention-based graph neural network. Journal of Information Security and Applications, 72:103399, 2023.
- Lunde [2023] Robert Lunde. On the validity of conformal prediction for network data under non-uniform sampling. arXiv preprint arXiv:2306.07252, 2023.
- Marandon [2023] Ariane Marandon. Conformal link prediction to control the error rate. arXiv preprint arXiv:2306.14693, 2023.
- Maurya et al. [2022] Sunil Kumar Maurya, Xin Liu, and Tsuyoshi Murata. Simplifying approach to node classification in graph neural networks. J. Comput. Sci., 62:101695, 2022.
- McAuley et al. [2015] Julian J. McAuley, Christopher Targett, Qinfeng Shi, and Anton van den Hengel. Image-based recommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 43–52, 2015.
- McCallum et al. [2000] Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore. Automating the construction of internet portals with machine learning. Inf. Retr., 3(2):127–163, 2000.
- Namata et al. [2012] Galileo Namata, Ben London, Lise Getoor, Bert Huang, and U Edu. Query-driven active surveying for collective classification. In 10th International Workshop on Mining and Learning with Graphs, volume 8, page 1, 2012.
- Paszke et al. [2019] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 2019.
- Pei et al. [2020] Hongbin Pei, Bingzhe Wei, Kevin Chen-Chuan Chang, Yu Lei, and Bo Yang. Geom-gcn: Geometric graph convolutional networks. In 8th International Conference on Learning Representations, 2020.
- Romano et al. [2020] Yaniv Romano, Matteo Sesia, and Emmanuel J. Candès. Classification with valid and adaptive coverage. In Advances in Neural Information Processing Systems, 2020.
- Rozemberczki et al. [2021] Benedek Rozemberczki, Carl Allen, and Rik Sarkar. Multi-scale attributed node embedding. J. Complex Networks, 9(2), 2021.
- Sanchez-Martin et al. [2024] Pablo Sanchez-Martin, Kinaan Aamir Khan, and Isabel Valera. Improving the interpretability of gnn predictions through conformal-based graph sparsification. arXiv preprint arXiv:2404.12356, 2024.
- Sen et al. [2008] Prithviraj Sen, Galileo Namata, Mustafa Bilgic, Lise Getoor, Brian Galligher, and Tina Eliassi-Rad. Collective classification in network data. AI Magazine, 29(3):93, 2008.
- Shchur et al. [2018] Oleksandr Shchur, Maximilian Mumme, Aleksandar Bojchevski, and Stephan Günnemann. Pitfalls of graph neural network evaluation. arXiv preprint arXiv:1811.05868, 2018.
- Tang et al. [2024] Boshi Tang, Zhiyong Wu, Xixin Wu, Qiaochu Huang, Jun Chen, Shun Lei, and Helen Meng. Simcalib: Graph neural network calibration based on similarity between nodes. In Thirty-Eighth AAAI Conference on Artificial Intelligence, pages 15267–15275, 2024.
- Velickovic et al. [2018] Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. Graph attention networks. In 6th International Conference on Learning Representations, 2018.
- Vovk et al. [2005] Vladimir Vovk, Alexander Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer Science & Business Media, 2005.
- Wang et al. [2020] Kuansan Wang, Zhihong Shen, Chiyuan Huang, Chieh-Han Wu, Yuxiao Dong, and Anshul Kanakia. Microsoft academic graph: When experts are not enough. Quantitative Science Studies, 1(1):396–413, 2020.
- Wang et al. [2021] Xiao Wang, Hongrui Liu, Chuan Shi, and Cheng Yang. Be confident! towards trustworthy graph neural networks via confidence calibration. In Advances in Neural Information Processing Systems, pages 23768–23779, 2021.
- Wei and Huang [2024] Hongxin Wei and Jianguo Huang. Torchcp: A library for conformal prediction based on pytorch. arXiv preprint arXiv:2402.12683, 2024.
- Wijegunawardana et al. [2020] Pivithuru Wijegunawardana, Ralucca Gera, and Sucheta Soundarajan. Node classification with bounded error rates. In Complex Networks XI: Proceedings of the 11th Conference on Complex Networks CompleNet 2020, pages 26–38. Springer, 2020.
- Xi et al. [2024] Huajun Xi, Jianguo Huang, Lei Feng, and Hongxin Wei. Delving into temperature scaling for adaptive conformal prediction. arXiv preprint arXiv:2402.04344, 2024.
- Xu et al. [2019] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural networks? In 7th International Conference on Learning Representations, 2019.
- Zargarbashi and Bojchevski [2023] Soroush H Zargarbashi and Aleksandar Bojchevski. Conformal inductive graph neural networks. In The Twelfth International Conference on Learning Representations, 2023.
- Zargarbashi et al. [2023] Soroush H. Zargarbashi, Simone Antonelli, and Aleksandar Bojchevski. Conformal prediction sets for graph neural networks. In International Conference on Machine Learning, volume 202, pages 12292–12318, 2023.
- Zhang et al. [2020] Jize Zhang, Bhavya Kailkhura, and Thomas Yong-Jin Han. Mix-n-match : Ensemble and compositional methods for uncertainty calibration in deep learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 11117–11128, 2020.
- Zou et al. [2023] Minhao Zou, Zhongxue Gan, Ruizhi Cao, Chun Guan, and Siyang Leng. Similarity-navigated graph neural networks for node classification. Information Sciences, 633:41–69, 2023.
Appendix A Proofs
In this section, we provided the proofs that were omitted from the main paper.
A.1 Proof of Proposition 1
Proof. [Zargarbashi et al., 2023] have proved that is exchangeable for . So we only need to prove that is also exchangeable for . is obtained by calculating the feature similarity between two nodes from a global perspective. Before obtaining this matrix, we can not distinguish between labeled and unlabeled nodes, so we just build a new graph structure using node features without considering the order of nodes. So when aggregating non-conformity, we do not break the permutation equivariant. Therefore, is a special case of a message passing GNNs layer. It follows that is invariant to permutations of the order of the calibration and testing nodes on the graph. Through the proof above, we can conclude that is also exchangeable for .
A.2 Proof of Proposition 2
Lemma 1
As stated in Proposition 2, we have
where denotes the maximum predicted probability of nodes whose ground-truth labels are , reflects the model’s error in misclassifying the ground-truth label as label and is a uniformly distributed random variable.
Proof of Lemma 1.
Here, we use APS non-conformity scores as the basic non-conformity scores. Then we have,
Suppose is the number of nodes whose ground-truth label is label . Below we discuss two cases of :
Case a. If is the largest predicted probability for node , then . Suppose the number of nodes satisfying this case is .
Case b. Otherwise, . Suppose the number of nodes satisfying this case is , where .
Therefore, summing up for both cases, we have
This simplifies to: .
Let , which reflects the model’s error in misclassifying the ground-truth label as label . Therefore, we conclude that: .
Proof of Proposition 2.
For the sake of description, we denote " quantile of basic non-conformity scores in the calibrated set" as "the quantile score". Let and denote APS and SNAPS non-conformity scores, respectively. For node whose label is , can be be expressed as
(6) |
where denotes the nodes set where nodes’ ground-truth label is , because regardless of whether high feature similarity nodes or one-hop structural neighbors, the purpose of aggregating these nodes’ scores is actually to aggregate, as much as possible, non-conformity scores of nodes with the same label as the ego node.
In order to prove Proposition 2, we only need to prove the following: 1) SNAPS is efficient for the score corresponding to the ground-truth label of node , i.e., or . 2) SNAPS is efficient for the score corresponding to the other label of node , i.e., or . The key idea behind this is as follows. We try to ensure that scores corresponding to the ground-truth label are below the quantile score or decrease compared to the before and scores corresponding to the other label are above the quantile score or increase compared to the before.
Firstly
SNAPS is efficient for the score corresponding to the ground-truth label of node , i.e., or . Here we have
1) If , then
Thus, . This means that SNAPS can decrease some scores corresponding to the ground-truth label, bringing them from above the quantile score to below it. Since false scores corresponding to ground-truth labels will decrease, , where denotes quantile of SNAPS scores in the calibrated set.
2) If , then
Thus, . This means that for original scores less than the quantile score, they are still less than the quantile score after aggregation.
Secondly
SNAPS is efficient for the score corresponding to the other label of node , i.e., or . Here we have
1) If , then
Thus, . This means that SNAPS can increase some scores corresponding to the other labels, bringing them from below the quantile score to above it.
2) If , then
Let