Synergistic Bleeding Region and Point Detection in Surgical Videos

Jialun Pei
CUHK
Hong Kong SAR, China
peijialun@gmail.com Zhangjun Zhou
PolyU
Hong Kong SAR, China
zhangjun.zhou@polyu.edu.hk Diandian Guo
CUHK
Hong Kong SAR, China
1155229775@link.cuhk.edu.hk Zhixi Li
SMU & PolyU
Guang Zhou, China
lzx22121222@smu.edu.cn Jing Qin
PolyU
Hong Kong SAR, China
harry.qin@polyu.edu.hk Bo Du
WHU
Wu Han, China
dubo@whu.edu.cn Corresponding author. Pheng-Ann Heng
CUHK
Hong Kong SAR, China
pheng@cse.cuhk.edu.hk

Abstract

Intraoperative bleeding in laparoscopic surgery causes rapid obscuration of the operative field to hinder the surgical process. Intelligent detection of bleeding regions can quantify the blood loss to assist decision-making, while locating the bleeding point helps surgeons quickly identify the source of bleeding and achieve hemostasis in time. In this study, we first construct a real-world surgical bleeding detection dataset, named SurgBlood, comprising 5,330 frames from 95 surgical video clips with bleeding region and point annotations. Accordingly, we develop a dual-task synergistic online detector called BlooDet, designed to perform simultaneous detection of bleeding regions and points in surgical videos. Our framework embraces a dual-branch bidirectional guidance design based on Segment Anything Model 2 (SAM 2). The mask branch detects bleeding regions through adaptive edge and point prompt embeddings, while the point branch leverages mask memory to induce bleeding point memory modeling and captures the direction of bleed point movement through inter-frame optical flow. By interactive guidance and prompts, the two branches explore potential spatial-temporal relationships while leveraging memory modeling from previous frames to infer the current bleeding condition. Extensive experiments demonstrate that our approach outperforms other counterparts on SurgBlood in both bleeding region and point detection tasks, e.g., achieving 64.88% IoU for bleeding region detection and 83.69% PCK-10% for bleeding point detection. Our code and data will be publicly available.

1 Introduction

Refer to caption — Figure 1: (a): Illustration of bleeding detection task with samples in SurgBlood and the predictions of our solution. (b): BlooDet performs dual-branch bidirectional guidance for synergistic bleeding mask and point detection. Zoom in for details.

Bleeding in Laparoscopic Surgery. Minimally invasive surgery has revolutionized clinical healthcare by reducing patient trauma and accelerating postoperative recovery [35]. However, intraoperative bleeding is a common contingency and significantly impacts surgical safety and efficiency [13]. Rapid changes in the amount and speed of bleeding can severely obscure the surgical field, potentially delaying the surgeon’s response and reducing the success rate of surgery. More prolonged bleeding increases the risk of organ damage, postoperative infection, and complications [11]. Therefore, utilizing computer-assisted techniques to detect bleeding regions and localize bleeding points promptly holds significant clinical value. In one respect, detecting bleeding areas can quantify blood loss, providing timely support for intraoperative decision-making. In another respect, the precise positioning of bleeding points enables surgeons to control hemorrhage promptly to ensure surgical safety.

Challenges for Bleeding Detection. Despite the popularity of laparoscopic surgery, automated detection of bleeding regions and bleeding points still faces numerous challenges [32]. Due to the narrow field of view under laparoscopy and unstable lighting conditions, the anatomical structures are incompletely exposed, which increases the difficulty of extracting discriminative representations. Additionally, the rapid accumulation and flow of blood can change tissue appearance and infiltrate surrounding tissues, reducing the availability of low-level visual clues and complicating the detection of bleeding regions. The bleeding points may also be buried, making it difficult to locate them quickly [25]. Beyond these challenges, intelligent bleeding warning involves detecting bleeding regions and locating bleeding points during dynamic surgical procedures [15]. This requires a reliable multi-task online detector that models fine-grained spatial-temporal relationships in surgical videos for accurate predictions. Further, the lack of real-world surgical bleeding multi-task datasets remains a major obstacle to progress in this community.

Proposed Benchmark. To advance research on bleeding region and point detection in surgical videos, we construct a new actual laparoscopic surgery bleeding dataset, named SurgBlood. This dataset comprises a total of 5,330 video frames from 95 laparoscopic cholecystectomy (LC) video clips, encompassing multiple types and intensities of bleeding encountered during surgery. As displayed in Fig. 1(a), SurgBlood also provides pixel-level annotations of bleeding regions and bleeding point coordinates by hepatobiliary surgeons. Our dataset supports the joint evaluation of bleeding region and point detection in surgical videos. We evaluate several task-relevant methods on SurgBlood to establish a comprehensive benchmark for intraoperative bleeding detection, aiming to drive further research in intelligent surgical assistance.

Existing Methods. Various deep learning-based algorithms have been demonstrated to be effective in bleeding region detection, including applications in intracranial hemorrhage detection [18] and bleeding detection in capsule endoscopy [3]. However, most algorithms [3, 18, 9] are designed for image or keyframe analysis, lacking the ability to model temporal dependencies in surgical videos. In addition, previous methods [29, 23, 32] mainly focus on bleeding region detection, which falls short of addressing the clinical needs in locating the bleeding source. The advent of Segment Anything Model 2 (SAM 2) [28] extended the generic large vision model (LVM) to the video domain, but has not yet been unified into the multi-task paradigm. Some multi-task frameworks [9, 23] detect both regions and keypoints concurrently, but they overlook mutual guidance for joint optimization across tasks. Therefore, it is desirable to develop a dual-task paradigm to synergistically detect bleeding regions and points.

Our Solution. To meet the clinical demand of bleeding region and point detection, in this paper, we propose a dual-task online baseline model called BlooDet, which adopts a dual-branch bidirectional guidance structure based on video-level SAM 2 to synergistically optimize both tasks. Our framework consists of two branches: Mask branch and Point branch. In the mask branch, we embed an edge generator that performs multi-scale perception of spatial-temporal features with the Wavelet Laplacian filter to generate edge prompts, mitigating the problem of blurred bleeding boundaries in surgical scenes. Meanwhile, we incorporate bleeding points produced from the point branch as point prompts and combine them with edge prompts to facilitate bleeding region detection. For the point branch, we leverage inter-frame optical flow and bleeding mask memory to estimate laparoscopic camera motion and viewpoint offsets to improve the prediction of bleeding point movement direction. Further, we integrate mask memory features from the mask branch to enhance bleed point location perception. By exchanging clues and co-guiding between two branches, BlooDet fully exploits the underlying spatial and temporal associations between bleeding regions and points. Extensive experiments on the SurgBlood dataset demonstrate that our approach achieves superior performance in both bleeding region and point detection tasks.

The main contributions of this work are four-fold:

•

We debut the intraoperative bleeding region and bleeding point detection tasks in surgical videos and contribute a real-world bleeding detection dataset, termed SurgBlood, to advance the surgical intelligent assistance community.
•

We propose a dual-task synergistic online detection model, BlooDet, for jointly detecting bleeding regions and points in surgical videos. Our framework embraces a dual-branch structure and performs co-optimization by mutual prompts and bidirectional guidance.
•

In the point branch, we utilize inter-frame optical flow and mask memory for point memory modeling, efficiently capturing bleeding point movement cues and providing spatial-temporal modeling. Adaptive edge and point prompting strategies are introduced in the mask branch, where the edge generator is designed to exploit multi-scale Wavelet Laplacian filter convergence to enhance edge perception, and combined with bleeding points from the point branch for mask prompt embedding.
•

We evaluate various task-related models on SurgBlood and establish a comprehensive benchmark to facilitate development in bleeding detection. Extensive experiments indicate that our framework outperforms other methods in bleeding region and point detection tasks.

2 Related Work

Bleeding Region Detection in Medical Domain. Bleeding region detection has been explored across various medical scenarios, such as intracranial hemorrhage detection [18], capsule endoscopy bleeding recognition [3], and retinal hemorrhage identification [39]. Deep learning-based methods employ convolutional neural networks (CNNs) and attention mechanisms to extract discriminative features for bleeding localization. In surgical videos, intraoperative bleeding detection presents unique challenges due to limited workspace and dynamic lighting variations. In this regard, Sunakawa et al. [32] developed a semantic segmentation model for automatically recognizing bleeding regions on the anatomical structure of the liver. Nonetheless, existing methods primarily focus on mask- or patch-level bleeding detection, overlooking the localization of the bleeding source. To bridge this gap, we introduce a unified paradigm for the synergistic detection of bleeding regions and points.

Keypoint Detection in Medical Images. Keypoint detection plays a crucial role in various clinical applications, e.g., pathological site identification and anatomical landmark localization [37, 1]. Existing keypoint detection methods are usually classified into three categories: 1) Context-aware spatial methods. This technique exploits the stability and uniformity of keypoint spatial distributions to improve localization accuracy [41, 21, 34, 26]. 2) Multi-stage learning strategies. These architectures follow a coarse-to-fine process to refine keypoint localization through gradual optimization and integrating shallow and deep layers to enhance performance [44, 8, 45]. 3) Multi-task learning frameworks, which jointly optimize medical image segmentation and keypoint detection by constructing a union network [42, 43, 12, 9]. Although current multi-task learning methods demonstrate strong performance, they neglect the mutual guidance between tasks. Therefore, we propose a collaborative dual-branch model that enhances both bleeding region and point detection via cross-task guidance.

Video Segmentation with Large Vision Models. With the advent of large vision models (LVMs) [16], video segmentation techniques have made great progress. Among these models, SAM 2 [28] has emerged as the leading framework, extending the capabilities of SAM from image segmentation to video domain. By leveraging large-scale pre-training and fine-tuning techniques, SAM 2 achieves state-of-the-art performance across a range of video tasks. LVM-based video segmentation techniques open up new possibilities in various downstream applications [27, 14, 7, 2, 46, 40]. SAM2-adapter [7] embeds adapter layers into SAM 2 to enhance the model’s flexibility, improving cross-task generalization. In the domain of surgical video analysis, SurgSAM-2 [22] utilizes SAM 2 along with an frame pruning mechanism for efficient instrument segmentation. Based on the powerful capabilities of SAM 2, we design a dual-task paradigm that allows for joint bleeding region and point detection in complex surgical environments.

3 SurgBlood Dataset

We construct a brand-new dataset specifically for the bleeding process in laparoscopic cholecystectomy (LC), named SurgBlood. The dataset involves two complementary tasks: bleeding region and bleeding point detection. We provide a overview of our dataset from three aspects: data collection, data annotation, and data analysis.

3.1 Data Collection

To ensure high-quality and representative data, we invited four hepatobiliary surgeons from partner hospitals to carefully select 95 LC video clips. Each clip covers the entire bleeding process while retaining the non-bleeding scene for approximately 3 seconds before and after the bleeding event. We collect a total of 5,330 video frames with a resolution of 1280 $\times$ 720 from all clips using a sampling rate of 2 fps. Notably, we focus exclusively on dynamic bleeding regions within the surgical action field, as this is the critical location that directly interferes with the surgeon and contains key bleeding points. As shown in Fig. 2, there are four bleeding types by tissue location: gallbladder, cystic triangle, vessel, and gallbladder bed.

3.2 Data Annotation

To ensure the annotation quality of SurgBlood, the invited hepatobiliary surgeons meticulously annotate and review each video clip. During the labeling process, the surgeon uses both static frames and dynamic video sequences to label bleeding regions and bleeding point coordinates for each frame. Annotation is guided by the following principles: 1) For bleeding regions, pooled blood and inactive sparse bloodstains are not annotated. 2) For bleeding points, if the bleeding point is unobstructed, its coordinates are labeled; if the bleeding point is covered by blood, annotators trace back to the first frame where the bleeding occurred and localize the point based on the surrounding anatomy. These situations bring distinct characteristics and challenges to our dataset. To ensure annotation consistency, we adopt the cross-validation strategy: each clip is initially annotated by two surgeons, followed by a review and refinement process conducted by two additional surgeons. Fig. 2 presents examples of annotations for various bleeding situations.

3.3 Data Analysis

•

Clip Distribution: SurgBlood includes 5,330 frames extracted from 95 video clips. As shown in Fig. 3, each clip contains an average of 56 frames, with the longest containing 300 frames and the shortest containing 8 frames. We also counted the distribution of bleeding types: gallbladder (21.64%), cystic triangle (25.01%), vessel (15.78%), and gallbladder bed (37.75%).
•

Bleeding Ratio: To reflect the information density of data, we calculate the proportion of frames containing bleeding regions and points. As shown in the left of Fig. 4, both bleeding regions and points have a high frame rate, where the slightly higher rate with bleeding regions is due to the partial occlusion of bleeding points.
•

Space Statistics: We analyze the spatial distribution and center bias of bleeding regions and points across all samples in SurgBlood. The right of Fig. 4 provides statistical insights into the distances of bleeding region centers and bleeding points from the image center. Besides, Fig. 5 visualizes their center bias. We can see that bleeding caused by surgical operations predominantly occurs near the image center, and bleeding points are contained within bleeding regions for the vast majority of the time. These statistical priors drive us to design synergistic guidance for the two bleeding detection tasks.

4 Proposed Method

4.1 Overall Architecture

BlooDet is a dual-task collaborative detector for simultaneous bleeding region and point detection in surgical videos. As shown in Fig. 6, our framework revises the body of SAM 2 [28] by empowering edge clues to detect bleeding regions and incorporating a point branch to detect bleeding points. The whole model consists of the following processes: image encoder, mask/point memory modeling, Edge generator, mask/point decoder, and mask/point memory bank.

Image Encoding. Given a set of $N$ video frames, including the current frame $I_{k}\in\mathbb{R}^{{H}\times{W}\times{3}}$ and the previous $N$ -1 frames $X=\{I_{i}\}^{k-1}_{i=k-N+1}$ , we first flatten frames and feed them into the image encoder inherited from SAM 2 to produce multi-scale spatial features $F\in\mathbb{R}^{{s}\times{c}\times{N}}$ , where $s$ denotes the length of the feature sequence and $c$ is the feature dimension. Then, the output sequential frame features $F_{k-N},...,F_{k}$ are fed into the mask and point branches, respectively, for memory modeling.

Point Branch. As shown in the bottom of Fig. 6, the point branch comprises three main parts: point memory modeling, point decoder, and point memory bank. Compared to mask memory modeling, the point memory modeling module embeds optical flow estimation to predict the displacement field between consecutive frames $[I_{k-7},...,I_{k}]$ . This operation enables the interference of laparoscopic camera motion and viewpoint offset during surgery, thereby identifying the movement direction of current bleeding points. Further, we integrate previous mask memory features { $M^{m}_{q}$ } ${}_{q=k-7}^{k-1}$ from mask branch with the corresponding point features to enhance location and temporal perception as well as to narrow the search space for bleeding point coordinates. We describe this process in detail in Sec. 4.2.

Next, the memory-enhanced point feature $F_{point}$ passes through the point decoder to predict the bleeding point. Different from the upsampling fusion in the mask decoder, we employ learnable output tokens and prompt tokens consistent with SAM 2 that interact with $F_{point}$ through self- and cross-attention operations [5], followed by MLP layers to predict bleeding point coordinates and confidence scores. The point memory is stored in the point memory bank after point encoding to enhance temporal modeling.

Mask Branch. This branch focuses on predicting bleeding regions by mask memory modeling and coupling bleeding edge and point prompts. The top of Fig. 6 illustrates the pipeline of our mask branch. The current frame features $F_{k}$ are first fed into mask memory modeling to perform self- and cross-attention interaction [36] with mask memory features from previous frames, producing the spatial-temporal feature $F_{mask}$ . After that, we introduce an edge generator that adopts multi-scale Wavelet Laplacian filters to $F_{mask}$ for edge refinement. Then, we incorporate high-resolution features from the image encoder to obtain edge maps (detailed description in Sec. 4.3). Unlike the manual intervention prompts in SAM [16, 28], we form adaptive prompt embeddings by combining the edge map $E_{m}$ from the edge generator with the point map $P_{m}$ output from point branch. Then, the prompt encoder is utilized to encode $E_{m}$ and $P_{m}$ to yield prompt features $E_{p}$ and $P_{p}$ :

\small E_{p},P_{p}=\mathcal{P}[E_{m},P_{m}],

(1)

\small E_{p}=\texttt{Cv}\big{(}\texttt{LN}(\mathbf{G}(\texttt{Cv}(\texttt{LN}(% \mathbf{G}(\texttt{Cv}(E_{m}))))))\big{)},

(2)

\small P_{p}=\mathcal{C}[sin(2\pi(\texttt{Po}(P_{m}))),cos(2\pi(\texttt{Po}(P_% {m})))]+\texttt{Le},

(3)

where $\mathcal{P}$ represents prompt encoding, $\mathbf{G}$ denotes the GeLU function, Cv refers to 2 $\times$ 2 convolution operation, and LN is layer normalization. Also, Po denotes positional encoding [33], Le stands for learned embeddings, and $\mathcal{C}[,]$ is the concatenation operation. We input prompt features along with $F_{mask}$ into the mask decoder similar to SAM 2, and attain the predicted bleeding mask by upsampling and integrating with high-resolution features. Subsequently, we employ memory encoding to achieve the mask memory feature $M^{m}_{k}$ and store it in the mask memory bank. Furthermore, the mask maps are also updated in the mask memory bank to provide spatial guidance for bleeding point detection across consecutive frames.

Cross-branch Guidance. A key design in our framework is bidirectional collaborative guidance between masks and point branches, enabling simultaneous optimization of bleeding region and bleeding point predictions. For the mask decoder, we exploit the point map produced by the point decoder as an automatic prompt input. This helps guide the decoder to focus on the target bleeding region while mitigating the interference from residual blood in the surrounding area. In point memory modeling, the predicted mask maps from previous frames can assist in optical flow estimation and in predicting the direction of the bleeding point. Besides, mask memory features are merged with point memory features to induce the point decoder to concentrate on the most likely bleeding areas while mitigating the impact of low-contrast background. Through this cross-branch guidance, the mask and point branches are synergistically optimized, resulting in more consistent bleeding region and point predictions in successive frames.

4.2 Point Memory Modeling in Point Branch

To detect bleeding points in consecutive frames effectively, we embed the point memory modeling module in the point branch to develop temporal clues for point features. As illustrated in Fig. 6, point memory modeling is divided into two steps: 1) combining the optical flow of consecutive frames with region maps to compensate the viewpoint offset of the camera, 2) interacting the average camera displacement of previous frames with mask memory features from mask branch to obtain point memory features.

For the viewpoint offset of the camera, we first utilize the frozen PWC-Net [30] for optical flow estimation. Given $N$ frames $\{I_{i}\}^{k}_{i=k-N+1}$ , the optical flow $O_{i}(x,y)\in\mathbb{R}^{{H}\times{W}\times{2}}$ between two consecutive frames can be expressed as $O_{i}(x,y)=\texttt{PWC-Net}(I_{i-1},I_{i})$ . Considering the instability of the optical flow in the rapidly changing bleeding region, we reverse the mask map $M_{i}$ from mask branch for each frame and combine with $O_{i}(x,y)$ to obtain the average viewpoint offset $\bar{O_{i}}(\Delta x,\Delta y)$ :

\small\bar{O}_{i}(\Delta x,\Delta y)=\frac{1}{H\times W}\sum_{X=1}^{H}\sum_{Y=% 1}^{W}(1-M_{i})\cdot O_{i}(x,y).

(4)

Then, the global offset coordinates $\bar{O_{i}}\in\mathbb{R}^{2}$ of previous frames can be produced through an MLP layer.

After that, we aggregate point memory features $M^{p}_{i}$ of previous frames in the point memory bank with $\bar{O_{i}}$ and concatenate with mask memory features $M^{m}_{i}$ to obtain the mask-guided corrected point features $\bar{F}^{ref}_{i}$ . Lastly, we perform a self-attention operation on $F_{k}$ and cross-attention with $\bar{F}^{ref}_{i}$ to the memory-enhanced point feature $F_{point}$ . Through the motion compensation mechanism based on optical flow estimation together with cross-branch mask guidance, we perform effective memory modeling of bleeding point features in laparoscopic scenes with camera offset.

4.3 Edge Generator in Mask Branch

Detecting bleeding regions is challenging due to the low contrast and high noise in surgical scenes. To this end, we embed a dedicated edge generator in mask branch, which enhances the accuracy of bleeding region detection by combining multi-scale Wavelet Laplacian filters [17] with high-resolution features containing lower-level texture clues to generate edge map prompts. Concretely, we first input the spatial-temporal features $F_{mask}$ into the Gabor wavelet Laplacian filter to enhance edge structures in the spatial domain. The Gabor Wavelet operation on the spatial position (x, y) is computed as

\small\mathcal{G}(x,y;\lambda,\theta,\psi,\sigma,\gamma)=exp{(-\frac{x^{\prime 2% }+\gamma^{2}y^{\prime 2}}{2\sigma^{2}})}exp{(i(\frac{2\pi}{\lambda}x^{\prime}+% \psi))}\ ,

(5)

where $x^{\prime}=xcos\theta+ysin\theta$ , $y^{\prime}=-xsin\theta+ycos\theta$ , $\lambda$ denotes the wavelength, $\theta$ is the orientation angle of the Gabor kernel, $\psi$ is the phase offset, $\sigma$ is the scale of Gaussian function, and $\gamma$ stands for the aspect ratio. Therefore, Laplacian filtering based on the Gabor wavelet is defined as

\small\mathbf{L}_{\mathbf{g}}(x,y)=\Delta{f(x,y)}\cdot\mathcal{G}(x,y),

(6)

\small\Delta{f(x,y)}=\frac{\partial^{2}f}{x^{2}}+\frac{\partial^{2}f}{y^{2}},

(7)

where $\Delta{f(x,y)}$ represents the Laplacian operator in 2D space. Then, we perform an activation operation on $F_{mask}$ and interact with the filtered features to suppress low-confidence signals and preserve refined edge features. The whole process can be described as

\small F^{{}^{\prime}}_{mask}=(\texttt{ReLU}(F_{mask}))\odot(\mathbf{L}_{% \mathbf{g}}(x,y)\ast{F_{mask}}),

(8)

where $\odot$ denotes the convolution operation. As shown in Fig. 6, we parallel upsample $F_{mask}$ twice and separately pass through the Wavelet Laplace filters, and then interact with high-resolution features, i.e., $F_{1}$ and $F_{2}$ , to further refine the edge information. Finally, the generated edge map is fed into the mask decoder as the edge prompt for bleeding region detection.

Types	Methods	Volumes	Bleeding Region Metrics		Bleeding Point Metrics			# Params (M) $\downarrow$
Types	Methods	Volumes	IoU $\uparrow$	Dice $\uparrow$	PCK-2% $\uparrow$	PCK-5% $\uparrow$	PCK-10% $\uparrow$	# Params (M) $\downarrow$
Region-level	Swin-UNet [4]	ECCV’22	41.31	58.47	-	-	-	27.2
	SAM [16]	ICCV’23	40.43	57.49	-	-	-	93.7
	SAM-Adapter [6]	ICCV’23	54.80	70.80	-	-	-	93.8
	MemSAM [10]	CVPR’24	55.34	71.28	-	-	-	133.3
	SAM 2 [28]	ICLR’25	63.51	77.68	-	-	-	80.8
	SAM2-Adapter [7]	arXiv’24	64.23	77.95	-	-	-	88.8
Point-level	HRNet [31]	CVPR’19	-	-	3.13	15.98	44.31	63.6
	SimCC [19]	ECCV’22	-	-	2.14	14.99	46.95	66.3
	GTPT [38]	ECCV’24	-	-	2.80	13.01	38.38	16.7
	D-CeLR [8]	ECCV’24	-	-	5.10	27.67	60.13	53.4
Multi-task	PAINet [9]	MICCAI’23	44.14	61.24	2.47	15.48	48.43	13.6
	PitSurgRT [23]	IJCARS’24	30.48	46.72	2.47	13.84	41.68	67.3
	SAM 2^† [28]	ICLR’25	50.93	67.49	12.35	41.68	71.99	81.0
	BlooDet (Ours)	-	64.88	78.70	18.62	55.85	83.69	91.6

Table 1: Overall comparison with the cutting-edge methods on SurgBlood test set. SAM 2^† denotes add a point head based on SAM 2.

4.4 Loss Function

The total loss function of BlooDet consists of mask loss $\mathcal{L}_{mask}$ and edge loss $\mathcal{L}_{edge}$ from mask branch, as well as point loss $\mathcal{L}_{point}$ and score loss $\mathcal{L}_{score}$ from point branch:

\small\mathcal{L}=\lambda_{m}\mathcal{L}_{mask}+\lambda_{e}\mathcal{L}_{edge}+% \lambda_{s}\mathcal{L}_{score}+\lambda_{p}\mathcal{L}_{point},

(9)

Both $\mathcal{L}_{mask}$ and $\mathcal{L}_{edge}$ are computed as a combination of Focal loss [20] and Dice loss [24]. In addition, $\mathcal{L}_{point}$ employs the smooth L1 loss for point-level supervision:

\small\mathcal{L}_{point}=\sum_{i=1}^{N}\mathbf{1}_{\{p_{i}\neq[0,0]\}}\cdot L% 1(\hat{y}_{\text{p}}^{i},y_{\text{p}}^{i}),

(10)

where $\hat{y}_{\text{p}}^{i}$ is the predicted point location and $y_{\text{p}}^{i}$ is the ground-truth point location. $\mathbf{1}_{\{p_{i}\neq[0,0]\}}$ denotes an indicator function that ensures the loss is calculated only when the point is not zero. Besides, $\mathcal{L}_{score}$ is the binary cross-entropy loss for point existence in point branch. $\lambda_{m}$ , $\lambda_{e}$ , $\lambda_{s}$ , and $\lambda_{p}$ are empirically set to 1, 1, 1, and 0.5, respectively, to balance the total loss function.

5 Experiments

5.1 Datasets and Evaluation Metrics

Datasets. Since the tasks of bleeding region and bleeding point cooperative detection are proposed for the first time, we can only adopt the proposed SurgBlood dataset to train and test our method and related comparative methods. We randomly divided the total 95 video clips into 75 for training as well as the remaining 20 for testing.

Evaluation Metrics. Following previous studies [28, 9], we adopt the Intersection over Union (IoU) and Dice Coefficient (Dice) metrics to evaluate bleeding region detection performance. For bleeding points, the Percentage of Correct Keypoints (PCK) metric is used to measure localization accuracy. Unlike the 10% to 40% threshold range applied in [9, 23] for anatomical structure centroids, we adopt a narrower threshold range of 2%–10% to ensure greater assessment, i.e., PCK-2%, PCK-5%, and PCK-10%. This is due to the requirement for higher precision and lower tolerance of bleeding point detection in laparoscopic surgery.

5.2 Implementation Details

Our framework is implemented on two RTX 4090 GPUs. During training, we input eight consecutive frames in an online manner with resolution resized to 512 $\times$ 512 pixels. The image encoder of BlooDet is initialized with pre-trained weights from SAM 2_base [28]. Additionally, we utilize a frozen PWC-Net [30] to compute inter-frame optical flow. No data augmentation is applied during data loading. The maximum learning rate for the image encoder is set to 5e-6, while other parts are trained with a learning rate of 5e-4. To optimize the training process, we employ the Adam optimizer with a warm-up strategy and linear decay, training for 20 epochs. During inference, we perform frame-by-frame inference in line with SAM 2 to get results for each frame.

5.3 Performance on SurgBlood Benchmark

We develop a comprehensive evaluation benchmark on the SurgBlood dataset. BlooDet is compared with 13 task-related methods for bleeding region and point detection. These approaches include multi-task detection models [9, 23], video- and image-level object segmentation methods [4, 16, 28, 6, 10, 7], and pose-based point detection methods [31, 19, 38, 8]. For fairness, all methods use the official code and adapt only the head to fit bleeding tasks.

EG	PMM	IoU $\uparrow$	Dice $\uparrow$	PCK-2% $\uparrow$	PCK-5% $\uparrow$	PCK-10% $\uparrow$
✗	✗	50.93	67.49	12.35	41.68	71.99
✓	✗	64.78	78.36	8.23	40.04	75.94
✗	✓	61.20	75.93	14.33	51.57	80.89
✓	✓	64.88	78.70	18.62	55.85	83.69

Table 2: Ablations for key components of BlooDet for bleeding region and point detection on SurgBlood test set. EG and PMM denote edge generator and point memory modeling modules.

Configs	IoU $\uparrow$	Dice $\uparrow$	# Params (M)
w/o Edge generator	61.20	75.93	90.99
w/o Laplacian Filter	60.79	75.61	91.58
w/o $F_{1}\&F_{2}$	64.74	78.60	91.55
Edge generator	64.88	78.70	91.58

Table 3: Influence of edge generator for bleeding region detection.

Quantitative Evaluation. Table 1 presents the performance of our framework and other comparison methods on SurgBlood test set for bleeding region and point detection. Our dual-task synergistic framework outperforms competitors across both tasks, even surpassing models designed for a single task. Benefiting from edge-enhanced prompts and bleeding point guidance, BlooDet achieves notable improvements in bleeding region detection. For bleeding point detection, BlooDet significantly surpasses other methods, with 6.27% and 14.17% improvement in PCK-2% and PCK-5% metrics, respectively. In addition, our dual-branch framework built upon large vision models obtains superior performance while adding few parameters.

Qualitative Evaluation. Fig. 7 shows a visual comparison of our approach with other multi-task frameworks. We can see that BlooDet provides greater stability and consistency in detecting bleeding regions and points. In complex surgical environments with low contrast, competitors tend to be disturbed by surrounding noise. In contrast, our method ensures robust bleeding detection across consecutive frames via spatial-temporal modeling and co-guidance.

5.4 Ablation Analysis

Contributions of Key Component. Table 2 illustrates the contribution of key components in BlooDet for bleeding region and point detection. The experimental results show that point memory modeling contributes significantly to detect bleeding points, e.g., improving the PCK-2% score by about 10.4%. Moreover, our edge generator provides effective edge prompt embedding, which greatly improves the accuracy of bleeding region detection. In short, each key component contributes positively to model performance.

Ablations for Edge Generator. We investigate the effect of different edge generator designs embedded in the mask branch. As exhibited in Table 3, the Wavelet Laplacian filters exhibit a strong response to edge features, effectively mitigating interference from complex background noise in laparoscopic scenes. Meanwhile, integrating high-resolution features with more texture information further enhances the quality of edge maps.

Mask map	PCK-2% $\uparrow$	PCK-5% $\uparrow$	PCK-10% $\uparrow$
Foreground	12.03	41.02	71.99
Background	18.62	55.85	83.69
Global	15.49	49.59	82.00

Table 4: Ablations for optical flow operation via mask maps.

Optical Flow Operation Design. In point memory modeling, we adopt reversed mask maps in conjunction with optical flow maps to estimate the average camera displacement. To validate the effect of this design, we ablate the impact of focusing on different regions in mask maps for bleeding point detection. Table 4 indicates that using the foreground region leads to inferior performance. It may be explained by the poor stability of optical flow in the rapidly changing bleeding area. In contrast, utilizing the background enables more stable motion modeling and point localization.

Effect of Bidirectional Guidance. To validate the effectiveness of our cross-branch bidirectional guidance, Fig. 9 ablates the impact of point prompt from point branch as well as mask map and mask memory from mask branch. The results indicate that the point prompt contributes to bleeding region detection. For the point branch, the mask maps from previous frames assist with the optical flow operation, while mask memory fosters point memory modeling to improve the accuracy of point localization. The attention maps from the mask and point decoders visualized in Fig. 8 also verify the effect of bidirectional guidance.

6 Conclusion

This work advances the intelligent detection of bleeding regions and bleeding points in laparoscopic surgical videos. We contribute a new dataset for bleeding detection in actual surgical scenarios, SurgBlood, to facilitate benchmark construction. Accordingly, we design a dual-task synergistic online framework called BlooDet, which assembles mask and point branches in a bidirectional guidance structure, and exploits an edge generator and point memory modeling to enhance the adaptive prompting mechanism. Extensive experimental results demonstrate that our method outperforms existing related models for both bleeding tasks. We believe that this study can facilitate research in intelligent surgical assistance, reducing intraoperative decision-making risks and improving clinical outcomes.

References

Ali et al. [2025] Sharib Ali, Yamid Espinel, Yueming Jin, Peng Liu, Bianca Güttner, Xukun Zhang, Lihua Zhang, Tom Dowrick, Matthew J Clarkson, Shiting Xiao, et al. An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion from the miccai2022 challenge. Medical Image Analysis, 99:103371, 2025.
Bai et al. [2024] Yunhao Bai, Qinji Yu, Boxiang Yun, Dakai Jin, Yingda Xia, and Yan Wang. Fs-medsam2: Exploring the potential of sam2 for few-shot medical image segmentation without fine-tuning. arXiv preprint arXiv:2409.04298, 2024.
Bourbakis et al. [2005] N Bourbakis, Sokratis Makrogiannis, and Despina Kavraki. A neural network-based detection of bleeding in sequences of wce images. In Fifth IEEE Symposium on Bioinformatics and Bioengineering, pages 324–327, 2005.
Cao et al. [2022] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In ECCV, 2022.
Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
Chen et al. [2023] Tianrun Chen, Lanyun Zhu, Chaotao Deng, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. Sam-adapter: Adapting segment anything in underperformed scenes. In IEEE ICCV, pages 3367–3375, 2023.
Chen et al. [2024] Tianrun Chen, Ankang Lu, Lanyun Zhu, Chaotao Ding, Chunan Yu, Deyi Ji, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam2-adapter: Evaluating & adapting segment anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more. arXiv preprint arXiv:2408.04579, 2024.
Dai et al. [2024] Chao Dai, Yang Wang, Chaolin Huang, Jiakai Zhou, Qilin Xu, and Minpeng Xu. A cephalometric landmark regression method based on dual-encoder for high-resolution x-ray image. In ECCV, pages 93–109, 2024.
Das et al. [2023] Adrito Das, Danyal Z Khan, Simon C Williams, John G Hanrahan, Anouk Borg, Neil L Dorward, Sophia Bano, Hani J Marcus, and Danail Stoyanov. A multi-task network for anatomy identification in endoscopic pituitary surgery. In MICCAI, pages 472–482, 2023.
Deng et al. [2024] Xiaolong Deng, Huisi Wu, Runhao Zeng, and Jing Qin. Memsam: taming segment anything model for echocardiography video segmentation. In IEEE CVPR, pages 9622–9631, 2024.
Deziel et al. [1993] Daniel J Deziel, Keith W Millikan, Steven G Economou, Alexander Doolas, Sung-Tao Ko, and Mohan C Airan. Complications of laparoscopic cholecystectomy: a national survey of 4,292 hospitals and an analysis of 77,604 cases. The American Journal of Surgery, 165(1):9–14, 1993.
Duan et al. [2019] Jinming Duan, Ghalib Bello, Jo Schlemper, Wenjia Bai, Timothy JW Dawes, Carlo Biffi, Antonio de Marvao, Georgia Doumoud, Declan P O’Regan, and Daniel Rueckert. Automatic 3d bi-ventricular segmentation of cardiac images by a shape-refined multi-task deep learning approach. IEEE TMI, 38(9):2151–2164, 2019.
Gallaher and Charles [2022] Jared R Gallaher and Anthony Charles. Acute cholecystitis: a review. Jama, 327(10):965–975, 2022.
Guo et al. [2024] Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Chengzhuo Tong, Peng Gao, Chunyuan Li, and Pheng-Ann Heng. Sam2point: Segment any 3d as videos in zero-shot and promptable manners. arXiv preprint arXiv:2408.16768, 2024.
Hirai et al. [2022] Yuichiro Hirai, Ai Fujimoto, Naomi Matsutani, Soichiro Murakami, Yuki Nakajima, Ryoichi Miyanaga, Yoshihiro Nakazato, Kazuyo Watanabe, Masahiro Kikuchi, and Naohisa Yahagi. Evaluation of the visibility of bleeding points using red dichromatic imaging in endoscopic hemostasis for acute gi bleeding (with video). Gastrointestinal Endoscopy, 95(4):692–700, 2022.
Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In IEEE ICCV, pages 4015–4026, 2023.
Lee [1996] Tai Sing Lee. Image representation using 2d gabor wavelets. IEEE TPAMI, 18(10):959–971, 1996.
Li et al. [2020] Lu Li, Meng Wei, BO Liu, Kunakorn Atchaneeyasakul, Fugen Zhou, Zehao Pan, Shimran A Kumar, Jason Y Zhang, Yuehua Pu, David S Liebeskind, et al. Deep learning for hemorrhagic lesion detection and segmentation on brain ct images. IEEE JBHI, 25(5):1646–1659, 2020.
Li et al. [2022] Yanjie Li, Sen Yang, Peidong Liu, Shoukui Zhang, Yunxiao Wang, Zhicheng Wang, Wankou Yang, and Shu-Tao Xia. Simcc: A simple coordinate classification perspective for human pose estimation. In ECCV, pages 89–106, 2022.
Lin et al. [2017] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In IEEE ICCV, pages 2980–2988, 2017.
Liu et al. [2019] Chuanbin Liu, Hongtao Xie, Sicheng Zhang, Jingyuan Xu, Jun Sun, and Yongdong Zhang. Misshapen pelvis landmark detection by spatial local correlation mining for diagnosing developmental dysplasia of the hip. In MICCAI, pages 441–449, 2019.
Liu et al. [2024] Haofeng Liu, Erli Zhang, Junde Wu, Mingxuan Hong, and Yueming Jin. Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning. In NeurIPS Workshop, 2024.
Mao et al. [2024] Zhehua Mao, Adrito Das, Mobarakol Islam, Danyal Z Khan, Simon C Williams, John G Hanrahan, Anouk Borg, Neil L Dorward, Matthew J Clarkson, Danail Stoyanov, et al. Pitsurgrt: real-time localization of critical anatomical structures in endoscopic pituitary surgery. IJCARS, pages 1–8, 2024.
Milletari et al. [2016] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pages 565–571, 2016.
Mori et al. [2024] Yosuke Mori, Taro Iwatsubo, Akitoshi Hakoda, Shin Kameishi, Kazuki Takayama, Shun Sasaki, Ryoji Koshiba, Shinya Nishida, Satoshi Harada, Hironori Tanaka, et al. Red dichromatic imaging improves the recognition of bleeding points during endoscopic submucosal dissection. Digestive Diseases and Sciences, 69(1):216–227, 2024.
Pei et al. [2024a] Jialun Pei, Ruize Cui, Yaoqian Li, Weixin Si, Jing Qin, and Pheng-Ann Heng. Depth-driven geometric prompt learning for laparoscopic liver landmark detection. In MICCAI, pages 154–164, 2024a.
Pei et al. [2024b] Jialun Pei, Zhangjun Zhou, and Tiantian Zhang. Evaluation study on sam 2 for class-agnostic instance-level segmentation. arXiv preprint arXiv:2409.02567, 2024b.
Ravi et al. [2025] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In ICLR, 2025.
Su et al. [2022] Ruisheng Su, Matthijs van der Sluijs, Sandra AP Cornelissen, Geert Lycklama, Jeannette Hofmeijer, Charles BLM Majoie, Pieter Jan van Doormaal, Adriaan CGM Van Es, Danny Ruijters, Wiro J Niessen, et al. Spatio-temporal deep learning for automatic detection of intracranial vessel perforation in digital subtraction angiography during endovascular thrombectomy. Medical Image Analysis, 77:102377, 2022.
Sun et al. [2018] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In IEEE CVPR, pages 8934–8943, 2018.
Sun et al. [2019] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In IEEE CVPR, pages 5693–5703, 2019.
Sunakawa et al. [2024] Taiki Sunakawa, Daichi Kitaguchi, Shin Kobayashi, Keishiro Aoki, Manabu Kujiraoka, Kimimasa Sasaki, Lena Azuma, Atsushi Yamada, Masashi Kudo, Motokazu Sugimoto, et al. Deep learning-based automatic bleeding recognition during liver resection in laparoscopic hepatectomy. Surgical Endoscopy, pages 1–7, 2024.
Tancik et al. [2020] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 33:7537–7547, 2020.
Tuysuzoglu et al. [2018] Ahmet Tuysuzoglu, Jeremy Tan, Kareem Eissa, Atilla P Kiraly, Mamadou Diallo, and Ali Kamen. Deep adversarial context-aware landmark detection for ultrasound imaging. In MICCAI, pages 151–158, 2018.
Varghese et al. [2024] Chris Varghese, Ewen M Harrison, Greg O’Grady, and Eric J Topol. Artificial intelligence in surgery. Nature Medicine, pages 1–12, 2024.
Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017.
Vorontsov et al. [2024] Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nature Medicine, 30(10):2924–2935, 2024.
Wang et al. [2024] Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, and Yong Wang. Gtpt: Group-based token pruning transformer for efficient human pose estimation. In ECCV, pages 213–230, 2024.
Wu et al. [2024] Renkai Wu, Pengchen Liang, Yiqi Huang, Qing Chang, and Huiping Yao. Automatic segmentation of hemorrhages in the ultra-wide field retina: multi-scale attention subtraction networks and an ultra-wide field retinal hemorrhage dataset. IEEE JBHI, 2024.
Xiong et al. [2024] Xinyu Xiong, Zihuang Wu, Shuangyi Tan, Wenxue Li, Feilong Tang, Ying Chen, Siying Li, Jie Ma, and Guanbin Li. Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation. arXiv preprint arXiv:2408.08870, 2024.
Zhang et al. [2020a] Dongqing Zhang, Jianing Wang, Jack H Noble, and Benoit M Dawant. Headlocnet: Deep convolutional neural networks for accurate classification and multi-landmark localization of head cts. Medical Image Analysis, 61:101659, 2020a.
Zhang et al. [2017] Jun Zhang, Mingxia Liu, Li Wang, Si Chen, Peng Yuan, Jianfu Li, Steve Guo-Fang Shen, Zhen Tang, Ken-Chung Chen, James J Xia, et al. Joint craniomaxillofacial bone segmentation and landmark digitization by context-guided fully convolutional networks. In MICCAI, pages 720–728, 2017.
Zhang et al. [2020b] Jun Zhang, Mingxia Liu, Li Wang, Si Chen, Peng Yuan, Jianfu Li, Steve Guo-Fang Shen, Zhen Tang, Ken-Chung Chen, James J Xia, et al. Context-guided fully convolutional networks for joint craniomaxillofacial bone segmentation and landmark digitization. Medical Image Analysis, 60:101621, 2020b.
Zheng et al. [2015] Yefeng Zheng, David Liu, Bogdan Georgescu, Hien Nguyen, and Dorin Comaniciu. 3d deep learning for efficient and robust landmark detection in volumetric data. In MICCAI, pages 565–572, 2015.
Zhong et al. [2019] Zhusi Zhong, Jie Li, Zhenxi Zhang, Zhicheng Jiao, and Xinbo Gao. An attention-guided deep regression model for landmark detection in cephalograms. In MICCAI, pages 540–548, 2019.
Zhu et al. [2024] Jiayuan Zhu, Yunli Qi, and Junde Wu. Medical sam 2: Segment medical images as video via segment anything model 2. arXiv preprint arXiv:2408.00874, 2024.