(Translated by https://www.hiragana.jp/)
Synergistic Bleeding Region and Point Detection in Surgical Videos

Synergistic Bleeding Region and Point Detection in Surgical Videos

Jialun Pei
CUHK
Hong Kong SAR, China
peijialun@gmail.com
   Zhangjun Zhou
PolyU
Hong Kong SAR, China
zhangjun.zhou@polyu.edu.hk
   Diandian Guo
CUHK
Hong Kong SAR, China
1155229775@link.cuhk.edu.hk
   Zhixi Li
SMU & PolyU
Guang Zhou, China
lzx22121222@smu.edu.cn
   Jing Qin
PolyU
Hong Kong SAR, China
harry.qin@polyu.edu.hk
   Bo Du
WHU
Wu Han, China
dubo@whu.edu.cn
Corresponding author.
   Pheng-Ann Heng
CUHK
Hong Kong SAR, China
pheng@cse.cuhk.edu.hk
Abstract

Intraoperative bleeding in laparoscopic surgery causes rapid obscuration of the operative field to hinder the surgical process. Intelligent detection of bleeding regions can quantify the blood loss to assist decision-making, while locating the bleeding point helps surgeons quickly identify the source of bleeding and achieve hemostasis in time. In this study, we first construct a real-world surgical bleeding detection dataset, named SurgBlood, comprising 5,330 frames from 95 surgical video clips with bleeding region and point annotations. Accordingly, we develop a dual-task synergistic online detector called BlooDet, designed to perform simultaneous detection of bleeding regions and points in surgical videos. Our framework embraces a dual-branch bidirectional guidance design based on Segment Anything Model 2 (SAM 2). The mask branch detects bleeding regions through adaptive edge and point prompt embeddings, while the point branch leverages mask memory to induce bleeding point memory modeling and captures the direction of bleed point movement through inter-frame optical flow. By interactive guidance and prompts, the two branches explore potential spatial-temporal relationships while leveraging memory modeling from previous frames to infer the current bleeding condition. Extensive experiments demonstrate that our approach outperforms other counterparts on SurgBlood in both bleeding region and point detection tasks, e.g., achieving 64.88% IoU for bleeding region detection and 83.69% PCK-10% for bleeding point detection. Our code and data will be publicly available.

1 Introduction

Refer to caption
Figure 1: (a): Illustration of bleeding detection task with samples in SurgBlood and the predictions of our solution. (b): BlooDet performs dual-branch bidirectional guidance for synergistic bleeding mask and point detection. Zoom in for details.

Bleeding in Laparoscopic Surgery. Minimally invasive surgery has revolutionized clinical healthcare by reducing patient trauma and accelerating postoperative recovery [35]. However, intraoperative bleeding is a common contingency and significantly impacts surgical safety and efficiency [13]. Rapid changes in the amount and speed of bleeding can severely obscure the surgical field, potentially delaying the surgeon’s response and reducing the success rate of surgery. More prolonged bleeding increases the risk of organ damage, postoperative infection, and complications [11]. Therefore, utilizing computer-assisted techniques to detect bleeding regions and localize bleeding points promptly holds significant clinical value. In one respect, detecting bleeding areas can quantify blood loss, providing timely support for intraoperative decision-making. In another respect, the precise positioning of bleeding points enables surgeons to control hemorrhage promptly to ensure surgical safety.

Challenges for Bleeding Detection. Despite the popularity of laparoscopic surgery, automated detection of bleeding regions and bleeding points still faces numerous challenges [32]. Due to the narrow field of view under laparoscopy and unstable lighting conditions, the anatomical structures are incompletely exposed, which increases the difficulty of extracting discriminative representations. Additionally, the rapid accumulation and flow of blood can change tissue appearance and infiltrate surrounding tissues, reducing the availability of low-level visual clues and complicating the detection of bleeding regions. The bleeding points may also be buried, making it difficult to locate them quickly [25]. Beyond these challenges, intelligent bleeding warning involves detecting bleeding regions and locating bleeding points during dynamic surgical procedures [15]. This requires a reliable multi-task online detector that models fine-grained spatial-temporal relationships in surgical videos for accurate predictions. Further, the lack of real-world surgical bleeding multi-task datasets remains a major obstacle to progress in this community.

Proposed Benchmark. To advance research on bleeding region and point detection in surgical videos, we construct a new actual laparoscopic surgery bleeding dataset, named SurgBlood. This dataset comprises a total of 5,330 video frames from 95 laparoscopic cholecystectomy (LC) video clips, encompassing multiple types and intensities of bleeding encountered during surgery. As displayed in Fig. 1(a), SurgBlood also provides pixel-level annotations of bleeding regions and bleeding point coordinates by hepatobiliary surgeons. Our dataset supports the joint evaluation of bleeding region and point detection in surgical videos. We evaluate several task-relevant methods on SurgBlood to establish a comprehensive benchmark for intraoperative bleeding detection, aiming to drive further research in intelligent surgical assistance.

Existing Methods. Various deep learning-based algorithms have been demonstrated to be effective in bleeding region detection, including applications in intracranial hemorrhage detection [18] and bleeding detection in capsule endoscopy [3]. However, most algorithms [3, 18, 9] are designed for image or keyframe analysis, lacking the ability to model temporal dependencies in surgical videos. In addition, previous methods [29, 23, 32] mainly focus on bleeding region detection, which falls short of addressing the clinical needs in locating the bleeding source. The advent of Segment Anything Model 2 (SAM 2) [28] extended the generic large vision model (LVM) to the video domain, but has not yet been unified into the multi-task paradigm. Some multi-task frameworks [9, 23] detect both regions and keypoints concurrently, but they overlook mutual guidance for joint optimization across tasks. Therefore, it is desirable to develop a dual-task paradigm to synergistically detect bleeding regions and points.

Our Solution. To meet the clinical demand of bleeding region and point detection, in this paper, we propose a dual-task online baseline model called BlooDet, which adopts a dual-branch bidirectional guidance structure based on video-level SAM 2 to synergistically optimize both tasks. Our framework consists of two branches: Mask branch and Point branch. In the mask branch, we embed an edge generator that performs multi-scale perception of spatial-temporal features with the Wavelet Laplacian filter to generate edge prompts, mitigating the problem of blurred bleeding boundaries in surgical scenes. Meanwhile, we incorporate bleeding points produced from the point branch as point prompts and combine them with edge prompts to facilitate bleeding region detection. For the point branch, we leverage inter-frame optical flow and bleeding mask memory to estimate laparoscopic camera motion and viewpoint offsets to improve the prediction of bleeding point movement direction. Further, we integrate mask memory features from the mask branch to enhance bleed point location perception. By exchanging clues and co-guiding between two branches, BlooDet fully exploits the underlying spatial and temporal associations between bleeding regions and points. Extensive experiments on the SurgBlood dataset demonstrate that our approach achieves superior performance in both bleeding region and point detection tasks.

Refer to caption
Figure 2: Illustration of bleeding types in SurgBlood. Bleeding regions and points are labeled in yellow mask and red dot.

The main contributions of this work are four-fold:

  • We debut the intraoperative bleeding region and bleeding point detection tasks in surgical videos and contribute a real-world bleeding detection dataset, termed SurgBlood, to advance the surgical intelligent assistance community.

  • We propose a dual-task synergistic online detection model, BlooDet, for jointly detecting bleeding regions and points in surgical videos. Our framework embraces a dual-branch structure and performs co-optimization by mutual prompts and bidirectional guidance.

  • In the point branch, we utilize inter-frame optical flow and mask memory for point memory modeling, efficiently capturing bleeding point movement cues and providing spatial-temporal modeling. Adaptive edge and point prompting strategies are introduced in the mask branch, where the edge generator is designed to exploit multi-scale Wavelet Laplacian filter convergence to enhance edge perception, and combined with bleeding points from the point branch for mask prompt embedding.

  • We evaluate various task-related models on SurgBlood and establish a comprehensive benchmark to facilitate development in bleeding detection. Extensive experiments indicate that our framework outperforms other methods in bleeding region and point detection tasks.

2 Related Work

Bleeding Region Detection in Medical Domain. Bleeding region detection has been explored across various medical scenarios, such as intracranial hemorrhage detection [18], capsule endoscopy bleeding recognition [3], and retinal hemorrhage identification [39]. Deep learning-based methods employ convolutional neural networks (CNNs) and attention mechanisms to extract discriminative features for bleeding localization. In surgical videos, intraoperative bleeding detection presents unique challenges due to limited workspace and dynamic lighting variations. In this regard, Sunakawa et al. [32] developed a semantic segmentation model for automatically recognizing bleeding regions on the anatomical structure of the liver. Nonetheless, existing methods primarily focus on mask- or patch-level bleeding detection, overlooking the localization of the bleeding source. To bridge this gap, we introduce a unified paradigm for the synergistic detection of bleeding regions and points.

Keypoint Detection in Medical Images. Keypoint detection plays a crucial role in various clinical applications, e.g., pathological site identification and anatomical landmark localization [37, 1]. Existing keypoint detection methods are usually classified into three categories: 1) Context-aware spatial methods. This technique exploits the stability and uniformity of keypoint spatial distributions to improve localization accuracy [41, 21, 34, 26]. 2) Multi-stage learning strategies. These architectures follow a coarse-to-fine process to refine keypoint localization through gradual optimization and integrating shallow and deep layers to enhance performance [44, 8, 45]. 3) Multi-task learning frameworks, which jointly optimize medical image segmentation and keypoint detection by constructing a union network [42, 43, 12, 9]. Although current multi-task learning methods demonstrate strong performance, they neglect the mutual guidance between tasks. Therefore, we propose a collaborative dual-branch model that enhances both bleeding region and point detection via cross-task guidance.

Video Segmentation with Large Vision Models. With the advent of large vision models (LVMs) [16], video segmentation techniques have made great progress. Among these models, SAM 2 [28] has emerged as the leading framework, extending the capabilities of SAM from image segmentation to video domain. By leveraging large-scale pre-training and fine-tuning techniques, SAM 2 achieves state-of-the-art performance across a range of video tasks. LVM-based video segmentation techniques open up new possibilities in various downstream applications [27, 14, 7, 2, 46, 40]. SAM2-adapter [7] embeds adapter layers into SAM 2 to enhance the model’s flexibility, improving cross-task generalization. In the domain of surgical video analysis, SurgSAM-2 [22] utilizes SAM 2 along with an frame pruning mechanism for efficient instrument segmentation. Based on the powerful capabilities of SAM 2, we design a dual-task paradigm that allows for joint bleeding region and point detection in complex surgical environments.

3  SurgBlood Dataset

We construct a brand-new dataset specifically for the bleeding process in laparoscopic cholecystectomy (LC), named SurgBlood. The dataset involves two complementary tasks: bleeding region and bleeding point detection. We provide a overview of our dataset from three aspects: data collection, data annotation, and data analysis.

3.1 Data Collection

To ensure high-quality and representative data, we invited four hepatobiliary surgeons from partner hospitals to carefully select 95 LC video clips. Each clip covers the entire bleeding process while retaining the non-bleeding scene for approximately 3 seconds before and after the bleeding event. We collect a total of 5,330 video frames with a resolution of 1280×\times×720 from all clips using a sampling rate of 2 fps. Notably, we focus exclusively on dynamic bleeding regions within the surgical action field, as this is the critical location that directly interferes with the surgeon and contains key bleeding points. As shown in Fig. 2, there are four bleeding types by tissue location: gallbladder, cystic triangle, vessel, and gallbladder bed.

Refer to caption
Figure 3: Statistical distribution of video clips in SurgBlood.

3.2 Data Annotation

To ensure the annotation quality of SurgBlood, the invited hepatobiliary surgeons meticulously annotate and review each video clip. During the labeling process, the surgeon uses both static frames and dynamic video sequences to label bleeding regions and bleeding point coordinates for each frame. Annotation is guided by the following principles: 1) For bleeding regions, pooled blood and inactive sparse bloodstains are not annotated. 2) For bleeding points, if the bleeding point is unobstructed, its coordinates are labeled; if the bleeding point is covered by blood, annotators trace back to the first frame where the bleeding occurred and localize the point based on the surrounding anatomy. These situations bring distinct characteristics and challenges to our dataset. To ensure annotation consistency, we adopt the cross-validation strategy: each clip is initially annotated by two surgeons, followed by a review and refinement process conducted by two additional surgeons. Fig. 2 presents examples of annotations for various bleeding situations.

3.3 Data Analysis

  • Clip Distribution: SurgBlood includes 5,330 frames extracted from 95 video clips. As shown in Fig. 3, each clip contains an average of 56 frames, with the longest containing 300 frames and the shortest containing 8 frames. We also counted the distribution of bleeding types: gallbladder (21.64%), cystic triangle (25.01%), vessel (15.78%), and gallbladder bed (37.75%).

  • Bleeding Ratio: To reflect the information density of data, we calculate the proportion of frames containing bleeding regions and points. As shown in the left of Fig. 4, both bleeding regions and points have a high frame rate, where the slightly higher rate with bleeding regions is due to the partial occlusion of bleeding points.

  • Space Statistics: We analyze the spatial distribution and center bias of bleeding regions and points across all samples in SurgBlood. The right of Fig. 4 provides statistical insights into the distances of bleeding region centers and bleeding points from the image center. Besides, Fig. 5 visualizes their center bias. We can see that bleeding caused by surgical operations predominantly occurs near the image center, and bleeding points are contained within bleeding regions for the vast majority of the time. These statistical priors drive us to design synergistic guidance for the two bleeding detection tasks.

Refer to caption
Figure 4: Bleeding distribution. Left: proportion of frames with bleeding region and point; Right: Distance of bleeding region center and bleeding point to image center.
Refer to caption
Figure 5: Visualization of center bias between bleeding region center and bleeding point coordinate in SurgBlood.

4 Proposed Method

Refer to caption
Figure 6: Overview of the proposed BlooDet. Our framework comprises a mask branch and a point branch to jointly detect bleeding regions and bleeding points. Cross-branch guidance and adaptive prompt embedding allow our model to reach a co-optimized state.

4.1 Overall Architecture

BlooDet is a dual-task collaborative detector for simultaneous bleeding region and point detection in surgical videos. As shown in Fig. 6, our framework revises the body of SAM 2 [28] by empowering edge clues to detect bleeding regions and incorporating a point branch to detect bleeding points. The whole model consists of the following processes: image encoder, mask/point memory modeling, Edge generator, mask/point decoder, and mask/point memory bank.

Image Encoding. Given a set of N𝑁Nitalic_N video frames, including the current frame IkH×W×3subscript𝐼𝑘superscript𝐻𝑊3I_{k}\in\mathbb{R}^{{H}\times{W}\times{3}}italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT and the previous N𝑁Nitalic_N-1 frames X={Ii}i=kN+1k1𝑋subscriptsuperscriptsubscript𝐼𝑖𝑘1𝑖𝑘𝑁1X=\{I_{i}\}^{k-1}_{i=k-N+1}italic_X = { italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = italic_k - italic_N + 1 end_POSTSUBSCRIPT, we first flatten frames and feed them into the image encoder inherited from SAM 2 to produce multi-scale spatial features Fs×c×N𝐹superscript𝑠𝑐𝑁F\in\mathbb{R}^{{s}\times{c}\times{N}}italic_F ∈ blackboard_R start_POSTSUPERSCRIPT italic_s × italic_c × italic_N end_POSTSUPERSCRIPT, where s𝑠sitalic_s denotes the length of the feature sequence and c𝑐citalic_c is the feature dimension. Then, the output sequential frame features FkN,,Fksubscript𝐹𝑘𝑁subscript𝐹𝑘F_{k-N},...,F_{k}italic_F start_POSTSUBSCRIPT italic_k - italic_N end_POSTSUBSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are fed into the mask and point branches, respectively, for memory modeling.

Point Branch. As shown in the bottom of Fig. 6, the point branch comprises three main parts: point memory modeling, point decoder, and point memory bank. Compared to mask memory modeling, the point memory modeling module embeds optical flow estimation to predict the displacement field between consecutive frames [Ik7,,Ik]subscript𝐼𝑘7subscript𝐼𝑘[I_{k-7},...,I_{k}][ italic_I start_POSTSUBSCRIPT italic_k - 7 end_POSTSUBSCRIPT , … , italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ]. This operation enables the interference of laparoscopic camera motion and viewpoint offset during surgery, thereby identifying the movement direction of current bleeding points. Further, we integrate previous mask memory features {Mqmsubscriptsuperscript𝑀𝑚𝑞M^{m}_{q}italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT}k1q=k7superscriptsubscriptabsent𝑞𝑘7𝑘1{}_{q=k-7}^{k-1}start_FLOATSUBSCRIPT italic_q = italic_k - 7 end_FLOATSUBSCRIPT start_POSTSUPERSCRIPT italic_k - 1 end_POSTSUPERSCRIPT from mask branch with the corresponding point features to enhance location and temporal perception as well as to narrow the search space for bleeding point coordinates. We describe this process in detail in Sec. 4.2.

Next, the memory-enhanced point feature Fpointsubscript𝐹𝑝𝑜𝑖𝑛𝑡F_{point}italic_F start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT passes through the point decoder to predict the bleeding point. Different from the upsampling fusion in the mask decoder, we employ learnable output tokens and prompt tokens consistent with SAM 2 that interact with Fpointsubscript𝐹𝑝𝑜𝑖𝑛𝑡F_{point}italic_F start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT through self- and cross-attention operations [5], followed by MLP layers to predict bleeding point coordinates and confidence scores. The point memory is stored in the point memory bank after point encoding to enhance temporal modeling.

Mask Branch. This branch focuses on predicting bleeding regions by mask memory modeling and coupling bleeding edge and point prompts. The top of Fig. 6 illustrates the pipeline of our mask branch. The current frame features Fksubscript𝐹𝑘F_{k}italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT are first fed into mask memory modeling to perform self- and cross-attention interaction [36] with mask memory features from previous frames, producing the spatial-temporal feature Fmasksubscript𝐹𝑚𝑎𝑠𝑘F_{mask}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT. After that, we introduce an edge generator that adopts multi-scale Wavelet Laplacian filters to Fmasksubscript𝐹𝑚𝑎𝑠𝑘F_{mask}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT for edge refinement. Then, we incorporate high-resolution features from the image encoder to obtain edge maps (detailed description in Sec. 4.3). Unlike the manual intervention prompts in SAM [16, 28], we form adaptive prompt embeddings by combining the edge map Emsubscript𝐸𝑚E_{m}italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT from the edge generator with the point map Pmsubscript𝑃𝑚P_{m}italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT output from point branch. Then, the prompt encoder is utilized to encode Emsubscript𝐸𝑚E_{m}italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and Pmsubscript𝑃𝑚P_{m}italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to yield prompt features Epsubscript𝐸𝑝E_{p}italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and Ppsubscript𝑃𝑝P_{p}italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT:

Ep,Pp=𝒫[Em,Pm],subscript𝐸𝑝subscript𝑃𝑝𝒫subscript𝐸𝑚subscript𝑃𝑚\small E_{p},P_{p}=\mathcal{P}[E_{m},P_{m}],italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = caligraphic_P [ italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] , (1)
Ep=Cv(LN(𝐆(Cv(LN(𝐆(Cv(Em))))))),subscript𝐸𝑝CvLN𝐆CvLN𝐆Cvsubscript𝐸𝑚\small E_{p}=\texttt{Cv}\big{(}\texttt{LN}(\mathbf{G}(\texttt{Cv}(\texttt{LN}(% \mathbf{G}(\texttt{Cv}(E_{m}))))))\big{)},italic_E start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = Cv ( LN ( bold_G ( Cv ( LN ( bold_G ( Cv ( italic_E start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) ) ) ) ) ) , (2)
Pp=𝒞[sin(2π(Po(Pm))),cos(2π(Po(Pm)))]+Le,subscript𝑃𝑝𝒞𝑠𝑖𝑛2𝜋Posubscript𝑃𝑚𝑐𝑜𝑠2𝜋Posubscript𝑃𝑚Le\small P_{p}=\mathcal{C}[sin(2\pi(\texttt{Po}(P_{m}))),cos(2\pi(\texttt{Po}(P_% {m})))]+\texttt{Le},italic_P start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = caligraphic_C [ italic_s italic_i italic_n ( 2 italic_π ( Po ( italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) ) , italic_c italic_o italic_s ( 2 italic_π ( Po ( italic_P start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) ) ) ] + Le , (3)

where 𝒫𝒫\mathcal{P}caligraphic_P represents prompt encoding, 𝐆𝐆\mathbf{G}bold_G denotes the GeLU function, Cv refers to 2×\times×2 convolution operation, and LN is layer normalization. Also, Po denotes positional encoding [33], Le stands for learned embeddings, and 𝒞[,]\mathcal{C}[,]caligraphic_C [ , ] is the concatenation operation. We input prompt features along with Fmasksubscript𝐹𝑚𝑎𝑠𝑘F_{mask}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT into the mask decoder similar to SAM 2, and attain the predicted bleeding mask by upsampling and integrating with high-resolution features. Subsequently, we employ memory encoding to achieve the mask memory feature Mkmsubscriptsuperscript𝑀𝑚𝑘M^{m}_{k}italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and store it in the mask memory bank. Furthermore, the mask maps are also updated in the mask memory bank to provide spatial guidance for bleeding point detection across consecutive frames.

Cross-branch Guidance. A key design in our framework is bidirectional collaborative guidance between masks and point branches, enabling simultaneous optimization of bleeding region and bleeding point predictions. For the mask decoder, we exploit the point map produced by the point decoder as an automatic prompt input. This helps guide the decoder to focus on the target bleeding region while mitigating the interference from residual blood in the surrounding area. In point memory modeling, the predicted mask maps from previous frames can assist in optical flow estimation and in predicting the direction of the bleeding point. Besides, mask memory features are merged with point memory features to induce the point decoder to concentrate on the most likely bleeding areas while mitigating the impact of low-contrast background. Through this cross-branch guidance, the mask and point branches are synergistically optimized, resulting in more consistent bleeding region and point predictions in successive frames.

4.2 Point Memory Modeling in Point Branch

To detect bleeding points in consecutive frames effectively, we embed the point memory modeling module in the point branch to develop temporal clues for point features. As illustrated in Fig. 6, point memory modeling is divided into two steps: 1) combining the optical flow of consecutive frames with region maps to compensate the viewpoint offset of the camera, 2) interacting the average camera displacement of previous frames with mask memory features from mask branch to obtain point memory features.

For the viewpoint offset of the camera, we first utilize the frozen PWC-Net [30] for optical flow estimation. Given N𝑁Nitalic_N frames {Ii}i=kN+1ksubscriptsuperscriptsubscript𝐼𝑖𝑘𝑖𝑘𝑁1\{I_{i}\}^{k}_{i=k-N+1}{ italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = italic_k - italic_N + 1 end_POSTSUBSCRIPT, the optical flow Oi(x,y)H×W×2subscript𝑂𝑖𝑥𝑦superscript𝐻𝑊2O_{i}(x,y)\in\mathbb{R}^{{H}\times{W}\times{2}}italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 2 end_POSTSUPERSCRIPT between two consecutive frames can be expressed as Oi(x,y)=PWC-Net(Ii1,Ii)subscript𝑂𝑖𝑥𝑦PWC-Netsubscript𝐼𝑖1subscript𝐼𝑖O_{i}(x,y)=\texttt{PWC-Net}(I_{i-1},I_{i})italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) = PWC-Net ( italic_I start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Considering the instability of the optical flow in the rapidly changing bleeding region, we reverse the mask map Misubscript𝑀𝑖M_{i}italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from mask branch for each frame and combine with Oi(x,y)subscript𝑂𝑖𝑥𝑦O_{i}(x,y)italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) to obtain the average viewpoint offset Oi¯(Δx,Δy)¯subscript𝑂𝑖Δ𝑥Δ𝑦\bar{O_{i}}(\Delta x,\Delta y)over¯ start_ARG italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ( roman_Δ italic_x , roman_Δ italic_y ):

O¯i(Δx,Δy)=1H×WX=1HY=1W(1Mi)Oi(x,y).subscript¯𝑂𝑖Δ𝑥Δ𝑦1𝐻𝑊superscriptsubscript𝑋1𝐻superscriptsubscript𝑌1𝑊1subscript𝑀𝑖subscript𝑂𝑖𝑥𝑦\small\bar{O}_{i}(\Delta x,\Delta y)=\frac{1}{H\times W}\sum_{X=1}^{H}\sum_{Y=% 1}^{W}(1-M_{i})\cdot O_{i}(x,y).over¯ start_ARG italic_O end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( roman_Δ italic_x , roman_Δ italic_y ) = divide start_ARG 1 end_ARG start_ARG italic_H × italic_W end_ARG ∑ start_POSTSUBSCRIPT italic_X = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_Y = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_W end_POSTSUPERSCRIPT ( 1 - italic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x , italic_y ) . (4)

Then, the global offset coordinates Oi¯2¯subscript𝑂𝑖superscript2\bar{O_{i}}\in\mathbb{R}^{2}over¯ start_ARG italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT of previous frames can be produced through an MLP layer.

After that, we aggregate point memory features Mipsubscriptsuperscript𝑀𝑝𝑖M^{p}_{i}italic_M start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of previous frames in the point memory bank with Oi¯¯subscript𝑂𝑖\bar{O_{i}}over¯ start_ARG italic_O start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG and concatenate with mask memory features Mimsubscriptsuperscript𝑀𝑚𝑖M^{m}_{i}italic_M start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to obtain the mask-guided corrected point features F¯irefsubscriptsuperscript¯𝐹𝑟𝑒𝑓𝑖\bar{F}^{ref}_{i}over¯ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Lastly, we perform a self-attention operation on Fksubscript𝐹𝑘F_{k}italic_F start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and cross-attention with F¯irefsubscriptsuperscript¯𝐹𝑟𝑒𝑓𝑖\bar{F}^{ref}_{i}over¯ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_r italic_e italic_f end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the memory-enhanced point feature Fpointsubscript𝐹𝑝𝑜𝑖𝑛𝑡F_{point}italic_F start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT. Through the motion compensation mechanism based on optical flow estimation together with cross-branch mask guidance, we perform effective memory modeling of bleeding point features in laparoscopic scenes with camera offset.

4.3 Edge Generator in Mask Branch

Detecting bleeding regions is challenging due to the low contrast and high noise in surgical scenes. To this end, we embed a dedicated edge generator in mask branch, which enhances the accuracy of bleeding region detection by combining multi-scale Wavelet Laplacian filters [17] with high-resolution features containing lower-level texture clues to generate edge map prompts. Concretely, we first input the spatial-temporal features Fmasksubscript𝐹𝑚𝑎𝑠𝑘F_{mask}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT into the Gabor wavelet Laplacian filter to enhance edge structures in the spatial domain. The Gabor Wavelet operation on the spatial position (x, y) is computed as

𝒢(x,y;λ,θ,ψ,σ,γ)=exp(x2+γ2y22σ2)exp(i(2πλx+ψ)),𝒢𝑥𝑦𝜆𝜃𝜓𝜎𝛾𝑒𝑥𝑝superscript𝑥2superscript𝛾2superscript𝑦22superscript𝜎2𝑒𝑥𝑝𝑖2𝜋𝜆superscript𝑥𝜓\small\mathcal{G}(x,y;\lambda,\theta,\psi,\sigma,\gamma)=exp{(-\frac{x^{\prime 2% }+\gamma^{2}y^{\prime 2}}{2\sigma^{2}})}exp{(i(\frac{2\pi}{\lambda}x^{\prime}+% \psi))}\ ,caligraphic_G ( italic_x , italic_y ; italic_λ , italic_θ , italic_ψ , italic_σ , italic_γ ) = italic_e italic_x italic_p ( - divide start_ARG italic_x start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT + italic_γ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT ′ 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) italic_e italic_x italic_p ( italic_i ( divide start_ARG 2 italic_π end_ARG start_ARG italic_λ end_ARG italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT + italic_ψ ) ) , (5)

where x=xcosθ+ysinθsuperscript𝑥𝑥𝑐𝑜𝑠𝜃𝑦𝑠𝑖𝑛𝜃x^{\prime}=xcos\theta+ysin\thetaitalic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_x italic_c italic_o italic_s italic_θ + italic_y italic_s italic_i italic_n italic_θ, y=xsinθ+ycosθsuperscript𝑦𝑥𝑠𝑖𝑛𝜃𝑦𝑐𝑜𝑠𝜃y^{\prime}=-xsin\theta+ycos\thetaitalic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = - italic_x italic_s italic_i italic_n italic_θ + italic_y italic_c italic_o italic_s italic_θ, λ𝜆\lambdaitalic_λ denotes the wavelength, θ𝜃\thetaitalic_θ is the orientation angle of the Gabor kernel, ψ𝜓\psiitalic_ψ is the phase offset, σ𝜎\sigmaitalic_σ is the scale of Gaussian function, and γ𝛾\gammaitalic_γ stands for the aspect ratio. Therefore, Laplacian filtering based on the Gabor wavelet is defined as

𝐋𝐠(x,y)=Δf(x,y)𝒢(x,y),subscript𝐋𝐠𝑥𝑦Δ𝑓𝑥𝑦𝒢𝑥𝑦\small\mathbf{L}_{\mathbf{g}}(x,y)=\Delta{f(x,y)}\cdot\mathcal{G}(x,y),bold_L start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT ( italic_x , italic_y ) = roman_Δ italic_f ( italic_x , italic_y ) ⋅ caligraphic_G ( italic_x , italic_y ) , (6)
Δf(x,y)=2fx2+2fy2,Δ𝑓𝑥𝑦superscript2𝑓superscript𝑥2superscript2𝑓superscript𝑦2\small\Delta{f(x,y)}=\frac{\partial^{2}f}{x^{2}}+\frac{\partial^{2}f}{y^{2}},roman_Δ italic_f ( italic_x , italic_y ) = divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG ∂ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_f end_ARG start_ARG italic_y start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (7)

where Δf(x,y)Δ𝑓𝑥𝑦\Delta{f(x,y)}roman_Δ italic_f ( italic_x , italic_y ) represents the Laplacian operator in 2D space. Then, we perform an activation operation on Fmasksubscript𝐹𝑚𝑎𝑠𝑘F_{mask}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT and interact with the filtered features to suppress low-confidence signals and preserve refined edge features. The whole process can be described as

Fmask=(ReLU(Fmask))(𝐋𝐠(x,y)Fmask),subscriptsuperscript𝐹𝑚𝑎𝑠𝑘direct-productReLUsubscript𝐹𝑚𝑎𝑠𝑘subscript𝐋𝐠𝑥𝑦subscript𝐹𝑚𝑎𝑠𝑘\small F^{{}^{\prime}}_{mask}=(\texttt{ReLU}(F_{mask}))\odot(\mathbf{L}_{% \mathbf{g}}(x,y)\ast{F_{mask}}),italic_F start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT = ( ReLU ( italic_F start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT ) ) ⊙ ( bold_L start_POSTSUBSCRIPT bold_g end_POSTSUBSCRIPT ( italic_x , italic_y ) ∗ italic_F start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT ) , (8)

where direct-product\odot denotes the convolution operation. As shown in Fig. 6, we parallel upsample Fmasksubscript𝐹𝑚𝑎𝑠𝑘F_{mask}italic_F start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT twice and separately pass through the Wavelet Laplace filters, and then interact with high-resolution features, i.e., F1subscript𝐹1F_{1}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and F2subscript𝐹2F_{2}italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, to further refine the edge information. Finally, the generated edge map is fed into the mask decoder as the edge prompt for bleeding region detection.

Types Methods Volumes Bleeding Region Metrics Bleeding Point Metrics # Params (M) \downarrow
IoU \uparrow Dice \uparrow PCK-2% \uparrow PCK-5% \uparrow PCK-10% \uparrow
Region-level Swin-UNet [4] ECCV’22 41.31 58.47 - - - 27.2
SAM [16] ICCV’23 40.43 57.49 - - - 93.7
SAM-Adapter [6] ICCV’23 54.80 70.80 - - - 93.8
MemSAM [10] CVPR’24 55.34 71.28 - - - 133.3
SAM 2 [28] ICLR’25 63.51 77.68 - - - 80.8
SAM2-Adapter [7] arXiv’24 64.23 77.95 - - - 88.8
Point-level HRNet [31] CVPR’19 - - 3.13 15.98 44.31 63.6
SimCC [19] ECCV’22 - - 2.14 14.99 46.95 66.3
GTPT [38] ECCV’24 - - 2.80 13.01 38.38 16.7
D-CeLR [8] ECCV’24 - - 5.10 27.67 60.13 53.4
Multi-task PAINet [9] MICCAI’23 44.14 61.24 2.47 15.48 48.43 13.6
PitSurgRT [23] IJCARS’24 30.48 46.72 2.47 13.84 41.68 67.3
SAM 2 [28] ICLR’25 50.93 67.49 12.35 41.68 71.99 81.0
BlooDet (Ours) - 64.88 78.70 18.62 55.85 83.69 91.6
Table 1: Overall comparison with the cutting-edge methods on SurgBlood test set. SAM 2 denotes add a point head based on SAM 2.
Refer to caption
Figure 7: Visual comparison of bleeding region and point detection on SurgBlood test set.

4.4 Loss Function

The total loss function of BlooDet consists of mask loss masksubscript𝑚𝑎𝑠𝑘\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT and edge loss edgesubscript𝑒𝑑𝑔𝑒\mathcal{L}_{edge}caligraphic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT from mask branch, as well as point loss pointsubscript𝑝𝑜𝑖𝑛𝑡\mathcal{L}_{point}caligraphic_L start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT and score loss scoresubscript𝑠𝑐𝑜𝑟𝑒\mathcal{L}_{score}caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT from point branch:

=λmmask+λeedge+λsscore+λppoint,subscript𝜆𝑚subscript𝑚𝑎𝑠𝑘subscript𝜆𝑒subscript𝑒𝑑𝑔𝑒subscript𝜆𝑠subscript𝑠𝑐𝑜𝑟𝑒subscript𝜆𝑝subscript𝑝𝑜𝑖𝑛𝑡\small\mathcal{L}=\lambda_{m}\mathcal{L}_{mask}+\lambda_{e}\mathcal{L}_{edge}+% \lambda_{s}\mathcal{L}_{score}+\lambda_{p}\mathcal{L}_{point},caligraphic_L = italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT + italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT , (9)

Both masksubscript𝑚𝑎𝑠𝑘\mathcal{L}_{mask}caligraphic_L start_POSTSUBSCRIPT italic_m italic_a italic_s italic_k end_POSTSUBSCRIPT and edgesubscript𝑒𝑑𝑔𝑒\mathcal{L}_{edge}caligraphic_L start_POSTSUBSCRIPT italic_e italic_d italic_g italic_e end_POSTSUBSCRIPT are computed as a combination of Focal loss [20] and Dice loss [24]. In addition, pointsubscript𝑝𝑜𝑖𝑛𝑡\mathcal{L}_{point}caligraphic_L start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT employs the smooth L1 loss for point-level supervision:

point=i=1N𝟏{pi[0,0]}L1(y^pi,ypi),subscript𝑝𝑜𝑖𝑛𝑡superscriptsubscript𝑖1𝑁subscript1subscript𝑝𝑖00𝐿1superscriptsubscript^𝑦p𝑖superscriptsubscript𝑦p𝑖\small\mathcal{L}_{point}=\sum_{i=1}^{N}\mathbf{1}_{\{p_{i}\neq[0,0]\}}\cdot L% 1(\hat{y}_{\text{p}}^{i},y_{\text{p}}^{i}),caligraphic_L start_POSTSUBSCRIPT italic_p italic_o italic_i italic_n italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ [ 0 , 0 ] } end_POSTSUBSCRIPT ⋅ italic_L 1 ( over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT ) , (10)

where y^pisuperscriptsubscript^𝑦p𝑖\hat{y}_{\text{p}}^{i}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the predicted point location and ypisuperscriptsubscript𝑦p𝑖y_{\text{p}}^{i}italic_y start_POSTSUBSCRIPT p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT is the ground-truth point location. 𝟏{pi[0,0]}subscript1subscript𝑝𝑖00\mathbf{1}_{\{p_{i}\neq[0,0]\}}bold_1 start_POSTSUBSCRIPT { italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≠ [ 0 , 0 ] } end_POSTSUBSCRIPT denotes an indicator function that ensures the loss is calculated only when the point is not zero. Besides, scoresubscript𝑠𝑐𝑜𝑟𝑒\mathcal{L}_{score}caligraphic_L start_POSTSUBSCRIPT italic_s italic_c italic_o italic_r italic_e end_POSTSUBSCRIPT is the binary cross-entropy loss for point existence in point branch. λmsubscript𝜆𝑚\lambda_{m}italic_λ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, λesubscript𝜆𝑒\lambda_{e}italic_λ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, λssubscript𝜆𝑠\lambda_{s}italic_λ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and λpsubscript𝜆𝑝\lambda_{p}italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT are empirically set to 1, 1, 1, and 0.5, respectively, to balance the total loss function.

5 Experiments

5.1 Datasets and Evaluation Metrics

Datasets. Since the tasks of bleeding region and bleeding point cooperative detection are proposed for the first time, we can only adopt the proposed SurgBlood dataset to train and test our method and related comparative methods. We randomly divided the total 95 video clips into 75 for training as well as the remaining 20 for testing.

Evaluation Metrics. Following previous studies [28, 9], we adopt the Intersection over Union (IoU) and Dice Coefficient (Dice) metrics to evaluate bleeding region detection performance. For bleeding points, the Percentage of Correct Keypoints (PCK) metric is used to measure localization accuracy. Unlike the 10% to 40% threshold range applied in [9, 23] for anatomical structure centroids, we adopt a narrower threshold range of 2%–10% to ensure greater assessment, i.e., PCK-2%, PCK-5%, and PCK-10%. This is due to the requirement for higher precision and lower tolerance of bleeding point detection in laparoscopic surgery.

5.2 Implementation Details

Our framework is implemented on two RTX 4090 GPUs. During training, we input eight consecutive frames in an online manner with resolution resized to 512×\times×512 pixels. The image encoder of BlooDet is initialized with pre-trained weights from SAM 2_base [28]. Additionally, we utilize a frozen PWC-Net [30] to compute inter-frame optical flow. No data augmentation is applied during data loading. The maximum learning rate for the image encoder is set to 5e-6, while other parts are trained with a learning rate of 5e-4. To optimize the training process, we employ the Adam optimizer with a warm-up strategy and linear decay, training for 20 epochs. During inference, we perform frame-by-frame inference in line with SAM 2 to get results for each frame.

5.3 Performance on SurgBlood Benchmark

We develop a comprehensive evaluation benchmark on the SurgBlood dataset. BlooDet is compared with 13 task-related methods for bleeding region and point detection. These approaches include multi-task detection models [9, 23], video- and image-level object segmentation methods [4, 16, 28, 6, 10, 7], and pose-based point detection methods [31, 19, 38, 8]. For fairness, all methods use the official code and adapt only the head to fit bleeding tasks.

EG PMM IoU \uparrow Dice \uparrow PCK-2% \uparrow PCK-5% \uparrow PCK-10% \uparrow
50.93 67.49 12.35 41.68 71.99
64.78 78.36 8.23 40.04 75.94
61.20 75.93 14.33 51.57 80.89
64.88 78.70 18.62 55.85 83.69
Table 2: Ablations for key components of BlooDet for bleeding region and point detection on SurgBlood test set. EG and PMM denote edge generator and point memory modeling modules.
Configs IoU \uparrow Dice \uparrow # Params (M)
w/o Edge generator 61.20 75.93 90.99
w/o Laplacian Filter 60.79 75.61 91.58
w/o F1&F2subscript𝐹1subscript𝐹2F_{1}\&F_{2}italic_F start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT & italic_F start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 64.74 78.60 91.55
Edge generator 64.88 78.70 91.58
Table 3: Influence of edge generator for bleeding region detection.

Quantitative Evaluation. Table 1 presents the performance of our framework and other comparison methods on SurgBlood test set for bleeding region and point detection. Our dual-task synergistic framework outperforms competitors across both tasks, even surpassing models designed for a single task. Benefiting from edge-enhanced prompts and bleeding point guidance, BlooDet achieves notable improvements in bleeding region detection. For bleeding point detection, BlooDet significantly surpasses other methods, with 6.27% and 14.17% improvement in PCK-2% and PCK-5% metrics, respectively. In addition, our dual-branch framework built upon large vision models obtains superior performance while adding few parameters.

Qualitative Evaluation. Fig. 7 shows a visual comparison of our approach with other multi-task frameworks. We can see that BlooDet provides greater stability and consistency in detecting bleeding regions and points. In complex surgical environments with low contrast, competitors tend to be disturbed by surrounding noise. In contrast, our method ensures robust bleeding detection across consecutive frames via spatial-temporal modeling and co-guidance.

5.4 Ablation Analysis

Contributions of Key Component. Table 2 illustrates the contribution of key components in BlooDet for bleeding region and point detection. The experimental results show that point memory modeling contributes significantly to detect bleeding points, e.g., improving the PCK-2% score by about 10.4%. Moreover, our edge generator provides effective edge prompt embedding, which greatly improves the accuracy of bleeding region detection. In short, each key component contributes positively to model performance.

Ablations for Edge Generator. We investigate the effect of different edge generator designs embedded in the mask branch. As exhibited in Table 3, the Wavelet Laplacian filters exhibit a strong response to edge features, effectively mitigating interference from complex background noise in laparoscopic scenes. Meanwhile, integrating high-resolution features with more texture information further enhances the quality of edge maps.

Refer to caption
Figure 8: Attention maps in mask and point branches of BlooDet.
Mask map PCK-2% \uparrow PCK-5% \uparrow PCK-10% \uparrow
Foreground 12.03 41.02 71.99
Background 18.62 55.85 83.69
Global 15.49 49.59 82.00
Table 4: Ablations for optical flow operation via mask maps.
Refer to caption
Figure 9: Effect of different information for mutual guidance.

Optical Flow Operation Design. In point memory modeling, we adopt reversed mask maps in conjunction with optical flow maps to estimate the average camera displacement. To validate the effect of this design, we ablate the impact of focusing on different regions in mask maps for bleeding point detection. Table 4 indicates that using the foreground region leads to inferior performance. It may be explained by the poor stability of optical flow in the rapidly changing bleeding area. In contrast, utilizing the background enables more stable motion modeling and point localization.

Effect of Bidirectional Guidance. To validate the effectiveness of our cross-branch bidirectional guidance, Fig. 9 ablates the impact of point prompt from point branch as well as mask map and mask memory from mask branch. The results indicate that the point prompt contributes to bleeding region detection. For the point branch, the mask maps from previous frames assist with the optical flow operation, while mask memory fosters point memory modeling to improve the accuracy of point localization. The attention maps from the mask and point decoders visualized in Fig. 8 also verify the effect of bidirectional guidance.

6 Conclusion

This work advances the intelligent detection of bleeding regions and bleeding points in laparoscopic surgical videos. We contribute a new dataset for bleeding detection in actual surgical scenarios, SurgBlood, to facilitate benchmark construction. Accordingly, we design a dual-task synergistic online framework called BlooDet, which assembles mask and point branches in a bidirectional guidance structure, and exploits an edge generator and point memory modeling to enhance the adaptive prompting mechanism. Extensive experimental results demonstrate that our method outperforms existing related models for both bleeding tasks. We believe that this study can facilitate research in intelligent surgical assistance, reducing intraoperative decision-making risks and improving clinical outcomes.

References

  • Ali et al. [2025] Sharib Ali, Yamid Espinel, Yueming Jin, Peng Liu, Bianca Güttner, Xukun Zhang, Lihua Zhang, Tom Dowrick, Matthew J Clarkson, Shiting Xiao, et al. An objective comparison of methods for augmented reality in laparoscopic liver resection by preoperative-to-intraoperative image fusion from the miccai2022 challenge. Medical Image Analysis, 99:103371, 2025.
  • Bai et al. [2024] Yunhao Bai, Qinji Yu, Boxiang Yun, Dakai Jin, Yingda Xia, and Yan Wang. Fs-medsam2: Exploring the potential of sam2 for few-shot medical image segmentation without fine-tuning. arXiv preprint arXiv:2409.04298, 2024.
  • Bourbakis et al. [2005] N Bourbakis, Sokratis Makrogiannis, and Despina Kavraki. A neural network-based detection of bleeding in sequences of wce images. In Fifth IEEE Symposium on Bioinformatics and Bioengineering, pages 324–327, 2005.
  • Cao et al. [2022] Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, and Manning Wang. Swin-unet: Unet-like pure transformer for medical image segmentation. In ECCV, 2022.
  • Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, pages 213–229, 2020.
  • Chen et al. [2023] Tianrun Chen, Lanyun Zhu, Chaotao Deng, Runlong Cao, Yan Wang, Shangzhan Zhang, Zejian Li, Lingyun Sun, Ying Zang, and Papa Mao. Sam-adapter: Adapting segment anything in underperformed scenes. In IEEE ICCV, pages 3367–3375, 2023.
  • Chen et al. [2024] Tianrun Chen, Ankang Lu, Lanyun Zhu, Chaotao Ding, Chunan Yu, Deyi Ji, Zejian Li, Lingyun Sun, Papa Mao, and Ying Zang. Sam2-adapter: Evaluating & adapting segment anything 2 in downstream tasks: Camouflage, shadow, medical image segmentation, and more. arXiv preprint arXiv:2408.04579, 2024.
  • Dai et al. [2024] Chao Dai, Yang Wang, Chaolin Huang, Jiakai Zhou, Qilin Xu, and Minpeng Xu. A cephalometric landmark regression method based on dual-encoder for high-resolution x-ray image. In ECCV, pages 93–109, 2024.
  • Das et al. [2023] Adrito Das, Danyal Z Khan, Simon C Williams, John G Hanrahan, Anouk Borg, Neil L Dorward, Sophia Bano, Hani J Marcus, and Danail Stoyanov. A multi-task network for anatomy identification in endoscopic pituitary surgery. In MICCAI, pages 472–482, 2023.
  • Deng et al. [2024] Xiaolong Deng, Huisi Wu, Runhao Zeng, and Jing Qin. Memsam: taming segment anything model for echocardiography video segmentation. In IEEE CVPR, pages 9622–9631, 2024.
  • Deziel et al. [1993] Daniel J Deziel, Keith W Millikan, Steven G Economou, Alexander Doolas, Sung-Tao Ko, and Mohan C Airan. Complications of laparoscopic cholecystectomy: a national survey of 4,292 hospitals and an analysis of 77,604 cases. The American Journal of Surgery, 165(1):9–14, 1993.
  • Duan et al. [2019] Jinming Duan, Ghalib Bello, Jo Schlemper, Wenjia Bai, Timothy JW Dawes, Carlo Biffi, Antonio de Marvao, Georgia Doumoud, Declan P O’Regan, and Daniel Rueckert. Automatic 3d bi-ventricular segmentation of cardiac images by a shape-refined multi-task deep learning approach. IEEE TMI, 38(9):2151–2164, 2019.
  • Gallaher and Charles [2022] Jared R Gallaher and Anthony Charles. Acute cholecystitis: a review. Jama, 327(10):965–975, 2022.
  • Guo et al. [2024] Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Chengzhuo Tong, Peng Gao, Chunyuan Li, and Pheng-Ann Heng. Sam2point: Segment any 3d as videos in zero-shot and promptable manners. arXiv preprint arXiv:2408.16768, 2024.
  • Hirai et al. [2022] Yuichiro Hirai, Ai Fujimoto, Naomi Matsutani, Soichiro Murakami, Yuki Nakajima, Ryoichi Miyanaga, Yoshihiro Nakazato, Kazuyo Watanabe, Masahiro Kikuchi, and Naohisa Yahagi. Evaluation of the visibility of bleeding points using red dichromatic imaging in endoscopic hemostasis for acute gi bleeding (with video). Gastrointestinal Endoscopy, 95(4):692–700, 2022.
  • Kirillov et al. [2023] Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In IEEE ICCV, pages 4015–4026, 2023.
  • Lee [1996] Tai Sing Lee. Image representation using 2d gabor wavelets. IEEE TPAMI, 18(10):959–971, 1996.
  • Li et al. [2020] Lu Li, Meng Wei, BO Liu, Kunakorn Atchaneeyasakul, Fugen Zhou, Zehao Pan, Shimran A Kumar, Jason Y Zhang, Yuehua Pu, David S Liebeskind, et al. Deep learning for hemorrhagic lesion detection and segmentation on brain ct images. IEEE JBHI, 25(5):1646–1659, 2020.
  • Li et al. [2022] Yanjie Li, Sen Yang, Peidong Liu, Shoukui Zhang, Yunxiao Wang, Zhicheng Wang, Wankou Yang, and Shu-Tao Xia. Simcc: A simple coordinate classification perspective for human pose estimation. In ECCV, pages 89–106, 2022.
  • Lin et al. [2017] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In IEEE ICCV, pages 2980–2988, 2017.
  • Liu et al. [2019] Chuanbin Liu, Hongtao Xie, Sicheng Zhang, Jingyuan Xu, Jun Sun, and Yongdong Zhang. Misshapen pelvis landmark detection by spatial local correlation mining for diagnosing developmental dysplasia of the hip. In MICCAI, pages 441–449, 2019.
  • Liu et al. [2024] Haofeng Liu, Erli Zhang, Junde Wu, Mingxuan Hong, and Yueming Jin. Surgical sam 2: Real-time segment anything in surgical video by efficient frame pruning. In NeurIPS Workshop, 2024.
  • Mao et al. [2024] Zhehua Mao, Adrito Das, Mobarakol Islam, Danyal Z Khan, Simon C Williams, John G Hanrahan, Anouk Borg, Neil L Dorward, Matthew J Clarkson, Danail Stoyanov, et al. Pitsurgrt: real-time localization of critical anatomical structures in endoscopic pituitary surgery. IJCARS, pages 1–8, 2024.
  • Milletari et al. [2016] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3DV, pages 565–571, 2016.
  • Mori et al. [2024] Yosuke Mori, Taro Iwatsubo, Akitoshi Hakoda, Shin Kameishi, Kazuki Takayama, Shun Sasaki, Ryoji Koshiba, Shinya Nishida, Satoshi Harada, Hironori Tanaka, et al. Red dichromatic imaging improves the recognition of bleeding points during endoscopic submucosal dissection. Digestive Diseases and Sciences, 69(1):216–227, 2024.
  • Pei et al. [2024a] Jialun Pei, Ruize Cui, Yaoqian Li, Weixin Si, Jing Qin, and Pheng-Ann Heng. Depth-driven geometric prompt learning for laparoscopic liver landmark detection. In MICCAI, pages 154–164, 2024a.
  • Pei et al. [2024b] Jialun Pei, Zhangjun Zhou, and Tiantian Zhang. Evaluation study on sam 2 for class-agnostic instance-level segmentation. arXiv preprint arXiv:2409.02567, 2024b.
  • Ravi et al. [2025] Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. In ICLR, 2025.
  • Su et al. [2022] Ruisheng Su, Matthijs van der Sluijs, Sandra AP Cornelissen, Geert Lycklama, Jeannette Hofmeijer, Charles BLM Majoie, Pieter Jan van Doormaal, Adriaan CGM Van Es, Danny Ruijters, Wiro J Niessen, et al. Spatio-temporal deep learning for automatic detection of intracranial vessel perforation in digital subtraction angiography during endovascular thrombectomy. Medical Image Analysis, 77:102377, 2022.
  • Sun et al. [2018] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In IEEE CVPR, pages 8934–8943, 2018.
  • Sun et al. [2019] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In IEEE CVPR, pages 5693–5703, 2019.
  • Sunakawa et al. [2024] Taiki Sunakawa, Daichi Kitaguchi, Shin Kobayashi, Keishiro Aoki, Manabu Kujiraoka, Kimimasa Sasaki, Lena Azuma, Atsushi Yamada, Masashi Kudo, Motokazu Sugimoto, et al. Deep learning-based automatic bleeding recognition during liver resection in laparoscopic hepatectomy. Surgical Endoscopy, pages 1–7, 2024.
  • Tancik et al. [2020] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 33:7537–7547, 2020.
  • Tuysuzoglu et al. [2018] Ahmet Tuysuzoglu, Jeremy Tan, Kareem Eissa, Atilla P Kiraly, Mamadou Diallo, and Ali Kamen. Deep adversarial context-aware landmark detection for ultrasound imaging. In MICCAI, pages 151–158, 2018.
  • Varghese et al. [2024] Chris Varghese, Ewen M Harrison, Greg O’Grady, and Eric J Topol. Artificial intelligence in surgery. Nature Medicine, pages 1–12, 2024.
  • Vaswani et al. [2017] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. NeurIPS, 30, 2017.
  • Vorontsov et al. [2024] Eugene Vorontsov, Alican Bozkurt, Adam Casson, George Shaikovski, Michal Zelechowski, Kristen Severson, Eric Zimmermann, James Hall, Neil Tenenholtz, Nicolo Fusi, et al. A foundation model for clinical-grade computational pathology and rare cancers detection. Nature Medicine, 30(10):2924–2935, 2024.
  • Wang et al. [2024] Haonan Wang, Jie Liu, Jie Tang, Gangshan Wu, Bo Xu, Yanbing Chou, and Yong Wang. Gtpt: Group-based token pruning transformer for efficient human pose estimation. In ECCV, pages 213–230, 2024.
  • Wu et al. [2024] Renkai Wu, Pengchen Liang, Yiqi Huang, Qing Chang, and Huiping Yao. Automatic segmentation of hemorrhages in the ultra-wide field retina: multi-scale attention subtraction networks and an ultra-wide field retinal hemorrhage dataset. IEEE JBHI, 2024.
  • Xiong et al. [2024] Xinyu Xiong, Zihuang Wu, Shuangyi Tan, Wenxue Li, Feilong Tang, Ying Chen, Siying Li, Jie Ma, and Guanbin Li. Sam2-unet: Segment anything 2 makes strong encoder for natural and medical image segmentation. arXiv preprint arXiv:2408.08870, 2024.
  • Zhang et al. [2020a] Dongqing Zhang, Jianing Wang, Jack H Noble, and Benoit M Dawant. Headlocnet: Deep convolutional neural networks for accurate classification and multi-landmark localization of head cts. Medical Image Analysis, 61:101659, 2020a.
  • Zhang et al. [2017] Jun Zhang, Mingxia Liu, Li Wang, Si Chen, Peng Yuan, Jianfu Li, Steve Guo-Fang Shen, Zhen Tang, Ken-Chung Chen, James J Xia, et al. Joint craniomaxillofacial bone segmentation and landmark digitization by context-guided fully convolutional networks. In MICCAI, pages 720–728, 2017.
  • Zhang et al. [2020b] Jun Zhang, Mingxia Liu, Li Wang, Si Chen, Peng Yuan, Jianfu Li, Steve Guo-Fang Shen, Zhen Tang, Ken-Chung Chen, James J Xia, et al. Context-guided fully convolutional networks for joint craniomaxillofacial bone segmentation and landmark digitization. Medical Image Analysis, 60:101621, 2020b.
  • Zheng et al. [2015] Yefeng Zheng, David Liu, Bogdan Georgescu, Hien Nguyen, and Dorin Comaniciu. 3d deep learning for efficient and robust landmark detection in volumetric data. In MICCAI, pages 565–572, 2015.
  • Zhong et al. [2019] Zhusi Zhong, Jie Li, Zhenxi Zhang, Zhicheng Jiao, and Xinbo Gao. An attention-guided deep regression model for landmark detection in cephalograms. In MICCAI, pages 540–548, 2019.
  • Zhu et al. [2024] Jiayuan Zhu, Yunli Qi, and Junde Wu. Medical sam 2: Segment medical images as video via segment anything model 2. arXiv preprint arXiv:2408.00874, 2024.