∎

¹¹institutetext: Sigma Jahan ²²institutetext: Dalhousie University, Canada
²²email: sigma.jahan@dal.ca ³³institutetext: Mehil B. Shah ⁴⁴institutetext: Dalhousie University, Canada
⁴⁴email: shahmehil@dal.ca ⁵⁵institutetext: Mohammad Masudur Rahman ⁶⁶institutetext: Dalhousie University, Canada
⁶⁶email: masud.rahman@dal.ca

Towards Understanding the Challenges of Bug Localization in Deep Learning Systems

Sigma Jahan Mehil B. Shah Mohammad Masudur Rahman

(the date of receipt and acceptance should be inserted later)

Abstract

Software bugs cost the global economy billions of dollars annually and claim $\sim$ 50% of the programming time from software developers. Locating these bugs is crucial for their resolution but challenging. It is even more challenging in deep-learning systems due to their black-box nature. Bugs in these systems are also hidden not only in the code but also in the models and training data, which might make traditional debugging methods less effective. In this article, we conduct a large-scale empirical study to better understand the challenges of localizing bugs in deep-learning systems. First, we determine the bug localization performance of four existing techniques using 2,365 bugs from deep-learning systems and 2,913 from traditional software. We found these techniques significantly underperform in localizing deep-learning system bugs. Second, we evaluate how different bug types in deep learning systems impact bug localization. We found that the effectiveness of localization techniques varies with bug type due to their unique challenges. For example, tensor bugs were more accessible to locate due to their structural nature, while all techniques struggled with GPU bugs due to their external dependencies. Third, we investigate the impact of bugs’ extrinsic nature on localization in deep-learning systems. We found that deep learning bugs are often extrinsic and thus connected to artifacts other than source code (e.g., GPU, training data), contributing to the poor performance of existing localization methods.

Keywords:

Bug localization Deep Learning Bug Deep Learning Framework Extrinsic Bugs Information Retrieval GPU Bug Training Bug

1 Introduction

Software bugs are human-made errors in the code that prevent it from working correctly [1]. They are often prevalent in modern software systems and could range from hundreds to thousands in a single system [2]. Due to the bugs in software systems, the global economy loses billions of dollars every year [3, 4]. Developers also spend about 50% of their programming time dealing with software bugs and failures [3]. To correct any bug, the developers first need to identify the location of a bug within a software system, which is known as bug localization [5]. According to a recent survey, 49.20% of 327 software practitioners from several major technology companies (e.g., Google, Meta, Amazon, and Microsoft) consider the localization of bugs as one of the most challenging tasks during software development and maintenance [6].

While localizing bugs in traditional software systems (a.k.a, non-deep learning systems) remains a challenge, it could even be more challenging in deep learning systems. Unlike bugs in non-deep learning systems, deep learning-related bugs could be hidden in the source code, training data, trained models, or even deployment scripts [7, 8, 9]. Besides, the use of various deep learning frameworks (e.g., PyTorch, Caffe, and TensorFlow) could make these bugs even more complex [10].

Given the prevalence and costs of software bugs, any automated support to localize the bugs can greatly benefit software practitioners. Over the years, many approaches have been designed to localize bugs in traditional software systems using information retrieval [11, 12, 13, 14], dynamic program analysis [15, 16], and deep learning [17, 18, 19]. However, due to the significant differences between traditional and deep learning bugs, these existing solutions might not be adequate for localizing bugs in deep learning systems.

To date, there exist only a few techniques for detecting bugs in deep learning systems. Most of them concentrate on specific type of bugs (e.g., model bugs, training bugs) without considering the broader spectrum of deep learning systems. Wardat et al. [18] propose a dynamic approach to localize different types of model bugs in the Deep Neural Network (DNN). They identify the faulty layers containing numerical bugs by customizing the Keras’ callback function and analyzing the dynamic behaviors of a model. However, their solution focuses on only model bugs from deep learning systems, strongly coupled with the Keras library, and achieves a low accuracy, which presents significant challenges for a widespread adoption by the industry. In another study, Wardat et al. [20] propose a heuristic-based approach to diagnose two main categories of bugs – model bugs and training bugs. They also recommend actionable fixes of the bugs based on the diagnosis. Since their approach depends on a set of hard-coded rules, it might be limited in terms of scalability and context-awareness. In a recent work, Cao et al. [21] introduce a technique that leverages the dynamic properties of a model and an ensemble of three machine learning classifiers (e.g., KNN, Decision Tree, Random Forest) to localize five types of training bugs (e.g., loss, gradient) from deep learning systems. Their technique might also not be able to address a broader array of bug types from deep learning systems, highlighting the issues of scalability. On the other hand, Kim et al. [22] use basic Information Retrieval (IR) algorithms, such as rVSM and BM25, to localize bugs in deep-learning systems. They report poor performance but do not perform any comprehensive analysis to understand the poor performance of IR-based techniques.

Interestingly, at least 30 techniques adopt IR algorithms to locate bugs in traditional software systems due to their computational efficiency and lightweight nature [11, 12, 14, 13, 23, 22]. They were also reported to perform comparably to the complex models (e.g., LDA) [24]. Unlike deep learning-based techniques, IR-based techniques rely on the textual similarity between bug reports and source code as a proxy of suspiciousness, which is simple and explainable. However, IR-based techniques suffer from vocabulary mismatch issues [25] and can only capture linear relationships between two items. On the contrary, deep learning-based techniques can capture the non-linear relationships between two items [26, 27, 28]. Thus, they have the potential to capture more nuances in the relevance between enriched information from the source code and bug reports. However, they also suffer from poor outlier handling, class imbalance problems, and a lack of monitoring [29]. Thus, the potential of existing solutions for localizing bugs in deep learning applications is neither well understood nor well investigated to date. Our work in this article fills in this important gap in the literature.

In this article, we conduct a large-scale empirical study to better understand the challenges of locating bugs in deep learning systems. First, we collect a total of 2,365 bugs from deep-learning systems and 2,913 bugs from traditional software systems (a.k.a non-deep-learning systems), and empirically show how existing techniques (e.g., BugLocator [11], BLUiR [12], BLIA [23], DNNLOC [17]) perform in locating bugs from deep learning systems. Our work utilizes these traditional techniques as a foundational framework, adapting their core principles to the specific nuances of deep learning bugs. Second, we categorize our collected bugs based on an existing bug taxonomy [30] and found that certain bugs from deep learning systems (e.g., GPU bugs) are more difficult than others to locate due to their multifaceted heterogeneous dependency issues. Finally, we found that deep learning bugs are connected to artifacts other than source code (e.g., GPU, training data, external dependencies) and are prone to be extrinsic in nature, which might explain the poor performance of existing techniques for these bugs. We thus answer three important research questions in our study as follows.

(a)

RQ $\mathbf{{}_{1}}$ : How effective are the existing approaches in localizing bugs from deep learning systems?
We evaluated the performance of four existing approaches (BugLocator [11], BLUiR [12], BLIA [23], and DNNLOC [17]) using two datasets – Denchmark [31] and BugGL [32]. First, we found that their performance measures are poorer (e.g., 31.59% less MAP for BugLocator, 33.25% for BLUiR, 34.14% for BLIA, 31.43% for DNNLOC) in localizing bugs from deep learning systems than that of non-deep learning systems. Our statistical tests (t-test [33], Cohen’s D [34]) also report that their performance is significantly lower. Second, we found that localizing bugs from the deep learning frameworks is more challenging than libraries or tools due to the frameworks’ inherent complexity. Although our findings reinforce the existing understanding and belief about the challenges of the bugs in deep learning systems [35, 22, 18], we also substantiate them with solid empirical evidence and demonstrate the performance gap of existing solutions in localizing the two categories of bugs.
(b)

RQ $\mathbf{{}_{2}}$ : How do different types of bugs in deep learning systems impact bug localization?
We use an existing taxonomy [30] of bugs to classify the bugs in deep learning systems and evaluate the performance of four existing techniques for each type of bug. First, we found that 64.80% of the bugs from deep learning systems are related to deep learning (e.g., model, training), whereas the remaining ones are not. Second, we found that DNNLOC demonstrated better results in locating model and tensor bugs, possibly due to its ability to capture comprehensive contextual information specific to these bugs. We also found that BLUiR performs comparably to DNNLOC for training bugs, which might be attributed to its structured information retrieval. However, all four baseline techniques experienced difficulty localizing GPU bugs. Thus, our analysis offers valuable insights regarding the nature of different types of bugs in deep learning systems and highlights the specific strengths and weaknesses of existing techniques, which could be useful to advance debugging support for deep learning systems.
(c)

RQ $\mathbf{{}_{3}}$ : What are the implications of extrinsic bugs in deep learning systems for bug localization?
Bugs triggered by external entities (e.g., third-party libraries, GPU) are called extrinsic bugs [36]. Given the frequent use of deep learning libraries and their external dependencies, the bugs in deep learning systems could be extrinsic [10]. Since the existing techniques mostly focus on intrinsic bugs (i.e., triggered by bug-introducing change), we investigate how they deal with extrinsic bugs from deep learning systems. First, we found deep learning systems have 40.00% extrinsic bugs, which is almost four times higher than that of non-deep learning systems. Second, we found that the localization performance of existing techniques degrades significantly for extrinsic bugs (e.g., 15.20% less MAP for DNNLOC) (Table 21). We also found that deep neural network-based solution (e.g., DNNLOC) is not particularly helpful for locating extrinsic bugs either because they are designed to detect code patterns, not issues from external sources or environments. Finally, we found a significant correlation between the bugs in deep learning systems and the extrinsic factors using appropriate statistical analysis (Chi-Square test), which delivers valuable insight for designing effective solutions to find bugs in deep learning systems.

2 Background

In this section, we introduce the necessary terminologies and concepts to follow the remainder of the article. We introduce extrinsic bugs, intrinsic bugs, and the taxonomy of deep-learning bugs.

2.1 Extrinsic bug

A bug caused by the factors external to a software system, such as changes to the operating environment, requirements, or third-party libraries, is known as extrinsic bug. Rodriguez-Perez et al. suggest three heuristics based on bug reports to identify extrinsic bugs as follows [36].

(a) Environment: An extrinsic bug is caused by a modification to the environment in which the software system operates. The environment could be an operating system, a physical machine, or even a cloud infrastructure.

(b) Requirement: An extrinsic bug is triggered by a change outside of the project’s version control system. During software development, if a user requirement gets changed after implementation, the development team might implement the new requirement without discarding the old feature. The old, unexpected feature will then be considered as an extrinsic bug.

(c) Third-party library: The bug found in the project’s third-party library is considered an extrinsic bug. For example, if a software project uses a third-party library for processing images for a mobile application, and the app crashes when processing certain image formats due to a bug in that third-party library, that bug will then be considered an extrinsic bug.

2.2 Intrinsic bug

The external factors do not cause an intrinsic bug; rather, it is caused by a bug-introducing change in the version control system [36]. For example, if a messaging application fails to deliver messages due to a logical error in a recent code change, that would be an intrinsic bug.

Table 1: Example of deep learning-related and non deep-learning-related extrinsic bugs [37, 38]

Deep Learning-related Extrinsic Bug [37]

Title:

wmt19 model cannot run on GPU except #0.

Description:

I am running the tutorial. I successfully loaded the model from

the hub and tried to run it on the second GPU (id=1).

, Which raised an exception that data and models are stored

on different GPUs. With GPU (id=0) works fine.

Code sample:

import torch

en2de = torch.hub.load(’pytorch/fairseq’,

’transformer.wmt19.en-de’

checkpoint_file=’model1.pt:model2.pt:model3.pt:model4.pt’

tokenizer=’moses’,bpe=’fastbpe’).to(torch.device(’cuda:1’))

result = en2de.translate([’hello’])

Environment:

fairseq Version==0.9.0, PyTorch Version ==1.4.0,

OS: Ubuntu 18.04,vPython version: 3.6,

CUDA/cuDNN version: 10.2,

GPU models and configuration: RTX 2080 x 2

Non Deep Learning-related Extrinsic Bug [38]

Title:

GitHub CI on Windows is broken.

Description:

Normally, we should skip distributed tests on Windows with

SKIP_DISTRIB_TESTS=1

CI_PYTHON_VERSION=”3.7”

sh tests/run_cpu_tests.sh,

but a distributed test was executed:

tests/ignite/contrib/engines/test_common.py

::test_distrib_cpu ERROR [2%]

Related to beta support of distributed on Windows in Pytorch 1.7

2.3 Taxonomy of bugs in deep learning systems

Software bugs in deep learning systems can be divided into two categories – DL bug and NDL bug [30].

Deep Learning (DL) bug refers to a software error that is connected to the deep learning module embedded in the software system, causing inaccurate or unexpected output. According to the existing literature [30], DL bugs can be divided into five main categories: Model, Training, Tensor & Input, API, and GPU.

•

Model bug is connected to the structure and properties of a deep learning model (Table 14 [39]). An example of a model bug is an incorrect model initialization caused by an input image size mismatch, resulting in inaccurate output in a computer vision application.
•

Training bug occurs during the training phase of a deep learning application (Table 15 [40]). For instance, during the training of a deep learning model for object detection, if the loss function is incorrectly defined, the model will learn to detect objects with very poor accuracy, leading to incorrect output from the system.
•

Tensor & Input bug (a.k.a tensor bug) occurs due to wrong tensor input or tensor calculation issues (Table 16 [41]). For instance, if the tensor input shape is declared incorrectly, it will lead to output errors.
•

API bug occurs due to incorrect use of an API in the deep learning software system (Table 17 [42]). For example, an API bug might occur if a developer mistakenly calls the wrong API function from the deep learning framework (e.g., Tensorflow), causing inaccurate results in the output.
•

GPU bug is connected to the Graphics Processing Unit (GPU) used in the system (Table 18 [43]). For example, if the model’s memory requirements exceed the available GPU memory, or the GPU is not compatible with the DL framework, then they could lead to errors during model training.

Non-Deep Learning (NDL) bug refers to a software error that is not related to the deep learning module but still leads to unexpected behaviors in deep learning applications. An example of NDL bugs could be a logical error in the source code that leads to a deadlock, making the program being stuck in an infinite loop.

As shown in Table 1, Bug 1860 [37] is a deep learning-related extrinsic bug triggered by the change in the environment. When the WMT19 model runs on multiple GPUs, the execution fails since the same GPU cannot store both the model and data. It is clearly related to the deep learning module. On the other hand, this bug is not related to the Fairseq library (a.k.a., deep learning application), rather, it is related to external factors (e.g., GPU), which indicates its extrinsic nature.

In Table 1, Bug 1426 [38] is another extrinsic bug connected to the Windows OS environment. The bug triggers when the tests from the CI pipeline are distributed over multiple Windows machines. It is clearly not related to deep learning (a.k.a., Non-DL bug), but the triggering factors are outside of the version control system, which indicates an extrinsic nature.

3 Study Methodology

Refer to caption — Figure 1: Schematic diagram of our empirical study

Fig. 1 shows the schematic diagram of our conducted study. First, we collect bug reports from two benchmark datasets for two different software systems: deep learning systems [31] and traditional software systems [32]. Then, we contrast the performance of four existing techniques [11, 12, 23, 17] in locating bugs between deep learning systems [31] and traditional software systems [32]. Second, we perform an in-depth analysis to understand the challenges of localizing different types of deep-learning bugs. Finally, we investigate the influence of extrinsic factors on deep learning bugs and their impact on bug localization. This section discusses the major steps of our study design as follows.

3.1 Construction of dataset

Dataset collection. In our study, we use two benchmark datasets – BugGL and Denchmark – that have been previously used by the literature [32, 22]. BugGL [32] contains bugs from Python-based, traditional software systems (a.k.a non-deep learning-based systems), whereas Denchmark focuses on bugs from deep learning-based systems. BugGL contains a total of 2,913 bug reports from 12 Python projects [32]. On the other hand, the original Denchmark dataset [31] contains 4,577 bug reports from 193 deep learning-based projects, which are written in ten programming languages (JavaScript, Python, Java, Go, C++, Ruby, TypeScript, PHP, C#, and C). We used 2,365 bug reports from 136 deep learning-based projects (written in Python) from the Denchmark dataset [22]. We limited our dataset to Python-based bugs to ensure a fair contrastive analysis between deep learning bugs and traditional software bugs.

Table 2: Original [31] and Experimental Distribution of Projects and Bugs

Original Distribution
Category	Projects	Bugs
Framework	25	836
Platform	8	150
Engine	4	47
Compiler	2	33
Tool	31	510
Library	44	666
Application	12	124
Total	136	2365
Experimental Dataset (DL Systems)
Category	Projects	Bugs
Framework	17	746
Tool	28	455
Library	41	594
Total	86	1795

We adopted a set of filtration steps to construct our experimental dataset, which consists of 86 projects from DL systems (shown in Table 2). First, the initial Denchmark dataset has 136 projects across various classes, including client-based applications, frameworks, tools, libraries, search engines, and compilers. Our study focuses on core elements of deep learning – frameworks (e.g., Apache MXNet), libraries (e.g., OpenCV), and tools (e.g., Fairseq) – rather than application-specific software (e.g., PhotoPrism). We thus collect the systems pertaining to those three categories and discard all application-specific systems. Second, we checked for overlapping projects among these classes to ensure distinct categorization. Finally, from a total of 136 projects, we finalized our experimental dataset with 17 projects categorized under frameworks, 28 under tools, and 41 under libraries, bringing the total to 86 projects from deep learning systems.

Deep learning systems differ significantly from traditional software systems due to the complexity of model integration, intricate interactions between deep learning libraries [10], and multifaceted dependencies.

Data cleaning and pre-processing. After collecting the data from two benchmark datasets, we cleaned and preprocessed them using a set of steps.
Corpus creation. We first download the latest version of the code repositories, ensuring that we have the most up-to-date code for analysis. Next, to accurately link bug reports to their corresponding buggy code, we used the heuristic of Kim et al. [22], focusing on commit messages and bug reports. This involved analyzing commit messages for keywords indicative of bug fixes (e.g., ‘fix’, ‘bug’, ‘error’, ‘resolve’) and connecting them to corresponding bug reports. To identify the correct bug-fix commit, we cross-referenced these commit messages with bug IDs from the reports, ensuring a precise match. To extract the buggy part from the bug-fix commit, we employed PyDriller, suggested by Kim et al. [22]. We also capture each project’s most recently released version as of the bug report’s date, and collect the appropriate version of the buggy code, especially in cases where the bug reports did not have clear buggy version information.
Query construction. In IR-based bug localization, bug reports are treated as queries that can be executed with a search engine to detect the relevant source documents from the corpus. We construct a repository of bug reports by parsing the original datasets (Denchmark & BugGL) and extracting important information such as bug IDs, descriptions, and timestamps. We construct the queries by extracting tokens from the title and description of bug reports, removing stop words, stemming each word, and splitting the tokens.
Meta data extraction. We also capture the historical context of bugs by extracting their commit history information from the repositories, including commit messages, authors, timestamps, and code changes history. This information is extracted to replicate the existing technique BLIA [23], which provides valuable insights into the evolution of the codebase and the bugs over time.
Ground truth construction. Both benchmark datasets provide the ground truth that contains the correct locations of bugs in the code against the bug reports. To evaluate the performance of the bug localization approaches, we collect ground truth files from both of the original datasets.
To ensure a fair performance comparison of the bug localization techniques between DLSW and NDLSW, we have selected an equal amount of data from both datasets using probability sampling (1795 bug reports from each dataset), which have 95% confidence interval and 5% error margin [44]. We use the principle of randomization for selecting the subsets [45] to avoid any bias. We also manually analyzed the projects to avoid any overlap between the two datasets. We spent $\approx$ 5 hours on the manual analysis.
Categorizing Classes in Deep Learning Systems. In the original dataset by Kim et al. [31], they have identified different categories (e.g., frameworks, libraries, tools) in deep learning systems. However, there were some projects that overlapped in terms of being frameworks, libraries, or tools. This is because the lines between these categories might be unclear sometimes. Some libraries may evolve into frameworks or vice versa, and tools could be integrated into either libraries or frameworks. To address this issue, we removed the overlapping projects from our dataset. We then manually analyzed each selected project based on the official documentation to ensure they clearly belonged to one of the three categories. This process helped us maintain distinct differences among our study’s frameworks, libraries, and tools. We have spent $\approx$ 5 hours on the manual analysis. We have provided the list of the selected projects from these three categories, along with their detailed project descriptions, in our replication package [46].

Table 3: Study dataset for bug localization [31, 32]

Data #P #BR #SF #BF #Versions Mean Max Mean Max Mean Max Dench. 136 1795 441.60 3,559 2.60 227 6.7 80 BugGL 12 1795 420.50 3,306 2.35 198 6.2 65 P: Projects, BR: Bug Reports, SF: Source Files, BF: Buggy Files

3.2 Replicating of existing techniques for experiments

To answer our first research question, we needed to replicate existing techniques that localize bugs in deep learning systems. We thus select suitable representatives from the existing literature on bug localization. In particular, we choose baseline methods from the frequently used methodologies – Information Retrieval (IR) and Deep Learning (DL).

At least 30 approaches adopt IR methods for localizing software bugs due to their computational efficiency and lightweight nature [22]. IR-based techniques rely on the textual similarity between bug reports and source code as a proxy of suspiciousness, which is explainable. We thus used three different baseline techniques from IR. We selected BugLocator [11] as our initial baseline technique since it is the seminal work on IR-based bug localization. Then, we chose BLUiR [12], recognized as the first structured IR-based technique that integrates code structure and bug report structure in its analysis. As our third baseline, we selected BLIA [23], notable for its incorporation of meta-components such as stack traces and version control history, which has demonstrated better performance compared to other IR-based techniques.

Unlike the above IR-based techniques that rely on the textual similarity between code and bug reports, DL-based approaches have the potential to uncover complex, non-linear relationships between dependent and independent variables [26, 27, 28]. Thus, we also used Deep Neural Network (DNN) for bug localization in deep learning systems, adapting an existing work – DNNLOC [17]. DNNLOC is a seminal work on DL-based hybrid approach to bug localization, integrating both traditional and deep learning techniques.

Most of the recent techniques are based on these four primary approaches with incremental improvements [14, 13, 47, 19]. We chose these baseline methods for our study, which can be considered as a representative sample of the existing approaches for bug localization.

BugLocator [11] uses rVSM (revised Vector Space Model) that takes the document length into consideration to optimize the classic VSM model for bug localization to detect relevant source code documents against a bug report. It also calculates the SimiScore, which is a measure of similarity between a newly reported bug and previously fixed bugs based on their bug reports. SimiScore is combined with rVSM to calculate the final relevancy score. The relevant source code files are then ranked based on their combined scores, and the top-K documents are marked as buggy. Many subsequent IR-based techniques [14, 13, 23] adopted this method due to its simplicity and explainability. Hence, we chose this method as our first baseline.

BLUiR [12] uses AST parsing to extract four items: class, method, variable, and comment – from each source code document. It also captures two fields from each of the bug reports (summary & description). Then, a total of eight separate similarities are calculated between these two sets using the BM25 algorithm [12]. Then, these document scores are summed as the suspiciousness score to rank buggy files against a given bug report. BLUiR technique is the first one that leverages structured elements from both source code and bug reports to localize bugs using IR. We thus choose this as another baseline technique for our study.

BLIA [23] integrates several items such as textual similarity between bug reports and source documents [11], code structures [12], version control history [14], stack trace analysis [48], and code change analysis in the IR-based bug localization. While bug reports and source code are useful, code change history can also assist in bug localization by providing the changes likely to induce a bug. BLIA has outperformed several previous techniques: BugLocator [11], BLUiR [12], Amalgam [14], BRTracer [48], which makes it suitable as the third baseline technique for our study.

IR-based localization can be adapted to different granularity levels (e.g., method, file). We chose file-level granularity since each of the selected baselines frequently used this granularity.

DNNLOC [17] uses a hybrid method that incorporates both rVSM [11] from IR and Deep Neural Networks (DNNs) from DL. DNNs establish associations between specific terms in bug reports and the corresponding code tokens and terms in the source files. While DNNs alone do not achieve high accuracy due to dimensionality reduction in the projection process as they lose information, the integration with rVSM enhances their capability to correlate bug reports with relevant buggy files. Buggy source documents may not share textual similarities with bug reports, where IR-based techniques struggle. Thus, to address the challenges of IR-based techniques (e.g., lexical mismatch problem [25]) and to include non-linearity in relevance estimation, we chose DNNLOC as the final baseline for our study.

Since the original authors’ replication packages were unavailable, we used the publicly available versions to replicate BugLocator [49], BLUiR [49], and DNNLOC [50]. We also carefully adapted the BLIA from the original replication package [23] to our datasets. Our replication package is provided for further experimental details [46].

3.3 Performance Evaluation

We use three performance metrics for our study — Top-K accuracy (Top@K), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR). These metrics have been frequently used by the relevant literature [11, 12, 14, 23, 17, 48].

3.3.1 Top@K

Top-K accuracy (Top@K) measures the percentage of bug reports for each of which at least one of the buggy files was present in the top-k retrieved files. We have used K= 1, 5, 10 for this study.

3.3.2 Mean Average Precision

Precision@K measures the precision of each buggy source document’s occurrence within a ranked list. Average Precision@K (AP) computes the average precision for all buggy documents within the ranked list against a search query (a.k.a., bug report). Mean Average Precision (MAP) is the average AP@K value across all queries in a system.

\text{AP =}\frac{1}{D}\sum_{k=1}^{D}P_{k}\times\text{buggy}(k)

(1)

\text{MAP =}\frac{1}{|Q|}\sum_{q\in Q}\text{AP}(q)

(2)

Here, AP represents the Average Precision, and $D$ refers to the number of total results for a query. $k$ represents the position in the ranked list, $P_{k}$ denotes the precision calculated at the $k$ -th position and $\text{buggy}(k)$ determines whether the $k$ -th result in the ranked list is buggy or not.

3.3.3 Mean Reciprocal Rank

Mean reciprocal rank (MRR) calculates the average of the reciprocal ranks for a set of queries.

\text{{MRR}(Q) =}\frac{1}{|Q|}\sum_{q\in Q}\frac{1}{\text{firstRank}(q)}

(3)

where $\text{MRR}(Q)$ represents the Mean Reciprocal Rank for a set of queries $Q$ , $|Q|$ represents the total number of queries in the set $Q$ , $q\in Q$ represents each query in the set $Q$ , $\text{firstRank}(q)$ represents the rank of the first correctly retrieved buggy document for the query $q$ .

4 Study Finding

4.1 Answering RQ $\mathbf{{}_{1}}$ : How effective are the existing approaches in localizing bugs from deep learning systems?

Comparison of the localization performance: Table 4 compares the performance of our baseline techniques in bug localization between deep learning systems and non-deep learning systems (a.k.a traditional software systems). We used three different evaluation metrics – Top@k, MRR, and MAP, for our comparative analysis.

We see notable differences in performance between the two types of systems when evaluating BugLocator, BLUiR, BLIA, and DNNLOC methods. In particular, for BugLocator, the differences in MAP and MRR are 31.59% and 29.46%, respectively. On the other hand, for BLUiR, the differences in MAP and MRR are also substantial, 33.25% and 30.24%, respectively. For BLIA, the differences in MAP and MRR are 34.14% and 30.77%, respectively. Finally, for DNNLOC, the bug localization performance for both cases has improved, but the gap remains large as the difference in MAP and MRR are 31.43% and 26.25%, respectively. We calculated the performance difference using the PerformanceDiff metric from Wattanakriengkrai et al. [51]. Fig. 2 visualizes the MAP measures, and the differences are clearly visible. Overall, the results show that all four approaches perform lower when localizing bugs in deep learning systems, and the trend is consistent across all three metrics.

We also perform statistical tests to determine the significance of the performance gap between the two types of systems – deep learning systems and non-deep learning systems (Table 5). We took the Reciprocal Rank (RR) and Average Precision (AP) results of all the samples for each of the four approaches. Then, we performed Shapiro-Wilk normality test [52], which reported normal distribution for those metrics. Then, we used appropriate significance and effect size tests to compare the result values from the two types of systems. For the normal distribution, we used t-test as the parametric test [33]. In all significance tests, the p-values were less than the threshold (0.05) for each of the four approaches. Thus, the null hypothesis can be rejected for all comparisons. In other words, the performances of all baseline techniques significantly differ between the two types of systems.

Table 4: Performance of existing approaches (BugLocator, BLUiR, BLIA, DNNLOC) in bug localization

Method	Top@1	Top@5	Top@10	MRR	MAP
DLSW
BugLocator	0.344	0.547	0.615	0.371	0.314
BLUiR	0.201	0.472	0.585	0.316	0.257
BLIA	0.411	0.609	0.719	0.423	0.355
DNNLOC	0.468	0.682	0.786	0.455	0.408
NDLSW
BugLocator	0.419	0.671	0.794	0.526	0.459
BLUiR	0.311	0.575	0.686	0.453	0.385
BLIA	0.512	0.716	0.820	0.611	0.539
DNNLOC	0.617	0.786	0.855	0.617	0.595

DLSW= Deep Learning Systems, NDLSW=Non-Deep Learning Systems

Table 5: Statistical tests for the performance gap of existing approaches between the deep learning systems (DLSW) and non-deep learning systems (NDLSW)

Method Metric Sig. (p-val) Effect Size BugLocator RR 0.00008721** Medium (0.3834) AP 0.00004211** Medium (0.4023) BLUiR RR 0.00395* Medium (0.2709) AP 0.00339* Medium (0.3154) BLIA RR 0.00000734*** Large (0.5012) AP 0.00000819*** Large (0.4896) DNNLOC RR 0.00000345*** Large (0.5237) AP 0.00000389*** Large (0.5110) Sig.: Significance, p-val: p-value, ***: Large, **: Medium, *: Small

While the significance of a result indicates how probable it is that it is due to chance, the effect size indicates the extent of the difference [53]. Hence, we performed the Cohen’s D effect size test [34], and our analysis found a medium to large effect size for all cases (Table 5). Thus, our results from effect size tests reinforce the above finding from significance tests. In other words, the existing techniques perform significantly poorly in localizing bugs from deep-learning software systems. Even though our findings above match natural intuition, we performed extensive experiments using four different baselines, which resulted in strong empirical evidence. Thus, not only our findings reinforce the existing understanding and belief about the challenges of the bugs in deep learning systems, but also they substantiate them with solid empirical evidence and demonstrate the performance gap of existing solutions in localizing the two categories of bugs.

Comparison among the categories of deep learning systems: Our dataset, Denchmark, consists of deep learning systems from various classes, including frameworks, libraries, and tools. We focus on examining if there are any differences in bug localization performance across these system classes. Thus, we employed four existing techniques (BugLocator, BLUiR, BLIA, DNNLOC) to evaluate their bug localization performance for each class.

Table 6: Example of a bug report from deep learning framework [54]

Framework Bug (Bug ID: 10224)

Title

Language model example cannot be run

Description

The language model example cannot be run without manually

creating a data folder. There are also inconsistencies between the

documentation and the code. Optional argument – data DATA

location of the data corpus does not appear in the code train.py

…

Detailed BR: https://github.com/apache/mxnet/issues/10224

A deep learning framework is a software platform that provides the environment for designing, training, and deploying deep learning models [55]. Examples include TensorFlow¹¹1https://www.tensorflow.org/, PyTorch²²2https://pytorch.org/, and Apache MXNet³³3https://mxnet.apache.org/versions/1.9.1/. These frameworks come with pre-defined modules and functions and offer a structured way to implement deep learning architectures using high-level programming interfaces [56]. The example bug in Table 6 [54] is characterized by the inability to run a language model without manually creating a data folder. It represents a framework bug because it directly impacts the core functionalities of the framework, specifically the design and training of models. Frameworks are expected to provide seamless, user-friendly environments for developing deep learning models, and issues that hinder the ease of use, such as documentation inaccuracies and additional manual setup steps, are indicative of problems at the framework level.

Table 7: Example of a bug from deep learning library [39]

Library Bug (Bug ID: 313)

Title

A bug in GPT2Tokenizer

Description

GPT2Tokenizer fails to recover a sentence

\”BART is a seq2seq model.\”

with encoded ids of it.

The output sentence is \”BART is a seqseq model.\”.

It should be related to numbers’ processing.

…

URL: https://github.com/tanyuqian/texar-pytorch

/blob/master/examples/bart/gpt2_tokenizer_bug.py

Detailed BR:https://github.com/asyml/texar-pytorch/issues/313

A deep learning library is a collection of functions that facilitate specific tasks within deep learning. Libraries like Keras⁴⁴4https://keras.io/ and cuDNN⁵⁵5https://developer.nvidia.com/blog/tag/cudnn/ can either be integrated into frameworks or can operate independently [57]. From Table 7 [39], the bug in the ‘GPT2Tokenizer’ within the texar-pytorch project is classified as a library bug due to its specific component focus and the nature of functionality. It uses the TokenizerBase class from the texar library, designed to work with the PyTorch framework. The tokenizer’s failure to accurately process text data leads to the bug, which indicates a library issue.

Table 8: Example of a bug from deep learning tool [58]

Tool Bug (Bug ID: 5596)

Title

Tensorboard can not load all Hyperparameters keys

Description

Encountered an issue in TensorBoard where it couldn’t load

all hyperparameter keys when I used writer.add-hparams with

different hparam-dict parameters in multiple experiments. This

problem made it difficult to track and compare different

hyperparameter settings across these experiments, affecting the

overall functionality of TensorBoard. I provided code snippets and

responses to showcase the inconsistency in the display of

hyperparameters in TensorBoard’s interface. This issue

is significant as it hinders the effective use of TensorBoard

for experiment tracking and analysis.

…

Detailed BR: https://github.com/tensorflow/tensorboard/issues/5596

Finally, deep learning tools refer to utilities that assist with the tasks related to deep learning, such as visualization or model optimization. An example would be TensorBoard⁶⁶6https://www.tensorflow.org/tensorboard, which is often used for TensorFlow visualization [59]. Each of the frameworks, libraries, and tools plays unique but complementary roles in the context of deep learning. From Table 8 [58], the bug in TensorBoard qualifies as a tool bug since TensorBoard is a visualization toolkit in the TensorFlow ecosystem. The bug refers to the inability of TensorBoard to load all hyperparameter keys when writer.add-hparams is used with varying hparam-dict parameters, directly impacting its core functionality as a tool. This feature is essential for monitoring and contrasting various experimental settings of TensorBoard.

Table 9: Comparative analysis of bug localization techniques across various deep learning project classes

Project Class	Method	MRR	MAP
Framework	BugLocator	0.404	0.334
	BLUIR	0.205	0.155
	BLIA	0.466	0.392
	DNNLOC	0.493	0.412
Library	BugLocator	0.559	0.453
	BLUIR	0.632	0.539
	BLIA	0.583	0.486
	DNNLOC	0.656	0.597
Tool	BugLocator	0.499	0.407
	BLUIR	0.469	0.387
	BLIA	0.546	0.449
	DNNLOC	0.592	0.514

Table 10: Example of a bug from deep learning library for bug localization [60]

Library Bug (Bug ID: 87085)

Title

gradcheck failure with sparse matrix multiplication

Description

Sparse @ dense matrix multiplication fails to pass gradcheck.

A manual inspection of the gradient seems to indicate that

this is a bug with gradcheck rather than the matrix multiplication itself.

Steps to reproduce: …

Detailed BR: https://github.com/pytorch/pytorch/issues/87085

According to our experimental results in Table 9, existing techniques perform higher in localizing bugs from deep learning libraries than that of frameworks and tools. This could be attributed to the modular design of libraries, aimed at handling specific tasks independently [57]. In the example bug report (Table 10 [60]), the autograd module of the Torch.Library⁷⁷7https://pytorch.org/docs/stable/library.html plays a crucial role in calculating gradient across all tensor operations, whereas autograd.gradcheck⁸⁸8https://pytorch.org/docs/stable/generated/torch.autograd.gradcheck.html is a utility function for verifying the accuracy of these computed gradients. Here, a specific error concerning gradient computation in gradcheck was reported during sparse matrix multiplication. Such problems can be localized more efficiently when occurring within a clearly defined module like autograd.gradcheck. In this context, DNNLOC identified the buggy file at the Top@1 level, while BLUiR located the correct file at Top@3. BLUiR’s performance suggests that libraries’ modular architecture could facilitate the bug localization process, which leverages structural analysis. On the other hand, DNNLOC’s success in accurately locating bugs can be linked to its ability to capture complex patterns using deep learning, which works effectively with the modular design of library bugs.

Table 11: Example of a bug from deep learning framework for bug localization [61]

Framework Bug (Bug ID: 61297)

Title

CTC Loss errors on TPU

Description

The Keras Model with LSTM+ CTC loss runs normally on GPU,

but VM TPU prints grappler errors. The errors

usually don’t stop the execution, but the loss is nan.

It sometimes also crashes with core dump, but it’s not consistent.

I created a public Kaggle notebook with the code producing the issue:

https://www.kaggle.com/code/shaironen/ctc-example/notebook

The grappler errors are:

….

In addition, I read it’s recommended to use

model.compile(jit-compile=True) while on GPU to

pre-diagnose TPU issues. It gives similar errors and terminates.

(with jit-compile=False, it runs normally on GPU only).

According to tf.nn.ctc-loss documentation, it should work on tpu.

Detailed BR: https://github.com/tensorflow/tensorflow/issues/61297

In contrast, we notice from Fig. 3, the performance of bug localization techniques dropped for framework bugs, especially when using the BLUiR method. Frameworks consist of a broad architecture that guides the flow of control, involving multiple layers and components that interact in complex ways [55, 10]. The example bug in Table 11 [61] deals with CTC Loss in a Keras model with LSTM on a TensorFlow Processing Unit (TPU) highlights a framework-level issue. For this example bug, DNNLOC retrieved the correct buggy file at the top 38 position. Despite DNNLOC’s strength in handling complex relationships, its lower rank demonstrates a shortcoming, suggesting the need to address the complexities within deep learning frameworks better.

BugLocator retrieved the correct buggy file at the top 65 position, indicating its limitations in transcending textual similarities between framework code and bug reports. Finally, BLUiR emerges as the least effective, ranked the correct file at the top 93 position, highlighting its insufficiency in localizing the framework-level bugs. BLUiR may struggle to match the bug report keywords to the correct code segment since the framework’s vast codebase introduces a high degree of variance, making the bug localization challenging.

Table 12: Example of a bug from deep learning tool for bug localization [62]

Tool Bug (Bug ID: 5948)

Title

TB.dev HParams dashboard shows floating point metrics incorrectly

Description

TensorBoard renders incorrectly locally when uploading

an experiment to tensorboard.dev some of the floating point values

are mangled, and zero values are shown as +/-Infinity

….

Detailed BR: https://github.com/tensorflow/tensorboard/issues/5948

Lastly, the performance in bug localization slightly improved for tools, but it was not as good as with libraries. This improvement could be attributed to the fact that tools (e.g., TensorBoard) are more independent and have less dependencies than the entire framework (e.g., Tensorflow). While tools like TensorBoard focus on specialized tasks like visualization, they often integrate with complex frameworks like TensorFlow. The example bug in Table 12 refers to an incorrect rendering of floating point metrics in the TensorBoard HParams dashboard. It can be classified as a tool bug, as TensorBoard is a visualization utility within the TensorFlow ecosystem. In this instance, DNNLOC identified the correct buggy file at the top 27 position. BugLocator located the correct file at the 40th position, which highlights its limitations in localizing tool bugs that warrants more than textual analysis. Meanwhile, BLUiR identified the correct file at the top 52 position, reflecting BLUiR’s struggle to effectively match textual descriptions in bug reports with the specific code segments responsible for visual aspects.

In short, our findings indicate that the modular design of libraries [59] facilitates better bug localization, whereas the complexities inherent in frameworks and tools hurt the bug localization performance of the existing techniques.

\MakeFramed\FrameRestore

Summary of RQ $\mathbf{{}_{1}}$ : We evaluate the performance of four existing techniques in localizing bugs from deep learning systems and non-deep learning systems using three evaluation metrics. Our findings show that all four approaches perform significantly lower (e.g., 34.14% less MAP for DNNLOC) when localizing bugs from deep learning systems. We also compare their performance when localizing bugs from deep learning frameworks, libraries, and tools. We found that localizing bugs from frameworks is most challenging due to the complex interaction of their components. \endMakeFramed

4.2 Answering RQ $\mathbf{{}_{2}}$ : How do different types of bugs in deep learning systems impact bug localization?

In this research question, we investigate the characteristics and localization challenges of different types bugs in deep learning systems through manual analysis. First, we employ stratified random sampling to construct a random sample that represents a balanced presence of instances from each bug type [63]. We select 385 bugs from both of the subsets of our datasets (BugGL, Denchmark from Table 3) that have 95% confidence interval and 5% error margin.

We performed our manual analysis using 385 bug reports from the Denchmark dataset. We did two levels of classification from deep learning systems. First, we manually labeled the bugs as deep learning-related (DL) bugs and non-deep learning-related (NDL) bugs from the deep learning systems involving two annotators. Second, we labeled the deep learning-related (DL) bugs into five categories: Model, Training, Tensor, API, and GPU, based on the existing taxonomy of Humbatova et al. [30]. We also analyzed the bug reports, associated developers’ discussions, and bug-fix code changes as a part of the labeling. Two authors of this work labeled the sample dataset separately and achieved a Cohen’s kappa [64] of 0.80, which indicates a substantial agreement between the authors. Our manual analysis above was documented using an Excel sheet, with a total of $\approx$ 55 hours spent by each author, which is provided in our replication package [46].

Prevalence ratio of deep learning-related bugs: We found that 64.80% of the bugs from deep learning systems are related to deep learning algorithms (a.k.a DL bugs). That is, they are related to inputs, data, or training of deep learning models, underlying API endpoints, and computational resources. In particular, we found 27.30% training bugs, 13.30% model bugs, 5.50% tensor bugs, 14.30% API bugs, and 4.40% GPU bugs (Fig. 4). Such a distribution informs us where the debugging efforts should be concentrated. Our findings also indicate that the majority of DL bugs are related to the training process. Training is a crucial step in deep learning that involves large amounts of data, complex learning algorithms, and optimization techniques, making it more susceptible to bugs and failures.

Prevalence ratio of non-deep learning-related bugs: We found that 35.20% of the bugs from deep learning systems are not related to deep learning algorithms (a.k.a NDL bugs). These bugs do not directly affect the functionality of the deep learning model, but they still lead to unexpected, erroneous behaviors in a software system. Bug 1426 in Table 1 is an NDL bug, which occurs when the tests from the CI pipeline are spread across multiple Windows machines. Although it is not directly connected to the deep learning module, it originated from the PyTorch-Ignite project, which is indeed a deep learning system.

Table 13: Experimental result of existing bug localization techniques (BugLocator, BLUiR, BLIA, DNNLOC) of each category of bugs in deep learning systems

Method Model Training Tensor API GPU MAP BugLocator 0.368 0.312 0.235 0.446 0.183 BLUiR 0.293 0.403 0.601 0.258 0.266 BLIA 0.357 0.345 0.448 0.395 0.290 DNNLOC 0.593 0.411 0.619 0.493 0.296 MRR BugLocator 0.532 0.386 0.358 0.378 0.223 BLUiR 0.387 0.427 0.621 0.424 0.311 BLIA 0.472 0.419 0.553 0.447 0.327 DNNLOC 0.585 0.467 0.682 0.548 0.336 MAP= Mean Average Precision MRR=Mean Reciprocal Rank

Localization of bugs in deep learning systems: To gain a deeper understanding of the challenges in localizing deep learning bugs, we further analyze our results from RQ ${}_{1}$ . To do this, we randomly sample 100 bugs from each category and analyze the performance of our baselines. Moreover, to ensure a fair performance comparison of the bug localization techniques for each type of DL bug, we selected an equal amount of data using the principle of randomization [63] to avoid any bias. We repeated this process three times and used different sample data each time to ensure the robustness of our findings. We then averaged the results from the three evaluations and presented them in Table 13. We conducted an in-depth analysis of each type of DL bug to understand their inherent challenges and how they impact the overall performance of the bug localization techniques. Our analysis with examples is provided as follows.

Table 14: Example of a model bug [39]

Model Bug (Bug ID: 313)

Title

A bug in GPT2Tokenizer

Description

GPT2Tokenizer fails to recover a sentence

\”BART is a seq2seq model.\”

with encoded ids of it.

The output sentence is \”BART is a seqseq model.\”.

It should be related to numbers’ processing.

A script to show the bug is here:

https://github.com/tanyuqian/texar-pytorch

/blob/master/examples/bart/gpt2_tokenizer_bug.py

Detailed BR:https://github.com/asyml/texar-pytorch/issues/313

Model bugs: From Table 13, we see that DNNLOC performs the best for model bugs, which could be attributed to its ability to capture complex patterns using DNN and textual relevance from the bug reports and source code using rVSM. Model bugs are often connected to a model’s type, properties, and layers. According to our observation, the texts describing model-related issues in bug reports have significant vocabulary overlap with the implementation of a deep learning model.

Table 14 [39] shows an example bug report that discusses a model bug from the CASL.ai project. Fig. 8 [39] (as shown in Appendix A) shows the code snippet responsible for the bug. The bug in the GPT2Tokenizer lies in the _bpe method, causing faulty tokenization and impacting the functionality of the GPT2 language model, which can be considered a model bug. The incorrect character merging during byte pair encoding leads to faulty tokenization.

BugLocator incorrectly retrieves ‘SentencePieceTokenizer.py’ (Fig. 9) (as shown in Appendix A) as the Top@1 result. It relies on lexical overlap between bug reports and source code. We found four main keywords from the bug report (Table 14) —
GPT2Tokenizer, recover, seq2seq, and model — overlapping with an incorrect file (i.e., SentencePieceTokenizer.py). This lexical similarity could have led to an incorrect localization. Interestingly, the ground truth file was retrieved at the 8th position (Fig. 8) (as shown in Appendix A) by BugLocator. BLIA retrieves the same ground truth file at the 17th position, which is less than ideal.

On the other hand, the BLUiR approach retrieves the ground truth code at a lower position (top 131). Thus, the similarity between bug reports and source code elements might not be sufficient for locating model bugs. In the example above (Table 14), the bug is triggered when tokenizing the following sentence – ‘BART is a seq2seq model.’. BLUiR incorrectly retrieved the source code file with the Seq2Seq class at the Top@1 position (Fig. 10) (as shown in Appendix A). One possible explanation could be that several important keywords from the bug report align with similar (e.g., ‘seq2seq’, ‘encode’, and ‘model’) code elements (e.g., seq2seq class), which are relevant to the false positive bug. Such a misalignment occurs due to BLUiR’s heavy reliance on structural elements (e.g., class, method, variable) and less attention to the semantic aspect of the bug report.

Finally, the DNNLOC approach retrieves the buggy file correctly at the Top@1 position for the model bug (Table 14). Unlike IR-based methods, DNNLOC’s neural networks can capture deeper semantic links that go beyond lexical similarities. For instance, it has a better chance to semantically connect terms such as ‘seq2seq’ and ‘tokenizer’ in the bug report to their corresponding code-level implementations in GPT2Tokenizer despite their textual mismatch. Thus, DNNLoc’s capability to leverage the non-linear relationships might have helped it localize model bugs better in deep learning systems.

Table 15: Example of a training bug [40]

Training Bug (Bug ID: 3048)

Title

Gradient Accumulation + Mixed Precision shows artificially high training loss

Description

OB: The bug occurs when Gradient Accumulation

and the MixedPrecision Callback are both used.

Gradient Accumulation runs before Mixed Precision

and causes the after_backwards to not be run,

meaning that the loss is not unscaled before it is logged.

This means that very large losses, such as 6000000+, are to be logged.

S2R: seed=random.randint(0,2**32-1)

with no_random(seed):

db=synth_dbunch(bs=8,n_train=1,n_valid=1,cuda=True)

learn = synth_learner(data=db)

learn.fit(1, lr=0.01)

#start without gradient overflow

max_loss_scale=2048.0

with no_random(seed):

db=synth_dbunch(bs=1,n_train=8,n_valid=8,cuda=True)

learn = synth_learner(data=db,cbs=[GradientAccumulation(n_acc=8)])

learn.to_fp16(max_loss_scale=max_loss_scale)

learn.fit(1, lr=0.01)

The training loss will be very high, 5000+ for fp16.

fp32 will be reasonable

EB: Similar training loss

between the fp32 and fp16 versions. <2 difference in loss.

Detailed BR: https://github.com/fastai/fastai/issues/3048

EB=Expected Behaviour, S2R=Steps to Reproduce, OB=Observed Behaviour

Training bugs: All four approaches perform poorly in localizing training bugs, as shown in Table 13. DNNLOC and BLUiR perform comparatively better than the other techniques. Table 15 [40] shows a training bug in the fast.ai project, where the combined usage of Gradient Accumulation and the MixedPrecision Callback (Fig. 11) (as shown in Appendix A) leads to improperly scaled and artificially high training loss value. DNNLOC retrieved the ground truth file at the Top@4 position for this bug, whereas BLUiR retrieved the ground truth file at the Top@7 position. Although both DNNLOC and BLUiR demonstrate comparable performance in locating training bugs, they perform poorly overall (e.g., MAP 40.3% - 41.1%). Moreover, BugLocator performs the worst (e.g., Top@23 position for the example bug in Table 15) in localizing the training bugs (Fig. 12) (as shown in Appendix A), which suggests that similarity analysis between bug reports and source code might not be sufficient for identifying these bugs. Training bugs, which occur during the model’s training phase and might involve aspects such as an improperly defined loss function, are more conceptual. These bugs might not be easily located using these existing methods and might require techniques that can offer a deeper insight into the model’s architecture and training process.

Table 16: Example of a tensor bug [41]

Tensor Bug (Bug ID: 13760)

Title

nd.slice does not return empty tensor when begin=end

Description

OB: For mxnet.ndarray.slice(data, begin, end),

if begin=end, it does not return an empty tensor.

Instead, it returns a tensor with the same shape as the data.

Environment info:

……

S2R: import mxnet.ndarray as nd

a = nd.normal(shape=(4, 3))

nd.slice(a, begin=0, end=0)

nd.slice(a, begin=2, end=2)

Detailed BR: https://github.com/apache/mxnet/issues/13760

BR=Bug Report, S2R=Steps to Reproduce, OB=Observed Behaviour

Tensor bugs: Tensors are central to deep learning and often involve intricate dimensional and mathematical issues. From Fig. 5, we notice that DNNLOC and BLUiR are more effective in localizing tensor bugs than other techniques (e.g., BugLocator, BLIA). In the example bug from Table 16 [41], the bug report described the issue with the ‘nd.slice’ function in MXNet, which should return an empty tensor when the ‘begin’ and ‘end’ parameters are equal. Instead, the function returns a tensor with the same shape as the data. For this example tensor bug, both DNNLOC and BLUiR retrieved the correct buggy file at the top position. Interestingly, by parsing the AST of the code, BLUiR identified the relevant code snippet in the test_slice() function, which shares significant keyword overlap with the bug report. BLUiR determines the relevance of low-level code elements (e.g., class names, method names) against a bug report, which helps to reduce noise in the code segments where tensors could be handled or manipulated.

On the other hand, BLIA retrieved the ground truth file at the top 12 positions. However, our analysis revealed that incorporating the stack trace information negatively impacted its bug localization performance. Tensor bugs are typically related to data manipulation and computations [30] rather than code execution flow or call stack [65] that can be found in stack traces. By excluding the stack trace from the BLIA approach, we were able to improve the ranking of the buggy file from the 12th position to the 9th position.

Lastly, BugLocator’s difficulty in accurately locating tensor bugs is demonstrated by its retrieval of the ground truth file at the 47th position (Fig. 13) (as shown in Appendix A) and the incorrect file at Top@1 (Fig. 14) (as shown in Appendix A). It can be attributed to its reliance on textual similarity. We found that BugLocator’s heavy reliance on textual similarity and the overlapping of trivial words (e.g., environment, system, and hardware) with the incorrect file (Fig. 14) (as shown in Appendix A) led to the incorrect ranking.

Table 17: Example of an API bug [42]

API Bug (Bug ID: 13862)

Title

[1.4.0] unravel_index no longer works with

magic ’-1’ in shape parameter as in 1.3.1

Description

OB: The unravel_index op seems to no longer correctly

work with ’magic’ shape values, such as ’-1’s.

The following example still works with mxnet 1.3.1,

but does not on the latest master

(it returns all zeros in the result

without throwing an error) or 1.4.0.

We have a use case for this in Sockeye.

Environment info (Required):

…

S2R: Input data taken from Sockeye unit tests.

x = mx.nd.array([335, 620, 593, 219, 36], dtype=’int32’)

mx.nd.unravel_index(x, shape=(-1, 200))

With mxnet==1.5.0b20190111, the result is incorrect:

With mxnet==1.3.1, the result is correct:

However, if the shape parameter is fully specified

(shape=(5,200)), mxnet==1.5.0b20190111

returns the correct values.

Detailed BR: https://github.com/apache/mxnet/issues/13862

BR=Bug Report, S2R=Steps to Reproduce, OB=Observed Behaviour

API bugs: From Fig. 5, we observe that DNNLOC and BugLocator perform well in localizing API bugs, whereas BLUiR is the least effective. This could be due to BLUiR’s heavy reliance on structural information from the source code, which might not capture the specifics of API bugs (e.g., incorrect API calls). Another factor could be the rapid evolution of deep learning APIs, which affects versioning and compatibility [66]. Due to frequent structural changes in the code, the technique might be less effective in assessing the relevance between code and bug reports.

The bug from Table 17 [42] involves the unravel-index function in MXNet, where its behavior incorrectly varies with certain input parameters across different versions (Fig. 15) (as shown in Appendix A). DNNLOC located the correct buggy code for the example bug (Table 17) at the 2nd position, while BugLocator ranked it in the top 5th. DNNLOC might be effective for this API bug because textual similarity (through rVSM) matches API-specific terms in bug reports with source code, while neural networks (through DNN) can capture complex, non-obvious relationships in the API’s usage and functionality. Moreover, our findings indicate that relying on either element (rVSM, DNN) in isolation is less effective, resulting in a noticeable drop in performance.
On the other hand, BLUiR retrieved the buggy file at the 38th position. BLUiR incorrectly retrieved the file containing the NDArray class. In the case of API bugs, the class file may not always be relevant. This is because API bugs frequently arise at the interface between different layers of abstraction [67]. Such bugs could be attributed to the interaction of a higher-level function (e.g., unravel-index from Table 17) with lower-level components rather than an issue within the code of the function itself. Meanwhile, BLIA initially placed the buggy file at the top 7th position without the stack traces but improved to 5th when stack trace information was included. It indicates that stack trace information, highlighting execution flow and function calls, is beneficial for locating API bugs.

Table 18: Example of a GPU bug [43]

GPU Bug (Bug ID: 1238)

Title

How to use multiple GPUs?

Description

I want to use a single machine with multiple

GPUs for training, but it has no actual effect.

OB: Only one single GPU is doing all the computations,

the other three remain idle. When following @FontTian

and inserting distribution_strategy=strat into the

initialization of the image classifier, the same

error RuntimeError: Too many failed attempts

to build the model occurs. The same happens

when adding tuner=’random’ to ak.ImageClassifier.

As suggested by @haifeng-jin, I ran a basic

KerasTuner example on 4 GPUs which

worked just fine. Furthermore, in #440 (comment),

I read that the clear_session() before every

run might wipe out the GPU configuration.

Removing this line from the code did not change

anything with respect to the errors/problems stated above.

I am specifying 4 GPUs (out of 8) to train

the current model in a distributed fashion,

using tf.distribute.MirroredStrategy( ) since

tf.keras.utils.multi_gpu_model( ) is

deprecated and removed since April 2020.

S2R: def make_model(ckpt_path, max_try = 1):

……

run_search(checkpoint, max_try = 3)

Detailed BR: https://github.com/keras-team/autokeras/issues/1238

BR=Bug Report, S2R=Steps to Reproduce, OB=Observed Behaviour

GPU bugs: Our investigation reveals that GPU bugs are the most difficult to localize for all four existing approaches. One possible explanation could be the complex nature of GPU bugs, as they can be triggered by a variety of factors, such as the compatibility between hardware (e.g., GPU device) and software (e.g., PyTorch). It might not even be located in the source code (a.k.a extrinsic GPU bug) [30]. From Table 20, we notice that only 17.65% of the GPU bugs can be found in the source code (a.k.a intrinsic GPU bug). Some examples of intrinsic GPU bugs that we found – the wrong reference to a GPU device, failed parallelism, incorrect state sharing between subprocesses, and faulty data transfer to a GPU device.

Table 18 [43] shows a GPU bug triggered by the codebase (a.k.a intrinsic). The bug is connected to the use of multiple GPUs during training. According to the report, the machine contains multiple GPU devices, but only one GPU is used during computation. All four bug localization techniques performed poorly in locating the actual buggy code for this GPU bug. BLUiR retrieved the buggy file at the 50th position, followed by BugLocator at the 65th, BLIA at the 47th, and DNNLOC at the 43rd position in their respective rankings.

We observed that there is almost no keyword overlapping and no structural similarity between the bug report (Table 18) and the actual buggy code (Fig. 16) (as shown in Appendix A). This suggests that these techniques struggled due to the lack of both textual and code-wise similarity between the bug report and source code, making it challenging to identify the buggy code for the GPU bug.

We observed a minimal keyword overlap and structural similarity between the bug report (Table 18) and the actual buggy code (Fig. 16) (as shown in Appendix A), presenting a significant challenge in locating such bugs using BugLocator. Additionally, BLUiR’s analysis of smaller code segment similarity also proves ineffective for GPU bugs, as it solely concentrates on source code and overlooks the hardware-software interactions and runtime specifics crucial for comprehending such bugs. Similarly, BLIA’s integration of stack trace information, commit history, or version control history fails to encapsulate the unique aspects of GPU bugs as well. Moreover, despite DNNLOC’s capability to capture non-linear complex relationships through deep neural networks, it did not help to locate GPU bugs. One reason might be that these bugs often involve complex hardware-software dynamics that are not typically addressed by standard source code analysis or within the codebase’s non-linear mappings.

NDL bugs: We found that 35.20% of bugs in deep learning applications are not directly related to deep learning, but they impact system behavior (e.g., failed CI build due to GPU compatibility issues). These bugs are known as Non Deep Learning-related (NDL) bugs in deep-learning systems. From Table 19, we find that the performance of existing techniques in localizing these bugs is also poor. DNNLOC outperforms other techniques, whereas BLUiR performs the lowest in locating NDL bugs. We observed that NDL bugs are less complex than their deep-learning counterparts. However, they are more prone to be extrinsic than the traditional bugs (48.15% from Table 20); we provided the details about extrinsic bugs in RQ ${{}_{3}}$ . Since existing baseline techniques focus on code-level artifacts only, they might fall short in detecting these bugs from deep learning systems. Overall, there exists a significant variation in the performance of existing approaches when localizing various types of deep-learning bugs. Tensor and API bugs are the easiest, whereas GPU bugs are the most difficult to localize using the selected baseline techniques.

Table 19: Experimental result of existing bug localization techniques (BugLocator, BLUiR, BLIA, DNNLOC) for NDL Bugs in deep learning systems

Method MAP MRR BugLocator 0.362 0.417 BLUiR 0.292 0.334 BLIA 0.437 0.381 DNNLOC 0.588 0.439

Bug report quality for bugs in deep learning systems: Our analysis showed that bug reports from deep learning systems contain more code snippets (83.11%) than traditional software systems (33.24%). Unfortunately, that does not help much in bug localization, as code snippets alone might not be sufficient. Deep learning bugs often involve intricate dependencies that extend beyond specific code components (e.g., training data bugs and GPU bugs). Complex bugs (e.g., gradient instability during training) warrant a deeper understanding of the model architecture, its dynamic behavior, and training processes, which the code snippets may not always capture.

\MakeFramed\FrameRestore

Summary of RQ $\mathbf{{}_{2}}$ : We found that 64.80% bugs in deep learning systems (DLSW) are related to deep learning algorithms, whereas the remaining bugs are not related to deep learning. Our analysis shows that Tensor bugs and API bugs are easier to localize than model and training bugs. However, GPU bugs are the most difficult to localize for each of the four approaches. Thus, our results not only inform the distribution of DL bugs but also highlight their localization challenges through extensive experiments. \endMakeFramed

4.3 Answering RQ $\mathbf{{}_{3}}$ : What are the implications of extrinsic bugs in deep learning systems for bug localization?

Most of the traditional bug localization techniques rely on the similarity between bug reports and source code [11, 6, 12, 17, 14, 13, 23, 48]. However, if a bug is of extrinsic nature (e.g., originates from the operating system), simply relying on source code may not be effective for its localization.

To investigate the impact of extrinsic bugs in deep learning systems, we performed another manual analysis using the same sample datasets from RQ ${}_{2}$ (385 bugs from DLSW and 385 bugs from NDLSW). We manually labeled them as extrinsic and intrinsic bugs based on the heuristics of Rodriguez-Perez et al. [36]. Two authors of the study analyzed the bug reports and associated discussions carefully, consulted the heuristics, and then labeled the sample dataset separately. It achieved a Cohen’s kappa [64] of 0.87, which indicates a substantial agreement between the authors. This manual analysis was documented using an Excel sheet, which can be found in our replication package [46]. Each author spent a total of $\approx$ 20 hours on the analysis.

Prevalence ratio of extrinsic & intrinsic bugs: We found 40.00% extrinsic bugs within a total of 385 bug reports from deep learning systems (Denchmark dataset). The notion of extrinsic bugs is relatively new, especially in the case of bugs from deep learning systems. For a better comparison, we also manually inspected 385 bugs from non-deep learning systems (BugGL dataset) and determined the prevalence ratio of extrinsic and intrinsic bugs. We found only 10.65% extrinsic bugs in non-deep learning systems. Thus, deep learning systems contain almost four times more extrinsic bugs (Fig. 6) than non-deep learning systems.

Prevalence ratio of extrinsic & intrinsic bugs from deep learning systems: We randomly select 100 samples for each type of bug from deep learning systems (same subsets from RQ ${}_{2}$ ) and determined the prevalence ratio of extrinsic and intrinsic bugs for each type. Table 20 shows the results of our manual analysis for different bug categories in deep learning systems in terms of extrinsic and intrinsic bugs. We see that the prevalence ratios of deep learning-related extrinsic bugs range from 21.90% to 82.35%, whereas for non-deep learning-related bugs, the prevalence ratio is 48.15%. This suggests that the deep learning components of a software system might be more likely to trigger extrinsic bugs than non-deep learning components.

Table 20: Prevalence ratio of extrinsic and intrinsic bugs in deep learning systems

Type		Extrinsic (%)	Intrinsic (%)
NDL		48.15	51.85
DL	Model	35.29	64.71
	Training	21.90	78.10
	Tensor	38.10	61.90
	API	38.19	61.81
	GPU	82.35	17.65

Localization of extrinsic & intrinsic bugs from both systems: To further analyze the impact of extrinsic bugs on bug localization, we experimented with our baseline techniques from RQ ${}_{1}$ on extrinsic and intrinsic bugs separately. We chose 100 random bugs to evaluate the performance of all four techniques in bug localization from each category of both benchmark datasets. We repeated the evaluation three times using three different random subsets and then calculated the average result for a fair comparison.

From Table 21, we notice that DNNLOC performs slightly better than other techniques in localizing extrinsic bugs for both systems. However, it shows a clear performance gap in localizing extrinsic and intrinsic bugs. The technique is less effective with extrinsic bugs, particularly in DLSW. It suggests a shortcoming in handling external complexities despite being able to capture non-linear relationships between bug reports and source code. We notice that BugLocator’s localization performance for extrinsic bugs in DLSW is lower than NDLSW. BugLocator might not be able to locate the extrinsic bugs in deep learning systems due to its naive approach, i.e., considering code as regular texts. We also found that BLUiR shows less performance gap between extrinsic and intrinsic bugs. It extracts different structured items, (e.g., methods, classes) from the source code and bug reports. Thus, even if they reside outside the current codebase and are invoked from an external library, they could be matched with relevant keywords from a bug report. Interestingly, BLIA performs slightly better for extrinsic bugs in DLSW than NDLSW, which could be possible due to its diverse use of meta components (e.g., stack traces, version control history). Stack traces from deep learning systems (DLSW) often contain more intricate information and dependencies (e.g., complex neural network data flows, dependencies on specialized libraries, GPU-related synchronization issues) related to the bugs, unlike stack traces from non-deep learning systems, which might not provide the same level of detailed information [68, 69].

Overall, these results suggest that extrinsic bugs are hard to localize, whether related to deep learning or not. However, deep learning bugs with an extrinsic nature are the more difficult to localize. On the other hand, from Fig. 7, we note that the performance of all four approaches for intrinsic bugs in NDLSW is higher compared to the intrinsic bugs in DLSW, which supports the fact that the bugs related to deep learning algorithms (a.k.a DL bugs) from deep learning systems are more challenging to localize.

Table 21: Performance of Bug Localization Techniques in DL and TRAD Categories

Method DLSW+EXT DLSW+INT NDLSW+EXT NDLSW+INT MAP BugLocator 0.287 0.359 0.375 0.405 BLUiR 0.277 0.369 0.301 0.359 BLIA 0.321 0.449 0.382 0.487 DNNLOC 0.334 0.486 0.390 0.505 MRR BugLocator 0.371 0.489 0.409 0.476 BLUiR 0.296 0.445 0.309 0.364 BLIA 0.385 0.518 0.404 0.505 DNNLOC 0.398 0.544 0.420 0.521 DLSW+EXT = Extrinsic Bug from Deep Learning Systems
DLSW+INT = Intrinsic Bug from Deep Learning Systems
NDLSW+EXT = Extrinsic Bug from Non-Deep Learning Systems
NDLSW+INT = Intrinsic Bug from Non-Deep Learning Systems

Correlation of extrinsic & deep learning-related bugs: To determine the potential correlation between the extrinsic bugs and the bugs in deep-learning systems, we performed a Chi-Square test to determine any significant association [70]. We conducted three iterations with different sample data to validate the Chi-Square test, averaging the results. We got a p-value of $\approx$ 1.79e-14. Such a low p-value (far below the conventional threshold of 0.05) indicates that the observed association is not a product of random chance. Instead, it implies a strong dependency between external factors contributing to bugs and the specific bug patterns within deep learning systems. Our manual analysis also supports the hypothesis, showing a higher prevalence of extrinsic bugs in deep learning systems (DLSW) compared to non-deep learning systems (NDLSW) (Fig. 6). The prevalence ratio of extrinsic bugs varies from 21.90% to 82.35% across different types of deep-learning bugs, confirming a strong association between extrinsic factors and bugs from deep-learning systems. Our experiments also suggest that extrinsic bugs might have an underlying connection with deep-learning bugs (refer to RQ $\mathbf{{}_{3}}$ : Localization of extrinsic & intrinsic bugs from both systems), which contributes to the poor performance of the existing bug localization techniques. \MakeFramed\FrameRestore Summary of RQ $\mathbf{{}_{3}}$ : We found that deep learning systems (DLSW) contain almost four times more extrinsic bugs than non-deep learning systems (NDLSW). The performance of our baseline bug localization techniques for extrinsic bugs is lower (e.g., 31.27% less MAP for DNNLOC) compared to that of intrinsic bugs. Our research also shows a strong connection between extrinsic bugs and bugs in deep learning systems. \endMakeFramed

4.4 Key findings

Based on our findings from three research questions, we discuss the key factors that challenge the bug localization in deep learning systems as follows:

•

Extrinsic nature of bugs: A substantial number of bugs in deep learning systems are extrinsic, arising from external factors (e.g., hardware compatibility, OS issues). These bugs might not be found through source code analysis alone, making them inherently challenging to localize using traditional methods relying on source code. We also found a strong statistical correlation between extrinsic bugs and bugs from deep learning systems.
•

Textual representation limitations in dependencies: Our baseline techniques heavily rely on textual similarity between bug reports and source code, which is less effective for deep learning bugs. They often involve intricate dependencies (e.g., gradient vanishing in RNN) and require conceptual understanding (e.g., improperly defined loss function). These complex issues may not be represented using texts, which makes the existing baseline techniques less effective.
•

Dynamic complexity in training and tensor operations: Training bugs and tensor bugs are challenging to localize due to their dynamic and complex nature, often involving issues in the training process (e.g., misconfigured batch sizes) and data manipulation (e.g., incorrect tensor reshaping) respectively, which are not adequately captured by existing methods for bug localization.
•

Multifaceted extrinsic influences on GPU operations: GPU bugs are particularly challenging to localize because they can be caused by multiple factors, such as hardware-software interaction and compatibility issues, and may not even be present in the source code (extrinsic bugs).
•

Inadequate information for API usage: Traditional techniques (e.g., BLUiR) struggle with locating API bugs in deep learning systems. Deep learning APIs differ from traditional APIs [66] since they handle complex data structures, specialized hardware dependencies, dynamic computation graphs, high-level abstractions for operation, and rapid evolution, which impacts versioning and compatibility. Traditional bug localization techniques might not be able to capture such intricate information adequately from deep learning APIs, which might be hurting the localization process.
•

Ineffectiveness of stack traces for tensor bugs: For tensor and input bugs, stack trace information is found to be less effective since these bugs are often related to data manipulation and computations rather than the execution flow captured by stack traces.
•

Lack of details on model architecture and training process: Although bug reports in deep learning systems contain more code snippets, this does not significantly improve the performance of bug localization. Deep learning bugs require a deeper understanding of the model architecture and training process, which may not be detailed in the bug reports and the code snippets.
•

Limited scope of non-linearity: While leveraging non-linear relationships between bug reports and source code using deep neural networks has improved the overall performance of bug localization, the significant performance gap between deep learning systems and non-deep learning systems indicates the inherent localization challenges in deep learning systems. It might help localize specific deep-learning bugs (e.g., model bugs) but might not others (e.g., training and GPU bugs). It suggests that the deep learning approach, despite its ability to capture non-linear relationships, fails to address the complexities of the DL training process or the GPU bugs’ multifaceted and often external nature. We also found that non-linearity alone might not be the complete solution for Tensor bugs, as code structure analysis from information retrieval (BLUiR) helped to locate such structural bugs.
•

Complexities within the deep learning architectures: Deep learning systems present a unique set of challenges for bug localization due to their convoluted architecture. The complex nature of frameworks, characterized by multiple layers and components, often coupled with dependencies, hinders the performance of existing bug localization techniques. In contrast, the modular nature of libraries helps them in localization slightly by providing a more defined structure [57]. On the other hand, tools (e.g., TensorBoard) frequently integrate with larger frameworks (e.g., TensorFlow), which leads to dependencies and interactions, thus low performance in bug localization.

4.5 Implications

The above challenges highlight the necessity for novel bug localization techniques specifically designed for deep learning systems. Our research findings indicate that a one-size-fits-all approach may not be effective in practice, as shown by the existing methods (e.g., BLUiR, DNNLOC). Future research should focus on developing more comprehensive methods capable of addressing the wide array and complexity of bugs in deep learning systems, leveraging insights from the strengths and weaknesses of existing techniques identified in our research.

5 Threats to Validity

We identify a few threats to the validity of our findings. In this section, we discuss these threats and the necessary steps taken to mitigate them as follows.

Threats to internal validity relate to experimental errors and human biases [71]. Traditional bug tracking systems (e.g., Bugzilla, GitHub, Jira) contain thousands of bug reports, and their quality cannot be guaranteed. This could be a source of threat as the bug reports are used as queries to locate the buggy files. Bug reports often contain poor, insufficient, missing, or even inaccurate information [72]. Hence, we used data from existing benchmarks ([31, 32], where the authors took necessary steps to avoid low-quality or invalid bug reports. Thus, such threats might be mitigated.

Another potential source of threat could be the replication of existing work. The original replication package was unavailable; hence, we used the publicly available version of BugLocator, BLUiR [49], and DNNLOC [50]. For BLIA, we reused the author’s replication package [23]. We validated our implementation of the existing methods using their original dataset and achieved comparable results (e.g., with differences $\approx$ 2.00%–3.00% using MAP).

Threats to conclusion validity. The observations from our study and the conclusions we drew from them could be a source of threat to conclusion validity [73]. In this research, we answer three research questions using two different datasets and re-implement four existing techniques. We use appropriate statistical tests (e.g., t-test) and report the test details (e.g., p-value, Cohen’s D) to conclude. Thus, such threats might also be mitigated.
Threats to construct validity relate to the use of appropriate performance metrics. We evaluate all the methodologies using MRR, MAP, and Top@K, which have been used widely by the related work [11, 12, 14, 17, 48, 19, 74]. Thus, such threats might also be mitigated.

6 Related Work

6.1 Software bug

Understanding the nature and characteristics of bugs is essential for effective debugging and testing. They can differ across different programming languages and development frameworks [75]. Over the last 50 years, hundreds of studies have been conducted to tackle bugs in traditional software systems. Recently, bugs from deep learning systems have garnered much attention due to their great interest and significance. Humbatova et al. [30] proposed a taxonomy of bugs from deep learning systems with five main categories - model, training, tensor & input, API, and GPU. Chen et al. [7] focused on the unique obstacles for deep learning-based software deployment. According to Islam et al. [35], data bugs and logic bugs are the most severe in deep-learning software systems. Another study by Islam et al. [76] showed that the bugs or repair patterns of deep learning models significantly differ from those of traditional systems. As a result, traditional software debugging approaches, such as bug localization techniques, might not be effective for deep learning systems. Therefore, empirical research like ours, which focuses on the challenges posed by deep learning bugs in the software debugging process (specifically in bug localization), is essential.

6.2 Information Retrieval-based bug localization

One of the crucial steps toward fixing a software bug is to detect its location within the software code. Many existing approaches [11, 12, 14, 13, 23] use Information Retrieval (IR) to locate bugs by matching keywords between a query and the source code.

Zhou et al. [11] introduce BugLocator, which leverages textual similarity between bug reports and source code using rVSM for bug localization. Saha et al. [12] propose BLUiR, which determines the textual similarity between source code and bug reports using the Okapi-BM25 algorithm [77]. BLUiR also leverages structural items from both bug reports and source code, which boosts its localization performance. Later, Wang and Lo [14] propose AmaLgam, which incorporates the textual similarity from BugLocator, structured items from BLUiR, and version control history into IR-based bug localization.

Wang et al. [78] analyzed IR-based fault localization techniques and found their effectiveness to be limited, mainly due to the frequent unavailability of high-quality bug reports. The quality of bug reports makes it challenging to localize bugs using traditional IR-based techniques. Rahman and Roy [13] propose BLIZZARD, which leverages the quality aspect of bug reports and introduces context-aware query reformulation into bug localization. Wong et al. [48] proposed BRTracer, which improves upon BugLocator by combining source document segmentation and stack-trace analysis.

Le et al. [79] used an automated method to predict the effectiveness of IR-based bug localization by leveraging features extracted from bug reports and localization methods, with their findings focusing on the significance of metadata features (e.g., commit history, stack traces) in enhancing the performance of these techniques. Another technique, namely Locus [74], uses the software change information from commit logs and change histories to improve bug localization. Youm et al. [23] proposed BLIA, which integrates bug reports, structured information of source files, and source code change history. It localizes bugs in two granularity levels - file level and method level – and outperforms prior approaches.

All these IR-based approaches have been designed with a focus on traditional software bugs. Bugs in deep learning applications pose several unique challenges: (a) non-deterministic behavior due to factors like random initialization and stochastic optimization [80], (b) complex relationships between high-dimensional data and model behavior and the influence of data-specific issues without direct code-level manifestations [35], (c) strong external dependencies on hardware (e.g., PyTorch leverages GPU) [10]. Although IR-based bug localization techniques have shown promising results in traditional software systems, their performance might decline while localizing bugs in deep learning systems. Our experiments also show relevant evidence to support this observation. Please check Section 4 for further details on our experiments.

Recently, Kim et al. [22] used basic IR-based techniques (e.g., VSM, rVSM, BM25) for locating bugs in deep learning systems but reported poor performance without any comprehensive analysis or explanation. Thus, the potential of existing IR-based solutions for bugs in deep-learning applications is not well understood yet. Our work in this article fills in that significant gap in the literature.

6.3 Deep learning-based bug localization

Unlike the above IR-based methods, deep learning can detect non-linear relationships between bug reports and source code for bug localization [26, 27, 28]. Polisetty et al. [81] evaluated deep learning-based bug localization models against traditional machine learning (ML) models, finding that while deep neural network (DNN) models generally outperform conventional ML models in performance, they require substantial resources such as GPUs and memory. Lam et al. [17] propose DNNLOC combining with information retrieval (e.g., rVSM [11]) and deep learning for bug localization. Xiao et al. [19] propose DeepLocator, where they use CNN and AST to extract features from bug reports and source documents, respectively. To learn unified features from natural language and source code during bug localization, Huo et al. [82] propose NP-CNN, which integrates both lexical and program structure information. Liang et al. [83] propose CAST, combining a tree-based CNN (TB-CNN) with customized AST to locate buggy files. However, these deep learning-based techniques are developed and evaluated using the source code from traditional software systems (e.g., JDT, SWT, Tomcat, AspectJ). These software systems do not represent deep learning applications, and thus, the designed techniques above might not be sufficient to tackle all the challenges of deep learning-related bugs.

Wardat et al. [18] propose an approach to locate Deep Neural Network (DNN) bugs through dynamic and statistical analysis. However, their method’s sole focus on model and training bugs, low accuracy, and over-reliance on the Keras library pose challenges for practical adoption. Deep learning-based approaches also lack explainability and heavily rely on source code, which may not be sufficient for the bugs with external dependencies (a.k.a extrinsic bugs) in deep learning applications.

To address the above gap, in this empirical study, we replicated four existing techniques [11, 12, 23] to locate bugs in deep learning systems. Unlike Kim et al. [22], our study extends beyond bug localization from deep learning systems. Our study evaluates existing bug localization techniques, categorizes deep-learning bugs, analyzes their prevalence and challenges, and assesses each technique’s effectiveness for different bug types. We also conduct extensive manual analysis and explain they are difficult to localize (e.g., extrinsic factors, multifaceted dependencies), which makes our work novel.

7 Conclusion

Identifying the location of a bug within a software system (a.k.a. bug localization) is crucial to correct any bug. In recent years, bug localization techniques have received considerable attention in the context of traditional software systems. However, they might not be sufficient for deep learning systems as deep learning bugs pose a greater challenge due to their multifaceted dependencies. However, the potential of existing approaches for localizing bugs in deep learning systems is not well understood to date. In this work, we replicated four existing bug localization approaches and found that they show poor performance in localizing bugs from deep-learning systems. Secondly, through an in-depth analysis, we found that localizing certain categories of bugs (e.g., training bugs & GPU bugs) is more difficult than other bugs in deep learning systems. Finally, we investigate and find that deep learning bugs are more likely to be extrinsic, i.e., connected to non-code artifacts (e.g., training data). Our research thus offers empirical evidence and actionable insights for deep learning software bugs, advancing automated software debugging research. Future work can focus on developing a new framework for automated software debugging based on the insights from this empirical study.

Data Availability Statement (DAS)

All the data generated or analyzed during this study are available in the GitHub Repository to help reproduce our results [46].

Conflict of Interest

The authors declare that they have no conflict of interest.

References

Arcuri [2008] A. Arcuri. On the automation of fixing software bugs. In ICSE, pages 1003–1006, 2008.
Karampatsis and Sutton [2020] R. M. Karampatsis and C. Sutton. How often do single-statement bugs occur? the manysstubs4j dataset. In MSR, pages 573–577, 2020.
Anvik et al. [2006] J. Anvik, L. Hiew, and G. C. Murphy. Who should fix this bug? In Proc. ICSE, pages 361–370, 2006.
Consortium for Information & Software Quality (2022) [CISQ]

Consortium for Information & Software Quality (CISQ). The cost of poor quality software in the us: A 2022 report. https://www.it-cisq.org/the-cost-of-poor-quality-software-in-the-us-a-2022-report/, 2022. Accessed: 2024-Jan-24.

Zhou et al. [2012a] J. Zhou, H. Zhang, and D. Lo. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In ICSE, pages 14–24, 2012a.

Zou et al. [2020] W. Zou, D. Lo, Z. Chen, X. Xia, Y. Feng, and B. Xu. How practitioners perceive automated bug report management techniques. IEEE TSE, 46(8):836–862, 2020.

Chen et al. [2020] Z. Chen, Y. Cao, Y. Liu, H. Wang, T. Xie, and X. Liu. A comprehensive study on challenges in deploying deep learning-based software. In ESEC/FSE, pages 750–762, 2020.

Amershi et al. [2019] S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann. Software engineering for machine learning: A case study. In ICSE-SEIP, pages 291–300, 2019.

Gonzalez et al. [2020] D. Gonzalez, T. Zimmermann, and N. Nagappan. The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In MSR, pages 431–442, 2020.

Nganyewou Tidjon et al. [2022] L. Nganyewou Tidjon, B. Rombaut, F. Khomh, and A. E. Hassan. An empirical study of library usage and dependency in deep learning frameworks. arXiv e-prints, pages arXiv–2211, 2022.

Zhou et al. [2012b] J. Zhou, H. Zhang, and D. Lo. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In ICSE, pages 14–24, 2012b.

Saha et al. [2013] R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry. Improving bug localization using structured information retrieval. In ASE, pages 345–355, 2013.

Rahman and Roy [2018a] M. M. Rahman and C. K. Roy. Improving ir-based bug localization with context-aware query reformulation. In ESEC/FSE, pages 621–632, 2018a.

Wang and Lo [2014] S. Wang and D. Lo. Version history, similar report, and structure: Putting them together for improved bug localization. In ICPC, pages 53–63, 2014.

Moreno et al. [2014] L. Moreno, J. Treadway, J, A. Marcus, and W. Shen. On the use of stack traces to improve text retrieval-based bug localization. In ICSME, pages 151–160, 2014.

Perez et al. [2014] A. Perez, R. Abreu, and A. Riboira. A dynamic code coverage approach to maximize fault localization efficiency. Journal of Systems and Software, 90:18–28, 2014.

Lam et al. [2017] A. N. Lam, A. T. Nguyen, H. A. Nguyen, and T. N. Nguyen. Bug localization with combination of deep learning and information retrieval. In ICPC, pages 218–229, 2017.

Wardat et al. [2021] M. Wardat, W. Le, and H. Rajan. Deeplocalize: Fault localization for deep neural networks. In ICSE, pages 251–262, 2021.

Xiao et al. [2017] Y. Xiao, J. Keung, Q. Mi, and K. E. Bennin. Improving bug localization with an enhanced convolutional neural network. In APSEC, pages 338–347, 2017.

Wardat et al. [2022] M. Wardat, B. D. Cruz, W. Le, and H. Rajan. Deepdiagnosis: automatically diagnosing faults and recommending actionable fixes in deep learning programs. In ICSE, pages 561–572, 2022.

Cao et al. [2022] J. Cao, M. Li, X. Chen, M. Wen, Y. Tian, B. Wu, and S. C. Cheung. Deepfd: Automated fault diagnosis and localization for deep learning programs. In ICSE, pages 573–585, 2022.

Kim et al. [2022] M. Kim, Y. Kim, and E. Lee. An empirical study of ir-based bug localization for deep learning-based software. In ICST, pages 128–139, 2022.

Youm et al. [2017] K. C. Youm, J. Ahn, and E. Lee. Improved bug localization based on code change histories and bug reports. Information and Software Technology, 82:177–192, 2017.

Rahman and Roy [2018b] M. M. Rahman and C. K. Roy. Improving ir-based bug localization with context-aware query reformulation. In ESEC/FSE, pages 621–632, 2018b.

Chawla and Singh [2013] I. Chawla and S. K. Singh. Performance evaluation of vsm and lsi models to determine bug reports similarity. In Proc. IC3, pages 375–380, 2013.

Obulesu et al. [2018] O. Obulesu, M. Mahendra, and M. ThrilokReddy. Machine learning techniques and tools: A survey. In ICIRCA, pages 605–611, 2018.

Almeida [2002] J. S. Almeida. Predictive non-linear modeling of complex data by artificial neural networks. Current opinion in biotechnology, 13(1):72–76, 2002.

Bitvai and Cohn [2015] Z. Bitvai and T. Cohn. Non-linear text regression with a deep convolutional neural network. In ACL, pages 180–185, 2015.

Deng and Liu [2018] L. Deng and Y. (Eds.) Liu. Deep learning in natural language processing. Springer, 2018.

Humbatova et al. [2020] N. Humbatova, G. Jahangirova, G. Bavota, V. Riccio, A. Stocco, and P. Tonella. Taxonomy of real faults in deep learning systems. In ICSE, pages 1110–1121, 2020.

Kim et al. [2021] M. Kim, Y. Kim, and E. Lee. Denchmark: A bug benchmark of deep learning-related software. In MSR, pages 540–544, 2021.

Muvva et al. [2020] S. Muvva, A. E. Rao, and S. Chimalakonda. Bugl–a cross-language dataset for bug localization. arXiv preprint arXiv:2004.08846, 2020.

Kim [2015] T. K. Kim. T test as a parametric statistic. Korean Journal of Anesthesiology, 68(6):540, 2015.

Rice and Harris [2005] M. E. Rice and G. T. Harris. Comparing effect sizes in follow-up studies: Roc area, cohen’s d, and r. Law and human behavior, 29:615–620, 2005.

Islam et al. [2019] M. J. Islam, G. Nguyen, R. Pan, and H. Rajan. A comprehensive study on deep learning bug characteristics. In ESEC/FSE, pages 510–520, 2019.

Rodriguez-Perez et al. [2020] G. Rodriguez-Perez, M. Nagappan, and G. Robles. Watch out for extrinsic bugs! a case study of their impact in just-in-time bug prediction models on the openstack project. IEEE TSE, 2020.

Facebook AI Research [2024] Facebook AI Research. Issue 1860. https://github.com/facebookresearch/fairseq/issues/1860, 2024. Accessed: 2024-01-30.

PyTorch Ignite Team [2024] PyTorch Ignite Team. Issue 1426. https://github.com/pytorch/ignite/issues/1426, 2024. Accessed: 2024-01-30.

ASYML Texar-PyTorch Team [2024] ASYML Texar-PyTorch Team. Issue 313. https://github.com/asyml/texar-pytorch/issues/313, 2024. Accessed: 2024-01-30.

FastAI Team [2024] FastAI Team. Issue 3048. https://github.com/fastai/fastai/issues/3048, 2024. Accessed: 2024-01-30.

Apache MXNet Team [2024a] Apache MXNet Team. Issue 13760. https://github.com/apache/mxnet/issues/13760, 2024a. Accessed: 2024-01-30.

Apache MXNet Team [2024b] Apache MXNet Team. Issue 13862. https://github.com/apache/mxnet/issues/13862, 2024b. Accessed: 2024-01-30.

Keras Team [2024] Keras Team. Issue 1238. https://github.com/keras-team/autokeras/issues/1238, 2024. Accessed: 2024-01-30.

Hernandez et al. [2006] P. A. Hernandez, C. H. Graham, L. L. Master, and D. L. Albert. The effect of sample size and species characteristics on performance of different species distribution modeling methods. Ecography, 29(5):773–785, 2006.

Acharya et al. [2013] A. S. Acharya, A. Prakash, P. Saxena, and A. Nigam. Sampling: Why and how of it. Indian Journal of Medical Specialties, 4(2):330–333, 2013.

Package [2023] Replication Package. https://bit.ly/BL-Challenges-DLSW, 2023.

Zhao et al. [2023] Yunhua Zhao, Kostadin Damevski, and Hui Chen. A systematic survey of just-in-time software defect prediction. ACM Computing Surveys, 55(10):1–35, 2023.

Wong et al. [2014] C. P. Wong, Y. Xiong, H. Zhang, D. Hao, L. Zhang, and H. Mei. Boosting bug-report-oriented fault localization with segmentation and stack-trace analysis. In ICSME, pages 181–190, 2014.

Lee et al. [2018] J. Lee, D. Kim, T. F. Bissyandé, W. Jung, and Y. Le Traon. Bench4bl: reproducibility study on the performance of ir-based bug localization. In ISSTA, pages 61–72, 2018.

Emre and Alperen [2019] D. Emre and C. Alperen. https://github.com/emredogan7/bug-localization-by-dnn-and-rvsm, 2019.

Wattanakriengkrai et al. [2020] S. Wattanakriengkrai, P. Thongtanunam, C. Tantithamthavorn, H. Hata, and K. Matsumoto. Predicting defective lines using a model-agnostic technique. IEEE TSE, 48(5):1480–1496, 2020.

Royston [1992] P. Royston. Approximating the shapiro-wilk w-test for non-normality. Statistics and computing, 2(3):117–119, 1992.

Rosenthal et al. [1994] R. Rosenthal, H. Cooper, and L. Hedges. Parametric measures of effect size. The handbook of research synthesis, 621(2):231–244, 1994.

Apache MXNet Team [2024c] Apache MXNet Team. Issue 10224. https://github.com/apache/mxnet/issues/10224, 2024c. Accessed: 2024-01-30.

Nguyen et al. [2019] G. Nguyen, S. Dlugolinsky, M. Bobák, V. Tran, Á. López G., I. Heredia, P. Malík, and L. Hluchỳ. Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey. Artificial Intelligence Review, 52:77–124, 2019.

Khan et al. [2018] S. Khan, H. Rahmani, S. A. A. Shah, and M. Bennamoun. Deep learning tools and libraries. In A Guide to Convolutional Neural Networks for Computer Vision, pages 159–167. Springer, 2018.

Wang et al. [2019] Z. Wang, K. Liu, J. Li, Y. Zhu, and Y. Zhang. Various frameworks and libraries of machine learning and deep learning: a survey. Archives of computational methods in engineering, pages 1–24, 2019.

TensorFlow Team [2024a] TensorFlow Team. Issue 5596. https://github.com/tensorflow/tensorboard/issues/5596, 2024a. Accessed: 2024-01-30.

Erickson et al. [2017] B. J Erickson, P. Korfiatis, Z. Akkus, T. Kline, and K. Philbrick. Toolkits and libraries for deep learning. Journal of Digital Imaging, 30:400–405, 2017.

PyTorch Team [2024] PyTorch Team. Issue 87085. https://github.com/pytorch/pytorch/issues/87085, 2024. Accessed: 2024-01-30.

TensorFlow Team [2024b] TensorFlow Team. Issue 61297. https://github.com/tensorflow/tensorflow/issues/61297, 2024b. Accessed: 2024-01-30.

TensorFlow Team [2024c] TensorFlow Team. Issue 5948. https://github.com/tensorflow/tensorboard/issues/5948, 2024c. Accessed: 2024-01-30.

Aoyama [1954] H. Aoyama. A study of stratified random sampling. Ann. Inst. Stat. Math, 6(1):1–36, 1954.

Watson and Petrie [2010] P. F. Watson and A Petrie. Method agreement analysis: a review of correct methodology. Theriogenology, 73(9):1167–1179, 2010.

Velez et al. [2022] M. Velez, P. Jamshidi, N. Siegmund, S. Apel, and C. Kästner. On debugging the performance of configurable software systems: Developer needs and tailored tool support. In ICSE, pages 1571–1583, 2022.

Zhang et al. [2019] T. Zhang, C. Gao, L. Ma, M. Lyu, and M. Kim. An empirical study of common challenges in developing deep learning applications. In IEEE ISSRE, pages 104–115, 2019.

Dig and Johnson [2006] D. Dig and R. Johnson. How do apis evolve? a story of refactoring. Journal of software maintenance and evolution: Research and Practice, 18(2):83–107, 2006.

O’Mahony et al. [2020] N. O’Mahony, S. Campbell, A. Carvalho, S. Harapanahalli, G. V. Hernandez, L. Krpalkova, D. Riordan, and J. Walsh. Deep learning vs. traditional computer vision. In CVC, pages 128–144, 2020.

Karasov et al. [2022] N. Karasov, A. Khvorov, R. Vasiliev, Y. Golubev, and T. Bryksin. Aggregation of stack trace similarities for crash report deduplication. arXiv preprint arXiv:2205.00212, 2022.

McHugh [2013] M. L. McHugh. The chi-square test of independence. Biochemia medica, 23(2):143–149, 2013.

Tian et al. [2014] Y. Tian, D. Lo, and J. Lawall. Automated construction of a software-specific word similarity database. In Proc. CSMR-WCRE, pages 44–53, 2014.

Gupta and Gupta [2021] S. Gupta and S. K. Gupta. A systematic study of duplicate bug report detection. International Journal of Advanced Computer Science and Applications, 12(1), 2021.

García-Pérez [2012] M. A. García-Pérez. Statistical conclusion validity: Some common threats and simple remedies. Frontiers in psychology, 3:325, 2012.

Wen et al. [2016] M. Wen, R. Wu, and S. C. Cheung. Locus: Locating bugs from software changes. In ASE, pages 262–273, 2016.

Widyasari et al. [2020] R. Widyasari, S. Q. Sim, C. Lok, H. Qi, J. Phan, Q. Tay, C. Tan, F. Wee, J. E. Tan, Y. Yieh, et al. Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies. In ESEC/FSE, pages 1556–1560, 2020.

Islam et al. [2020] M. J. Islam, R. Pan, G. Nguyen, and H. Rajan. Repairing deep neural networks: Fix patterns and challenges. In ICSE, pages 1135–1146, 2020.

Robertson and Zaragoza [2009] S. Robertson and H. Zaragoza. The probabilistic relevance framework: BM25 and beyond. Now Publishers Inc, 2009.

Wang et al. [2015] Q. Wang, C. Parnin, and A. Orso. Evaluating the usefulness of ir-based fault localization techniques. In ISSTA, pages 1–11, 2015.

Le et al. [2014] T. Bs Le, F. Thung, and D. Lo. Predicting effectiveness of ir-based bug localization techniques. In IEEE ISSRE, pages 335–345, 2014.

Sajjadi et al. [2016] M. Sajjadi, M. Javanmardi, and T. Tasdizen. Regularization with stochastic transformations and perturbations for deep semi-supervised learning. Advances in Neural Information Processing Systems, 29, 2016.

Polisetty et al. [2019] S. Polisetty, A. Miranskyy, and A. Başar. On usefulness of the deep-learning-based bug localization models to practitioners. In PROMISE, pages 16–25, 2019.

Huo et al. [2016] X. Huo, M. Li, and Z. H. Zhou. Learning unified features from natural and programming languages for locating buggy source code. In IJCAI, volume 16, pages 1606–1612, 2016.

Liang et al. [2019] H. Liang, L. Sun, M. Wang, and Y. Yang. Deep learning with customized abstract syntax tree for bug localization. IEEE Access, 7:116309–116320, 2019.

Towards Understanding the Challenges of Bug Localization in Deep Learning Systems

Abstract

Keywords:

1 Introduction

2 Background

2.1 Extrinsic bug

2.2 Intrinsic bug

2.3 Taxonomy of bugs in deep learning systems

3 Study Methodology

3.1 Construction of dataset

3.2 Replicating of existing techniques for experiments

3.3 Performance Evaluation

3.3.1 Top@K

3.3.2 Mean Average Precision

3.3.3 Mean Reciprocal Rank

4 Study Finding

4.1 Answering RQ𝟏1\mathbf{{}_{1}}start_FLOATSUBSCRIPT bold_1 end_FLOATSUBSCRIPT: How effective are the existing approaches in localizing bugs from deep learning systems?

4.2 Answering RQ𝟐2\mathbf{{}_{2}}start_FLOATSUBSCRIPT bold_2 end_FLOATSUBSCRIPT: How do different types of bugs in deep learning systems impact bug localization?

4.3 Answering RQ𝟑3\mathbf{{}_{3}}start_FLOATSUBSCRIPT bold_3 end_FLOATSUBSCRIPT: What are the implications of extrinsic bugs in deep learning systems for bug localization?

4.4 Key findings

4.5 Implications

5 Threats to Validity

6 Related Work

6.1 Software bug

6.2 Information Retrieval-based bug localization

6.3 Deep learning-based bug localization

7 Conclusion

Data Availability Statement (DAS)

Conflict of Interest

References

Appendix A Appendix

4.1 Answering RQ $\mathbf{{}_{1}}$ : How effective are the existing approaches in localizing bugs from deep learning systems?

4.2 Answering RQ $\mathbf{{}_{2}}$ : How do different types of bugs in deep learning systems impact bug localization?

4.3 Answering RQ $\mathbf{{}_{3}}$ : What are the implications of extrinsic bugs in deep learning systems for bug localization?