∎
22email: sigma.jahan@dal.ca 33institutetext: Mehil B. Shah 44institutetext: Dalhousie University, Canada
44email: shahmehil@dal.ca 55institutetext: Mohammad Masudur Rahman 66institutetext: Dalhousie University, Canada
66email: masud.rahman@dal.ca
Towards Understanding the Challenges of Bug Localization in Deep Learning Systems
Abstract
Software bugs cost the global economy billions of dollars annually and claim 50% of the programming time from software developers. Locating these bugs is crucial for their resolution but challenging. It is even more challenging in deep-learning systems due to their black-box nature. Bugs in these systems are also hidden not only in the code but also in the models and training data, which might make traditional debugging methods less effective. In this article, we conduct a large-scale empirical study to better understand the challenges of localizing bugs in deep-learning systems. First, we determine the bug localization performance of four existing techniques using 2,365 bugs from deep-learning systems and 2,913 from traditional software. We found these techniques significantly underperform in localizing deep-learning system bugs. Second, we evaluate how different bug types in deep learning systems impact bug localization. We found that the effectiveness of localization techniques varies with bug type due to their unique challenges. For example, tensor bugs were more accessible to locate due to their structural nature, while all techniques struggled with GPU bugs due to their external dependencies. Third, we investigate the impact of bugs’ extrinsic nature on localization in deep-learning systems. We found that deep learning bugs are often extrinsic and thus connected to artifacts other than source code (e.g., GPU, training data), contributing to the poor performance of existing localization methods.
Keywords:
Bug localization Deep Learning Bug Deep Learning Framework Extrinsic Bugs Information Retrieval GPU Bug Training Bug1 Introduction
Software bugs are human-made errors in the code that prevent it from working correctly [1]. They are often prevalent in modern software systems and could range from hundreds to thousands in a single system [2]. Due to the bugs in software systems, the global economy loses billions of dollars every year [3, 4]. Developers also spend about 50% of their programming time dealing with software bugs and failures [3]. To correct any bug, the developers first need to identify the location of a bug within a software system, which is known as bug localization [5]. According to a recent survey, 49.20% of 327 software practitioners from several major technology companies (e.g., Google, Meta, Amazon, and Microsoft) consider the localization of bugs as one of the most challenging tasks during software development and maintenance [6].
While localizing bugs in traditional software systems (a.k.a, non-deep learning systems) remains a challenge, it could even be more challenging in deep learning systems. Unlike bugs in non-deep learning systems, deep learning-related bugs could be hidden in the source code, training data, trained models, or even deployment scripts [7, 8, 9]. Besides, the use of various deep learning frameworks (e.g., PyTorch, Caffe, and TensorFlow) could make these bugs even more complex [10].
Given the prevalence and costs of software bugs, any automated support to localize the bugs can greatly benefit software practitioners. Over the years, many approaches have been designed to localize bugs in traditional software systems using information retrieval [11, 12, 13, 14], dynamic program analysis [15, 16], and deep learning [17, 18, 19]. However, due to the significant differences between traditional and deep learning bugs, these existing solutions might not be adequate for localizing bugs in deep learning systems.
To date, there exist only a few techniques for detecting bugs in deep learning systems. Most of them concentrate on specific type of bugs (e.g., model bugs, training bugs) without considering the broader spectrum of deep learning systems. Wardat et al. [18] propose a dynamic approach to localize different types of model bugs in the Deep Neural Network (DNN). They identify the faulty layers containing numerical bugs by customizing the Keras’ callback function and analyzing the dynamic behaviors of a model. However, their solution focuses on only model bugs from deep learning systems, strongly coupled with the Keras library, and achieves a low accuracy, which presents significant challenges for a widespread adoption by the industry. In another study, Wardat et al. [20] propose a heuristic-based approach to diagnose two main categories of bugs – model bugs and training bugs. They also recommend actionable fixes of the bugs based on the diagnosis. Since their approach depends on a set of hard-coded rules, it might be limited in terms of scalability and context-awareness. In a recent work, Cao et al. [21] introduce a technique that leverages the dynamic properties of a model and an ensemble of three machine learning classifiers (e.g., KNN, Decision Tree, Random Forest) to localize five types of training bugs (e.g., loss, gradient) from deep learning systems. Their technique might also not be able to address a broader array of bug types from deep learning systems, highlighting the issues of scalability. On the other hand, Kim et al. [22] use basic Information Retrieval (IR) algorithms, such as rVSM and BM25, to localize bugs in deep-learning systems. They report poor performance but do not perform any comprehensive analysis to understand the poor performance of IR-based techniques.
Interestingly, at least 30 techniques adopt IR algorithms to locate bugs in traditional software systems due to their computational efficiency and lightweight nature [11, 12, 14, 13, 23, 22]. They were also reported to perform comparably to the complex models (e.g., LDA) [24]. Unlike deep learning-based techniques, IR-based techniques rely on the textual similarity between bug reports and source code as a proxy of suspiciousness, which is simple and explainable. However, IR-based techniques suffer from vocabulary mismatch issues [25] and can only capture linear relationships between two items. On the contrary, deep learning-based techniques can capture the non-linear relationships between two items [26, 27, 28]. Thus, they have the potential to capture more nuances in the relevance between enriched information from the source code and bug reports. However, they also suffer from poor outlier handling, class imbalance problems, and a lack of monitoring [29]. Thus, the potential of existing solutions for localizing bugs in deep learning applications is neither well understood nor well investigated to date. Our work in this article fills in this important gap in the literature.
In this article, we conduct a large-scale empirical study to better understand the challenges of locating bugs in deep learning systems. First, we collect a total of 2,365 bugs from deep-learning systems and 2,913 bugs from traditional software systems (a.k.a non-deep-learning systems), and empirically show how existing techniques (e.g., BugLocator [11], BLUiR [12], BLIA [23], DNNLOC [17]) perform in locating bugs from deep learning systems. Our work utilizes these traditional techniques as a foundational framework, adapting their core principles to the specific nuances of deep learning bugs. Second, we categorize our collected bugs based on an existing bug taxonomy [30] and found that certain bugs from deep learning systems (e.g., GPU bugs) are more difficult than others to locate due to their multifaceted heterogeneous dependency issues. Finally, we found that deep learning bugs are connected to artifacts other than source code (e.g., GPU, training data, external dependencies) and are prone to be extrinsic in nature, which might explain the poor performance of existing techniques for these bugs. We thus answer three important research questions in our study as follows.
-
(a)
RQ: How effective are the existing approaches in localizing bugs from deep learning systems?
We evaluated the performance of four existing approaches (BugLocator [11], BLUiR [12], BLIA [23], and DNNLOC [17]) using two datasets – Denchmark [31] and BugGL [32]. First, we found that their performance measures are poorer (e.g., 31.59% less MAP for BugLocator, 33.25% for BLUiR, 34.14% for BLIA, 31.43% for DNNLOC) in localizing bugs from deep learning systems than that of non-deep learning systems. Our statistical tests (t-test [33], Cohen’s D [34]) also report that their performance is significantly lower. Second, we found that localizing bugs from the deep learning frameworks is more challenging than libraries or tools due to the frameworks’ inherent complexity. Although our findings reinforce the existing understanding and belief about the challenges of the bugs in deep learning systems [35, 22, 18], we also substantiate them with solid empirical evidence and demonstrate the performance gap of existing solutions in localizing the two categories of bugs. -
(b)
RQ: How do different types of bugs in deep learning systems impact bug localization?
We use an existing taxonomy [30] of bugs to classify the bugs in deep learning systems and evaluate the performance of four existing techniques for each type of bug. First, we found that 64.80% of the bugs from deep learning systems are related to deep learning (e.g., model, training), whereas the remaining ones are not. Second, we found that DNNLOC demonstrated better results in locating model and tensor bugs, possibly due to its ability to capture comprehensive contextual information specific to these bugs. We also found that BLUiR performs comparably to DNNLOC for training bugs, which might be attributed to its structured information retrieval. However, all four baseline techniques experienced difficulty localizing GPU bugs. Thus, our analysis offers valuable insights regarding the nature of different types of bugs in deep learning systems and highlights the specific strengths and weaknesses of existing techniques, which could be useful to advance debugging support for deep learning systems. -
(c)
RQ: What are the implications of extrinsic bugs in deep learning systems for bug localization?
Bugs triggered by external entities (e.g., third-party libraries, GPU) are called extrinsic bugs [36]. Given the frequent use of deep learning libraries and their external dependencies, the bugs in deep learning systems could be extrinsic [10]. Since the existing techniques mostly focus on intrinsic bugs (i.e., triggered by bug-introducing change), we investigate how they deal with extrinsic bugs from deep learning systems. First, we found deep learning systems have 40.00% extrinsic bugs, which is almost four times higher than that of non-deep learning systems. Second, we found that the localization performance of existing techniques degrades significantly for extrinsic bugs (e.g., 15.20% less MAP for DNNLOC) (Table 21). We also found that deep neural network-based solution (e.g., DNNLOC) is not particularly helpful for locating extrinsic bugs either because they are designed to detect code patterns, not issues from external sources or environments. Finally, we found a significant correlation between the bugs in deep learning systems and the extrinsic factors using appropriate statistical analysis (Chi-Square test), which delivers valuable insight for designing effective solutions to find bugs in deep learning systems.
2 Background
In this section, we introduce the necessary terminologies and concepts to follow the remainder of the article. We introduce extrinsic bugs, intrinsic bugs, and the taxonomy of deep-learning bugs.
2.1 Extrinsic bug
A bug caused by the factors external to a software system, such as changes to the operating environment, requirements, or third-party libraries, is known as extrinsic bug. Rodriguez-Perez et al. suggest three heuristics based on bug reports to identify extrinsic bugs as follows [36].
(a) Environment: An extrinsic bug is caused by a modification to the environment in which the software system operates. The environment could be an operating system, a physical machine, or even a cloud infrastructure.
(b) Requirement: An extrinsic bug is triggered by a change outside of the project’s version control system. During software development, if a user requirement gets changed after implementation, the development team might implement the new requirement without discarding the old feature. The old, unexpected feature will then be considered as an extrinsic bug.
(c) Third-party library: The bug found in the project’s third-party library is considered an extrinsic bug. For example, if a software project uses a third-party library for processing images for a mobile application, and the app crashes when processing certain image formats due to a bug in that third-party library, that bug will then be considered an extrinsic bug.
2.2 Intrinsic bug
The external factors do not cause an intrinsic bug; rather, it is caused by a bug-introducing change in the version control system [36]. For example, if a messaging application fails to deliver messages due to a logical error in a recent code change, that would be an intrinsic bug.
|
|||||||||||||||||||
|
|||||||||||||||||||
|
|||||||||||||||||||
|
2.3 Taxonomy of bugs in deep learning systems
Software bugs in deep learning systems can be divided into two categories – DL bug and NDL bug [30].
Deep Learning (DL) bug refers to a software error that is connected to the deep learning module embedded in the software system, causing inaccurate or unexpected output. According to the existing literature [30], DL bugs can be divided into five main categories: Model, Training, Tensor & Input, API, and GPU.
- •
-
•
Training bug occurs during the training phase of a deep learning application (Table 15 [40]). For instance, during the training of a deep learning model for object detection, if the loss function is incorrectly defined, the model will learn to detect objects with very poor accuracy, leading to incorrect output from the system.
- •
- •
- •
Non-Deep Learning (NDL) bug refers to a software error that is not related to the deep learning module but still leads to unexpected behaviors in deep learning applications. An example of NDL bugs could be a logical error in the source code that leads to a deadlock, making the program being stuck in an infinite loop.
As shown in Table 1, Bug 1860 [37] is a deep learning-related extrinsic bug triggered by the change in the environment. When the WMT19 model runs on multiple GPUs, the execution fails since the same GPU cannot store both the model and data. It is clearly related to the deep learning module. On the other hand, this bug is not related to the Fairseq library (a.k.a., deep learning application), rather, it is related to external factors (e.g., GPU), which indicates its extrinsic nature.
In Table 1, Bug 1426 [38] is another extrinsic bug connected to the Windows OS environment. The bug triggers when the tests from the CI pipeline are distributed over multiple Windows machines. It is clearly not related to deep learning (a.k.a., Non-DL bug), but the triggering factors are outside of the version control system, which indicates an extrinsic nature.
3 Study Methodology
Fig. 1 shows the schematic diagram of our conducted study. First, we collect bug reports from two benchmark datasets for two different software systems: deep learning systems [31] and traditional software systems [32]. Then, we contrast the performance of four existing techniques [11, 12, 23, 17] in locating bugs between deep learning systems [31] and traditional software systems [32]. Second, we perform an in-depth analysis to understand the challenges of localizing different types of deep-learning bugs. Finally, we investigate the influence of extrinsic factors on deep learning bugs and their impact on bug localization. This section discusses the major steps of our study design as follows.
3.1 Construction of dataset
Dataset collection. In our study, we use two benchmark datasets – BugGL and Denchmark – that have been previously used by the literature [32, 22]. BugGL [32] contains bugs from Python-based, traditional software systems (a.k.a non-deep learning-based systems), whereas Denchmark focuses on bugs from deep learning-based systems. BugGL contains a total of 2,913 bug reports from 12 Python projects [32]. On the other hand, the original Denchmark dataset [31] contains 4,577 bug reports from 193 deep learning-based projects, which are written in ten programming languages (JavaScript, Python, Java, Go, C++, Ruby, TypeScript, PHP, C#, and C). We used 2,365 bug reports from 136 deep learning-based projects (written in Python) from the Denchmark dataset [22]. We limited our dataset to Python-based bugs to ensure a fair contrastive analysis between deep learning bugs and traditional software bugs.
Original Distribution | ||
Category | Projects | Bugs |
Framework | 25 | 836 |
Platform | 8 | 150 |
Engine | 4 | 47 |
Compiler | 2 | 33 |
Tool | 31 | 510 |
Library | 44 | 666 |
Application | 12 | 124 |
Total | 136 | 2365 |
Experimental Dataset (DL Systems) | ||
Category | Projects | Bugs |
Framework | 17 | 746 |
Tool | 28 | 455 |
Library | 41 | 594 |
Total | 86 | 1795 |
We adopted a set of filtration steps to construct our experimental dataset, which consists of 86 projects from DL systems (shown in Table 2). First, the initial Denchmark dataset has 136 projects across various classes, including client-based applications, frameworks, tools, libraries, search engines, and compilers. Our study focuses on core elements of deep learning – frameworks (e.g., Apache MXNet), libraries (e.g., OpenCV), and tools (e.g., Fairseq) – rather than application-specific software (e.g., PhotoPrism). We thus collect the systems pertaining to those three categories and discard all application-specific systems. Second, we checked for overlapping projects among these classes to ensure distinct categorization. Finally, from a total of 136 projects, we finalized our experimental dataset with 17 projects categorized under frameworks, 28 under tools, and 41 under libraries, bringing the total to 86 projects from deep learning systems.
Deep learning systems differ significantly from traditional software systems due to the complexity of model integration, intricate interactions between deep learning libraries [10], and multifaceted dependencies.
Data cleaning and pre-processing. After collecting the data from two benchmark datasets, we cleaned and preprocessed them using a set of steps.
Corpus creation. We first download the latest version of the code repositories, ensuring that we have the most up-to-date code for analysis. Next, to accurately link bug reports to their corresponding buggy code, we used the heuristic of Kim et al. [22], focusing on commit messages and bug reports. This involved analyzing commit messages for keywords indicative of bug fixes (e.g., ‘fix’, ‘bug’, ‘error’, ‘resolve’) and connecting them to corresponding bug reports. To identify the correct bug-fix commit, we cross-referenced these commit messages with bug IDs from the reports, ensuring a precise match. To extract the buggy part from the bug-fix commit, we employed PyDriller, suggested by Kim et al. [22]. We also capture each project’s most recently released version as of the bug report’s date, and collect the appropriate version of the buggy code, especially in cases where the bug reports did not have clear buggy version information.
Query construction. In IR-based bug localization, bug reports are treated as queries that can be executed with a search engine to detect the relevant source documents from the corpus. We construct a repository of bug reports by parsing the original datasets (Denchmark & BugGL) and extracting important information such as bug IDs, descriptions, and timestamps. We construct the queries by extracting tokens from the title and description of bug reports, removing stop words, stemming each word, and splitting the tokens.
Meta data extraction. We also capture the historical context of bugs by extracting their commit history information from the repositories, including commit messages, authors, timestamps, and code changes history. This information is extracted to replicate the existing technique BLIA [23], which provides valuable insights into the evolution of the codebase and the bugs over time.
Ground truth construction. Both benchmark datasets provide the ground truth that contains the correct locations of bugs in the code against the bug reports. To evaluate the performance of the bug localization approaches, we collect ground truth files from both of the original datasets.
To ensure a fair performance comparison of the bug localization techniques between DLSW and NDLSW, we have selected an equal amount of data from both datasets using probability sampling (1795 bug reports from each dataset), which have 95% confidence interval and 5% error margin [44]. We use the principle of randomization for selecting the subsets [45] to avoid any bias. We also manually analyzed the projects to avoid any overlap between the two datasets. We spent 5 hours on the manual analysis.
Categorizing Classes in Deep Learning Systems. In the original dataset by Kim et al. [31], they have identified different categories (e.g., frameworks, libraries, tools) in deep learning systems. However, there were some projects that overlapped in terms of being frameworks, libraries, or tools. This is because the lines between these categories might be unclear sometimes. Some libraries may evolve into frameworks or vice versa, and tools could be integrated into either libraries or frameworks. To address this issue, we removed the overlapping projects from our dataset. We then manually analyzed each selected project based on the official documentation to ensure they clearly belonged to one of the three categories. This process helped us maintain distinct differences among our study’s frameworks, libraries, and tools. We have spent 5 hours on the manual analysis. We have provided the list of the selected projects from these three categories, along with their detailed project descriptions, in our replication package [46].
3.2 Replicating of existing techniques for experiments
To answer our first research question, we needed to replicate existing techniques that localize bugs in deep learning systems. We thus select suitable representatives from the existing literature on bug localization. In particular, we choose baseline methods from the frequently used methodologies – Information Retrieval (IR) and Deep Learning (DL).
At least 30 approaches adopt IR methods for localizing software bugs due to their computational efficiency and lightweight nature [22]. IR-based techniques rely on the textual similarity between bug reports and source code as a proxy of suspiciousness, which is explainable. We thus used three different baseline techniques from IR. We selected BugLocator [11] as our initial baseline technique since it is the seminal work on IR-based bug localization. Then, we chose BLUiR [12], recognized as the first structured IR-based technique that integrates code structure and bug report structure in its analysis. As our third baseline, we selected BLIA [23], notable for its incorporation of meta-components such as stack traces and version control history, which has demonstrated better performance compared to other IR-based techniques.
Unlike the above IR-based techniques that rely on the textual similarity between code and bug reports, DL-based approaches have the potential to uncover complex, non-linear relationships between dependent and independent variables [26, 27, 28]. Thus, we also used Deep Neural Network (DNN) for bug localization in deep learning systems, adapting an existing work – DNNLOC [17]. DNNLOC is a seminal work on DL-based hybrid approach to bug localization, integrating both traditional and deep learning techniques.
Most of the recent techniques are based on these four primary approaches with incremental improvements [14, 13, 47, 19]. We chose these baseline methods for our study, which can be considered as a representative sample of the existing approaches for bug localization.
BugLocator [11] uses rVSM (revised Vector Space Model) that takes the document length into consideration to optimize the classic VSM model for bug localization to detect relevant source code documents against a bug report. It also calculates the SimiScore, which is a measure of similarity between a newly reported bug and previously fixed bugs based on their bug reports. SimiScore is combined with rVSM to calculate the final relevancy score. The relevant source code files are then ranked based on their combined scores, and the top-K documents are marked as buggy. Many subsequent IR-based techniques [14, 13, 23] adopted this method due to its simplicity and explainability. Hence, we chose this method as our first baseline.
BLUiR [12] uses AST parsing to extract four items: class, method, variable, and comment – from each source code document. It also captures two fields from each of the bug reports (summary & description). Then, a total of eight separate similarities are calculated between these two sets using the BM25 algorithm [12]. Then, these document scores are summed as the suspiciousness score to rank buggy files against a given bug report. BLUiR technique is the first one that leverages structured elements from both source code and bug reports to localize bugs using IR. We thus choose this as another baseline technique for our study.
BLIA [23] integrates several items such as textual similarity between bug reports and source documents [11], code structures [12], version control history [14], stack trace analysis [48], and code change analysis in the IR-based bug localization. While bug reports and source code are useful, code change history can also assist in bug localization by providing the changes likely to induce a bug. BLIA has outperformed several previous techniques: BugLocator [11], BLUiR [12], Amalgam [14], BRTracer [48], which makes it suitable as the third baseline technique for our study.
IR-based localization can be adapted to different granularity levels (e.g., method, file). We chose file-level granularity since each of the selected baselines frequently used this granularity.
DNNLOC [17] uses a hybrid method that incorporates both rVSM [11] from IR and Deep Neural Networks (DNNs) from DL. DNNs establish associations between specific terms in bug reports and the corresponding code tokens and terms in the source files. While DNNs alone do not achieve high accuracy due to dimensionality reduction in the projection process as they lose information, the integration with rVSM enhances their capability to correlate bug reports with relevant buggy files. Buggy source documents may not share textual similarities with bug reports, where IR-based techniques struggle. Thus, to address the challenges of IR-based techniques (e.g., lexical mismatch problem [25]) and to include non-linearity in relevance estimation, we chose DNNLOC as the final baseline for our study.
Since the original authors’ replication packages were unavailable, we used the publicly available versions to replicate BugLocator [49], BLUiR [49], and DNNLOC [50]. We also carefully adapted the BLIA from the original replication package [23] to our datasets. Our replication package is provided for further experimental details [46].
3.3 Performance Evaluation
We use three performance metrics for our study — Top-K accuracy (Top@K), Mean Average Precision (MAP), and Mean Reciprocal Rank (MRR). These metrics have been frequently used by the relevant literature [11, 12, 14, 23, 17, 48].
3.3.1 Top@K
Top-K accuracy (Top@K) measures the percentage of bug reports for each of which at least one of the buggy files was present in the top-k retrieved files. We have used K= 1, 5, 10 for this study.
3.3.2 Mean Average Precision
Precision@K measures the precision of each buggy source document’s occurrence within a ranked list. Average Precision@K (AP) computes the average precision for all buggy documents within the ranked list against a search query (a.k.a., bug report). Mean Average Precision (MAP) is the average AP@K value across all queries in a system.
(1) |
(2) |
Here, AP represents the Average Precision, and refers to the number of total results for a query. represents the position in the ranked list, denotes the precision calculated at the -th position and determines whether the -th result in the ranked list is buggy or not.
3.3.3 Mean Reciprocal Rank
Mean reciprocal rank (MRR) calculates the average of the reciprocal ranks for a set of queries.
(3) |
where represents the Mean Reciprocal Rank for a set of queries , represents the total number of queries in the set , represents each query in the set , represents the rank of the first correctly retrieved buggy document for the query .
4 Study Finding
4.1 Answering RQ: How effective are the existing approaches in localizing bugs from deep learning systems?
Comparison of the localization performance: Table 4 compares the performance of our baseline techniques in bug localization between deep learning systems and non-deep learning systems (a.k.a traditional software systems). We used three different evaluation metrics – Top@k, MRR, and MAP, for our comparative analysis.
We see notable differences in performance between the two types of systems when evaluating BugLocator, BLUiR, BLIA, and DNNLOC methods. In particular, for BugLocator, the differences in MAP and MRR are 31.59% and 29.46%, respectively. On the other hand, for BLUiR, the differences in MAP and MRR are also substantial, 33.25% and 30.24%, respectively. For BLIA, the differences in MAP and MRR are 34.14% and 30.77%, respectively. Finally, for DNNLOC, the bug localization performance for both cases has improved, but the gap remains large as the difference in MAP and MRR are 31.43% and 26.25%, respectively. We calculated the performance difference using the PerformanceDiff metric from Wattanakriengkrai et al. [51]. Fig. 2 visualizes the MAP measures, and the differences are clearly visible. Overall, the results show that all four approaches perform lower when localizing bugs in deep learning systems, and the trend is consistent across all three metrics.
We also perform statistical tests to determine the significance of the performance gap between the two types of systems – deep learning systems and non-deep learning systems (Table 5). We took the Reciprocal Rank (RR) and Average Precision (AP) results of all the samples for each of the four approaches. Then, we performed Shapiro-Wilk normality test [52], which reported normal distribution for those metrics. Then, we used appropriate significance and effect size tests to compare the result values from the two types of systems. For the normal distribution, we used t-test as the parametric test [33]. In all significance tests, the p-values were less than the threshold (0.05) for each of the four approaches. Thus, the null hypothesis can be rejected for all comparisons. In other words, the performances of all baseline techniques significantly differ between the two types of systems.
Method | Top@1 | Top@5 | Top@10 | MRR | MAP |
DLSW | |||||
BugLocator | 0.344 | 0.547 | 0.615 | 0.371 | 0.314 |
BLUiR | 0.201 | 0.472 | 0.585 | 0.316 | 0.257 |
BLIA | 0.411 | 0.609 | 0.719 | 0.423 | 0.355 |
DNNLOC | 0.468 | 0.682 | 0.786 | 0.455 | 0.408 |
NDLSW | |||||
BugLocator | 0.419 | 0.671 | 0.794 | 0.526 | 0.459 |
BLUiR | 0.311 | 0.575 | 0.686 | 0.453 | 0.385 |
BLIA | 0.512 | 0.716 | 0.820 | 0.611 | 0.539 |
DNNLOC | 0.617 | 0.786 | 0.855 | 0.617 | 0.595 |
DLSW= Deep Learning Systems, NDLSW=Non-Deep Learning Systems
While the significance of a result indicates how probable it is that it is due to chance, the effect size indicates the extent of the difference [53]. Hence, we performed the Cohen’s D effect size test [34], and our analysis found a medium to large effect size for all cases (Table 5). Thus, our results from effect size tests reinforce the above finding from significance tests. In other words, the existing techniques perform significantly poorly in localizing bugs from deep-learning software systems. Even though our findings above match natural intuition, we performed extensive experiments using four different baselines, which resulted in strong empirical evidence. Thus, not only our findings reinforce the existing understanding and belief about the challenges of the bugs in deep learning systems, but also they substantiate them with solid empirical evidence and demonstrate the performance gap of existing solutions in localizing the two categories of bugs.
Comparison among the categories of deep learning systems: Our dataset, Denchmark, consists of deep learning systems from various classes, including frameworks, libraries, and tools. We focus on examining if there are any differences in bug localization performance across these system classes. Thus, we employed four existing techniques (BugLocator, BLUiR, BLIA, DNNLOC) to evaluate their bug localization performance for each class.
Framework Bug (Bug ID: 10224) | ||||||
Title | ||||||
Language model example cannot be run | ||||||
Description | ||||||
|
A deep learning framework is a software platform that provides the environment for designing, training, and deploying deep learning models [55]. Examples include TensorFlow111https://www.tensorflow.org/, PyTorch222https://pytorch.org/, and Apache MXNet333https://mxnet.apache.org/versions/1.9.1/. These frameworks come with pre-defined modules and functions and offer a structured way to implement deep learning architectures using high-level programming interfaces [56]. The example bug in Table 6 [54] is characterized by the inability to run a language model without manually creating a data folder. It represents a framework bug because it directly impacts the core functionalities of the framework, specifically the design and training of models. Frameworks are expected to provide seamless, user-friendly environments for developing deep learning models, and issues that hinder the ease of use, such as documentation inaccuracies and additional manual setup steps, are indicative of problems at the framework level.
Library Bug (Bug ID: 313) | |||||||||
Title | |||||||||
A bug in GPT2Tokenizer | |||||||||
Description | |||||||||
|
A deep learning library is a collection of functions that facilitate specific tasks within deep learning. Libraries like Keras444https://keras.io/ and cuDNN555https://developer.nvidia.com/blog/tag/cudnn/ can either be integrated into frameworks or can operate independently [57]. From Table 7 [39], the bug in the ‘GPT2Tokenizer’ within the texar-pytorch project is classified as a library bug due to its specific component focus and the nature of functionality. It uses the TokenizerBase class from the texar library, designed to work with the PyTorch framework. The tokenizer’s failure to accurately process text data leads to the bug, which indicates a library issue.
Tool Bug (Bug ID: 5596) | ||||||||||||
Title | ||||||||||||
Tensorboard can not load all Hyperparameters keys | ||||||||||||
Description | ||||||||||||
|
Finally, deep learning tools refer to utilities that assist with the tasks related to deep learning, such as visualization or model optimization. An example would be TensorBoard666https://www.tensorflow.org/tensorboard, which is often used for TensorFlow visualization [59]. Each of the frameworks, libraries, and tools plays unique but complementary roles in the context of deep learning. From Table 8 [58], the bug in TensorBoard qualifies as a tool bug since TensorBoard is a visualization toolkit in the TensorFlow ecosystem. The bug refers to the inability of TensorBoard to load all hyperparameter keys when writer.add-hparams is used with varying hparam-dict parameters, directly impacting its core functionality as a tool. This feature is essential for monitoring and contrasting various experimental settings of TensorBoard.
Project Class | Method | MRR | MAP |
Framework | BugLocator | 0.404 | 0.334 |
BLUIR | 0.205 | 0.155 | |
BLIA | 0.466 | 0.392 | |
DNNLOC | 0.493 | 0.412 | |
Library | BugLocator | 0.559 | 0.453 |
BLUIR | 0.632 | 0.539 | |
BLIA | 0.583 | 0.486 | |
DNNLOC | 0.656 | 0.597 | |
Tool | BugLocator | 0.499 | 0.407 |
BLUIR | 0.469 | 0.387 | |
BLIA | 0.546 | 0.449 | |
DNNLOC | 0.592 | 0.514 |
Library Bug (Bug ID: 87085) | |||||
Title | |||||
gradcheck failure with sparse matrix multiplication | |||||
Description | |||||
|
According to our experimental results in Table 9, existing techniques perform higher in localizing bugs from deep learning libraries than that of frameworks and tools. This could be attributed to the modular design of libraries, aimed at handling specific tasks independently [57]. In the example bug report (Table 10 [60]), the autograd module of the Torch.Library777https://pytorch.org/docs/stable/library.html plays a crucial role in calculating gradient across all tensor operations, whereas autograd.gradcheck888https://pytorch.org/docs/stable/generated/torch.autograd.gradcheck.html is a utility function for verifying the accuracy of these computed gradients. Here, a specific error concerning gradient computation in gradcheck was reported during sparse matrix multiplication. Such problems can be localized more efficiently when occurring within a clearly defined module like autograd.gradcheck. In this context, DNNLOC identified the buggy file at the Top@1 level, while BLUiR located the correct file at Top@3. BLUiR’s performance suggests that libraries’ modular architecture could facilitate the bug localization process, which leverages structural analysis. On the other hand, DNNLOC’s success in accurately locating bugs can be linked to its ability to capture complex patterns using deep learning, which works effectively with the modular design of library bugs.
Framework Bug (Bug ID: 61297) | ||||||||||||||
Title | ||||||||||||||
CTC Loss errors on TPU | ||||||||||||||
Description | ||||||||||||||
|
In contrast, we notice from Fig. 3, the performance of bug localization techniques dropped for framework bugs, especially when using the BLUiR method. Frameworks consist of a broad architecture that guides the flow of control, involving multiple layers and components that interact in complex ways [55, 10]. The example bug in Table 11 [61] deals with CTC Loss in a Keras model with LSTM on a TensorFlow Processing Unit (TPU) highlights a framework-level issue. For this example bug, DNNLOC retrieved the correct buggy file at the top 38 position. Despite DNNLOC’s strength in handling complex relationships, its lower rank demonstrates a shortcoming, suggesting the need to address the complexities within deep learning frameworks better.
BugLocator retrieved the correct buggy file at the top 65 position, indicating its limitations in transcending textual similarities between framework code and bug reports. Finally, BLUiR emerges as the least effective, ranked the correct file at the top 93 position, highlighting its insufficiency in localizing the framework-level bugs. BLUiR may struggle to match the bug report keywords to the correct code segment since the framework’s vast codebase introduces a high degree of variance, making the bug localization challenging.
Tool Bug (Bug ID: 5948) | |||||
Title | |||||
TB.dev HParams dashboard shows floating point metrics incorrectly | |||||
Description | |||||
|
Lastly, the performance in bug localization slightly improved for tools, but it was not as good as with libraries. This improvement could be attributed to the fact that tools (e.g., TensorBoard) are more independent and have less dependencies than the entire framework (e.g., Tensorflow). While tools like TensorBoard focus on specialized tasks like visualization, they often integrate with complex frameworks like TensorFlow. The example bug in Table 12 refers to an incorrect rendering of floating point metrics in the TensorBoard HParams dashboard. It can be classified as a tool bug, as TensorBoard is a visualization utility within the TensorFlow ecosystem. In this instance, DNNLOC identified the correct buggy file at the top 27 position. BugLocator located the correct file at the 40th position, which highlights its limitations in localizing tool bugs that warrants more than textual analysis. Meanwhile, BLUiR identified the correct file at the top 52 position, reflecting BLUiR’s struggle to effectively match textual descriptions in bug reports with the specific code segments responsible for visual aspects.
In short, our findings indicate that the modular design of libraries [59] facilitates better bug localization, whereas the complexities inherent in frameworks and tools hurt the bug localization performance of the existing techniques.
Summary of RQ: We evaluate the performance of four existing techniques in localizing bugs from deep learning systems and non-deep learning systems using three evaluation metrics. Our findings show that all four approaches perform significantly lower (e.g., 34.14% less MAP for DNNLOC) when localizing bugs from deep learning systems. We also compare their performance when localizing bugs from deep learning frameworks, libraries, and tools. We found that localizing bugs from frameworks is most challenging due to the complex interaction of their components. \endMakeFramed
4.2 Answering RQ: How do different types of bugs in deep learning systems impact bug localization?
In this research question, we investigate the characteristics and localization challenges of different types bugs in deep learning systems through manual analysis. First, we employ stratified random sampling to construct a random sample that represents a balanced presence of instances from each bug type [63]. We select 385 bugs from both of the subsets of our datasets (BugGL, Denchmark from Table 3) that have 95% confidence interval and 5% error margin.
We performed our manual analysis using 385 bug reports from the Denchmark dataset. We did two levels of classification from deep learning systems. First, we manually labeled the bugs as deep learning-related (DL) bugs and non-deep learning-related (NDL) bugs from the deep learning systems involving two annotators. Second, we labeled the deep learning-related (DL) bugs into five categories: Model, Training, Tensor, API, and GPU, based on the existing taxonomy of Humbatova et al. [30]. We also analyzed the bug reports, associated developers’ discussions, and bug-fix code changes as a part of the labeling. Two authors of this work labeled the sample dataset separately and achieved a Cohen’s kappa [64] of 0.80, which indicates a substantial agreement between the authors. Our manual analysis above was documented using an Excel sheet, with a total of 55 hours spent by each author, which is provided in our replication package [46].
Prevalence ratio of deep learning-related bugs: We found that 64.80% of the bugs from deep learning systems are related to deep learning algorithms (a.k.a DL bugs). That is, they are related to inputs, data, or training of deep learning models, underlying API endpoints, and computational resources. In particular, we found 27.30% training bugs, 13.30% model bugs, 5.50% tensor bugs, 14.30% API bugs, and 4.40% GPU bugs (Fig. 4). Such a distribution informs us where the debugging efforts should be concentrated. Our findings also indicate that the majority of DL bugs are related to the training process. Training is a crucial step in deep learning that involves large amounts of data, complex learning algorithms, and optimization techniques, making it more susceptible to bugs and failures.
Prevalence ratio of non-deep learning-related bugs: We found that 35.20% of the bugs from deep learning systems are not related to deep learning algorithms (a.k.a NDL bugs). These bugs do not directly affect the functionality of the deep learning model, but they still lead to unexpected, erroneous behaviors in a software system. Bug 1426 in Table 1 is an NDL bug, which occurs when the tests from the CI pipeline are spread across multiple Windows machines. Although it is not directly connected to the deep learning module, it originated from the PyTorch-Ignite project, which is indeed a deep learning system.
Localization of bugs in deep learning systems: To gain a deeper understanding of the challenges in localizing deep learning bugs, we further analyze our results from RQ. To do this, we randomly sample 100 bugs from each category and analyze the performance of our baselines. Moreover, to ensure a fair performance comparison of the bug localization techniques for each type of DL bug, we selected an equal amount of data using the principle of randomization [63] to avoid any bias. We repeated this process three times and used different sample data each time to ensure the robustness of our findings. We then averaged the results from the three evaluations and presented them in Table 13. We conducted an in-depth analysis of each type of DL bug to understand their inherent challenges and how they impact the overall performance of the bug localization techniques. Our analysis with examples is provided as follows.
Model Bug (Bug ID: 313) | |||||||||
Title | |||||||||
A bug in GPT2Tokenizer | |||||||||
Description | |||||||||
|
Model bugs: From Table 13, we see that DNNLOC performs the best for model bugs, which could be attributed to its ability to capture complex patterns using DNN and textual relevance from the bug reports and source code using rVSM. Model bugs are often connected to a model’s type, properties, and layers. According to our observation, the texts describing model-related issues in bug reports have significant vocabulary overlap with the implementation of a deep learning model.
Table 14 [39] shows an example bug report that discusses a model bug from the CASL.ai project. Fig. 8 [39] (as shown in Appendix A) shows the code snippet responsible for the bug. The bug in the GPT2Tokenizer lies in the _bpe method, causing faulty tokenization and impacting the functionality of the GPT2 language model, which can be considered a model bug. The incorrect character merging during byte pair encoding leads to faulty tokenization.
BugLocator incorrectly retrieves ‘SentencePieceTokenizer.py’ (Fig. 9) (as shown in Appendix A) as the Top@1 result. It relies on lexical overlap between bug reports and source code. We found four main keywords from the bug report (Table 14) —
GPT2Tokenizer, recover, seq2seq, and model — overlapping with an incorrect file (i.e., SentencePieceTokenizer.py). This lexical similarity could have led to an incorrect localization. Interestingly, the ground truth file was retrieved at the 8th position (Fig. 8) (as shown in Appendix A) by BugLocator. BLIA retrieves the same ground truth file at the 17th position, which is less than ideal.
On the other hand, the BLUiR approach retrieves the ground truth code at a lower position (top 131). Thus, the similarity between bug reports and source code elements might not be sufficient for locating model bugs. In the example above (Table 14), the bug is triggered when tokenizing the following sentence – ‘BART is a seq2seq model.’. BLUiR incorrectly retrieved the source code file with the Seq2Seq class at the Top@1 position (Fig. 10) (as shown in Appendix A). One possible explanation could be that several important keywords from the bug report align with similar (e.g., ‘seq2seq’, ‘encode’, and ‘model’) code elements (e.g., seq2seq class), which are relevant to the false positive bug. Such a misalignment occurs due to BLUiR’s heavy reliance on structural elements (e.g., class, method, variable) and less attention to the semantic aspect of the bug report.
Finally, the DNNLOC approach retrieves the buggy file correctly at the Top@1 position for the model bug (Table 14). Unlike IR-based methods, DNNLOC’s neural networks can capture deeper semantic links that go beyond lexical similarities. For instance, it has a better chance to semantically connect terms such as ‘seq2seq’ and ‘tokenizer’ in the bug report to their corresponding code-level implementations in GPT2Tokenizer despite their textual mismatch. Thus, DNNLoc’s capability to leverage the non-linear relationships might have helped it localize model bugs better in deep learning systems.
Training Bug (Bug ID: 3048) | |||||||||||||||||||||||
Title | |||||||||||||||||||||||
Gradient Accumulation + Mixed Precision shows artificially high training loss | |||||||||||||||||||||||
Description | |||||||||||||||||||||||
|
|||||||||||||||||||||||
EB=Expected Behaviour, S2R=Steps to Reproduce, OB=Observed Behaviour |
Training bugs: All four approaches perform poorly in localizing training bugs, as shown in Table 13. DNNLOC and BLUiR perform comparatively better than the other techniques. Table 15 [40] shows a training bug in the fast.ai project, where the combined usage of Gradient Accumulation and the MixedPrecision Callback (Fig. 11) (as shown in Appendix A) leads to improperly scaled and artificially high training loss value. DNNLOC retrieved the ground truth file at the Top@4 position for this bug, whereas BLUiR retrieved the ground truth file at the Top@7 position. Although both DNNLOC and BLUiR demonstrate comparable performance in locating training bugs, they perform poorly overall (e.g., MAP 40.3% - 41.1%). Moreover, BugLocator performs the worst (e.g., Top@23 position for the example bug in Table 15) in localizing the training bugs (Fig. 12) (as shown in Appendix A), which suggests that similarity analysis between bug reports and source code might not be sufficient for identifying these bugs. Training bugs, which occur during the model’s training phase and might involve aspects such as an improperly defined loss function, are more conceptual. These bugs might not be easily located using these existing methods and might require techniques that can offer a deeper insight into the model’s architecture and training process.
Tensor Bug (Bug ID: 13760) |
Title |
nd.slice does not return empty tensor when begin=end |
Description |
OB: For mxnet.ndarray.slice(data, begin, end), |
if begin=end, it does not return an empty tensor. |
Instead, it returns a tensor with the same shape as the data. |
Environment info: |
…… |
…… |
…… |
S2R: import mxnet.ndarray as nd |
a = nd.normal(shape=(4, 3)) |
nd.slice(a, begin=0, end=0) |
nd.slice(a, begin=2, end=2) |
Detailed BR: https://github.com/apache/mxnet/issues/13760 |
BR=Bug Report, S2R=Steps to Reproduce, OB=Observed Behaviour |
Tensor bugs: Tensors are central to deep learning and often involve intricate dimensional and mathematical issues. From Fig. 5, we notice that DNNLOC and BLUiR are more effective in localizing tensor bugs than other techniques (e.g., BugLocator, BLIA). In the example bug from Table 16 [41], the bug report described the issue with the ‘nd.slice’ function in MXNet, which should return an empty tensor when the ‘begin’ and ‘end’ parameters are equal. Instead, the function returns a tensor with the same shape as the data. For this example tensor bug, both DNNLOC and BLUiR retrieved the correct buggy file at the top position. Interestingly, by parsing the AST of the code, BLUiR identified the relevant code snippet in the test_slice() function, which shares significant keyword overlap with the bug report. BLUiR determines the relevance of low-level code elements (e.g., class names, method names) against a bug report, which helps to reduce noise in the code segments where tensors could be handled or manipulated.
On the other hand, BLIA retrieved the ground truth file at the top 12 positions. However, our analysis revealed that incorporating the stack trace information negatively impacted its bug localization performance. Tensor bugs are typically related to data manipulation and computations [30] rather than code execution flow or call stack [65] that can be found in stack traces. By excluding the stack trace from the BLIA approach, we were able to improve the ranking of the buggy file from the 12th position to the 9th position.
Lastly, BugLocator’s difficulty in accurately locating tensor bugs is demonstrated by its retrieval of the ground truth file at the 47th position (Fig. 13) (as shown in Appendix A) and the incorrect file at Top@1 (Fig. 14) (as shown in Appendix A). It can be attributed to its reliance on textual similarity. We found that BugLocator’s heavy reliance on textual similarity and the overlapping of trivial words (e.g., environment, system, and hardware) with the incorrect file (Fig. 14) (as shown in Appendix A) led to the incorrect ranking.
API Bug (Bug ID: 13862) | ||||||
Title | ||||||
|
||||||
Description | ||||||
|
||||||
We have a use case for this in Sockeye. | ||||||
Environment info (Required): | ||||||
… | ||||||
S2R: Input data taken from Sockeye unit tests. | ||||||
x = mx.nd.array([335, 620, 593, 219, 36], dtype=’int32’) | ||||||
mx.nd.unravel_index(x, shape=(-1, 200)) | ||||||
With mxnet==1.5.0b20190111, the result is incorrect: | ||||||
With mxnet==1.3.1, the result is correct: | ||||||
|
||||||
Detailed BR: https://github.com/apache/mxnet/issues/13862 | ||||||
BR=Bug Report, S2R=Steps to Reproduce, OB=Observed Behaviour |
API bugs: From Fig. 5, we observe that DNNLOC and BugLocator perform well in localizing API bugs, whereas BLUiR is the least effective. This could be due to BLUiR’s heavy reliance on structural information from the source code, which might not capture the specifics of API bugs (e.g., incorrect API calls). Another factor could be the rapid evolution of deep learning APIs, which affects versioning and compatibility [66]. Due to frequent structural changes in the code, the technique might be less effective in assessing the relevance between code and bug reports.
The bug from Table 17 [42] involves the unravel-index function in MXNet, where its behavior incorrectly varies with certain input parameters across different versions (Fig. 15) (as shown in Appendix A). DNNLOC located the correct buggy code for the example bug (Table 17) at the 2nd position, while BugLocator ranked it in the top 5th. DNNLOC might be effective for this API bug because textual similarity (through rVSM) matches API-specific terms in bug reports with source code, while neural networks (through DNN) can capture complex, non-obvious relationships in the API’s usage and functionality. Moreover, our findings indicate that relying on either element (rVSM, DNN) in isolation is less effective, resulting in a noticeable drop in performance.
On the other hand, BLUiR retrieved the buggy file at the 38th position. BLUiR incorrectly retrieved the file containing the NDArray class. In the case of API bugs, the class file may not always be relevant. This is because API bugs frequently arise at the interface between different layers of abstraction [67]. Such bugs could be attributed to the interaction of a higher-level function (e.g., unravel-index from Table 17) with lower-level components rather than an issue within the code of the function itself. Meanwhile, BLIA initially placed the buggy file at the top 7th position without the stack traces but improved to 5th when stack trace information was included. It indicates that stack trace information, highlighting execution flow and function calls, is beneficial for locating API bugs.
GPU Bug (Bug ID: 1238) | |||||||
Title | |||||||
How to use multiple GPUs? | |||||||
Description | |||||||
|
|||||||
|
|||||||
|
|||||||
|
|||||||
S2R: def make_model(ckpt_path, max_try = 1): | |||||||
…… | |||||||
run_search(checkpoint, max_try = 3) | |||||||
|
|||||||
BR=Bug Report, S2R=Steps to Reproduce, OB=Observed Behaviour |
GPU bugs: Our investigation reveals that GPU bugs are the most difficult to localize for all four existing approaches. One possible explanation could be the complex nature of GPU bugs, as they can be triggered by a variety of factors, such as the compatibility between hardware (e.g., GPU device) and software (e.g., PyTorch). It might not even be located in the source code (a.k.a extrinsic GPU bug) [30]. From Table 20, we notice that only 17.65% of the GPU bugs can be found in the source code (a.k.a intrinsic GPU bug). Some examples of intrinsic GPU bugs that we found – the wrong reference to a GPU device, failed parallelism, incorrect state sharing between subprocesses, and faulty data transfer to a GPU device.
Table 18 [43] shows a GPU bug triggered by the codebase (a.k.a intrinsic). The bug is connected to the use of multiple GPUs during training. According to the report, the machine contains multiple GPU devices, but only one GPU is used during computation. All four bug localization techniques performed poorly in locating the actual buggy code for this GPU bug. BLUiR retrieved the buggy file at the 50th position, followed by BugLocator at the 65th, BLIA at the 47th, and DNNLOC at the 43rd position in their respective rankings.
We observed that there is almost no keyword overlapping and no structural similarity between the bug report (Table 18) and the actual buggy code (Fig. 16) (as shown in Appendix A). This suggests that these techniques struggled due to the lack of both textual and code-wise similarity between the bug report and source code, making it challenging to identify the buggy code for the GPU bug.
We observed a minimal keyword overlap and structural similarity between the bug report (Table 18) and the actual buggy code (Fig. 16) (as shown in Appendix A), presenting a significant challenge in locating such bugs using BugLocator. Additionally, BLUiR’s analysis of smaller code segment similarity also proves ineffective for GPU bugs, as it solely concentrates on source code and overlooks the hardware-software interactions and runtime specifics crucial for comprehending such bugs. Similarly, BLIA’s integration of stack trace information, commit history, or version control history fails to encapsulate the unique aspects of GPU bugs as well. Moreover, despite DNNLOC’s capability to capture non-linear complex relationships through deep neural networks, it did not help to locate GPU bugs. One reason might be that these bugs often involve complex hardware-software dynamics that are not typically addressed by standard source code analysis or within the codebase’s non-linear mappings.
NDL bugs: We found that 35.20% of bugs in deep learning applications are not directly related to deep learning, but they impact system behavior (e.g., failed CI build due to GPU compatibility issues). These bugs are known as Non Deep Learning-related (NDL) bugs in deep-learning systems. From Table 19, we find that the performance of existing techniques in localizing these bugs is also poor. DNNLOC outperforms other techniques, whereas BLUiR performs the lowest in locating NDL bugs. We observed that NDL bugs are less complex than their deep-learning counterparts. However, they are more prone to be extrinsic than the traditional bugs (48.15% from Table 20); we provided the details about extrinsic bugs in RQ. Since existing baseline techniques focus on code-level artifacts only, they might fall short in detecting these bugs from deep learning systems. Overall, there exists a significant variation in the performance of existing approaches when localizing various types of deep-learning bugs. Tensor and API bugs are the easiest, whereas GPU bugs are the most difficult to localize using the selected baseline techniques.
Bug report quality for bugs in deep learning systems: Our analysis showed that bug reports from deep learning systems contain more code snippets (83.11%) than traditional software systems (33.24%). Unfortunately, that does not help much in bug localization, as code snippets alone might not be sufficient. Deep learning bugs often involve intricate dependencies that extend beyond specific code components (e.g., training data bugs and GPU bugs). Complex bugs (e.g., gradient instability during training) warrant a deeper understanding of the model architecture, its dynamic behavior, and training processes, which the code snippets may not always capture.
Summary of RQ: We found that 64.80% bugs in deep learning systems (DLSW) are related to deep learning algorithms, whereas the remaining bugs are not related to deep learning. Our analysis shows that Tensor bugs and API bugs are easier to localize than model and training bugs. However, GPU bugs are the most difficult to localize for each of the four approaches. Thus, our results not only inform the distribution of DL bugs but also highlight their localization challenges through extensive experiments. \endMakeFramed
4.3 Answering RQ: What are the implications of extrinsic bugs in deep learning systems for bug localization?
Most of the traditional bug localization techniques rely on the similarity between bug reports and source code [11, 6, 12, 17, 14, 13, 23, 48]. However, if a bug is of extrinsic nature (e.g., originates from the operating system), simply relying on source code may not be effective for its localization.
To investigate the impact of extrinsic bugs in deep learning systems, we performed another manual analysis using the same sample datasets from RQ (385 bugs from DLSW and 385 bugs from NDLSW). We manually labeled them as extrinsic and intrinsic bugs based on the heuristics of Rodriguez-Perez et al. [36]. Two authors of the study analyzed the bug reports and associated discussions carefully, consulted the heuristics, and then labeled the sample dataset separately. It achieved a Cohen’s kappa [64] of 0.87, which indicates a substantial agreement between the authors. This manual analysis was documented using an Excel sheet, which can be found in our replication package [46]. Each author spent a total of 20 hours on the analysis.
Prevalence ratio of extrinsic & intrinsic bugs: We found 40.00% extrinsic bugs within a total of 385 bug reports from deep learning systems (Denchmark dataset). The notion of extrinsic bugs is relatively new, especially in the case of bugs from deep learning systems. For a better comparison, we also manually inspected 385 bugs from non-deep learning systems (BugGL dataset) and determined the prevalence ratio of extrinsic and intrinsic bugs. We found only 10.65% extrinsic bugs in non-deep learning systems. Thus, deep learning systems contain almost four times more extrinsic bugs (Fig. 6) than non-deep learning systems.
Prevalence ratio of extrinsic & intrinsic bugs from deep learning systems: We randomly select 100 samples for each type of bug from deep learning systems (same subsets from RQ) and determined the prevalence ratio of extrinsic and intrinsic bugs for each type. Table 20 shows the results of our manual analysis for different bug categories in deep learning systems in terms of extrinsic and intrinsic bugs. We see that the prevalence ratios of deep learning-related extrinsic bugs range from 21.90% to 82.35%, whereas for non-deep learning-related bugs, the prevalence ratio is 48.15%. This suggests that the deep learning components of a software system might be more likely to trigger extrinsic bugs than non-deep learning components.
Type | Extrinsic (%) | Intrinsic (%) | |
NDL | 48.15 | 51.85 | |
DL | Model | 35.29 | 64.71 |
Training | 21.90 | 78.10 | |
Tensor | 38.10 | 61.90 | |
API | 38.19 | 61.81 | |
GPU | 82.35 | 17.65 |
Localization of extrinsic & intrinsic bugs from both systems: To further analyze the impact of extrinsic bugs on bug localization, we experimented with our baseline techniques from RQ on extrinsic and intrinsic bugs separately. We chose 100 random bugs to evaluate the performance of all four techniques in bug localization from each category of both benchmark datasets. We repeated the evaluation three times using three different random subsets and then calculated the average result for a fair comparison.
From Table 21, we notice that DNNLOC performs slightly better than other techniques in localizing extrinsic bugs for both systems. However, it shows a clear performance gap in localizing extrinsic and intrinsic bugs. The technique is less effective with extrinsic bugs, particularly in DLSW. It suggests a shortcoming in handling external complexities despite being able to capture non-linear relationships between bug reports and source code. We notice that BugLocator’s localization performance for extrinsic bugs in DLSW is lower than NDLSW. BugLocator might not be able to locate the extrinsic bugs in deep learning systems due to its naive approach, i.e., considering code as regular texts. We also found that BLUiR shows less performance gap between extrinsic and intrinsic bugs. It extracts different structured items, (e.g., methods, classes) from the source code and bug reports. Thus, even if they reside outside the current codebase and are invoked from an external library, they could be matched with relevant keywords from a bug report. Interestingly, BLIA performs slightly better for extrinsic bugs in DLSW than NDLSW, which could be possible due to its diverse use of meta components (e.g., stack traces, version control history). Stack traces from deep learning systems (DLSW) often contain more intricate information and dependencies (e.g., complex neural network data flows, dependencies on specialized libraries, GPU-related synchronization issues) related to the bugs, unlike stack traces from non-deep learning systems, which might not provide the same level of detailed information [68, 69].
Overall, these results suggest that extrinsic bugs are hard to localize, whether related to deep learning or not. However, deep learning bugs with an extrinsic nature are the more difficult to localize. On the other hand, from Fig. 7, we note that the performance of all four approaches for intrinsic bugs in NDLSW is higher compared to the intrinsic bugs in DLSW, which supports the fact that the bugs related to deep learning algorithms (a.k.a DL bugs) from deep learning systems are more challenging to localize.
Correlation of extrinsic & deep learning-related bugs: To determine the potential correlation between the extrinsic bugs and the bugs in deep-learning systems, we performed a Chi-Square test to determine any significant association [70]. We conducted three iterations with different sample data to validate the Chi-Square test, averaging the results. We got a p-value of 1.79e-14. Such a low p-value (far below the conventional threshold of 0.05) indicates that the observed association is not a product of random chance. Instead, it implies a strong dependency between external factors contributing to bugs and the specific bug patterns within deep learning systems. Our manual analysis also supports the hypothesis, showing a higher prevalence of extrinsic bugs in deep learning systems (DLSW) compared to non-deep learning systems (NDLSW) (Fig. 6). The prevalence ratio of extrinsic bugs varies from 21.90% to 82.35% across different types of deep-learning bugs, confirming a strong association between extrinsic factors and bugs from deep-learning systems. Our experiments also suggest that extrinsic bugs might have an underlying connection with deep-learning bugs (refer to RQ: Localization of extrinsic & intrinsic bugs from both systems), which contributes to the poor performance of the existing bug localization techniques. \MakeFramed\FrameRestore Summary of RQ: We found that deep learning systems (DLSW) contain almost four times more extrinsic bugs than non-deep learning systems (NDLSW). The performance of our baseline bug localization techniques for extrinsic bugs is lower (e.g., 31.27% less MAP for DNNLOC) compared to that of intrinsic bugs. Our research also shows a strong connection between extrinsic bugs and bugs in deep learning systems. \endMakeFramed
4.4 Key findings
Based on our findings from three research questions, we discuss the key factors that challenge the bug localization in deep learning systems as follows:
-
•
Extrinsic nature of bugs: A substantial number of bugs in deep learning systems are extrinsic, arising from external factors (e.g., hardware compatibility, OS issues). These bugs might not be found through source code analysis alone, making them inherently challenging to localize using traditional methods relying on source code. We also found a strong statistical correlation between extrinsic bugs and bugs from deep learning systems.
-
•
Textual representation limitations in dependencies: Our baseline techniques heavily rely on textual similarity between bug reports and source code, which is less effective for deep learning bugs. They often involve intricate dependencies (e.g., gradient vanishing in RNN) and require conceptual understanding (e.g., improperly defined loss function). These complex issues may not be represented using texts, which makes the existing baseline techniques less effective.
-
•
Dynamic complexity in training and tensor operations: Training bugs and tensor bugs are challenging to localize due to their dynamic and complex nature, often involving issues in the training process (e.g., misconfigured batch sizes) and data manipulation (e.g., incorrect tensor reshaping) respectively, which are not adequately captured by existing methods for bug localization.
-
•
Multifaceted extrinsic influences on GPU operations: GPU bugs are particularly challenging to localize because they can be caused by multiple factors, such as hardware-software interaction and compatibility issues, and may not even be present in the source code (extrinsic bugs).
-
•
Inadequate information for API usage: Traditional techniques (e.g., BLUiR) struggle with locating API bugs in deep learning systems. Deep learning APIs differ from traditional APIs [66] since they handle complex data structures, specialized hardware dependencies, dynamic computation graphs, high-level abstractions for operation, and rapid evolution, which impacts versioning and compatibility. Traditional bug localization techniques might not be able to capture such intricate information adequately from deep learning APIs, which might be hurting the localization process.
-
•
Ineffectiveness of stack traces for tensor bugs: For tensor and input bugs, stack trace information is found to be less effective since these bugs are often related to data manipulation and computations rather than the execution flow captured by stack traces.
-
•
Lack of details on model architecture and training process: Although bug reports in deep learning systems contain more code snippets, this does not significantly improve the performance of bug localization. Deep learning bugs require a deeper understanding of the model architecture and training process, which may not be detailed in the bug reports and the code snippets.
-
•
Limited scope of non-linearity: While leveraging non-linear relationships between bug reports and source code using deep neural networks has improved the overall performance of bug localization, the significant performance gap between deep learning systems and non-deep learning systems indicates the inherent localization challenges in deep learning systems. It might help localize specific deep-learning bugs (e.g., model bugs) but might not others (e.g., training and GPU bugs). It suggests that the deep learning approach, despite its ability to capture non-linear relationships, fails to address the complexities of the DL training process or the GPU bugs’ multifaceted and often external nature. We also found that non-linearity alone might not be the complete solution for Tensor bugs, as code structure analysis from information retrieval (BLUiR) helped to locate such structural bugs.
-
•
Complexities within the deep learning architectures: Deep learning systems present a unique set of challenges for bug localization due to their convoluted architecture. The complex nature of frameworks, characterized by multiple layers and components, often coupled with dependencies, hinders the performance of existing bug localization techniques. In contrast, the modular nature of libraries helps them in localization slightly by providing a more defined structure [57]. On the other hand, tools (e.g., TensorBoard) frequently integrate with larger frameworks (e.g., TensorFlow), which leads to dependencies and interactions, thus low performance in bug localization.
4.5 Implications
The above challenges highlight the necessity for novel bug localization techniques specifically designed for deep learning systems. Our research findings indicate that a one-size-fits-all approach may not be effective in practice, as shown by the existing methods (e.g., BLUiR, DNNLOC). Future research should focus on developing more comprehensive methods capable of addressing the wide array and complexity of bugs in deep learning systems, leveraging insights from the strengths and weaknesses of existing techniques identified in our research.
5 Threats to Validity
We identify a few threats to the validity of our findings. In this section, we discuss these threats and the necessary steps taken to mitigate them as follows.
Threats to internal validity relate to experimental errors and human biases [71]. Traditional bug tracking systems (e.g., Bugzilla, GitHub, Jira) contain thousands of bug reports, and their quality cannot be guaranteed. This could be a source of threat as the bug reports are used as queries to locate the buggy files. Bug reports often contain poor, insufficient, missing, or even inaccurate information [72]. Hence, we used data from existing benchmarks ([31, 32], where the authors took necessary steps to avoid low-quality or invalid bug reports. Thus, such threats might be mitigated.
Another potential source of threat could be the replication of existing work. The original replication package was unavailable; hence, we used the publicly available version of BugLocator, BLUiR [49], and DNNLOC [50]. For BLIA, we reused the author’s replication package [23]. We validated our implementation of the existing methods using their original dataset and achieved comparable results (e.g., with differences 2.00%–3.00% using MAP).
Threats to conclusion validity. The observations from our study and the conclusions we drew from them could be a source of threat to conclusion validity [73]. In this research, we answer three research questions using two different datasets and re-implement four existing techniques. We use appropriate statistical tests (e.g., t-test) and report the test details (e.g., p-value, Cohen’s D) to conclude. Thus, such threats might also be mitigated.
Threats to construct validity relate to the use of appropriate performance metrics. We evaluate all the methodologies using MRR, MAP, and Top@K, which have been used widely by the related work [11, 12, 14, 17, 48, 19, 74]. Thus, such threats might also be mitigated.
6 Related Work
6.1 Software bug
Understanding the nature and characteristics of bugs is essential for effective debugging and testing. They can differ across different programming languages and development frameworks [75]. Over the last 50 years, hundreds of studies have been conducted to tackle bugs in traditional software systems. Recently, bugs from deep learning systems have garnered much attention due to their great interest and significance. Humbatova et al. [30] proposed a taxonomy of bugs from deep learning systems with five main categories - model, training, tensor & input, API, and GPU. Chen et al. [7] focused on the unique obstacles for deep learning-based software deployment. According to Islam et al. [35], data bugs and logic bugs are the most severe in deep-learning software systems. Another study by Islam et al. [76] showed that the bugs or repair patterns of deep learning models significantly differ from those of traditional systems. As a result, traditional software debugging approaches, such as bug localization techniques, might not be effective for deep learning systems. Therefore, empirical research like ours, which focuses on the challenges posed by deep learning bugs in the software debugging process (specifically in bug localization), is essential.
6.2 Information Retrieval-based bug localization
One of the crucial steps toward fixing a software bug is to detect its location within the software code. Many existing approaches [11, 12, 14, 13, 23] use Information Retrieval (IR) to locate bugs by matching keywords between a query and the source code.
Zhou et al. [11] introduce BugLocator, which leverages textual similarity between bug reports and source code using rVSM for bug localization. Saha et al. [12] propose BLUiR, which determines the textual similarity between source code and bug reports using the Okapi-BM25 algorithm [77]. BLUiR also leverages structural items from both bug reports and source code, which boosts its localization performance. Later, Wang and Lo [14] propose AmaLgam, which incorporates the textual similarity from BugLocator, structured items from BLUiR, and version control history into IR-based bug localization.
Wang et al. [78] analyzed IR-based fault localization techniques and found their effectiveness to be limited, mainly due to the frequent unavailability of high-quality bug reports. The quality of bug reports makes it challenging to localize bugs using traditional IR-based techniques. Rahman and Roy [13] propose BLIZZARD, which leverages the quality aspect of bug reports and introduces context-aware query reformulation into bug localization. Wong et al. [48] proposed BRTracer, which improves upon BugLocator by combining source document segmentation and stack-trace analysis.
Le et al. [79] used an automated method to predict the effectiveness of IR-based bug localization by leveraging features extracted from bug reports and localization methods, with their findings focusing on the significance of metadata features (e.g., commit history, stack traces) in enhancing the performance of these techniques. Another technique, namely Locus [74], uses the software change information from commit logs and change histories to improve bug localization. Youm et al. [23] proposed BLIA, which integrates bug reports, structured information of source files, and source code change history. It localizes bugs in two granularity levels - file level and method level – and outperforms prior approaches.
All these IR-based approaches have been designed with a focus on traditional software bugs. Bugs in deep learning applications pose several unique challenges: (a) non-deterministic behavior due to factors like random initialization and stochastic optimization [80], (b) complex relationships between high-dimensional data and model behavior and the influence of data-specific issues without direct code-level manifestations [35], (c) strong external dependencies on hardware (e.g., PyTorch leverages GPU) [10]. Although IR-based bug localization techniques have shown promising results in traditional software systems, their performance might decline while localizing bugs in deep learning systems. Our experiments also show relevant evidence to support this observation. Please check Section 4 for further details on our experiments.
Recently, Kim et al. [22] used basic IR-based techniques (e.g., VSM, rVSM, BM25) for locating bugs in deep learning systems but reported poor performance without any comprehensive analysis or explanation. Thus, the potential of existing IR-based solutions for bugs in deep-learning applications is not well understood yet. Our work in this article fills in that significant gap in the literature.
6.3 Deep learning-based bug localization
Unlike the above IR-based methods, deep learning can detect non-linear relationships between bug reports and source code for bug localization [26, 27, 28]. Polisetty et al. [81] evaluated deep learning-based bug localization models against traditional machine learning (ML) models, finding that while deep neural network (DNN) models generally outperform conventional ML models in performance, they require substantial resources such as GPUs and memory. Lam et al. [17] propose DNNLOC combining with information retrieval (e.g., rVSM [11]) and deep learning for bug localization. Xiao et al. [19] propose DeepLocator, where they use CNN and AST to extract features from bug reports and source documents, respectively. To learn unified features from natural language and source code during bug localization, Huo et al. [82] propose NP-CNN, which integrates both lexical and program structure information. Liang et al. [83] propose CAST, combining a tree-based CNN (TB-CNN) with customized AST to locate buggy files. However, these deep learning-based techniques are developed and evaluated using the source code from traditional software systems (e.g., JDT, SWT, Tomcat, AspectJ). These software systems do not represent deep learning applications, and thus, the designed techniques above might not be sufficient to tackle all the challenges of deep learning-related bugs.
Wardat et al. [18] propose an approach to locate Deep Neural Network (DNN) bugs through dynamic and statistical analysis. However, their method’s sole focus on model and training bugs, low accuracy, and over-reliance on the Keras library pose challenges for practical adoption. Deep learning-based approaches also lack explainability and heavily rely on source code, which may not be sufficient for the bugs with external dependencies (a.k.a extrinsic bugs) in deep learning applications.
To address the above gap, in this empirical study, we replicated four existing techniques [11, 12, 23] to locate bugs in deep learning systems. Unlike Kim et al. [22], our study extends beyond bug localization from deep learning systems. Our study evaluates existing bug localization techniques, categorizes deep-learning bugs, analyzes their prevalence and challenges, and assesses each technique’s effectiveness for different bug types. We also conduct extensive manual analysis and explain they are difficult to localize (e.g., extrinsic factors, multifaceted dependencies), which makes our work novel.
7 Conclusion
Identifying the location of a bug within a software system (a.k.a. bug localization) is crucial to correct any bug. In recent years, bug localization techniques have received considerable attention in the context of traditional software systems. However, they might not be sufficient for deep learning systems as deep learning bugs pose a greater challenge due to their multifaceted dependencies. However, the potential of existing approaches for localizing bugs in deep learning systems is not well understood to date. In this work, we replicated four existing bug localization approaches and found that they show poor performance in localizing bugs from deep-learning systems. Secondly, through an in-depth analysis, we found that localizing certain categories of bugs (e.g., training bugs & GPU bugs) is more difficult than other bugs in deep learning systems. Finally, we investigate and find that deep learning bugs are more likely to be extrinsic, i.e., connected to non-code artifacts (e.g., training data). Our research thus offers empirical evidence and actionable insights for deep learning software bugs, advancing automated software debugging research. Future work can focus on developing a new framework for automated software debugging based on the insights from this empirical study.
Data Availability Statement (DAS)
All the data generated or analyzed during this study are available in the GitHub Repository to help reproduce our results [46].
Conflict of Interest
The authors declare that they have no conflict of interest.
References
- Arcuri [2008] A. Arcuri. On the automation of fixing software bugs. In ICSE, pages 1003–1006, 2008.
- Karampatsis and Sutton [2020] R. M. Karampatsis and C. Sutton. How often do single-statement bugs occur? the manysstubs4j dataset. In MSR, pages 573–577, 2020.
- Anvik et al. [2006] J. Anvik, L. Hiew, and G. C. Murphy. Who should fix this bug? In Proc. ICSE, pages 361–370, 2006.
- Consortium for Information & Software Quality (2022) [CISQ]
Consortium for Information & Software Quality (CISQ). The cost of poor quality software in the us: A 2022 report. https://www.it-cisq.org/the-cost-of-poor-quality-software-in-the-us-a-2022-report/, 2022. Accessed: 2024-Jan-24.