^†^†This work was supported by the National Science Foundation through the real-time machine learning (RTML) program (Award number: 1937592).

DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference

Chaojian Li Rice University6100 Main STHoustonUSA cl114@rice.edu , Wuyang Chen University of Texas at Austin110 Inner Campus DRAustinUSA wuyang.chen@utexas.edu , Yuchen Gu Rice University6100 Main STHoustonUSA yg50@rice.edu , Tianlong Chen University of Texas at Austin110 Inner Campus DRAustinUSA tianlong.chen@utexas.edu , Yonggan Fu Rice University6100 Main STHoustonUSA yf22@rice.edu , Zhangyang Wang University of Texas at Austin110 Inner Campus DRAustinUSA atlaswang@utexas.edu and Yingyan Lin Rice University6100 Main STHoustonUSA yingyan.lin@rice.edu

(2021)

Abstract.

Semantic segmentation for scene understanding is nowadays widely demanded, raising significant challenges for the algorithm efficiency, especially its applications on resource-limited platforms. Current segmentation models are trained and evaluated on massive high-resolution scene images (“data level”) and suffer from the expensive computation arising from the required multi-scale aggregation (“network level”). In both folds, the computational and energy costs in training and inference are notable due to the often desired large input resolutions and heavy computational burden of segmentation models. To this end, we propose DANCE, general automated DAta-Network Co-optimization for Efficient segmentation model training and inference. Distinct from existing efficient segmentation approaches that focus merely on light-weight network design, DANCE distinguishes itself as an automated simultaneous data-network co-optimization via both input data manipulation and network architecture slimming. Specifically, DANCE integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images’ spatial complexity. Such a downsampling operation, in addition to slimming down the cost associated with the input size directly, also shrinks the dynamic range of input object and context scales, therefore motivating us to also adaptively slim the network to match the downsampled data. Extensive experiments and ablating studies (on four SOTA segmentation models with three popular segmentation datasets under two training settings) demonstrate that DANCE can achieve “all-win” towards efficient segmentation (reduced training cost, less expensive inference, and better mean Intersection-over-Union (mIoU)). Specifically, DANCE can reduce $\downarrow$ 25% - $\downarrow$ 77% energy consumption in training, $\downarrow$ 31% - $\downarrow$ 56% in inference, while boosting the mIoU by $\downarrow$ 0.71% - $\uparrow$ 13.34%.

efficient training and inference methods, semantic segmentation

^†^†copyright: acmcopyright^†^†journalyear: 2021^†^†doi: TBD^†^†journal: TODAES^†^†journalvolume: 0^†^†journalnumber: 0^†^†article: 0^†^†publicationmonth: 0^†^†ccs: Computing methodologies Image segmentation^†^†ccs: Computing methodologies Neural networks^†^†ccs: Hardware Platform power issues

1. Introduction

The recent record-breaking performance of semantic segmentation using deep networks motivates an ever-growing application demand. However, those segmentation models typically bear a heavy computational cost to run (i.e., inference), making them extremely challenging to be deployed into resource-constrained platforms, ranging from mobile phones to wearable glasses, drones, and autonomous vehicles. Particularly, while existing works on improving inference efficiency are traditionally focused on classification, state-of-the-art (SOTA) segmentation models are even much more costly. For example, a ResNet50 (He et al., 2016) costs 4 GFLOPs for inference with an input size $224\times 224$ . In comparison, for a DeepLabv3+ (Chen et al., 2018) with the Resnet50 backbone and the same $224\times 224$ input (associated with an output stride of 16), the inference cost jumps up to 13.3 GFLOPs; the cost could further soar to 435 GFLOPs if we operate on a higher input resolution of $2048\times 1024$ . A similar trend can be expected in terms of the required energy costs. These highly required resource costs prohibit segmentation models from edge device deployments or at least degrade the quality of user experience. Specifically, such expensiveness of segmentation models arises from two aspects:

•

High input resolution and its proportional costs: segmentation, as a dense prediction task, typically relies on fully convolutional networks whose inference FLOPs are proportional to the input size. Meanwhile, unlike classification, segmentation is well-known to be more resolution-sensitive due to its much finer prediction granularity (Chen et al., 2019b). Therefore, high-resolution inputs are preferable for improving algorithmic performance, which yet contradicts the resource-saving needs.
•

Multi-scale aggregation: segmentation is well-known for its strong dependency on multiple scale features (Chen et al., 2018; Yu et al., 2018a; Yu and Koltun, 2016; Zhao et al., 2018, 2017) for contextual reasoning in combination with full-resolution outputs. Such a desired feature is often achieved by fusing a multi-resolution stream or aggregating paralleled filters with different sizes. Both the fusion and aggregation modules can incur heavy resource costs.

The expensiveness of segmentation is further amplified when we come to consider its training (e.g., continuous learning and adaptation) in resource-constrained settings. Many applications, such as autonomous vehicles and robots, require real-time and in-situ learning and continuous adaptation to new data, to be considered truly intelligent. As compared to cloud-based (re)training, local (re)training helps avoid transferring data back and forth between data centers and local platforms, reducing communication loads, and enhancing privacy. Besides, the increasingly prohibitive energy, financial and environmental costs of training ML algorithms have become a growing concern even for training in the cloud (Strubell et al., 2019). However, resource-constrained training was not explored much until a few recent efforts on classification (Jiang et al., 2019; Wang et al., 2019; You et al., 2019).

Our contributions. This work aims to push forward the training and inference efficiency of SOTA segmentation models to a new level, from the current practice of merely focusing on light-weight network design, towards a novel data-network co-optimization perspective. Its core driving motivation can be summarized in two points: (1) not all input samples are born equal (Jiang et al., 2019; et.al, 2017); and (2) eliminating input variances reduces the model’s learning workload (Engelbrecht et al., 1999).

More specifically, we propose DANCE, an efficient training and inference framework that can be applied towards any existing segmentation model. First, DANCE adopts an input adaptive automated data slimming technique. We propose a spatial complexity indicator to adapt the input images’ spatial resolution, training sampling frequency, and weighted coefficients in the loss function. Thus DANCE makes the models focus more on the complicated samples during training, while during testing the input images’ spatial resolution will be similarly reduced (i.e., downsampled).

Meanwhile, adaptively reducing input resolution has direct (proportional) impacts on the training and inference energy costs (i.e., both computation and memory movement costs). The indirect, yet also the important consequence is that the downsampled inputs become more “normalized” in terms of object and feature scales. Current segmentation models strongly rely on built-in multi-scale aggregation modules, to balance between contextual reasoning and fine-detail preservation (Chen et al., 2019b; Yu and Koltun, 2016). Interestingly, with spatial-complexity-adaptive downsampled inputs, further slimming those cost-dominant multi-scale aggregation building blocks save both training and inference costs without hampering the algorithmic performance, that’s our proposed automated network slimming in DANCE.

Refer to caption — Figure 1. The achieved mIoU vs. the required energy cost (Left: the total training energy cost; Right: the averaged inference energy cost per image) on the Cityscapes (Cordts et al., 2016a) test dataset. For the three segmentation models evaluated, DANCE achieves “all-win”: reduced training cost, less expensive inference, and improved mIoU.

Below we outline the contributions of the proposed DANCE framework:

•

DANCE, the first data-network co-optimization framework, boosts efficiency of both training and inference for segmentation models while mostly improving the accuracy. Further, DANCE is general and thus can be applied to any existing segmentation backbone.
•

DANCE in this paper simultaneously integrates automated data and network slimming to manipulate input images and their contribution to the model while slimming the network architecture in a co-optimization manner. Interestingly, the former can emulate the effect of multi-scale aggregation, thus enabling more aggressive slimming of their corresponding cost-dominant building blocks.
•

Extensive experiments and ablation studies demonstrate that DANCE can achieve “all-win” (i.e., reduced training and inference costs, and improved model accuracy) towards efficient segmentation, when benchmarking on four SOTA segmentation models and three popular segmentation benchmark datasets. As shown in Fig. 1, DANCE establishes a new record trade-off between segmentation models’ accuracy and training&inference efficiency.

2. Related works

2.1. Efficient CNN inference and training

Extensive works have been proposed to improve the efficiency of CNN inference, most of them focus on the classification tasks. Network compression has been widely studied to speed up CNN inference, e.g., by pruning unimportant network weights (Han et al., 2015a; He et al., 2017), quantizing the network into low bitwidths (Hubara et al., 2017), or distilling lighter-weight networks from teachers (Polino et al., 2018). For example, a representative automated pruning method (Network Slimming (Liu et al., 2017)) imposes $L_{1}$ -sparsity making use of the scaling factor from the batch normalization; later progressive pruning methods (i.e., gradually increase pruning ratio) are developed to improve the resulting models’ accuracy (Ye et al., 2018). Another stream of approaches involves designing compact models, such as MobileNet (Sandler et al., 2018) and ShuffleNet (Zhang et al., 2018). Energy cost was leveraged in (Yang et al., 2017) to guide the pruning towards the goal of energy-efficient inference.

Resource-efficient training is different from and more complicated than its inference counterpart. However, many insights gained from the latter can be lent to the former. For example, the recent work (Lym et al., 2019) showed that performing active channel pruning during training can accelerate the empirical convergence. Lately, Wang et al. (Wang et al., 2019) proposed one of the first comprehensive energy-efficient training frameworks, consisting of stochastic data dropping, selective layer updating, and low-precision back-propagation. They demonstrated its success in training several classification models with over 80% energy savings. (Jiang et al., 2019) accelerated training by skipping samples that may lead to low loss values (considered as less informative) at each iteration.

2.2. Semantic segmentation

Multi-scale aggregation in segmentation. Multi-scale aggregation has been proven to be powerful for semantic segmentation (Chen et al., 2018; Yu et al., 2018a; Zhao et al., 2018, 2017), via integrating multi-scale modules and high-/low-level features to capture patterns of different granularities. Pyramid Pooling and Atrous Spatial Pyramid Pooling (ASPP) modules were introduced in (Zhao et al., 2017) and (Chen et al., 2018) to aggregate features learned in different sizes of receptive fields, adapting the models to objects with different semantic sizes. Parallel branches of different downsampling rates were proposed by (Yu et al., 2018a; Zhao et al., 2018) to cover different resolutions. Although multi-scale aggregation contributes to segmentation accuracy improvement, it and its associated header introduce extra overhead during both training and inference (e.g., 52.98% inference FLOPs of Deeplabv3+ with a ResNet50 backbone and output stride of 16). That motivates us to slim such modules in DANCE.

Efficient segmentation models. A handful of efficient semantic segmentation models have been developed: ENet (Paszke et al., 2016) used an asymmetric encoder-decoder structure together with early downsampling; ICNet (Zhao et al., 2018) cascaded feature maps from multi-resolution branches under proper label guidance, together with network compression; and BiSeNet (Yu et al., 2018a) fused a context path with a fast downsampling scheme and a spatial path with smaller filter strides.

Remaining challenges. However, the models above were neither customized for nor evaluated on ultra-high resolution images, and our experiments show that they did not achieve sufficiently satisfactory trade-off in such cases. A knowledge distillation method was also leveraged to boost the performance of a computationally light-weight segmentation model from a teacher network (He et al., 2019). Despite their progress, none of them touches the training efficiency, nor any discussion related to co-optimization with the input data. Besides, the FLOPs number has a correlation to, but is not a faithful indicator of the actual energy cost, as pointed out by many prior works (Yang et al., 2017).

3. The proposed DANCE framework

This section presents our proposed DANCE framework. We will first provide an overview of DANCE in Section 3.1, and then introduce DANCE’s automated data slimming and automated network slimming design in Section 3.2 and Section 3.3, respectively.

3.1. DANCE overview

The driving hypothesis of DANCE is that matching the data and network can potentially boost both the model performance and hardware efficiency by removing redundancy associated with both the data and network. As such, DANCE aims to reduce the computational and energy costs of segmentation tasks during both training and inference, via a joint effort from data-level and network-level. Specifically, as shown in Fig. 2, DANCE integrates both automated data and network slimming, where the former automatically performs complexity-driven data downsampling/dropping before applying the data to a network while the latter automatically and progressively prunes the network to match the slimmed data. A bonus benefit of DANCE is that the resulting data-network pipeline after training (i.e., inference) is also naturally cost-efficient.

3.2. DANCE: automated data slimming

DANCE’s automated data slimming strives to automatically downsample or drop input images and controls their corresponding contribution to the training loss, adapting to the images’ spatial complexity which is estimated using a spatial complexity indicator.

Spatial complexity indicator. Spatial complexity has been commonly used as the basis for estimating image complexity (Gain and Siegelmann, 2019; Mishra et al., 2019; Xu et al., 2019), such as the one proposed in (Yu and Winkler, 2013):

(1)

{SC_{mean}=\frac{1}{M}\sum\sqrt{s_{h}^{2}+s_{v}^{2}}}

where ${s_{h}}$ and ${s_{v}}$ denote gray-scale images filtered with horizontal and vertical Sobel kernels, respectively, and ${M}$ denotes the number of pixels. Developed by (Yu and Winkler, 2013) to predict the image complexity for imaging compression/coding purpose, $SC_{mean}$ reflects the pixel-level variances and is extremely efficient to calculate, e.g., account for only $0.15\%$ FLOPs and $<0.5\%$ energy (on-device measurement when including both computations and data movements) of the DeepLabv3+ model (ResNet50 as the backbone with an output stride of 16) on one RGB image patch of size 224 $\times$ 224.

In DANCE, we first compute all training samples’ $SC_{mean}$ and fit the corresponding cumulative distribution function (CDF) using a Maxwell-Boltzmann distribution (Maxwell, 1860), which turns out to be well-matched in all considered datasets as shown in Fig. 3. Statistical analysis of $SC_{mean}$ for a specific dataset is an interesting question, which we leave for future works.

Thanks to the fitted CDF, given an input image, we can project its $SC_{mean}\in[0,inf]$ to a variable $p\in[0,1]$ via probability integral transform (Dodge and Commenges, 2006). The resulting $p$ is then directly used as the corresponding input image’s downsampling ratio, stochastic dropping probability, and weighted coefficient in the training loss.

Complexity-adaptive downsampling. The proposed complexity-adaptive downsampling in DANCE draws inspiration from recent findings which show that not all input samples are born equal (Jiang et al., 2019; et.al, 2017), and is motivated by the fact that downsampling input image sizes can most straightforwardly reduce the training/inference energy costs, as well as directly benefits the memory throughput. Meanwhile, a few recent works learn to adjust resolution or respective fields (Dai et al., 2017; Marin et al., 2019), whose promising results further motivate our complexity-adaptive downsampling.

As prior works show that the minimal acceptable downsampling ratio is 0.5 for most segmentation models (Yu et al., 2018a; Chen et al., 2018), we make use of the spatial complexity indicator $SC_{mean}$ to downsample the input images with a ratio of $(0.5p+0.5)\in[0,0.5]$ , where $p$ is the aforementioned projected value corresponding to the images’ $SC_{mean}$ . In contrast to the learning-based approaches in prior works (Dai et al., 2017; Marin et al., 2019) that incur extra training workloads, we seek a reliable indicator that is mostly “training free” and inexpensive to compute, based on which we can estimate a proper downsampling rate per image adaptively. In particular, the energy overhead of our complexity-adaptive downsampling is ¡ $0.02\%$ when estimated using real-device measurements in all our considered datasets.

Complexity-adaptive stochastic dropping. Recent pioneering CNN efficient works (Jiang et al., 2019; Wang et al., 2019) proposed that dropping a portion of training samples/mini-batches, either randomly or using some loss-based deterministic rules, can reduce the total training costs without notably sacrificing or even improving the algorithmic accuracy. Inspired by the stochastic dropping idea of (Wang et al., 2019), we incorporate the readily available spatial complexity indicator in Eq. (1) to calibrate the dropping probability. Specifically, (Wang et al., 2019) proposes to randomly skip incoming data (in mini-batch) with a default probability of 50% (i.e., 50% of the data is discarded without being fed into the models). The authors demonstrated this naively simple idea (with zero overhead) to be highly effective for efficient training without hurting or even improving the achieved accuracy. We further hypothesize that the images with larger spatial complexity are more informative and likely to favor the achieved accuracy if being more frequently trained than the ones with smaller spatial complexity.

Therefore, instead of adopting a uniformly dropping probability for all images, we propose a simple yet effective heuristic to enable complexity-adaptive stochastic dropping by assigning a smaller dropping probability to input images with larger spatial complexity. In particular, we assign $(1-p)$ as the dropping probability, where $p$ is the aforementioned projected value of the images’ spatial complexity indicator ( $SC_{mean}$ ).

Complexity-adaptive loss. Similarly, the losses produced by images with different complexities have been observed to contribute differently to the training loss (Gain and Siegelmann, 2019) or convergence in training (Jiang et al., 2019). We thus prioritize the updates generated by samples with larger spatial complexity, and adopt an adaptive weighted loss as below:

(2)

\mathcal{L}=\frac{\sum w_{i}\cdot l_{i}}{\sum w_{i}}=\frac{\sum p_{i}\cdot l_{% i}}{\sum p_{i}},i=1,2,...N

where $w_{i}$ is a scalar weighted coefficient, and $l_{i}$ is the cross-entropy loss of samples, corresponding to the $i$ -th image of the current mini-batch with $N$ images. Similar to the dropping probability assignment in complexity-adaptive stochastic dropping, input image with larger spatial complexity will be assigned a larger weighted coefficient than the one with smaller spatial complexity. As such, we adopt weighted coefficients equal to the aforementioned projected value $p$ of the images’ spatial complexity indicator ( $SC_{mean}$ ), i.e., $w_{i}=p_{i},i=1,2,...N$ .

3.3. DANCE: automated network slimming

Various ways to aggregate multi-scale features (Chen et al., 2018; Yu et al., 2018a; Zhao et al., 2018, 2017) have been proved to improve segmentation accuracy at a cost of extra parameters and computations, leading to a higher training/inference energy burden. Thanks to the developed complexity-adaptive downsampling in DANCE’s automated data slimming (see Section 3.2), the resulting inputs have been re-scaled according to their spatial complexity. We conjecture that such downsampled inputs naturally have more “normalized” object feature scales, i.e., complexity-adaptive downsampling can emulate the effect of multi-scale aggregation, and thus can potentially rely less on multi-scale aggregation modules for improving the segmentation accuracy. We thus expect that the network appears to be more redundant when handling our automated data slimming’s resulting downsampled inputs as the cost dominant building blocks of multi-scale aggregation now becomes less important.

Progressive pruning during training. Motivated by the above conjecture and targeting reduced costs for both the training and inference (e.g., post-training pruning merely reduces inference costs), we propose an automated network slimming with a progressive pruning schedule during the training trajectory to prune the header of the networks for segmentation, which includes the aforementioned multi-scale feature modules and also often dominates both the training and inference costs, e.g., accounts for 52.98% FLOPs in DeepLabv3+ (with a ResNet50 backbone and an output stride of 16). Note that DANCE’s effectiveness and insights extend when other network pruning methods are considered, here we consider progressive pruning without loss of generalization.

To design the progressive pruning schedule, we develop a straightforward heuristic design, following the commonly used schedule in most pruning works (Renda et al., 2020; Luo et al., 2017; Han et al., 2015b). Specially, we first divide the whole training/adaptation process into several stages w.r.t the total number of iterations, and then perform channel-wise pruning (based on (Liu et al., 2017)) at the end of each stage.

Co-optimization affects pruning patterns.

To validate the aforementioned conjecture, we visualize the percentage of pruned channels in layers corresponding to multi-scale aggregation and other layers under different pruning ratios in Fig. 4, when the models are trained with DANCE or merely DANCE’s automated network slimming.

We can see that training with both automated data and network slimming, i.e., DANCE, always prunes more channels in layers corresponding to multi-scale aggregation (e.g., the ASPP module in DeepLabv3+) and fewer channels on other layers, under all the considered seven pruning ratios between 20% and 80%, while merely automated network slimming does opposite. Specifically, as compared to training using merely automated data slimming under the same pruning ratio of 50%, the model trained with both automated data and network slimming, i.e., models trained using DANCE, prunes 1.47 $\times$ more channels in layers associated with multi-scale aggregation, where the corresponding accuracy is also higher (e.g., a 5.33% higher mIoU on the Cityscapes validation dataset together with a 54.8% lower inference energy with images of 592 $\times$ 592).

The experiment in Fig. 4 together with those in the experiment section verify our conjecture that (1) matching the data with the network can potentially improve the accuracy (thanks to the match between slimmed data and unpruned channels’ distribution) and remove redundant costs associated with both the data and network, thus achieving “all-win”: reducing both the training and inference costs while improving the achieved model accuracy (mIOU); and (2) DANCE’s automated data slimming can (partially) emulate the effect of multi-scale aggregation in segmentation models, enabling a higher pruning ratio on the corresponding multi-scale aggregation modules. The observations are consistent when other pruning methods and different pruning hyperparameters are used in DANCE’s automated network slimming (more details in Section 4.3.4), again verifying that the above conclusion (i.e., “co-optimization affects the optimal pruning patterns”) holds for DANCE regardless of the adopted pruning designs.

4. Experiments

In this section, we evaluate DANCE on four segmentation models and three popular urban scene understanding datasets in terms of mIoU and the total training/inference FLOPs and energy cost, where the energy cost is measured when training/inference the corresponding models in a SOTA edge device (JETSON TX2 (NVIDIA Inc., [n.d.])). We consider both the computational and energy costs because the former is commonly adopted and thus helps to benchmark with prior works while the latter better capture the real hardware cost.

4.1. Experiment setting

Considered models and datasets. Our evaluation of DANCE considers four SOTA segmentation models (two complicated models: DeepLabv3+ (Chen et al., 2018), PSPNet (Zhao et al., 2017), and two compact models: ICNet (Zhao et al., 2018), and BiSeNet (Yu et al., 2018a)) and three commonly used urban scene understanding datasets (Cityscapes (Cordts et al., 2016b), CamVid (Brostow et al., 2008), and BDD (Yu et al., 2018b)) in many efficient segmentation models (Chen et al., 2019a; Zhao et al., 2018; Yu et al., 2018a).

Experimental platforms and training details. All experiments (except the energy measurements) are performed on a workstation with NVIDIA 2080Ti GPU cards using the PyTorch framework (Paszke et al., 2017) for a fair comparison. We use an SGD optimizer with a learning rate of $1\times 10^{-3}$ for training all models except ICNet, which adopts a learning rate of $1\times 10^{-2}$ due to the unavailability of the corresponding ImageNet pre-trained model; and a minibatch size of (1) 8 for the DeepLabv3+ and PSPNet models and (2) 16 for the BiSeNet and ICNet models.

Table 1. The FLOPs, energy cost, and mIoU of DANCE on top of the four models on the Ciytscapes test dataset.

Model	FLOPs		Energy		mIoU
Model	Train. (P)	Infer. (G)	Train. (MJ)	Infer. (J)	(%)
DeepLabv3+	198.31	743.64	45.21	41.32	68.05
DANCE Improv.	-35.75%	-53.67%	-39.69%	-45.02%	+3.44
PSPNet	153.61	582.54	35.16	30.99	65.59
DANCE Improv.	-39.28%	-50.23%	-34.92%	-42.81%	+1.93
ICNet	39.33	45.20	10.82	6.51	47.74
DANCE Improv.	-49.77%	-55.67%	-45.27%	-47.69%	+13.34
BiSeNet	73.64	157.41	18.40	9.93	71.69
DANCE Improv.	-32.77%	-39.29%	-25.66%	-31.27%	-0.71

4.2. Performance on various datasets/models

In this subsection, we apply DANCE to the four segmentation models and three datasets and compare the resulting segmentation accuracies and inference/training costs with those of the base models.

4.2.1. DANCE on the Cityscapes dataset

Table 1 compares the segmentation accuracy, and computational and energy costs of DANCE on the four models, i.e., DeepLabv3+ (Chen et al., 2018), PSPNet (Zhao et al., 2017), ICNet (Zhao et al., 2018), and BiSeNet (Yu et al., 2018a), when evaluated on the Cityscapes dataset. We can see that (1) DANCE saves about 36% - 39% and 35% - 40% computational and energy costs in training (a similar trend in inference), while boosting the mIoU in the cases of DeepLabv3+ (Chen et al., 2018) and PSPNet (Zhao et al., 2017) by 3.44% and 1.93%, respectively; (2) In the case of ICNet, DANCE achieves a 13.34% higher mIoU with up to 45% energy savings than those of the base model, where the lower mIoU of the base model might be due to the lack of a corresponding ImageNet pre-trained model; and (3) Though DANCE doesn’t boost the mIoU on the compact model of BiSeNet, it does save in training energy cost and win bigger (saving up to 31% energy) in inference.

Table 2. The FLOPs, energy cost, and mIoU of DANCE on top of the four models on the CamVid test set.

Model	FLOPs		Energy		mIoU
Model	Train. (P)	Infer. (G)	Train. (MJ)	Infer. (J)	(%)
DeepLabv3+	37.19	254.62	7.30	20.19	69.15
DANCE Improv.	-32.76%	-47.6%	-31.76%	-43.65%	+1.51
PSPNet	27.27	208.77	4.71	14.64	65.28
DANCE Improv.	-39.69%	-46.47%	-32.22%	-41.22%	+2.82
ICNet	6.01	16.33	1.78	4.21	53.29
DANCE Improv.	-45.32%	-52.64%	-49.21%	-56.21%	+1.40
BiSeNet	6.46	54.17	2.72	6.77	68.6
DANCE Improv.	-38.09%	-41.10%	-32.45%	-33.76%	-0.27

4.2.2. DANCE on the CamVid dataset

Under smaller images ( $720\times 960$ ) in CamVid (vs. $1048\times 2048$ in Cityscapes), we can still observe similar trends as those in Cityscapes (see Table 1). Specifically, our DANCE can still save 32% - 49% energy cost, as shown in Table 2, while achieving improved mIoU (over 1.4%). For the compact model BiSeNet, with a comparable mIoU, our DANCE still stably brings 32% and 33% energy savings in training and inference, respectively.

4.2.3. DANCE on the BDD dataset for adaptation

As Section 1 stated, for most on-device learning applications, training from scratch is not necessary and the ability to adapt to new data can be more interesting for some applications, especially for autonomous vehicles and robots.

Table 3. The FLOPs, energy cost, and mIoU of DANCE on top of the four models on the BDD test set on adaptation.

Model	FLOPs		Energy		mIoU
Model	Train. (P)	Infer. (G)	Train. (MJ)	Infer. (J)	(%)
DeepLabv3+	97.01	339.46	17.74	27.67	52.66
DANCE Improv.	-79.31%	-51.86%	-77.15%	-43.5%	+0.12
PSPNet	72.56	290.57	11.82	19.37	39.54
DANCE Improv.	-37.47%	-49.65%	-25.18%	-40.29%	+5.51
ICNet	58.27	21.68	15.73	5.17	39.53
DANCE Improv.	-58.65%	-51.51%	-61.47%	-37.47%	+0.47
BiSeNet	45.32	72.27	7.48	7.94	56.20
DANCE Improv.	-34.08%	-44.47%	-27.92%	-38.46%	+0.27

Here, we choose the BDD (Yu et al., 2018b) for the adaptation experiments. We use pre-trained models on Cityscapes to adapt to unseen images in BBD. For a fair comparison, we choose the same checkpoints as the pre-trained model for each model in experiments. The adaptation performance is summarized in Table 3, which shows that while being similar to the performance on Cityscapes, DANCE saves up to 77% energy cost while achieving a slightly better (+0.12%) mIoU over the baseline, or boosts the mIoU by 5.51% when requiring even a 25% lower energy cost than the baseline.

The extensive results in Tables 1 - 3 show that DANCE can achieve “all-win” on all the three datasets when applying to both DeepLabv3+ and PSPNet: lower training cost (energy savings: 77% - 25%), more efficient inference (energy savings: 40% - 45%), and improved mIOU (0.12% - 5.51%), demonstrating the consistent superiority of DANCE on complicated models. As for the performance on compact models, DANCE can improve efficiency of both training (energy savings: 25% - 61%) and inference (energy savings: 31% - 56%) with a slightly dropped or even better mIoU (-0.71% - 13.34%) on all the three datasets, indicating that DANCE can benefit energy efficiency of even compact models.

4.3. Ablation studies of DANCE

In this subsection, we perform ablation studies of DANCE for evaluating the effectiveness of its data-network co-optimization, $p$ indicator, and automated data slimming.

4.3.1. Ablation study on the effectiveness of DANCE’s data-network co-optimization

Table 4. The FLOPs and mIoU of co-optimize and separately optimize on top of DeepLabv3+@CityScapes.

Method	FLOPs		mIoU
Method	Train. (P)	Infer. (G)	(%)
Baseline	198.31	743.64	69.71
Optimize Network After Data	+0.85%	-61.10%	+0.66
Optimize Data After Network	+12.99%	-55.88%	+3.08
Co-Optimize (DANCE)	-35.75%	-56.57%	+3.41

DANCE vs. only automated network/data slimming. As shown in Fig. 5, combining both automated data and network slimming (i.e., DANCE) achieves (1) better performance (in terms of training cost, inference cost, and mIoU) than the standalone implementation of either of these two techniques integrated into DANCE (i.e., automated data and network slimming); and (2) a much higher mIoU than the baseline (+3.41%) while requiring 39% and 45% less energy in training and inference, respectively. This set of experiments indicates the advantage of jointly matching the data and network for co-optimization.

DANCE vs. optimizing network and data separately. Table 4 compares co-optimization (DANCE) with separate optimization (optimizing network/data and then data/network sequentially), showing that network and data need to be jointly co-optimized to achieve the best mIoU-cost trade-off, while optimizing (i.e., slimming) the network and data sequentially will cause a 0.33% - 2.75% mIoU drop on DeepLabv3+@Cityscapes at an even higher computational cost (e.g., +48.74%) than DANCE.

4.3.2. Ablation study of DANCEon objects with different scales

Table 5. The inference mIoU of w/o DANCE and w/ DANCE on top of DeepLabv3+ for CityScapes’s large, medium, and small scale of objects (manually picked static scales), where DANCE can further provide dynamic scales.

Method	w/o DANCE			w/ DANCE
Image Scales	368 $\times$ 368	496 $\times$ 496	592 $\times$ 592	Dynamic
IoU of Wall (%)	50.56	48.68	46.40	52.69
IoU of Motorcycle (%)	53.96	57.51	55.09	58.23
IoU of Traffic Sign (%)	70.17	72.41	72.94	73.58

Here we compare the inference mIoU when turning off and on our DANCE applied on top of DeepLabv3+, when testing representative large, medium, and small scales (i.e., wall, motorcycle, and traffic sign) of objects in Cityscapes. As shown in Table 5, we can see that (1) small/large scales of objectives favor/degrade the achieved inference mIoU of applying DeepLabv3+ to the selected objects of different scales; and (2) DANCE, which inherently incorporates dynamic scales to its applied data, consistently outperforms its baselines even for the manually selected objectives which have static scales by design, indicating the advantage of DANCE’s automated choices of adaptive scales of data, validating DANCE’s inherent advantages in handling datasets/tasks of which the objects have different scales , which is common for semantic segmentation datasets (e.g., Cityscapes (Cordts et al., 2016b), CamVid (Brostow et al., 2008), and BBD (Yu et al., 2018b)).

Table 6. The mIoU of using proposed

p

(in Section 3.2), random

p

, or inverse

p

indicator in DANCE on top of DeepLabv3+@CityScapes under same training cost budget.

Method	Train. FLOPs (P)	mIoU(%)
Proposed $p$ indicator	127.41	73.12
Random $p$ indicator	+0.06%	-4.08
Inverse $p$ indicator, i.e., 1- $p$	+0.00%	-11.03

4.3.3. Ablation study of the $p$ indicator’ effectiveness

The spatial complexity indicator presented in Section 3.2 is to provide a variable $p\in[0,1]$ for estimating a given image’s complexity, which will be directly used to guide the slimming direction (e.g., image’s downsampling ratio). As shown in Table 6, we apply inverse $p$ or random $p$ to replace the proposed $p$ indicator in DANCE, and find that their resulting mIoU drops 11.03% or 4.08% under the same training cost budget, respectively, validating the advantageous effectiveness of our proposed $p$ indicator. Additionally, Fig. 6 visualizes 24 image samples randomly selected from the image groups with the largest 33%, medium 3%, and smallest 33% spatial complexity in the Cityscapes (Cordts et al., 2016b) training dataset. Interestingly, we can see as expected that the image complexity identified by the adopted indicator is consistent with that by human eyes, e.g., images with spatial complexity falling within the smallest 33% of the dataset have a simpler background and include fewer objectives.

4.3.4. Ablation study of DANCE’s effectiveness regardless of the adopted pruning methods

Table 7. The number of pruned weights for different kinds of layers under various pruning ratio on a DeepLabv3+ model (with a ResNet50 backbone and an output stride of 16) with the Cityscapes dataset)

Pruning	#pruned weights (DANCE’s - AMS’s)
Ratio	ASPP Module Layers	Other Layers
20%	61759	-61759
30%	96455	-96455
40%	102535	-102535
50%	96674	-96674
60%	76767	-76767
70%	51177	-51177
80%	24986	-24987

We consistently find that DANCE’s advantages in enabling data model co-optimization is effectiveness regardless of the adopted pruning methods. For example, Table 7 summarizes the pruning results when turning on and off DANCE’s automated data slimming during pruning, where we adopt the unstructured pruning in (Han et al., 2015a). Again, similar observations can be made as those in (Liu et al., 2017) when using channel-wise pruning. Specifically, training with both automated data and model slimming, i.e., DANCE, always prunes more weights in layers corresponding to multi-scale aggregation (e.g., the ASPP module in DeepLabv3+) and fewer weights on other layer, under all the considered seven pruning ratios between 20% and 80%, whereas merely using DANCE’s automated model slimming (AMS) does the opposite. This set of experiment results further confirm that (1) DANCE’s automated data slimming can (partially) emulate the effect of multi-scale aggregation in segmentation models, and thus enable a higher pruning ratio on the corresponding multi-scale aggregation modules, and (2) matching the data with model can potentially improve the model accuracy and remove redundant costs associated with both the data and model, thus achieving “all-win”, which is consistent with the results in Fig. 1.

4.3.5. Ablation study of DANCE’s automated data slimming

As described in Section 3.2, DANCE’s automated data slimming integrates three techniques, including complexity-adaptive downsampling (CAD), complexity-adaptive stochastic dropping (CASD), and complexity-adaptive loss (CAL), which are guided by the adopted spatial complexity indicator. In this subsection, we evaluate the efficacy of these techniques and their different combinations on top of DANCE’s automated network slimming (ANS) (see Section 3.3), in terms of the resulting task accuracy (mIoU), and computational and energy savings of both inference and training, as summarized in Table 8. Note that all the task accuracy and computational and energy savings are normalized to those of the standard DeepLabv3+ (Chen et al., 2018) model and Cityscapes dataset (See row No. 1 of Table 8). We next discuss the observations in terms of the “all-win” goal (i.e., reducing both the training and inference costs while improving the achieved model accuracy (mIoU)):

Table 8. Ablation studies on the component techniques of DANCE’s automated data slimming on the DeepLabv3+ model (with a ResNet50 backbone and an output stride of 16) and the Cityscapes validation dataset, where

\mathbf{\dagger}

(i.e., No.7) is our DANCE setting.

No.	ANS^a	CAD^b	CAL^c	CASD^d	RD^e	Train. FLOPs	Train. Energy	Infer. FLOPs	Infer. Energy	mIoU
1						198.3 (P)	45.21 (MJ)	743.6 (G)	41.32 (J)	69.71(%)
2	✓					-21.16%	-8.08%	-34.34%	-14.36%	-1.52
3	✓	✓				-57.05%	-51.12%	-60.96%	-46.52%	-2.54
4	✓	✓	✓			-56.98%	-51.45%	-61.15%	-46.96%	-0.63
5	✓				✓	-13.73%	-15.90%	-25.07%	-14.21%	+1.96
6	✓			✓		-13.92%	-14.10%	-25.21%	-15.71%	+3.33
$\mathbf{7^{\dagger}}$	✓	✓	✓	✓		-35.75%	-39.69%	-53.67%	-45.02%	+3.41

a

ANS: Automated Network Slimming
b

CAD: Complexity-Adaptive Downsampling
c

CASD: Complexity-Adaptive Stochastic Dropping
d

CAL: Complexity-Adaptive Loss
e

RD: Randomly Drop 50% (Wang et al., 2019)

1. Complexity-Adaptive Downsampling (CAD): Comparing the results in Rows No. 2 and No. 3 shows that CAD+ANS (see Row No. 3, i.e., applying CAD, which has the advantage of “training free”, on top of DANCE’s automated network slimming (ANS)), can save 42.92% and 32.16% energy cost in training and inference, respectively, whereas decreasing the mIoU by 1.02% (i.e., -1.52% vs. -2.54%), as compared to merely performing ANS (see Row No. 2), indicating that CAD offers a new trade-off between the achieved energy efficiency and mIoU.

2. Complexity-Adaptive Loss (CAL): Comparing the results in Rows No. 3, and No. 4 shows that CAL+CAD+ANS (see Row No. 4, i.e., applying CAL on top of ANS and CAD) can boost the mIoU by 1.91% as compared to merely combining CAD and ANS (i.e., CAD+ANS in Row No. 3), while still reducing 51.45% and 46.96% energy cost in training and inference, respectively, as compared to the DeepLabv3+ baseline (Row No. 1), indicating that adding CAL on top of CAD and ANS can further boost the model accuracy while keeping the achieved energy efficiency.

3. Complexity-Adaptive Stochastic Dropping (CASD): First, comparing the results in Rows No. 5 and No. 6 shows that the proposed CASD (Row No. 6) can achieve a 1.37% higher mIoU than the random dropping technique in (Wang et al., 2019) (Row No. 5) under the same energy cost of both training and inference, indicating the advantage of complexity-adaptive stochastic dropping over random dropping in (Wang et al., 2019). Second, comparing the results in Rows No. 4, and No. 7 shows that applying CASD on top of CAL+CAD+ANS (Row No. 4) can boost the mIoU by 4.04% as compared to merely combining ANS, CAD, and CAL (Row No. 4), and by 3.41% as compared to the DeepLabv3+ baseline (Row No. 1), while obtaining 39.69% and 45.02% energy savings in training and inference, respectively, as compared to the DeepLabv3+ baseline (Row No. 1).

This set of comparisons indicates the effectiveness of the proposed DANCE’s automated data slimming, i.e., integrating all three component techniques of DANCE’s automated data slimming can achieve the most favorable data-network co-optimization benefits as it achieves the “all-win” goal as shown in Fig. 1.

5. Ablation study of DANCE’s hyperparameters

In this subsection, we perform experiments for evaluating DANCE with different hyperparameters by changing the ranges of (1) the weighted coefficients in complexity-adaptive loss (CAL) and (2) dropping probability in complexity-adaptive stochastic dropping (CAL) (as described in Section 3.2), and summarize the results in Table 9. To better study the effect of each of the aforementioned hyperparameters, we fix others with the default ones (as described in Section 3.2) when tuning one of them.

Note that the larger the ratio of endpoints in the dynamic range of both the weighted coefficients and dropping probabilities are, the more (less) frequent images with a higher (lower) spatial complexity would be used. And the largest ratio is $1.0/0.0=\infty$ and $100\%/0\%=\infty$ for the weighted coefficients and dropping probabilities, respectively, which is also the default setting as mentioned in Section 3.2.

The results in Table 9 show that increasing the frequency of training images with a higher spatial complexity (defined in Eq. 1), by increasing the ratio of endpoints in the dynamic range of the weighted coefficient or dropping probability, favors the segmentation accuracy (i.e., a higher mIoU). This observation is consistent with that of (Jiang et al., 2019; Gain and Siegelmann, 2019). Specifically, changing the dropping probability range from 60% - 40% to 100% - 0% boosts the achieved mIoU by 1.99%, while changing the weighted coefficient range from 2.0 - 1.0 to 1.0 - 0.0 leads to an improved mIoU of 0.82%, while the training and inference costs of both cases mostly stay the same.

Table 9. Ablation study of DANCE’s hyperparameters: DANCE with different ranges of (1) weighted coefficients in complexity-adaptive loss (CAL) and (2) dropping probability in complexity-adaptive stochastic dropping (CASD) on DeepLabv3+ with Cityscapes.

Settings		Train. FLOPs	Train. Energy	Infer. FLOPs	Infer. Energy	mIoU
Range of the Weighted Coefficients in CAL	2.0 - 1.0	124.66 (P)	26.43 (MJ)	358.39 (G)	23.57 (J)	72.30%
	4.0 - 1.0	-1.76%	-2.65%	+0.78%	+0.52%	+0.64%
	1.0 - 0.0	+2.21%	+3.18%	-3.87%	-3.60%	+0.82%
Range of the Dropping Probability in CASD	60% - 40%	128.05 (P)	27.55 (MJ)	345.21 (G)	22.74 (J)	71.13%
	75% - 25%	-0.88%	-0.57%	-1.37%	-1.22%	+0.27%
	100% -0%	-0.50%	-1.02%	-0.20%	-0.07%	+1.99%

6. Conclusions

We proposed DANCE for boosting segmentation efficiency during both training and inference, leveraging the hypothesis that maximum model accuracy and efficiency should be achieved when the data and model are optimally matched. On the “data-level”, DANCE’s automated data slimming not only halve the computational and energy costs, but also boost the segmentation accuracy. Interestingly, DANCE’s automated data slimming can emulate the effect of multi-scale feature extraction yet at a much lower cost. This further motivates DANCE’s automated network slimming on the “model-level” that advocates automatically pruning the model adapting to the resulting data slimmed by DANCE’s automated data slimming and leads to more pruning in the cost-dominant building blocks for multi-scale feature extraction, validating our hypothesis and further reducing both training and inference costs. Extensive experiments and ablation studies validate DANCE’s effectiveness and superiority, which resides in its capability to automatically match the data and network via automated co-optimization.

References

(1)
Brostow et al. (2008) Gabriel J Brostow, Jamie Shotton, Julien Fauqueur, and Roberto Cipolla. 2008. Segmentation and recognition using structure from motion point clouds. In European conference on computer vision. Springer, 44–57.
Chen et al. (2018) Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV). 801–818.
Chen et al. (2019a) Wuyang Chen, Xinyu Gong, Xianming Liu, Qian Zhang, Yuan Li, and Zhangyang Wang. 2019a. FasterSeg: Searching for Faster Real-time Semantic Segmentation. arXiv preprint arXiv:1912.10917 (2019).
Chen et al. (2019b) Wuyang Chen, Ziyu Jiang, Zhangyang Wang, Kexin Cui, and Xiaoning Qian. 2019b. Collaborative global-local networks for memory-efficient segmentation of ultra-high resolution images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 8924–8933.
Cordts et al. (2016a) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016a. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3213–3223.
Cordts et al. (2016b) Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. 2016b. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Dai et al. (2017) Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. 2017. Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision. 764–773.
Dodge and Commenges (2006) Yadolah Dodge and Daniel Commenges. 2006. The Oxford dictionary of statistical terms. Oxford University Press on Demand.
Engelbrecht et al. (1999) Andries P Engelbrecht, L Fletcher, and Ian Cloete. 1999. Variance analysis of sensitivity information for pruning multilayer feedforward neural networks. In IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No. 99CH36339), Vol. 3. IEEE, 1829–1833.
et.al (2017) Li et.al. 2017. Not All Pixels Are Equal: Difficulty-Aware Semantic Segmentation via Deep Layer Cascade. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (Jul 2017). https://doi.org/10.1109/cvpr.2017.684
Gain and Siegelmann (2019) Alex Gain and Hava Siegelmann. 2019. Relating information complexity and training in deep neural networks. In Micro-and Nanotechnology Sensors, Systems, and Applications XI, Vol. 10982. International Society for Optics and Photonics, 109822H.
Han et al. (2015a) Song Han, Huizi Mao, and William J Dally. 2015a. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149 (2015).
Han et al. (2015b) Song Han, Jeff Pool, John Tran, and William Dally. 2015b. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems. 1135–1143.
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
He et al. (2019) Tong He, Chunhua Shen, Zhi Tian, Dong Gong, Changming Sun, and Youliang Yan. 2019. Knowledge Adaptation for Efficient Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 578–587.
He et al. (2017) Yihui He, Xiangyu Zhang, and Jian Sun. 2017. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 1389–1397.
Hubara et al. (2017) Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research 18, 1 (2017), 6869–6898.
Jiang et al. (2019) Angela H. Jiang, Daniel L. K. Wong, Giulio Zhou, David G. Andersen, Jeffrey Dean, Gregory R. Ganger, Gauri Joshi, Michael Kaminksy, Michael Kozuch, Zachary C. Lipton, and Padmanabhan Pillai. 2019. Accelerating Deep Learning by Focusing on the Biggest Losers.
Liu et al. (2017) Zhuang Liu, Jianguo Li, Zhiqiang Shen, Gao Huang, Shoumeng Yan, and Changshui Zhang. 2017. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision. 2736–2744.
Luo et al. (2017) Jian-Hao Luo, Jianxin Wu, and Weiyao Lin. 2017. Thinet: A filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision. 5058–5066.
Lym et al. (2019) Sangkug Lym, Esha Choukse, Siavash Zangeneh, Wei Wen, Sujay Sanghavi, and Mattan Erez. 2019. PruneTrain: Fast Neural Network Training by Dynamic Sparse Model Reconfiguration.
Marin et al. (2019) Dmitrii Marin, Zijian He, Peter Vajda, Priyam Chatterjee, Sam Tsai, Fei Yang, and Yuri Boykov. 2019. Efficient Segmentation: Learning Downsampling Near Semantic Boundaries. arXiv preprint arXiv:1907.07156 (2019).
Maxwell (1860) James Clerk Maxwell. 1860. V. Illustrations of the dynamical theory of gases.—Part I. On the motions and collisions of perfectly elastic spheres. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 19, 124 (1860), 19–32.
Mishra et al. (2019) Suraj Mishra, Peixian Liang, Adam Czajka, Danny Z Chen, and X Sharon Hu. 2019. CC-NET: Image Complexity Guided Network Compression for Biomedical Image Segmentation. In 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019). IEEE, 57–60.
NVIDIA Inc. ([n.d.]) NVIDIA Inc. [n.d.]. NVIDIA Jetson TX2. https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-tx2/, accessed 2019-09-01.
Paszke et al. (2016) Adam Paszke, Abhishek Chaurasia, Sangpil Kim, and Eugenio Culurciello. 2016. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:1606.02147 (2016).
Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.
Polino et al. (2018) Antonio Polino, Razvan Pascanu, and Dan Alistarh. 2018. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668 (2018).
Renda et al. (2020) Alex Renda, Jonathan Frankle, and Michael Carbin. 2020. Comparing Rewinding and Fine-tuning in Neural Network Pruning. In International Conference on Learning Representations.
Sandler et al. (2018) Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4510–4520.
Strubell et al. (2019) Emma Strubell, Ananya Ganesh, and Andrew McCallum. 2019. Energy and Policy Considerations for Deep Learning in NLP. arXiv preprint arXiv:1906.02243 (2019).
Wang et al. (2019) Yue Wang, Ziyu Jiang, Xiaohan Chen, Pengfei Xu, Yang Zhao, Yingyan Lin, and Zhangyang Wang. 2019. E2-Train: Training State-of-the-art CNNs with Over 80% Less Energy. In Advances in Neural Information Processing Systems.
Xu et al. (2019) Ran Xu, Jinkyu Koo, Rakesh Kumar, Peter Bai, Subrata Mitra, Ganga Maghanath, and Saurabh Bagchi. 2019. ApproxNet: Content and Contention Aware Video Analytics System for the Edge. arXiv preprint arXiv:1909.02068 (2019).
Yang et al. (2017) Tien-Ju Yang, Yu-Hsin Chen, and Vivienne Sze. 2017. Designing energy-efficient convolutional neural networks using energy-aware pruning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5687–5695.
Ye et al. (2018) Shaokai Ye, Tianyun Zhang, Kaiqi Zhang, Jiayu Li, Kaidi Xu, Yunfei Yang, Fuxun Yu, Jian Tang, Makan Fardad, Sijia Liu, et al. 2018. Progressive weight pruning of deep neural networks using ADMM. arXiv preprint arXiv:1810.07378 (2018).
You et al. (2019) Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Yingyan Lin, Zhangyang Wang, and Richard G Baraniuk. 2019. Drawing early-bird tickets: Towards more efficient training of deep networks. arXiv preprint arXiv:1909.11957 (2019).
Yu et al. (2018a) Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong Sang. 2018a. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In Proceedings of the European Conference on Computer Vision (ECCV). 325–341.
Yu and Koltun (2016) Fisher Yu and Vladlen Koltun. 2016. Multi-scale context aggregation by dilated convolutions. ICLR (2016).
Yu et al. (2018b) Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. 2018b. Bdd100k: A diverse driving video database with scalable annotation tooling. arXiv preprint arXiv:1805.04687 (2018).
Yu and Winkler (2013) H. Yu and S. Winkler. 2013. Image complexity and spatial information. In 2013 Fifth International Workshop on Quality of Multimedia Experience (QoMEX). 12–17. https://doi.org/10.1109/QoMEX.2013.6603194
Zhang et al. (2018) Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. 2018. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6848–6856.
Zhao et al. (2018) Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. 2018. Icnet for real-time semantic segmentation on high-resolution images. In Proceedings of the European Conference on Computer Vision (ECCV). 405–420.
Zhao et al. (2017) Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. 2017. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2881–2890.