Search | arXiv e-print repository

SDQ: Sparse Decomposed Quantization for LLM Inference

Authors: Geonhwa Jeong, Po-An Tsai, Stephen W. Keckler, Tushar Krishna

Abstract: Recently, large language models (LLMs) have shown surprising performance in task-specific workloads as well as general tasks with the given prompts. However, to achieve unprecedented performance, recent LLMs use billions to trillions of parameters, which hinder the wide adaptation of those models due to their extremely large compute and memory requirements. To resolve the issue, various model comp… ▽ More Recently, large language models (LLMs) have shown surprising performance in task-specific workloads as well as general tasks with the given prompts. However, to achieve unprecedented performance, recent LLMs use billions to trillions of parameters, which hinder the wide adaptation of those models due to their extremely large compute and memory requirements. To resolve the issue, various model compression methods are being actively investigated. In this work, we propose SDQ (Sparse Decomposed Quantization) to exploit both structured sparsity and quantization to achieve both high compute and memory efficiency. From our evaluations, we observe that SDQ can achieve 4x effective compute throughput with <1% quality drop. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2403.07953 [pdf, other]

Abstracting Sparse DNN Acceleration via Structured Sparse Tensor Decomposition

Authors: Geonhwa Jeong, Po-An Tsai, Abhimanyu R. Bambhaniya, Stephen W. Keckler, Tushar Krishna

Abstract: Exploiting sparsity in deep neural networks (DNNs) has been a promising area to meet the growing computation need of modern DNNs. However, in practice, sparse DNN acceleration still faces a key challenge. To minimize the overhead of sparse acceleration, hardware designers have proposed structured sparse hardware support recently, which provides limited flexibility and requires extra model fine-tun… ▽ More Exploiting sparsity in deep neural networks (DNNs) has been a promising area to meet the growing computation need of modern DNNs. However, in practice, sparse DNN acceleration still faces a key challenge. To minimize the overhead of sparse acceleration, hardware designers have proposed structured sparse hardware support recently, which provides limited flexibility and requires extra model fine-tuning. Moreover, any sparse model fine-tuned for certain structured sparse hardware cannot be accelerated by other structured hardware. To bridge the gap between sparse DNN models and hardware, this paper proposes tensor approximation via structured decomposition (TASD), which leverages the distributive property in linear algebra to turn any sparse tensor into a series of structured sparse tensors. Next, we develop a software framework, TASDER, to accelerate DNNs by searching layer-wise, high-quality structured decomposition for both weight and activation tensors so that they can be accelerated by any systems with structured sparse hardware support. Evaluation results show that, by exploiting prior structured sparse hardware baselines, our method can accelerate off-the-shelf dense and sparse DNNs without fine-tuning and improves energy-delay-product by up to 83% and 74% on average. △ Less

Submitted 31 March, 2024; v1 submitted 12 March, 2024; originally announced March 2024.

arXiv:2401.02549 [pdf]

doi 10.1142/S0219877023300021

Quantitative Technology Forecasting: a Review of Trend Extrapolation Methods

Authors: Peng-Hung Tsai, Daniel Berleant, Richard S. Segall, Hyacinthe Aboudja, Venkata Jaipal R. Batthula, Sheela Duggirala, Michael Howell

Abstract: Quantitative technology forecasting uses quantitative methods to understand and project technological changes. It is a broad field encompassing many different techniques and has been applied to a vast range of technologies. A widely used approach in this field is trend extrapolation. Based on the publications available to us, there has been little or no attempt made to systematically review the em… ▽ More Quantitative technology forecasting uses quantitative methods to understand and project technological changes. It is a broad field encompassing many different techniques and has been applied to a vast range of technologies. A widely used approach in this field is trend extrapolation. Based on the publications available to us, there has been little or no attempt made to systematically review the empirical evidence on quantitative trend extrapolation techniques. This study attempts to close this gap by conducting a systematic review of technology forecasting literature addressing the application of quantitative trend extrapolation techniques. We identified 25 studies relevant to the objective of this research and classified the techniques used in the studies into different categories, among which growth curves and time series methods were shown to remain popular over the past decade, while newer methods, such as machine learning-based hybrid models, have emerged in recent years. As more effort and evidence are needed to determine if hybrid models are superior to traditional methods, we expect to see a growing trend in the development and application of hybrid models to technology forecasting. △ Less

Submitted 4 January, 2024; originally announced January 2024.

Journal ref: International Journal of Innovation and Technology Management (2023), 20(4):2330002

arXiv:2311.02324 [pdf, other]

Bounded and Unbiased Composite Differential Privacy

Authors: Kai Zhang, Yanjun Zhang, Ruoxi Sun, Pei-Wei Tsai, Muneeb Ul Hassan, Xin Yuan, Minhui Xue, Jinjun Chen

Abstract: The objective of differential privacy (DP) is to protect privacy by producing an output distribution that is indistinguishable between any two neighboring databases. However, traditional differentially private mechanisms tend to produce unbounded outputs in order to achieve maximum disturbance range, which is not always in line with real-world applications. Existing solutions attempt to address th… ▽ More The objective of differential privacy (DP) is to protect privacy by producing an output distribution that is indistinguishable between any two neighboring databases. However, traditional differentially private mechanisms tend to produce unbounded outputs in order to achieve maximum disturbance range, which is not always in line with real-world applications. Existing solutions attempt to address this issue by employing post-processing or truncation techniques to restrict the output results, but at the cost of introducing bias issues. In this paper, we propose a novel differentially private mechanism which uses a composite probability density function to generate bounded and unbiased outputs for any numerical input data. The composition consists of an activation function and a base function, providing users with the flexibility to define the functions according to the DP constraints. We also develop an optimization algorithm that enables the iterative search for the optimal hyper-parameter setting without the need for repeated experiments, which prevents additional privacy overhead. Furthermore, we evaluate the utility of the proposed mechanism by assessing the variance of the composite probability density function and introducing two alternative metrics that are simpler to compute than variance estimation. Our extensive evaluation on three benchmark datasets demonstrates consistent and significant improvement over the traditional Laplace and Gaussian mechanisms. The proposed bounded and unbiased composite differentially private mechanism will underpin the broader DP arsenal and foster future privacy-preserving studies. △ Less

Submitted 4 November, 2023; originally announced November 2023.

Comments: Accepted at 45th IEEE Symposium on Security and Privacy (IEEE S&P)

arXiv:2305.12718 [pdf, other]

doi 10.1145/3613424.3623786

HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity

Authors: Yannan Nellie Wu, Po-An Tsai, Saurav Muralidharan, Angshuman Parashar, Vivienne Sze, Joel S. Emer

Abstract: Due to complex interactions among various deep neural network (DNN) optimization techniques, modern DNNs can have weights and activations that are dense or sparse with diverse sparsity degrees. To offer a good trade-off between accuracy and hardware performance, an ideal DNN accelerator should have high flexibility to efficiently translate DNN sparsity into reductions in energy and/or latency with… ▽ More Due to complex interactions among various deep neural network (DNN) optimization techniques, modern DNNs can have weights and activations that are dense or sparse with diverse sparsity degrees. To offer a good trade-off between accuracy and hardware performance, an ideal DNN accelerator should have high flexibility to efficiently translate DNN sparsity into reductions in energy and/or latency without incurring significant complexity overhead. This paper introduces hierarchical structured sparsity (HSS), with the key insight that we can systematically represent diverse sparsity degrees by having them hierarchically composed from multiple simple sparsity patterns. As a result, HSS simplifies the underlying hardware since it only needs to support simple sparsity patterns; this significantly reduces the sparsity acceleration overhead, which improves efficiency. Motivated by such opportunities, we propose a simultaneously efficient and flexible accelerator, named HighLight, to accelerate DNNs that have diverse sparsity degrees (including dense). Due to the flexibility of HSS, different HSS patterns can be introduced to DNNs to meet different applications' accuracy requirements. Compared to existing works, HighLight achieves a geomean of up to 6.4x better energy-delay product (EDP) across workloads with diverse sparsity degrees, and always sits on the EDP-accuracy Pareto frontier for representative DNNs △ Less

Submitted 1 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted to MICRO23

arXiv:2303.13631 [pdf, other]

In-depth analysis of music structure as a text network

Authors: Ping-Rui Tsai, Yen-Ting Chou, Nathan-Christopher Wang, Hui-Ling Chen, Hong-Yue Huang, Zih-Jia Luo, Tzay-Ming Hong

Abstract: Music, enchanting and poetic, permeates every corner of human civilization. Although music is not unfamiliar to people, our understanding of its essence remains limited, and there is still no universally accepted scientific description. This is primarily due to music being regarded as a product of both reason and emotion, making it difficult to define. In this article, we focus on the fundamental… ▽ More Music, enchanting and poetic, permeates every corner of human civilization. Although music is not unfamiliar to people, our understanding of its essence remains limited, and there is still no universally accepted scientific description. This is primarily due to music being regarded as a product of both reason and emotion, making it difficult to define. In this article, we focus on the fundamental elements of music and construct an evolutionary network from the perspective of music as a natural language, aligning with the statistical characteristics of texts. Through this approach, we aim to comprehend the structural differences in music across different periods, enabling a more scientific exploration of music. Relying on the advantages of structuralism, we can concentrate on the relationships and order between the physical elements of music, rather than getting entangled in the blurred boundaries of science and philosophy. The scientific framework we present not only conforms to past conclusions in music, but also serves as a bridge that connects music to natural language processing and knowledge graphs. △ Less

Submitted 2 January, 2024; v1 submitted 21 March, 2023; originally announced March 2023.

Comments: 7 pages, 8 figures

arXiv:2210.03731 [pdf, other]

Demystifying Map Space Exploration for NPUs

Authors: Sheng-Chun Kao, Angshuman Parashar, Po-An Tsai, Tushar Krishna

Abstract: Map Space Exploration is the problem of finding optimized mappings of a Deep Neural Network (DNN) model on an accelerator. It is known to be extremely computationally expensive, and there has been active research looking at both heuristics and learning-based methods to make the problem computationally tractable. However, while there are dozens of mappers out there (all empirically claiming to find… ▽ More Map Space Exploration is the problem of finding optimized mappings of a Deep Neural Network (DNN) model on an accelerator. It is known to be extremely computationally expensive, and there has been active research looking at both heuristics and learning-based methods to make the problem computationally tractable. However, while there are dozens of mappers out there (all empirically claiming to find better mappings than others), the research community lacks systematic insights on how different search techniques navigate the map-space and how different mapping axes contribute to the accelerator's performance and efficiency. Such insights are crucial to developing mapping frameworks for emerging DNNs that are increasingly irregular (due to neural architecture search) and sparse, making the corresponding map spaces much more complex. In this work, rather than proposing yet another mapper, we do a first-of-its-kind apples-to-apples comparison of search techniques leveraged by different mappers. Next, we extract the learnings from our study and propose two new techniques that can augment existing mappers -- warm-start and sparsity-aware -- that demonstrate speedups, scalability, and robustness across diverse DNN models. △ Less

Submitted 7 October, 2022; originally announced October 2022.

arXiv:2209.05653 [pdf]

doi 10.1007/s10489-023-05259-z

Semantic2Graph: Graph-based Multi-modal Feature Fusion for Action Segmentation in Videos

Authors: Junbin Zhang, Pei-Hsuan Tsai, Meng-Hsun Tsai

Abstract: Video action segmentation have been widely applied in many fields. Most previous studies employed video-based vision models for this purpose. However, they often rely on a large receptive field, LSTM or Transformer methods to capture long-term dependencies within videos, leading to significant computational resource requirements. To address this challenge, graph-based model was proposed. However,… ▽ More Video action segmentation have been widely applied in many fields. Most previous studies employed video-based vision models for this purpose. However, they often rely on a large receptive field, LSTM or Transformer methods to capture long-term dependencies within videos, leading to significant computational resource requirements. To address this challenge, graph-based model was proposed. However, previous graph-based models are less accurate. Hence, this study introduces a graph-structured approach named Semantic2Graph, to model long-term dependencies in videos, thereby reducing computational costs and raise the accuracy. We construct a graph structure of video at the frame-level. Temporal edges are utilized to model the temporal relations and action order within videos. Additionally, we have designed positive and negative semantic edges, accompanied by corresponding edge weights, to capture both long-term and short-term semantic relationships in video actions. Node attributes encompass a rich set of multi-modal features extracted from video content, graph structures, and label text, encompassing visual, structural, and semantic cues. To synthesize this multi-modal information effectively, we employ a graph neural network (GNN) model to fuse multi-modal features for node action label classification. Experimental results demonstrate that Semantic2Graph outperforms state-of-the-art methods in terms of performance, particularly on benchmark datasets such as GTEA and 50Salads. Multiple ablation experiments further validate the effectiveness of semantic features in enhancing model performance. Notably, the inclusion of semantic edges in Semantic2Graph allows for the cost-effective capture of long-term dependencies, affirming its utility in addressing the challenges posed by computational resource constraints in video-based vision models. △ Less

Submitted 6 February, 2024; v1 submitted 12 September, 2022; originally announced September 2022.

Comments: 13 pages, 3 figures, 9 tables. Published on Applied Intelligence

MSC Class: 68T01; 68T30; 68T45 ACM Class: I.2.10; I.4.8; I.5

Journal ref: Applied Intelligence(2024)

arXiv:2205.05826 [pdf, other]

Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling

Authors: Yannan Nellie Wu, Po-An Tsai, Angshuman Parashar, Vivienne Sze, Joel S. Emer

Abstract: In recent years, many accelerators have been proposed to efficiently process sparse tensor algebra applications (e.g., sparse neural networks). However, these proposals are single points in a large and diverse design space. The lack of systematic description and modeling support for these sparse tensor accelerators impedes hardware designers from efficient and effective design space exploration. T… ▽ More In recent years, many accelerators have been proposed to efficiently process sparse tensor algebra applications (e.g., sparse neural networks). However, these proposals are single points in a large and diverse design space. The lack of systematic description and modeling support for these sparse tensor accelerators impedes hardware designers from efficient and effective design space exploration. This paper first presents a unified taxonomy to systematically describe the diverse sparse tensor accelerator design space. Based on the proposed taxonomy, it then introduces Sparseloop, the first fast, accurate, and flexible analytical modeling framework to enable early-stage evaluation and exploration of sparse tensor accelerators. Sparseloop comprehends a large set of architecture specifications, including various dataflows and sparse acceleration features (e.g., elimination of zero-based compute). Using these specifications, Sparseloop evaluates a design's processing speed and energy efficiency while accounting for data movement and compute incurred by the employed dataflow as well as the savings and overhead introduced by the sparse acceleration features using stochastic tensor density models. Across representative accelerators and workloads, Sparseloop achieves over 2000 times faster modeling speed than cycle-level simulations, maintains relative performance trends, and achieves 0.1% to 8% average error. With a case study, we demonstrate Sparseloop's ability to help reveal important insights for designing sparse tensor accelerators (e.g., it is important to co-design orthogonal design aspects). △ Less

Submitted 9 January, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

Comments: Update website link, update UOP format description

arXiv:2205.01252 [pdf, other]

SIMD$^2$: A Generalized Matrix Instruction Set for Accelerating Tensor Computation beyond GEMM

Authors: Yunan Zhang, Po-An Tsai, Hung-Wei Tseng

Abstract: Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for exampl… ▽ More Matrix-multiplication units (MXUs) are now prevalent in every computing platform. The key attribute that makes MXUs so successful is the semiring structure, which allows tiling for both parallelism and data reuse. Nonetheless, matrix-multiplication is not the only algorithm with such attributes. We find that many algorithms share the same structure and differ in only the core operation; for example, using add-minimum instead of multiply-add. Algorithms with a semiring-like structure therefore have potential to be accelerated by a general-purpose matrix operation architecture, instead of common MXUs. In this paper, we propose SIMD$^2$, a new programming paradigm to support generalized matrix operations with a semiring-like structure. SIMD$^2$ instructions accelerate eight more types of matrix operations, in addition to matrix multiplications. Since SIMD$^2$ instructions resemble a matrix-multiplication instruction, we are able to build SIMD$^2$ architecture on top of any MXU architecture with minimal modifications. We developed a framework that emulates and validates SIMD$^2$ using NVIDIA GPUs with Tensor Cores. Across 8 applications, SIMD2 provides up to 38.59$\times$ speedup and more than 10.63$\times$ on average over optimized CUDA programs, with only 5% of full-chip area overhead. △ Less

Submitted 31 August, 2022; v1 submitted 2 May, 2022; originally announced May 2022.

Comments: To Appear in the 49th International Symposium on Computer Architecture (ISCA'22), June 18--22, 2022, New York, NY, USA

arXiv:2201.09717 [pdf, other]

Keeping Deep Lithography Simulators Updated: Global-Local Shape-Based Novelty Detection and Active Learning

Authors: Hao-Chiang Shao, Hsing-Lei Ping, Kuo-shiuan Chen, Weng-Tai Su, Chia-Wen Lin, Shao-Yun Fang, Pin-Yian Tsai, Yan-Hsiu Liu

Abstract: Learning-based pre-simulation (i.e., layout-to-fabrication) models have been proposed to predict the fabrication-induced shape deformation from an IC layout to its fabricated circuit. Such models are usually driven by pairwise learning, involving a training set of layout patterns and their reference shape images after fabrication. However, it is expensive and time-consuming to collect the referenc… ▽ More Learning-based pre-simulation (i.e., layout-to-fabrication) models have been proposed to predict the fabrication-induced shape deformation from an IC layout to its fabricated circuit. Such models are usually driven by pairwise learning, involving a training set of layout patterns and their reference shape images after fabrication. However, it is expensive and time-consuming to collect the reference shape images of all layout clips for model training and updating. To address the problem, we propose a deep learning-based layout novelty detection scheme to identify novel (unseen) layout patterns, which cannot be well predicted by a pre-trained pre-simulation model. We devise a global-local novelty scoring mechanism to assess the potential novelty of a layout by exploiting two subnetworks: an autoencoder and a pretrained pre-simulation model. The former characterizes the global structural dissimilarity between a given layout and training samples, whereas the latter extracts a latent code representing the fabrication-induced local deformation. By integrating the global dissimilarity with the local deformation boosted by a self-attention mechanism, our model can accurately detect novelties without the ground-truth circuit shapes of test samples. Based on the detected novelties, we further propose two active-learning strategies to sample a reduced amount of representative layouts most worthy to be fabricated for acquiring their ground-truth circuit shapes. Experimental results demonstrate i) our method's effectiveness in layout novelty detection, and ii) our active-learning strategies' ability in selecting representative novel layouts for keeping a learning-based pre-simulation model updated. △ Less

Submitted 24 January, 2022; originally announced January 2022.

arXiv:2109.15088 [pdf]

doi 10.3390/s22010341

An Efficient Probe-based Routing for Content-Centric Networking

Authors: Pei-Hsuan Tsai, Junbin Zhang, Meng-Hsun Tsai

Abstract: With the development of new technologies and applications, such as the Internet of Things, smart cities, 5G, and edge computing, traditional Internet Protocol-based (IP-based) networks have been exposed as having many problems. Information-Centric Networking (ICN), Named Data Networking (NDN), and Content-Centric Networking (CCN) are therefore proposed as an alternative for future networks. Howeve… ▽ More With the development of new technologies and applications, such as the Internet of Things, smart cities, 5G, and edge computing, traditional Internet Protocol-based (IP-based) networks have been exposed as having many problems. Information-Centric Networking (ICN), Named Data Networking (NDN), and Content-Centric Networking (CCN) are therefore proposed as an alternative for future networks. However, unlike IP-based networks, CCN routing is nondeterministic and difficult to optimize due to frequent in-network caching replacement. This paper presents a novel probe-based routing algorithm that explores real-time in-network caching to ensure the routing table storing the optimal paths to the nearest content provider is up-to-date. Effective probe-selections, Pending Interest Table (PIT) probe, and Forwarding Information Base (FIB) probe are discussed and analyzed by simulation with different performance measurements. Compared with the basic CCN, in terms of qualitative analysis, the additional computational overhead of our approach is O(*) and O(*) on processing interest packets and data packets, respectively. However, in terms of quantitative analysis, our approach reduces the number of timeout interests by 6% and the average response time by 0.6 s. Furthermore, although basic CCN and our approach belong to the same Quality of Service category, our approach outperforms basic CCN in terms of real values. Additionally, our probe-based approach performs better than * and *. Owing to speedup FIB updating by probes, our approach provides more reliable interest packet routing when accounting for router failures. In summary, the results demonstrate that compared to basic CCN, our probe-based routing approach raises FIB accuracy and reduces network congestion and response time, resulting in efficient routing. △ Less

Submitted 10 January, 2022; v1 submitted 30 September, 2021; originally announced September 2021.

Comments: 16 pages, 9 figures, 3 tables

MSC Class: 68-11 ACM Class: C.2.1; C.2.2

Journal ref: Sensors,2022; 22(1):341

arXiv:2109.07419 [pdf, other]

Union: A Unified HW-SW Co-Design Ecosystem in MLIR for Evaluating Tensor Operations on Spatial Accelerators

Authors: Geonhwa Jeong, Gokcen Kestor, Prasanth Chatarasi, Angshuman Parashar, Po-An Tsai, Sivasankaran Rajamanickam, Roberto Gioiosa, Tushar Krishna

Abstract: To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are s… ▽ More To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these "domain-specific" accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW co-design ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations(CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes. △ Less

Submitted 6 November, 2021; v1 submitted 15 September, 2021; originally announced September 2021.

Comments: This paper is accepted to PACT 2021

arXiv:2106.11181 [pdf]

doi 10.1109/ICS51289.2020.00060

A Query-based Routing Table Update Mechanism for Content-Centric Network

Authors: Pei-Hsuan Tsai, Yu-Lin Tseng, Jun-Bin Zhang, Meng-Hsun Tsai

Abstract: Due to the popularity of network applications, such as multimedia, online shopping, Internet of Things (IoT), and 5G, the contents cached in the routers are frequently replaced in Content-Centric Networking (CCN). Generally, cache miss causes numerous propagated packets to get the required content that deteriorates network congestion and delay the response time of consumers. Many caching strategie… ▽ More Due to the popularity of network applications, such as multimedia, online shopping, Internet of Things (IoT), and 5G, the contents cached in the routers are frequently replaced in Content-Centric Networking (CCN). Generally, cache miss causes numerous propagated packets to get the required content that deteriorates network congestion and delay the response time of consumers. Many caching strategies and routing policies were proposed to solve the problem. This paper presents an alternative solution by designing a query-based routing table update mechanism to increase the accuracy of routing tables. By adding an additional query content in interest packets, our approach real-time explores the cached content in routers and updated the routing table accordingly. This paper uses a general network simulator, ndnSIM, to compare basic CCN and our approach. The results show that our approach improves the response time of consumers and network congestion and is compatible with general forwarding strategies. △ Less

Submitted 21 June, 2021; originally announced June 2021.

Comments: 6 pages, 14 figures, conference. ISBN:978-1-7281-9256-7

ACM Class: C.2.2

Journal ref: 2020 International Computer Symposium (ICS), 2020, pp. 266-271

arXiv:2103.01489 [pdf, other]

doi 10.1145/3445814.3446762

Mind Mappings: Enabling Efficient Algorithm-Accelerator Mapping Space Search

Authors: Kartik Hegde, Po-An Tsai, Sitao Huang, Vikas Chandra, Angshuman Parashar, Christopher W. Fletcher

Abstract: Modern day computing increasingly relies on specialization to satiate growing performance and efficiency requirements. A core challenge in designing such specialized hardware architectures is how to perform mapping space search, i.e., search for an optimal mapping from algorithm to hardware. Prior work shows that choosing an inefficient mapping can lead to multiplicative-factor efficiency overhead… ▽ More Modern day computing increasingly relies on specialization to satiate growing performance and efficiency requirements. A core challenge in designing such specialized hardware architectures is how to perform mapping space search, i.e., search for an optimal mapping from algorithm to hardware. Prior work shows that choosing an inefficient mapping can lead to multiplicative-factor efficiency overheads. Additionally, the search space is not only large but also non-convex and non-smooth, precluding advanced search techniques. As a result, previous works are forced to implement mapping space search using expert choices or sub-optimal search heuristics. This work proposes Mind Mappings, a novel gradient-based search method for algorithm-accelerator mapping space search. The key idea is to derive a smooth, differentiable approximation to the otherwise non-smooth, non-convex search space. With a smooth, differentiable approximation, we can leverage efficient gradient-based search algorithms to find high-quality mappings. We extensively compare Mind Mappings to black-box optimization schemes used in prior work. When tasked to find mappings for two important workloads (CNN and MTTKRP), the proposed search finds mappings that achieve an average $1.40\times$, $1.76\times$, and $1.29\times$ (when run for a fixed number of steps) and $3.16\times$, $4.19\times$, and $2.90\times$ (when run for a fixed amount of time) better energy-delay product (EDP) relative to Simulated Annealing, Genetic Algorithms and Reinforcement Learning, respectively. Meanwhile, Mind Mappings returns mappings with only $5.32\times$ higher EDP than a possibly unachievable theoretical lower-bound, indicating proximity to the global optima. △ Less

Submitted 2 March, 2021; originally announced March 2021.

Comments: Appears in the proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '21), April 19-23, 2021, Virtual, USA

arXiv:2002.04967 [pdf, other]

doi 10.1109/TCAD.2020.3015469

From IC Layout to Die Photo: A CNN-Based Data-Driven Approach

Authors: Hao-Chiang Shao, Chao-Yi Peng, Jun-Rei Wu, Chia-Wen Lin, Shao-Yun Fang, Pin-Yen Tsai, Yan-Hsiu Liu

Abstract: We propose a deep learning-based data-driven framework consisting of two convolutional neural networks: i) LithoNet that predicts the shape deformations on a circuit due to IC fabrication, and ii) OPCNet that suggests IC layout corrections to compensate for such shape deformations. By learning the shape correspondences between pairs of layout design patterns and their scanning electron microscope… ▽ More We propose a deep learning-based data-driven framework consisting of two convolutional neural networks: i) LithoNet that predicts the shape deformations on a circuit due to IC fabrication, and ii) OPCNet that suggests IC layout corrections to compensate for such shape deformations. By learning the shape correspondences between pairs of layout design patterns and their scanning electron microscope (SEM) images of the product wafer thereof, given an IC layout pattern, LithoNet can mimic the fabrication process to predict its fabricated circuit shape. Furthermore, LithoNet can take the wafer fabrication parameters as a latent vector to model the parametric product variations that can be inspected on SEM images. Besides, traditional optical proximity correction (OPC) methods used to suggest a correction on a lithographic photomask is computationally expensive. Our proposed OPCNet mimics the OPC procedure and efficiently generates a corrected photomask by collaborating with LithoNet to examine if the shape of a fabricated circuit optimally matches its original layout design. As a result, the proposed LithoNet-OPCNet framework can not only predict the shape of a fabricated IC from its layout pattern, but also suggests a layout correction according to the consistency between the predicted shape and the given layout. Experimental results with several benchmark layout patterns demonstrate the effectiveness of the proposed method. △ Less

Submitted 6 August, 2020; v1 submitted 10 February, 2020; originally announced February 2020.

Comments: 14 pages, 16 figures

arXiv:1804.07088 [pdf, ps, other]

doi 10.1017/S147106841800011X

A Trajectory Calculus for Qualitative Spatial Reasoning Using Answer Set Programming

Authors: George Baryannis, Ilias Tachmazidis, Sotiris Batsakis, Grigoris Antoniou, Mario Alviano, Timos Sellis, Pei-Wei Tsai

Abstract: Spatial information is often expressed using qualitative terms such as natural language expressions instead of coordinates; reasoning over such terms has several practical applications, such as bus routes planning. Representing and reasoning on trajectories is a specific case of qualitative spatial reasoning that focuses on moving objects and their paths. In this work, we propose two versions of a… ▽ More Spatial information is often expressed using qualitative terms such as natural language expressions instead of coordinates; reasoning over such terms has several practical applications, such as bus routes planning. Representing and reasoning on trajectories is a specific case of qualitative spatial reasoning that focuses on moving objects and their paths. In this work, we propose two versions of a trajectory calculus based on the allowed properties over trajectories, where trajectories are defined as a sequence of non-overlapping regions of a partitioned map. More specifically, if a given trajectory is allowed to start and finish at the same region, 6 base relations are defined (TC-6). If a given trajectory should have different start and finish regions but cycles are allowed within, 10 base relations are defined (TC-10). Both versions of the calculus are implemented as ASP programs; we propose several different encodings, including a generalised program capable of encoding any qualitative calculus in ASP. All proposed encodings are experimentally evaluated using a real-world dataset. Experiment results show that the best performing implementation can scale up to an input of 250 trajectories for TC-6 and 150 trajectories for TC-10 for the problem of discovering a consistent configuration, a significant improvement compared to previous ASP implementations for similar qualitative spatial and temporal calculi. This manuscript is under consideration for acceptance in TPLP. △ Less

Submitted 19 April, 2018; originally announced April 2018.

Comments: Paper presented at the 34th International Conference on Logic Programming (ICLP 2018), Oxford, UK, July 14 to July 17, 2018, 20 pages, LaTeX, 16 figures

Journal ref: Theory and Practice of Logic Programming 18 (2018) 355-371

arXiv:1510.01823 [pdf, ps, other]

Multiple Configurations LT Codes

Authors: Pei-Chuan Tsai, Chih-Ming Chen, Ying-ping Chen

Abstract: This paper introduces a new scheme of LT codes, named multiple configurations. In multiple configurations LT codes (MC-LT codes), multiple sets of output symbols are simultaneously provided to receivers for recovering the source data. Each receiver, without the need to send information back to the sender, is capable of receiving the output symbols generated by some configuration chosen according t… ▽ More This paper introduces a new scheme of LT codes, named multiple configurations. In multiple configurations LT codes (MC-LT codes), multiple sets of output symbols are simultaneously provided to receivers for recovering the source data. Each receiver, without the need to send information back to the sender, is capable of receiving the output symbols generated by some configuration chosen according to its own decoding phase. Aiming at the broadcasting scenarios without feedback channels, the proposed MC-LT codes are shown to outperform the optimal pure LT codes at the cost of encoding and transmitting units. In this paper, the inspiration of MC-LT codes is presented, how MC-LT codes work is described by giving examples, in which the optimal pure LT codes are outperformed, and a practical design of MC-LT codes, which is analytically proved to have at least the same performance bound as the pure LT codes, is proposed. The results of numerical simulation experiments demonstrate that the proposed practical design of MC-LT codes can deliver better performance than the LT codes in comparison. In summary, this paper creates new potential research directions for LT codes, and MC-LT codes are a promising variant of LT codes, especially for broadcasting scenarios. △ Less

Submitted 7 October, 2015; originally announced October 2015.

Comments: 11 pages, 8 figures, 3 tables

ACM Class: E.4; H.1.1; C.2.0

Showing 1–18 of 18 results for author: Tsai, P