Search | arXiv e-print repository

SABLE: Staging Blocked Evaluation of Sparse Matrix Computations

Authors: Pratyush Das, Adhitha Dias, Anxhelo Xhebraj, Artem Pelenitsyn, Kirshanthan Sundararajah, Milind Kulkarni

Abstract: Sparse Matrices found in the real world often have some structure in how the dense elements are organized. While the inspector-executor model inspects matrices for structure, its generality can overlook further specialization. We propose a system that - if the sparse matrix is stored in a blocked storage format - can generate more efficient code by constructing regular loops over these blocks. Our… ▽ More Sparse Matrices found in the real world often have some structure in how the dense elements are organized. While the inspector-executor model inspects matrices for structure, its generality can overlook further specialization. We propose a system that - if the sparse matrix is stored in a blocked storage format - can generate more efficient code by constructing regular loops over these blocks. Our system performs a specified computation over every element of the block instead of avoiding computing any sparse element at all and achieving regularity in specialized code. The system is extensible, providing a dense block iterator for the user to express any computation over these dense blocks. We show that this approach can significantly speed up SpMV and SpMM operations over the state-of-the-art systems Partially-Strided Codelets and Sparse Register Tiling. △ Less

Submitted 3 April, 2024; originally announced July 2024.

arXiv:2405.17590 [pdf, other]

Optimizing Layout of Recursive Datatypes with Marmoset

Authors: Vidush Singhal, Chaitanya Koparkar, Joseph Zullo, Artem Pelenitsyn, Michael Vollmer, Mike Rainey, Ryan Newton, Milind Kulkarni

Abstract: While programmers know that the low-level memory representation of data structures can have significant effects on performance, compiler support to optimize the layout of those structures is an under-explored field. Prior work has optimized the layout of individual, non-recursive structures without considering how collections of those objects in linked or recursive data structures are laid out. Th… ▽ More While programmers know that the low-level memory representation of data structures can have significant effects on performance, compiler support to optimize the layout of those structures is an under-explored field. Prior work has optimized the layout of individual, non-recursive structures without considering how collections of those objects in linked or recursive data structures are laid out. This work introduces Marmoset, a compiler that optimizes the layouts of algebraic datatypes, with a special focus on producing highly optimized, packed data layouts where recursive structures can be traversed with minimal pointer chasing. Marmoset performs an analysis of how a recursive ADT is used across functions to choose a global layout that promotes simple, strided access for that ADT in memory. It does so by building and solving a constraint system to minimize an abstract cost model, yielding a predicted efficient layout for the ADT. Marmoset then builds on top of Gibbon, a prior compiler for packed, mostly-serial representations, to synthesize optimized ADTs. We show experimentally that Marmoset is able to choose optimal layouts across a series of microbenchmarks and case studies, outperforming both Gibbons baseline approach, as well as MLton, a Standard ML compiler that uses traditional pointer-heavy representations. △ Less

Submitted 3 June, 2024; v1 submitted 27 May, 2024; originally announced May 2024.

Journal ref: European Conference on Object Oriented Programming 2024

arXiv:2311.09549 [pdf, other]

SparseAuto: An Auto-Scheduler for Sparse Tensor Computations Using Recursive Loop Nest Restructuring

Authors: Adhitha Dias, Logan Anderson, Kirshanthan Sundararajah, Artem Pelenitsyn, Milind Kulkarni

Abstract: Automated code generation and performance optimizations for sparse tensor algebra are cardinal since they have become essential in many real-world applications like quantum computing, physics, chemistry, and machine learning. General sparse tensor algebra compilers are not always versatile enough to generate asymptotically optimal code for sparse tensor contractions. This paper shows how to optimi… ▽ More Automated code generation and performance optimizations for sparse tensor algebra are cardinal since they have become essential in many real-world applications like quantum computing, physics, chemistry, and machine learning. General sparse tensor algebra compilers are not always versatile enough to generate asymptotically optimal code for sparse tensor contractions. This paper shows how to optimize and generate asymptotically better schedules for complex tensor expressions using kernel fission and fusion. We present a generalized loop transformation to achieve loop nesting for minimized memory footprint and reduced asymptotic complexity. Furthermore, we present an auto-scheduler that uses a partially ordered set-based cost model that uses both time and auxiliary memory complexities in its pruning stages. In addition, we highlight the use of SMT solvers in sparse auto-schedulers to prune the Pareto frontier of schedules to the smallest number of possible schedules with user-defined constraints available at compile time. Finally, we show that our auto-scheduler can select asymptotically better schedules that use our compiler transformation to generate optimized code. Our results show that the auto-scheduler achieves orders of magnitude speedup compared to the TACO-generated code for several real-world tensor algebra computations on different real-world inputs. △ Less

Submitted 5 January, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

arXiv:2311.09168 [pdf, other]

doi 10.1145/3650200.3656601

Arkade: k-Nearest Neighbor Search With Non-Euclidean Distances using GPU Ray Tracing

Authors: Durga Mandarapu, Vani Nagarajan, Artem Pelenitsyn, Milind Kulkarni

Abstract: High-performance implementations of $k$-Nearest Neighbor Search ($k$NN) in low dimensions use tree-based data structures. Tree algorithms are hard to parallelize on GPUs due to their irregularity. However, newer Nvidia GPUs offer hardware support for tree operations through ray-tracing cores. Recent works have proposed using RT cores to implement $k$NN search, but they all have a hardware-imposed… ▽ More High-performance implementations of $k$-Nearest Neighbor Search ($k$NN) in low dimensions use tree-based data structures. Tree algorithms are hard to parallelize on GPUs due to their irregularity. However, newer Nvidia GPUs offer hardware support for tree operations through ray-tracing cores. Recent works have proposed using RT cores to implement $k$NN search, but they all have a hardware-imposed constraint on the distance metric used in the search -- the Euclidean distance. We propose and implement two reductions to support $k$NN for a broad range of distances other than the Euclidean distance: Arkade Filter-Refine and Arkade Monotone Transformation, each of which allows non-Euclidean distance-based nearest neighbor queries to be performed in terms of the Euclidean distance. With our reductions, we observe that $k$NN search time speedups range between $1.6$x-$200$x and $1.3$x-$33.1$x over various state-of-the-art GPU shader core and RT core baselines, respectively. In evaluation, we provide several insights on RT architectures' ability to efficiently build and traverse the tree by analyzing the $k$NN search time trends. △ Less

Submitted 21 April, 2024; v1 submitted 15 November, 2023; originally announced November 2023.

arXiv:2109.01950 [pdf, other]

doi 10.1145/3485527

Type Stability in Julia: Avoiding Performance Pathologies in JIT Compilation (Extended Version)

Authors: Artem Pelenitsyn, Julia Belyakova, Benjamin Chung, Ross Tate, Jan Vitek

Abstract: As a scientific programming language, Julia strives for performance but also provides high-level productivity features. To avoid performance pathologies, Julia users are expected to adhere to a coding discipline that enables so-called type stability. Informally, a function is type stable if the type of the output depends only on the types of the inputs, not their values. This paper provides a form… ▽ More As a scientific programming language, Julia strives for performance but also provides high-level productivity features. To avoid performance pathologies, Julia users are expected to adhere to a coding discipline that enables so-called type stability. Informally, a function is type stable if the type of the output depends only on the types of the inputs, not their values. This paper provides a formal definition of type stability as well as a stronger property of type groundedness, shows that groundedness enables compiler optimizations, and proves the compiler correct. We also perform a corpus analysis to uncover how these type-related properties manifest in practice. △ Less

Submitted 17 November, 2021; v1 submitted 4 September, 2021; originally announced September 2021.

Comments: OOPSLA '21, extended version

Showing 1–5 of 5 results for author: Pelenitsyn, A