-
MPI Advance : Open-Source Message Passing Optimizations
Authors:
Amanda Bienz,
Derek Schafer,
Anthony Skjellum
Abstract:
The large variety of production implementations of the message passing interface (MPI) each provide unique and varying underlying algorithms. Each emerging supercomputer supports one or a small number of system MPI installations, tuned for the given architecture. Performance varies with MPI version, but application programmers are typically unable to achieve optimal performance with local MPI inst…
▽ More
The large variety of production implementations of the message passing interface (MPI) each provide unique and varying underlying algorithms. Each emerging supercomputer supports one or a small number of system MPI installations, tuned for the given architecture. Performance varies with MPI version, but application programmers are typically unable to achieve optimal performance with local MPI installations and therefore rely on whichever implementation is provided as a system install. This paper presents MPI Advance, a collection of libraries that sit on top of MPI, optimizing the underlying performance of any existing MPI library. The libraries provide optimizations for collectives, neighborhood collectives, partitioned communication, and GPU-aware communication.
△ Less
Submitted 13 September, 2023;
originally announced September 2023.
-
A More Scalable Sparse Dynamic Data Exchange
Authors:
Andrew Geyko,
Gerald Collom,
Derek Schafer,
Patrick Bridges,
Amanda Bienz
Abstract:
Parallel architectures are continually increasing in performance and scale, while underlying algorithmic infrastructure often fail to take full advantage of available compute power. Within the context of MPI, irregular communication patterns create bottlenecks in parallel applications. One common bottleneck is the sparse dynamic data exchange, often required when forming communication patterns wit…
▽ More
Parallel architectures are continually increasing in performance and scale, while underlying algorithmic infrastructure often fail to take full advantage of available compute power. Within the context of MPI, irregular communication patterns create bottlenecks in parallel applications. One common bottleneck is the sparse dynamic data exchange, often required when forming communication patterns within applications. There are a large variety of approaches for these dynamic exchanges, with optimizations implemented directly in parallel applications. This paper proposes a novel API within an MPI extension library, allowing for applications to utilize the variety of provided optimizations for sparse dynamic data exchange methods. Further, the paper presents novel locality-aware sparse dynamic data exchange algorithms. Finally, performance results show significant speedups up to 20x with the novel locality-aware algorithms.
△ Less
Submitted 3 April, 2024; v1 submitted 26 August, 2023;
originally announced August 2023.
-
Collective-Optimized FFTs
Authors:
Evelyn Namugwanya,
Amanda Bienz,
Derek Schafer,
Anthony Skjellum
Abstract:
This paper measures the impact of the various alltoallv methods. Results are analyzed within Beatnik, a Z-model solver that is bottlenecked by HeFFTe and representative of applications that rely on FFTs.
This paper measures the impact of the various alltoallv methods. Results are analyzed within Beatnik, a Z-model solver that is bottlenecked by HeFFTe and representative of applications that rely on FFTs.
△ Less
Submitted 4 July, 2023; v1 submitted 28 June, 2023;
originally announced June 2023.
-
Optimizing Irregular Communication with Neighborhood Collectives and Locality-Aware Parallelism
Authors:
Gerald Collom,
Rui Peng Li,
Amanda Bienz
Abstract:
Irregular communication often limits both the performance and scalability of parallel applications. Typically, applications individually implement irregular messages using point-to-point communications, and any optimizations are added directly into the application. As a result, these optimizations lack portability. There is no easy way to optimize point-to-point messages within MPI, as the interfa…
▽ More
Irregular communication often limits both the performance and scalability of parallel applications. Typically, applications individually implement irregular messages using point-to-point communications, and any optimizations are added directly into the application. As a result, these optimizations lack portability. There is no easy way to optimize point-to-point messages within MPI, as the interface for single messages provides no information on the collection of all communication to be performed. However, the persistent neighbor collective API, released in the MPI 4 standard, provides an interface for portable optimizations of irregular communication within MPI libraries.
This paper presents methods for optimizing irregular communication within neighborhood collectives, analyzes the impact of replacing point-to-point communication in existing codebases such as Hypre BoomerAMG with neighborhood collectives, and finally shows an up to 1.32x speedup on sparse matrix-vector multiplication within a BoomerAMG solve through the use of our optimized neighbor collectives. The authors analyze multiple implementations of neighborhood collectives, including a standard implementation, which simply wraps standard point-to-point communication, as well as multiple implementations of locality-aware aggregation. All optimizations are available in an open-source codebase, MPI Advance, which sits on top of MPI, allowing for optimizations to be added into existing codebases regardless of the system MPI install.
△ Less
Submitted 2 June, 2023;
originally announced June 2023.
-
Characterizing the Performance of Node-Aware Strategies for Irregular Point-to-Point Communication on Heterogeneous Architectures
Authors:
Shelby Lockhart,
Amanda Bienz,
William D. Gropp,
Luke N. Olson
Abstract:
Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI…
▽ More
Supercomputer architectures are trending toward higher computational throughput due to the inclusion of heterogeneous compute nodes. These multi-GPU nodes increase on-node computational efficiency, while also increasing the amount of data to be communicated and the number of potential data flow paths. In this work, we characterize the performance of irregular point-to-point communication with MPI on heterogeneous compute environments through performance modeling, demonstrating the limitations of standard communication strategies for both device-aware and staging-through-host communication techniques. Presented models suggest staging communicated data through host processes then using node-aware communication strategies for high inter-node message counts. Notably, the models also predict that node-aware communication utilizing all available CPU cores to communicate inter-node data leads to the most performant strategy when communicating with a high number of nodes. Model validation is provided via a case study of irregular point-to-point communication patterns in distributed sparse matrix-vector products. Importantly, we include a discussion on the implications model predictions have on communication strategy design for emerging supercomputer architectures.
△ Less
Submitted 13 September, 2022;
originally announced September 2022.
-
A Locality-Aware Bruck Allgather
Authors:
Amanda Bienz,
Shreeman Gautam,
Amun Kharel
Abstract:
Collective algorithms are an essential part of MPI, allowing application programmers to utilize underlying optimizations of common distributed operations. The MPI_Allgather gathers data, which is originally distributed across all processes, so that all data is available to each process. For small data sizes, the Bruck algorithm is commonly implemented to minimize the maximum number of messages com…
▽ More
Collective algorithms are an essential part of MPI, allowing application programmers to utilize underlying optimizations of common distributed operations. The MPI_Allgather gathers data, which is originally distributed across all processes, so that all data is available to each process. For small data sizes, the Bruck algorithm is commonly implemented to minimize the maximum number of messages communicated by any process. However, the cost of each step of communication is dependent upon the relative locations of source and destination processes, with non-local messages, such as inter-node, significantly more costly than local messages, such as intra-node. This paper optimizes the Bruck algorithm with locality-awareness, minimizing the number and size of non-local messages to improve performance and scalability of the allgather operation
△ Less
Submitted 23 August, 2022; v1 submitted 7 June, 2022;
originally announced June 2022.
-
Performance Analysis and Optimal Node-Aware Communication for Enlarged Conjugate Gradient Methods
Authors:
Shelby Lockhart,
Amanda Bienz,
William Gropp,
Luke Olson
Abstract:
Krylov methods are a key way of solving large sparse linear systems of equations, but suffer from poor strong scalabilty on distributed memory machines. This is due to high synchronization costs from large numbers of collective communication calls alongside a low computational workload. Enlarged Krylov methods address this issue by decreasing the total iterations to convergence, an artifact of spl…
▽ More
Krylov methods are a key way of solving large sparse linear systems of equations, but suffer from poor strong scalabilty on distributed memory machines. This is due to high synchronization costs from large numbers of collective communication calls alongside a low computational workload. Enlarged Krylov methods address this issue by decreasing the total iterations to convergence, an artifact of splitting the initial residual and resulting in operations on block vectors. In this paper, we present a performance study of an Enlarged Krylov Method, Enlarged Conjugate Gradients (ECG), noting the impact of block vectors on parallel performance at scale. Most notably, we observe the increased overhead of point-to-point communication as a result of denser messages in the sparse matrix-block vector multiplication kernel. Additionally, we present models to analyze expected performance of ECG, as well as, motivate design decisions. Most importantly, we introduce a new point-to-point communication approach based on node-aware communication techniques that increases efficiency of the method at scale.
△ Less
Submitted 11 March, 2022;
originally announced March 2022.
-
Modeling Data Movement Performance on Heterogeneous Architectures
Authors:
Amanda Bienz,
Luke N. Olson,
William D. Gropp,
Shelby Lockhart
Abstract:
The cost of data movement on parallel systems varies greatly with machine architecture, job partition, and nearby jobs. Performance models that accurately capture the cost of data movement provide a tool for analysis, allowing for communication bottlenecks to be pinpointed. Modern heterogeneous architectures yield increased variance in data movement as there are a number of viable paths for inter-…
▽ More
The cost of data movement on parallel systems varies greatly with machine architecture, job partition, and nearby jobs. Performance models that accurately capture the cost of data movement provide a tool for analysis, allowing for communication bottlenecks to be pinpointed. Modern heterogeneous architectures yield increased variance in data movement as there are a number of viable paths for inter-GPU communication. In this paper, we present performance models for the various paths of inter-node communication on modern heterogeneous architectures, including the trade-off between GPUDirect communication and copying to CPUs. Furthermore, we present a novel optimization for inter-node communication based on these models, utilizing all available CPU cores per node. Finally, we show associated performance improvements for MPI collective operations.
△ Less
Submitted 16 July, 2021; v1 submitted 20 October, 2020;
originally announced October 2020.
-
Node-Aware Improvements to Allreduce
Authors:
Amanda Bienz,
Luke N. Olson,
William D. Gropp
Abstract:
The \texttt{MPI\_Allreduce} collective operation is a core kernel of many parallel codebases, particularly for reductions over a single value per process. The commonly used allreduce recursive-doubling algorithm obtains the lower bound message count, yielding optimality for small reduction sizes based on node-agnostic performance models. However, this algorithm yields duplicate messages between se…
▽ More
The \texttt{MPI\_Allreduce} collective operation is a core kernel of many parallel codebases, particularly for reductions over a single value per process. The commonly used allreduce recursive-doubling algorithm obtains the lower bound message count, yielding optimality for small reduction sizes based on node-agnostic performance models. However, this algorithm yields duplicate messages between sets of nodes. Node-aware optimizations in MPICH remove duplicate messages through use of a single master process per node, yielding a large number of inactive processes at each inter-node step. In this paper, we present an algorithm that uses the multiple processes available per node to reduce the maximum number of inter-node messages communicated by a single process, improving the performance of allreduce operations, particularly for small message sizes.
△ Less
Submitted 21 October, 2019;
originally announced October 2019.
-
Reducing Communication in Algebraic Multigrid with Multi-step Node Aware Communication
Authors:
Amanda Bienz,
Luke Olson,
William Gropp
Abstract:
Algebraic multigrid (AMG) is often viewed as a scalable $\mathcal{O}(n)$ solver for sparse linear systems. Yet, parallel AMG lacks scalability due to increasingly large costs associated with communication, both in the initial construction of a multigrid hierarchy as well as the iterative solve phase. This work introduces a parallel implementation of AMG to reduce the cost of communication, yieldin…
▽ More
Algebraic multigrid (AMG) is often viewed as a scalable $\mathcal{O}(n)$ solver for sparse linear systems. Yet, parallel AMG lacks scalability due to increasingly large costs associated with communication, both in the initial construction of a multigrid hierarchy as well as the iterative solve phase. This work introduces a parallel implementation of AMG to reduce the cost of communication, yielding an increase in scalability. Standard inter-process communication consists of sending data regardless of the send and receive process locations. Performance tests show notable differences in the cost of intra- and inter-node communication, motivating a restructuring of communication. In this case, the communication schedule takes advantage of the less costly intra-node communication, reducing both the number and size of inter-node messages. Node-centric communication extends to the range of components in both the setup and solve phase of AMG, yielding an increase in the weak and strong scalability of the entire method.
△ Less
Submitted 24 April, 2019; v1 submitted 11 April, 2019;
originally announced April 2019.
-
Improving Performance Models for Irregular Point-to-Point Communication
Authors:
Amanda Bienz,
William D. Gropp,
Luke N. Olson
Abstract:
Parallel applications are often unable to take full advantage of emerging parallel architectures due to scaling limitations, which arise due to inter-process communication. Performance models are used to analyze the sources of communication costs. However, traditional models for point-to-point communication fail to capture the full cost of many irregular operations, such as sparse matrix methods.…
▽ More
Parallel applications are often unable to take full advantage of emerging parallel architectures due to scaling limitations, which arise due to inter-process communication. Performance models are used to analyze the sources of communication costs. However, traditional models for point-to-point communication fail to capture the full cost of many irregular operations, such as sparse matrix methods. In this paper, a node-aware based model is presented. Furthermore, the model is extended to include communication queue search time as well as an additional parameter estimating network contention. The resulting model is applied to a variety of irregular communication patterns throughout matrix operations, displaying improved accuracy over traditional models.
△ Less
Submitted 6 June, 2018;
originally announced June 2018.
-
Node Aware Sparse Matrix-Vector Multiplication
Authors:
Amanda Bienz,
William D. Gropp,
Luke N. Olson
Abstract:
The sparse matrix-vector multiply (SpMV) operation is a key computational kernel in many simulations and linear solvers. The large communication requirements associated with a reference implementation of a parallel SpMV result in poor parallel scalability. The cost of communication depends on the physical locations of the send and receive processes: messages injected into the network are more cost…
▽ More
The sparse matrix-vector multiply (SpMV) operation is a key computational kernel in many simulations and linear solvers. The large communication requirements associated with a reference implementation of a parallel SpMV result in poor parallel scalability. The cost of communication depends on the physical locations of the send and receive processes: messages injected into the network are more costly than messages sent between processes on the same node. In this paper, a node aware parallel SpMV (NAPSpMV) is introduced to exploit knowledge of the system topology, specifically the node-processor layout, to reduce costs associated with communication. The values of the input vector are redistributed to minimize both the number and the size of messages that are injected into the network during a SpMV, leading to a reduction in communication costs. A variety of computational experiments that highlight the efficiency of this approach are presented.
△ Less
Submitted 15 November, 2017; v1 submitted 23 December, 2016;
originally announced December 2016.
-
Reducing Parallel Communication in Algebraic Multigrid through Sparsification
Authors:
Amanda Bienz,
Robert D. Falgout William Gropp,
Luke N. Olson,
Jacob B. Schroder
Abstract:
Algebraic multigrid (AMG) is an $\mathcal{O}(n)$ solution process for many large sparse linear systems. A hierarchy of progressively coarser grids is constructed that utilize complementary relaxation and interpolation operators. High-energy error is reduced by relaxation, while low-energy error is mapped to coarse-grids and reduced there. However, large parallel communication costs often limit par…
▽ More
Algebraic multigrid (AMG) is an $\mathcal{O}(n)$ solution process for many large sparse linear systems. A hierarchy of progressively coarser grids is constructed that utilize complementary relaxation and interpolation operators. High-energy error is reduced by relaxation, while low-energy error is mapped to coarse-grids and reduced there. However, large parallel communication costs often limit parallel scalability. As the multigrid hierarchy is formed, each coarse matrix is formed through a triple matrix product. The resulting coarse-grids often have significantly more nonzeros per row than the original fine-grid operator, thereby generating high parallel communication costs on coarse-levels. In this paper, we introduce a method that systematically removes entries in coarse-grid matrices after the hierarchy is formed, leading to an improved communication costs. We sparsify by removing weakly connected or unimportant entries in the matrix, leading to improved solve time. The main trade-off is that if the heuristic identifying unimportant entries is used too aggressively, then AMG convergence can suffer. To counteract this, the original hierarchy is retained, allowing entries to be reintroduced into the solver hierarchy if convergence is too slow. This enables a balance between communication cost and convergence, as necessary. In this paper we present new algorithms for reducing communication and present a number of computational experiments in support.
△ Less
Submitted 14 December, 2015;
originally announced December 2015.