-
Achieving High-Performance Fault-Tolerant Routing in HyperX Interconnection Networks
Authors:
Cristóbal Camarero,
Alejandro Cano,
Carmen Martínez,
Ramón Beivide
Abstract:
Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter networks, including HyperX, are cheaper t…
▽ More
Interconnection networks are key actors that condition the performance of current large datacenter and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter networks, including HyperX, are cheaper than typical Fat Trees. But, to be really competitive, they have to employ evolved routing algorithms to both balance traffic and tolerate failures.
In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topology is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This mechanism not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios.
△ Less
Submitted 5 April, 2024;
originally announced April 2024.
-
Analysing Mechanisms for Virtual Channel Management in Low-Diameter networks
Authors:
Alejandro Cano,
Cristóbal Camarero,
Carmen Martínez,
Ramón Beivide
Abstract:
To interconnect their growing number of servers, current supercomputers and data centers are starting to adopt low-diameter networks, such as HyperX, Dragonfly and Dragonfly+. These emergent topologies require balancing the load over their links and finding suitable non-minimal routing mechanisms for them becomes particularly challenging. The Valiant load balancing scheme is a very popular choice…
▽ More
To interconnect their growing number of servers, current supercomputers and data centers are starting to adopt low-diameter networks, such as HyperX, Dragonfly and Dragonfly+. These emergent topologies require balancing the load over their links and finding suitable non-minimal routing mechanisms for them becomes particularly challenging. The Valiant load balancing scheme is a very popular choice for non-minimal routing. Evolved adaptive routing mechanisms implemented in real systems are based on this Valiant scheme.
All these low-diameter networks are deadlock-prone when non-minimal routing is employed. Routing deadlocks occur when packets cannot progress due to cyclic dependencies. Therefore, developing efficient deadlock-free packet routing mechanisms is critical for the progress of these emergent networks. The routing function includes the routing algorithm for path selection and the buffers management policy that dictates how packets allocate the buffers of the switches on their paths. For the same routing algorithm, a different buffer management mechanism can lead to a very different performance. Moreover, certain mechanisms considered efficient for avoiding deadlocks, may still suffer from hard to pinpoint instabilities that make erratic the network response. This paper focuses on exploring the impact of these buffers management policies on the performance of current interconnection networks, showing a 90\% of performance drop if an incorrect buffers management policy is used. Moreover, this study not only characterizes some of these undesirable scenarios but also proposes practicable solutions.
△ Less
Submitted 1 February, 2024; v1 submitted 22 June, 2023;
originally announced June 2023.
-
Simple, Fast and Practicable Algorithms for Cholesky, LU and QR Decomposition Using Fast Rectangular Matrix Multiplication
Authors:
Cristóbal Camarero
Abstract:
This note presents fast Cholesky/LU/QR decomposition algorithms with $O(n^{2.529})$ time complexity when using the fastest known matrix multiplication. The algorithms have potential application, since a quickly made implementation using Strassen multiplication has lesser execution time than the employed by the GNU Scientific Library for the same task in at least a few examples.
The underlaying i…
▽ More
This note presents fast Cholesky/LU/QR decomposition algorithms with $O(n^{2.529})$ time complexity when using the fastest known matrix multiplication. The algorithms have potential application, since a quickly made implementation using Strassen multiplication has lesser execution time than the employed by the GNU Scientific Library for the same task in at least a few examples.
The underlaying ideas are very simple. Despite this, I have been unable to find these methods in the literature.
△ Less
Submitted 5 December, 2018;
originally announced December 2018.
-
Projective Networks: Topologies for Large Parallel Computer Systems
Authors:
Cristóbal Camarero,
Carmen Martínez,
Enrique Vallejo,
Ramón Beivide
Abstract:
The interconnection network comprises a significant portion of the cost of large parallel computers, both in economic terms and power consumption. Several previous proposals exploit large-radix routers to build scalable low-distance topologies with the aim of minimizing these costs. However, they fail to consider potential unbalance in the network utilization, which in some cases results in subopt…
▽ More
The interconnection network comprises a significant portion of the cost of large parallel computers, both in economic terms and power consumption. Several previous proposals exploit large-radix routers to build scalable low-distance topologies with the aim of minimizing these costs. However, they fail to consider potential unbalance in the network utilization, which in some cases results in suboptimal designs. Based on an appropriate cost model, this paper advocates the use of networks based on incidence graphs of projective planes, broadly denoted as Projective Networks. Projective Networks rely on highly symmetric generalized Moore graphs and encompass several proposed direct (PN and demi-PN) and indirect (OFT) topologies under a common mathematical framework. Compared to other proposals with average distance between 2 and 3 hops, these networks provide very high scalability while preserving a balanced network utilization, resulting in low network costs. Overall, Projective Networks constitute a competitive alternative for exascale-level interconnection network design.
△ Less
Submitted 23 December, 2015;
originally announced December 2015.
-
Identifying Codes of Degree 4 Cayley Graphs over Abelian Groups
Authors:
Cristóbal Camarero,
Carmen Martínez,
Ramón Beivide
Abstract:
In this paper a wide family of identifying codes over regular Cayley graphs of degree four which are built over finite Abelian groups is presented. Some of the codes in this construction are also perfect. The graphs considered include some well-known graphs such as tori, twisted tori and Kronecker products of two cycles. Therefore, the codes can be used for identification in these graphs. Finally,…
▽ More
In this paper a wide family of identifying codes over regular Cayley graphs of degree four which are built over finite Abelian groups is presented. Some of the codes in this construction are also perfect. The graphs considered include some well-known graphs such as tori, twisted tori and Kronecker products of two cycles. Therefore, the codes can be used for identification in these graphs. Finally, an example of how these codes can be applied for adaptive identification over these graphs is presented.
△ Less
Submitted 18 December, 2014;
originally announced December 2014.
-
Quasi-perfect Lee Codes of Radius 2 and Arbitrarily Large Dimension
Authors:
Cristóbal Camarero,
Carmen Martínez
Abstract:
A construction of 2-quasi-perfect Lee codes is given over the space $\mathbb Z_p^n$ for $p$ prime, $p\equiv \pm 5\pmod{12}$ and $n=2[\frac{p}{4}]$. It is known that there are infinitely many such primes. Golomb and Welch conjectured that perfect codes for the Lee-metric do not exist for dimension $n\geq 3$ and radius $r\geq 2$. This conjecture was proved to be true for large radii as well as for l…
▽ More
A construction of 2-quasi-perfect Lee codes is given over the space $\mathbb Z_p^n$ for $p$ prime, $p\equiv \pm 5\pmod{12}$ and $n=2[\frac{p}{4}]$. It is known that there are infinitely many such primes. Golomb and Welch conjectured that perfect codes for the Lee-metric do not exist for dimension $n\geq 3$ and radius $r\geq 2$. This conjecture was proved to be true for large radii as well as for low dimensions. The codes found are very close to be perfect, which exhibits the hardness of the conjecture. A series of computations show that related graphs are Ramanujan, which could provide further connections between Coding and Graph Theories.
△ Less
Submitted 23 June, 2017; v1 submitted 18 December, 2014;
originally announced December 2014.
-
Symmetric Interconnection Networks from Cubic Crystal Lattices
Authors:
Cristóbal Camarero,
Carmen Martínez,
Ramón Beivide
Abstract:
Torus networks of moderate degree have been widely used in the supercomputer industry. Tori are superb when used for executing applications that require near-neighbor communications. Nevertheless, they are not so good when dealing with global communications. Hence, typical 3D implementations have evolved to 5D networks, among other reasons, to reduce network distances. Most of these big systems ar…
▽ More
Torus networks of moderate degree have been widely used in the supercomputer industry. Tori are superb when used for executing applications that require near-neighbor communications. Nevertheless, they are not so good when dealing with global communications. Hence, typical 3D implementations have evolved to 5D networks, among other reasons, to reduce network distances. Most of these big systems are mixed-radix tori which are not the best option for minimizing distances and efficiently using network resources. This paper is focused on improving the topological properties of these networks.
By using integral matrices to deal with Cayley graphs over Abelian groups, we have been able to propose and analyze a family of high-dimensional grid-based interconnection networks. As they are built over $n$-dimensional grids that induce a regular tiling of the space, these topologies have been denoted \textsl{lattice graphs}. We will focus on cubic crystal lattices for modeling symmetric 3D networks. Other higher dimensional networks can be composed over these graphs, as illustrated in this research. Easy network partitioning can also take advantage of this network composition operation. Minimal routing algorithms are also provided for these new topologies. Finally, some practical issues such as implementability and preliminary performance evaluations have been addressed.
△ Less
Submitted 8 November, 2013;
originally announced November 2013.