-
On the Limitations of Fractal Dimension as a Measure of Generalization
Authors:
Charlie Tan,
Inés García-Redondo,
Qiquan Wang,
Michael M. Bronstein,
Anthea Monod
Abstract:
Bounding and predicting the generalization gap of overparameterized neural networks remains a central open problem in theoretical machine learning. Neural network optimization trajectories have been proposed to possess fractal structure, leading to bounds and generalization measures based on notions of fractal dimension on these trajectories. Prominently, both the Hausdorff dimension and the persi…
▽ More
Bounding and predicting the generalization gap of overparameterized neural networks remains a central open problem in theoretical machine learning. Neural network optimization trajectories have been proposed to possess fractal structure, leading to bounds and generalization measures based on notions of fractal dimension on these trajectories. Prominently, both the Hausdorff dimension and the persistent homology dimension have been proposed to correlate with generalization gap, thus serving as a measure of generalization. This work performs an extended evaluation of these topological generalization measures. We demonstrate that fractal dimension fails to predict generalization of models trained from poor initializations. We further identify that the $\ell^2$ norm of the final parameter iterate, one of the simplest complexity measures in learning theory, correlates more strongly with the generalization gap than these notions of fractal dimension. Finally, our study reveals the intriguing manifestation of model-wise double descent in persistent homology-based generalization measures. This work lays the ground for a deeper investigation of the causal relationships between fractal geometry, topological data analysis, and neural network optimization.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Tropical Expressivity of Neural Networks
Authors:
Shiv Bhatia,
Yueqi Cao,
Paul Lezeau,
Anthea Monod
Abstract:
We propose an algebraic geometric framework to study the expressivity of linear activation neural networks. A particular quantity that has been actively studied in the field of deep learning is the number of linear regions, which gives an estimate of the information capacity of the architecture. To study and evaluate information capacity and expressivity, we work in the setting of tropical geometr…
▽ More
We propose an algebraic geometric framework to study the expressivity of linear activation neural networks. A particular quantity that has been actively studied in the field of deep learning is the number of linear regions, which gives an estimate of the information capacity of the architecture. To study and evaluate information capacity and expressivity, we work in the setting of tropical geometry -- a combinatorial and polyhedral variant of algebraic geometry -- where there are known connections between tropical rational maps and feedforward neural networks. Our work builds on and expands this connection to capitalize on the rich theory of tropical geometry to characterize and study various architectural aspects of neural networks. Our contributions are threefold: we provide a novel tropical geometric approach to selecting sampling domains among linear regions; an algebraic result allowing for a guided restriction of the sampling domain for network architectures with symmetries; and an open source library to analyze neural networks as tropical Puiseux rational maps. We provide a comprehensive set of proof-of-concept numerical experiments demonstrating the breadth of neural network architectures to which tropical geometric theory can be applied to reveal insights on expressivity characteristics of a network. Our work provides the foundations for the adaptation of both theory and existing software from computational tropical geometry and symbolic computation to deep learning.
△ Less
Submitted 30 May, 2024;
originally announced May 2024.
-
Tropical Gradient Descent
Authors:
Roan Talbut,
Anthea Monod
Abstract:
We propose a gradient descent method for solving optimisation problems arising in settings of tropical geometry - a variant of algebraic geometry that has become increasingly studied in applications such as computational biology, economics, and computer science. Our approach takes advantage of the polyhedral and combinatorial structures arising in tropical geometry to propose a versatile approach…
▽ More
We propose a gradient descent method for solving optimisation problems arising in settings of tropical geometry - a variant of algebraic geometry that has become increasingly studied in applications such as computational biology, economics, and computer science. Our approach takes advantage of the polyhedral and combinatorial structures arising in tropical geometry to propose a versatile approach for approximating local minima in tropical statistical optimisation problems - a rapidly growing body of work in recent years. Theoretical results establish global solvability for 1-sample problems and a convergence rate of $O(1/\sqrt{k})$. Numerical experiments demonstrate the method's superior performance over classical descent for tropical optimisation problems which exhibit tropical convexity but not classical convexity. Notably, tropical descent seamlessly integrates into advanced optimisation methods, such as Adam, offering improved overall performance.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Topological Community Detection: A Sheaf-Theoretic Approach
Authors:
Arne Wolf,
Anthea Monod
Abstract:
We propose a model for network community detection using topological data analysis, a branch of modern data science that leverages theory from algebraic topology to statistical analysis and machine learning. Specifically, we use cellular sheaves, which relate local to global properties of various algebraic topological constructions, to propose three new algorithms for vertex clustering over networ…
▽ More
We propose a model for network community detection using topological data analysis, a branch of modern data science that leverages theory from algebraic topology to statistical analysis and machine learning. Specifically, we use cellular sheaves, which relate local to global properties of various algebraic topological constructions, to propose three new algorithms for vertex clustering over networks to detect communities. We apply our algorithms to real social network data in numerical experiments and obtain near optimal results in terms of modularity. Our work is the first implementation of sheaves on real social network data and provides a solid proof-of-concept for future work using sheaves as tools to study complex systems captured by networks and simplicial complexes.
△ Less
Submitted 9 October, 2023;
originally announced October 2023.
-
Computable Stability for Persistence Rank Function Machine Learning
Authors:
Qiquan Wang,
Inés García-Redondo,
Pierre Faugère,
Anthea Monod,
Gregory Henselman-Petrusek
Abstract:
Persistent homology barcodes and diagrams are a cornerstone of topological data analysis. Widely used in many real data settings, they relate variation in topological information (as measured by cellular homology) with variation in data, however, they are challenging to use in statistical settings due to their complex geometric structure. In this paper, we revisit the persistent homology rank func…
▽ More
Persistent homology barcodes and diagrams are a cornerstone of topological data analysis. Widely used in many real data settings, they relate variation in topological information (as measured by cellular homology) with variation in data, however, they are challenging to use in statistical settings due to their complex geometric structure. In this paper, we revisit the persistent homology rank function -- an invariant measure of ``shape" that was introduced before barcodes and persistence diagrams and captures the same information in a form that is more amenable to data and computation. In particular, since they are functions, techniques from functional data analysis -- a domain of statistics adapted for functions -- apply directly to persistent homology when represented by rank functions. Rank functions, however, have been less popular than barcodes because they face the challenge that stability -- a property that is crucial to validate their use in data analysis -- is difficult to guarantee, mainly due to metric concerns on rank function space. However, rank functions extend more naturally to the increasingly popular and important case of multiparameter persistent homology. In this paper, we study the performance of rank functions in functional inferential statistics and machine learning on both simulated and real data, and in both single and multiparameter persistent homology. We find that the use of persistent homology captured by rank functions offers a clear improvement over existing approaches. We then provide theoretical justification for our numerical experiments and applications to data by deriving several stability results for single- and multiparameter persistence rank functions under various metrics with the underlying aim of computational feasibility and interpretability.
△ Less
Submitted 6 July, 2023;
originally announced July 2023.
-
Probability Metrics for Tropical Spaces of Different Dimensions
Authors:
Roan Talbut,
Daniele Tramontano,
Yueqi Cao,
Mathias Drton,
Anthea Monod
Abstract:
The problem of comparing probability distributions is at the heart of many tasks in statistics and machine learning and the most classical comparison methods assume that the distributions occur in spaces of the same dimension. Recently, a new geometric solution has been proposed to address this problem when the measures live in Euclidean spaces of differing dimensions. Here, we study the same prob…
▽ More
The problem of comparing probability distributions is at the heart of many tasks in statistics and machine learning and the most classical comparison methods assume that the distributions occur in spaces of the same dimension. Recently, a new geometric solution has been proposed to address this problem when the measures live in Euclidean spaces of differing dimensions. Here, we study the same problem of comparing probability distributions of different dimensions in the tropical geometric setting, which is becoming increasingly relevant in computations and applications involving complex, geometric data structures. Specifically, we construct a Wasserstein distance between measures on different tropical projective tori - the focal metric spaces in both theory and applications of tropical geometry - via tropical mappings between probability measures. We prove equivalence of the directionality of the maps, whether starting from the lower dimensional space and mapping to the higher dimensional space or vice versa. As an important practical implication, our work provides a framework for comparing probability distributions on the spaces of phylogenetic trees with different leaf sets.
△ Less
Submitted 6 July, 2023;
originally announced July 2023.
-
Generalized Morse Theory of Distance Functions to Surfaces for Persistent Homology
Authors:
Anna Song,
Ka Man Yim,
Anthea Monod
Abstract:
This paper brings together three distinct theories with the goal of quantifying shape textures with complex morphologies. Distance fields are central objects in shape representation, while topological data analysis uses algebraic topology to characterize geometric and topological patterns in shapes. The most well-known and widely applied tool from this approach is persistent homology, which tracks…
▽ More
This paper brings together three distinct theories with the goal of quantifying shape textures with complex morphologies. Distance fields are central objects in shape representation, while topological data analysis uses algebraic topology to characterize geometric and topological patterns in shapes. The most well-known and widely applied tool from this approach is persistent homology, which tracks the evolution of topological features in a dynamic manner as a barcode. Morse theory is a framework from differential topology that studies critical points of functions on manifolds; it has been used to characterize the birth and death of persistent homology features. However, a significant limitation to Morse theory is that it cannot be readily applied to distance functions because distance functions lack smoothness, which is required in Morse theory. Our contributions to addressing this issue is two fold. First, we generalize Morse theory to Euclidean distance functions of bounded sets with smooth boundaries. We focus in particular on distance fields for shape representation and we study the persistent homology of shape textures using a sublevel set filtration induced by the signed distance function. We use transversality theory to prove that for generic embeddings of a smooth compact surface in $\mathbb{R}^3$, signed distance functions admit finitely many non-degenerate critical points. This gives rise to our second contribution, which is that shapes and textures can both now be quantified and rigorously characterized in the language of persistent homology: signed distance persistence modules of generic shapes admit a finite barcode decomposition whose birth and death points can be classified and described geometrically. We use this approach to quantify shape textures on both simulated data and real vascular data from biology.
△ Less
Submitted 2 July, 2023; v1 submitted 26 June, 2023;
originally announced June 2023.
-
Wavelet-Based Density Estimation for Persistent Homology
Authors:
Konstantin Häberle,
Barbara Bravi,
Anthea Monod
Abstract:
Persistent homology is a central methodology in topological data analysis that has been successfully implemented in many fields and is becoming increasingly popular and relevant. The output of persistent homology is a persistence diagram -- a multiset of points supported on the upper half plane -- that is often used as a statistical summary of the topological features of data. In this paper, we st…
▽ More
Persistent homology is a central methodology in topological data analysis that has been successfully implemented in many fields and is becoming increasingly popular and relevant. The output of persistent homology is a persistence diagram -- a multiset of points supported on the upper half plane -- that is often used as a statistical summary of the topological features of data. In this paper, we study the random nature of persistent homology and estimate the density of expected persistence diagrams from observations using wavelets; we show that our wavelet-based estimator is optimal. Furthermore, we propose an estimator that offers a sparse representation of the expected persistence diagram that achieves near-optimality. We demonstrate the utility of our contributions in a machine learning task in the context of dynamical systems.
△ Less
Submitted 22 April, 2024; v1 submitted 15 May, 2023;
originally announced May 2023.
-
$k$-Means Clustering for Persistent Homology
Authors:
Yueqi Cao,
Prudence Leung,
Anthea Monod
Abstract:
Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram; it has recently gained much popularity from its myriad successful applications to many domains. However, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we…
▽ More
Persistent homology is a methodology central to topological data analysis that extracts and summarizes the topological features within a dataset as a persistence diagram; it has recently gained much popularity from its myriad successful applications to many domains. However, its algebraic construction induces a metric space of persistence diagrams with a highly complex geometry. In this paper, we prove convergence of the $k$-means clustering algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush--Kuhn--Tucker framework. Additionally, we perform numerical experiments on various representations of persistent homology, including embeddings of persistence diagrams as well as diagrams themselves and their generalizations as persistence measures; we find that $k$-means clustering performance directly on persistence diagrams and measures outperform their vectorized representations.
△ Less
Submitted 25 November, 2023; v1 submitted 18 October, 2022;
originally announced October 2022.
-
Fast Topological Signal Identification and Persistent Cohomological Cycle Matching
Authors:
Inés García-Redondo,
Anthea Monod,
Anna Song
Abstract:
Within the context of topological data analysis, the problems of identifying topological significance and matching signals across datasets are important and useful inferential tasks in many applications. The limitation of existing solutions to these problems, however, is computational speed. In this paper, we harness the state-of-the-art for persistent homology computation by studying the problem…
▽ More
Within the context of topological data analysis, the problems of identifying topological significance and matching signals across datasets are important and useful inferential tasks in many applications. The limitation of existing solutions to these problems, however, is computational speed. In this paper, we harness the state-of-the-art for persistent homology computation by studying the problem of determining topological prevalence and cycle matching using a cohomological approach, which increases their feasibility and applicability to a wider variety of applications and contexts. We demonstrate this on a wide range of real-life, large-scale, and complex datasets. We extend existing notions of topological prevalence and cycle matching to include general non-Morse filtrations. This provides the most general and flexible state-of-the-art adaptation of topological signal identification and persistent cycle matching, which performs comparisons of orders of ten for thousands of sampled points in a matter of minutes on standard institutional HPC CPU facilities.
△ Less
Submitted 30 May, 2024; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Video Restoration with a Deep Plug-and-Play Prior
Authors:
Antoine Monod,
Julie Delon,
Matias Tassano,
Andrés Almansa
Abstract:
This paper presents a novel method for restoring digital videos via a Deep Plug-and-Play (PnP) approach. Under a Bayesian formalism, the method consists in using a deep convolutional denoising network in place of the proximal operator of the prior in an alternating optimization scheme. We distinguish ourselves from prior PnP work by directly applying that method to restore a digital video from a d…
▽ More
This paper presents a novel method for restoring digital videos via a Deep Plug-and-Play (PnP) approach. Under a Bayesian formalism, the method consists in using a deep convolutional denoising network in place of the proximal operator of the prior in an alternating optimization scheme. We distinguish ourselves from prior PnP work by directly applying that method to restore a digital video from a degraded video observation. This way, a network trained once for denoising can be repurposed for other video restoration tasks. Our experiments in video deblurring, super-resolution, and interpolation of random missing pixels all show a clear benefit to using a network specifically designed for video denoising, as it yields better restoration performance and better temporal stability than a single image network with similar denoising performance using the same PnP formulation. Moreover, our method compares favorably to applying a different state-of-the-art PnP scheme separately on each frame of the sequence. This opens new perspectives in the field of video restoration.
△ Less
Submitted 15 September, 2022; v1 submitted 6 September, 2022;
originally announced September 2022.
-
Learning Linear Non-Gaussian Polytree Models
Authors:
Daniele Tramontano,
Anthea Monod,
Mathias Drton
Abstract:
In the context of graphical causal discovery, we adapt the versatile framework of linear non-Gaussian acyclic models (LiNGAMs) to propose new algorithms to efficiently learn graphs that are polytrees. Our approach combines the Chow--Liu algorithm, which first learns the undirected tree structure, with novel schemes to orient the edges. The orientation schemes assess algebraic relations among momen…
▽ More
In the context of graphical causal discovery, we adapt the versatile framework of linear non-Gaussian acyclic models (LiNGAMs) to propose new algorithms to efficiently learn graphs that are polytrees. Our approach combines the Chow--Liu algorithm, which first learns the undirected tree structure, with novel schemes to orient the edges. The orientation schemes assess algebraic relations among moments of the data-generating distribution and are computationally inexpensive. We establish high-dimensional consistency results for our approach and compare different algorithmic versions in numerical experiments.
△ Less
Submitted 13 August, 2022;
originally announced August 2022.
-
Rewiring Networks for Graph Neural Network Training Using Discrete Geometry
Authors:
Jakub Bober,
Anthea Monod,
Emil Saucan,
Kevin N. Webster
Abstract:
Information over-squashing is a phenomenon of inefficient information propagation between distant nodes on networks. It is an important problem that is known to significantly impact the training of graph neural networks (GNNs), as the receptive field of a node grows exponentially. To mitigate this problem, a preprocessing procedure known as rewiring is often applied to the input network. In this p…
▽ More
Information over-squashing is a phenomenon of inefficient information propagation between distant nodes on networks. It is an important problem that is known to significantly impact the training of graph neural networks (GNNs), as the receptive field of a node grows exponentially. To mitigate this problem, a preprocessing procedure known as rewiring is often applied to the input network. In this paper, we investigate the use of discrete analogues of classical geometric notions of curvature to model information flow on networks and rewire them. We show that these classical notions achieve state-of-the-art performance in GNN training accuracy on a variety of real-world network datasets. Moreover, compared to the current state-of-the-art, these classical notions exhibit a clear advantage in computational runtime by several orders of magnitude.
△ Less
Submitted 16 July, 2022;
originally announced July 2022.
-
A Geometric Condition for Uniqueness of Fréchet Means of Persistence Diagrams
Authors:
Yueqi Cao,
Anthea Monod
Abstract:
The Fréchet mean is an important statistical summary and measure of centrality of data; it has been defined and studied for persistent homology captured by persistence diagrams. However, the complicated geometry of the space of persistence diagrams implies that the Fréchet mean for a given set of persistence diagrams is not necessarily unique, which prohibits theoretical guarantees for empirical m…
▽ More
The Fréchet mean is an important statistical summary and measure of centrality of data; it has been defined and studied for persistent homology captured by persistence diagrams. However, the complicated geometry of the space of persistence diagrams implies that the Fréchet mean for a given set of persistence diagrams is not necessarily unique, which prohibits theoretical guarantees for empirical means with respect to population means. In this paper, we derive a variance expression for a set of persistence diagrams exhibiting a multi-matching between the persistence points known as a grouping. Moreover, we propose a condition for groupings, which we refer to as flatness: sets of persistence diagrams that exhibit flat groupings give rise to unique Fréchet means. We derive a finite sample convergence result for general groupings, which results in convergence for Fréchet means if the groupings are flat. Finally, we interpret flat groupings in a recently-proposed general framework of Fréchet means in Alexandrov geometry. Together with recent results from Alexandrov geometry, this allows for the first derivation of a finite sample convergence rate for sets of persistence diagrams and lays the ground for viability of the Fréchet mean as a practical statistical summary of persistent homology.
△ Less
Submitted 4 May, 2023; v1 submitted 8 July, 2022;
originally announced July 2022.
-
Approximating Persistent Homology for Large Datasets
Authors:
Yueqi Cao,
Anthea Monod
Abstract:
Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement…
▽ More
Persistent homology is an important methodology from topological data analysis which adapts theory from algebraic topology to data settings and has been successfully implemented in many applications. It produces a statistical summary in the form of a persistence diagram, which captures the shape and size of the data. Despite its widespread use, persistent homology is simply impossible to implement when a dataset is very large. In this paper we address the problem of finding a representative persistence diagram for prohibitively large datasets. We adapt the classical statistical method of bootstrapping, namely, drawing and studying smaller multiple subsamples from the large dataset. We show that the mean of the persistence diagrams of subsamples -- taken as a mean persistence measure computed from the subsamples -- is a valid approximation of the true persistent homology of the larger dataset. We give the rate of convergence of the mean persistence diagram to the true persistence diagram in terms of the number of subsamples and size of each subsample. Given the complex algebraic and geometric nature of persistent homology, we adapt the convexity and stability properties in the space of persistence diagrams together with random set theory to achieve our theoretical results for the general setting of point cloud data. We demonstrate our approach on simulated and real data, including an application of shape clustering on complex large-scale point cloud data.
△ Less
Submitted 18 May, 2022; v1 submitted 19 April, 2022;
originally announced April 2022.
-
An Analysis and Implementation of the HDR+ Burst Denoising Method
Authors:
Antoine Monod,
Julie Delon,
Thomas Veit
Abstract:
HDR+ is an image processing pipeline presented by Google in 2016. At its core lies a denoising algorithm that uses a burst of raw images to produce a single higher quality image. Since it is designed as a versatile solution for smartphone cameras, it does not necessarily aim for the maximization of standard denoising metrics, but rather for the production of natural, visually pleasing images. In t…
▽ More
HDR+ is an image processing pipeline presented by Google in 2016. At its core lies a denoising algorithm that uses a burst of raw images to produce a single higher quality image. Since it is designed as a versatile solution for smartphone cameras, it does not necessarily aim for the maximization of standard denoising metrics, but rather for the production of natural, visually pleasing images. In this article, we specifically discuss and analyze the HDR+ burst denoising algorithm architecture and the impact of its various parameters. With this publication, we provide an open source Python implementation of the algorithm, along with an interactive demo.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Curved Markov Chain Monte Carlo for Network Learning
Authors:
John Sigbeku,
Emil Saucan,
Anthea Monod
Abstract:
We present a geometrically enhanced Markov chain Monte Carlo sampler for networks based on a discrete curvature measure defined on graphs. Specifically, we incorporate the concept of graph Forman curvature into sampling procedures on both the nodes and edges of a network explicitly, via the transition probability of the Markov chain, as well as implicitly, via the target stationary distribution, w…
▽ More
We present a geometrically enhanced Markov chain Monte Carlo sampler for networks based on a discrete curvature measure defined on graphs. Specifically, we incorporate the concept of graph Forman curvature into sampling procedures on both the nodes and edges of a network explicitly, via the transition probability of the Markov chain, as well as implicitly, via the target stationary distribution, which gives a novel, curved Markov chain Monte Carlo approach to learning networks. We show that integrating curvature into the sampler results in faster convergence to a wide range of network statistics demonstrated on deterministic networks drawn from real-world data.
△ Less
Submitted 11 October, 2021; v1 submitted 7 October, 2021;
originally announced October 2021.
-
An Invitation to Tropical Alexandrov Curvature
Authors:
Carlos Améndola,
Anthea Monod
Abstract:
We study Alexandrov curvature in the tropical projective torus with respect to the tropical metric, which has been useful in various statistical analyses, particularly in phylogenomics. Alexandrov curvature is a generalization of classical Riemannian sectional curvature to more general metric spaces; it is determined by a comparison of triangles in an arbitrary metric space to corresponding triang…
▽ More
We study Alexandrov curvature in the tropical projective torus with respect to the tropical metric, which has been useful in various statistical analyses, particularly in phylogenomics. Alexandrov curvature is a generalization of classical Riemannian sectional curvature to more general metric spaces; it is determined by a comparison of triangles in an arbitrary metric space to corresponding triangles in Euclidean space. In the polyhedral setting of tropical geometry, triangles are a combinatorial object, which adds a combinatorial dimension to our analysis. We study the effect that the triangle types have on curvature, and what can be revealed about these types from the curvature. We find that positive, negative, zero, and undefined Alexandrov curvature can exist concurrently in tropical settings and that there is a tight connection between triangle combinatorial type and curvature. Our results are established both by proof and computational experiments, and shed light on the intricate geometry of the tropical projective torus. In this context, we discuss implications for statistical methodologies which admit inherent geometric interpretations.
This paper is dedicated to Bernd Sturmfels on the occasion of his 60th birthday.
△ Less
Submitted 10 February, 2023; v1 submitted 16 May, 2021;
originally announced May 2021.
-
Topological Information Retrieval with Dilation-Invariant Bottleneck Comparative Measures
Authors:
Yueqi Cao,
Athanasios Vlontzos,
Luca Schmidtke,
Bernhard Kainz,
Anthea Monod
Abstract:
Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously c…
▽ More
Appropriately representing elements in a database so that queries may be accurately matched is a central task in information retrieval; recently, this has been achieved by embedding the graphical structure of the database into a manifold in a hierarchy-preserving manner using a variety of metrics. Persistent homology is a tool commonly used in topological data analysis that is able to rigorously characterize a database in terms of both its hierarchy and connectivity structure. Computing persistent homology on a variety of embedded datasets reveals that some commonly used embeddings fail to preserve the connectivity. We show that those embeddings which successfully retain the database topology coincide in persistent homology by introducing two dilation-invariant comparative measures to capture this effect: in particular, they address the issue of metric distortion on manifolds. We provide an algorithm for their computation that exhibits greatly reduced time complexity over existing methods. We use these measures to perform the first instance of topology-based information retrieval and demonstrate its increased performance over the standard bottleneck distance for persistent homology. We showcase our approach on databases of different data varieties including text, videos, and medical images.
△ Less
Submitted 6 July, 2022; v1 submitted 4 April, 2021;
originally announced April 2021.
-
Tropical Geometric Variation of Phylogenetic Tree Shapes
Authors:
Bo Lin,
Anthea Monod,
Ruriko Yoshida
Abstract:
We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tree space. Tree shapes are formally referred to as tree topologies; a tree topology can also be thought of as a tree combinatorial type, which is given by the tree's branching configuration and leaf labeling. We use the tropical line segment as a framework to define notions of variance as well as invaria…
▽ More
We study the behavior of phylogenetic tree shapes in the tropical geometric interpretation of tree space. Tree shapes are formally referred to as tree topologies; a tree topology can also be thought of as a tree combinatorial type, which is given by the tree's branching configuration and leaf labeling. We use the tropical line segment as a framework to define notions of variance as well as invariance of tree topologies: we provide a combinatorial search theorem that describes all tree topologies occurring along a tropical line segment, as well as a setting under which tree topologies do not change along a tropical line segment. Our study is motivated by comparison to the moduli space endowed with a geodesic metric proposed by Billera, Holmes, and Vogtmann (referred to as BHV space); we consider the tropical geometric setting as an alternative framework to BHV space for sets of phylogenetic trees. We give an algorithm to compute tropical line segments which is lower in computational complexity than the fastest method currently available for BHV geodesics and show that its trajectory behaves more subtly: while the BHV geodesic traverses the origin for vastly different tree topologies, the tropical line segment bypasses it.
△ Less
Submitted 19 February, 2022; v1 submitted 10 October, 2020;
originally announced October 2020.
-
Tropical Optimal Transport and Wasserstein Distances
Authors:
Wonjun Lee,
Wuchen Li,
Bo Lin,
Anthea Monod
Abstract:
We study the problem of optimal transport in tropical geometry and define the Wasserstein-$p$ distances in the continuous metric measure space setting of the tropical projective torus. We specify the tropical metric -- a combinatorial metric that has been used to study of the tropical geometric space of phylogenetic trees -- as the ground metric and study the cases of $p=1,2$ in detail. The case o…
▽ More
We study the problem of optimal transport in tropical geometry and define the Wasserstein-$p$ distances in the continuous metric measure space setting of the tropical projective torus. We specify the tropical metric -- a combinatorial metric that has been used to study of the tropical geometric space of phylogenetic trees -- as the ground metric and study the cases of $p=1,2$ in detail. The case of $p=1$ gives an efficient computation of the infinitely-many geodesics on the tropical projective torus, while the case of $p=2$ gives a form for Fréchet means and a general inner product structure. Our results also provide theoretical foundations for geometric insight a statistical framework in a tropical geometric setting. We construct explicit algorithms for the computation of the tropical Wasserstein-1 and 2 distances and prove their convergence. Our results provide the first study of the Wasserstein distances and optimal transport in tropical geometry. Several numerical examples are provided.
△ Less
Submitted 17 May, 2021; v1 submitted 13 November, 2019;
originally announced November 2019.
-
Tropical Geometry of Phylogenetic Tree Space: A Statistical Perspective
Authors:
Anthea Monod,
Bo Lin,
Ruriko Yoshida,
Qiwen Kang
Abstract:
Phylogenetic trees are the fundamental mathematical representation of evolutionary processes in biology. They are also objects of interest in pure mathematics, such as algebraic geometry and combinatorics, due to their discrete geometry. Although they are important data structures, they face the significant challenge that sets of trees form a non-Euclidean phylogenetic tree space, which means that…
▽ More
Phylogenetic trees are the fundamental mathematical representation of evolutionary processes in biology. They are also objects of interest in pure mathematics, such as algebraic geometry and combinatorics, due to their discrete geometry. Although they are important data structures, they face the significant challenge that sets of trees form a non-Euclidean phylogenetic tree space, which means that standard computational and statistical methods cannot be directly applied. In this work, we explore the statistical feasibility of a pure mathematical representation of the set of all phylogenetic trees based on tropical geometry for both descriptive and inferential statistics, and unsupervised and supervised machine learning. Our exploration is both theoretical and practical. We show that the tropical geometric phylogenetic tree space endowed with a generalized Hilbert projective metric exhibits analytic, geometric, and topological properties that are desirable for theoretical studies in probability and statistics and allow for well-defined questions to be posed. We illustrate the statistical feasibility of the tropical geometric perspective for phylogenetic trees with an example of both a descriptive and inferential statistical task. Moreover, this approach exhibits increased computational efficiency and statistical performance over the current state-of-the-art, which we illustrate with a real data example on seasonal influenza. Our results demonstrate the viability of the tropical geometric setting for parametric statistical and probabilistic studies of sets of phylogenetic trees.
△ Less
Submitted 29 June, 2022; v1 submitted 31 May, 2018;
originally announced May 2018.
-
Tropical Sufficient Statistics for Persistent Homology
Authors:
Anthea Monod,
Sara Kališnik,
Juan Ángel Patiño-Galindo,
Lorin Crawford
Abstract:
We show that an embedding in Euclidean space based on tropical geometry generates stable sufficient statistics for barcodes. In topological data analysis, barcodes are multiscale summaries of algebraic topological characteristics that capture the `shape' of data; however, in practice, they have complex structures that make them difficult to use in statistical settings. The sufficiency result prese…
▽ More
We show that an embedding in Euclidean space based on tropical geometry generates stable sufficient statistics for barcodes. In topological data analysis, barcodes are multiscale summaries of algebraic topological characteristics that capture the `shape' of data; however, in practice, they have complex structures that make them difficult to use in statistical settings. The sufficiency result presented in this work allows for classical probability distributions to be assumed on the tropical geometric representation of barcodes. This makes a variety of parametric statistical inference methods amenable to barcodes, all while maintaining their initial interpretations. More specifically, we show that exponential family distributions may be assumed, and that likelihood functions for persistent homology may be constructed. We conceptually demonstrate sufficiency and illustrate its utility in persistent homology dimensions 0 and 1 with concrete parametric applications to human immunodeficiency virus and avian influenza data.
△ Less
Submitted 30 June, 2019; v1 submitted 8 September, 2017;
originally announced September 2017.
-
Estimating thresholding levels for random fields via Euler characteristics
Authors:
Robert J. Adler,
Kevin Bartz,
Sam C. Kou,
Anthea Monod
Abstract:
We introduce Lipschitz-Killing curvature (LKC) regression, a new method to produce $(1-α)$ thresholds for signal detection in random fields that does not require knowledge of the spatial correlation structure. The idea is to fit observed empirical Euler characteristics to the Gaussian kinematic formula via generalized least squares, which quickly and easily provides statistical estimates of the LK…
▽ More
We introduce Lipschitz-Killing curvature (LKC) regression, a new method to produce $(1-α)$ thresholds for signal detection in random fields that does not require knowledge of the spatial correlation structure. The idea is to fit observed empirical Euler characteristics to the Gaussian kinematic formula via generalized least squares, which quickly and easily provides statistical estimates of the LKCs --- complex topological quantities that can be extremely challenging to compute, both theoretically and numerically. With these estimates, we can then make use of a powerful parametric approximation via Euler characteristics for Gaussian random fields to generate accurate $(1-α)$ thresholds and $p$-values. The main features of our proposed LKC regression method are easy implementation, conceptual simplicity, and facilitated diagnostics, which we demonstrate in a variety of simulations and applications.
△ Less
Submitted 27 April, 2017;
originally announced April 2017.
-
Predicting Clinical Outcomes in Glioblastoma: An Application of Topological and Functional Data Analysis
Authors:
Lorin Crawford,
Anthea Monod,
Andrew X. Chen,
Sayan Mukherjee,
Raúl Rabadán
Abstract:
Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer that is under active study in the field of cancer biology. Its rapid progression and the relative time cost of obtaining molecular data make other readily-available forms of data, such as images, an important resource for actionable measures in patients. Our goal is to utilize information given by medical images taken from G…
▽ More
Glioblastoma multiforme (GBM) is an aggressive form of human brain cancer that is under active study in the field of cancer biology. Its rapid progression and the relative time cost of obtaining molecular data make other readily-available forms of data, such as images, an important resource for actionable measures in patients. Our goal is to utilize information given by medical images taken from GBM patients in statistical settings. To do this, we design a novel statistic---the smooth Euler characteristic transform (SECT)---that quantifies magnetic resonance images (MRIs) of tumors. Due to its well-defined inner product structure, the SECT can be used in a wider range of functional and nonparametric modeling approaches than other previously proposed topological summary statistics. When applied to a cohort of GBM patients, we find that the SECT is a better predictor of clinical outcomes than both existing tumor shape quantifications and common molecular assays. Specifically, we demonstrate that SECT features alone explain more of the variance in GBM patient survival than gene expression, volumetric features, and morphometric features. The main takeaways from our findings are thus twofold. First, they suggest that images contain valuable information that can play an important role in clinical prognosis and other medical decisions. Second, they show that the SECT is a viable tool for the broader study of medical imaging informatics.
△ Less
Submitted 12 September, 2019; v1 submitted 21 November, 2016;
originally announced November 2016.