Non-geodesically-convex optimization in the Wasserstein space
Abstract
We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. Specifically, the objective exhibits some difference-of-convex structure along these geodesics. The setting also encompasses sampling problems where the logarithm of the target distribution is difference-of-convex. We derive multiple convergence insights for a novel semi Forward-Backward Euler scheme under several nonconvex (and possibly nonsmooth) regimes. Notably, the semi Forward-Backward Euler is just a slight modification of the Forward-Backward Euler whose convergence is—to our knowledge—still unknown in our very general non-geodesically-convex setting.
1 Introduction
Sampling and optimization are intertwined. For example, the (overdamped) Langevin dynamics, typically considered a sampling algorithm, can be considered as gradient descent optimization where a suitable amount of Gaussian noise is injected at each step. There are also deeper connections. At the limit of infinitesimal stepsize, the law of the Langevin dynamics is governed by the Fokker-Planck equation describing a diffusion over time of probability measures. In the seminal paper [39], Jordan, Kinderlehrer, and Otto reinterpreted the Fokker-Planck equation as the gradient flow of the functional relative entropy, a.k.a. Kullback-Leibler (KL) divergence, in the (Wasserstein) space of finite second-moment probability measures equipped with the Wasserstein metric. The discovery connects the two fields and encourages optimization in the Wasserstein space, even conceptually, as it directly gives insight into the sampling context. Studies in continuous-time dynamics [21, 12, 66, 30] seem natural and enjoy nice theoretical properties without discretization errors. Another line of research studies discretization of Wasserstein gradient flow by either quantifying the discretization error between the continuous-time flow and the discrete-time flow [39, 67, 27, 23, 28] or viewing discrete-time flows as iterative optimization schemes in the Wasserstein space [65, 26, 70, 11] where the primary focus is on (geodesically) convex optimization problems.
Nonconvex, nonsmooth optimization is challenging, even in Euclidean space, quoting Rockafellar [63]: “In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity.” The landscape of nonconvex problems is mostly underexplored in the Wasserstein space. In the sampling language, it amounts to sampling from a non-log-concave and possibly non-log-Lipschitz-smooth target distribution. Recently, Balasubramanian et al. [9] advocated the need for a sound theory for non-log-concave sampling and provided some guarantees for the unadjusted Langevin algorithm (ULA) in sampling from log-smooth (Lipschitz/Hölder smooth) densities. These results are preliminary for the ULA (and its variants) with a specific class of densities (smooth). Theoretical understandings of other classes of algorithms and densities are needed.
We approach the subject through the lens of nonconvex optimization in the space of probability distributions and pose discretized Wasserstein gradient flows as iterative minimization algorithms. This allows us to, on the one hand, use and extend tools from classical nonconvex optimization and, on the other hand, derive more connections between sampling and optimization.
We study the following non-geodesically-convex optimization problem defined over the space of probability measures over with finite second moment, i.e., ,
(1) |
where is a nonconvex function which can be represented as a difference of two convex functions and , is the potential energy, and plays a role as the regularizer which is assumed to be a convex function along generalized geodesics.
Why difference-of-convex structure?
Nonconvexity lies at the difference-of-convex (DC) structure , where and are called the first and second DC components, respectively. being nonconvex implies being non-geodesically-convex in general. First, the class of DC functions is very rich, and DC structures are present everywhere in real-world applications [60, 46, 47, 1, 22, 56, 58]. Weakly convex and Lipschitz smooth (-smooth or simply smooth) functions are two subclasses of DC functions. Furthermore, any continuous function can be approximated by a sequence of DC functions over a compact, convex domain [8]. We also remark that many nonconvex functions admit quite natural DC decompositions, for example, an -smooth function has the following splittings: and whenever . Second, DC functions preserve enough structure to extend convex analysis. Such structure is key in classic DC programming [60] and in our Wasserstein space analysis with optimal transport tools.
Context
Many problems in machine learning and sampling fall into the spectrum of problem (1). For example, refer to a discussion in [65] that inspired our work. The regularizer can be the internal energy [4, Sect. 10.4.3]. Under McCann’s condition, the internal energy is convex along generalized geodesics [4, Prop. 9.3.9]. In particular, the negative entropy, if is absolutely continuous w.r.t. Lebesgue measure, otherwise, is a special case of internal energy satisfying McCann’s condition. In the latter case, where , the optimization problem reduces to a sampling problem with log-DC target distribution. In Bayesian inference, the posterior structure depends on both the prior and likelihood. If the likelihood is log-smooth, it exhibits the aforementioned DC splittings. Log-priors, often nonsmooth to capture sparsity or low rank, typically also have explicit DC structures [48, 32, 25]. In the context of infinitely wide one-layer neural networks and Maximum Mean Discrepancy [51, 7, 21], let be the optimal distribution over a network’s parameters, be a given kernel, the regularizer is then the interaction energy and In general, is not convex along generalized geodesics and is nonconvex but not necessarily DC. When the kernel has Lipschitz gradient, we can adjust both and as and for some making generalized geodesically convex and concave (hence DC); Appx. A.2.
Our idea is to minimize (1) in the space of probability distributions by discretization of the gradient flow of , leveraging on the JKO (Jordan, Kinderlehrer, and Otto) operator (2). In the previous work [70], this has been done with the Forward-Backward (FB) Euler discretization, but it lacks convergence analysis. Recently, Salim et al. [65] did some study on FB Euler, but their results do not apply here because is nonconvex and possibly nonsmooth. Further leveraging on the DC structure of and inspired by classical DC programming literature [60], we subtly modify the FB Euler to give rise to a scheme named semi FB Euler that enjoys major theoretical advantages as we can provide a wide range of convergence analysis. Regarding the name, "semi" addresses the splitting of the potential energy. The scheme can be nevertheless reinterpreted as FB Euler.
Contributions
To our knowledge, no prior work studies problem (1) when is DC. Therefore, most of the derived results in this paper are novel. We propose and analyze the semi FB Euler (4) leveraging on the classic DC optimization proof template [60, 45, 46] with a substantial accommodation of Wasserstein geometry and derive the following set of new insights:
- Thm. 1
-
Thm. 2
We provide convergence rate of in terms of Wasserstein (sub)gradient mapping in the general nonsmooth setting. The notion of gradient mapping [33, 38, 55] is from the context of proximal algorithms in Euclidean space that is applicable to nonconvex programs where the notion of distance to global solution is—in general—not possible to work out.
-
Thm. 3
Under the extra assumption that is continuously twice differentiable and has bounded Hessian, we provide a convergence rate of in terms of distance of to the Fréchet subdifferential of . One can think of this as convergence rate to Fréchet stationarity, i.e., if is a Fréchet stationary point of , then, by definition, is in the Fréchet subdifferential of at . Fréchet stationarity is a relatively sharp necessary condition for local optimality.
-
Thm. 4, 5
Under the assumptions of Thm. 3 and additionally satisfying the Łojasciewicz-type inequality for some Łojasciewicz exponent of , we show that is a Cauchy sequence under Wasserstein topology, and thanks to the completeness of the Wasserstein space, the whole sequence converges to some . We show that is in fact a global minimizer to . Furthermore, we provide convergence rate of in three different regimes ( denotes the Wasserstein metric): (1) if , converges to after a finite number of steps; (2) if , both and converges to exponentially fast; (3) if , both and converges sublinearly to with rates and , respectively. When is the negative entropy, ; Therefore, in the sampling context, we provide convergence guarantees in both Wasserstein and KL distances. See Sect. 4.3 for additional observations and implications.
2 Preliminaries
2.1 Notations and basic results in measure theory and functional analysis
We denote by , the Borel -algebra over , and the Lebesgue measure on . is the set of Borel probability measures on . For , we denote its second-order moment by , where can be infinity. denotes a set of finite second-order moment probability measures. is the set of measures that are absolutely continuous w.r.t. . Here -a.e. stands for almost everywhere w.r.t. .
Let be the classes of -time continuously differentiable functions, infinitely differentiable functions with compact support, bounded and continuous functions, respectively.
From functional analysis [20], for each , denotes the Banach space of measurable (where measurable is understood as Borel measurable from now on) functions such that . We shall consider an element of as an equivalent class of functions that agree -a.e. on rather than a sole function. The norm of is . When , is actually a Hilbert space with the inner product which induces the mentioned norm. These results can be extended to vector-valued functions. In particular, we denote by the Hilbert space of in which . The norm .
We say that has quadratic growth if there exists such that for all . It is clear that if has quadratic growth and , then
The pushforward of a measure through a Borel map , denoted by is defined by for every Borel sets
2.2 Optimal transport [4, 5, 69, 68]
Given , the principal problem in optimal transport is to find a transport map pushing to , i.e., , in the most cost-efficient way, i.e., minimizing on -average. Monge’s formulation for this problem is , where the optimal solution, if exists, is denoted by and called the optimal (Monge) map. Monge’s problem can be ill-posed, e.g., no such exists when is a Dirac mass and is absolutely continuous [5].
By relaxing Monge’s formulation, Kantorovich considers , where denotes the set of probabilities over whose marginals are and , i.e, iff where are the projections onto the first space and the second space, respectively. Such is called a plan. Kantorovich’s formulation is well-posed because is non-empty (at least ) and the element actually exists (see [5, Sect. 2.2]). The set of optimal plans between and is denoted by In terms of random variables, any pairs where is called a coupling of and while it is called an optimal coupling if the joint law of and is in .
In , the value in Kantorovich’s problem specifies a valid metric referred to as Wasserstein distance, for some, and thus all, . The metric space is then called the Wasserstein space. In , beside the convergence notion induced by the Wasserstein metric, there is a weaker notion of convergence called narrow convergence: we say a sequence converges narrowly to if for all Convergence in the Wasserstein metric implies narrow convergence but the converse is not necessarily true. The extra condition to make it true is . We denote Wasserstein and narrow convergence by and , respectively.
If , Monge’s formulation is well-posed and the unique (-a.e.) solution exists, and in this case, it is safe to talk about (and use) the optimal transport map . Moreover, there exists some convex function such that -a.e. Kantorovich’s problem also has a unique solution and it is given by where is the identity map. This is known as Brenier theorem or polar factorization theorem [18].
2.3 Subdifferential calculus in the Wasserstein space
Apart from being a metric space, also enjoys some pre-Riemannian structure making subdifferential calculus on it possible. Let us have a picture of a manifold in mind. Firstly, the tangent space [4] of at is , where the closure is w.r.t. the -topology. Intuitively, for , is an optimal transport map if is small enough [43], so plays a role as "tangent vector".
Let , we denote . Let , we say that a map belongs to the Fréchet subdifferential [15, 43] if for all , where the little-o notation means If , we say is Fréchet subdifferentiable at . We also denote .
Similarly, we say that belongs to the (Fréchet) superdifferential of at if . In other words,
We say is Wassertein differentiable [15, 43] at if . We call an element of the intersection, denoted by , a Wasserstein gradient of at , and it holds , for all and any The Wasserstein gradient is not unique in general, but its parallel component in is unique, and this parallel component is again a valid Wasserstein gradient as the orthogonal component plays no role in the above definitions, i.e., if , it holds for any and [43, Prop. 2.5]. We may refer to this parallel component as the unique Wasserstein gradient of at .
2.4 Optimization in the Wasserstein space
A function is called proper if , while it is called lower semicontinuous (l.s.c) if for any sequence , it holds .
We next recall (a simplified version of) generalized geodesic convexity.
Definition 1.
[65] Let . We say is convex along generalized geodesics if , , , .
The curve (called a generalized geodesic) interpolates from to as runs from to . The definition says that is convex along these curves. If and , the curve is a geodesic in . If the definition is relaxed to the class of geodesics only, we say that is convex along geodesics.
An important characterization of Fréchet subdifferential of a geodesically convex function is that we can drop the little-o notation in its definition in Sect. 2.3 [4, Sect 10.1.1]. As a convention, for a geodesically convex function , the Fréchet subdifferential will be simply written as .
First-order optimality conditions
Let be a proper function. is a global minimizer of if For local optimality, we shall use the Wasserstein metric to define neighborhoods. is a local minimizer if there exists such that for all We shall denote the (open) Wasserstein ball centered at with radius . If we replace by we obtain the notion of a closed Wasserstein ball.
We call a Fréchet stationary point of if Fréchet stationarity is a necessary condition for local optimality. In other words, if is a local minimizer, it is a Fréchet stationary point (Lem. 5 in Appendix). In addition, if is Wasserstein differentiable at , -a.e. [43]. When is geodesically convex, Fréchet stationarity is a sufficient condition for global optimality (Lem. 6 in Appendix).
3 Semi Forward-Backward Euler for difference-of-convex structures
3.1 Wasserstein gradient flows: different types of discretizations
To neatly present the idea of minimizing via discretized gradient flow, we first assume for a moment that is infinitely differentiable and is the negative entropy. See also a discussion in [65].
We wish to minimize (1) in the space of probability distributions. A natural idea is to apply discretizations of the gradient flow of , where the gradient flow is defined (under some technical assumptions [39]) as the limit of the following scheme with some simple time-interpolation
(2) |
Straightforwardly, given a fixed , (2) gives back a discretization for this flow known as Backward Euler. On the other hand, if is Wasserstein differentiable (Sect. 2.2), the Forward Euler discretization reads [70] , which is reinterpreted as doing gradient descent in the space of probability distributions. These are optimization methods that work directly on the objective function itself. However, the composite structure of (a sum of several terms) can also be exploited. One such scheme is the unadjusted Langevin algorithm (ULA), where it first takes a gradient step w.r.t. the potential part, then follows the heat flow corresponding to the entropy part [70]: , where is the convolution. This ULA is "viewed" in the space of distributions (Eulerian approach), a more familiar and equivalent form of the ULA from the particle perspective (Lagrangian approach) goes like where . The ULA is known to be asymptotically biased even for Gaussian target measure (Ornstein-Uhlenbeck process). To correct this bias, the Metropolis-Hasting accept-reject step [62] is sometimes introduced. Metropolis-Hasting algorithm [52, 36] is a much more general framework that works with quite any proposal (e.g., a random walk) whose convergence analysis is based on the Markov kernel satisfying the detailed balance condition. This convergence framework is different from what is considered in this work: we are more interested in the underlying dynamics of the chain. Metropolis-Hasting algorithm is indeed another story.
In optimization, for composite structure, Forward-Backward (FB) Euler and its variants are methods of choice [59, 10]. The corresponding FB Euler for will take the gradient step (forward) according to the potential, and JKO step (backward) w.r.t. the negative entropy
(3) |
This scheme appears in [70] without convergence analysis, and later on [65] derives non-asymptotic convergence guarantees under the assumption being convex and Lipschitz smooth.
In this work, as is nonconvex and nonsmooth, the theory in [65] does not apply, and the convergence (if any) of (3) remains mysterious. The DC structure of can be further exploited. In DC programming [60], the forward step should be applied to the concave part, while the backward step should be applied to the convex part. We hence propose the following semi FB Euler
(4) |
for which we can provide convergence guarantees. Apparently, the difference between semi FB Euler and FB Euler is subtle: while FB Euler does forward on and backward on , semi FB Euler does forward on and backward on ; recall that .
Theoretically, semi FB Euler enjoys some advantages compared to FB Euler. Thanks to Brenier theorem (Sect. 2.2), the pushing step in semi FB Euler is optimal since is convex; Meanwhile, the pushing in FB Euler is non-optimal whose optimal Monge map is not identifiable in general. The convergence of FB Euler is still an open question, even when is (DC) differentiable. In contrast, we can provide a solid theoretical guarantee for semi FB Euler, especially when is differentiable. Additionally, we also offer convergence guarantees when is nonsmooth.
3.2 Problem setting
Our goal is to minimize the non-geodesically-convex functional over , where is a DC function. We make Assumption 1 throughout the paper:
Assumption 1.
-
(i)
The objective function is bounded below.
-
(ii)
are convex functions and have quadratic growth.
-
(iii)
is proper, l.s.c, and convex along generalized geodesics in , and
-
(iv)
There exists such that , for every
3.3 Optimality charactizations
3.4 Semi FB Euler: a general setting
We allow to be non-differentiable in some derivations, meaning that (convex subdifferential [54]) contains multiple elements in general. We first pick a selector of , i.e., , such that . By the axiom of choice (Zermelo, 1904, see, e.g., [37]), such selection always exists. However, an arbitrary selector can behave badly, e.g., not measurable. We shall first restrict ourselves to the class of Borel measurable selectors (see Appx. A.1 for an existence discussion).
Assumption 2 (Measurability).
The selector is Borel measurable.
We recall the semi FB scheme (4) but for nonsmooth as follows: start with an initial distribution , given a discretization stepsize , we repeat the following two steps:
4 Convergence analysis
4.1 Asymptotic analysis
Lemma 1 (Descent lemma).
Lem. 1 shows that the objective does not increase along semi FB Euler’s iterates. Proof of Lem. 1 is in Appx. A.3. By using Lem. 1, we establish asymptotic convergence for semi FB Euler as follows.
For the asymptotic convergence analysis, we need the following assumption on .
Assumption 3.
is continuously differentiable.
Theorem 1 (Asymptotic convergence).
Proof of Thm.1 is in Appx. A.4. Thm. 1 does not ensure convergence of the whole sequence ; Rather, it guarantees subsequential convergence to critical points of .
Remark 1.
In the Euclidean space, the compactness assumption of the generated sequence is usually enforced via the coercivity assumption: whenever . A striking difference in the Wasserstein space is that closed Wasserstein balls are not compact in the Wasserstein topology [43, Prop. 4.2], making coercivity not sufficient to induce (Wasserstein) compactness. For Thm. 1, we simply assume the sequence to be relatively compact.
4.2 Non asymptotic analysis
To measure how fast the algorithm converges, we need some convergence measurement. First, for proximal-type algorithms in Euclidean space, the notion of gradient mapping is usually used (see, e.g., [33, 55] and [38, Eq. (5)]) and we measure the rate . In analogy as in Euclidean space, we define the Wasserstein (sub)gradient mapping as follows , and we measure the rate of .