Non-geodesically-convex optimization in the Wasserstein space

Hoang Phuc Hau Luu Hanlin Yu Bernardo Williams Petrus Mikkola Marcelo Hartmann Kai Puolamäki Arto Klami
Department of Computer Science, University of Helsinki

Abstract

We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. Specifically, the objective exhibits some difference-of-convex structure along these geodesics. The setting also encompasses sampling problems where the logarithm of the target distribution is difference-of-convex. We derive multiple convergence insights for a novel semi Forward-Backward Euler scheme under several nonconvex (and possibly nonsmooth) regimes. Notably, the semi Forward-Backward Euler is just a slight modification of the Forward-Backward Euler whose convergence is—to our knowledge—still unknown in our very general non-geodesically-convex setting.

1 Introduction

Sampling and optimization are intertwined. For example, the (overdamped) Langevin dynamics, typically considered a sampling algorithm, can be considered as gradient descent optimization where a suitable amount of Gaussian noise is injected at each step. There are also deeper connections. At the limit of infinitesimal stepsize, the law of the Langevin dynamics is governed by the Fokker-Planck equation describing a diffusion over time of probability measures. In the seminal paper [39], Jordan, Kinderlehrer, and Otto reinterpreted the Fokker-Planck equation as the gradient flow of the functional relative entropy, a.k.a. Kullback-Leibler (KL) divergence, in the (Wasserstein) space of finite second-moment probability measures equipped with the Wasserstein metric. The discovery connects the two fields and encourages optimization in the Wasserstein space, even conceptually, as it directly gives insight into the sampling context. Studies in continuous-time dynamics [21, 12, 66, 30] seem natural and enjoy nice theoretical properties without discretization errors. Another line of research studies discretization of Wasserstein gradient flow by either quantifying the discretization error between the continuous-time flow and the discrete-time flow [39, 67, 27, 23, 28] or viewing discrete-time flows as iterative optimization schemes in the Wasserstein space [65, 26, 70, 11] where the primary focus is on (geodesically) convex optimization problems.

Nonconvex, nonsmooth optimization is challenging, even in Euclidean space, quoting Rockafellar [63]: “In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity.” The landscape of nonconvex problems is mostly underexplored in the Wasserstein space. In the sampling language, it amounts to sampling from a non-log-concave and possibly non-log-Lipschitz-smooth target distribution. Recently, Balasubramanian et al. [9] advocated the need for a sound theory for non-log-concave sampling and provided some guarantees for the unadjusted Langevin algorithm (ULA) in sampling from log-smooth (Lipschitz/Hölder smooth) densities. These results are preliminary for the ULA (and its variants) with a specific class of densities (smooth). Theoretical understandings of other classes of algorithms and densities are needed.

We approach the subject through the lens of nonconvex optimization in the space of probability distributions and pose discretized Wasserstein gradient flows as iterative minimization algorithms. This allows us to, on the one hand, use and extend tools from classical nonconvex optimization and, on the other hand, derive more connections between sampling and optimization.

We study the following non-geodesically-convex optimization problem defined over the space $\mathcal{P}_{2}(X)$ of probability measures $\mu$ over $X=\mathbb{R}^{d}$ with finite second moment, i.e., $\int{\|x\|^{2}}d\mu(x)<+\infty$ ,

\min_{\mu\in\mathcal{P}_{2}(X)}\mathcal{F}(\mu):=\mathcal{E}_{F}(\mu)+\mathscr% {H}(\mu):=\mathcal{E}_{G-H}(\mu)+\mathscr{H}(\mu)

(1)

where $F:X\to\mathbb{R}$ is a nonconvex function which can be represented as a difference of two convex functions $G$ and $H$ , $\mathcal{E}_{F}(\mu):=\int{F(x)}d\mu(x)$ is the potential energy, and $\mathscr{H}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ plays a role as the regularizer which is assumed to be a convex function along generalized geodesics.

Why difference-of-convex structure?

Nonconvexity lies at the difference-of-convex (DC) structure $F=G-H$ , where $G$ and $H$ are called the first and second DC components, respectively. $F$ being nonconvex implies $\mathcal{E}_{F}$ being non-geodesically-convex in general. First, the class of DC functions is very rich, and DC structures are present everywhere in real-world applications [60, 46, 47, 1, 22, 56, 58]. Weakly convex and Lipschitz smooth ( $L$ -smooth or simply smooth) functions are two subclasses of DC functions. Furthermore, any continuous function can be approximated by a sequence of DC functions over a compact, convex domain [8]. We also remark that many nonconvex functions admit quite natural DC decompositions, for example, an $L$ -smooth function $F$ has the following splittings: $F(x)=\alpha\|x\|^{2}-(\alpha\|x\|^{2}-F(x))$ and $F(x)=(F(x)+\alpha\|x\|^{2})-\alpha\|x\|^{2}$ whenever $\alpha\geq L/2$ . Second, DC functions preserve enough structure to extend convex analysis. Such structure is key in classic DC programming [60] and in our Wasserstein space analysis with optimal transport tools.

Context

Many problems in machine learning and sampling fall into the spectrum of problem (1). For example, refer to a discussion in [65] that inspired our work. The regularizer $\mathscr{H}$ can be the internal energy [4, Sect. 10.4.3]. Under McCann’s condition, the internal energy is convex along generalized geodesics [4, Prop. 9.3.9]. In particular, the negative entropy, $\mathscr{H}(\mu)=\int\log(\mu(x))d\mu(x)$ if $\mu$ is absolutely continuous w.r.t. Lebesgue measure, $+\infty$ otherwise, is a special case of internal energy satisfying McCann’s condition. In the latter case, $\mathcal{F}(\mu)=\operatorname{D_{KL}}(\mu\|\mu^{*})+\text{const}$ where $\mu^{*}(x)\propto\exp(-F(x))$ , the optimization problem reduces to a sampling problem with log-DC target distribution. In Bayesian inference, the posterior structure depends on both the prior and likelihood. If the likelihood is log-smooth, it exhibits the aforementioned DC splittings. Log-priors, often nonsmooth to capture sparsity or low rank, typically also have explicit DC structures [48, 32, 25]. In the context of infinitely wide one-layer neural networks and Maximum Mean Discrepancy [51, 7, 21], let $\mu^{*}$ be the optimal distribution over a network’s parameters, $k$ be a given kernel, the regularizer is then the interaction energy $\mathscr{H}(\mu)=\iint{k(x,y)}d\mu(x)d\mu(y)$ and $F(x)=-2\int{k(x,y)}d\mu^{*}(y).$ In general, $\mathscr{H}$ is not convex along generalized geodesics and $F$ is nonconvex but not necessarily DC. When the kernel has Lipschitz gradient, we can adjust both $\mathscr{H}$ and $F$ as $\mathscr{H}(\mu)=\iint{k(x,y)}+\alpha\|x\|^{2}+\alpha\|y\|^{2}d\mu(x)d\mu(y)$ and $F(x)=-2\int{k(x,y)}d\mu^{*}(y)-2\alpha\|x\|^{2}$ for some $\alpha>0$ making $\mathscr{H}$ generalized geodesically convex and $F$ concave (hence DC); Appx. A.2.

Our idea is to minimize (1) in the space of probability distributions by discretization of the gradient flow of $\mathcal{F}$ , leveraging on the JKO (Jordan, Kinderlehrer, and Otto) operator (2). In the previous work [70], this has been done with the Forward-Backward (FB) Euler discretization, but it lacks convergence analysis. Recently, Salim et al. [65] did some study on FB Euler, but their results do not apply here because $F$ is nonconvex and possibly nonsmooth. Further leveraging on the DC structure of $F$ and inspired by classical DC programming literature [60], we subtly modify the FB Euler to give rise to a scheme named semi FB Euler that enjoys major theoretical advantages as we can provide a wide range of convergence analysis. Regarding the name, "semi" addresses the splitting of the potential energy. The scheme can be nevertheless reinterpreted as FB Euler.

Contributions

To our knowledge, no prior work studies problem (1) when $F$ is DC. Therefore, most of the derived results in this paper are novel. We propose and analyze the semi FB Euler (4) leveraging on the classic DC optimization proof template [60, 45, 46] with a substantial accommodation of Wasserstein geometry and derive the following set of new insights:

Thm. 1

We show that if the $H$ is continuously differentiable, every cluster point of the sequence of distributions $\{\mu_{n}\}_{n\in\mathbb{N}}$ generated by semi FB Euler is a critical point to $\mathcal{F}$ . Note that criticality is a notion from the DC programming literature [60] and it is a necessary condition for local optimality; See Sect. 3.3.
Thm. 2

We provide convergence rate of $O(N^{-1})$ in terms of Wasserstein (sub)gradient mapping in the general nonsmooth setting. The notion of gradient mapping [33, 38, 55] is from the context of proximal algorithms in Euclidean space that is applicable to nonconvex programs where the notion of distance to global solution is—in general—not possible to work out.
Thm. 3

Under the extra assumption that $H$ is continuously twice differentiable and has bounded Hessian, we provide a convergence rate of $O(N^{-\frac{1}{2}})$ in terms of distance of $0$ to the Fréchet subdifferential of $\mathcal{F}$ . One can think of this as convergence rate to Fréchet stationarity, i.e., if $\mu^{*}$ is a Fréchet stationary point of $\mathcal{F}$ , then, by definition, $0$ is in the Fréchet subdifferential of $\mathcal{F}$ at $\mu^{*}$ . Fréchet stationarity is a relatively sharp necessary condition for local optimality.
Thm. 4, 5

Under the assumptions of Thm. 3 and additionally $\mathcal{F}$ satisfying the Łojasciewicz-type inequality for some Łojasciewicz exponent of $\theta\in[0,1)$ , we show that $\{\mu_{n}\}_{n\in\mathbb{N}}$ is a Cauchy sequence under Wasserstein topology, and thanks to the completeness of the Wasserstein space, the whole sequence $\{\mu_{n}\}_{n\in\mathbb{N}}$ converges to some $\mu^{*}$ . We show that $\mu^{*}$ is in fact a global minimizer to $\mathcal{F}$ . Furthermore, we provide convergence rate of $\mu_{n}\to\mu^{*}$ in three different regimes ( $W_{2}$ denotes the Wasserstein metric): (1) if $\theta=0$ , $W_{2}(\mu_{n},\mu^{*})$ converges to $0$ after a finite number of steps; (2) if $\theta\in(0,1/2]$ , both $\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})$ and $W_{2}(\mu_{n},\mu^{*})$ converges to $0$ exponentially fast; (3) if $\theta\in(1/2,1)$ , both $\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})$ and $W_{2}(\mu_{n},\mu^{*})$ converges sublinearly to $0$ with rates $O\left(n^{-\frac{1}{2\theta-1}}\right)$ and $O\left(n^{-\frac{1-\theta}{2\theta-1}}\right)$ , respectively. When $\mathscr{H}$ is the negative entropy, $\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})=\operatorname{D_{KL}}(\mu_{n}\|\mu^{% *})$ ; Therefore, in the sampling context, we provide convergence guarantees in both Wasserstein and KL distances. See Sect. 4.3 for additional observations and implications.

2 Preliminaries

2.1 Notations and basic results in measure theory and functional analysis

We denote by $X=\mathbb{R}^{d}$ , $\mathcal{B}(X)$ the Borel $\sigma$ -algebra over $X$ , and $\mathscr{L}^{d}$ the Lebesgue measure on $X$ . $\mathcal{P}(X)$ is the set of Borel probability measures on $X$ . For $\mu\in\mathcal{P}(X)$ , we denote its second-order moment by $\mathfrak{m}_{2}(\mu):=\int_{X}{\|x\|^{2}}d\mu(x)$ , where $\mathfrak{m}_{2}(\mu)$ can be infinity. $\mathcal{P}_{2}(X)\subset\mathcal{P}(X)$ denotes a set of finite second-order moment probability measures. $\mathcal{P}_{2,\operatorname{abs}}(X)\subset\mathcal{P}_{2}(X)$ is the set of measures that are absolutely continuous w.r.t. $\mathscr{L}^{d}$ . Here $\mu$ -a.e. stands for almost everywhere w.r.t. $\mu$ .

Let $C^{p}(X),C^{\infty}_{c}(X),C_{b}(X)$ be the classes of $p$ -time continuously differentiable functions, infinitely differentiable functions with compact support, bounded and continuous functions, respectively.

From functional analysis [20], for each $p\geq 1$ , $L^{p}(X,\mu)$ denotes the Banach space of measurable (where measurable is understood as Borel measurable from now on) functions $f$ such that $\int_{X}{|f(x)|^{p}}d\mu(x)<+\infty$ . We shall consider an element of $L^{p}(X,\mu)$ as an equivalent class of functions that agree $\mu$ -a.e. on $X$ rather than a sole function. The norm of $f\in L^{p}(X,\mu)$ is $\|f\|_{L^{p}(X,\mu)}=(\int_{X}{|f(x)|^{p}}d\mu(x))^{1/p}$ . When $p=2$ , $L^{2}(X,\mu)$ is actually a Hilbert space with the inner product $\langle f,g\rangle_{L^{2}(X,\mu)}=\int_{X}{f(x)g(x)}d\mu(x)$ which induces the mentioned norm. These results can be extended to vector-valued functions. In particular, we denote by $L^{2}(X,X,\mu)$ the Hilbert space of $\xi:X\to X$ in which $\|\xi\|\in L^{2}(X,\mu)$ . The norm $\|\xi\|_{L^{2}(X,X,\mu)}:=(\int_{X}\|\xi(x)\|^{2}d\mu(x))^{1/2}$ .

We say that $f:X\to\mathbb{R}$ has quadratic growth if there exists $a>0$ such that $|f(x)|\leq a(\|x\|^{2}+1)$ for all $x\in X$ . It is clear that if $f$ has quadratic growth and $\mu\in\mathcal{P}_{2}(X)$ , then $f\in L^{1}(X,\mu).$

The pushforward of a measure $\mu\in\mathcal{P}(X)$ through a Borel map $T:X\to\mathbb{R}^{m}$ , denoted by $T_{\#}\mu$ is defined by $(T_{\#}\mu)(A):=\mu(T^{-1}(A))$ for every Borel sets $A\subset\mathbb{R}^{m}.$

2.2 Optimal transport [4, 5, 69, 68]

Given $\mu,\nu\in\mathcal{P}(X)$ , the principal problem in optimal transport is to find a transport map $T$ pushing $\mu$ to $\nu$ , i.e., $T_{\#}\mu=\nu$ , in the most cost-efficient way, i.e., minimizing $\|x-T(x)\|^{2}$ on $\mu$ -average. Monge’s formulation for this problem is $\inf_{T:T_{\#}\mu=\nu}\int_{X}{\|x-T(x)\|^{2}}d\mu(x)$ , where the optimal solution, if exists, is denoted by $T_{\mu}^{\nu}$ and called the optimal (Monge) map. Monge’s problem can be ill-posed, e.g., no such $T_{\mu}^{\nu}$ exists when $\mu$ is a Dirac mass and $\nu$ is absolutely continuous [5].

By relaxing Monge’s formulation, Kantorovich considers $\min_{\gamma\in\Gamma(\mu,\nu)}\int_{X\times X}\|x-y\|^{2}d\gamma(x,y)$ , where $\Gamma(\mu,\nu)$ denotes the set of probabilities over $X\times X$ whose marginals are $\mu$ and $\nu$ , i.e, $\gamma\in\Gamma(\mu,\nu)$ iff ${\operatorname{proj}_{1}}_{\#}\gamma=\mu,{\operatorname{proj}_{2}}_{\#}\gamma=\nu$ where $\operatorname{proj}_{1},\operatorname{proj}_{2}$ are the projections onto the first $X$ space and the second $X$ space, respectively. Such $\gamma$ is called a plan. Kantorovich’s formulation is well-posed because $\Gamma(\mu,\nu)$ is non-empty (at least $\mu\times\nu\in\Gamma(\mu,\nu)$ ) and the $\operatorname*{arg\,min}$ element actually exists (see [5, Sect. 2.2]). The set of optimal plans between $\mu$ and $\nu$ is denoted by $\Gamma_{o}(\mu,\nu).$ In terms of random variables, any pairs $(X,Y)$ where $X\sim\mu,Y\sim\nu$ is called a coupling of $\mu$ and $\nu$ while it is called an optimal coupling if the joint law of $X$ and $Y$ is in $\Gamma_{o}(\mu,\nu)$ .

In $\mathcal{P}_{2}(X)$ , the $\min$ value in Kantorovich’s problem specifies a valid metric referred to as Wasserstein distance, $W_{2}(\mu,\nu)=(\int_{X\times X}\|x-y\|^{2}d\gamma(x,y))^{1/2}$ for some, and thus all, $\gamma\in\Gamma_{o}(\mu,\nu)$ . The metric space $(\mathcal{P}_{2}(X),W_{2})$ is then called the Wasserstein space. In $\mathcal{P}_{2}(X)$ , beside the convergence notion induced by the Wasserstein metric, there is a weaker notion of convergence called narrow convergence: we say a sequence $\{\mu_{n}\}_{n\in\mathbb{N}}\subset\mathcal{P}_{2}(X)$ converges narrowly to $\mu\in\mathcal{P}_{2}(X)$ if $\int_{X}{\phi(x)}d\mu_{n}(x)\to\int_{X}{\phi(x)}d\mu(x)$ for all $\phi\in C_{b}(X).$ Convergence in the Wasserstein metric implies narrow convergence but the converse is not necessarily true. The extra condition to make it true is $\mathfrak{m}_{2}(\mu_{n})\to\mathfrak{m}_{2}(\mu)$ . We denote Wasserstein and narrow convergence by $\xrightarrow{\operatorname{\text{Wass}}}$ and $\xrightarrow{\operatorname{\text{narrow}}}$ , respectively.

If $\mu\in\mathcal{P}_{2,\operatorname{abs}}(X),\nu\in\mathcal{P}_{2}(X)$ , Monge’s formulation is well-posed and the unique ( $\mu$ -a.e.) solution exists, and in this case, it is safe to talk about (and use) the optimal transport map $T_{\mu}^{\nu}$ . Moreover, there exists some convex function $f$ such that $T_{\mu}^{\nu}=\nabla f$ $\mu$ -a.e. Kantorovich’s problem also has a unique solution $\gamma$ and it is given by $\gamma=(I,T_{\mu}^{\nu})_{\#}\mu$ where $I$ is the identity map. This is known as Brenier theorem or polar factorization theorem [18].

2.3 Subdifferential calculus in the Wasserstein space

Apart from being a metric space, $(\mathcal{P}_{2}(X),W_{2})$ also enjoys some pre-Riemannian structure making subdifferential calculus on it possible. Let us have a picture of a manifold in mind. Firstly, the tangent space [4] of $\mathcal{P}_{2}(X)$ at $\mu$ is $\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X):=\overline{\{\nabla\psi:\psi\in C_{% c}^{\infty}(X)\}}^{L^{2}(X,X,\mu)}$ , where the closure is w.r.t. the $L^{2}(X,X,\mu)$ -topology. Intuitively, for $\psi\in C_{c}^{\infty}(X)$ , $I+\epsilon\nabla\psi$ is an optimal transport map if $\epsilon>0$ is small enough [43], so $\nabla\psi$ plays a role as "tangent vector".

Let $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ , we denote $\operatorname{dom}(\phi)=\{\mu\in\mathcal{P}_{2}(X):\phi(\mu)<+\infty\}$ . Let $\mu\in\operatorname{dom}(\phi)$ , we say that a map $\xi\in L^{2}(X,X,\mu)$ belongs to the Fréchet subdifferential [15, 43] $\partial_{F}^{-}\phi(\mu)$ if $\phi(\nu)-\phi(\mu)\geq\sup_{\gamma\in\Gamma_{o}(\mu,\nu)}\int_{X\times X}{% \langle\xi(x),y-x\rangle}d\gamma(x,y)+o(W_{2}(\mu,\nu))$ for all $\nu\in\mathcal{P}_{2}(X)$ , where the little-o notation means $\lim_{s\to 0}{o(s)/s}=0.$ If $\partial_{F}^{-}\phi(\mu)\neq\emptyset$ , we say $\phi$ is Fréchet subdifferentiable at $\mu$ . We also denote $\operatorname{dom}(\partial_{F}^{-}\phi)=\{\mu\in\mathcal{P}_{2}(X):\partial_{% F}^{-}\phi(\mu)\neq\emptyset\}$ .

Similarly, we say that $\xi\in L^{2}(X,X,\mu)$ belongs to the (Fréchet) superdifferential $\partial_{F}^{+}\phi(\mu)$ of $\phi$ at $\mu$ if $-\xi\in\partial_{F}^{-}(-\phi)(\mu)$ . In other words, $\partial_{F}^{-}(-\phi)(\mu)=-\partial_{F}^{+}\phi(\mu).$

We say $\phi$ is Wassertein differentiable [15, 43] at $\mu\in\operatorname{dom}(\phi)$ if $\partial_{F}^{-}\phi(\mu)\cap\partial_{F}^{+}\phi(\mu)\neq\emptyset$ . We call an element of the intersection, denoted by $\nabla_{W}\phi(\mu)$ , a Wasserstein gradient of $\phi$ at $\mu$ , and it holds $\phi(\nu)-\phi(\mu)=\int_{X\times X}{\langle\nabla_{W}\phi(\mu)(x),y-x\rangle}% d\gamma(x,y)+o(W_{2}(\mu,\nu))$ , for all $\nu\in\mathcal{P}_{2}(X)$ and any $\gamma\in\Gamma_{o}(\mu,\nu).$ The Wasserstein gradient is not unique in general, but its parallel component in $\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X)$ is unique, and this parallel component is again a valid Wasserstein gradient as the orthogonal component plays no role in the above definitions, i.e., if $\xi^{\perp}\in\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X)^{\perp}$ , it holds $\int_{X\times X}\langle\xi^{\perp}(x),y-x\rangle d\gamma(x,y)=0$ for any $\nu\in\mathcal{P}_{2}(X)$ and $\gamma\in\Gamma_{o}(\mu,\nu)$ [43, Prop. 2.5]. We may refer to this parallel component as the unique Wasserstein gradient of $\phi$ at $\mu$ .

2.4 Optimization in the Wasserstein space

A function $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ is called proper if $\operatorname{dom}(\phi)\neq\emptyset$ , while it is called lower semicontinuous (l.s.c) if for any sequence $\mu_{n}\xrightarrow{\operatorname{\text{Wass}}}\mu$ , it holds $\liminf_{n}\phi(\mu_{n})\geq\phi(\mu)$ .

We next recall (a simplified version of) generalized geodesic convexity.

Definition 1.

[65] Let $\mathcal{\phi}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ . We say $\phi$ is convex along generalized geodesics if $\forall\mu,\pi\in\mathcal{P}_{2}(X)$ , $\forall\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ , $\phi((tT_{\nu}^{\mu}+(1-t)T_{\nu}^{\pi})_{\#}\nu)\leq t\phi(\mu)+(1-t)\phi(\pi)$ , $\forall t\in[0,1]$ .

The curve $t\mapsto(tT_{\nu}^{\mu}+(1-t)T_{\nu}^{\pi})_{\#}\nu$ (called a generalized geodesic) interpolates from $\pi$ to $\mu$ as $t$ runs from $0$ to $1$ . The definition says that $\phi$ is convex along these curves. If $\mu\in\mathcal{P}_{2,\operatorname{abs}}(X)$ and $\nu=\mu$ , the curve is a geodesic in $(\mathcal{P}_{2}(X),W_{2})$ . If the definition is relaxed to the class of geodesics only, we say that $\phi$ is convex along geodesics.

An important characterization of Fréchet subdifferential of a geodesically convex function is that we can drop the little-o notation in its definition in Sect. 2.3 [4, Sect 10.1.1]. As a convention, for a geodesically convex function $\phi$ , the Fréchet subdifferential $\partial_{F}^{-}$ will be simply written as $\partial$ .

First-order optimality conditions

Let $\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ be a proper function. $\mu^{*}\in\mathcal{P}_{2}(X)$ is a global minimizer of $\phi$ if $\phi(\mu^{*})\leq\phi(\mu),\forall\mu\in\mathcal{P}_{2}(X).$ For local optimality, we shall use the Wasserstein metric to define neighborhoods. $\mu^{*}\in\mathcal{P}_{2}(X)$ is a local minimizer if there exists $r>0$ such that $\phi(\mu^{*})\leq\phi(\mu)$ for all $\mu:W_{2}(\mu,\mu^{*})<r.$ We shall denote $B(\mu^{*},r):=\{\mu\in\mathcal{P}_{2}(X):W_{2}(\mu,\mu^{*})<r\}$ the (open) Wasserstein ball centered at $\mu^{*}$ with radius $r$ . If we replace $<$ by $\leq$ we obtain the notion of a closed Wasserstein ball.

We call $\mu^{*}$ a Fréchet stationary point of $\phi$ if $0\in\partial_{F}^{-}\phi(\mu^{*}).$ Fréchet stationarity is a necessary condition for local optimality. In other words, if $\mu^{*}$ is a local minimizer, it is a Fréchet stationary point (Lem. 5 in Appendix). In addition, if $\phi$ is Wasserstein differentiable at $\mu^{*}$ , $\nabla_{W}\phi(\mu^{*})(x)=0$ $\mu^{*}$ -a.e. [43]. When $\phi$ is geodesically convex, Fréchet stationarity is a sufficient condition for global optimality (Lem. 6 in Appendix).

3 Semi Forward-Backward Euler for difference-of-convex structures

3.1 Wasserstein gradient flows: different types of discretizations

To neatly present the idea of minimizing $\mathcal{F}$ via discretized gradient flow, we first assume for a moment that $F$ is infinitely differentiable and $\mathscr{H}$ is the negative entropy. See also a discussion in [65].

We wish to minimize (1) in the space of probability distributions. A natural idea is to apply discretizations of the gradient flow of $\mathcal{F}$ , where the gradient flow is defined (under some technical assumptions [39]) as the limit $\eta\to 0^{+}$ of the following scheme with some simple time-interpolation

\displaystyle\mu_{n+1}\in\operatorname{JKO}_{\eta\mathcal{F}}(\mu_{n}),\text{ % where }\operatorname{JKO}_{\eta\mathcal{F}}(\mu):=\operatorname*{arg\,min}_{% \nu\in\mathcal{P}_{2}(X)}\mathcal{F}(\nu)+\dfrac{1}{2\eta}W_{2}^{2}(\mu,\nu).

(2)

Straightforwardly, given a fixed $\eta>0$ , (2) gives back a discretization for this flow known as Backward Euler. On the other hand, if $\mathcal{F}$ is Wasserstein differentiable (Sect. 2.2), the Forward Euler discretization reads [70] $\mu_{n+1}=(I-\eta\nabla_{W}\mathcal{F}(\mu_{n}))_{\#}\mu_{n}$ , which is reinterpreted as doing gradient descent in the space of probability distributions. These are optimization methods that work directly on the objective function $\mathcal{F}$ itself. However, the composite structure of $\mathcal{F}$ (a sum of several terms) can also be exploited. One such scheme is the unadjusted Langevin algorithm (ULA), where it first takes a gradient step w.r.t. the potential part, then follows the heat flow corresponding to the entropy part [70]: $\nu_{n+1}=(I-\eta\nabla F)_{\#}\mu_{n},\text{ and }\mu_{n+1}=\mathcal{N}(0,2% \eta I)*\nu_{n+1}$ , where $*$ is the convolution. This ULA is "viewed" in the space of distributions (Eulerian approach), a more familiar and equivalent form of the ULA from the particle perspective (Lagrangian approach) goes like $x_{n+1}=x_{n}-\eta\nabla F(x_{n})+\sqrt{2\eta}z_{k}$ where $z_{k}\sim\mathcal{N}(0,I)$ . The ULA is known to be asymptotically biased even for Gaussian target measure (Ornstein-Uhlenbeck process). To correct this bias, the Metropolis-Hasting accept-reject step [62] is sometimes introduced. Metropolis-Hasting algorithm [52, 36] is a much more general framework that works with quite any proposal (e.g., a random walk) whose convergence analysis is based on the Markov kernel satisfying the detailed balance condition. This convergence framework is different from what is considered in this work: we are more interested in the underlying dynamics of the chain. Metropolis-Hasting algorithm is indeed another story.

In optimization, for composite structure, Forward-Backward (FB) Euler and its variants are methods of choice [59, 10]. The corresponding FB Euler for $\mathcal{F}$ will take the gradient step (forward) according to the potential, and JKO step (backward) w.r.t. the negative entropy

\displaystyle\text{(FB Euler)}\quad\nu_{n+1}=(I-\eta\nabla F)_{\#}\mu_{n},% \text{ and }\mu_{n+1}\in\operatorname{JKO}_{\eta\mathscr{H}}(\nu_{n+1}).

(3)

This scheme appears in [70] without convergence analysis, and later on [65] derives non-asymptotic convergence guarantees under the assumption $F$ being convex and Lipschitz smooth.

In this work, as $F$ is nonconvex and nonsmooth, the theory in [65] does not apply, and the convergence (if any) of (3) remains mysterious. The DC structure of $F$ can be further exploited. In DC programming [60], the forward step should be applied to the concave part, while the backward step should be applied to the convex part. We hence propose the following semi FB Euler

\displaystyle\text{(semi FB Euler)}\quad\nu_{n+1}=(I+\eta\nabla H)_{\#}\mu_{n}% ,\text{ and }\mu_{n+1}\in\operatorname{JKO}_{\eta(\mathscr{H}+\mathcal{E}_{G})% }(\nu_{n+1})

(4)

for which we can provide convergence guarantees. Apparently, the difference between semi FB Euler and FB Euler is subtle: while FB Euler does forward on $\mathcal{E}_{G-H}=\mathcal{E}_{G}-\mathcal{E}_{H}$ and backward on $\mathscr{H}$ , semi FB Euler does forward on $-\mathcal{E}_{H}$ and backward on $\mathscr{H}+\mathcal{E}_{G}$ ; recall that $\mathcal{F}=\mathcal{E}_{G}-\mathcal{E}_{H}+\mathscr{H}$ .

Theoretically, semi FB Euler enjoys some advantages compared to FB Euler. Thanks to Brenier theorem (Sect. 2.2), the pushing step in semi FB Euler is optimal since $H$ is convex; Meanwhile, the pushing in FB Euler is non-optimal whose optimal Monge map is not identifiable in general. The convergence of FB Euler is still an open question, even when $F$ is (DC) differentiable. In contrast, we can provide a solid theoretical guarantee for semi FB Euler, especially when $H$ is differentiable. Additionally, we also offer convergence guarantees when $H$ is nonsmooth.

3.2 Problem setting

Our goal is to minimize the non-geodesically-convex functional $\mathcal{F}(\mu)=\mathcal{E}_{F}(\mu)+\mathscr{H}(\mu)$ over $\mathcal{P}_{2}(X)$ , where $F=G-H$ is a DC function. We make Assumption 1 throughout the paper:

Assumption 1.

(i)

The objective function $\mathcal{F}$ is bounded below.
(ii)

$G,H:X\to\mathbb{R}$ are convex functions and have quadratic growth.
(iii)

$\mathscr{H}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}$ is proper, l.s.c, and convex along generalized geodesics in $(\mathcal{P}_{2}(X),W_{2})$ , and $\operatorname{dom}(\mathscr{H})\subset\mathcal{P}_{2,\operatorname{abs}}(X).$
(iv)

There exists $\eta_{0}>0$ such that $\forall\eta\in(0,\eta_{0})$ , $\operatorname{JKO}_{\eta(\mathcal{E}_{G}+\mathscr{H})}(\mu)\neq\emptyset$ for every $\mu\in\mathcal{P}_{2}(X).$

Note that Assumption 1(iv) is a commonly-used assumption to simplify technical complication when working with the JKO operator [4, 15, 65]. Assumption 1(ii) implies $\mathcal{E}_{G}$ and $\mathcal{E}_{H}$ are continuous w.r.t. Wasserstein topology [3, Prop. 2.4] ( $G,H$ are continuous [54, Cor. 2.27] and have quadratic growth).

3.3 Optimality charactizations

First, it follows from Assumption 1(iii), $\operatorname{dom}(\mathcal{F})\subset\mathcal{P}_{2,\operatorname{abs}}(X).$ By analogy to DC programming in Euclidean space, we call $\mu^{*}\in\operatorname{dom}(\mathcal{F})$ a critical point of $\mathcal{F}=\mathscr{H}+\mathcal{E}_{G}-\mathcal{E}_{H}$ if $\partial(\mathscr{H}+\mathcal{E}_{G})(\mu^{*})\cap\partial\mathcal{E}_{H}(\mu^% {*})\neq\emptyset.$ Criticality is a necessary condition for local optimality (Lem. 7). Moreover, if $\mathcal{E}_{H}$ is Wasserstein differentiable at $\mu^{*}$ , criticality becomes Fréchet stationarity (Lem. 8).

3.4 Semi FB Euler: a general setting

We allow $H$ to be non-differentiable in some derivations, meaning that $\partial H$ (convex subdifferential [54]) contains multiple elements in general. We first pick a selector $S$ of $\partial H$ , i.e., $S:X\to X$ , such that $S(x)\in\partial H(x)$ . By the axiom of choice (Zermelo, 1904, see, e.g., [37]), such selection always exists. However, an arbitrary selector can behave badly, e.g., not measurable. We shall first restrict ourselves to the class of Borel measurable selectors (see Appx. A.1 for an existence discussion).

Assumption 2 (Measurability).

The selector $S$ is Borel measurable.

We recall the semi FB scheme (4) but for nonsmooth $F$ as follows: start with an initial distribution $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ , given a discretization stepsize $0<\eta<\eta_{0}$ , we repeat the following two steps:

\displaystyle\nu_{n+1}

\displaystyle=(I+\eta S)_{\#}\mu_{n}\quad\triangleleft\text{ push forward step% ;}\quad\mu_{n+1}=\operatorname{JKO}_{\eta(\mathcal{E}_{G}+\mathscr{H})}(\nu_{n% +1})\quad\triangleleft\text{ JKO step}.

Well-definiteness and properties: Given $\mu_{n}\in\mathcal{P}_{2}(X)$ , it follows from Lem. (4) that $\nu_{n+1}\in\mathcal{P}_{2}(X)$ . The two generated sequences are then in $\mathcal{P}_{2}(X)$ . Moreover, it follows from Assumption 1 that $\{\mu_{n}\}_{n\in\mathbb{N}}$ are in $\mathcal{P}_{2,\operatorname{abs}}(X)$ , so are $\{\nu_{n}\}_{n\in\mathbb{N}}$ using Lem. 9 by noting that $I+\eta S$ is subgradient of a strongly convex function $x\mapsto(1/2)\|x\|^{2}+\eta H(x)$ .

4 Convergence analysis

4.1 Asymptotic analysis

Lemma 1 (Descent lemma).

Under Assumptions 1 and 2, let $\{\mu_{n}\}_{n\in\mathbb{N}}$ be the sequence of distributions produced by semi FB Euler starting from some $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ with $0<\eta<\eta_{0}$ . Then it holds $\mathcal{F}(\mu_{n+1})\leq\mathcal{F}(\mu_{n})-\frac{1}{\eta}\int_{X}{\|T_{\nu% _{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x),\quad% \forall n\in\mathbb{N}$ .

Lem. 1 shows that the objective does not increase along semi FB Euler’s iterates. Proof of Lem. 1 is in Appx. A.3. By using Lem. 1, we establish asymptotic convergence for semi FB Euler as follows.

For the asymptotic convergence analysis, we need the following assumption on $H$ .

Assumption 3.

$H$ is continuously differentiable.

Theorem 1 (Asymptotic convergence).

Under Assumptions 1, 3, let $\{\mu_{n}\}_{n\in\mathbb{N}}$ and $\{\nu_{n}\}_{n\in\mathbb{N}}$ be sequences produced by semi FB Euler starting from some $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ with $0<\eta<\eta_{0}$ . If $\{\mu_{n}\}_{n\in\mathbb{N}}$ is relatively compact with respect to the Wasserstein topology and $\sup_{n\in\mathbb{N}}\mathscr{H}(\nu_{n})<+\infty$ , then every cluster point of $\{\mu_{n}\}_{n\in\mathbb{N}}$ is a critical point of $\mathcal{F}$ .

Proof of Thm.1 is in Appx. A.4. Thm. 1 does not ensure convergence of the whole sequence $\{\mu_{n}\}_{n\in\mathbb{N}}$ ; Rather, it guarantees subsequential convergence to critical points of $\mathcal{F}$ .

Remark 1.

In the Euclidean space, the compactness assumption of the generated sequence is usually enforced via the coercivity assumption: $f(x)\to+\infty$ whenever $\|x\|\to+\infty$ . A striking difference in the Wasserstein space is that closed Wasserstein balls are not compact in the Wasserstein topology [43, Prop. 4.2], making coercivity not sufficient to induce (Wasserstein) compactness. For Thm. 1, we simply assume the sequence $\{\mu_{n}\}_{n\in\mathbb{N}}$ to be relatively compact.

4.2 Non asymptotic analysis

To measure how fast the algorithm converges, we need some convergence measurement. First, for proximal-type algorithms in Euclidean space, the notion of gradient mapping $\mathcal{G}_{\eta}(x_{n})$ is usually used (see, e.g., [33, 55] and [38, Eq. (5)]) and we measure the rate $\|\mathcal{G}_{\eta}(x_{n})\|^{2}\to 0$ . In analogy as in Euclidean space, we define the Wasserstein (sub)gradient mapping as follows $\mathcal{G}_{\eta}(\mu):=\frac{1}{\eta}\left(I-T_{\mu}^{\operatorname{JKO}_{% \eta(\mathcal{E}_{G}+\mathscr{H})}((I+\eta S)_{\#}\mu)}\right)$ , and we measure the rate of $\|\mathcal{G}_{\eta}(\mu_{n})\|^{2}_{L^{2}(X,X,\mu_{n})}\to 0$ .

Theorem 2 (Convergence rate: Wasserstein (sub)gradient mapping).

Under Assumptions 1, 2, let $\{\mu_{n}\}_{n\in\mathbb{N}}$ be the sequence of distributions produced by semi FB Euler starting from some $\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)$ with $0<\eta<\eta_{0}$ . Then it holds $\min_{n=\overline{1,N}}\|\mathcal{G}_{\eta}(\mu_{n})\|^{2}_{L^{2}(X,X,\mu_{n})% }=O(N^{-1})$