(Translated by https://www.hiragana.jp/)
Non-geodesically-convex optimization in the Wasserstein space

Non-geodesically-convex optimization in the Wasserstein space

Hoang Phuc Hau Luu    Hanlin Yu    Bernardo Williams    Petrus Mikkola    Marcelo Hartmann    Kai Puolamäki    Arto Klami   
Department of Computer Science, University of Helsinki
Abstract

We study a class of optimization problems in the Wasserstein space (the space of probability measures) where the objective function is nonconvex along generalized geodesics. Specifically, the objective exhibits some difference-of-convex structure along these geodesics. The setting also encompasses sampling problems where the logarithm of the target distribution is difference-of-convex. We derive multiple convergence insights for a novel semi Forward-Backward Euler scheme under several nonconvex (and possibly nonsmooth) regimes. Notably, the semi Forward-Backward Euler is just a slight modification of the Forward-Backward Euler whose convergence is—to our knowledge—still unknown in our very general non-geodesically-convex setting.

1 Introduction

Sampling and optimization are intertwined. For example, the (overdamped) Langevin dynamics, typically considered a sampling algorithm, can be considered as gradient descent optimization where a suitable amount of Gaussian noise is injected at each step. There are also deeper connections. At the limit of infinitesimal stepsize, the law of the Langevin dynamics is governed by the Fokker-Planck equation describing a diffusion over time of probability measures. In the seminal paper [39], Jordan, Kinderlehrer, and Otto reinterpreted the Fokker-Planck equation as the gradient flow of the functional relative entropy, a.k.a. Kullback-Leibler (KL) divergence, in the (Wasserstein) space of finite second-moment probability measures equipped with the Wasserstein metric. The discovery connects the two fields and encourages optimization in the Wasserstein space, even conceptually, as it directly gives insight into the sampling context. Studies in continuous-time dynamics [21, 12, 66, 30] seem natural and enjoy nice theoretical properties without discretization errors. Another line of research studies discretization of Wasserstein gradient flow by either quantifying the discretization error between the continuous-time flow and the discrete-time flow [39, 67, 27, 23, 28] or viewing discrete-time flows as iterative optimization schemes in the Wasserstein space [65, 26, 70, 11] where the primary focus is on (geodesically) convex optimization problems.

Nonconvex, nonsmooth optimization is challenging, even in Euclidean space, quoting Rockafellar [63]: “In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity.” The landscape of nonconvex problems is mostly underexplored in the Wasserstein space. In the sampling language, it amounts to sampling from a non-log-concave and possibly non-log-Lipschitz-smooth target distribution. Recently, Balasubramanian et al. [9] advocated the need for a sound theory for non-log-concave sampling and provided some guarantees for the unadjusted Langevin algorithm (ULA) in sampling from log-smooth (Lipschitz/Hölder smooth) densities. These results are preliminary for the ULA (and its variants) with a specific class of densities (smooth). Theoretical understandings of other classes of algorithms and densities are needed.

We approach the subject through the lens of nonconvex optimization in the space of probability distributions and pose discretized Wasserstein gradient flows as iterative minimization algorithms. This allows us to, on the one hand, use and extend tools from classical nonconvex optimization and, on the other hand, derive more connections between sampling and optimization.

We study the following non-geodesically-convex optimization problem defined over the space 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) of probability measures μ𝜇\muitalic_μ over X=d𝑋superscript𝑑X=\mathbb{R}^{d}italic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT with finite second moment, i.e., x2𝑑μ(x)<+superscriptnorm𝑥2differential-d𝜇𝑥\int{\|x\|^{2}}d\mu(x)<+\infty∫ ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) < + ∞,

minμ𝒫2(X)(μ):=F(μ)+(μ):=GH(μ)+(μ)assignsubscript𝜇subscript𝒫2𝑋𝜇subscript𝐹𝜇𝜇assignsubscript𝐺𝐻𝜇𝜇\min_{\mu\in\mathcal{P}_{2}(X)}\mathcal{F}(\mu):=\mathcal{E}_{F}(\mu)+\mathscr% {H}(\mu):=\mathcal{E}_{G-H}(\mu)+\mathscr{H}(\mu)roman_min start_POSTSUBSCRIPT italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) end_POSTSUBSCRIPT caligraphic_F ( italic_μ ) := caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ ) + script_H ( italic_μ ) := caligraphic_E start_POSTSUBSCRIPT italic_G - italic_H end_POSTSUBSCRIPT ( italic_μ ) + script_H ( italic_μ ) (1)

where F:X:𝐹𝑋F:X\to\mathbb{R}italic_F : italic_X → blackboard_R is a nonconvex function which can be represented as a difference of two convex functions G𝐺Gitalic_G and H𝐻Hitalic_H, F(μ):=F(x)𝑑μ(x)assignsubscript𝐹𝜇𝐹𝑥differential-d𝜇𝑥\mathcal{E}_{F}(\mu):=\int{F(x)}d\mu(x)caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ ) := ∫ italic_F ( italic_x ) italic_d italic_μ ( italic_x ) is the potential energy, and :𝒫2(X){+}:subscript𝒫2𝑋\mathscr{H}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}script_H : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } plays a role as the regularizer which is assumed to be a convex function along generalized geodesics.

Why difference-of-convex structure?

Nonconvexity lies at the difference-of-convex (DC) structure F=GH𝐹𝐺𝐻F=G-Hitalic_F = italic_G - italic_H, where G𝐺Gitalic_G and H𝐻Hitalic_H are called the first and second DC components, respectively. F𝐹Fitalic_F being nonconvex implies Fsubscript𝐹\mathcal{E}_{F}caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT being non-geodesically-convex in general. First, the class of DC functions is very rich, and DC structures are present everywhere in real-world applications [60, 46, 47, 1, 22, 56, 58]. Weakly convex and Lipschitz smooth (L𝐿Litalic_L-smooth or simply smooth) functions are two subclasses of DC functions. Furthermore, any continuous function can be approximated by a sequence of DC functions over a compact, convex domain [8]. We also remark that many nonconvex functions admit quite natural DC decompositions, for example, an L𝐿Litalic_L-smooth function F𝐹Fitalic_F has the following splittings: F(x)=αx2(αx2F(x))𝐹𝑥𝛼superscriptnorm𝑥2𝛼superscriptnorm𝑥2𝐹𝑥F(x)=\alpha\|x\|^{2}-(\alpha\|x\|^{2}-F(x))italic_F ( italic_x ) = italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_F ( italic_x ) ) and F(x)=(F(x)+αx2)αx2𝐹𝑥𝐹𝑥𝛼superscriptnorm𝑥2𝛼superscriptnorm𝑥2F(x)=(F(x)+\alpha\|x\|^{2})-\alpha\|x\|^{2}italic_F ( italic_x ) = ( italic_F ( italic_x ) + italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) - italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT whenever αL/2𝛼𝐿2\alpha\geq L/2italic_α ≥ italic_L / 2. Second, DC functions preserve enough structure to extend convex analysis. Such structure is key in classic DC programming [60] and in our Wasserstein space analysis with optimal transport tools.

Context

Many problems in machine learning and sampling fall into the spectrum of problem (1). For example, refer to a discussion in [65] that inspired our work. The regularizer \mathscr{H}script_H can be the internal energy [4, Sect. 10.4.3]. Under McCann’s condition, the internal energy is convex along generalized geodesics [4, Prop. 9.3.9]. In particular, the negative entropy, (μ)=log(μ(x))𝑑μ(x)𝜇𝜇𝑥differential-d𝜇𝑥\mathscr{H}(\mu)=\int\log(\mu(x))d\mu(x)script_H ( italic_μ ) = ∫ roman_log ( italic_μ ( italic_x ) ) italic_d italic_μ ( italic_x ) if μ𝜇\muitalic_μ is absolutely continuous w.r.t. Lebesgue measure, ++\infty+ ∞ otherwise, is a special case of internal energy satisfying McCann’s condition. In the latter case, (μ)=DKL(μμ)+const𝜇subscriptDKLconditional𝜇superscript𝜇const\mathcal{F}(\mu)=\operatorname{D_{KL}}(\mu\|\mu^{*})+\text{const}caligraphic_F ( italic_μ ) = start_OPFUNCTION roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT end_OPFUNCTION ( italic_μ ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + const where μ(x)exp(F(x))proportional-tosuperscript𝜇𝑥𝐹𝑥\mu^{*}(x)\propto\exp(-F(x))italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_x ) ∝ roman_exp ( - italic_F ( italic_x ) ), the optimization problem reduces to a sampling problem with log-DC target distribution. In Bayesian inference, the posterior structure depends on both the prior and likelihood. If the likelihood is log-smooth, it exhibits the aforementioned DC splittings. Log-priors, often nonsmooth to capture sparsity or low rank, typically also have explicit DC structures [48, 32, 25]. In the context of infinitely wide one-layer neural networks and Maximum Mean Discrepancy [51, 7, 21], let μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the optimal distribution over a network’s parameters, k𝑘kitalic_k be a given kernel, the regularizer is then the interaction energy (μ)=k(x,y)𝑑μ(x)𝑑μ(y)𝜇double-integral𝑘𝑥𝑦differential-d𝜇𝑥differential-d𝜇𝑦\mathscr{H}(\mu)=\iint{k(x,y)}d\mu(x)d\mu(y)script_H ( italic_μ ) = ∬ italic_k ( italic_x , italic_y ) italic_d italic_μ ( italic_x ) italic_d italic_μ ( italic_y ) and F(x)=2k(x,y)𝑑μ(y).𝐹𝑥2𝑘𝑥𝑦differential-dsuperscript𝜇𝑦F(x)=-2\int{k(x,y)}d\mu^{*}(y).italic_F ( italic_x ) = - 2 ∫ italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) . In general, \mathscr{H}script_H is not convex along generalized geodesics and F𝐹Fitalic_F is nonconvex but not necessarily DC. When the kernel has Lipschitz gradient, we can adjust both \mathscr{H}script_H and F𝐹Fitalic_F as (μ)=k(x,y)+αx2+αy2dμ(x)dμ(y)𝜇double-integral𝑘𝑥𝑦𝛼superscriptnorm𝑥2𝛼superscriptnorm𝑦2𝑑𝜇𝑥𝑑𝜇𝑦\mathscr{H}(\mu)=\iint{k(x,y)}+\alpha\|x\|^{2}+\alpha\|y\|^{2}d\mu(x)d\mu(y)script_H ( italic_μ ) = ∬ italic_k ( italic_x , italic_y ) + italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ∥ italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) italic_d italic_μ ( italic_y ) and F(x)=2k(x,y)𝑑μ(y)2αx2𝐹𝑥2𝑘𝑥𝑦differential-dsuperscript𝜇𝑦2𝛼superscriptnorm𝑥2F(x)=-2\int{k(x,y)}d\mu^{*}(y)-2\alpha\|x\|^{2}italic_F ( italic_x ) = - 2 ∫ italic_k ( italic_x , italic_y ) italic_d italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_y ) - 2 italic_α ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for some α>0𝛼0\alpha>0italic_α > 0 making \mathscr{H}script_H generalized geodesically convex and F𝐹Fitalic_F concave (hence DC); Appx. A.2.

Our idea is to minimize (1) in the space of probability distributions by discretization of the gradient flow of \mathcal{F}caligraphic_F, leveraging on the JKO (Jordan, Kinderlehrer, and Otto) operator (2). In the previous work [70], this has been done with the Forward-Backward (FB) Euler discretization, but it lacks convergence analysis. Recently, Salim et al. [65] did some study on FB Euler, but their results do not apply here because F𝐹Fitalic_F is nonconvex and possibly nonsmooth. Further leveraging on the DC structure of F𝐹Fitalic_F and inspired by classical DC programming literature [60], we subtly modify the FB Euler to give rise to a scheme named semi FB Euler that enjoys major theoretical advantages as we can provide a wide range of convergence analysis. Regarding the name, "semi" addresses the splitting of the potential energy. The scheme can be nevertheless reinterpreted as FB Euler.

Contributions

To our knowledge, no prior work studies problem (1) when F𝐹Fitalic_F is DC. Therefore, most of the derived results in this paper are novel. We propose and analyze the semi FB Euler (4) leveraging on the classic DC optimization proof template [60, 45, 46] with a substantial accommodation of Wasserstein geometry and derive the following set of new insights:

  • Thm. 1

    We show that if the H𝐻Hitalic_H is continuously differentiable, every cluster point of the sequence of distributions {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT generated by semi FB Euler is a critical point to \mathcal{F}caligraphic_F. Note that criticality is a notion from the DC programming literature [60] and it is a necessary condition for local optimality; See Sect. 3.3.

  • Thm. 2

    We provide convergence rate of O(N1)𝑂superscript𝑁1O(N^{-1})italic_O ( italic_N start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) in terms of Wasserstein (sub)gradient mapping in the general nonsmooth setting. The notion of gradient mapping [33, 38, 55] is from the context of proximal algorithms in Euclidean space that is applicable to nonconvex programs where the notion of distance to global solution is—in general—not possible to work out.

  • Thm. 3

    Under the extra assumption that H𝐻Hitalic_H is continuously twice differentiable and has bounded Hessian, we provide a convergence rate of O(N12)𝑂superscript𝑁12O(N^{-\frac{1}{2}})italic_O ( italic_N start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) in terms of distance of 00 to the Fréchet subdifferential of \mathcal{F}caligraphic_F. One can think of this as convergence rate to Fréchet stationarity, i.e., if μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a Fréchet stationary point of \mathcal{F}caligraphic_F, then, by definition, 00 is in the Fréchet subdifferential of \mathcal{F}caligraphic_F at μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Fréchet stationarity is a relatively sharp necessary condition for local optimality.

  • Thm. 4, 5

    Under the assumptions of Thm. 3 and additionally \mathcal{F}caligraphic_F satisfying the Łojasciewicz-type inequality for some Łojasciewicz exponent of θ[0,1)𝜃01\theta\in[0,1)italic_θ ∈ [ 0 , 1 ), we show that {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is a Cauchy sequence under Wasserstein topology, and thanks to the completeness of the Wasserstein space, the whole sequence {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT converges to some μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We show that μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is in fact a global minimizer to \mathcal{F}caligraphic_F. Furthermore, we provide convergence rate of μnμsubscript𝜇𝑛superscript𝜇\mu_{n}\to\mu^{*}italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT → italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT in three different regimes (W2subscript𝑊2W_{2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT denotes the Wasserstein metric): (1) if θ=0𝜃0\theta=0italic_θ = 0, W2(μn,μ)subscript𝑊2subscript𝜇𝑛superscript𝜇W_{2}(\mu_{n},\mu^{*})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) converges to 00 after a finite number of steps; (2) if θ(0,1/2]𝜃012\theta\in(0,1/2]italic_θ ∈ ( 0 , 1 / 2 ], both (μn)(μ)subscript𝜇𝑛superscript𝜇\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and W2(μn,μ)subscript𝑊2subscript𝜇𝑛superscript𝜇W_{2}(\mu_{n},\mu^{*})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) converges to 00 exponentially fast; (3) if θ(1/2,1)𝜃121\theta\in(1/2,1)italic_θ ∈ ( 1 / 2 , 1 ), both (μn)(μ)subscript𝜇𝑛superscript𝜇\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) and W2(μn,μ)subscript𝑊2subscript𝜇𝑛superscript𝜇W_{2}(\mu_{n},\mu^{*})italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) converges sublinearly to 00 with rates O(n12θ1)𝑂superscript𝑛12𝜃1O\left(n^{-\frac{1}{2\theta-1}}\right)italic_O ( italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ) and O(n1θ2θ1)𝑂superscript𝑛1𝜃2𝜃1O\left(n^{-\frac{1-\theta}{2\theta-1}}\right)italic_O ( italic_n start_POSTSUPERSCRIPT - divide start_ARG 1 - italic_θ end_ARG start_ARG 2 italic_θ - 1 end_ARG end_POSTSUPERSCRIPT ), respectively. When \mathscr{H}script_H is the negative entropy, (μn)(μ)=DKL(μnμ)subscript𝜇𝑛superscript𝜇subscriptDKLconditionalsubscript𝜇𝑛superscript𝜇\mathcal{F}(\mu_{n})-\mathcal{F}(\mu^{*})=\operatorname{D_{KL}}(\mu_{n}\|\mu^{% *})caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - caligraphic_F ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) = start_OPFUNCTION roman_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT end_OPFUNCTION ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∥ italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ); Therefore, in the sampling context, we provide convergence guarantees in both Wasserstein and KL distances. See Sect. 4.3 for additional observations and implications.

2 Preliminaries

2.1 Notations and basic results in measure theory and functional analysis

We denote by X=d𝑋superscript𝑑X=\mathbb{R}^{d}italic_X = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, (X)𝑋\mathcal{B}(X)caligraphic_B ( italic_X ) the Borel σ𝜎\sigmaitalic_σ-algebra over X𝑋Xitalic_X, and dsuperscript𝑑\mathscr{L}^{d}script_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT the Lebesgue measure on X𝑋Xitalic_X. 𝒫(X)𝒫𝑋\mathcal{P}(X)caligraphic_P ( italic_X ) is the set of Borel probability measures on X𝑋Xitalic_X. For μ𝒫(X)𝜇𝒫𝑋\mu\in\mathcal{P}(X)italic_μ ∈ caligraphic_P ( italic_X ), we denote its second-order moment by 𝔪2(μ):=Xx2𝑑μ(x)assignsubscript𝔪2𝜇subscript𝑋superscriptnorm𝑥2differential-d𝜇𝑥\mathfrak{m}_{2}(\mu):=\int_{X}{\|x\|^{2}}d\mu(x)fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ) := ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ), where 𝔪2(μ)subscript𝔪2𝜇\mathfrak{m}_{2}(\mu)fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ) can be infinity. 𝒫2(X)𝒫(X)subscript𝒫2𝑋𝒫𝑋\mathcal{P}_{2}(X)\subset\mathcal{P}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) ⊂ caligraphic_P ( italic_X ) denotes a set of finite second-order moment probability measures. 𝒫2,abs(X)𝒫2(X)subscript𝒫2abs𝑋subscript𝒫2𝑋\mathcal{P}_{2,\operatorname{abs}}(X)\subset\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) ⊂ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) is the set of measures that are absolutely continuous w.r.t. dsuperscript𝑑\mathscr{L}^{d}script_L start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Here μ𝜇\muitalic_μ-a.e. stands for almost everywhere w.r.t. μ𝜇\muitalic_μ.

Let Cp(X),Cc(X),Cb(X)superscript𝐶𝑝𝑋subscriptsuperscript𝐶𝑐𝑋subscript𝐶𝑏𝑋C^{p}(X),C^{\infty}_{c}(X),C_{b}(X)italic_C start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X ) , italic_C start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_X ) , italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_X ) be the classes of p𝑝pitalic_p-time continuously differentiable functions, infinitely differentiable functions with compact support, bounded and continuous functions, respectively.

From functional analysis [20], for each p1𝑝1p\geq 1italic_p ≥ 1, Lp(X,μ)superscript𝐿𝑝𝑋𝜇L^{p}(X,\mu)italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X , italic_μ ) denotes the Banach space of measurable (where measurable is understood as Borel measurable from now on) functions f𝑓fitalic_f such that X|f(x)|p𝑑μ(x)<+subscript𝑋superscript𝑓𝑥𝑝differential-d𝜇𝑥\int_{X}{|f(x)|^{p}}d\mu(x)<+\infty∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT | italic_f ( italic_x ) | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) < + ∞. We shall consider an element of Lp(X,μ)superscript𝐿𝑝𝑋𝜇L^{p}(X,\mu)italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X , italic_μ ) as an equivalent class of functions that agree μ𝜇\muitalic_μ-a.e. on X𝑋Xitalic_X rather than a sole function. The norm of fLp(X,μ)𝑓superscript𝐿𝑝𝑋𝜇f\in L^{p}(X,\mu)italic_f ∈ italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X , italic_μ ) is fLp(X,μ)=(X|f(x)|p𝑑μ(x))1/psubscriptnorm𝑓superscript𝐿𝑝𝑋𝜇superscriptsubscript𝑋superscript𝑓𝑥𝑝differential-d𝜇𝑥1𝑝\|f\|_{L^{p}(X,\mu)}=(\int_{X}{|f(x)|^{p}}d\mu(x))^{1/p}∥ italic_f ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ( italic_X , italic_μ ) end_POSTSUBSCRIPT = ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT | italic_f ( italic_x ) | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) ) start_POSTSUPERSCRIPT 1 / italic_p end_POSTSUPERSCRIPT. When p=2𝑝2p=2italic_p = 2, L2(X,μ)superscript𝐿2𝑋𝜇L^{2}(X,\mu)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_μ ) is actually a Hilbert space with the inner product f,gL2(X,μ)=Xf(x)g(x)𝑑μ(x)subscript𝑓𝑔superscript𝐿2𝑋𝜇subscript𝑋𝑓𝑥𝑔𝑥differential-d𝜇𝑥\langle f,g\rangle_{L^{2}(X,\mu)}=\int_{X}{f(x)g(x)}d\mu(x)⟨ italic_f , italic_g ⟩ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_μ ) end_POSTSUBSCRIPT = ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_f ( italic_x ) italic_g ( italic_x ) italic_d italic_μ ( italic_x ) which induces the mentioned norm. These results can be extended to vector-valued functions. In particular, we denote by L2(X,X,μ)superscript𝐿2𝑋𝑋𝜇L^{2}(X,X,\mu)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) the Hilbert space of ξ:XX:𝜉𝑋𝑋\xi:X\to Xitalic_ξ : italic_X → italic_X in which ξL2(X,μ)norm𝜉superscript𝐿2𝑋𝜇\|\xi\|\in L^{2}(X,\mu)∥ italic_ξ ∥ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_μ ). The norm ξL2(X,X,μ):=(Xξ(x)2𝑑μ(x))1/2assignsubscriptnorm𝜉superscript𝐿2𝑋𝑋𝜇superscriptsubscript𝑋superscriptnorm𝜉𝑥2differential-d𝜇𝑥12\|\xi\|_{L^{2}(X,X,\mu)}:=(\int_{X}\|\xi(x)\|^{2}d\mu(x))^{1/2}∥ italic_ξ ∥ start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) end_POSTSUBSCRIPT := ( ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_ξ ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT.

We say that f:X:𝑓𝑋f:X\to\mathbb{R}italic_f : italic_X → blackboard_R has quadratic growth if there exists a>0𝑎0a>0italic_a > 0 such that |f(x)|a(x2+1)𝑓𝑥𝑎superscriptnorm𝑥21|f(x)|\leq a(\|x\|^{2}+1)| italic_f ( italic_x ) | ≤ italic_a ( ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 1 ) for all xX𝑥𝑋x\in Xitalic_x ∈ italic_X. It is clear that if f𝑓fitalic_f has quadratic growth and μ𝒫2(X)𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), then fL1(X,μ).𝑓superscript𝐿1𝑋𝜇f\in L^{1}(X,\mu).italic_f ∈ italic_L start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ( italic_X , italic_μ ) .

The pushforward of a measure μ𝒫(X)𝜇𝒫𝑋\mu\in\mathcal{P}(X)italic_μ ∈ caligraphic_P ( italic_X ) through a Borel map T:Xm:𝑇𝑋superscript𝑚T:X\to\mathbb{R}^{m}italic_T : italic_X → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, denoted by T#μsubscript𝑇#𝜇T_{\#}\muitalic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ is defined by (T#μ)(A):=μ(T1(A))assignsubscript𝑇#𝜇𝐴𝜇superscript𝑇1𝐴(T_{\#}\mu)(A):=\mu(T^{-1}(A))( italic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ ) ( italic_A ) := italic_μ ( italic_T start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_A ) ) for every Borel sets Am.𝐴superscript𝑚A\subset\mathbb{R}^{m}.italic_A ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT .

2.2 Optimal transport [4, 5, 69, 68]

Given μ,ν𝒫(X)𝜇𝜈𝒫𝑋\mu,\nu\in\mathcal{P}(X)italic_μ , italic_ν ∈ caligraphic_P ( italic_X ), the principal problem in optimal transport is to find a transport map T𝑇Titalic_T pushing μ𝜇\muitalic_μ to ν𝜈\nuitalic_ν, i.e., T#μ=νsubscript𝑇#𝜇𝜈T_{\#}\mu=\nuitalic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ = italic_ν, in the most cost-efficient way, i.e., minimizing xT(x)2superscriptnorm𝑥𝑇𝑥2\|x-T(x)\|^{2}∥ italic_x - italic_T ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT on μ𝜇\muitalic_μ-average. Monge’s formulation for this problem is infT:T#μ=νXxT(x)2𝑑μ(x)subscriptinfimum:𝑇subscript𝑇#𝜇𝜈subscript𝑋superscriptnorm𝑥𝑇𝑥2differential-d𝜇𝑥\inf_{T:T_{\#}\mu=\nu}\int_{X}{\|x-T(x)\|^{2}}d\mu(x)roman_inf start_POSTSUBSCRIPT italic_T : italic_T start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ = italic_ν end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_T ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_μ ( italic_x ), where the optimal solution, if exists, is denoted by Tμνsuperscriptsubscript𝑇𝜇𝜈T_{\mu}^{\nu}italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT and called the optimal (Monge) map. Monge’s problem can be ill-posed, e.g., no such Tμνsuperscriptsubscript𝑇𝜇𝜈T_{\mu}^{\nu}italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT exists when μ𝜇\muitalic_μ is a Dirac mass and ν𝜈\nuitalic_ν is absolutely continuous [5].

By relaxing Monge’s formulation, Kantorovich considers minγΓ(μ,ν)X×Xxy2𝑑γ(x,y)subscript𝛾Γ𝜇𝜈subscript𝑋𝑋superscriptnorm𝑥𝑦2differential-d𝛾𝑥𝑦\min_{\gamma\in\Gamma(\mu,\nu)}\int_{X\times X}\|x-y\|^{2}d\gamma(x,y)roman_min start_POSTSUBSCRIPT italic_γ ∈ roman_Γ ( italic_μ , italic_ν ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_γ ( italic_x , italic_y ), where Γ(μ,ν)Γ𝜇𝜈\Gamma(\mu,\nu)roman_Γ ( italic_μ , italic_ν ) denotes the set of probabilities over X×X𝑋𝑋X\times Xitalic_X × italic_X whose marginals are μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν, i.e, γΓ(μ,ν)𝛾Γ𝜇𝜈\gamma\in\Gamma(\mu,\nu)italic_γ ∈ roman_Γ ( italic_μ , italic_ν ) iff proj1#γ=μ,proj2#γ=νformulae-sequencesubscriptsubscriptproj1#𝛾𝜇subscriptsubscriptproj2#𝛾𝜈{\operatorname{proj}_{1}}_{\#}\gamma=\mu,{\operatorname{proj}_{2}}_{\#}\gamma=\nuroman_proj start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_γ = italic_μ , roman_proj start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_γ = italic_ν where proj1,proj2subscriptproj1subscriptproj2\operatorname{proj}_{1},\operatorname{proj}_{2}roman_proj start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_proj start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the projections onto the first X𝑋Xitalic_X space and the second X𝑋Xitalic_X space, respectively. Such γ𝛾\gammaitalic_γ is called a plan. Kantorovich’s formulation is well-posed because Γ(μ,ν)Γ𝜇𝜈\Gamma(\mu,\nu)roman_Γ ( italic_μ , italic_ν ) is non-empty (at least μ×νΓ(μ,ν)𝜇𝜈Γ𝜇𝜈\mu\times\nu\in\Gamma(\mu,\nu)italic_μ × italic_ν ∈ roman_Γ ( italic_μ , italic_ν )) and the argminargmin\operatorname*{arg\,min}roman_arg roman_min element actually exists (see [5, Sect. 2.2]). The set of optimal plans between μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν is denoted by Γo(μ,ν).subscriptΓ𝑜𝜇𝜈\Gamma_{o}(\mu,\nu).roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ) . In terms of random variables, any pairs (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) where Xμ,Yνformulae-sequencesimilar-to𝑋𝜇similar-to𝑌𝜈X\sim\mu,Y\sim\nuitalic_X ∼ italic_μ , italic_Y ∼ italic_ν is called a coupling of μ𝜇\muitalic_μ and ν𝜈\nuitalic_ν while it is called an optimal coupling if the joint law of X𝑋Xitalic_X and Y𝑌Yitalic_Y is in Γo(μ,ν)subscriptΓ𝑜𝜇𝜈\Gamma_{o}(\mu,\nu)roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ).

In 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), the min\minroman_min value in Kantorovich’s problem specifies a valid metric referred to as Wasserstein distance, W2(μ,ν)=(X×Xxy2𝑑γ(x,y))1/2subscript𝑊2𝜇𝜈superscriptsubscript𝑋𝑋superscriptnorm𝑥𝑦2differential-d𝛾𝑥𝑦12W_{2}(\mu,\nu)=(\int_{X\times X}\|x-y\|^{2}d\gamma(x,y))^{1/2}italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) = ( ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ∥ italic_x - italic_y ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_γ ( italic_x , italic_y ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT for some, and thus all, γΓo(μ,ν)𝛾subscriptΓ𝑜𝜇𝜈\gamma\in\Gamma_{o}(\mu,\nu)italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ). The metric space (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is then called the Wasserstein space. In 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), beside the convergence notion induced by the Wasserstein metric, there is a weaker notion of convergence called narrow convergence: we say a sequence {μn}n𝒫2(X)subscriptsubscript𝜇𝑛𝑛subscript𝒫2𝑋\{\mu_{n}\}_{n\in\mathbb{N}}\subset\mathcal{P}_{2}(X){ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT ⊂ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) converges narrowly to μ𝒫2(X)𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) if Xϕ(x)𝑑μn(x)Xϕ(x)𝑑μ(x)subscript𝑋italic-ϕ𝑥differential-dsubscript𝜇𝑛𝑥subscript𝑋italic-ϕ𝑥differential-d𝜇𝑥\int_{X}{\phi(x)}d\mu_{n}(x)\to\int_{X}{\phi(x)}d\mu(x)∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ϕ ( italic_x ) italic_d italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_x ) → ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT italic_ϕ ( italic_x ) italic_d italic_μ ( italic_x ) for all ϕCb(X).italic-ϕsubscript𝐶𝑏𝑋\phi\in C_{b}(X).italic_ϕ ∈ italic_C start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_X ) . Convergence in the Wasserstein metric implies narrow convergence but the converse is not necessarily true. The extra condition to make it true is 𝔪2(μn)𝔪2(μ)subscript𝔪2subscript𝜇𝑛subscript𝔪2𝜇\mathfrak{m}_{2}(\mu_{n})\to\mathfrak{m}_{2}(\mu)fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) → fraktur_m start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ ). We denote Wasserstein and narrow convergence by WassWass\xrightarrow{\operatorname{\text{Wass}}}start_ARROW overwass → end_ARROW and narrownarrow\xrightarrow{\operatorname{\text{narrow}}}start_ARROW overna → end_ARROW, respectively.

If μ𝒫2,abs(X),ν𝒫2(X)formulae-sequence𝜇subscript𝒫2abs𝑋𝜈subscript𝒫2𝑋\mu\in\mathcal{P}_{2,\operatorname{abs}}(X),\nu\in\mathcal{P}_{2}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) , italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), Monge’s formulation is well-posed and the unique (μ𝜇\muitalic_μ-a.e.) solution exists, and in this case, it is safe to talk about (and use) the optimal transport map Tμνsuperscriptsubscript𝑇𝜇𝜈T_{\mu}^{\nu}italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT. Moreover, there exists some convex function f𝑓fitalic_f such that Tμν=fsuperscriptsubscript𝑇𝜇𝜈𝑓T_{\mu}^{\nu}=\nabla fitalic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT = ∇ italic_f μ𝜇\muitalic_μ-a.e. Kantorovich’s problem also has a unique solution γ𝛾\gammaitalic_γ and it is given by γ=(I,Tμν)#μ𝛾subscript𝐼superscriptsubscript𝑇𝜇𝜈#𝜇\gamma=(I,T_{\mu}^{\nu})_{\#}\muitalic_γ = ( italic_I , italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ν end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ where I𝐼Iitalic_I is the identity map. This is known as Brenier theorem or polar factorization theorem [18].

2.3 Subdifferential calculus in the Wasserstein space

Apart from being a metric space, (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) also enjoys some pre-Riemannian structure making subdifferential calculus on it possible. Let us have a picture of a manifold in mind. Firstly, the tangent space [4] of 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) at μ𝜇\muitalic_μ is Tanμ𝒫2(X):={ψ:ψCc(X)}¯L2(X,X,μ)assignsubscriptTan𝜇subscript𝒫2𝑋superscript¯conditional-set𝜓𝜓superscriptsubscript𝐶𝑐𝑋superscript𝐿2𝑋𝑋𝜇\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X):=\overline{\{\nabla\psi:\psi\in C_{% c}^{\infty}(X)\}}^{L^{2}(X,X,\mu)}roman_Tan start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) := over¯ start_ARG { ∇ italic_ψ : italic_ψ ∈ italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_X ) } end_ARG start_POSTSUPERSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) end_POSTSUPERSCRIPT, where the closure is w.r.t. the L2(X,X,μ)superscript𝐿2𝑋𝑋𝜇L^{2}(X,X,\mu)italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ )-topology. Intuitively, for ψCc(X)𝜓superscriptsubscript𝐶𝑐𝑋\psi\in C_{c}^{\infty}(X)italic_ψ ∈ italic_C start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_X ), I+ϵψ𝐼italic-ϵ𝜓I+\epsilon\nabla\psiitalic_I + italic_ϵ ∇ italic_ψ is an optimal transport map if ϵ>0italic-ϵ0\epsilon>0italic_ϵ > 0 is small enough [43], so ψ𝜓\nabla\psi∇ italic_ψ plays a role as "tangent vector".

Let ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ }, we denote dom(ϕ)={μ𝒫2(X):ϕ(μ)<+}domitalic-ϕconditional-set𝜇subscript𝒫2𝑋italic-ϕ𝜇\operatorname{dom}(\phi)=\{\mu\in\mathcal{P}_{2}(X):\phi(\mu)<+\infty\}roman_dom ( italic_ϕ ) = { italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : italic_ϕ ( italic_μ ) < + ∞ }. Let μdom(ϕ)𝜇domitalic-ϕ\mu\in\operatorname{dom}(\phi)italic_μ ∈ roman_dom ( italic_ϕ ), we say that a map ξL2(X,X,μ)𝜉superscript𝐿2𝑋𝑋𝜇\xi\in L^{2}(X,X,\mu)italic_ξ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) belongs to the Fréchet subdifferential [15, 43] Fϕ(μ)superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{-}\phi(\mu)∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) if ϕ(ν)ϕ(μ)supγΓo(μ,ν)X×Xξ(x),yx𝑑γ(x,y)+o(W2(μ,ν))italic-ϕ𝜈italic-ϕ𝜇subscriptsupremum𝛾subscriptΓ𝑜𝜇𝜈subscript𝑋𝑋𝜉𝑥𝑦𝑥differential-d𝛾𝑥𝑦𝑜subscript𝑊2𝜇𝜈\phi(\nu)-\phi(\mu)\geq\sup_{\gamma\in\Gamma_{o}(\mu,\nu)}\int_{X\times X}{% \langle\xi(x),y-x\rangle}d\gamma(x,y)+o(W_{2}(\mu,\nu))italic_ϕ ( italic_ν ) - italic_ϕ ( italic_μ ) ≥ roman_sup start_POSTSUBSCRIPT italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ) end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ⟨ italic_ξ ( italic_x ) , italic_y - italic_x ⟩ italic_d italic_γ ( italic_x , italic_y ) + italic_o ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ) for all ν𝒫2(X)𝜈subscript𝒫2𝑋\nu\in\mathcal{P}_{2}(X)italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), where the little-o notation means lims0o(s)/s=0.subscript𝑠0𝑜𝑠𝑠0\lim_{s\to 0}{o(s)/s}=0.roman_lim start_POSTSUBSCRIPT italic_s → 0 end_POSTSUBSCRIPT italic_o ( italic_s ) / italic_s = 0 . If Fϕ(μ)superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{-}\phi(\mu)\neq\emptyset∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) ≠ ∅, we say ϕitalic-ϕ\phiitalic_ϕ is Fréchet subdifferentiable at μ𝜇\muitalic_μ. We also denote dom(Fϕ)={μ𝒫2(X):Fϕ(μ)}domsuperscriptsubscript𝐹italic-ϕconditional-set𝜇subscript𝒫2𝑋superscriptsubscript𝐹italic-ϕ𝜇\operatorname{dom}(\partial_{F}^{-}\phi)=\{\mu\in\mathcal{P}_{2}(X):\partial_{% F}^{-}\phi(\mu)\neq\emptyset\}roman_dom ( ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ) = { italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) ≠ ∅ }.

Similarly, we say that ξL2(X,X,μ)𝜉superscript𝐿2𝑋𝑋𝜇\xi\in L^{2}(X,X,\mu)italic_ξ ∈ italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ ) belongs to the (Fréchet) superdifferential F+ϕ(μ)superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{+}\phi(\mu)∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) of ϕitalic-ϕ\phiitalic_ϕ at μ𝜇\muitalic_μ if ξF(ϕ)(μ)𝜉superscriptsubscript𝐹italic-ϕ𝜇-\xi\in\partial_{F}^{-}(-\phi)(\mu)- italic_ξ ∈ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( - italic_ϕ ) ( italic_μ ). In other words, F(ϕ)(μ)=F+ϕ(μ).superscriptsubscript𝐹italic-ϕ𝜇superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{-}(-\phi)(\mu)=-\partial_{F}^{+}\phi(\mu).∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ( - italic_ϕ ) ( italic_μ ) = - ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) .

We say ϕitalic-ϕ\phiitalic_ϕ is Wassertein differentiable [15, 43] at μdom(ϕ)𝜇domitalic-ϕ\mu\in\operatorname{dom}(\phi)italic_μ ∈ roman_dom ( italic_ϕ ) if Fϕ(μ)F+ϕ(μ)superscriptsubscript𝐹italic-ϕ𝜇superscriptsubscript𝐹italic-ϕ𝜇\partial_{F}^{-}\phi(\mu)\cap\partial_{F}^{+}\phi(\mu)\neq\emptyset∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) ∩ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT italic_ϕ ( italic_μ ) ≠ ∅. We call an element of the intersection, denoted by Wϕ(μ)subscript𝑊italic-ϕ𝜇\nabla_{W}\phi(\mu)∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_ϕ ( italic_μ ), a Wasserstein gradient of ϕitalic-ϕ\phiitalic_ϕ at μ𝜇\muitalic_μ, and it holds ϕ(ν)ϕ(μ)=X×XWϕ(μ)(x),yx𝑑γ(x,y)+o(W2(μ,ν))italic-ϕ𝜈italic-ϕ𝜇subscript𝑋𝑋subscript𝑊italic-ϕ𝜇𝑥𝑦𝑥differential-d𝛾𝑥𝑦𝑜subscript𝑊2𝜇𝜈\phi(\nu)-\phi(\mu)=\int_{X\times X}{\langle\nabla_{W}\phi(\mu)(x),y-x\rangle}% d\gamma(x,y)+o(W_{2}(\mu,\nu))italic_ϕ ( italic_ν ) - italic_ϕ ( italic_μ ) = ∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ⟨ ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_ϕ ( italic_μ ) ( italic_x ) , italic_y - italic_x ⟩ italic_d italic_γ ( italic_x , italic_y ) + italic_o ( italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ), for all ν𝒫2(X)𝜈subscript𝒫2𝑋\nu\in\mathcal{P}_{2}(X)italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) and any γΓo(μ,ν).𝛾subscriptΓ𝑜𝜇𝜈\gamma\in\Gamma_{o}(\mu,\nu).italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ) . The Wasserstein gradient is not unique in general, but its parallel component in Tanμ𝒫2(X)subscriptTan𝜇subscript𝒫2𝑋\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X)roman_Tan start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) is unique, and this parallel component is again a valid Wasserstein gradient as the orthogonal component plays no role in the above definitions, i.e., if ξTanμ𝒫2(X)superscript𝜉perpendicular-tosubscriptTan𝜇subscript𝒫2superscript𝑋perpendicular-to\xi^{\perp}\in\operatorname{Tan}_{\mu}\mathcal{P}_{2}(X)^{\perp}italic_ξ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∈ roman_Tan start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT, it holds X×Xξ(x),yx𝑑γ(x,y)=0subscript𝑋𝑋superscript𝜉perpendicular-to𝑥𝑦𝑥differential-d𝛾𝑥𝑦0\int_{X\times X}\langle\xi^{\perp}(x),y-x\rangle d\gamma(x,y)=0∫ start_POSTSUBSCRIPT italic_X × italic_X end_POSTSUBSCRIPT ⟨ italic_ξ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ) , italic_y - italic_x ⟩ italic_d italic_γ ( italic_x , italic_y ) = 0 for any ν𝒫2(X)𝜈subscript𝒫2𝑋\nu\in\mathcal{P}_{2}(X)italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) and γΓo(μ,ν)𝛾subscriptΓ𝑜𝜇𝜈\gamma\in\Gamma_{o}(\mu,\nu)italic_γ ∈ roman_Γ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ( italic_μ , italic_ν ) [43, Prop. 2.5]. We may refer to this parallel component as the unique Wasserstein gradient of ϕitalic-ϕ\phiitalic_ϕ at μ𝜇\muitalic_μ.

2.4 Optimization in the Wasserstein space

A function ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } is called proper if dom(ϕ)domitalic-ϕ\operatorname{dom}(\phi)\neq\emptysetroman_dom ( italic_ϕ ) ≠ ∅, while it is called lower semicontinuous (l.s.c) if for any sequence μnWassμWasssubscript𝜇𝑛𝜇\mu_{n}\xrightarrow{\operatorname{\text{Wass}}}\muitalic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT start_ARROW overwass → end_ARROW italic_μ, it holds lim infnϕ(μn)ϕ(μ)subscriptlimit-infimum𝑛italic-ϕsubscript𝜇𝑛italic-ϕ𝜇\liminf_{n}\phi(\mu_{n})\geq\phi(\mu)lim inf start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_ϕ ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ italic_ϕ ( italic_μ ).

We next recall (a simplified version of) generalized geodesic convexity.

Definition 1.

[65] Let ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\mathcal{\phi}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ }. We say ϕitalic-ϕ\phiitalic_ϕ is convex along generalized geodesics if μ,π𝒫2(X)for-all𝜇𝜋subscript𝒫2𝑋\forall\mu,\pi\in\mathcal{P}_{2}(X)∀ italic_μ , italic_π ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), ν𝒫2,abs(X)for-all𝜈subscript𝒫2abs𝑋\forall\nu\in\mathcal{P}_{2,\operatorname{abs}}(X)∀ italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), ϕ((tTνμ+(1t)Tνπ)#ν)tϕ(μ)+(1t)ϕ(π)italic-ϕsubscript𝑡superscriptsubscript𝑇𝜈𝜇1𝑡superscriptsubscript𝑇𝜈𝜋#𝜈𝑡italic-ϕ𝜇1𝑡italic-ϕ𝜋\phi((tT_{\nu}^{\mu}+(1-t)T_{\nu}^{\pi})_{\#}\nu)\leq t\phi(\mu)+(1-t)\phi(\pi)italic_ϕ ( ( italic_t italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT + ( 1 - italic_t ) italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ν ) ≤ italic_t italic_ϕ ( italic_μ ) + ( 1 - italic_t ) italic_ϕ ( italic_π ), t[0,1]for-all𝑡01\forall t\in[0,1]∀ italic_t ∈ [ 0 , 1 ].

The curve t(tTνμ+(1t)Tνπ)#νmaps-to𝑡subscript𝑡superscriptsubscript𝑇𝜈𝜇1𝑡superscriptsubscript𝑇𝜈𝜋#𝜈t\mapsto(tT_{\nu}^{\mu}+(1-t)T_{\nu}^{\pi})_{\#}\nuitalic_t ↦ ( italic_t italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT + ( 1 - italic_t ) italic_T start_POSTSUBSCRIPT italic_ν end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_ν (called a generalized geodesic) interpolates from π𝜋\piitalic_π to μ𝜇\muitalic_μ as t𝑡titalic_t runs from 00 to 1111. The definition says that ϕitalic-ϕ\phiitalic_ϕ is convex along these curves. If μ𝒫2,abs(X)𝜇subscript𝒫2abs𝑋\mu\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) and ν=μ𝜈𝜇\nu=\muitalic_ν = italic_μ, the curve is a geodesic in (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). If the definition is relaxed to the class of geodesics only, we say that ϕitalic-ϕ\phiitalic_ϕ is convex along geodesics.

An important characterization of Fréchet subdifferential of a geodesically convex function is that we can drop the little-o notation in its definition in Sect. 2.3 [4, Sect 10.1.1]. As a convention, for a geodesically convex function ϕitalic-ϕ\phiitalic_ϕ, the Fréchet subdifferential Fsuperscriptsubscript𝐹\partial_{F}^{-}∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT will be simply written as \partial.

First-order optimality conditions

Let ϕ:𝒫2(X){+}:italic-ϕsubscript𝒫2𝑋\phi:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}italic_ϕ : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } be a proper function. μ𝒫2(X)superscript𝜇subscript𝒫2𝑋\mu^{*}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) is a global minimizer of ϕitalic-ϕ\phiitalic_ϕ if ϕ(μ)ϕ(μ),μ𝒫2(X).formulae-sequenceitalic-ϕsuperscript𝜇italic-ϕ𝜇for-all𝜇subscript𝒫2𝑋\phi(\mu^{*})\leq\phi(\mu),\forall\mu\in\mathcal{P}_{2}(X).italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϕ ( italic_μ ) , ∀ italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) . For local optimality, we shall use the Wasserstein metric to define neighborhoods. μ𝒫2(X)superscript𝜇subscript𝒫2𝑋\mu^{*}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) is a local minimizer if there exists r>0𝑟0r>0italic_r > 0 such that ϕ(μ)ϕ(μ)italic-ϕsuperscript𝜇italic-ϕ𝜇\phi(\mu^{*})\leq\phi(\mu)italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≤ italic_ϕ ( italic_μ ) for all μ:W2(μ,μ)<r.:𝜇subscript𝑊2𝜇superscript𝜇𝑟\mu:W_{2}(\mu,\mu^{*})<r.italic_μ : italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_r . We shall denote B(μ,r):={μ𝒫2(X):W2(μ,μ)<r}assign𝐵superscript𝜇𝑟conditional-set𝜇subscript𝒫2𝑋subscript𝑊2𝜇superscript𝜇𝑟B(\mu^{*},r):=\{\mu\in\mathcal{P}_{2}(X):W_{2}(\mu,\mu^{*})<r\}italic_B ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_r ) := { italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) : italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_μ , italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) < italic_r } the (open) Wasserstein ball centered at μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with radius r𝑟ritalic_r. If we replace <<< by \leq we obtain the notion of a closed Wasserstein ball.

We call μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT a Fréchet stationary point of ϕitalic-ϕ\phiitalic_ϕ if 0Fϕ(μ).0superscriptsubscript𝐹italic-ϕsuperscript𝜇0\in\partial_{F}^{-}\phi(\mu^{*}).0 ∈ ∂ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) . Fréchet stationarity is a necessary condition for local optimality. In other words, if μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is a local minimizer, it is a Fréchet stationary point (Lem. 5 in Appendix). In addition, if ϕitalic-ϕ\phiitalic_ϕ is Wasserstein differentiable at μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Wϕ(μ)(x)=0subscript𝑊italic-ϕsuperscript𝜇𝑥0\nabla_{W}\phi(\mu^{*})(x)=0∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT italic_ϕ ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ( italic_x ) = 0 μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT-a.e. [43]. When ϕitalic-ϕ\phiitalic_ϕ is geodesically convex, Fréchet stationarity is a sufficient condition for global optimality (Lem. 6 in Appendix).

3 Semi Forward-Backward Euler for difference-of-convex structures

3.1 Wasserstein gradient flows: different types of discretizations

To neatly present the idea of minimizing \mathcal{F}caligraphic_F via discretized gradient flow, we first assume for a moment that F𝐹Fitalic_F is infinitely differentiable and \mathscr{H}script_H is the negative entropy. See also a discussion in [65].

We wish to minimize (1) in the space of probability distributions. A natural idea is to apply discretizations of the gradient flow of \mathcal{F}caligraphic_F, where the gradient flow is defined (under some technical assumptions [39]) as the limit η0+𝜂superscript0\eta\to 0^{+}italic_η → 0 start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT of the following scheme with some simple time-interpolation

μn+1JKOη(μn), where JKOη(μ):=argminν𝒫2(X)(ν)+12ηW22(μ,ν).formulae-sequencesubscript𝜇𝑛1subscriptJKO𝜂subscript𝜇𝑛assign where subscriptJKO𝜂𝜇subscriptargmin𝜈subscript𝒫2𝑋𝜈12𝜂superscriptsubscript𝑊22𝜇𝜈\displaystyle\mu_{n+1}\in\operatorname{JKO}_{\eta\mathcal{F}}(\mu_{n}),\text{ % where }\operatorname{JKO}_{\eta\mathcal{F}}(\mu):=\operatorname*{arg\,min}_{% \nu\in\mathcal{P}_{2}(X)}\mathcal{F}(\nu)+\dfrac{1}{2\eta}W_{2}^{2}(\mu,\nu).italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_JKO start_POSTSUBSCRIPT italic_η caligraphic_F end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , where roman_JKO start_POSTSUBSCRIPT italic_η caligraphic_F end_POSTSUBSCRIPT ( italic_μ ) := start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_ν ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) end_POSTSUBSCRIPT caligraphic_F ( italic_ν ) + divide start_ARG 1 end_ARG start_ARG 2 italic_η end_ARG italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_μ , italic_ν ) . (2)

Straightforwardly, given a fixed η>0𝜂0\eta>0italic_η > 0, (2) gives back a discretization for this flow known as Backward Euler. On the other hand, if \mathcal{F}caligraphic_F is Wasserstein differentiable (Sect. 2.2), the Forward Euler discretization reads [70] μn+1=(IηW(μn))#μnsubscript𝜇𝑛1subscript𝐼𝜂subscript𝑊subscript𝜇𝑛#subscript𝜇𝑛\mu_{n+1}=(I-\eta\nabla_{W}\mathcal{F}(\mu_{n}))_{\#}\mu_{n}italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I - italic_η ∇ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, which is reinterpreted as doing gradient descent in the space of probability distributions. These are optimization methods that work directly on the objective function \mathcal{F}caligraphic_F itself. However, the composite structure of \mathcal{F}caligraphic_F (a sum of several terms) can also be exploited. One such scheme is the unadjusted Langevin algorithm (ULA), where it first takes a gradient step w.r.t. the potential part, then follows the heat flow corresponding to the entropy part [70]: νn+1=(IηF)#μn, and μn+1=𝒩(0,2ηI)νn+1formulae-sequencesubscript𝜈𝑛1subscript𝐼𝜂𝐹#subscript𝜇𝑛 and subscript𝜇𝑛1𝒩02𝜂𝐼subscript𝜈𝑛1\nu_{n+1}=(I-\eta\nabla F)_{\#}\mu_{n},\text{ and }\mu_{n+1}=\mathcal{N}(0,2% \eta I)*\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I - italic_η ∇ italic_F ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , and italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = caligraphic_N ( 0 , 2 italic_η italic_I ) ∗ italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT, where * is the convolution. This ULA is "viewed" in the space of distributions (Eulerian approach), a more familiar and equivalent form of the ULA from the particle perspective (Lagrangian approach) goes like xn+1=xnηF(xn)+2ηzksubscript𝑥𝑛1subscript𝑥𝑛𝜂𝐹subscript𝑥𝑛2𝜂subscript𝑧𝑘x_{n+1}=x_{n}-\eta\nabla F(x_{n})+\sqrt{2\eta}z_{k}italic_x start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - italic_η ∇ italic_F ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) + square-root start_ARG 2 italic_η end_ARG italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where zk𝒩(0,I)similar-tosubscript𝑧𝑘𝒩0𝐼z_{k}\sim\mathcal{N}(0,I)italic_z start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_I ). The ULA is known to be asymptotically biased even for Gaussian target measure (Ornstein-Uhlenbeck process). To correct this bias, the Metropolis-Hasting accept-reject step [62] is sometimes introduced. Metropolis-Hasting algorithm [52, 36] is a much more general framework that works with quite any proposal (e.g., a random walk) whose convergence analysis is based on the Markov kernel satisfying the detailed balance condition. This convergence framework is different from what is considered in this work: we are more interested in the underlying dynamics of the chain. Metropolis-Hasting algorithm is indeed another story.

In optimization, for composite structure, Forward-Backward (FB) Euler and its variants are methods of choice [59, 10]. The corresponding FB Euler for \mathcal{F}caligraphic_F will take the gradient step (forward) according to the potential, and JKO step (backward) w.r.t. the negative entropy

(FB Euler)νn+1=(IηF)#μn, and μn+1JKOη(νn+1).formulae-sequence(FB Euler)subscript𝜈𝑛1subscript𝐼𝜂𝐹#subscript𝜇𝑛 and subscript𝜇𝑛1subscriptJKO𝜂subscript𝜈𝑛1\displaystyle\text{(FB Euler)}\quad\nu_{n+1}=(I-\eta\nabla F)_{\#}\mu_{n},% \text{ and }\mu_{n+1}\in\operatorname{JKO}_{\eta\mathscr{H}}(\nu_{n+1}).(FB Euler) italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I - italic_η ∇ italic_F ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , and italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_JKO start_POSTSUBSCRIPT italic_η script_H end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) . (3)

This scheme appears in [70] without convergence analysis, and later on [65] derives non-asymptotic convergence guarantees under the assumption F𝐹Fitalic_F being convex and Lipschitz smooth.

In this work, as F𝐹Fitalic_F is nonconvex and nonsmooth, the theory in [65] does not apply, and the convergence (if any) of (3) remains mysterious. The DC structure of F𝐹Fitalic_F can be further exploited. In DC programming [60], the forward step should be applied to the concave part, while the backward step should be applied to the convex part. We hence propose the following semi FB Euler

(semi FB Euler)νn+1=(I+ηH)#μn, and μn+1JKOη(+G)(νn+1)formulae-sequence(semi FB Euler)subscript𝜈𝑛1subscript𝐼𝜂𝐻#subscript𝜇𝑛 and subscript𝜇𝑛1subscriptJKO𝜂subscript𝐺subscript𝜈𝑛1\displaystyle\text{(semi FB Euler)}\quad\nu_{n+1}=(I+\eta\nabla H)_{\#}\mu_{n}% ,\text{ and }\mu_{n+1}\in\operatorname{JKO}_{\eta(\mathscr{H}+\mathcal{E}_{G})% }(\nu_{n+1})(semi FB Euler) italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = ( italic_I + italic_η ∇ italic_H ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , and italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ roman_JKO start_POSTSUBSCRIPT italic_η ( script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) (4)

for which we can provide convergence guarantees. Apparently, the difference between semi FB Euler and FB Euler is subtle: while FB Euler does forward on GH=GHsubscript𝐺𝐻subscript𝐺subscript𝐻\mathcal{E}_{G-H}=\mathcal{E}_{G}-\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_G - italic_H end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and backward on \mathscr{H}script_H, semi FB Euler does forward on Hsubscript𝐻-\mathcal{E}_{H}- caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT and backward on +Gsubscript𝐺\mathscr{H}+\mathcal{E}_{G}script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT; recall that =GH+subscript𝐺subscript𝐻\mathcal{F}=\mathcal{E}_{G}-\mathcal{E}_{H}+\mathscr{H}caligraphic_F = caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT + script_H.

Theoretically, semi FB Euler enjoys some advantages compared to FB Euler. Thanks to Brenier theorem (Sect. 2.2), the pushing step in semi FB Euler is optimal since H𝐻Hitalic_H is convex; Meanwhile, the pushing in FB Euler is non-optimal whose optimal Monge map is not identifiable in general. The convergence of FB Euler is still an open question, even when F𝐹Fitalic_F is (DC) differentiable. In contrast, we can provide a solid theoretical guarantee for semi FB Euler, especially when H𝐻Hitalic_H is differentiable. Additionally, we also offer convergence guarantees when H𝐻Hitalic_H is nonsmooth.

3.2 Problem setting

Our goal is to minimize the non-geodesically-convex functional (μ)=F(μ)+(μ)𝜇subscript𝐹𝜇𝜇\mathcal{F}(\mu)=\mathcal{E}_{F}(\mu)+\mathscr{H}(\mu)caligraphic_F ( italic_μ ) = caligraphic_E start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ( italic_μ ) + script_H ( italic_μ ) over 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), where F=GH𝐹𝐺𝐻F=G-Hitalic_F = italic_G - italic_H is a DC function. We make Assumption 1 throughout the paper:

Assumption 1.
  • (i)

    The objective function \mathcal{F}caligraphic_F is bounded below.

  • (ii)

    G,H:X:𝐺𝐻𝑋G,H:X\to\mathbb{R}italic_G , italic_H : italic_X → blackboard_R are convex functions and have quadratic growth.

  • (iii)

    :𝒫2(X){+}:subscript𝒫2𝑋\mathscr{H}:\mathcal{P}_{2}(X)\to\mathbb{R}\cup\{+\infty\}script_H : caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) → blackboard_R ∪ { + ∞ } is proper, l.s.c, and convex along generalized geodesics in (𝒫2(X),W2)subscript𝒫2𝑋subscript𝑊2(\mathcal{P}_{2}(X),W_{2})( caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) , italic_W start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ), and dom()𝒫2,abs(X).domsubscript𝒫2abs𝑋\operatorname{dom}(\mathscr{H})\subset\mathcal{P}_{2,\operatorname{abs}}(X).roman_dom ( script_H ) ⊂ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) .

  • (iv)

    There exists η0>0subscript𝜂00\eta_{0}>0italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 such that η(0,η0)for-all𝜂0subscript𝜂0\forall\eta\in(0,\eta_{0})∀ italic_η ∈ ( 0 , italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), JKOη(G+)(μ)subscriptJKO𝜂subscript𝐺𝜇\operatorname{JKO}_{\eta(\mathcal{E}_{G}+\mathscr{H})}(\mu)\neq\emptysetroman_JKO start_POSTSUBSCRIPT italic_η ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) end_POSTSUBSCRIPT ( italic_μ ) ≠ ∅ for every μ𝒫2(X).𝜇subscript𝒫2𝑋\mu\in\mathcal{P}_{2}(X).italic_μ ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ) .

Note that Assumption 1(iv) is a commonly-used assumption to simplify technical complication when working with the JKO operator [4, 15, 65]. Assumption 1(ii) implies Gsubscript𝐺\mathcal{E}_{G}caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT and Hsubscript𝐻\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT are continuous w.r.t. Wasserstein topology [3, Prop. 2.4] (G,H𝐺𝐻G,Hitalic_G , italic_H are continuous [54, Cor. 2.27] and have quadratic growth).

3.3 Optimality charactizations

First, it follows from Assumption 1(iii), dom()𝒫2,abs(X).domsubscript𝒫2abs𝑋\operatorname{dom}(\mathcal{F})\subset\mathcal{P}_{2,\operatorname{abs}}(X).roman_dom ( caligraphic_F ) ⊂ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) . By analogy to DC programming in Euclidean space, we call μdom()superscript𝜇dom\mu^{*}\in\operatorname{dom}(\mathcal{F})italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ roman_dom ( caligraphic_F ) a critical point of =+GHsubscript𝐺subscript𝐻\mathcal{F}=\mathscr{H}+\mathcal{E}_{G}-\mathcal{E}_{H}caligraphic_F = script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT - caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT if (+G)(μ)H(μ).subscript𝐺superscript𝜇subscript𝐻superscript𝜇\partial(\mathscr{H}+\mathcal{E}_{G})(\mu^{*})\cap\partial\mathcal{E}_{H}(\mu^% {*})\neq\emptyset.∂ ( script_H + caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ∩ ∂ caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ( italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ≠ ∅ . Criticality is a necessary condition for local optimality (Lem. 7). Moreover, if Hsubscript𝐻\mathcal{E}_{H}caligraphic_E start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT is Wasserstein differentiable at μsuperscript𝜇\mu^{*}italic_μ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, criticality becomes Fréchet stationarity (Lem. 8).

3.4 Semi FB Euler: a general setting

We allow H𝐻Hitalic_H to be non-differentiable in some derivations, meaning that H𝐻\partial H∂ italic_H (convex subdifferential [54]) contains multiple elements in general. We first pick a selector S𝑆Sitalic_S of H𝐻\partial H∂ italic_H, i.e., S:XX:𝑆𝑋𝑋S:X\to Xitalic_S : italic_X → italic_X, such that S(x)H(x)𝑆𝑥𝐻𝑥S(x)\in\partial H(x)italic_S ( italic_x ) ∈ ∂ italic_H ( italic_x ). By the axiom of choice (Zermelo, 1904, see, e.g., [37]), such selection always exists. However, an arbitrary selector can behave badly, e.g., not measurable. We shall first restrict ourselves to the class of Borel measurable selectors (see Appx. A.1 for an existence discussion).

Assumption 2 (Measurability).

The selector S𝑆Sitalic_S is Borel measurable.

We recall the semi FB scheme (4) but for nonsmooth F𝐹Fitalic_F as follows: start with an initial distribution μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), given a discretization stepsize 0<η<η00𝜂subscript𝜂00<\eta<\eta_{0}0 < italic_η < italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we repeat the following two steps:

νn+1subscript𝜈𝑛1\displaystyle\nu_{n+1}italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT =(I+ηS)#μn push forward step;μn+1=JKOη(G+)(νn+1) JKO step.formulae-sequenceabsentsubscript𝐼𝜂𝑆#subscript𝜇𝑛 push forward step;subscript𝜇𝑛1subscriptJKO𝜂subscript𝐺subscript𝜈𝑛1 JKO step\displaystyle=(I+\eta S)_{\#}\mu_{n}\quad\triangleleft\text{ push forward step% ;}\quad\mu_{n+1}=\operatorname{JKO}_{\eta(\mathcal{E}_{G}+\mathscr{H})}(\nu_{n% +1})\quad\triangleleft\text{ JKO step}.= ( italic_I + italic_η italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ◁ push forward step; italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = roman_JKO start_POSTSUBSCRIPT italic_η ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) end_POSTSUBSCRIPT ( italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ◁ JKO step .

Well-definiteness and properties: Given μn𝒫2(X)subscript𝜇𝑛subscript𝒫2𝑋\mu_{n}\in\mathcal{P}_{2}(X)italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ), it follows from Lem. (4) that νn+1𝒫2(X)subscript𝜈𝑛1subscript𝒫2𝑋\nu_{n+1}\in\mathcal{P}_{2}(X)italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ). The two generated sequences are then in 𝒫2(X)subscript𝒫2𝑋\mathcal{P}_{2}(X)caligraphic_P start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_X ). Moreover, it follows from Assumption 1 that {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT are in 𝒫2,abs(X)subscript𝒫2abs𝑋\mathcal{P}_{2,\operatorname{abs}}(X)caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ), so are {νn}nsubscriptsubscript𝜈𝑛𝑛\{\nu_{n}\}_{n\in\mathbb{N}}{ italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT using Lem. 9 by noting that I+ηS𝐼𝜂𝑆I+\eta Sitalic_I + italic_η italic_S is subgradient of a strongly convex function x(1/2)x2+ηH(x)maps-to𝑥12superscriptnorm𝑥2𝜂𝐻𝑥x\mapsto(1/2)\|x\|^{2}+\eta H(x)italic_x ↦ ( 1 / 2 ) ∥ italic_x ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_η italic_H ( italic_x ).

4 Convergence analysis

4.1 Asymptotic analysis

Lemma 1 (Descent lemma).

Under Assumptions 1 and 2, let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be the sequence of distributions produced by semi FB Euler starting from some μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) with 0<η<η00𝜂subscript𝜂00<\eta<\eta_{0}0 < italic_η < italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then it holds (μn+1)(μn)1ηXTνn+1μn(x)Tνn+1μn+1(x)2𝑑νn+1(x),nformulae-sequencesubscript𝜇𝑛1subscript𝜇𝑛1𝜂subscript𝑋superscriptnormsuperscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛𝑥superscriptsubscript𝑇subscript𝜈𝑛1subscript𝜇𝑛1𝑥2differential-dsubscript𝜈𝑛1𝑥for-all𝑛\mathcal{F}(\mu_{n+1})\leq\mathcal{F}(\mu_{n})-\frac{1}{\eta}\int_{X}{\|T_{\nu% _{n+1}}^{\mu_{n}}(x)-T_{\nu_{n+1}}^{\mu_{n+1}}(x)\|^{2}}d\nu_{n+1}(x),\quad% \forall n\in\mathbb{N}caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ≤ caligraphic_F ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ∫ start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ∥ italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) - italic_T start_POSTSUBSCRIPT italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_x ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ν start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ( italic_x ) , ∀ italic_n ∈ blackboard_N.

Lem. 1 shows that the objective does not increase along semi FB Euler’s iterates. Proof of Lem. 1 is in Appx. A.3. By using Lem. 1, we establish asymptotic convergence for semi FB Euler as follows.

For the asymptotic convergence analysis, we need the following assumption on H𝐻Hitalic_H.

Assumption 3.

H𝐻Hitalic_H is continuously differentiable.

Theorem 1 (Asymptotic convergence).

Under Assumptions 1, 3, let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT and {νn}nsubscriptsubscript𝜈𝑛𝑛\{\nu_{n}\}_{n\in\mathbb{N}}{ italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be sequences produced by semi FB Euler starting from some μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) with 0<η<η00𝜂subscript𝜂00<\eta<\eta_{0}0 < italic_η < italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. If {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is relatively compact with respect to the Wasserstein topology and supn(νn)<+subscriptsupremum𝑛subscript𝜈𝑛\sup_{n\in\mathbb{N}}\mathscr{H}(\nu_{n})<+\inftyroman_sup start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT script_H ( italic_ν start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < + ∞, then every cluster point of {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT is a critical point of \mathcal{F}caligraphic_F.

Proof of Thm.1 is in Appx. A.4. Thm. 1 does not ensure convergence of the whole sequence {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT; Rather, it guarantees subsequential convergence to critical points of \mathcal{F}caligraphic_F.

Remark 1.

In the Euclidean space, the compactness assumption of the generated sequence is usually enforced via the coercivity assumption: f(x)+𝑓𝑥f(x)\to+\inftyitalic_f ( italic_x ) → + ∞ whenever x+norm𝑥\|x\|\to+\infty∥ italic_x ∥ → + ∞. A striking difference in the Wasserstein space is that closed Wasserstein balls are not compact in the Wasserstein topology [43, Prop. 4.2], making coercivity not sufficient to induce (Wasserstein) compactness. For Thm. 1, we simply assume the sequence {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT to be relatively compact.

4.2 Non asymptotic analysis

To measure how fast the algorithm converges, we need some convergence measurement. First, for proximal-type algorithms in Euclidean space, the notion of gradient mapping 𝒢η(xn)subscript𝒢𝜂subscript𝑥𝑛\mathcal{G}_{\eta}(x_{n})caligraphic_G start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) is usually used (see, e.g., [33, 55] and [38, Eq. (5)]) and we measure the rate 𝒢η(xn)20superscriptnormsubscript𝒢𝜂subscript𝑥𝑛20\|\mathcal{G}_{\eta}(x_{n})\|^{2}\to 0∥ caligraphic_G start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → 0. In analogy as in Euclidean space, we define the Wasserstein (sub)gradient mapping as follows 𝒢η(μ):=1η(ITμJKOη(G+)((I+ηS)#μ))assignsubscript𝒢𝜂𝜇1𝜂𝐼superscriptsubscript𝑇𝜇subscriptJKO𝜂subscript𝐺subscript𝐼𝜂𝑆#𝜇\mathcal{G}_{\eta}(\mu):=\frac{1}{\eta}\left(I-T_{\mu}^{\operatorname{JKO}_{% \eta(\mathcal{E}_{G}+\mathscr{H})}((I+\eta S)_{\#}\mu)}\right)caligraphic_G start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_μ ) := divide start_ARG 1 end_ARG start_ARG italic_η end_ARG ( italic_I - italic_T start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_JKO start_POSTSUBSCRIPT italic_η ( caligraphic_E start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT + script_H ) end_POSTSUBSCRIPT ( ( italic_I + italic_η italic_S ) start_POSTSUBSCRIPT # end_POSTSUBSCRIPT italic_μ ) end_POSTSUPERSCRIPT ), and we measure the rate of 𝒢η(μn)L2(X,X,μn)20subscriptsuperscriptnormsubscript𝒢𝜂subscript𝜇𝑛2superscript𝐿2𝑋𝑋subscript𝜇𝑛0\|\mathcal{G}_{\eta}(\mu_{n})\|^{2}_{L^{2}(X,X,\mu_{n})}\to 0∥ caligraphic_G start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_X , italic_X , italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT → 0.

Theorem 2 (Convergence rate: Wasserstein (sub)gradient mapping).

Under Assumptions 1, 2, let {μn}nsubscriptsubscript𝜇𝑛𝑛\{\mu_{n}\}_{n\in\mathbb{N}}{ italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n ∈ blackboard_N end_POSTSUBSCRIPT be the sequence of distributions produced by semi FB Euler starting from some μ0𝒫2,abs(X)subscript𝜇0subscript𝒫2abs𝑋\mu_{0}\in\mathcal{P}_{2,\operatorname{abs}}(X)italic_μ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ caligraphic_P start_POSTSUBSCRIPT 2 , roman_abs end_POSTSUBSCRIPT ( italic_X ) with 0<η<η00𝜂subscript𝜂00<\eta<\eta_{0}0 < italic_η < italic_η start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Then it holds minn=1,N¯𝒢η(μn)L2(X,X