\stackMath

Transport meets Variational Inference:
Controlled Monte Carlo Diffusions

Francisco Vargas^$*$, Shreyas Padhy^$*$
University of Cambridge
Cambridge, UK
{fav25,sp2058}@cam.ac.uk &Denis Blessing
KIT
Karlsruhe, Germany
jl8142@kit.edu &Nikolas Nüsken^$*$
Kings College London
London, UK
nik.nuesken@gmx.de

Abstract

Connecting optimal transport and variational inference, we present a principled and systematic framework for sampling and generative modelling centred around divergences on path space. Our work culminates in the development of Controlled Monte Carlo Diffusions for sampling and inference, a score-based annealing technique that crucially adapts both forward and backward dynamics in a diffusion model. On the way, we clarify the relationship between the EM-algorithm and iterative proportional fitting (IPF) for Schrödinger bridges, providing a conceptual link between fields. Finally, we show that CMCD has a strong foundation in the Jarzinsky and Crooks identities from statistical physics, and that it convincingly outperforms competing approaches across a wide array of experiments.

{NoHyper}^$*$^$*$footnotetext: Equal contribution.

1 Introduction

Optimal transport (Villani et al., 2009) and variational inference (Blei et al., 2017) have for a long time been separate fields of research. In recent years, many fruitful connections have been established (Liu et al., 2019), in particular based on dynamical formulations (Tzen & Raginsky, 2019a), and in conjunction with time reversals (Huang et al., 2021a; Song et al., 2021). The goal of this paper is twofold: In the first part, we enhance those relationships based on forward and reverse time diffusions, and associated Girsanov transformations, arriving at a unifying framework for generative modeling and sampling. In the second part, we build on this and develop a novel score-based scheme for sampling from unnormalised densities. To set the stage, we recall a classical approach (Kingma & Welling, 2014; Rezende & Mohamed, 2015) towards generating samples from a target distribution $\mu({\bm{x}})$ μみゅー ( bold_italic_x ), which is the goal both in generative modelling and sampling:

Generative processes, encoders and decoders. We consider methodologies which can be implemented via the following generative process,

\displaystyle{\bm{z}}\sim\nu({\bm{z}}),\qquad{\bm{x}}|{\bm{z}}\sim p^{\theta}(% {\bm{x}}|{\bm{z}}),\vspace{-0.1cm}

νにゅー ( bold_italic_z ) , bold_italic_x | bold_italic_z ∼ italic_p start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) ,

(1)

transforming a sample ${\bm{z}}\sim\nu({\bm{z}})$ νにゅー ( bold_italic_z ) into a sample ${\bm{x}}\sim\int p^{\theta}({\bm{x}}|{\bm{z}})\nu(\mathrm{d}{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) italic_νにゅー ( roman_d bold_italic_z ). Traditionally, $\nu({\bm{z}})$ νにゅー ( bold_italic_z ) is a simple auxiliary distribution, and the family of transitions $p^{\theta}({\bm{x}}|{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) is parameterised flexibly and in such a way that sampling according to (1) is tractable. Then we can frame the tasks of generative modelling and sampling as finding transition densities such that the marginal in ${\bm{x}}$ matches the target distribution,

\mu({\bm{x}})=\int p^{\theta}({\bm{x}}|{\bm{z}})\nu(\mathrm{d}{\bm{z}}).% \vspace{-0.1cm}

μみゅー ( bold_italic_x ) = ∫ italic_p start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) italic_νにゅー ( roman_d bold_italic_z ) .

(2)

To learn such a transition, it is helpful to introduce a reversed process

\displaystyle{\bm{x}}\sim\mu({\bm{x}}),\qquad{\bm{z}}|{\bm{x}}\sim q^{\phi}({% \bm{z}}|{\bm{x}}),

μみゅー ( bold_italic_x ) , bold_italic_z | bold_italic_x ∼ italic_q start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_z | bold_italic_x ) ,

(3)

relying on an appropriately parameterised backward transition $q^{\phi}({\bm{z}}|{\bm{x}})$ . We will say that (1) and (3) are reversals of each other in the case when their joint distributions coincide, that is, when

q^{\phi}({\bm{z}}|{\bm{x}})\mu({\bm{x}})=p^{\theta}({\bm{x}}|{\bm{z}})\nu({\bm% {z}}).\vspace{-0.1cm}

μみゅー ( bold_italic_x ) = italic_p start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) italic_νにゅー ( bold_italic_z ) .

(4)

To appreciate the significance of (3), notice that if (4) holds, then (2) is implied by integrating both sides with respect to ${\bm{z}}$ . Building on this observation, it is natural to define the loss function

\mathcal{L}_{D}(\phi,\theta):=D\left(q^{\phi}({\bm{z}}|{\bm{x}})\mu({\bm{x}})% \big{|}\big{|}p^{\theta}({\bm{x}}|{\bm{z}})\nu({\bm{z}})\right),

θしーた ) := italic_D ( italic_q start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_z | bold_italic_x ) italic_μみゅー ( bold_italic_x ) | | italic_p start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) italic_νにゅー ( bold_italic_z ) ) ,

(5)

where $D$ is a divergence¹¹1As usual, divergences are characterised by the requirement that $D(\alpha\big{|}\big{|}\beta)\geq 0$ αあるふぁ | | italic_βべーた ) ≥ 0, with equality iff $\alpha=\beta$ αあるふぁ = italic_βべーた. between distributions yet to be specified. Along the lines of Bengio et al. (2021); Sohl-Dickstein et al. (2015); Wu et al. (2020); Liu et al. (b), we have now laid the foundations for algorithmic approaches that aim at sampling from $\mu({\bm{x}})$ μみゅー ( bold_italic_x ) by minimising $\mathcal{L}_{D}(\phi,\theta)$ θしーた ):

Framework 1.

Let $D$ be an arbitrary divergence, and assume that $\mathcal{L}_{D}(\phi,\theta)=0$ θしーた ) = 0. Then we have

\mu({\bm{x}})\!=\!\!\int\!\!p^{\theta}({\bm{x}}|{\bm{z}})\nu(\mathrm{d}{\bm{z}% })\;\quad\text{and}\quad\;\nu({\bm{z}})\!\!=\!\!\int\!\!q^{\phi}({\bm{z}}|{\bm% {x}})\mu(\mathrm{d}{\bm{x}}),

μみゅー ( bold_italic_x ) = ∫ italic_p start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) italic_νにゅー ( roman_d bold_italic_z ) and italic_νにゅー ( bold_italic_z ) = ∫ italic_q start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_z | bold_italic_x ) italic_μみゅー ( roman_d bold_italic_x ) ,

(6)

that is, $\nu({\bm{z}})$ νにゅー ( bold_italic_z ) is transformed into $\mu({\bm{x}})$ μみゅー ( bold_italic_x ) by $p^{\theta}({\bm{x}}|{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ), and $\mu({\bm{x}})$ μみゅー ( bold_italic_x ) is transformed into $\nu({\bm{z}})$ νにゅー ( bold_italic_z ) by $q^{\phi}({\bm{z}}|{\bm{x}})$ .

The sampling problem. Let $\nu$ νにゅー denote a probability density function on ${\mathbb{R}}^{d}$ of the form ${\nu({\bm{z}})=\frac{\hat{\nu}({\bm{z}})}{Z},\quad Z=\int_{\mathbb{R}^{d}}\hat% {\nu}({\bm{z}})\mathrm{d}{\bm{z}},}$ νにゅー ( bold_italic_z ) = divide start_ARG over^ start_ARG italic_νにゅー end_ARG ( bold_italic_z ) end_ARG start_ARG italic_Z end_ARG , italic_Z = ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_νにゅー end_ARG ( bold_italic_z ) roman_d bold_italic_z , where $\hat{\nu}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{+}$ νにゅー end_ARG : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPTcan be differentiated and evaluated pointwise but the normalizing constant $Z$ is intractable. We are interested in both estimating $Z$ and obtaining approximate samples from $\nu$ νにゅー given we can sample from a more tractable density $\mu$ μみゅー. Framework 1 provides us with an objective to tackle the sampling problem as once $\mathcal{L}_{D}(\phi,\theta)=0$ θしーた ) = 0, we can generate samples from $\nu({\bm{z}})$ νにゅー ( bold_italic_z ) via the variational distribution $q^{\phi}({\bm{z}}|{\bm{x}})$ . Through variational inference and optimal transport, we discuss relationships to classical methods as well as shortcomings:

KL-divergence, ELBO and variational inference. Choosing $D=D_{\mathrm{KL}}$ in (5), variational inference (VI) and latent variable model based approaches (Dempster et al., 1977; Blei et al., 2017; Kingma & Welling, 2014) can elegantly be placed within Framework 1. Indeed, direct computation (see Appendix B) shows that $\mathcal{L}_{D_{\mathrm{KL}}}(\phi,\theta)=-\mathbb{E}_{{\bm{x}}\sim\mu({\bm{x% }})}[\mathrm{ELBO}_{x}(\phi,\theta)]+\mathbb{E}_{{\bm{x}}\sim\mu({\bm{x}})}[% \operatorname{ln}\mu(x)]$ θしーた ) = - blackboard_E start_POSTSUBSCRIPT bold_italic_x ∼ italic_μみゅー ( bold_italic_x ) end_POSTSUBSCRIPT [ roman_ELBO start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ( italic_ϕ , italic_θしーた ) ] + blackboard_E start_POSTSUBSCRIPT bold_italic_x ∼ italic_μみゅー ( bold_italic_x ) end_POSTSUBSCRIPT [ roman_ln italic_μみゅー ( italic_x ) ], so that minimising $\mathcal{L}_{D_{\mathrm{KL}}}(\phi,\theta)$ θしーた ) is equivalent to maximising the expected evidence lower bound (ELBO), also known as the negative free energy (Blei et al., 2017). This derivation is alternative to the standard approach via maximum likelihood and convex duality (or Jensen’s inequality) (Kingma et al., 2021, Section 2.2), and directly accomodates various modifications by replacing the $D_{\mathrm{KL}}$ -divergence (see Appendix B).

Couplings, (optimal) transport and nonuniqueness. Assuming (4) holds, it is natural to define the joint distribution $\pi({\bm{x}},{\bm{z}}):=q^{\phi}({\bm{z}}|{\bm{x}})\mu({\bm{x}})=p^{\theta}({% \bm{x}}|{\bm{z}})\nu({\bm{z}})$ πぱい ( bold_italic_x , bold_italic_z ) := italic_q start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_z | bold_italic_x ) italic_μみゅー ( bold_italic_x ) = italic_p start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) italic_νにゅー ( bold_italic_z ), which is a coupling between $\mu({\bm{x}})$ μみゅー ( bold_italic_x ) and $\nu({\bm{z}})$ νにゅー ( bold_italic_z ). Viewed from this angle, the set of minimisers of $\mathcal{L}(\phi,\theta)$ θしーた ) stands in one-to-one correspondence with the set of couplings between $\mu({\bm{x}})$ μみゅー ( bold_italic_x ) and $\nu({\bm{z}})$ νにゅー ( bold_italic_z ), provided that the parameterisations are chosen flexibly enough. Under the latter assumption, the objective in (5) admits an infinite number of minimisers, rendering algorithmic approaches solely based on Framework 1 potentially unstable and their output hard to interpret. In the language of optimal transport (Villani, 2003), minimising $\mathcal{L}(\phi,\theta)$ θしーた ) enforces the marginal (‘transport’) constraints in (6) without a selection principle based on an appropriate cost function (‘optimal’).

Methods such as VAEs (Kingma & Welling, 2014) parameterise $p^{\theta}({\bm{x}}|{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) and $q^{\phi}({\bm{z}}|{\bm{x}})$ with a restricted family of distributions (such as Gaussians), thus restricting the set of couplings. Expectation maximisation (EM) minimises $\mathcal{L}(\phi,\theta)$ θしーた ) in a component-wise fashion, resolving nonquniqueness in a procedural manner (see Section 3.1). Common diffusion models fix either $p^{\theta}({\bm{x}}|{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) or $q^{\phi}({\bm{z}}|{\bm{x}})$ , and thus select a coupling (Section 2.2). In this paper, we argue that the full potential of diffusion models can be unleashed by training the forward and backward processes at the same time, but appropriate modifications that resolve the nonuniqueness inherent in Framework 1 need to be imposed. To develop principled approaches towards this, we proceed as follows:

Outline and contributions. In Section 2 we recall hierarchical VAEs (Rezende et al., 2014) and, following Tzen & Raginsky (2019a), proceed to the infinite-depth limit described by the SDEs in (12). Readers more familiar with VI and discrete time might want to take the development in Section 2.1 as an explanation of (12); readers with background in stochastic analysis might take Framework 1^′ as their starting point. In Proposition 2.2 we provide a generalised form of the Girsanov theorem for forward-reverse time SDEs, crucially incorporating the choice of a reference process that allows us to reason about sampling and generation in a systematic and principled way. We demonstrate that a range of widely used approaches, such as score-based diffusions and path integral samplers, among others, are special cases of our unifying framework (Section 2.2). Similarly in Section 3.1 we unify optimal transport (OT) and VI under our framework by establishing a correspondence between expectation-maximisation (EM) and iterative proportional fitting (IPF). Going further, we show that this framework allows us to derive new methods:

In Section 3.2, we derive a novel score-based annealed flow technique, the Controlled Monte Carlo Diffusion (CMCD) sampler, and show that it may be viewed as an infinitesimal analogue of the method from Section 3.1. Finally, we connect CMCD to the foundational identities by Crooks and Jarzynki in statistical physics, and show that it empirically outperforms a range of state-of-the-art inference methods in sampling and estimating normalizing constants (Section 4).

2 From hierarchical VAEs to forward-reverse time diffusions

2.1 Hierarchical VAEs (Rezende et al., 2014)

A particularly flexible choice of implicitly parameterising $p^{\theta}({\bm{x}}|{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) and $q^{\phi}({\bm{z}}|{\bm{x}})$ can be achieved via a hierarchical model with intermediate latents: We identify ${\bm{x}}=:{\bm{y}}_{0}$ and ${\bm{z}}=:{\bm{y}}_{L}$ with the ‘endpoints’ of the layered augmentation $({\bm{y}}_{0},{\bm{y}}_{1},\ldots,{\bm{y}}_{L-1},{\bm{y}}_{L})=:{\bm{y}}_{0:L}$ , and define

\displaystyle q^{\phi}({\bm{y}}_{L},\ldots,{\bm{y}}_{1}|{\bm{y}}_{0}):=\prod_{% l=1}^{L}q^{\phi_{l-1}}({\bm{y}}_{l}|{\bm{y}}_{l-1}),\qquad p^{\theta}({\bm{y}}% _{0},\ldots,{\bm{y}}_{L-1}|{\bm{y}}_{L}):=\prod_{l=1}^{L}p^{\theta_{l}}({\bm{y% }}_{l-1}|{\bm{y}}_{l}),\vspace{-0.2cm}

θしーた end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , … , bold_italic_y start_POSTSUBSCRIPT italic_L - 1 end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) := ∏ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT italic_p start_POSTSUPERSCRIPT italic_θしーた start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) ,

(7)

so that $q^{\phi}({\bm{z}}|{\bm{x}})$ and $p^{\theta}({\bm{x}}|{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) can be obtained from (7) by marginalising over the auxiliary variables ${\bm{y}}_{1},\ldots,{\bm{y}}_{L-1}$ . Here, $\phi=(\phi_{0},\ldots,\phi_{L-1})$ and $\theta=(\theta_{1},\ldots,\theta_{L})$ θしーた = ( italic_θしーた start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_θしーた start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) refer to sets of parameters to be specified in more detail below. Further introducing notation, we write $q^{\mu,\phi}({\bm{y}}_{0:L}):=q^{\phi}({\bm{y}}_{1:L}|{\bm{y}}_{0})\mu({\bm{y}% }_{0})$ μみゅー , italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ) := italic_q start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 1 : italic_L end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_μみゅー ( bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) as well as $p^{\nu,\theta}({\bm{y}}_{0:L}):=p^{\theta}({\bm{y}}_{0:L-1}|{\bm{y}}_{L})\nu({% \bm{y}}_{L})$ νにゅー , italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ) := italic_p start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L - 1 end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) italic_νにゅー ( bold_italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) and think of those implied joint distributions as emanating from $\mu({\bm{x}})=\mu({\bm{y}}_{0})$ μみゅー ( bold_italic_x ) = italic_μみゅー ( bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and $\nu({\bm{z}})=\nu({\bm{y}}_{L})$ νにゅー ( bold_italic_z ) = italic_νにゅー ( bold_italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ), respectively, moving ‘forwards’ or ‘backwards’ according to the specific choices for $\phi$ and $\theta$ θしーた. In the regime when $L$ is large, the models in (7) are very expressive, even if the intermediate transition kernels are parameterised in a simple manner. We hence proceed by assuming Gaussian distributions,

\displaystyle q^{\phi_{l-1}}\!({\bm{y}}_{l}|{\bm{y}}_{l-1}\!)\!=\!{\mathcal{N}% }({\bm{y}}_{l}|{\bm{y}}_{l-1}\!\!+\!\delta a^{\phi}_{l-1}({\bm{y}}_{l-1}),% \delta\sigma^{2}\!I),\;\;p^{\theta_{l}}\!({\bm{y}}_{l-1}|{\bm{y}}_{l}\!)\!=\!{% \mathcal{N}}({\bm{y}}_{l-1}|{\bm{y}}_{l}\!\!+\!\delta b^{\theta}_{l}({\bm{y}}_% {l}),\delta\sigma^{2}\!I),

δでるた italic_a start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT ) , italic_δでるた italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) , italic_p start_POSTSUPERSCRIPT italic_θしーた start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = caligraphic_N ( bold_italic_y start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT | bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_δでるた italic_b start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , italic_δでるた italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_I ) ,

(8)

where $\sigma>0$ σしぐま > 0 controls the standard deviation, and $\delta>0$ δでるた > 0 is a small parameter, anticipating the limits $L\rightarrow\infty$ , $\delta\rightarrow 0$ δでるた → 0 to be taken in Section 2.2 below. The vector fields $a^{\phi}_{l}({\bm{y}}_{l})$ and $b_{l}^{\theta}({\bm{y}}_{l})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) introduced in (8) should be thought of as parameterised by $\phi$ and $\theta$ θしーた, but we will henceforth suppress this for brevity.

The models (7)-(8) could equivalently be defined via the Markov chains


$\displaystyle{\bm{y}}_{l+1}$	$\displaystyle={\bm{y}}_{l}+\delta a_{l}({\bm{y}}_{l})+\sqrt{\delta}\sigma\xi_{% l},\qquad{\bm{y}}_{0}\sim\mu\implies{\bm{y}}_{0:L}\sim q^{\mu,\phi}({\bm{y}}_{% 0:L}),$ δでるた italic_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) + square-root start_ARG italic_δでるた end_ARG italic_σしぐま italic_ξくしー start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_μみゅー ⟹ bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ∼ italic_q start_POSTSUPERSCRIPT italic_μみゅー , italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ) ,	(9a)

$\displaystyle{\bm{y}}_{l-1}={\bm{y}}_{l}+\delta b_{l}({\bm{y}}_{l})+\sqrt{% \delta}\sigma\xi_{l},\qquad{\bm{y}}_{L}\sim\nu\implies{\bm{y}}_{0:L}\sim p^{% \nu,\theta}({\bm{y}}_{0:L}),$ δでるた italic_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) + square-root start_ARG italic_δでるた end_ARG italic_σしぐま italic_ξくしー start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∼ italic_νにゅー ⟹ bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ∼ italic_p start_POSTSUPERSCRIPT italic_νにゅー , italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ) ,		(9b)

where $(\xi_{l})_{l=1}^{L}$ ξくしー start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT is an iid sequence of standard Gaussian random variables. As indicated, the forward process in (9a) may serve to define the distribution $q^{\mu,\phi}({\bm{y}}_{0:L})$ μみゅー , italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ), whilst the backward process in (9b) induces $p^{\nu,\theta}({\bm{y}}_{0:L})$ νにゅー , italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ). Note that the transition densities $p^{\theta}({\bm{x}}|{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) and $q^{\phi}({\bm{z}}|{\bm{x}})$ obtained as the marginals of (7) will in general not be available in closed form. However, generalising slightly from Framework 1, we may set out to minimise the extended loss

\mathcal{L}^{\mathrm{ext}}_{D}(\phi,\theta)=D(q^{\mu,\phi}({\bm{y}}_{0:L})||p^% {\nu,\theta}({\bm{y}}_{0:L})),

θしーた ) = italic_D ( italic_q start_POSTSUPERSCRIPT italic_μみゅー , italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ) | | italic_p start_POSTSUPERSCRIPT italic_νにゅー , italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ) ) ,

(10)

where $D$ refers to a divergence on the ‘discrete path space’ $\{{\bm{y}}_{0:L}\}$ . Clearly, $\mathcal{L}_{D}^{\mathrm{ext}}(\phi,\theta)=0$ θしーた ) = 0 still implies (6), but is no longer equivalent. More specifically, in the case when $D=D_{\mathrm{KL}}$ , the data processing inequality yields

\displaystyle D_{\mathrm{KL}}(q^{\mu,\phi}({\bm{y}}_{0:L})||p^{\nu,\theta}({% \bm{y}}_{0:L}))\geq D_{\mathrm{KL}}\left(q^{\phi}({\bm{z}}|{\bm{x}})\mu({\bm{x% }})\big{|}\big{|}p^{\theta}({\bm{x}}|{\bm{z}})\nu({\bm{z}})\right),

μみゅー , italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ) | | italic_p start_POSTSUPERSCRIPT italic_νにゅー , italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_y start_POSTSUBSCRIPT 0 : italic_L end_POSTSUBSCRIPT ) ) ≥ italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_q start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT ( bold_italic_z | bold_italic_x ) italic_μみゅー ( bold_italic_x ) | | italic_p start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) italic_νにゅー ( bold_italic_z ) ) ,

(11)

so that $\mathcal{L}^{\mathrm{ext}}_{D_{\mathrm{KL}}}(\phi,\theta)$ θしーた ) provides an upper bound for $\mathcal{L}_{D_{\mathrm{KL}}}(\phi,\theta)$ θしーた ) as defined in (5).

2.2 Diffusion models – hierarchical VAEs in the infinite depth limit

Here we take inspiration from Section 2.1 and Tzen & Raginsky (2019a); Li et al. (2020); Huang et al. (2021a) to investigate the $L\rightarrow\infty$ limit, using stochastic differential equations (SDEs). To this end, we think of $l=0,\ldots,L$ as discrete instances in a fixed time interval $[0,T]$ , equidistant with time step $\delta$ δでるた, that is, we set $\delta=TL^{-1}$ δでるた = italic_T italic_L start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. The discrete paths ${\bm{y}}_{0:L}$ give rise to continuous paths $({\bm{Y}}_{t})_{0\leq t\leq T}\in C([0,T];\mathbb{R}^{d})$ by setting ${\bm{Y}}_{\delta l}={\bm{y}}_{l}$ δでるた italic_l end_POSTSUBSCRIPT = bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and linearly interpolating ${\bm{Y}}_{\delta l}$ δでるた italic_l end_POSTSUBSCRIPT and ${\bm{Y}}_{\delta(l+1)}$ δでるた ( italic_l + 1 ) end_POSTSUBSCRIPT. To complete the set-up, we think of $a^{\phi}=(a^{\phi}_{0},\ldots,a^{\phi}_{L-1})$ and $b^{\theta}=(b_{1}^{\theta},\ldots,b^{\theta}_{L})$ θしーた end_POSTSUPERSCRIPT = ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT , … , italic_b start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) in (8) as arising from time-dependent vector fields $a,b\in C^{\infty}([0,T]\times\mathbb{R}^{d};\mathbb{R}^{d})$ via $a^{\phi}_{l}({\bm{y}}_{l})=a_{t\delta^{-1}}({\bm{Y}}_{\delta l})$ δでるた start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUBSCRIPT italic_δでるた italic_l end_POSTSUBSCRIPT ) and $b^{\theta}_{l}({\bm{y}}_{l})=b_{t\delta^{-1}}({\bm{Y}}_{\delta l})$ θしーた end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_italic_y start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) = italic_b start_POSTSUBSCRIPT italic_t italic_δでるた start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUBSCRIPT italic_δでるた italic_l end_POSTSUBSCRIPT ).

Taking the limit $\delta\rightarrow 0$ δでるた → 0, while keeping $T>0$ fixed, transforms the Markov chains in (9) into continuous-time dynamics described by the SDEs (Tzen & Raginsky, 2019a)


$\displaystyle{\mathrm{d}}{\bm{Y}}_{t}$	$\displaystyle=a_{t}({\bm{Y}}_{t})\,{\mathrm{d}}t+\sigma\overrightarrow{\mathrm% {d}}{\bm{W}}_{t},\quad{\bm{Y}}_{0}\sim\mu\implies({\bm{Y}}_{t})_{0\leq t\leq T% }\sim\mathbb{Q}^{\mu,a}\equiv\overrightarrow{\mathbb{P}}^{\mu,a},$ σしぐま over→ start_ARG roman_d end_ARG bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_μみゅー ⟹ ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 0 ≤ italic_t ≤ italic_T end_POSTSUBSCRIPT ∼ blackboard_Q start_POSTSUPERSCRIPT italic_μみゅー , italic_a end_POSTSUPERSCRIPT ≡ over→ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_μみゅー , italic_a end_POSTSUPERSCRIPT ,	(12a)

$\displaystyle{\mathrm{d}}{\bm{Y}}_{t}$	$\displaystyle=b_{t}({\bm{Y}}_{t})\,{\mathrm{d}}t+\sigma\overleftarrow{\mathrm{% d}}{\bm{W}}_{t},\quad{\bm{Y}}_{T}\sim\nu\implies({\bm{Y}}_{t})_{0\leq t\leq T}% \sim\mathbb{P}^{\nu,b}\equiv\overleftarrow{\mathbb{P}}^{\nu,b},$ σしぐま over← start_ARG roman_d end_ARG bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∼ italic_νにゅー ⟹ ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 0 ≤ italic_t ≤ italic_T end_POSTSUBSCRIPT ∼ blackboard_P start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT ≡ over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT ,	(12b)

where $\overrightarrow{\mathrm{d}}$ and $\overleftarrow{\mathrm{d}}$ denote forward and backward Itô integration (see Appendix A for more details and remarks on the notation), and $({\bm{W}}_{t})_{0\leq t\leq T}$ is a standard Brownian motion. In complete analogy with (9), the SDEs in (12) induce the distributions $\mathbb{Q}^{\mu,a}$ μみゅー , italic_a end_POSTSUPERSCRIPT and $\mathbb{P}^{\nu,b}$ νにゅー , italic_b end_POSTSUPERSCRIPT on the path space $C([0,T];\mathbb{R}^{d})$ . Relating back to the discussion in the introduction, note that we maintain the relations ${\bm{Y}}_{0}={\bm{x}}$ and ${\bm{Y}}_{T}={\bm{z}}$ , and the transitions are parameterised by the vector fields $a,b$ , in the sense that $p^{\theta}({\bm{x}}|{\bm{z}})={\mathbb{P}}^{\nu,b^{\theta}}_{0}({\bm{x}}|{\bm{% Y}}_{T}={\bm{z}})={{\mathbb{P}}}^{\delta_{{\bm{z}}},b^{\theta}}_{0}({\bm{x}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) = blackboard_P start_POSTSUPERSCRIPT italic_νにゅー , italic_b start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_x | bold_italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = bold_italic_z ) = blackboard_P start_POSTSUPERSCRIPT italic_δでるた start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT , italic_b start_POSTSUPERSCRIPT italic_θしーた end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_x ) and $q^{\phi}({\bm{z}}|{\bm{x}})={\mathbb{Q}}^{\mu,a^{\phi}}_{T}({\bm{z}}|{\bm{Y}}_% {0}={\bm{x}})={{\mathbb{Q}}}^{\delta_{{\bm{x}}},a^{\phi}}_{T}({\bm{z}})$ μみゅー , italic_a start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_italic_z | bold_italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_x ) = blackboard_Q start_POSTSUPERSCRIPT italic_δでるた start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_italic_z ).

The following well-known result (Anderson, 1982; Nelson, 1967) allows us to relate forward and backward path measures via a local (score-matching) condition for the reversal relation in (4). ²²2The global condition $\overrightarrow{{\mathbb{P}}}^{\mu,a}\!=\!\overleftarrow{{\mathbb{P}}}^{\nu,b}$ μみゅー , italic_a end_POSTSUPERSCRIPT = over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT is captured by the local condition (13) due to (12)’s Markovian nature.

Proposition 2.1 (Nelson’s relation).

For $\mu$ μみゅー and $a$ of sufficient regularity, denote the time-marginals of the corresponding path measure by $\overrightarrow{{\mathbb{P}}}^{\mu,a}_{t}=:\rho^{\mu,a}_{t}$ μみゅー , italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = : italic_ρろー start_POSTSUPERSCRIPT italic_μみゅー , italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Then $\overrightarrow{{\mathbb{P}}}^{\mu,a}=\overleftarrow{{\mathbb{P}}}^{\nu,b}$ μみゅー , italic_a end_POSTSUPERSCRIPT = over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT if and only if

\displaystyle\nu=\overrightarrow{{\mathbb{P}}}^{\mu,a}_{T}\qquad\text{and}% \qquad b_{t}=a_{t}-\sigma^{2}\nabla\operatorname{ln}\rho^{\mu,a}_{t},\qquad% \text{for all }t\in(0,T].

νにゅー = over→ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_μみゅー , italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_ρろー start_POSTSUPERSCRIPT italic_μみゅー , italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , for all italic_t ∈ ( 0 , italic_T ] .

(13)

Remark 1.

A similarly clean characterisation of equality between forward and backward path measures is not available for the discrete-time setting as presented in (9). In particular, Gaussianity of the intermediate transitions is not preserved under time-reversal.

A recurring theme in this work and related literature is the interplay between the score-matching condition in (13) and the global condition $D(\overrightarrow{{\mathbb{P}}}^{\mu,a}|\overleftarrow{{\mathbb{P}}}^{\nu,b})=0$ μみゅー , italic_a end_POSTSUPERSCRIPT | over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT ) = 0, invoking Framework 1. To enable calculations involving the latter, we will rely on the following result:

Proposition 2.2 (forward-backward Radon-Nikodym derivatives).

Let $\overrightarrow{{\mathbb{P}}}^{\Gamma_{0},\gamma^{+}}=\overleftarrow{{\mathbb{% P}}}^{\Gamma_{T},\gamma^{-}}$ Γがんま0superscript𝛾superscript←ℙsubscriptΓがんま𝑇superscript𝛾\overrightarrow{{\mathbb{P}}}^{\Gamma_{0},\gamma^{+}}=\overleftarrow{{\mathbb{% P}}}^{\Gamma_{T},\gamma^{-}}over→ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT roman_Γがんま start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT roman_Γがんま start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_γがんま start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT be a reference path measure (that is, $\Gamma_{0}$ Γがんま0\Gamma_{0}roman_Γがんま start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, $\Gamma_{T}$ Γがんま𝑇\Gamma_{T}roman_Γがんま start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT and $\gamma^{\pm}$ γがんま start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT define diffusions as in (12) and are related as in Proposition 2.1), absolutely continuous with respect to both $\overrightarrow{{\mathbb{P}}}^{\mu,a}$ μみゅー , italic_a end_POSTSUPERSCRIPT and $\overleftarrow{{\mathbb{P}}}^{\nu,b}$ νにゅー , italic_b end_POSTSUPERSCRIPT. Then, $\overrightarrow{{\mathbb{P}}}^{\mu,a}$ μみゅー , italic_a end_POSTSUPERSCRIPT-almost surely, the corresponding Radon-Nikodym derivative (RND) can be expressed as follows,


$\displaystyle\operatorname{ln}\left(\frac{\mathrm{d}\overrightarrow{\mathbb{P}% }^{\mu,a}}{\mathrm{d}\overleftarrow{\mathbb{P}}^{\nu,b}}\right)({\bm{Y}})$ μみゅー , italic_a end_POSTSUPERSCRIPT end_ARG start_ARG roman_d over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT end_ARG ) ( bold_italic_Y )	$\displaystyle=\operatorname{ln}\left(\frac{\mathrm{d}\mu}{\mathrm{d}\Gamma_{0}% }\right)({\bm{Y}}_{0})-\operatorname{ln}\left(\frac{\mathrm{d}\nu}{\mathrm{d}% \Gamma_{T}}\right)({\bm{Y}}_{T})$ Γがんま0subscript𝒀0lnd𝜈dsubscriptΓがんま𝑇subscript𝒀𝑇\displaystyle=\operatorname{ln}\left(\frac{\mathrm{d}\mu}{\mathrm{d}\Gamma_{0}% }\right)({\bm{Y}}_{0})-\operatorname{ln}\left(\frac{\mathrm{d}\nu}{\mathrm{d}% \Gamma_{T}}\right)({\bm{Y}}_{T})= roman_ln ( divide start_ARG roman_d italic_μみゅー end_ARG start_ARG roman_d roman_Γがんま start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) ( bold_italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - roman_ln ( divide start_ARG roman_d italic_νにゅー end_ARG start_ARG roman_d roman_Γがんま start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG ) ( bold_italic_Y start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT )	(14a)
	$\displaystyle+\tfrac{1}{\sigma^{2}}\!\int_{0}^{T}\!\!\left(a_{t}-\gamma^{+}_{t% }\right)({\bm{Y}}_{t})\!\cdot\!\left(\overrightarrow{\mathrm{d}}{\bm{Y}}_{t}-% \tfrac{1}{2}\left(a_{t}+\gamma^{+}_{t}\right)({\bm{Y}}_{t})\,\mathrm{d}t\right)$ σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γがんま start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ ( over→ start_ARG roman_d end_ARG bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γがんま start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_d italic_t )	(14b)
	$\displaystyle-\tfrac{1}{\sigma^{2}}\!\int_{0}^{T}\!\!\left(b_{t}-\gamma^{-}_{t% }\right)({\bm{Y}}_{t})\!\cdot\!\left(\overleftarrow{\mathrm{d}}{\bm{Y}}_{t}-% \tfrac{1}{2}\left(b_{t}+\gamma^{-}_{t}\right)({\bm{Y}}_{t})\,\mathrm{d}t\right).$ σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_γがんま start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ ( over← start_ARG roman_d end_ARG bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γがんま start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_d italic_t ) .	(14c)

Proof.

The proof relies on Girsanov’s theorem (Üstünel & Zakai, 2013), using the reference to relate the forward and backward processes. For details, see Appendix E. ∎

Remark 2 (Role of the reference process).

According to Proposition 2.2, the Radon-Nikodym derivative between $\overrightarrow{{\mathbb{P}}}^{\mu,a}$ μみゅー , italic_a end_POSTSUPERSCRIPT and $\overleftarrow{{\mathbb{P}}}^{\nu,b}$ νにゅー , italic_b end_POSTSUPERSCRIPT can be decomposed into boundary terms (14a), as well as forward and backward path integrals (14b) and (14c). Since the left-hand side of (14a) does not depend on the reference $\Gamma_{0,T}$ Γがんま0𝑇\Gamma_{0,T}roman_Γがんま start_POSTSUBSCRIPT 0 , italic_T end_POSTSUBSCRIPT, $\gamma^{\pm}$ γがんま start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT, the expressions in (14) are in principle equivalent for all choices of reference. The freedom in $\Gamma_{0,T}$ Γがんま0𝑇\Gamma_{0,T}roman_Γがんま start_POSTSUBSCRIPT 0 , italic_T end_POSTSUBSCRIPT and $\gamma^{\pm}$ γがんま start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT allows us to ‘reweight’ between (14a), (14b) and (14c), or even cancel terms. A canonical choice is the Lebesgue measure for $\Gamma_{0}$ Γがんま0\Gamma_{0}roman_Γがんま start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and $\Gamma_{T}$ Γがんま𝑇\Gamma_{T}roman_Γがんま start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, and $\gamma^{\pm}=0$ γがんま start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT = 0, see Appendix C.1.

Remark 3 (Discretisation and conversion formulae).

The distinction between forward and backward integration in (14) is related to the time points at which the integrands $\left(a_{t}-\gamma^{+}_{t}\right)({\bm{Y}}_{t})$ γがんま start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and $\left(b_{t}-\gamma^{-}_{t}\right)({\bm{Y}}_{t})$ γがんま start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) would be evaluated in discrete-time approximations, e.g.,

\displaystyle\int_{0}^{T}\!\!a_{t}({\bm{Y}}_{t})\!\cdot\!\overrightarrow{% \mathrm{d}}{\bm{Y}}_{t}\approx\sum_{i}a_{t_{i}}({\bm{Y}}_{t_{i}})\!\cdot\!({% \bm{Y}}_{t_{i+1}}-{\bm{Y}}_{t_{i}}),\;\;\int_{0}^{T}\!\!a_{t}({\bm{Y}}_{t})\!% \cdot\!\overleftarrow{\mathrm{d}}{\bm{Y}}_{t}\!\approx\!\sum_{i}a_{t_{i+1}}({% \bm{Y}}_{t_{i+1}})\!\cdot\!({\bm{Y}}_{t_{i+1}}-{\bm{Y}}_{t_{i}}).

Alternatively, forward and backward integrals can be transformed into each other using the conversion

\int_{0}^{T}\!\!a_{t}({\bm{Y}}_{t})\cdot\overrightarrow{\mathrm{d}}{\bm{Y}}_{t% }=\int_{0}^{T}\!\!a_{t}({\bm{Y}}_{t})\cdot\overleftarrow{\mathrm{d}}{\bm{Y}}_{% t}-\sigma^{2}\!\!\int_{0}^{T}\!\!(\nabla\cdot a_{t})({\bm{Y}}_{t})\,\mathrm{d}t.

σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( ∇ ⋅ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_d italic_t .

(15)

We refer to Kunita (2019) and Appendix A for further details. In passing, we note that (15) allows us to eliminate the Hutchinson estimator (Hutchinson, 1989)from a variety of common score-matching objectives, potentially reducing the variance of gradient estimators, see Appendix C.1.

Framework 1 can be translated into the setting of (12), noting that (11) continues to hold with appropriate modifications:

Framework 1^′.

For a divergence $D$ on path space, minimise $D(\overrightarrow{{\mathbb{P}}}^{\mu,a}|\overleftarrow{{\mathbb{P}}}^{\nu,b})$ μみゅー , italic_a end_POSTSUPERSCRIPT | over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT ). If $D(\overrightarrow{{\mathbb{P}}}^{\mu,a}|\overleftarrow{{\mathbb{P}}}^{\nu,b})=0$ μみゅー , italic_a end_POSTSUPERSCRIPT | over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT ) = 0, then (12a) transports $\mu$ μみゅー to $\nu$ νにゅー, and (12b) transports $\nu$ νにゅー to $\mu$ μみゅー. ³³3Concurrently Richter & Berner (2024) propose an akin framework to ours.

At optimality, $D(\overrightarrow{{\mathbb{P}}}^{\mu,a}|\overleftarrow{{\mathbb{P}}}^{\nu,b})=0$ μみゅー , italic_a end_POSTSUPERSCRIPT | over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT ) = 0, Proposition 2.1 allows us to obtain the scores associated to the learned diffusion via $\sigma^{2}\nabla\operatorname{ln}\rho_{t}^{\mu,a}=a_{t}-b_{t}$ σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_ρろー start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_μみゅー , italic_a end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. In this way, Framework 1^′ is closely connected to (and in some ways extends) score-matching ideas (Song & Ermon, 2019; Song et al., 2021). Indeed, recent approaches towards generative modeling and sampling can be recovered from Framework 1^′ by making specific choices for the divergence $D$ , the parameterisations for $a$ and $b$ , as well as for the reference diffusion $\overrightarrow{{\mathbb{P}}}^{\Gamma_{0},\gamma^{+}}=\overleftarrow{{\mathbb{% P}}}^{\Gamma_{T},\gamma^{-}}$ Γがんま0superscript𝛾superscript←ℙsubscriptΓがんま𝑇superscript𝛾\overrightarrow{{\mathbb{P}}}^{\Gamma_{0},\gamma^{+}}=\overleftarrow{{\mathbb{% P}}}^{\Gamma_{T},\gamma^{-}}over→ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT roman_Γがんま start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γがんま start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT = over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT roman_Γがんま start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_γがんま start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT in Proposition 2.2:

Score-based generative modeling: Letting $\mu$ μみゅー be the target and fixing the forward drift $a_{t}$ , and, motivated by Proposition 2.1, parameterising the backward drift as $b_{t}=a_{t}-s_{t}$ , we recover the SGM objectives in Hyvärinen & Dayan (2005); Song & Ermon (2019); Song et al. (2021) from $D=D_{\mathrm{KL}}$ ; when $\overrightarrow{{\mathbb{P}}}^{\mu,a}=\overleftarrow{{\mathbb{P}}}^{\nu,b}$ μみゅー , italic_a end_POSTSUPERSCRIPT = over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT, the variable drift component $s_{t}$ will represent the score $\sigma^{2}\nabla\operatorname{ln}\rho^{\mu,a}_{t}$ σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_ρろー start_POSTSUPERSCRIPT italic_μみゅー , italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Modifications can be obtained from the conversion formula (15), see Appendix C.2.

Score-based sampling – ergodic drift: In this setting, $\nu$ νにゅー becomes the target and we fix $b_{t}$ to be the drift of an ergodic (backward) process. Then choosing $\Gamma_{0,T}=\mu$ Γがんま0𝑇𝜇\Gamma_{0,T}=\muroman_Γがんま start_POSTSUBSCRIPT 0 , italic_T end_POSTSUBSCRIPT = italic_μみゅー, $\gamma^{\pm}=b$ γがんま start_POSTSUPERSCRIPT ± end_POSTSUPERSCRIPT = italic_b allows us to recover the approaches in Vargas et al. (2023a); Berner et al. (2022). Possible generalisations based on Framework 1^′ include IWAE-type objectives, see Appendix C.3.

Score-based sampling – Föllmer drift: Finally choosing $b_{t}(x)=x/t$ we recover Föllmer sampling (Appendix C.3; Follmer, 1984; Vargas et al., 2023b; Zhang & Chen, 2022; Huang et al., 2021b).

3 Learning forward and backward transitions simultaneously

Recall from the introduction that complete flexibility in $a$ and $b$ will render the minima of $D(\overrightarrow{{\mathbb{P}}}^{\mu,a}|\overleftarrow{{\mathbb{P}}}^{\nu,b})$ μみゅー , italic_a end_POSTSUPERSCRIPT | over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_νにゅー , italic_b end_POSTSUPERSCRIPT ) highly nonunique. Furthermore, the approaches surveyed at the end of the previous section circumvent this problem by fixing either $\overrightarrow{{\mathbb{P}}}^{\mu,a}$ μみゅー , italic_a end_POSTSUPERSCRIPT or $\overleftarrow{{\mathbb{P}}}^{\nu,b}$ νにゅー , italic_b end_POSTSUPERSCRIPT. However, to leverage the full power of diffusion models, both $\overrightarrow{{\mathbb{P}}}^{\mu,a}$ μみゅー , italic_a end_POSTSUPERSCRIPT or $\overleftarrow{{\mathbb{P}}}^{\nu,b}$ νにゅー , italic_b end_POSTSUPERSCRIPT should be adapted to the problem at hand. In this section, we explore models of this kind, by imposing additional constraints on $a$ and $b$ . We end this section by presenting our new CMCD sampler connecting it to prior methodology within VI (Doucet et al., 2022b; Geffner & Domke, 2023; Papamakarios et al., 2017) and OT where we can view CMCD as an instance of entropy regularised OT in the infinite constraint limit (Bernton et al., 2019).

3.1 Connection to Entropic optimal transport

One way of selecting a particular transition between $\mu$ μみゅー and $\nu$ νにゅー is by imposing an entropic penalty, encouraging the dynamics to stay close to a prescribed, oftentimes physically or biologically motivated, reference process. Using the notation employed in Framework 1, the static Schrödinger problem (Schrödinger, 1931; Léonard, 2014a) is given by

\displaystyle\pi^{*}({\bm{x}},{\bm{z}})\in\operatorname*{arg\,min}_{\pi({\bm{x% }},{\bm{z}})}\Big{\{}D_{\mathrm{KL}}(\pi({\bm{x}},{\bm{z}})||r({\bm{x}},{\bm{z% }})):\pi_{\bm{x}}({\bm{x}})=\mu({\bm{x}}),\pi_{\bm{z}}({\bm{z}})=\nu({\bm{z}})% \Big{\}},

πぱい start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z ) ∈ start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_πぱい ( bold_italic_x , bold_italic_z ) end_POSTSUBSCRIPT { italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_πぱい ( bold_italic_x , bold_italic_z ) | | italic_r ( bold_italic_x , bold_italic_z ) ) : italic_πぱい start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ( bold_italic_x ) = italic_μみゅー ( bold_italic_x ) , italic_πぱい start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT ( bold_italic_z ) = italic_νにゅー ( bold_italic_z ) } ,

(16)

where $r({\bm{x}},{\bm{z}})$ is the Schrödinger prior encoding additional domain-specific information. In an analogous way, we can introduce a regulariser to the path-space approach of Framework 1’ to obtain the dynamic Schrödinger problem

{\mathbb{P}}^{*}\!\!\in\!\operatorname*{arg\,min}_{\overrightarrow{{\mathbb{P}% }}^{\mu,a}_{T}\;=\;\nu}\mathbb{E}_{{\bm{Y}}\sim\overrightarrow{{\mathbb{P}}}^{% \mu,a}}\!\!\left[\tfrac{1}{2\sigma^{2}}\!\int_{0}^{T}\!\!\|a_{t}-f_{t}\|^{2}({% \bm{Y}}_{t})\,\mathrm{d}t\right]\!,

μみゅー , italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_νにゅー end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT bold_italic_Y ∼ over→ start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_μみゅー , italic_a end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG 2 italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∥ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_f start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_d italic_t ] ,

(17)

that is, the driving vector field $a_{t}$ determining ${\mathbb{P}}^{*}$ should be chosen in such a way that (i), the corresponding diffusion transitions from $\mu$ μみゅー to $\nu$ νにゅー, and (ii), among such diffusions, the vector field $a_{t}$ remains close to the prescribed vector field $f_{t}$ , in mean square sense. Under mild conditions, the solutions to (16) and (17) exist and are unique. Further, the static and dynamic viewpoints are related through a mixture-of-bridges construction (assuming that the conditionals $r({\bm{z}}|{\bm{x}})$ correspond to the transitions induced by $f_{t}$ ), see (Léonard, 2014a, Section 2).

Iterative proportional fitting (IPF) and the EM algorithm. It is well known that approximate solutions for $\pi^{*}({\bm{x}},{\bm{z}})$ πぱい start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z ) and ${\mathbb{P}}^{*}$ can be obtained using alternating $D_{\mathrm{KL}}$ -projections, keeping one of the marginals fixed in each iteration: Under mild conditions, the sequence defined by


$\displaystyle\pi^{2n+1}({\bm{x}},{\bm{z}})$ πぱい start_POSTSUPERSCRIPT 2 italic_n + 1 end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z )	$\displaystyle=\operatorname*{arg\,min}_{\pi({\bm{x}},{\bm{z}})}\left\{D_{% \mathrm{KL}}(\pi({\bm{x}},{\bm{z}})\|\|\pi^{2n}({\bm{x}},{\bm{z}})):\,\,\pi_{\bm% {x}}({\bm{x}})=\mu({\bm{x}})\right\},$ πぱい ( bold_italic_x , bold_italic_z ) end_POSTSUBSCRIPT { italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_πぱい ( bold_italic_x , bold_italic_z ) \| \| italic_πぱい start_POSTSUPERSCRIPT 2 italic_n end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z ) ) : italic_πぱい start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ( bold_italic_x ) = italic_μみゅー ( bold_italic_x ) } ,	(18a)
$\displaystyle\pi^{2n+2}({\bm{x}},{\bm{z}})$ πぱい start_POSTSUPERSCRIPT 2 italic_n + 2 end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z )	$\displaystyle=\operatorname*{arg\,min}_{\pi({\bm{x}},{\bm{z}})}\left\{D_{% \mathrm{KL}}(\pi({\bm{x}},{\bm{z}})\|\|\pi^{2n+1}({\bm{x}},{\bm{z}})):\,\,\pi_{% \bm{z}}({\bm{z}})=\nu({\bm{z}})\right\},\qquad n\geq 0,$ πぱい ( bold_italic_x , bold_italic_z ) end_POSTSUBSCRIPT { italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT ( italic_πぱい ( bold_italic_x , bold_italic_z ) \| \| italic_πぱい start_POSTSUPERSCRIPT 2 italic_n + 1 end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z ) ) : italic_πぱい start_POSTSUBSCRIPT bold_italic_z end_POSTSUBSCRIPT ( bold_italic_z ) = italic_νにゅー ( bold_italic_z ) } , italic_n ≥ 0 ,	(18b)

with initialisation $\pi^{0}({\bm{x}},{\bm{z}})=r({\bm{x}},{\bm{z}})$ πぱい start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z ) = italic_r ( bold_italic_x , bold_italic_z ), converges to $\pi^{*}({\bm{x}},{\bm{z}})$ πぱい start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z ) as $n\rightarrow\infty$ (De Bortoli et al., 2021), and this procedure is commonly referred to as iterative proportional fitting (IPF) (Fortet, 1940; Kullback, 1968; Ruschendorf, 1995) or Sinkhorn updates (Cuturi, 2013). IPF can straightforwardly be modified to the path space setting of (17), and the resulting updates coincide with the Föllmer drift updates discussed in Section C.3, see (Vargas et al., 2021a) and Appendix E.4.

To further demonstrate the coverage of our framework, we establish a connection between IPF and expectation-maximisation (EM) (Dempster et al., 1977), originally devised for finding maximum likelihood estimates in models with latent (or hidden) variables. According to Neal & Hinton (1998), the EM-algorithm can be described in the setting from the introduction, and written in the form

\displaystyle\theta_{n+1}=\operatorname*{arg\,min}_{\theta}\mathcal{L}_{D_{% \mathrm{KL}}}(\phi_{n},\theta),\qquad\phi_{n+1}=\operatorname*{arg\,min}_{\phi% }\mathcal{L}_{D_{\mathrm{KL}}}(\phi,\theta_{n+1}),

θしーた start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_θしーた end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϕ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θしーた ) , italic_ϕ start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT roman_KL end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ϕ , italic_θしーた start_POSTSUBSCRIPT italic_n + 1 end_POSTSUBSCRIPT ) ,

(19)

with $\mathcal{L}_{D_{\mathrm{KL}}}$ defined as in (5). If the initialisations are matched appropriately, the following result establishes an exact correspondence between the IPF updates in (18) and the EM updates in (19):

Proposition 3.1 (EM $\iff$ IPF).

Assume that the transition densities $p^{\theta}({\bm{x}}|{\bm{z}})$ θしーた end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) and $q^{\phi}({\bm{z}}|{\bm{x}})$ are parameterised with perfect flexibility,⁴⁴4In precise terms, we assume that for any transition densities $p({\bm{x}}|{\bm{z}})$ and $q({\bm{z}}|{\bm{x}})$ , there exist $\theta_{*}$ θしーた start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and $\phi_{*}$ such that $p({\bm{x}}|{\bm{z}})=p^{\theta_{*}}({\bm{x}}|{\bm{z}})$ θしーた start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) and $q({\bm{x}}|{\bm{z}})=q^{\phi_{*}}({\bm{x}}|{\bm{z}})$ . and furthermore that the EM-scheme (19) is initialised at $\phi_{0}$ in such a way that $q^{\phi_{0}}({\bm{z}}|{\bm{x}})=r({\bm{z}}|{\bm{x}})$ . Then the IPF iterations in (18) agree with the EM iterations in (19) for all $n\geq 1$ , in the sense that

\displaystyle\pi^{n}({\bm{x}},{\bm{z}})=q^{\phi_{(n-1)/2}}({\bm{z}}|{\bm{x}})% \mu({\bm{x}}),\quad\text{for}\,\,n\,\text{odd},\quad\pi^{n}({\bm{x}},{\bm{z}})% =p^{\theta_{n/2}}({\bm{x}}|{\bm{z}})\nu({\bm{z}}),\quad\text{for}\,\,n\,\text{% even}.

πぱい start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z ) = italic_q start_POSTSUPERSCRIPT italic_ϕ start_POSTSUBSCRIPT ( italic_n - 1 ) / 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_z | bold_italic_x ) italic_μみゅー ( bold_italic_x ) , for italic_n odd , italic_πぱい start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_z ) = italic_p start_POSTSUPERSCRIPT italic_θしーた start_POSTSUBSCRIPT italic_n / 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_italic_x | bold_italic_z ) italic_νにゅー ( bold_italic_z ) , for italic_n even .

(20)

From the proof (Appenix E), it is clear that flexibility of parameterisations is crucial, and thus $\text{EM}\iff\text{IPF}$ fails for classical VAEs, but holds up to a negligle error for the SDE-parameterisations from Section 2.2, see also Liu et al. (b). Under this assumption, the key observation is that replacing forward- $D_{\mathrm{KL}}$ by reverse- $D_{\mathrm{KL}}$ in one or both of (18a) and (18b) does not – in theory – change the sequence of minimisers.

In practice favoring the EM objectives over IPF can offer an advantage as optimizing with respect to forward- $D_{\mathrm{KL}}$ and backward- $D_{\mathrm{KL}}$ encourages moment-matching and mode-seeking behavior, respectively, and so an alternating scheme as defined in (19) might present a suitable compromise over optimizing a single direction of $D_{\mathrm{KL}}$ ’s, empirical exploration is left for future work.

Whilst EM and IPF might seem appealing for learning a sampler they both require sequentially solving a series of minimization problems, which we can only solve approximately; this is not only slow but also causes a sequential accumulation of errors arising from each iterate (Vargas et al., 2021a; Fernandes et al., 2021). In order to address both issues we will present a novel approach (CMCD) that similarly to IPF learns both the forward and backward processes whilst preserving the desired uniqueness property. However, in contrast to IPF it does so in an end-to-end fashion and performs updates simultaneously. As an alternative in Appendix E.5 we also discuss a regularised IPF objective and leave further empirical exploration for future work.

3.2 Score-based annealing: the Controlled Monte Carlo Diffusion sampler

In this section, we fix a prescribed curve of distributions $(\pi_{t})_{t\in[0,T]}$ πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT, whose scores $\nabla\operatorname{ln}\pi_{t}$ πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (and unnormalised densities $\hat{\pi}_{t}$ πぱい end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT) are assumed to be available in tractable form; this is the scenario typically encountered in annealed importance sampling (IS) and related approaches towards computing posterior expectations (Neal, 2001; Reich, 2011; Heng et al., 2021; 2020; Arbel et al., 2021; Doucet et al., 2022a). The Controlled Monte Carlo Diffusion sampler (CMCD) learns the vector field $\nabla\phi_{t}$ in

\displaystyle\mathrm{d}{\bm{Y}}_{t}\!=\!\left(\sigma^{2}\nabla\operatorname{ln% }\pi_{t}({\bm{Y}}_{t})\!+\!\nabla\phi_{t}({\bm{Y}}_{t})\right)\mathrm{d}t+% \sigma\!\sqrt{2}\,\overrightarrow{\mathrm{d}}{\bm{W}}_{t},\qquad{\bm{Y}}_{0}\!% \sim\!\pi_{0},

σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + ∇ italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) roman_d italic_t + italic_σしぐま square-root start_ARG 2 end_ARG over→ start_ARG roman_d end_ARG bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,

(21)

so that (21) produces the interpolation from the prior $\pi_{0}$ πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to the posterior $\pi_{T}$ πぱい start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, i.e., $\overrightarrow{{\mathbb{P}}}^{\pi_{0},\sigma^{2}\nabla\operatorname{ln}\pi+% \nabla\phi}_{t}=\pi_{t}$ πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい + ∇ italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, for all $t\in[0,T]$ . Note that if $\pi_{t}$ πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT were constant in time ( $\pi_{t}=\pi_{0}$ πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), then $\phi=0$ would reduce (21) to equilibrium overdamped Langevin dynamics, preserving $\pi_{0}$ πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. With $\pi_{t}$ πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT varying in time, $\nabla\phi_{t}$ can be thought of as a control enabling transitions between neighbouring densities $\pi_{t}$ πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and $\pi_{t+\delta t}$ πぱい start_POSTSUBSCRIPT italic_t + italic_δでるた italic_t end_POSTSUBSCRIPT.

To obtain $\nabla\phi_{t}$ we invoke Framework 1^′, but restrict $\overleftarrow{{\mathbb{P}}}^{\pi_{T},b}$ πぱい start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_b end_POSTSUPERSCRIPT to retain uniqueness. Proposition 2.1 motivates the choice $b_{t}=(\sigma^{2}\nabla\operatorname{ln}\pi_{t}+\nabla\phi_{t})-2\sigma^{2}% \nabla\operatorname{ln}\pi_{t}=-\sigma^{2}\nabla\operatorname{ln}\pi_{t}+% \nabla\phi_{t}$ σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - 2 italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = - italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∇ italic_ϕ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,⁵⁵5Note the additional factor of $2$ in Nelson’s relation due to the noise scaling $\sigma\sqrt{2}\overrightarrow{\mathrm{d}}{\bm{W}}_{t}$ σしぐま square-root start_ARG 2 end_ARG over→ start_ARG roman_d end_ARG bold_italic_W start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in (21). leading to

\displaystyle\!\!\!\!\mathcal{L}^{\mathrm{CMCD}}_{D}(\phi):=D\left({% \overrightarrow{{\mathbb{P}}}^{\pi_{0},\sigma^{2}\nabla\operatorname{ln}\pi+% \nabla\phi}},{\overleftarrow{{\mathbb{P}}}^{\pi_{T},-\sigma^{2}\nabla% \operatorname{ln}\pi+\nabla\phi}}\right),

πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい + ∇ italic_ϕ end_POSTSUPERSCRIPT , over← start_ARG blackboard_P end_ARG start_POSTSUPERSCRIPT italic_πぱい start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , - italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい + ∇ italic_ϕ end_POSTSUPERSCRIPT ) ,

(22)

which is valid for any choice of divergence $D$ . The additional score constraint $b_{t}=a_{t}-2\sigma^{2}\nabla\operatorname{ln}\pi_{t}$ σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT restores uniqueness in Framework 1^′ (see Appendix D for a proof):

Algorithm 1 Controlled Monte Carlo Diffusions - Sampling and normalizing constant estimation

\pi_{0}

πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT,

\pi_{T}

πぱい start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT,

\pi_{t}

πぱい start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,

\sigma

σしぐま,

K

step-sizes

\Delta t_{k}

Δでるたsubscript𝑡𝑘\Delta t_{k}roman_Δでるた italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, network

f^{\phi}

trained via minimising Eq 24

{\bm{Y}}_{0}\sim\pi_{0}

πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT

\ln{\bm{W}}=-\operatorname{ln}\pi_{0}({\bm{Y}}_{0})

πぱい start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( bold_italic_Y start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )

for

k=0

K-1

{\bm{Y}}_{t_{k+1}}\sim{\mathcal{N}}\Big{(}{\bm{Y}}_{t_{k+1}}\big{|}{\bm{Y}}_{t% _{k}}+(\sigma^{2}\nabla\ln\pi_{t_{k}}+f^{\phi}_{t_{k}})({\bm{Y}}_{t_{k}})% \Delta t_{k},2\sigma^{2}\Delta t_{k}\Big{)}

Δでるたsubscript𝑡𝑘2superscript𝜎2Δでるたsubscript𝑡𝑘{\bm{Y}}_{t_{k+1}}\sim{\mathcal{N}}\Big{(}{\bm{Y}}_{t_{k+1}}\big{|}{\bm{Y}}_{t% _{k}}+(\sigma^{2}\nabla\ln\pi_{t_{k}}+f^{\phi}_{t_{k}})({\bm{Y}}_{t_{k}})% \Delta t_{k},2\sigma^{2}\Delta t_{k}\Big{)}bold_italic_Y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_italic_Y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | bold_italic_Y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT + ( italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∇ roman_ln italic_πぱい start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_f start_POSTSUPERSCRIPT italic_ϕ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ( bold_italic_Y start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) roman_Δでるた italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , 2 italic_σしぐま start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_Δでるた italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )

\ln{\bm{W}}=\ln{\bm{W}}+\ln\frac{{\mathcal{N}}\Big{(}{\bm{Y}}_{t_{k}}\big{|}{% \bm{Y}}_{t_{k+1}}+(\sigma^{2}\nabla\ln\pi_{t_{k+1}}-f^{\phi}_{t_{k+1}})({\bm{Y% }}_{t_{k+1}})\Delta t_{k},2\sigma^{2}\Delta t_{k}\Big{)}}{{\mathcal{N}}\Big{(}% {\bm{Y}}_{t_{k+1}}\big{|}{\bm{Y}}_{t_{k}}+(\sigma^{2}\nabla\ln\pi_{t_{k}}+f^{% \phi}_{t_{k}})({\bm{Y}}_{t_{k}})\Delta t_{k},2\sigma^{2}\Delta t_{k}\Big{)}}

Transport meets Variational Inference: Controlled Monte Carlo Diffusions