Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relation between the Kantorovich-Wasserstein metric and the Kullback-Leibler divergence (1908.09211v1)

Published 24 Aug 2019 in cs.IT, math.IT, and math.OC

Abstract: We discuss a relation between the Kantorovich-Wasserstein (KW) metric and the Kullback-Leibler (KL) divergence. The former is defined using the optimal transport problem (OTP) in the Kantorovich formulation. The latter is used to define entropy and mutual information, which appear in variational problems to find optimal channel (OCP) from the rate distortion and the value of information theories. We show that OTP is equivalent to OCP with one additional constraint fixing the output measure, and therefore OCP with constraints on the KL-divergence gives a lower bound on the KW-metric. The dual formulation of OTP allows us to explore the relation between the KL-divergence and the KW-metric using decomposition of the former based on the law of cosines. This way we show the link between two divergences using the variational and geometric principles.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Roman V. Belavkin (8 papers)
Citations (8)

Summary

  • The paper explores the relationship between the Kantorovich-Wasserstein metric and the Kullback-Leibler divergence using connections between variational problems and geometric decomposition.
  • The paper establishes a variational connection, showing how the value of the Optimal Channel Problem provides a lower bound on the information-constrained Kantorovich-Wasserstein metric.
  • A geometric connection using the dual Optimal Transport Problem shows KW is a term in KL divergence decomposition under specific dual potential-gradient conditions.

The relationship between the Kantorovich-Wasserstein (KW) metric and the Kullback-Leibler (KL) divergence is explored by connecting the variational problems that define them and by utilizing a geometric decomposition of the KL divergence related to the dual formulation of the Optimal Transport Problem (OTP).

Variational Connection via Optimal Transport and Optimal Channel Problems

The KW metric, denoted Kc[p,q]K_c[p,q], arises from the Kantorovich formulation of the Optimal Transport Problem (OTP). It represents the minimum cost to transport mass from a distribution qq on space XX to a distribution pp on space YY, where c(x,y)c(x,y) is the cost function for moving mass from xx to yy. Mathematically, Kc[p,q]=infwΓ[q,p]c(x,y)dw(x,y)K_c[p,q] = \inf_{w \in \Gamma[q,p]} \int c(x,y) dw(x,y), where Γ[q,p]\Gamma[q,p] is the set of all joint probability measures ww on X×YX \times Y with marginals πXw=q\pi_X w = q and πYw=p\pi_Y w = p.

The KL divergence, D[p,q]=p(x)lndp(x)dq(x)dxD[p,q] = \int p(x) \ln \frac{dp(x)}{dq(x)} dx, is a fundamental measure in information theory. It quantifies the dissimilarity between two probability measures pp and qq. Related concepts include entropy H[p/r]=lnr(X)D[p,r/r(X)]H[p/r] = \ln r(X) - D[p, r/r(X)] (relative to a reference measure rr) and mutual information I(X,Y)=D[w,qp]I(X,Y) = D[w, q \otimes p], where ww is a joint measure with marginals qq and pp.

A connection is established by comparing the OTP to the Optimal Channel Problem (OCP), often encountered in rate distortion theory and value of information contexts. The OCP seeks to find an optimal conditional probability (channel) dw(yx)dw(y|x) that minimizes the expected cost Ew{c}E_w\{c\}, given a fixed input marginal πXw=q\pi_X w = q and an upper bound λ\lambda on the mutual information I(X,Y)I(X,Y). The value function for OCP is: Rc[q](λ)=inf{Ew{c}:I(X,Y)λ,πXw=q}R_c[q](\lambda) = \inf \{ E_w\{c\} : I(X,Y) \leq \lambda, \pi_X w = q \}.

The key difference lies in the constraints: OTP fixes both marginals (qq and pp), while OCP fixes only the input marginal qq but adds an explicit constraint on the mutual information I(X,Y)λI(X,Y) \leq \lambda. However, fixing both marginals in OTP implicitly constrains the mutual information, since I(X,Y)min[Hq(X),Hp(Y)]I(X,Y) \leq \min[H_q(X), H_p(Y)].

Because OTP includes the additional constraint πYw=p\pi_Y w = p, its feasible set is a subset of the feasible set for OCP (when considering equivalent information constraints). This leads to the inequality: Rc[q](λ)Kc[p,q](λ)R_c[q](\lambda) \leq K_c[p,q](\lambda), where Kc[p,q](λ)K_c[p,q](\lambda) denotes the OTP solution under an explicit constraint I(X,Y)λI(X,Y) \leq \lambda. The value of the OCP provides a lower bound on the information-constrained KW metric. Equality holds if and only if the optimal joint measure wOCPw_{OCP} for the OCP happens to have the target output marginal pp, i.e., πYwOCP=p\pi_Y w_{OCP} = p (Belavkin, 2019 ).

Geometric Connection via Dual OTP and KL Decomposition

A second perspective arises from the dual formulation of the OTP and a geometric decomposition of the KL divergence. The dual OTP seeks to maximize: Jc[p,q]=supf,g{Ep{f}Eq{g}}J_c[p,q] = \sup_{f,g} \{ E_p\{f\} - E_q\{g\} \}, subject to the constraint f(x)g(y)c(x,y)f(x) - g(y) \leq c(x,y). Strong duality often holds, meaning Jc[p,q]=Kc[p,q]J_c[p,q] = K_c[p,q].

The KL divergence can be decomposed using a "law of cosines" involving an arbitrary reference measure rr: D[p,q]=D[p,r]D[q,r]lndq(x)dr(x)[dp(x)dq(x)]D[p,q] = D[p,r] - D[q,r] - \int \ln \frac{dq(x)}{dr(x)} [dp(x) - dq(x)].

This decomposition can be linked to the dual OTP by considering specific forms for the dual potentials ff and gg, connecting them to gradients of KL divergences relative to rr. Specifically, assume measures pp and qq have exponential forms related to potentials ϕ\phi and ψ\psi: dp=eϕdrdp = e^{\phi} dr and dq=eψdrdq = e^{\psi} dr. Then, the gradients can be identified as D[p,r]=ϕ(x)=lndp(x)dr(x)\nabla D[p,r] = \phi(x) = \ln \frac{dp(x)}{dr(x)} and D[q,r]=ψ(x)=lndq(x)dr(x)\nabla D[q,r] = \psi(x) = \ln \frac{dq(x)}{dr(x)}.

If we further relate the dual OTP potentials f,gf, g to these gradients, for instance by setting βf=D[p,r]\beta f = \nabla D[p,r] and αg=D[q,r]\alpha g = \nabla D[q,r] for scaling factors α,β\alpha, \beta, the terms Ep{f}E_p\{f\} and Eq{g}E_q\{g\} in the dual objective Jc[p,q]J_c[p,q] can be expressed using KL divergences. Substituting these into the KL decomposition yields an expression relating D[p,q]D[p,q] to terms resembling the dual OTP objective.

A key result (Theorem 4 in (Belavkin, 2019 )) states that if the optimal solution (f,g)(f, g) to the dual OTP also satisfies the gradient conditions f=D[p,r]f = \nabla D[p,r] and g=D[q,r]g = \nabla D[q,r] for some rr (implying α=β=1\alpha = \beta = 1 and linking the cost cc to these potentials via the dual constraint), then the KL divergence can be expressed using the optimal value of the OTP (Kc[p,q]K_c[p,q] assuming strong duality): D[p,q]=Kc[p,q](κ[f]κ[g])g(x)[dp(x)dq(x)]D[p,q] = K_c[p,q] - (\kappa[f] - \kappa[g]) - \int g(x) [dp(x) - dq(x)]. Here, κ[f]\kappa[f] and κ[g]\kappa[g] are normalization constants (log partition functions) associated with the exponential forms of pp and qq. This result demonstrates that under specific conditions linking optimal dual potentials to divergence gradients, the KW metric Kc[p,q]K_c[p,q] emerges as a principal term in the geometric decomposition of the KL divergence D[p,q]D[p,q].

In conclusion, the relationship between the KW metric and KL divergence is established through both variational principles, where the OCP (using KL-based mutual information constraints) provides a lower bound on the KW metric, and through a geometric decomposition of KL divergence linked to the dual OTP, where the KW metric can appear as a term under specific assumptions relating dual potentials to divergence gradients.