Relation between the Kantorovich-Wasserstein metric and the Kullback-Leibler divergence
(1908.09211v1)
Published 24 Aug 2019 in cs.IT, math.IT, and math.OC
Abstract: We discuss a relation between the Kantorovich-Wasserstein (KW) metric and the Kullback-Leibler (KL) divergence. The former is defined using the optimal transport problem (OTP) in the Kantorovich formulation. The latter is used to define entropy and mutual information, which appear in variational problems to find optimal channel (OCP) from the rate distortion and the value of information theories. We show that OTP is equivalent to OCP with one additional constraint fixing the output measure, and therefore OCP with constraints on the KL-divergence gives a lower bound on the KW-metric. The dual formulation of OTP allows us to explore the relation between the KL-divergence and the KW-metric using decomposition of the former based on the law of cosines. This way we show the link between two divergences using the variational and geometric principles.
The paper explores the relationship between the Kantorovich-Wasserstein metric and the Kullback-Leibler divergence using connections between variational problems and geometric decomposition.
The paper establishes a variational connection, showing how the value of the Optimal Channel Problem provides a lower bound on the information-constrained Kantorovich-Wasserstein metric.
A geometric connection using the dual Optimal Transport Problem shows KW is a term in KL divergence decomposition under specific dual potential-gradient conditions.
The relationship between the Kantorovich-Wasserstein (KW) metric and the Kullback-Leibler (KL) divergence is explored by connecting the variational problems that define them and by utilizing a geometric decomposition of the KL divergence related to the dual formulation of the Optimal Transport Problem (OTP).
Variational Connection via Optimal Transport and Optimal Channel Problems
The KW metric, denoted Kc[p,q], arises from the Kantorovich formulation of the Optimal Transport Problem (OTP). It represents the minimum cost to transport mass from a distribution q on space X to a distribution p on space Y, where c(x,y) is the cost function for moving mass from x to y. Mathematically,
Kc[p,q]=w∈Γ[q,p]inf∫c(x,y)dw(x,y),
where Γ[q,p] is the set of all joint probability measures w on X×Y with marginals πXw=q and πYw=p.
The KL divergence, D[p,q]=∫p(x)lndq(x)dp(x)dx, is a fundamental measure in information theory. It quantifies the dissimilarity between two probability measures p and q. Related concepts include entropy H[p/r]=lnr(X)−D[p,r/r(X)] (relative to a reference measure r) and mutual information I(X,Y)=D[w,q⊗p], where w is a joint measure with marginals q and p.
A connection is established by comparing the OTP to the Optimal Channel Problem (OCP), often encountered in rate distortion theory and value of information contexts. The OCP seeks to find an optimal conditional probability (channel) dw(y∣x) that minimizes the expected cost Ew{c}, given a fixed input marginal πXw=q and an upper bound λ on the mutual information I(X,Y). The value function for OCP is:
Rc[q](λ)=inf{Ew{c}:I(X,Y)≤λ,πXw=q}.
The key difference lies in the constraints: OTP fixes both marginals (q and p), while OCP fixes only the input marginal q but adds an explicit constraint on the mutual information I(X,Y)≤λ. However, fixing both marginals in OTP implicitly constrains the mutual information, since I(X,Y)≤min[Hq(X),Hp(Y)].
Because OTP includes the additional constraint πYw=p, its feasible set is a subset of the feasible set for OCP (when considering equivalent information constraints). This leads to the inequality:
Rc[q](λ)≤Kc[p,q](λ),
where Kc[p,q](λ) denotes the OTP solution under an explicit constraint I(X,Y)≤λ. The value of the OCP provides a lower bound on the information-constrained KW metric. Equality holds if and only if the optimal joint measure wOCP for the OCP happens to have the target output marginal p, i.e., πYwOCP=p (Belavkin, 2019).
Geometric Connection via Dual OTP and KL Decomposition
A second perspective arises from the dual formulation of the OTP and a geometric decomposition of the KL divergence. The dual OTP seeks to maximize:
Jc[p,q]=supf,g{Ep{f}−Eq{g}},
subject to the constraint f(x)−g(y)≤c(x,y). Strong duality often holds, meaning Jc[p,q]=Kc[p,q].
The KL divergence can be decomposed using a "law of cosines" involving an arbitrary reference measure r:
D[p,q]=D[p,r]−D[q,r]−∫lndr(x)dq(x)[dp(x)−dq(x)].
This decomposition can be linked to the dual OTP by considering specific forms for the dual potentials f and g, connecting them to gradients of KL divergences relative to r. Specifically, assume measures p and q have exponential forms related to potentials ϕ and ψ: dp=eϕdr and dq=eψdr. Then, the gradients can be identified as ∇D[p,r]=ϕ(x)=lndr(x)dp(x) and ∇D[q,r]=ψ(x)=lndr(x)dq(x).
If we further relate the dual OTP potentials f,g to these gradients, for instance by setting βf=∇D[p,r] and αg=∇D[q,r] for scaling factors α,β, the terms Ep{f} and Eq{g} in the dual objective Jc[p,q] can be expressed using KL divergences. Substituting these into the KL decomposition yields an expression relating D[p,q] to terms resembling the dual OTP objective.
A key result (Theorem 4 in (Belavkin, 2019)) states that if the optimal solution (f,g) to the dual OTP also satisfies the gradient conditions f=∇D[p,r] and g=∇D[q,r] for some r (implying α=β=1 and linking the cost c to these potentials via the dual constraint), then the KL divergence can be expressed using the optimal value of the OTP (Kc[p,q] assuming strong duality):
D[p,q]=Kc[p,q]−(κ[f]−κ[g])−∫g(x)[dp(x)−dq(x)].
Here, κ[f] and κ[g] are normalization constants (log partition functions) associated with the exponential forms of p and q. This result demonstrates that under specific conditions linking optimal dual potentials to divergence gradients, the KW metric Kc[p,q] emerges as a principal term in the geometric decomposition of the KL divergence D[p,q].
In conclusion, the relationship between the KW metric and KL divergence is established through both variational principles, where the OCP (using KL-based mutual information constraints) provides a lower bound on the KW metric, and through a geometric decomposition of KL divergence linked to the dual OTP, where the KW metric can appear as a term under specific assumptions relating dual potentials to divergence gradients.