Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generalising maximum mean discrepancy: kernelised functional Bregman divergences

Published 27 Apr 2026 in cs.LG, cs.CV, and cs.IT | (2604.24047v1)

Abstract: Bregman divergences play a pivotal role in statistics, machine learning and computational information geometry. Particularly in the context of machine learning, they are central to clustering, exponential families, parameter estimation and optimisation, among other things. Despite this, the full toolkit of Hilbert spaces and in particular reproducing kernel Hilbert spaces have not been systematically developed and applied to functional Bregman divergences, where points are functions rather than finite-dimensional parameter vectors. While other types of functional Bregman divergences have been studied, these are typically in a Banach space rather than more directly aligned with kernel methods and Hilbert-space geometry commonly used in machine learning. We consider functional Bregman divergences on a Hilbert space, where the self-dual pairing and Riesz representer afford us particularly convenient calculus. Further specialising Bregman generators as a composition involving a kernel mean embedding makes such divergences easy to estimate. We discuss applications in clustering, universal estimation, robust estimation and generative modelling, and contrast our approach with other types of Bregman divergences.

Summary

  • The paper introduces kernelised functional Bregman divergences (k-FBD) that generalise MMD for robust statistical estimation.
  • It leverages RKHS embeddings via kernel mean mappings to extend divergence calculus to infinite-dimensional function spaces.
  • The framework provides sandwich bounds, metricisation, and direct applications in clustering, generative modeling, and hypothesis testing.

Generalization of Maximum Mean Discrepancy via Kernelised Functional Bregman Divergences

Introduction and Problem Statement

This work addresses the extension of Bregman divergences (BDs) from their classical finite-dimensional setting to function spaces under a reproducing kernel Hilbert space (RKHS) framework. The authors focus on functional Bregman divergences (FBDs) defined on general Hilbert spaces rather than Banach or LrL^r spaces, leveraging the Riesz representation and self-duality properties to facilitate analysis and practical calculus. A salient contribution is the construction and study of kernelised FBDs (k-FBD), in which the BD generator is defined as a composition with a kernel mean embedding (KME). This enables tractable estimation from samples and places the framework in direct connection with maximum mean discrepancy (MMD) and kernel-based statistical learning paradigms. Figure 1

Figure 1: Schematic illustration of kernelised functional Bregman divergence, where functions f,gFf, g \in \mathcal{F} are mapped to an RKHS via μ\mu. If Φ=Fμ\Phi = F \circ \mu, then dΦ(f,g)=dF(μ(f),μ(g))d_\Phi(f,g) = d_F(\mu(f), \mu(g)) holds, reducing analysis to divergences between mean embeddings.

Formal Development of Functional Bregman Divergences

An FBD is generalized to functions f,gFf, g \in \mathcal{F}, with F\mathcal{F} a real Hilbert space, via a strictly convex, Fréchet-differentiable functional Φ\Phi:

dΦ(f,g)=Φ(f)Φ(g)Φ(g),fgFd_\Phi(f, g) = \Phi(f) - \Phi(g) - \langle \nabla \Phi(g), f - g \rangle_{\mathcal{F}}

Key properties derived for this construction include:

  • Nonnegativity and identity of indiscernibles: dΦ(f,g)0d_\Phi(f, g) \ge 0, equality iff f,gFf, g \in \mathcal{F}0.
  • Convexity in the first argument and additivity in generators.
  • Duality: Under Legendre-type conditions on f,gFf, g \in \mathcal{F}1, a dual divergence is formed using the convex conjugate f,gFf, g \in \mathcal{F}2, yielding self-dual geometry in the Hilbert space setting.
  • Bias-variance decomposition: Extends bias-variance results for vector BDs, ensuring the mean minimises expected FBD risk wrt the second argument.
  • Symmetric FBDs: Demonstrated that, under twice Fréchet-differentiability, the only symmetric FBDs are quadratic forms induced by positive definite, self-adjoint operators.

Kernelised Functional Bregman Divergences and MMD

By considering generators of the form f,gFf, g \in \mathcal{F}3, with f,gFf, g \in \mathcal{F}4 the KME from f,gFf, g \in \mathcal{F}5 to an RKHS f,gFf, g \in \mathcal{F}6, the divergence f,gFf, g \in \mathcal{F}7 reduces to f,gFf, g \in \mathcal{F}8. If f,gFf, g \in \mathcal{F}9 is injective, μ\mu0 inherits convexity and smoothness from μ\mu1. This abstraction subsumes several important divergences:

  • Squared MMD: For μ\mu2, the k-FBD recovers the squared MMD, foundational in nonparametric two-sample testing and distributional learning.
  • Deformed squared MMD: Employing μ\mu3 with suitable convex, increasing, twice differentiable functions μ\mu4, the divergence generalises MMD and interpolates a family of divergences with controllable sensitivity.
  • Formal connections to kernelised Kullback-Leibler divergence and negative entropy via suitable choices of μ\mu5.

These constructions inherit the estimability of MMD, as all evaluation and gradient operations reduce to empirical means over RKHS feature maps.

Theoretical Guarantees: Sandwich Bounds and Robust Estimation

The authors derive quantitative sandwich inequalities for deformed MMDs: within a ball of radius μ\mu6 in μ\mu7, the deformed divergence is always bounded above and below by the squared MMD modulated by explicit constants involving the spectrum of μ\mu8 and μ\mu9 on Φ=Fμ\Phi = F \circ \mu0. This enables rigorous robustness guarantees—for instance, universal risk bounds in parameter estimation problems analogous to those established for classical MMD [cherief2022finite]. For minimum-divergence estimators based on k-FBDs, one has:

Φ=Fμ\Phi = F \circ \mu1

where Φ=Fμ\Phi = F \circ \mu2, Φ=Fμ\Phi = F \circ \mu3 are the minimal/maximal Hessian eigenvalues, ensuring uniform consistency and robustness in statistical learning, particularly when sample dependencies exist.

Clustering, Metricisation, and Symmetrisation

By generalizing metricisation of BDs [acharyya2013bregman], the FBD admits an explicit symmetrisation and is shown to generate a valid metric (under square-root scaling and appropriate operator choices). This permits algorithmic applications requiring a distance structure, e.g., accelerated clustering via triangle inequality pruning. Theorem-level results specify operator conditions (via Schur complements) so that metricisation holds in infinite dimensions, extending prior finite-dimensional results.

Practical Implications

The technical apparatus enables several direct machine learning applications:

  • Generative modeling: Divergence minimisation for learning generative models can leverage the expanded class of k-FBDs, offering explicit alternatives to MMD for distribution matching in GAN architectures.
  • Universal and robust statistical estimation: The sandwich bounds show that any estimator using k-FBD minimisation inherits universal and contamination-robust risk bounds proportional to those for MMD-based estimation, even under model misspecification and dependent data.
  • Clustering in function space: The explicit metric structure allows extension of k-means and related cluster algorithms to the space of functions or distributions via k-FBDs.
  • Divergence estimation and hypothesis testing: Sample-based Monte Carlo estimators for k-FBDs become feasible, opening avenues for plug-in tests beyond MMD.

Theoretical Implications

By providing a general FBD construction in Hilbert spaces, the analysis leverages self-duality, Riesz representation, and the geometry of RKHS embeddings, thus streamlining the calculus of gradients, Hessians, and dualities compared to prior Banach or Φ=Fμ\Phi = F \circ \mu4-space approaches. Open theoretical directions remain, such as converse optimality (mean-as-minimiser implies BD structure) and further uniqueness results for symmetric divergences.

Conclusion

This work formalises a natural and practically estimable generalisation of maximum mean discrepancy via kernelised functional Bregman divergences. The framework integrates statistical, geometric, and kernel-based principles, provides important stability and sandwich bounds, and has immediate implications in robust learning, generative modeling, and clustering. Its introduction opens several avenues for nonparametric inference, robust hypothesis testing, and metric learning in Hilbert function spaces. Further extensions, such as analysis of hypothesis testing power, expressivity of deformed MMDs, and application to high-dimensional model selection, are motivated directly by these results.

Reference: "Generalising maximum mean discrepancy: kernelised functional Bregman divergences" (2604.24047).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Explain it Like I'm 14

Overview

This paper is about a way to measure how different two functions are, especially when those functions are actually probability distributions (ways things can happen). The authors take a well-known tool called a Bregman divergence (a kind of “distance-like” measure that may be one-way and not symmetric) and extend it so it works smoothly for functions, not just for ordinary vectors of numbers. They also connect it to popular “kernel” methods used in machine learning, which makes these measures practical to estimate from data. Along the way, they show how their approach includes the well-known maximum mean discrepancy (MMD) as a special case and propose new, MMD-like measures that can be more robust or flexible.

What questions does the paper ask?

  • Can we build a clean, math-friendly version of Bregman divergences that works directly on functions (like curves, signals, or probability distributions), not just on finite lists of numbers?
  • Can we connect that functional version to kernel methods so we can estimate it easily from samples?
  • Which properties from the “vector” world still hold (like “the average minimizes expected loss”)? Which new properties can we prove (like how to turn the divergence into a true distance)?
  • How does this relate to the popular MMD, and can we generalize MMD in a useful way?

How do the authors approach the problem?

Think of a Bregman divergence as comparing two points using a “bowl-shaped” surface (a convex function). You look at how high the bowl is at one point compared with the height predicted by the tangent (the best straight-line approximation) at the other point. This creates a measure of mismatch that is usually one-way: measuring from A to B is not always the same as from B to A.

To bring this idea to functions:

  • Functions live in a “Hilbert space,” which you can imagine as a place where functions act like points, and we can measure lengths and angles between them. This makes calculus and geometry on functions possible and neat.
  • Kernel methods add a handy trick: they map complicated objects (like distributions) into a “feature space” where simple geometry works well. The kernel mean embedding (KME) takes a distribution and turns it into an average point in that feature space. The maximum mean discrepancy (MMD) is just the distance between these average points.
  • The authors define a functional Bregman divergence (FBD) on a Hilbert space and then introduce a kernelised version (k-FBD): pick any nice “bowl” FF in the feature space, and apply it to the embedded averages. This makes the divergence easy to compute from samples, because you only need kernel evaluations.
  • They also study special choices of the “bowl” function FF, including one that gives back squared MMD and a family they call “deformed MMD,” which lets you bend the geometry in useful ways (for example, to be more robust).

In short: they replace “compare functions directly” by “embed them with a kernel, then compare their embeddings with a Bregman divergence.” This keeps the math clean and the computations practical.

What did they find and why does it matter?

Here are the main findings, explained simply:

  • Basic good behavior carries over:
    • The functional Bregman divergence is always nonnegative and is zero only when the functions are the same.
    • It behaves nicely with averages and with mixtures of bowls (linearity in the generator).
    • It has a “three-point identity” that helps with analysis.
  • Averages are special (bias-variance style result):
    • If you are trying to pick a single function to best represent a random function (or distribution), the best choice (the one that minimizes expected divergence) is the average function. This generalizes the familiar “mean minimizes squared error” idea.
  • Dual view and quasi-means:
    • There is a “dual” version of the divergence (like flipping the coordinates) that matches the original—this mirrors known results for vectors.
    • When you minimize the divergence in the other direction, the solution is a “quasi-arithmetic mean,” which is a generalized kind of average determined by the bowl’s shape.
  • When is the divergence symmetric?
    • Most Bregman divergences are not symmetric (distance from A to B differs from B to A). The authors show that in their function setting, the divergence is symmetric only when the bowl is quadratic (like a squared distance). This matches the vector case.
  • Turning it into a true distance (metricisation):
    • They show how to build from the divergence a true distance (it’s symmetric and satisfies the triangle inequality) by symmetrizing and adding a couple of extra squared terms. This lets you use fast distance-based algorithms (like better versions of k-means).
  • Kernelised FBD includes and generalizes MMD:
    • If you pick the bowl to be “squared length” in the feature space, you get exactly the squared MMD.
    • If you pick the bowl to be some smooth, increasing function of the length (a “deformation”), you get a “deformed MMD.” They prove this new divergence is sandwiched between two constant multiples of squared MMD on a reasonable set, so it’s comparable and well-behaved.
  • Practical estimation and robustness:
    • Because everything runs through kernels, the divergence can be estimated from data samples.
    • Using deformed MMD for fitting models can be both universal (works across many models) and robust (still works well if some data are noisy or slightly corrupted). They provide performance bounds that extend known MMD results.

Why it matters:

  • This gives a unified, flexible way to compare distributions and functions that plugs directly into modern kernel-based machine learning.
  • It extends a lot of useful theory (like average-as-best and duality) to a functional setting.
  • It introduces new, MMD-like divergences that can be better suited to certain tasks (e.g., more robust to outliers) while staying easy to compute.

What could this change or lead to next?

  • Better tools for comparing data distributions in generative modeling: training models like GANs or MMD-nets could benefit from deformed MMDs that are robust or emphasize certain differences.
  • Stronger, more flexible clustering and learning algorithms: the new metric can power distance-based methods that work on functions or distributions, with triangle-inequality speedups.
  • Reliable, sample-efficient hypothesis tests and estimators: since these divergences are easy to estimate from samples, they could lead to practical statistical tests and robust estimators that hold up under messy, real-world data.
  • Open questions: for example, does “the mean minimizes expected divergence” also imply the divergence must be of Bregman type in this functional setting (as it does in finite dimensions)? Answering such questions could further tighten the theory.

In short, the paper builds a bridge between elegant geometry (Bregman divergences), powerful practical tools (kernels and MMD), and functional data. This bridge lets us design and analyze new divergences that are both theoretically sound and usable in everyday machine learning tasks.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a single, concrete list of unresolved issues and limitations that emerge from the paper, formulated to guide future research:

  • Characterization gap: Does the mean-as-minimiser property imply that a divergence is an FBD on a Hilbert space (the converse of Theorem 3), as in the finite-dimensional case?
  • Symmetry under weaker smoothness: Can the characterisation of symmetric FBDs (Theorem 5) be extended beyond twice continuously Fréchet differentiable generators to merely once-differentiable or nonsmooth convex generators?
  • Beyond Hilbert spaces: Which results (e.g., metricisation, bias–variance decomposition) extend to FBDs defined on Banach spaces like Ls(ν)L^s(\nu), and can one prove or refute that the metricisation cannot hold in these settings?
  • Metricisation practicality: How should one choose operators A,BA,B in practice to ensure the block operator MM is PSD in infinite dimensions, and what is the computational impact of different A,BA,B choices (e.g., learnable vs. fixed)?
  • Tightness and design of metricisation: Are there tighter or alternative symmetric/metric constructions for FBDs with provable optimality (e.g., smallest additive correction to achieve a metric)?
  • Functional Jensen–Bregman divergences: What are the properties (e.g., Pythagorean identities, convexity, metric-like bounds) of functional Jensen–Bregman divergences in Hilbert spaces?
  • Functional Fenchel–Young divergences: How far do classical properties (duality, generalisation bounds, calibration) extend to the functional Fenchel–Young divergence noted but not pursued here?
  • Dually flat geometry: Under what minimal assumptions does the dually flat structure exist in the functional setting, and how can one exploit it algorithmically (e.g., mirror descent, natural gradients) beyond the Legendre-type case?
  • Injectivity of the kernel mean embedding (KME): Precisely characterize when μ\mu is injective on the domain A\mathcal A (e.g., for which kernels and function classes), and analyze the behavior of k-FBDs when μ\mu is not injective (identifiability and collapsed equivalence classes).
  • Kernelized KL/negative entropy: Establish rigorous conditions (positivity, integrability, convexity, Legendre-type) under which “kernelized KL” or “kernelized negative entropy” generators are well-defined, convex, differentiable, and estimable.
  • Non-smooth generators: Develop an FBD theory for nonsmooth generators (subgradient-based or Gâteaux derivatives), enabling divergences relevant to sparsity or total variation.
  • Estimation theory for k-FBDs: Derive unbiased or bias-corrected estimators, asymptotic distributions, variance/concentration bounds, and U-statistics formulations analogous to MMD, including fast and numerically stable estimators.
  • Hypothesis testing with k-FBDs: Specify null distributions and power analyses for two-sample and goodness-of-fit tests based on k-FBDs (including wild bootstrap/permutation schemes), and quantify finite-sample and asymptotic properties.
  • High-probability guarantees: Extend universal/robust estimation results beyond expectations to high-probability bounds, including precise dependence on mixing/dependence (tight bounds for the kernelized covariance sequence {ρi}\{\rho_i\}).
  • Sandwich bounds: Assess the tightness of mϕ(R)m_\phi(R) and Lϕ(R)L_\phi(R), provide constructive methods to select/estimate RR, and develop bounds when mϕ(R)=0m_\phi(R)=0 (e.g., ϕ(r)=rp\phi(r)=r^p with p>2p>2 at r0r\approx 0) or when kernels are unbounded.
  • Choosing the deformation ϕ\phi: Provide principled guidelines (e.g., influence functions, breakdown points) for selecting ϕ\phi to trade off sensitivity and robustness, and investigate optimal ϕ\phi under contamination models.
  • Noising interpretation limits: Formalize the noising view (Eq. 9) by precisely characterizing when kk is a stationary PDF, how bandwidth and kernel choice affect the induced divergence, and how to handle nonstationary kernels.
  • Optimization for generative modeling: Analyze gradients of k-FBD objectives through μ\mu for deep models, identify potential gradient bias/variance issues, and compare training dynamics and mode coverage to MMD-based GANs.
  • Clustering algorithms: Instantiate and analyze k-means-style algorithms under the metricised FBD, including convergence guarantees, initialization strategies, triangle-inequality pruning, and complexity in infinite-dimensional settings.
  • Multi-kernel/learned kernels: Study how multiple or learned kernels affect injectivity, stability, and statistical properties of k-FBDs, and develop hyperparameter selection approaches tailored to k-FBD objectives.
  • Empirical validation: Benchmark deformed MMD and other k-FBDs against standard MMD across tasks (testing, density estimation, generative modeling), and quantify gains/losses in sample efficiency and robustness.
  • Domain and integrability conditions: Provide verifiable conditions ensuring Bochner integrability and finiteness of EΦ(F)\mathbb E|\Phi(F)| in common settings, and clarify domains where the proposed theorems apply without technical caveats.
  • Reverse-risk minimizers: Extend existence/uniqueness and stability results for the quasi-arithmetic mean minimiser under weaker assumptions, including continuity of the solution map and perturbation bounds.
  • Extensions beyond RKHS-IPMs: Explore whether analogous kernelised FBDs and sandwich bounds can be established for other IPMs (e.g., energy distance) or for non-Hilbert feature maps.

Practical Applications

Immediate Applications

Below are practical uses that can be deployed now or tested with modest engineering effort, drawing directly on the paper’s kernelised functional Bregman divergences (k-FBDs), deformed squared MMD, metricisation, and associated properties.

  • Robust and universal parameter estimation via minimum-divergence fitting
    • Sectors: healthcare, finance, energy, manufacturing, ads/rec systems, time series forecasting
    • What: Replace squared MMD in minimum-divergence estimators with k-FBDs (e.g., deformed squared MMD) to gain robustness to misspecification and contamination while retaining consistency guarantees via the sandwich bounds with MMD.
    • Tools/workflow: Implement k-FBD losses in PyTorch/JAX/TF; optimise θ by minimising dΦ(pθ, p̂n); monitor training with empirical KME radius R estimates; use kernels with bounded norms (e.g., Gaussian/RBF) to stabilise constants.
    • Assumptions/dependencies: Kernel boundedness and injective kernel mean embedding (characteristic kernel); φ differentiable with φ′, φ″ > 0; bounded KME radius R to apply the comparison constants; for dependent data, bounded kernelised covariance ρ.
  • Data drift/OOD detection and monitoring in production
    • Sectors: ML Ops across all industries, healthcare (cohort shift), finance (market regime shift), cybersecurity (traffic shifts)
    • What: Track k-FBD between streaming/live data and the training data (or a reference window) for more robust drift signals than standard MMD; exploit the noising interpretation to evaluate smoothed shifts that reduce spurious alarms.
    • Tools/workflow: Sliding windows over data; online KME updates; compute dΦ(pwindow, pref); set thresholds from bootstrap/permutation baselines; dashboards in monitoring tools.
    • Assumptions/dependencies: Bounded kernel; stable φ; adequate sample sizes per window; calibrate thresholds on-site data.
  • Change-point detection in streaming signals
    • Sectors: finance (trading), manufacturing (sensor processes), IoT, energy (load), network security
    • What: Use k-FBD on consecutive windows over feature embeddings to detect abrupt distributional changes; φ controls sensitivity to small vs large deviations.
    • Tools/workflow: Online computation of μ(window) and dΦ between windows; alarm logic with adaptive thresholds.
    • Assumptions/dependencies: Same as drift detection; window sizing trade-offs.
  • Generative model training with k-FBD losses
    • Sectors: media (images/audio), vision, speech, tabular data synthesis, privacy-preserving data generation
    • What: Replace squared MMD objectives in MMD-GANs/flow models with deformed squared MMD or other k-FBDs to tune robustness and sample quality; exploit asymmetry when desired (directional optimisation).
    • Tools/workflow: Plug-in loss module computing dΦ(μ(pθ), μ(pdata)); autodiff through μ and φ; hyperparameter search over φ and kernels.
    • Assumptions/dependencies: Stable gradient estimates; kernel and φ selection affects convergence; empirical validation required per modality.
  • Clustering of distributions or complex objects using metricised FBD
    • Sectors: document clustering, time-series/trajectory clustering, shape analysis, bioinformatics (cell distributions)
    • What: Cluster probability distributions or functional data by constructing √dΦgsb as a metric; use triangle inequality to accelerate k-means–style algorithms (Elkan-style bounds).
    • Tools/workflow: Choose operators A, B to ensure PSD block operator M; compute b(f) = (f, ∇Φ(f)); plug metric into metric k-means/k-medoids; apply acceleration rules.
    • Assumptions/dependencies: Need A, B so that M is PSD (e.g., B ⪰ A⁻¹ with A strongly positive); compute ∇Φ efficiently; kernelised features available.
  • Fairness/bias auditing and subgroup shift analysis
    • Sectors: policy, HR tech, finance compliance, healthcare equity
    • What: Use k-FBD to quantify distributional gaps between groups (e.g., pre/post intervention or across demographics) with robustness to outliers or sampling noise; compare to MMD with tunable sensitivity via φ.
    • Tools/workflow: Compute μ for each subgroup; evaluate k-FBD; report with confidence intervals via resampling; integrate into model cards/audit dashboards.
    • Assumptions/dependencies: Representative subgroup samples; kernel choice to reflect semantics; governance on thresholds.
  • Model and data set alignment for domain adaptation
    • Sectors: NLP (domain shift), vision (camera/sensor change), robotics (sim-to-real), speech
    • What: Align feature distributions by minimising k-FBD across domains (e.g., CORAL-style alignment with RKHS distributions); supports emphasis on particular discrepancies via φ and kernel.
    • Tools/workflow: Feature extractor + k-FBD penalty between source/target batches; tune φ for stability; add to training loop as alignment regulariser.
    • Assumptions/dependencies: Quality of feature extractor; kernel choice; computational budget for batch-wise divergence.
  • Federated learning client weighting and robust aggregation
    • Sectors: mobile/edge ML, healthcare federated analytics, finance
    • What: Weight client updates by k-FBD between client and global/reference distributions (downweight outliers/poisoning); monitor heterogeneity with k-FBD.
    • Tools/workflow: Compute μ per client (on-device or via sketches); server computes dΦ to reweight aggregation; detect anomalous clients.
    • Assumptions/dependencies: Privacy constraints (KME may leak information unless protected); communication budget; secure aggregation of KMEs.
  • Evaluation metrics for generative models and simulators
    • Sectors: autonomous driving simulation, robotics simulators, game AIs
    • What: Use k-FBD to compare simulator-generated and real-world distributions of states/observations; noising interpretation provides smoothed comparison that avoids brittle high-dimensional distances.
    • Tools/workflow: Embed trajectories/frames; compute dΦ; integrate into validation dashboards.
    • Assumptions/dependencies: Adequate feature kernels; simulator coverage.
  • Library extensions and engineering
    • Sectors: software/tooling
    • What: Extend existing MMD libraries (e.g., in PyTorch or scikit-learn) with k-FBD modules supporting φ families, gradient APIs, and metricised distances √dΦgsb.
    • Tools/workflow: Implement μ, dΦ, ∇Φ for standard φ; provide tests and examples (drift detection, robust fitting, generative training).
    • Assumptions/dependencies: GPU-friendly kernels; numerical stability (ensure φ′(r)/r is well-conditioned near r≈0).

Long-Term Applications

These need additional theory (e.g., asymptotics), scaling, or empirical validation before broad deployment.

  • Two-sample and independence tests with k-FBD statistics
    • Sectors: scientific research, A/B testing platforms, clinical trials, econometrics
    • What: Develop hypothesis tests based on k-FBD estimators (beyond MMD), potentially improving power/robustness for particular alternatives via φ.
    • Tools/workflow: Derive asymptotic distributions or wild bootstrap/permutation schemes; software for test calibration and power analysis.
    • Assumptions/dependencies: Asymptotic theory for biased estimators; bandwidth/φ selection procedures; control of Type I error.
  • Privacy-aware analytics via smoothed divergences
    • Sectors: policy, healthcare (HIPAA/GDPR), finance, public data portals
    • What: Leverage the noising interpretation (convolution with κ) to compare smoothed (privacy-preserving) distributions; explore links to differential privacy and sensitivity reduction.
    • Tools/workflow: Integrate calibrated kernels corresponding to DP mechanisms; provide utility/privacy trade-off dashboards.
    • Assumptions/dependencies: Formal privacy guarantees require additional analysis; kernel must reflect DP noise distribution.
  • Automated kernel and generator selection (AutoML for divergences)
    • Sectors: general ML
    • What: Meta-learn or search over kernels and φ to tailor geometry to tasks (robustness vs sensitivity); learn φ from data subject to convexity constraints.
    • Tools/workflow: Hyperparameter optimisation or bilevel schemes; validation via downstream metrics.
    • Assumptions/dependencies: Risk of overfitting; need regularisation and priors over φ; compute cost.
  • Robust ERM and risk minimisation with asymmetric FBDs
    • Sectors: finance (risk control), safety-critical ML, healthcare
    • What: Use directional (asymmetric) k-FBDs to bias training toward mode-seeking or mode-covering behaviours, injecting preference into learning objectives.
    • Tools/workflow: Formulate ERM with asymmetric dΦ; curriculum strategies to stabilise optimisation.
    • Assumptions/dependencies: Convergence analysis and stability under non-symmetric losses; task-specific tuning.
  • Advanced clustering and metric learning on function/distribution spaces
    • Sectors: bioinformatics (single-cell distributions), materials science (microstructure distributions), remote sensing
    • What: Learn A, B (operators) and/or embeddings that optimise the metricised √dΦgsb for clustering/classification of distributions or functional data.
    • Tools/workflow: Operator parameterisation with PSD constraints; metric learning objectives with constraints M ⪰ 0 (Schur complement).
    • Assumptions/dependencies: Efficient parameterisation in infinite-dimensional settings; sample complexity; generalisation bounds.
  • Bayesian inference and variational objectives using FBD geometry
    • Sectors: scientific ML, probabilistic programming
    • What: Replace KL/reverse-KL with k-FBD families in variational objectives to trade off mode-seeking/covering and robustness; explore dual/quasi-arithmetic means for posterior aggregation.
    • Tools/workflow: Implement dΦ in VI frameworks; stochastic estimators for μ and ∇Φ; evaluate calibration.
    • Assumptions/dependencies: Theory for convergence and bias; variance control of gradient estimators.
  • Optimal transport and geometry-inspired optimisation
    • Sectors: logistics, graphics, ML theory
    • What: Use metricised FBDs as surrogates or complements to OT distances in applications needing triangle inequality and faster computation; explore mirror-descent style algorithms leveraging FBD geometry.
    • Tools/workflow: Hybrid objectives mixing OT and k-FBD; preconditioners derived from FBD Hessian structure.
    • Assumptions/dependencies: Task-dependent fidelity vs efficiency; establishing performance guarantees.
  • Federated and distributed learning with FBD-based trust and aggregation rules
    • Sectors: mobile, edge AI, cross-silo healthcare/finance
    • What: Design robust aggregation and client selection using k-FBD distances and metric properties; analyse convergence under heterogeneity.
    • Tools/workflow: Adaptive weighting/selection policies; theoretical bounds incorporating dΦ geometry.
    • Assumptions/dependencies: Communication-efficient KME summaries; privacy-preserving KME sharing (e.g., secure aggregation, DP noise); theoretical convergence under partial participation.
  • Fairness-aware training objectives
    • Sectors: policy compliance, HR, lending
    • What: Train models with k-FBD constraints/penalties to align subgroup distributions in latent or prediction spaces (beyond MMD penalties).
    • Tools/workflow: Differentiable fairness penalties using dΦ; multi-objective optimisation.
    • Assumptions/dependencies: Legal/compliance alignment; careful kernel/φ selection to reflect fairness definitions.
  • Scientific discovery with function-space distances
    • Sectors: climate science, systems biology, neuroscience
    • What: Compare functional outputs (e.g., spatiotemporal fields, trajectories) via k-FBDs, exploiting RKHS geometry to capture structured differences.
    • Tools/workflow: Task-specific kernels for spatial/temporal structure; uncertainty-aware comparisons via smoothed/divergence families.
    • Assumptions/dependencies: Domain-specific kernel design; computational scaling for large grids.

Notes on Assumptions and Dependencies (cross-cutting)

  • Kernel requirements: Prefer characteristic, bounded kernels (e.g., Gaussian/RBF) for injective KME and boundedness; kernel selection encodes task-specific notions of similarity.
  • Generator requirements: For deformed squared MMD, φ ∈ C² with φ′(r) > 0 and φ″(r) > 0; φ′(0)=0 for clean bounds; Legendre-type conditions enable duality results.
  • Estimation stability: Ensure μ(f) is well-estimated from sample sizes at hand; regularise near r≈0 to avoid φ′(r)/r instability.
  • Metricisation: Choose A, B such that the block operator M is PSD (e.g., B ⪰ A⁻¹ with A strongly positive); verify numerically in implementations.
  • Computational scaling: KME computation scales with batch size; adopt random features/Nyström approximations for large-scale use.
  • Statistical guarantees: Where tests/CI are needed, further theory (asymptotics or resampling) should be developed for specific φ and kernels.

These applications collectively leverage the paper’s core advances: Hilbert-space functional Bregman divergences with RKHS calculus, kernelised constructions estimable from samples, deformed squared MMD offering robustness with provable comparison to MMD, and metricisation enabling algorithmic speedups and geometric tooling in non-Euclidean spaces.

Glossary

  • Banach space: A complete normed vector space; a general setting for functional analysis where notions of convergence and continuity are defined via a norm. Example: "these are typically in a Banach space rather than more directly aligned with kernel methods and Hilbert-space geometry commonly used in machine learning."
  • Biconjugate: The second convex conjugate of a function; for proper lower-semicontinuous convex functions it equals the original function (Fenchel–Moreau theorem). Example: "and the biconjugate of Φ\Phi is Φ=(Φ)\Phi^{\ast\ast}=(\Phi^\ast)^\ast."
  • Bochner mean: The expectation (integral) of a random element in a Banach or Hilbert space defined via Bochner integration. Example: "Assume EΦ(F)<\mathbb E|\Phi(F)|<\infty and that the Bochner mean fˉ=E[F]\bar f=\mathbb E[F] belongs to U\mathcal{U}."
  • Bregman divergence: A measure of discrepancy induced by a convex function, defined as the difference between the function value and its first-order Taylor approximation at a reference point. Example: "Bregman divergences play a pivotal role in statistics, machine learning and computational information geometry."
  • Convex conjugate: The Legendre–Fenchel transform of a function, mapping it to a dual function via a supremum of linear functionals minus the original function. Example: "The convex conjugate Φ:F(,]\Phi^\ast:\mathcal{F}\to(-\infty, \infty] of (a not necessarily Legendre type generator) Φ:F(,]\Phi:\mathcal{F}\to(-\infty,\infty] is defined by"
  • Dually flat structure: A geometric structure where a manifold admits dual affine coordinate systems and a potential function, typical of spaces induced by Bregman divergences. Example: "Our FBD possesses a natural dually flat structure, under standard Legendre-type conditions on the generator."
  • Fenchel–Moreau theorem: A fundamental result stating that a proper lower-semicontinuous convex function equals its biconjugate. Example: "Convex conjugates also satisfy a biconjugate property, known as a Fenchel-Moreau theorem."
  • Fenchel–Young inequality: An inequality relating a function and its convex conjugate via the dual pairing; equality holds at subgradients. Example: "Convex conjugate functionals satisfy a Fenchel-Young inequality."
  • Functional Bregman divergence (FBD): A Bregman divergence defined on a space of functions (e.g., a Hilbert space), measuring discrepancy between functions via a convex functional and its gradient. Example: "We now introduce the functional Bregman divergence (FBD)."
  • Gateaux derivative: A directional derivative in infinite-dimensional spaces, generalizing the derivative along directions; weaker than the Fréchet derivative. Example: "The definition below actually extends to Gateaux derivatives, but here we consider only Fr"
  • Hilbert space: A complete inner-product space, providing geometric tools like orthogonality and projections; central to RKHS and functional analysis. Example: "We consider functional Bregman divergences on a Hilbert space, where the self-dual pairing and Riesz representer afford us particularly convenient calculus."
  • Jensen-Bregman divergence: A symmetrized divergence constructed from a convex generator via Jensen’s inequality, offering a symmetric alternative to standard Bregman divergences. Example: "We remark that it is in principle also possible to construct a Jensen-Bregman divergence (as in the finite vector case~\citep{nielsen2011burbea}), a type of symmetrisation of the Bregman divergence."
  • Kernel mean embedding (KME): A mapping of probability measures or functions into an RKHS by taking expectations of kernel feature maps, enabling nonparametric representation. Example: "where μ:FH\mu:\mathcal{F}\to\mathcal{H} and μ(f)H\mu(f) \in \mathcal{H} is the kernel mean embedding (KME) of ff"
  • Kernelised functional Bregman divergence (k-FBD): A functional Bregman divergence whose generator factors through a kernel mean embedding, allowing estimation via RKHS tools. Example: "From the perspective of estimation of the FBD, a particularly convenient subfamily is a kernelised FBD (k-FBD)."
  • Legendre-type generator: A convex function that is essentially smooth and essentially strictly convex, enabling strong duality and well-behaved gradients. Example: "If a generator Φ\Phi as in Definition~\ref{def:generator} additionally is essentially smooth and essentially strictly convex, then it is said to be of Legendre type."
  • Löwner ordering: A partial order on self-adjoint operators based on positive semidefiniteness; A ⪯ B if B−A is positive semidefinite. Example: "Construction of positive definite and positive semidefinite adjoint operators (via L\"owner ordering) on F2\mathcal{F}^2 are discussed in Example~13.18 of~\cite{bauschke2020}."
  • Mahalanobis distance: A quadratic form distance defined by a positive definite matrix/operator, often appearing as the only symmetric Bregman divergence in finite dimensions. Example: "It is well-known that in the finite-vector Bregman divergence setting, the only Bregman divergence which is symmetric is the squared Mahalanobis distance."
  • Maximum mean discrepancy (MMD): An integral probability metric defined as the RKHS norm between kernel mean embeddings of two distributions. Example: "is called the maximum mean discrepancy (MMD)~\citep{gretton2006kernel, gretton2012kernel, sriperumbudur2010hilbert}."
  • Metricisation: The process of transforming or symmetrizing a divergence into a true metric, often by adding correction terms to enforce triangle inequality and symmetry. Example: "Metricising the FBD allows for a host of opportunities to build non-Euclidean metric algorithms and applications."
  • Orlicz space: A function space generalizing Lp spaces using a convex Young function, used in information geometry and exponential families. Example: "A particularly important FBD is built on statistical manifold through an Orlicz space in~\cite{pistonesempi1995}."
  • Positive semidefinite operator: A self-adjoint linear operator T on a Hilbert space with ⟨x,Tx⟩ ≥ 0 for all x; generalizes PSD matrices. Example: "we require a notion of a positive semidefinite operator on a Hilbert space."
  • Quasi-arithmetic mean: A generalized mean defined by applying an invertible function, averaging, then applying the inverse; arises from dual coordinates of Legendre-type generators. Example: "In particular, gg^\star is the quasi-arithmetic mean of FF associated with the transform Φ\nabla\Phi."
  • Reproducing kernel Hilbert space (RKHS): A Hilbert space of functions where evaluation is continuous and represented by inner products with a kernel function. Example: "We make use of a reproducing kernel Hilbert space (RKHS) H\mathcal{H}"
  • Riesz representer theorem: The result that every continuous linear functional on a Hilbert space can be represented as an inner product with a unique element of the space. Example: "by the Riesz representer theorem, for every gFg^\ast \in \mathcal{F}^\ast there exists a unique gFg\in\mathcal{F}"
  • Schur complement: A matrix/operator construct used to characterize positive semidefiniteness of block operators and derive conditions like PSD via block decomposition. Example: "A sufficient condition on AA and BB can be derived from the Schur complement of MM"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 8 tweets with 430 likes about this paper.