Theory: Predictor-dependent clustering + clustering outputs

This page explains the theory behind the package’s clustering extension, where conditional density modeling induces a latent partition of observations.

Predictor-dependent Bayesian nonparametric mixtures

Conditional density modeling can be viewed as learning the collection \(\{f(\cdot\mid x):x\in\mathcal{X}\}\). A dependent-DP (DDP) perspective represents this collection using predictor-indexed random mixing measures \(\{G_x:x\in\mathcal{X}\}\).

A generic mixture representation is

\[ f(y\mid x) =\int k\!\left(y\mid x,\theta\right)\,dG_x(\theta), \]

where \(k(\cdot\mid x,\theta)\) is a kernel family and \(G_x\) mixes kernel parameters.

When \(G_x\) is discrete (as under stick-breaking / CRP-type constructions), latent kernel parameters tie across observations, which induces a random partition. Introduce latent cluster-label variables \(z_i\) and component parameters \(\{\theta_j\}_{j\ge 1}\) so that

\[ y_i \mid z_i, \{\theta_j\}, x_i \sim k\!\left(\cdot\mid x_i,\theta_{z_i}\right). \]

Posterior clustering summaries can then be expressed in a label-invariant way using co-clustering (e.g., posterior similarity) functionals.

In the package’s clustering theory, predictor dependence enters through the conditional series representation

\[ f(y\mid x) =\sum_{j=1}^{\infty} w_j(x)\,k\!\left(y\mid x,\theta_j\right), \]

with \(w_j(x)\ge 0\) and \(\sum_j w_j(x)=1\). Three common dependence modes are:

Weight dependence only (atoms shared across \(x\)):

\[ f(y\mid x)=\sum_{j=1}^{\infty} w_j(x)\,k(y\mid \theta_j). \]
Atom dependence only (weights shared across \(x\)):

\[ f(y\mid x)=\sum_{j=1}^{\infty} w_j\,k(y\mid x,\theta_j). \]
Weight and atom dependence (both vary with predictors):

\[ f(y\mid x)=\sum_{j=1}^{\infty} w_j(x)\,k(y\mid x,\theta_j). \]

Posterior similarity, and representative partitions

A central label-invariant summary is the posterior similarity matrix

\[ S_{i\ell}=\Pr(z_i=z_\ell\mid \text{data}),\qquad i,\ell=1,\ldots,n, \]

estimated from MCMC output as Monte Carlo averages.

Because component labels can permute across draws, a representative partition is obtained by selecting a sampled partition whose adjacency matrix is closest to the PSM under squared loss. This yields a stable partition estimate while retaining the PSM as the primary uncertainty summary.

The clustering extension as a dedicated workflow

The package’s clustering wrappers fit the same underlying mixture model as the density models, but return clustering summaries derived from the posterior distribution of latent cluster labels.

Clustering uses the user-specified response, optional covariates, kernel family, and whether the likelihood is bulk-only or spliced bulk-tail.
It runs under the same MCMC framework as conditional / causal models.
Two main clustering entry points exist:
- dpmix.cluster() (bulk-only mixtures)
- dpmgpd.cluster() (spliced mixtures with a GPD tail so extremes can influence the induced partition)

The key modeling knobs include the dependence mode (often expressed through a type choice) and how parameters are linked to covariates.

Cluster-specific summaries and representative cluster labels

For presentation, clustering inference focuses on the representative partition and the response distributions within its representative clusters rather than on the full PSM heatmap.

Training cluster labels for the fitted clustering model can be extracted through the package’s label prediction interface (internally based on the representative partition construction).

Assigning labels to new observations

New observations are assigned using the same prediction interface, producing a label in the space of representative training clusters.

Conceptually, this is done by aggregating posterior predictive co-clustering evidence at the cluster level, so cluster assignment is driven by posterior similarity rather than by raw component indices.

References (key)

Quintana & Muller (2022), The Dependent Dirichlet Process and Related Models — doi:10.1214/20-STS819
Ren et al. (2011), The Logistic Stick-Breaking Process — https://www.jmlr.org/papers/v12/ren11a.html
MacEachern (1999), Dependent Nonparametric Processes — https://u.osu.edu/maceachern.1/files/2025/05/1999-MacEachern-JSM-Proceedings.pdf
MacEachern (2000), Dependent Dirichlet Processes — https://u.osu.edu/maceachern.1/files/2025/05/2000-MacEachern-DDP-Tech-Report.pdf

Prereqs

Required packages and data for this page are listed in the setup chunks above.

Outputs

This page renders model fits, diagnostics, and summary artifacts generated by package APIs.

Interpretation

Canonical concept page: Index
Treat this page as an application/example view and use the canonical page for core definitions.

Continue to the linked canonical concept page, then return for implementation-specific details.