ex16. Clustering with Regular Kernel (Parameter Links + Covariates)

Website workflow note. This page reflects the current exported clustering API built around dpmix.cluster(), predict(), summary(), and plot(). Last updated: 2026-03-19.

For the package-level theory and longer narrative, see the manuscript and the clustering discussion in the main package articles.

Clustering with a Regular Kernel and Covariate-Dependent Parameter Links

Purpose: Fit a bulk-only clustering model in which covariates affect component-specific kernel parameters (type = "param"), while mixture weights remain global.

What you’ll learn

How dpmix.cluster() differs from outcome modeling: the goal is partitioning / labeling, not density prediction alone.
What type = "param" means: covariates shift within-cluster kernel parameters while mixture weights remain global.
How to use predict(..., type = "label" | "psm") plus summary()/plot() for cluster interpretation and stability.

When to use this template

You want covariates to explain how clusters behave internally (component parameters), not just reweight cluster membership.
You need train/test labeling where new points map back to a representative training partition.

Next steps

Compare against the type = "weights" clustering mode (ex15) when covariates should drive cluster membership rather than component parameters.

Data Setup

Code

data("nc_posX100_p3_k2")
dat <- data.frame(y = nc_posX100_p3_k2$y, nc_posX100_p3_k2$X)

train_id <- seq_len(70)
train_dat <- dat[train_id, , drop = FALSE]
test_dat <- dat[-train_id, , drop = FALSE]

summary_tbl <- tibble(
  split = c("train", "test"),
  n = c(nrow(train_dat), nrow(test_dat)),
  y_mean = c(mean(train_dat$y), mean(test_dat$y)),
  y_sd = c(stats::sd(train_dat$y), stats::sd(test_dat$y)),
  x1_mean = c(mean(train_dat$x1), mean(test_dat$x1))
)

summary_tbl

# A tibble: 2 × 5
  split     n y_mean  y_sd x1_mean
  <chr> <int>  <dbl> <dbl>   <dbl>
1 train    70   3.31  2.19 -0.0869
2 test     30   3.78  2.86  0.240

Code

ggplot(train_dat, aes(x = x1, y = y, color = x2)) +
  geom_point(alpha = 0.65) +
  labs(
    title = "Training sample for bulk-only clustering",
    subtitle = "Covariates drive kernel parameters via type = 'param'",
    x = "x1",
    y = "y",
    color = "x2"
  ) +
  theme_minimal()

Fit a Bulk-Only Clustering Model

Code

fit_cluster_param <- load_or_fit(
  "ex16-clustering-dpm-param-fit_cluster_param",
  dpmix.cluster(
    y ~ x1 + x2 + x3,
    data = train_dat,
    kernel = "normal",
    type = "param",
    components = 8,
    mcmc = mcmc
  )
)

summary(fit_cluster_param, vars = c("y", "x1", "x2"))

$K_star
[1] 1

$cluster_sizes

 1 
70 

$cluster_profiles
  cluster  n   y_mean     y_sd     x1_mean     x1_sd     x2_mean     x2_sd
1      C1 70 3.312788 2.190948 -0.08687103 0.9558666 -0.04528296 0.5553962
  certainty_mean certainty_sd
1              1            0

$certainty
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

$source
[1] "train"

$burnin
[1] 0

$thin
[1] 1

attr(,"class")
[1] "summary.dpmixgpd_cluster_fit" "list"

Here the formula still defines the full covariate structure, but type = "param" means the model uses those predictors to shift component-specific kernel parameters rather than the cluster weights. This matches the parameter-link clustering formulation described in the manuscript.

Training Labels and Cluster Summaries

Code

cluster_psm_param <- predict(fit_cluster_param, type = "psm")
cluster_train_param <- predict(fit_cluster_param, type = "label", return_scores = TRUE)

fit_summary_param <- summary(
  fit_cluster_param,
  vars = c("y", "x1", "x2"),
  top_n = 5
)

train_sizes_tbl <- tibble(
  cluster = paste0("C", names(fit_summary_param$cluster_sizes)),
  n = as.integer(fit_summary_param$cluster_sizes)
)

train_sizes_tbl

# A tibble: 1 × 2
  cluster     n
  <chr>   <int>
1 C1         70

Code

plot(fit_cluster_param, which = "summary", top_n = 5)

Code

plot(cluster_train_param, type = "sizes", top_n = 5)

Code

plot(cluster_psm_param)

The fit-level and label-level plotting methods expose the same representative partition from different angles: size summaries, response summaries, and the full posterior similarity matrix.

Predict Labels for Held-Out Data

Code

cluster_new_param <- predict(
  fit_cluster_param,
  newdata = test_dat,
  type = "label",
  return_scores = TRUE
)

new_summary_param <- summary(
  cluster_new_param,
  vars = c("y", "x1", "x2"),
  top_n = 5
)

new_sizes_tbl <- tibble(
  cluster = paste0("C", names(new_summary_param$cluster_sizes)),
  n = as.integer(new_summary_param$cluster_sizes)
)

new_sizes_tbl

# A tibble: 1 × 2
  cluster     n
  <chr>   <int>
1 C1         30

Code

plot(cluster_new_param, type = "summary", top_n = 5)

Code

plot(cluster_new_param, type = "certainty")

For newdata, the labels are returned in the space of the representative training clusters, which makes train/test comparison straightforward without exposing component-label switching from the sampler.

Takeaways

dpmix.cluster() is the current wrapper for bulk-only clustering.
type = "param" is appropriate when covariates should alter within-cluster regression/parameter structure while keeping global mixture weights.
The same three-step workflow applies as in the spliced example: fit, predict(..., type = "psm" | "label"), then summary() and plot().
This closes the examples sequence with the two main clustering modes now exposed by the package website.

Prereqs

Required packages and data for this page are listed in the setup chunks above.

Outputs

This page renders model fits, diagnostics, and summary artifacts generated by package APIs.

Interpretation

Canonical concept page: 02 Clustering Extension
Treat this page as an application/example view and use the canonical page for core definitions.

Continue to the linked canonical concept page, then return for implementation-specific details.

Clustering with a Regular Kernel and Covariate-Dependent Parameter Links

What you’ll learn

When to use this template

Next steps

Data Setup

Fit a Bulk-Only Clustering Model

Training Labels and Cluster Summaries

Predict Labels for Held-Out Data

Takeaways

Workflow Navigation

Prereqs

Outputs

Interpretation

Next