CausalMixGPD
  • Home
  • Roadmaps
    • Website roadmap
    • Package roadmap
  • Start
    • Start Hub
    • Roadmap
    • Usage Diagrams
    • Start Here
    • Basic Compile and Run
    • Backends and Workflow
    • Troubleshooting
  • Tracks
    • Quickstart
    • Modeling (1-arm)
    • Causal
    • Clustering
    • Kernels & tails
    • Customization
  • Examples
  • Kernels
  • Advanced
  • Developers
  • Reference
    • Reference hub
    • Function reference by job
  • News
  • Cite
  • Coverage
  • API Reference

ex16. Clustering with Regular Kernel (Parameter Links + Covariates)

Website workflow note. This page reflects the current exported clustering API built around dpmix.cluster(), predict(), summary(), and plot(). Last updated: 2026-03-19.

For the package-level theory and longer narrative, see the manuscript and the clustering discussion in the main package articles.

Clustering with a Regular Kernel and Covariate-Dependent Parameter Links

Purpose: Fit a bulk-only clustering model in which covariates affect component-specific kernel parameters (type = "param"), while mixture weights remain global.

What you’ll learn

  • How dpmix.cluster() differs from outcome modeling: the goal is partitioning / labeling, not density prediction alone.
  • What type = "param" means: covariates shift within-cluster kernel parameters while mixture weights remain global.
  • How to use predict(..., type = "label" | "psm") plus summary()/plot() for cluster interpretation and stability.

When to use this template

  • You want covariates to explain how clusters behave internally (component parameters), not just reweight cluster membership.
  • You need train/test labeling where new points map back to a representative training partition.

Next steps

  • Compare against the type = "weights" clustering mode (ex15) when covariates should drive cluster membership rather than component parameters.

Data Setup

Code
data("nc_posX100_p3_k2")
dat <- data.frame(y = nc_posX100_p3_k2$y, nc_posX100_p3_k2$X)

train_id <- seq_len(70)
train_dat <- dat[train_id, , drop = FALSE]
test_dat <- dat[-train_id, , drop = FALSE]

summary_tbl <- tibble(
  split = c("train", "test"),
  n = c(nrow(train_dat), nrow(test_dat)),
  y_mean = c(mean(train_dat$y), mean(test_dat$y)),
  y_sd = c(stats::sd(train_dat$y), stats::sd(test_dat$y)),
  x1_mean = c(mean(train_dat$x1), mean(test_dat$x1))
)

summary_tbl
# A tibble: 2 × 5
  split     n y_mean  y_sd x1_mean
  <chr> <int>  <dbl> <dbl>   <dbl>
1 train    70   3.31  2.19 -0.0869
2 test     30   3.78  2.86  0.240 
Code
ggplot(train_dat, aes(x = x1, y = y, color = x2)) +
  geom_point(alpha = 0.65) +
  labs(
    title = "Training sample for bulk-only clustering",
    subtitle = "Covariates drive kernel parameters via type = 'param'",
    x = "x1",
    y = "y",
    color = "x2"
  ) +
  theme_minimal()


Fit a Bulk-Only Clustering Model

Code
fit_cluster_param <- load_or_fit(
  "ex16-clustering-dpm-param-fit_cluster_param",
  dpmix.cluster(
    y ~ x1 + x2 + x3,
    data = train_dat,
    kernel = "normal",
    type = "param",
    components = 8,
    mcmc = mcmc
  )
)

summary(fit_cluster_param, vars = c("y", "x1", "x2"))
$K_star
[1] 1

$cluster_sizes

 1 
70 

$cluster_profiles
  cluster  n   y_mean     y_sd     x1_mean     x1_sd     x2_mean     x2_sd
1      C1 70 3.312788 2.190948 -0.08687103 0.9558666 -0.04528296 0.5553962
  certainty_mean certainty_sd
1              1            0

$certainty
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

$source
[1] "train"

$burnin
[1] 0

$thin
[1] 1

attr(,"class")
[1] "summary.dpmixgpd_cluster_fit" "list"                        

Here the formula still defines the full covariate structure, but type = "param" means the model uses those predictors to shift component-specific kernel parameters rather than the cluster weights. This matches the parameter-link clustering formulation described in the manuscript.


Training Labels and Cluster Summaries

Code
cluster_psm_param <- predict(fit_cluster_param, type = "psm")
cluster_train_param <- predict(fit_cluster_param, type = "label", return_scores = TRUE)

fit_summary_param <- summary(
  fit_cluster_param,
  vars = c("y", "x1", "x2"),
  top_n = 5
)

train_sizes_tbl <- tibble(
  cluster = paste0("C", names(fit_summary_param$cluster_sizes)),
  n = as.integer(fit_summary_param$cluster_sizes)
)

train_sizes_tbl
# A tibble: 1 × 2
  cluster     n
  <chr>   <int>
1 C1         70
Code
plot(fit_cluster_param, which = "summary", top_n = 5)

Code
plot(cluster_train_param, type = "sizes", top_n = 5)

Code
plot(cluster_psm_param)

The fit-level and label-level plotting methods expose the same representative partition from different angles: size summaries, response summaries, and the full posterior similarity matrix.


Predict Labels for Held-Out Data

Code
cluster_new_param <- predict(
  fit_cluster_param,
  newdata = test_dat,
  type = "label",
  return_scores = TRUE
)

new_summary_param <- summary(
  cluster_new_param,
  vars = c("y", "x1", "x2"),
  top_n = 5
)

new_sizes_tbl <- tibble(
  cluster = paste0("C", names(new_summary_param$cluster_sizes)),
  n = as.integer(new_summary_param$cluster_sizes)
)

new_sizes_tbl
# A tibble: 1 × 2
  cluster     n
  <chr>   <int>
1 C1         30
Code
plot(cluster_new_param, type = "summary", top_n = 5)

Code
plot(cluster_new_param, type = "certainty")

For newdata, the labels are returned in the space of the representative training clusters, which makes train/test comparison straightforward without exposing component-label switching from the sampler.


Takeaways

  • dpmix.cluster() is the current wrapper for bulk-only clustering.
  • type = "param" is appropriate when covariates should alter within-cluster regression/parameter structure while keeping global mixture weights.
  • The same three-step workflow applies as in the spliced example: fit, predict(..., type = "psm" | "label"), then summary() and plot().
  • This closes the examples sequence with the two main clustering modes now exposed by the package website.

Workflow Navigation

  • Previous: ex15-clustering-dpmgpd-weights
  • Next: troubleshooting
  • Workflow index: Roadmap
  • Practical entry: Examples

Prereqs

  • Required packages and data for this page are listed in the setup chunks above.

Outputs

  • This page renders model fits, diagnostics, and summary artifacts generated by package APIs.

Interpretation

  • Canonical concept page: 02 Clustering Extension
  • Treat this page as an application/example view and use the canonical page for core definitions.

Next

  • Continue to the linked canonical concept page, then return for implementation-specific details.
(c) CausalMixGPD - Bayesian semiparametric modeling for heavy-tailed data
- - Cite - API - GitHub