ms+seqgen+phyclust -- Bootstrap to assess the number of clusters.

Model selection includes identifying the number of clusters, evolutionary models, and phyloclustering assumptions. We could use information criteria to choose among models, but the parameter space is mixed continuous and discrete, so the theory justifying the criteria does not apply. A more elaborate procedure to assess the number of clusters for a data set is based on the parametric bootstrap technique and sequential hypothesis testing (Maitra and Melnykov 2010).

The basic idea is to resample data sets from the fitted model using the functions ms() and seqgen(), and refit the resampled data set by phyclust(). The same fitting method is applied to each data set, producing a distribution of parameter estimates. The bootstrap.seq.data() function is a tool for this procedure; it takes a fitted model, the output of a previous call to phyclust(). The following example bootstraps the toy data set once assuming $K = 2$ clusters.

> set.seed(1234)
> X <- seq.data.toy$org
> ret.4 <- phyclust(X, 2)
> (seq.data.toy.new <- bootstrap.seq.data(ret.4))
code.type: NUCLEOTIDE, n.seq: 100, seq.len: 200.
> (ret.4.new <- phyclust(seq.data.toy.new$org, 2))
Phyclust Results:
code type: NUCLEOTIDE, em method: EM, boundary method: ADJUST.
init procedure: exhaustEM, method: randomMu.
model substitution: JC69, distance: D_JC69.
iter: 30 2947 0, convergence: 0, check.param: 1.
eps: 1.41e-14, error: 0.
N.X.org: 100, N.X.unique: 56, L: 200, K: 2, p: 402, N.seg.site: 69.
logL: -685.3, bic: 3222, aic: 2175, icl: 3222
identifier: EE
  Eta: 0.48 0.52 
  Tt: 0.001354 
  n.class: 48 52

We need to repeat the steps several times to obtain a distribution of parameter estimates. For example, the following code uses B = 10 bootstraps to obtain the distributions of log-likelihood for $K = 1$ and $K = 2$ and we can perform a likelihood ratio test to assess the number of clusters.

> ### This code may take a long time to run.
> set.seed(1234)
> X <- seq.data.toy$org
> ret.K1 <- find.best(X, 1)
> ret.K2 <- find.best(X, 2)
> (logL.diff <- ret.K2$logL - ret.K1$logL)
[1] 663.2635
> set.seed(1234)
> B <- 10
> boot.logL.diff <- NULL
> for(b in 1:B){
>   seq.data.toy.new <- bootstrap.seq.data(ret.K1)
>   ret.H0 <- find.best(seq.data.toy.new$org, 1)
>   ret.H1.H0 <- find.best(seq.data.toy.new$org, 2)
>   boot.logL.diff <- c(boot.logL.diff, ret.H1.H0$logL - ret.H0$logL)
> }
> boot.logL.diff
 [1] 20.17399 24.94449 25.50644 24.30299 25.94984 19.03801 19.20099 19.40565
 [9] 18.88133 37.16275
> sum(logL.diff < boot.logL.diff)
[1] 0

Here, the difference of likelihood values logL.idff is 663.2635 between models $K = 1$ and $K = 2$. We bootstrap from a fitted model of $K = 1$ and it is 0 out of 10 results indicating the $K = 2$ performs better than $K = 1$ As the B increases, we may more confident to reject $K = 1$.