phyclust+paml.baseml -- Combinations of phyloclustering and maximum likelihood tree.

In the following, we demonstrate the phyclust+paml.baseml approach to build a phylogeny for central/ancestral sequences. The phyclust will identify clusters and centers of clusters, then followed by paml.baseml to construct a tree by maximum likelihood. The paml.baseml is a port from the baseml function of PAML package. This two-stages approach (phyclust+paml.baseml) is particularly useful for large datasets, and potentially more efficient than the single-stage approach using paml.baseml on all sequences directly. Note that the appropriate number of clusters and the best evolutionary model for the phylogeny are required to be determined by either information criteria or bootstrap.

Opening questions:

Is this two-stages approach meaningful? When does this approach work or fail?
Is it possible to compare this two-stages approach with the single-stage approach? How? By simulation?
How accurate are these two approaches compare to the true tree based on simulation studies?
Is it possible or accurate to apply the single-stage approach on the application pages of HIV which has more than 5,000 short-read fragments?
How about the three-stages approach? See the application page of the application page of Supertree for detail.
How about Bayesian framework to the above questions?
Can we control the evolving viruses by drugs?

The following code will generate a tree which is close to the true. This tree based on maximum likelihood approach is different to the last figure in the demo example based on distance approach. Note that PAML uses multifurcation at root for an unrooted tree as the default value clock = 0.

> ### Fit a EE, JC69 model using emEM and fit an ancestral tree.
> EMC <- .EMControl(init.procedure = "emEM")
> set.seed(1234)
> K <- 4
> ret.K <- phyclust(seq.data.toy$org, K, EMC = EMC)
> (ret.Mu <- paml.baseml(ret.K$Mu, opts = paml.baseml.control()))
((1: 0.005025, 3: 0.000004): 0.015156, 2: 0.000004, 4: 0.005025);
> (tree.est <- read.tree(text = ret.Mu$best.tree))

Phylogenetic tree with 4 tips and 2 internal nodes.

Tip labels:
[1] "1" "3" "2" "4"

Unrooted; includes branch lengths.
> 
> ### Construct descent trees.
> for(k in 1:K){
+   tree.dec <- gen.star.tree(ret.K$n.class[k], total.height = ret.K$QA$Tt)
+   tree.est <- bind.tree(tree.est, tree.dec,
+                           where = which(tree.est$tip.label == as.character(k)))
+ }
> est.class <- rep(1:K, ret.K$n.class)
> plotnj(unroot(tree.est), X.class = est.class, main = "tree of sequences")

The colored branches are estimated by phyclust, and each color represents a cluster characterized by a central sequences. Using these four central sequences, the internal branches are constructed and estimated by paml.baseml. This is exactly the same topology where this toy data generated from except the length of branches may not estimate accurately. While the last figure in the demo example based on distance approach gives the other illustration for the same dataset.