phyclust+paml.baseml -- Combinations of phyloclustering and maximum likelihood tree.
In the following, we demonstrate the phyclust+paml.baseml
approach
to build a phylogeny for central/ancestral sequences. The phyclust
will identify clusters and centers of clusters, then followed by
paml.baseml
to construct a tree by maximum likelihood.
The paml.baseml
is a port from the baseml
function
of PAML package. This two-stages approach
(phyclust+paml.baseml
)
is particularly useful for large datasets, and potentially more efficient than
the single-stage approach using
paml.baseml
on all sequences directly.
Note that the appropriate number of clusters and the best evolutionary model
for the phylogeny are required to be determined by either information criteria
or bootstrap.
Opening questions:
- Is this two-stages approach meaningful? When does this approach work or fail?
- Is it possible to compare this two-stages approach with the single-stage approach? How? By simulation?
- How accurate are these two approaches compare to the true tree based on simulation studies?
- Is it possible or accurate to apply the single-stage approach on the application pages of HIV which has more than 5,000 short-read fragments?
- How about the three-stages approach? See the application page of the application page of Supertree for detail.
- How about Bayesian framework to the above questions?
- Can we control the evolving viruses by drugs?
The following code will generate a tree which is close to the true.
This tree based on maximum likelihood approach is different to the last
figure in the
demo
example based on distance approach.
Note that PAML uses multifurcation at root for an unrooted tree as
the default value clock = 0
.
> ### Fit a EE, JC69 model using emEM and fit an ancestral tree.
> EMC <- .EMControl(init.procedure = "emEM")
> set.seed(1234)
> K <- 4
> ret.K <- phyclust(seq.data.toy$org, K, EMC = EMC)
> (ret.Mu <- paml.baseml(ret.K$Mu, opts = paml.baseml.control()))
((1: 0.005025, 3: 0.000004): 0.015156, 2: 0.000004, 4: 0.005025);
> (tree.est <- read.tree(text = ret.Mu$best.tree))
Phylogenetic tree with 4 tips and 2 internal nodes.
Tip labels:
[1] "1" "3" "2" "4"
Unrooted; includes branch lengths.
>
> ### Construct descent trees.
> for(k in 1:K){
+ tree.dec <- gen.star.tree(ret.K$n.class[k], total.height = ret.K$QA$Tt)
+ tree.est <- bind.tree(tree.est, tree.dec,
+ where = which(tree.est$tip.label == as.character(k)))
+ }
> est.class <- rep(1:K, ret.K$n.class)
> plotnj(unroot(tree.est), X.class = est.class, main = "tree of sequences")

The colored branches are estimated by phyclust
, and each
color represents a cluster characterized by a central sequences.
Using these four central sequences, the internal branches are
constructed and estimated by paml.baseml
.
This is exactly the same topology where this toy data generated from
except the length of branches may not estimate accurately.
While the last figure in the demo
example based on distance approach gives the other illustration for the
same dataset.