Phylogenetics

Usually, the mutation processes are modeled by an instantaneous transition rate matrix $\boldsymbol{Q}$ with a state apace $\mathcal{S} = \{A, G, C, T \}$. A sequence $\boldsymbol{x}_n = (x_{n1},x_{n2},\ldots, x_{nL})$ has length $L$ sites/loci taking values from the state space, and sites $x_{nl}$ are assumed to be mutated independently. According to Continuous Time Markov Chain (CTMC) theory, the transition probability/density of mutations from a certain $s \in \mathcal{S}$ to $x_{nl}$ in time $t$ is $P_{s, x_{nl}}(t) = e^{\boldsymbol{Q}t}$ which can serve as a distribution for mutation processes of sequences. Moreover, sequences coming from different subpopulation may mutate differently so that the mutation processes can be modeled by different instantaneous transition rate matrices, mutated around different central sequences, and mutated in different time scales.

Model-Based Clustering

A finite mixture model provides a statistical framework for clustering. The multivariate normal distribution $MVN(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ is one of the most popular models. Imagine that we may have $L$ coordinates in a huge discrete space, and each coordinate has four axes/directions. Like $MVN(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ sequences $\{\boldsymbol{x}_n\}|_{n=1}^N$ are mutated from or clouded around some unknown ancestral/central sequences $\boldsymbol{\mu}_k$, and they also disperse on the space oriented by $\boldsymbol{Q}$ and $t$ Note that the $\boldsymbol{\mu}_k$ can or can not be one of $\{\boldsymbol{x}_n\}_{n=1}^N$. More importantly, $\boldsymbol{\mu}_k$ takes values from $\mathcal{S}$ and can be a function of other interesting explanatory variables in terms of generalized linear model.

In a simplified case, we adopt the transition probability/density $f_k(\cdots)$ as a mixture distribution and assume all $K$ components mixed with proportions $\eta_k$ and share with the same $\boldsymbol{Q}$ and $t$, but the ancestral/central sequences $\boldsymbol{\mu}_k$ are different. Then, the essential model of phyloclustering is $$ L(\Theta | \boldsymbol{X}) = \prod_{n = 1}^N f(\boldsymbol{x}_n| \boldsymbol{\mu}, \boldsymbol{Q}, t) = \prod_{n = 1}^N \left[ \sum_{k = 1}^K \eta_k (\boldsymbol{f}_k(\boldsymbol{s}| \boldsymbol{\mu}_k, \boldsymbol{Q}, t) \otimes \boldsymbol{p}_e(\boldsymbol{x}_n|\boldsymbol{s})) \right] $$ where $ \Theta = \{\eta_1, \eta_2, \ldots, \eta_K, \boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \ldots, \boldsymbol{\mu}_K, \boldsymbol{Q}, t\} $ are unknown parameters to be estimated by the EM algorithm. The sequences are classified by the maximum posterior probabilities.

Haplotype Grouping

Haplotypes carry more information than single locus for identifying population structure and finding disease genes. Directly extending the frequency idea from alleles to haplotypes is restricted due to computation complexities caused by the linkage disequilibrium assumption. In genetic studies and association studies, the haplotype analysis is a powerful tool to narrow down the location of disease genes. For SNP sequences, the state space is $\mathcal{S} = \{1, 2 \}$ in phyloclustering. Several different aspect methods have been proposed, and Hap-Clustering (Tzeng, 2005) is one of them using evolution ideas to group haplotype based on haplotype frequencies.

Sequencing Error Modeling

The next generation sequencing techniques produce high-throughput sequencing for genomes, broad research possibility, and enhance scientific curiosity. But, high sequencing error affect the clustering results, as well as the haplotype identification. Therefore, the estimations such as for population structure, community diversity, selection determination may be biased. The errors can be modeled as well and incorporated into model-based clustering, as the following: $$ L(\Theta | \boldsymbol{X}) = \prod_{n = 1}^N f(\boldsymbol{x}_n| \boldsymbol{\mu}, \boldsymbol{Q}, t) = \prod_{n = 1}^N \left[ \sum_{k = 1}^K \eta_k (\boldsymbol{f}_k(\boldsymbol{s}| \boldsymbol{\mu}_k, \boldsymbol{Q}, t) \otimes \boldsymbol{p}_e(\boldsymbol{x}_n|\boldsymbol{s})) \right] $$ that $f_k(\cdots)$ is replaced by a convolution of mutation process $\boldsymbol{f}_k(\cdots)$ and probability of sequencing error $\boldsymbol{p}_e(\cdots)$.