Overview
Phylogenetic clustering (phyloclustering) is an evolutionary Continuous Time Markov Chain (CTMC) model-based approach that identifies population structure from molecular data without assuming linkage equilibrium. The goal is to use a statistical approach to find the population structure from tones of sequences which can be SNPs, DNAs, codons, ... etc, to cluster individuals into subpopulations, and to identify molecular sequences representative of those subpopulations. It is an approximate solution to the NP-complete problem of estimating phylogenetic trees. It also benefits varied research fields such as
- Virology -- identifying key sequences for disease diagnostics and vaccine design,
- Ecology -- detecting structure and gene flow in endangered population or invasive species, and
- Human Genetics -- searching for genes associated with complex diseases including potential environmental interactions.
Details and references can be found in Method and Document.
Purpose
The major goals of phyloclustering are:
- to distinguish ancestors where sequences evolve from,
- to determine population structure based on classifications,
- to avoid possible sequencing or alignment discrepancy, and
- to aggregate trustworthy sequence information.
In phyloclusterng, the similarity of sequences in a group is characterized by mutation processes rather than nucleotide frequency. A naive example is illustrated in the table below to illustrate phyloclustering.
- The first column contains the id for six sequences shown in the second column. The third column shows the potential ancestors for two groups. The fourth column indicates the classifications.
- The sequences in the first group have a higher chance mutating from the first ancestor than mutating from the second ancestor.
- The two row blocks show the difference of two possible populations behind the data.
- The first site of the fourth sequence is T which may be a sequencing error, but can be "rounded" as the first ancestor.
To get a phylogenetic tree based on the two ancestors is easier then based on the six sequences. The final tree may reveal the structural phylogeny of population.
Id Sequence Ancestor Group 1
2
3
4
A C G T A C C A T C C
A A G T C C G A T G C
A A G T C C G A T G C
T A G T C C G A T G CA A G T C C G A T G C 1 5
6
C C G G A A C T A C G
C C G G A A C T G C AC C G G A A C T A C A 2