Model-Based Clustering, Finite Mixture Model, EGM Algorithm, and K-Means.

References:

  1. Chen, W.-C. and Maitra, R. (2011), ``Model-based clustering of regression time series data via APECM -- an AECM algorithm sung to an even faster beat'', Statistical Analysis and Data Mining 4, 567-578.
  2. Meng, X.-L. and Van Dyk, D. (1997), ``The EM Algorithm -- an Old Folk-song Sung to a Fast New Tune'', Journal of the Royal Statistical Society Series B, 59, 511-567.
  3. Expectation-Maximization Algorithm on Wikipedia.
  4. K-Means Clustering on Wikipedia.

I have to leave the details of whole algorithm behind a package we implemented, and a paper we are trying to finish. It is too tedious to introduce at here, but if you want to know the details you can search on Google with keywords "Parallel K-Means". This should give you some flavor what the "Expectation-Gathering-Maximization (EGM) algorithm" should be.

The pmclust implements both clustering methods in parallel for ultra-large dataset by utilizing Rmpi. The newest version of pmclust is available on CRAN at http://cran.r-project.org/web/packages/pmclust/index.html including source code and manuals. A vignette should be on-line soon at the same address, too. The Model-Based Clustering uses finite mixture Gaussion model with unstructured dispersions. The K-Means uses Euclidean distance.

The EGM algorithm is implemented for the Model-Based Clustering, and other EM-alike algorithms such as AECM and APECM are also implemented to boost the convergence and reduce computing time. The parallel algorithm aims to reduce communications between processors by transferring sufficient statistics only which greatly benefits the computing performance.