High Performance Statistical Computing (HPSC) is an advanced technique utilizing High Performance Computing for statistical research and methodology development such as efficient computing for ultra-large/unlimited datasets and visualization. Currently, this website is focusing on MPI and pbdMPI for ultra-large/unlimited data analysis in a distributed manner. The techniques and methods introduced here can be quickly extended to other statistical applications as ultra-large/unlimited data involved.

This website will demonstrate key steps using simplified examples and provide explanations for statistical method. As well as, this website will illustrate the fundamental parallelization for algorithms and codes, especially for distributed data. All examples in the Cookbook are available for download and can be executed easily. In the Cookbook computational statistical methodologies will be demonstrated with a few introduction to the background. Each technique will come with example codes where both serial and parallel versions will be provided. The key steps will be annotated inside the codes. The parallel version would show the idea how to transfer appropriately from serial version without too much burden if the serial version were designed appropriately.

The optimized serial code may not be easily parallelized, even so the parallel version may not be optimized. One important issue in parallel programming is the interprocessor communication which may reduce performance dramatically. The other issue is that the programming time may be high as well if existed serial codes were not reusable. For the distributed data and in the view of statistical analysis, parallel programming in R requires fundamental knowledges about mathematics and statistics that linear algebra is involved intensively. Also, the programming concepts about C, Fortran, and MPI are essential. To see and practice abounding examples intensively is an efficient way to learn high performance statistical computing.

In short, the easy approach to analysis untra-large data or dataset in R is that "process data in many processors".