High Performance Statistical Computing (HPSC)
is an advanced technique utilizing High Performance
Computing for statistical research and methodology development
such as efficient computing for
ultra-large/unlimited datasets and visualization.
Currently, this website is focusing on
MPI
and
pbdMPI
for ultra-large/unlimited data analysis in
a distributed manner.
The techniques and methods introduced here can be quickly extended to other
statistical applications as ultra-large/unlimited data involved.
This website will demonstrate key steps using simplified examples and provide
explanations for statistical method. As well as, this website will illustrate
the fundamental parallelization for algorithms and codes, especially for
distributed data.
All examples in the
Cookbook
are available for download and can be executed easily.
In the
Cookbook
computational statistical methodologies will be demonstrated with
a few introduction to the background.
Each technique will come with example codes where both
serial and parallel versions will be provided.
The key steps will be annotated inside the codes.
The parallel version would show the idea how to transfer
appropriately from serial version without too much burden
if the serial version were designed appropriately.
The optimized serial code may not be easily parallelized, even so the
parallel version may not be optimized.
One important issue in parallel programming is the interprocessor
communication which may reduce performance dramatically.
The other issue is that the programming time may be high as well if
existed serial codes were not reusable.
For the distributed data and in the view of statistical analysis,
parallel programming in R requires fundamental knowledges about
mathematics and statistics that linear algebra is involved intensively.
Also, the programming concepts about C, Fortran, and MPI are essential.
To see and practice abounding examples intensively is an
efficient way to learn high performance statistical computing.
In short,
the easy approach to analysis untra-large data or dataset in R
is that "process data in many processors".
---