Basic -- Sample Mean and Sample Variance.

This topic will introduce two essential statistics, sample mean $\bar{x} = \frac{1}{N}\sum_{i = 1}^N x_i$ and sample variance $s_x = \frac{1}{N - 1}\sum_{i = 1}^N (x_i - \bar{x})^2$ for a dataset with a large number $N$ of observations $\{x_1, x_2, \ldots, x_N\}$. The data are stored in a standard vector, matrix, or data.frame format in R, and are distributed in several processors. i.e. Each processor owns a part of dataset. The direct benefit of this approach is one don't need to worry about the memory, length, or dimension limitations such as $2^{31} - 1$ in R.

Tricks:

By default, R uses double for most of the variables.
length(...) returns an integer in R.
as.double(...) makes a copy for the input object, so it is not good for big object.

More fantastic algorithms about calculating variance can be found at "Algorithms for calculating variance" on Wikipedia. One-pass algorithms are achievable for higher moments of statistics, and they will reduce computing a lot and better than two or more pass algorithms for ultra-large dataset. Of course, lost precision is possible. For example,

J. Bennett, R. Grout, P. Pebay, D. Roe, and D. Thompson (2009), ``Numerically Stable, Single-Pass, Parallel Statistics Algorithms'', In Proc. 2009 IEEE International Conference on Clustering Computing, New Orleans, LA, Aug. 2009.

The other way is to apply sampling techniques on ultra-large dataset.

Serial code: (ex_basic_serial.r)

# File name: ex_basic_serial.r
# Run: Rscript --vanilla ex_basic_serial.r

### Basic implication
my.basic <- function(x){
  ### This is the same as mean(x).
  N <- length(x)
  bar.x <- sum(x / N)

  ### This is the same as var(x).
  s.x <- sum(x^2 / (N - 1)) - bar.x^2 * (N / (N - 1))

  list(mean = bar.x, s = s.x)
} # End of my.basic().

### Main codes stat from here.
set.seed(1234)
N <- 100
x <- rnorm(N)
my.basic(x)

Parallel (SPMD) code: (ex_basic_spmd.r for ultra-large/unlimited $N$)

# File name: ex_basic_spmd.r
# Run: mpiexec -np 2 Rscript --vanilla ex_basic_spmd.r

### Load pbdMPI and initial the communicator.
library(pbdMPI, quiet = TRUE)
init()

### Basic implication for workers.
my.basic.spmd <- function(x.spmd){
  ### For mean(x).
  N <- allreduce(length(x.spmd), op = "sum")
  bar.x.spmd <- sum(x.spmd / N)
  bar.x <- allreduce(bar.x.spmd, op = "sum")

  ### For var(x).
  s.x.spmd <- sum(x.spmd^2 / (N - 1))
  s.x <- allreduce(s.x.spmd, op = "sum") - bar.x^2 * (N / (N - 1))

  list(mean = bar.x, s = s.x)
} # End of my.basic.spmd().

### Main codes start from here.
set.seed(1234)
N <- 100                # Pretend N is large.
x <- rnorm(N)

### Load data partially by processors if N is ultra-large.
id.get <- get.jid(N)
x.spmd <- x[id.get]
ret.spmd <- my.basic.spmd(x.spmd)

### Output.
comm.print(ret.spmd)
finalize()

Exercise:

Try to compute standard deviation.