Basic -- Sample Mean and Sample Variance.
This topic will introduce two essential statistics, sample mean
$\bar{x} = \frac{1}{N}\sum_{i = 1}^N x_i$
and sample variance
$s_x = \frac{1}{N - 1}\sum_{i = 1}^N (x_i - \bar{x})^2$
for a dataset with a large number
$N$
of observations
$\{x_1, x_2, \ldots, x_N\}$.
The data
are stored in a standard vector
, matrix
,
or data.frame
format in R, and are
distributed in several processors. i.e. Each processor owns a part of
dataset. The direct benefit of this approach is one don't need to worry
about the memory, length, or dimension limitations such as
$2^{31} - 1$
in R.
Tricks:
- By default, R uses double for most of the variables.
length(...)
returns an integer in R.as.double(...)
makes a copy for the input object, so it is not good for big object.
More fantastic algorithms about calculating variance can be found at "Algorithms for calculating variance" on Wikipedia. One-pass algorithms are achievable for higher moments of statistics, and they will reduce computing a lot and better than two or more pass algorithms for ultra-large dataset. Of course, lost precision is possible. For example,
- J. Bennett, R. Grout, P. Pebay, D. Roe, and D. Thompson (2009), ``Numerically Stable, Single-Pass, Parallel Statistics Algorithms'', In Proc. 2009 IEEE International Conference on Clustering Computing, New Orleans, LA, Aug. 2009.
The other way is to apply sampling techniques on ultra-large dataset.
Serial code: (ex_basic_serial.r)
# File name: ex_basic_serial.r
# Run: Rscript --vanilla ex_basic_serial.r
### Basic implication
my.basic <- function(x){
### This is the same as mean(x).
N <- length(x)
bar.x <- sum(x / N)
### This is the same as var(x).
s.x <- sum(x^2 / (N - 1)) - bar.x^2 * (N / (N - 1))
list(mean = bar.x, s = s.x)
} # End of my.basic().
### Main codes stat from here.
set.seed(1234)
N <- 100
x <- rnorm(N)
my.basic(x)
Parallel (SPMD) code: (ex_basic_spmd.r for ultra-large/unlimited $N$)
# File name: ex_basic_spmd.r
# Run: mpiexec -np 2 Rscript --vanilla ex_basic_spmd.r
### Load pbdMPI and initial the communicator.
library(pbdMPI, quiet = TRUE)
init()
### Basic implication for workers.
my.basic.spmd <- function(x.spmd){
### For mean(x).
N <- allreduce(length(x.spmd), op = "sum")
bar.x.spmd <- sum(x.spmd / N)
bar.x <- allreduce(bar.x.spmd, op = "sum")
### For var(x).
s.x.spmd <- sum(x.spmd^2 / (N - 1))
s.x <- allreduce(s.x.spmd, op = "sum") - bar.x^2 * (N / (N - 1))
list(mean = bar.x, s = s.x)
} # End of my.basic.spmd().
### Main codes start from here.
set.seed(1234)
N <- 100 # Pretend N is large.
x <- rnorm(N)
### Load data partially by processors if N is ultra-large.
id.get <- get.jid(N)
x.spmd <- x[id.get]
ret.spmd <- my.basic.spmd(x.spmd)
### Output.
comm.print(ret.spmd)
finalize()
Exercise:
- Try to compute standard deviation.
E-Mail: wccsnow at gmail dot com
Last Updated: December 30, 2016
Created: October 19, 2011