Binning -- Table Cutting, Binning, and Nonparamatric Statistics.
Counting is a fundamental method of Statistics including computing frequence, proportion, and probability, ... etc. It is also an essential tool for categorial data analysis. A fast implementation for binning data given categories/breaks is done in R efficiently for small data. Based on the same idea demonstrated here, a lot of statistical concepts can be parallelized in the same way for large datasets.
Serial code: (ex_binning_serial.r)
# File name: ex_binning_serial.r
# Run: Rscript --vanilla ex_binning_serial.r
### A famous example from help("cut") in R.
set.seed(1234)
N <- 100
y <- rnorm(N)
### Based on breaks to count data.
table(cut(y, breaks = pi / 3 * (-3:3)))
Parallel (SPMD) code: (ex_binning_spmd.r for ultra-large/unlimited $N$)
# File name: ex_binning_spmd.r
# Run: mpiexec -np 2 Rscript --vanilla ex_binning_spmd.r
### Load pbdMPI and initial the communicator.
library(pbdMPI, quiet = TRUE)
init()
### Main codes start from here.
set.seed(1234)
N <- 100
y <- rnorm(N)
### Load data partially by processors if N is ultra-large.
id.get <- get.jid(N)
y.spmd <- y[id.get]
### Based on breaks to count data.
bin.spmd <- table(cut(y.spmd, breaks = pi / 3 * (-3:3)))
bin <- as.array(allreduce(bin.spmd, op = "sum"))
dimnames(bin) <- dimnames(bin.spmd)
class(bin) <- class(bin.spmd)
### Output from RANK 0 since reduce(...) will dump only to 0 by default.
comm.print(bin)
finalize()
Exercise:
- Try the R function
tabulate()
to replacetable()
.