[1] 11
PHS 7045: Advanced Programming
University of Utah, EEUU
2024-10-01
Loosely, from R’s perspective, we can think of HPC in terms of two, maybe three things:
Big data: How to work with data that doesn’t fit your computer
Parallel computing: How to take advantage of multiple core systems
Compiled code: Write your own low-level code (if R doesn’t has it yet…)
(Checkout CRAN Task View on HPC)
High Throughput Computing Cluster (HTC Cluster)
Supercomputer (HPC Cluster)
Embarassingly Parallel
In terms of scale
HTC > HPC > Single node > Socket > Core > Thread | SIMD vectorization
How many cores does your computer has?
Pro-tip
When in doubt, profile your code first! In R, you can use the profvis
package. which will give you a visual representation of where your code is spending most of the time.
Things that are easily parallelizable
Things that are not easily parallelizable
Good example usage case:
Parallelization is not free: Most cost is in sending+receiving data.
In R (and other flavors), you can mitigate by (i) reducing the amount of data communicated, and (ii) reducing the number of times you communicate.
While there are several alternatives (just take a look at the High-Performance Computing Task View), we’ll focus on the following R-packages for explicit parallelism
Some examples:
Implicit parallelism, on the other hand, are out-of-the-box tools that allow the programmer not to worry about parallelization, e.g. such as gpuR for Matrix manipulation using GPU, tensorflow
And there’s also a more advanced set of options
PSOCK
, Fork
, MPI
, etc.(Usually) We do the following:
PSOCK/FORK
(or other) cluster using makePSOCKCluster
/makeForkCluster
(or makeCluster
)Copy/prepare each R session (if you are using a PSOCK
cluster):
Copy objects with clusterExport
Pass expressions with clusterEvalQ
Set a seed
parApply
, parLapply
, etc.clusterStop
Type | Description | Pros | Cons |
---|---|---|---|
PSOCK |
Multiple machines via socket connection | Works in all OSs | Slowest |
FORK |
Single machine via forking | Avoids memory duplication | Only for Unix-based |
MPI 1 |
Multiple machines via Message Passage Interface | Best alternative for HPC clusters | Sometimes hard to setup |
Using PSOCK, the slurmR
package creates clusters featuring multiple nodes in HPC environments, think hundreds of cores.
# 1. CREATING A CLUSTER
library(parallel)
cl <- makePSOCKcluster(4)
# 2. PREPARING THE CLUSTER
clusterSetRNGStream(cl, 123) # Equivalent to `set.seed(123)`
x <- 20
clusterExport(cl, "x")
# 3. DO YOUR CALL
clusterEvalQ(cl, {
paste0("Hello from process #", Sys.getpid(), ". x = ", x)
})
[[1]]
[1] "Hello from process #35564. x = 20"
[[2]]
[1] "Hello from process #35566. x = 20"
[[3]]
[1] "Hello from process #35565. x = 20"
[[4]]
[1] "Hello from process #35563. x = 20"
Problem: Run multiple regressions on a very wide dataset. We need to fit the following model:
\[ y = X_i\beta_i + \varepsilon,\quad \varepsilon\sim N(0, \sigma^2_i),\quad\forall i \]
[1] 500 999
x001 x002 x003 x004 x005
1 0.61827227 1.72847041 -1.4810695 -0.2471871 1.4776281
2 0.96777456 -0.19358426 -0.8176465 0.6356714 0.7292221
3 -0.04303734 -0.06692844 0.9048826 -1.9277964 2.2947675
4 0.84237608 -1.13685605 -1.8559158 0.4687967 0.9881953
5 -1.91921443 1.83865873 0.5937039 -0.1410556 0.6507415
6 0.59146153 0.81743419 0.3348553 -1.8771819 0.8181764
num [1:500] -0.8188 -0.5438 1.0209 0.0467 -0.4501 ...
X
.
lm
) and returns the coefficients (coef
).
y
, we pass it as an argument.
apply
with parApply
.
parApply
.
Both results should be the same.
Are we going any faster? The microbenchmark
package can help us with that:
library(microbenchmark)
microbenchmark(
parallel = parallel::parApply(
cl = cl,
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
),
serial = apply(
X = X, MARGIN = 2,
FUN = function(x, y) coef(lm(y ~ x)),
y = y
),
times = 10,
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval cld
parallel 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 10 a
serial 2.920075 2.925246 2.874162 2.907786 2.949153 2.612683 10 b
Problem: We want to bootstrap a logistic regression model. We need to fit the following model:
\[ P(Y=1) = \text{logit}^{-1}\left(X\beta\right) \]
[1] 100 5
[,1] [,2] [,3] [,4] [,5]
[1,] -0.13592452 1.1921489 -1.04101654 0.26500638 -0.51561099
[2,] -0.04079697 -0.1231379 -0.43190705 1.38694989 0.39568325
[3,] 1.01053901 -0.5741648 -0.77781632 -0.29149014 -0.78301461
[4,] -0.15826244 -1.4903169 0.37368178 -1.83027672 0.88538861
[5,] -2.15663750 2.3638289 0.31256458 -1.62766978 -0.38212891
[6,] 0.49864683 -2.9510362 0.07122864 -0.01630346 0.05333596
[1] 1 1 0 0 1 0
my_boot <- function(y, X, B=1000) {
# Generating the indices
n <- length(y)
indices <- sample.int(n = n, size = n * B, replace = TRUE) |>
matrix(nrow = n)
# Fitting the model
apply(indices, 2, function(i) {
glm(y[i] ~ X[i,], family = binomial("logit")) |>
coef()
}) |> t()
}
set.seed(3312)
ans <- my_boot(y, X, B=50)
head(ans)
(Intercept) X[i, ]1 X[i, ]2 X[i, ]3 X[i, ]4 X[i, ]5
[1,] 2.943576 -0.46986617 2.292807 1.1069735 2.117947 0.7839228
[2,] 4.265760 -0.01445575 3.881603 2.5052960 4.300462 0.0542386
[3,] 2.702185 -0.40973910 2.315127 1.1693082 3.059388 0.2927383
[4,] 4.827939 -1.52854114 2.692226 1.7977035 4.370736 0.7825011
[5,] 3.229396 -0.56316370 1.980704 1.4054200 3.949632 0.2806117
[6,] 2.933971 0.25911455 2.193838 0.6953409 1.970649 -0.3528708
my_boot_pll <- function(y, X, cl, B=1000) {
# Generating the indices
n <- length(y)
indices <- sample.int(n = n, size = n * B, replace = TRUE) |>
matrix(nrow = n)
# Making sure y and X are available in the cluster
parallel::clusterExport(cl, c("y", "X"))
# Fitting the model
parallel::parApply(cl, indices, 2, function(i) {
glm(y[i] ~ X[i,], family = binomial("logit")) |>
coef()
}) |> t()
}
cl <- parallel::makeForkCluster(4)
set.seed(3312)
ans_pll <- my_boot_pll(y, X, cl, B=50)
head(ans_pll)
(Intercept) X[i, ]1 X[i, ]2 X[i, ]3 X[i, ]4 X[i, ]5
[1,] 2.943576 -0.46986617 2.292807 1.1069735 2.117947 0.7839228
[2,] 4.265760 -0.01445575 3.881603 2.5052960 4.300462 0.0542386
[3,] 2.702185 -0.40973910 2.315127 1.1693082 3.059388 0.2927383
[4,] 4.827939 -1.52854114 2.692226 1.7977035 4.370736 0.7825011
[5,] 3.229396 -0.56316370 1.980704 1.4054200 3.949632 0.2806117
[6,] 2.933971 0.25911455 2.193838 0.6953409 1.970649 -0.3528708
How much faster?
microbenchmark::microbenchmark(
parallel = my_boot_pll(y, X, cl, B=1000),
serial = my_boot(y, X, B=1000),
times = 1,
unit = "s"
)
Unit: seconds
expr min lq mean median uq max neval
parallel 0.1918902 0.1918902 0.1918902 0.1918902 0.1918902 0.1918902 1
serial 0.5202934 0.5202934 0.5202934 0.5202934 0.5202934 0.5202934 1
Problem: Revisit of the overhead cost of parallelization. We want to fit the following model \[y = X_k\beta_k + \varepsilon,\quad k = 1, \dots\]
[,1] [,2] [,3] [,4] [,5]
[1,] -1.91835255 -0.2359106 -1.4642601 -0.5320349 -0.4639574
[2,] -0.09017806 -0.1022420 -0.6735899 1.6146947 -2.3792154
[3,] -1.25551672 1.2079800 0.2159515 -0.1323614 0.9867689
[4,] 1.28006769 -0.2806277 -0.2026345 -0.7375033 -0.1067501
[1] 0.3570720 0.4850507 1.0281664 -0.7044579 1.1378356 -0.9032009
Important
For this exercise only, we are excluding the time required to setup and stop the cluster. Those times are usually negligible for large computations but are also part of the overhead cost.
Let’s start with the naive approach: fitting the model and returning the full output.
Elapsed time (s) | |
---|---|
Serial | 2.429 |
Parallel naive | 4.489 |
The problem: we are returning a lot of information that we may not need:
Instead of capturing the full output, we can just return the coefficients.
Elapsed time (s) | |
---|---|
Serial | 2.429 |
Parallel naive | 4.489 |
Parallel coef | 0.743 |
The coefficients are much smaller, significantly reducing the overhead cost to about 0.8 Mb.
Since we only get coefficients, we can use a lighter version of lm
called lm.fit
.
Elapsed time (s) | |
---|---|
Serial | 2.429 |
Parallel naive | 4.489 |
Parallel coef | 0.743 |
Parallel lite | 0.257 |
Pro-tip
Using a Fork cluster instead of a PSOCK cluster can further reduce the overhead cost. Both X
and y
would have been automatically available in the Fork cluster at 0 cost.
Parallel computing is a powerful tool to speed up your R code.
It’s not always the best solution, you have to think first!
In R the parallel
package is a good starting point for explicit parallelism.
When parallelizing, think about the overhead cost and how to “do less”:
gvegayon
ggvy.cl
george.vegayon@utah.edu
For more, checkout the CRAN Task View on HPC
We know that \(\pi = \frac{A}{r^2}\). We approximate it by randomly adding points \(x\) to a square of size 2 centered at the origin.
So, we approximate \(\pi\) as \(\Pr\{\|x\| \leq 1\}\times 2^2\)
The R code to do this
library(parallel)
# Setup
cl <- makePSOCKcluster(4L)
clusterSetRNGStream(cl, 123)
# Number of simulations we want each time to run
nsim <- 1e5
# We need to make -nsim- and -pisim- available to the
# cluster
clusterExport(cl, c("nsim", "pisim"))
# Benchmarking: parSapply and sapply will run this simulation
# a hundred times each, so at the end we have 1e5*100 points
# to approximate pi
microbenchmark::microbenchmark(
parallel = parSapply(cl, 1:100, pisim, nsim=nsim),
serial = sapply(1:100, pisim, nsim=nsim),
times = 10,
unit = "relative"
)
Unit: relative
expr min lq mean median uq max neval cld
parallel 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 10 a
serial 1.882948 1.899173 1.975634 1.896836 2.155636 2.04383 10 b
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin23.4.0
Running under: macOS Sonoma 14.6.1
Matrix products: default
BLAS: /opt/homebrew/Cellar/openblas/0.3.27/lib/libopenblasp-r0.3.27.dylib
LAPACK: /opt/homebrew/Cellar/r/4.4.1/lib/R/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Denver
tzcode source: internal
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] microbenchmark_1.4.10
loaded via a namespace (and not attached):
[1] digest_0.6.37 codetools_0.2-20 multcomp_1.4-25 fastmap_1.2.0
[5] Matrix_1.7-0 xfun_0.45 lattice_0.22-6 TH.data_1.1-2
[9] splines_4.4.1 zoo_1.8-12 knitr_1.47 htmltools_0.5.8.1
[13] rmarkdown_2.27 mvtnorm_1.2-5 cli_3.6.3 grid_4.4.1
[17] sandwich_3.1-0 compiler_4.4.1 tools_4.4.1 evaluate_1.0.0
[21] survival_3.6-4 yaml_2.3.8 rlang_1.1.4 jsonlite_1.8.9
[25] MASS_7.3-60.2