Debugging and Profiling

Author

Jonathan Chipman, Ph.D.

Introduction

Readings

In the RStudio (aka Posit) menu bar, notice the Debug and Profile drop-down menus. These are quick starting points for finding code bugs/bottlenecks. Studying the below chapters will give insight into finding where and how to improve code efficiency.

Required reading from Advanced R by Hadley Wickham:

  1. Ch. 22: Debugging
  2. Ch. 23: Measuring Performance
  3. Ch. 24: Improving Performance

Supplemental reading:

Norman Matloof’s Ch. 13.1-13.3.5: Debugging and Ch. 14: Performance Enhancement: Speed and Memory is slightly outdated by recommending RProf to measure performance (as compared to profvis used in RStudio) but has good content on speed and memory allocation.

Warm-up

Vectorization

Review the table of vectorization commands

Perform calculations one-at-a-time ONLY if necessary. Pause to think if you’re performing the same operation over and over again.

Urn problem revisited!

“Urn 1 contains ten blue marbles and eight yellow ones. In urn 2, the mixture is six blue and six yellow. We draw a marble at random from urn 1, transfer it to urn 2, and then draw a marble at random from urn 2. What is the probability that that second marble is blue? This is easy to find analytically, but we’ll use simulation.” Create a function that generates 100K replicates and returns the estimated probability.

When are loops faster than *apply (and vice-versa)

We saw some solutions to the urn problem that were more time consuming despite using an apply statement. What is happening here?

This week’s lesson

Key concepts

  1. Debugging strategies
  • Modular programming
  • traceback
  • debug functions
  • set break points
  1. Profiling code
  • system.time and bench::mark
  • profvis::provis
  1. Improving efficiency
  • Ch. 24: Improving Performance
  • Use vectorization
  • Beware of copy-on-change (example: continual rbind’ing)
  • Use data.table as appropriate
  • Use parallel computing
  • Compile code in C (rcpp)

Debugging

Debug in a Modular, Top-Down Manner

How do you organize files on your computer?

A friend of mine stored all documents in the downloads folder! When it came time to find a file, it was like finding a needle in a haystack (there also wasn’t a uniform naming convention).

Quoting from Matloff

“Most good software developers agree that code should be written in a modular manner. Your first-level code should not be longer than, say, a dozen lines, with much of it consisting of function calls. And those functions should not be too lengthy and should call other functions if necessary. This makes the code easier to organize during the writing stage and easier for others to understand when it comes time for the code to be extended.”

Top-down modular level:

  1. Analysis / Summary
  2. Design functions
  3. Function to draw posterior distributions

With debugging start top-down. Assume lower-level functions are correct until finishing debugging / checking top-level modules.

An example from collaboration:

An example from an R package: SeqSGPV

  • Step 0: Provides initial checks on inputs
  • Steps 1-4: Call lower-level functions
  • Step 5: Aggregates results
  • Step 6: Returns output object

Question for HW 1: What can be “modularized”?

Antibugging

Pro-actively stop a function (or provide a warning or message) if a condition is not met (example similar to here):

x <- 1
stopifnot(is.character(x))
Error: is.character(x) is not TRUE

Here, is an equivalent stop command using if-logic. deparse(subsitute([var])) prints the name of the object:

x <- 1
if(!is.character(x)){
  stop(paste0(deparse(substitute(x)), " is not a character"))
}
Error in eval(expr, envir, enclos): x is not a character
x <- 1
if(!is.character(x)){
  warning(paste0(deparse(substitute(x)), " is not a character"))
}
Warning: x is not a character

Interactive debugging

For a more in-depth consideration of debugging, see https://adv-r.hadley.nz/debugging.html. The examples below are from this book.

In R-Studio, notice the “Debug” drop-down menu. Under “on error” it has three options:

  1. Message only: Provides the error
  2. Error Inspector: Provides the error + traceback and debugonce options
  3. Break in Code: Provides the error + enters debugonce

The traceback function lists call that provided the error and any sub-calls prior to it.

Options to debug a function from top-to-bottom:

  • debug and (undebug)
  • debugonce

This will interactively walk you through the function call with the following commands:

  • n: next line of code
  • s: step into the next function
  • f: continue forward through the current {} code block (example, a loop or the rest of the function)
  • c: continue the rest of the function to see if it’ll complete as anticipated
  • Q: quit interactive
  • where: shows the trace stack (the possibly multiple layers of a function call)
  • any r command may be run when in debugger mode

An example where f() calls g() which calls h() which calls i() which is a function with conditional logic …

f <- function(a) g(a)
g <- function(b) h(b)
h <- function(c) i(c)
i <- function(d) {
  if (!is.numeric(d)) {
    stop("`d` must be numeric", call. = FALSE)
  }
  d + 10
}

# Traceback and debug using Rstudio 'Rerun with Debug'
f("a")

# Enter debugging mode for a function until deactivating debugging mode
debug(f)
f("a")
f("a")
undebug(f)

# Enter debug mode for a single call
debugonce(f)
f("a")
f("a")

View and debug are good ways to learn the underlying code to a function.

View(lm)
debugonce(lm)
lm(1:10~1,method = "model.frame")

You can use the browser function to enter debugging mode partially through a function

f <- function(x, remainder){
  
  # Step 1, add 1 to even elements
  x[x%%2] <- x[x%%2] + 1
  
  # Step 2, see if each value in sequence is even or odd
  out <- x %% remainder
  
  if(length(unique(out))!= 2) browser()
  
  # Step 2, for elements with 0 remainder, add 1
  x[out == 0] <- x[out == 0] + 1
  
  # Step 3, report number of unique elements
  length(unique(x))

}
f(x=1:10,remainder=3)

The browser function can be helpful if you want to debug a loop mid-way through iterations.

f <- function(){
  for (i in 1:10^5){
    print(i)
    if(i == 100) browser()
  }
}
f()

While embedding browser into code can be helpful, you must remember to remove the browser calls when done. Rstudio also provides a feature to set breakpoints by clicking on the line-number where you’d want to start debugging. See also breakpoints.

Using the urn problem, set breakpoints and run through debugonce().

Practice

Find a set of code you’ve written and know well to share with a classmate.

As the classmate,

  • Do you have any feedback on how to make the code more readable? More efficient?
  • Insert a bug(s) into the code

When you get your code back, can you find and fix the bug?

Profiling

system.time

For a quick profiling of code, use system.time().

x <- runif(10000000)
y <- runif(10000000)
t1 <- system.time(z <- x + y)

z <- numeric(10000000)
t2 <- system.time({
  for (i in 1:length(x)) {
    z[i] <- x[i] + y[i]
  }
})

# Time duration
t1
   user  system elapsed 
  0.005   0.008   0.015 
t2
   user  system elapsed 
  0.407   0.010   0.421 
# The loop is longer by a factor of:
t2["elapsed"] / t1["elapsed"]
 elapsed 
28.06667 

Why is t2 28.07? times longer?

  • Seemingly hidden functions are : and [ (subsetting) which add to the collective set of functions called (i.e. the “stack frame”).

system.time is not as precise for quick calls and can require many iterations to get a meaningful timing (example: 10e6 in above example). bench::mark can provide more precise timings (see warm up problem).

  • ns (nanoseconds) < \(\mu\)s (microseconds) < ms (milliseconds) < s (seconds) < m (minutes) < h (hours) < d (days) < w (weeks)
  • relative option in mark does not report in raw time but in comparing different calls
  • check=FALSE allows for running different code which may return different results (example: due to sampling)

profvis (profiling visualization)

profvis::profvis is a good go-to for evaluating performance of multiple calls.

We’ll work through Ch. 23: Measuring Performance together. Begin by the below script as a file called “profiling-example.R”:

f <- function() {
  profvis::pause(0.1)
  g()
  h()
}
g <- function() {
  profvis::pause(0.1)
  h()
}
h <- function() {
  profvis::pause(0.1)
}

Install the profvis package if needed, and call:

source("profiling-example.R")
profvis::profvis(f())

Alternatively, call

f <- function() {
  profvis::pause(0.1)
  g()
  h()
}
g <- function() {
  profvis::pause(0.1)
  h()
}
h <- function() {
  profvis::pause(0.1)
}
profvis::profvis(f())

Now, we’ll turn to the notes Ch. 23: Measuring Performance as we walk through the:

  1. stack call: the set of routines called in a function and subfunctions
  2. time allocation: the time spent on each call within function
  3. memory allocation: the amount of memory used with each call

Notice in the example below (and even better as seen in Ch. 23: Measuring Performance) how much time is spent in . stands for ‘garbage collector’ and points to a place of possible ineffeciences.

profvis::profvis({
  x <- integer()
  for (i in 1:1e4) {
    x <- c(x, i)
  }
})

Practice

Run your urn solution (or another posted here) through profvis::profvis().

Benchmarking

The bench::mark function provides more accurate time of functions and allows for checking whether two functions return the same results (check=TRUE). We’ve seen this here.

library(bench)
bench::mark( <solutions>, relative = TRUE, check = FALSE)

Improving efficiency

Ch. 24: Improving Performance

Parts to highlight:

“It’s easy to get caught up in trying to remove all bottlenecks. Don’t! Your time is valuable and is better spent analysing your data, not eliminating possible inefficiencies in your code. Be pragmatic: don’t spend hours of your time to save seconds of computer time. To enforce this advice, you should set a goal time for your code and optimise only up to that goal. This means you will not eliminate all bottlenecks.”

Be aware of all the steps required for a function. If it requires additional steps, see if there are alternative strategies that provide equivalent solutions

Avoiding copies is when you iteratively build up an object that includes previous values. For example:

# NOTE: an example of what to AVOID
out <- 1
for (i in 1:10){
  out <- c(out,i)
}

# BETTER: Pre-allocate space
out <- rep(NA,11)
out[1] <- 1
for (i in 1:10){
  out[i+1] <- i
}

*check for existing solutions; also, appreciate the value of cpp

Session info

devtools::session_info()
─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.0 (2024-04-24)
 os       macOS Sonoma 14.5
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Denver
 date     2024-09-03
 pandoc   3.1.11 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 package     * version date (UTC) lib source
 cachem        1.0.8   2023-05-01 [1] CRAN (R 4.4.0)
 cli           3.6.2   2023-12-11 [1] CRAN (R 4.4.0)
 devtools      2.4.5   2022-10-11 [1] CRAN (R 4.4.0)
 digest        0.6.35  2024-03-11 [1] CRAN (R 4.4.0)
 ellipsis      0.3.2   2021-04-29 [1] CRAN (R 4.4.0)
 evaluate      0.23    2023-11-01 [1] CRAN (R 4.4.0)
 fastmap       1.1.1   2023-02-24 [1] CRAN (R 4.4.0)
 fs            1.6.4   2024-04-25 [1] CRAN (R 4.4.0)
 glue          1.7.0   2024-01-09 [1] CRAN (R 4.4.0)
 htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
 htmlwidgets   1.6.4   2023-12-06 [1] CRAN (R 4.4.0)
 httpuv        1.6.15  2024-03-26 [1] CRAN (R 4.4.0)
 jsonlite      1.8.8   2023-12-04 [1] CRAN (R 4.4.0)
 knitr         1.46    2024-04-06 [1] CRAN (R 4.4.0)
 later         1.3.2   2023-12-06 [1] CRAN (R 4.4.0)
 lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
 magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
 memoise       2.0.1   2021-11-26 [1] CRAN (R 4.4.0)
 mime          0.12    2021-09-28 [1] CRAN (R 4.4.0)
 miniUI        0.1.1.1 2018-05-18 [1] CRAN (R 4.4.0)
 pkgbuild      1.4.4   2024-03-17 [1] CRAN (R 4.4.0)
 pkgload       1.3.4   2024-01-16 [1] CRAN (R 4.4.0)
 profvis       0.3.8   2023-05-02 [1] CRAN (R 4.4.0)
 promises      1.3.0   2024-04-05 [1] CRAN (R 4.4.0)
 purrr         1.0.2   2023-08-10 [1] CRAN (R 4.4.0)
 R6            2.5.1   2021-08-19 [1] CRAN (R 4.4.0)
 Rcpp          1.0.12  2024-01-09 [1] CRAN (R 4.4.0)
 remotes       2.5.0   2024-03-17 [1] CRAN (R 4.4.0)
 rlang         1.1.3   2024-01-10 [1] CRAN (R 4.4.0)
 rmarkdown     2.28    2024-08-17 [1] CRAN (R 4.4.0)
 rstudioapi    0.16.0  2024-03-24 [1] CRAN (R 4.4.0)
 sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
 shiny         1.8.1.1 2024-04-02 [1] CRAN (R 4.4.0)
 stringi       1.8.4   2024-05-06 [1] CRAN (R 4.4.0)
 stringr       1.5.1   2023-11-14 [1] CRAN (R 4.4.0)
 urlchecker    1.0.1   2021-11-30 [1] CRAN (R 4.4.0)
 usethis       3.0.0   2024-07-29 [1] CRAN (R 4.4.0)
 vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
 xfun          0.43    2024-03-25 [1] CRAN (R 4.4.0)
 xtable        1.8-4   2019-04-21 [1] CRAN (R 4.4.0)
 yaml          2.3.8   2023-12-11 [1] CRAN (R 4.4.0)

 [1] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library

──────────────────────────────────────────────────────────────────────────────