a <- 1
fun <- function() {
  return(a)
}
fun_sub <- function() {
  a <- pi # It doesn't change the result!
  return(fun())
}
fun_sub()[1] 1PHS 7045: Advanced Programming
Discuss some technical aspects of functions in R.
We will learn about how to manipulate data with data.table, and in particular,
Throughout the session, we will see examples using:
Using the MET dataset.
Where the object was created matters.
And the order also matters.
Question: Two R packages with the same function name, degree. Are these two equivalent?
vs
What should you do?
R is very smart when it comes to saving space.
Objects passed to a function are only copied if modified, the “copy-on-modify” concept.
This is very powerful as it mixes passing objects by reference and by copy dynamically (see this wiki entry).
# If we change it...
(function(a) {
  a <- a + 0 # The input is now modified!
  data.table::address(a)
})(bigdata)[1] "0x55555ad042b0"Note: Functions that don’t have a name are called “anonymous functions.”
Overall, you will find the following approaches:
base R: Use only base R functions.
dplyr: Using “verbs”.
data.table: High-performing (ideal for large data)
dplyr + data.table = dtplyr: High-performing + dplyr verbs.
Other methods involve, for example, using external tools such as Spark, sparkly.
We will be focusing on data.table because of this
Take a look at this neat cheat sheet by Erik Petrovski here.
The dtplyr R package translates dplyr (tidyverse) syntax to data.table, so that we can still use the dplyr verbs while at the same time leveraging the performance of data.table.
The data that we will be using is an already processed version of the MET dataset. We can download (and load) the data directly in our session using the following commands:
# Where are we getting the data from
met_url <- "https://github.com/USCbiostats/data-science-data/raw/master/02_met/met_all.gz"
# Downloading the data to a tempfile (so it is destroyed afterwards)
# you can replace this with, for example, your own data:
# tmp <- tempfile(fileext = ".gz")
tmp <- "met.gz"
# We sould be downloading this, ONLY IF this was not downloaded already.
# otherwise is just a waste of time.
if (!file.exists(tmp)) {
  download.file(
    url      = met_url,
    destfile = tmp,
    # method   = "libcurl", timeout = 1000 (you may need this option)
  )
}Now we can load the data using the fread() function. In R
   USAFID  WBAN  year month   day  hour   min   lat      lon  elev wind.dir
    <int> <int> <int> <int> <int> <int> <int> <num>    <num> <int>    <int>
1: 690150 93121  2019     8     1     0    56  34.3 -116.166   696      220
2: 690150 93121  2019     8     1     1    56  34.3 -116.166   696      230
3: 690150 93121  2019     8     1     2    56  34.3 -116.166   696      230
4: 690150 93121  2019     8     1     3    56  34.3 -116.166   696      210
5: 690150 93121  2019     8     1     4    56  34.3 -116.166   696      120
6: 690150 93121  2019     8     1     5    56  34.3 -116.166   696       NA
   wind.dir.qc wind.type.code wind.sp wind.sp.qc ceiling.ht ceiling.ht.qc
        <char>         <char>   <num>     <char>      <int>         <int>
1:           5              N     5.7          5      22000             5
2:           5              N     8.2          5      22000             5
3:           5              N     6.7          5      22000             5
4:           5              N     5.1          5      22000             5
5:           5              N     2.1          5      22000             5
6:           9              C     0.0          5      22000             5
   ceiling.ht.method sky.cond vis.dist vis.dist.qc vis.var vis.var.qc  temp
              <char>   <char>    <int>      <char>  <char>     <char> <num>
1:                 9        N    16093           5       N          5  37.2
2:                 9        N    16093           5       N          5  35.6
3:                 9        N    16093           5       N          5  34.4
4:                 9        N    16093           5       N          5  33.3
5:                 9        N    16093           5       N          5  32.8
6:                 9        N    16093           5       N          5  31.1
   temp.qc dew.point dew.point.qc atm.press atm.press.qc       rh
    <char>     <num>       <char>     <num>        <int>    <num>
1:       5      10.6            5    1009.9            5 19.88127
2:       5      10.6            5    1010.3            5 21.76098
3:       5       7.2            5    1010.6            5 18.48212
4:       5       5.0            5    1011.6            5 16.88862
5:       5       5.0            5    1012.7            5 17.38410
6:       5       5.6            5    1012.7            5 20.01540In Python
Before we continue, let’s learn a bit more on data.table and dtplyr
data.table and dtplyr: Data Table’s Syntaxdata.table all happens within the square brackets. Here is common way to imagine DT:data.table and dtplyr: Data Table’s SyntaxOperations applied in j are evaluated within the data, meaning that names work as symbols, e.g.,
Furthermore, we can do things like this:
data.table and dtplyr: Lazy tableThe dtplyr package provides a way to translate dplyr verbs to data.table syntax.
The key lies on the function lazy_dt from dtplyr (see ?dtplyr::lazy_dt).
This function creates a wrapper that “points” to a data.table object
data.table and dtplyr: Lazy table (cont.)# Creating a lazy table object
dat_ldt <- lazy_dt(dat, immutable = FALSE)
# We can use the address() function from data.table
address(dat)
address(dat_ldt$parent)
# To ensure we are not modifying it
# SEEMS THERES A BUG HERE!
dat_ldt <- lazy_dt(dat, immutable = TRUE) [1] "0x55555b2bb710"
[1] "0x55555b2bb710"Question: What is the immutable = FALSE option used for?
How can we select the columns USAFID, lat, and lon, using data.table:
dat[, list(USAFID, lat, lon)]
# dat[, .(USAFID, lat, lon)]       # Alternative 1
# dat[, c("USAFID", "lat", "lon")] # Alternative 2         USAFID    lat      lon
          <int>  <num>    <num>
      1: 690150 34.300 -116.166
      2: 690150 34.300 -116.166
      3: 690150 34.300 -116.166
      4: 690150 34.300 -116.166
      5: 690150 34.300 -116.166
     ---                       
2377339: 726813 43.650 -116.633
2377340: 726813 43.650 -116.633
2377341: 726813 43.650 -116.633
2377342: 726813 43.642 -116.636
2377343: 726813 43.642 -116.636What happens if instead of list() you used c()?
Using the dplyr::select verb:
Source: local data table [2,377,343 x 3]
Call:   `_DT2`[, .(USAFID, lat, lon)]
  USAFID   lat   lon
   <int> <dbl> <dbl>
1 690150  34.3 -116.
2 690150  34.3 -116.
3 690150  34.3 -116.
4 690150  34.3 -116.
5 690150  34.3 -116.
6 690150  34.3 -116.
# ℹ 2,377,337 more rows
# Use as.data.table()/as.data.frame()/as_tibble() to access resultsIn the case of pydatatable
        | USAFID      lat       lon
        |  int32  float64   float64
------- + ------  -------  --------
      0 | 690150   34.3    -116.166
      1 | 690150   34.3    -116.166
      2 | 690150   34.3    -116.166
      3 | 690150   34.3    -116.166
      4 | 690150   34.3    -116.166
      5 | 690150   34.3    -116.166
      6 | 690150   34.3    -116.166
      7 | 690150   34.3    -116.166
      8 | 690150   34.3    -116.166
      9 | 690150   34.3    -116.166
     10 | 690150   34.296  -116.162
     11 | 690150   34.296  -116.162
     12 | 690150   34.3    -116.166
     13 | 690150   34.3    -116.166
     14 | 690150   34.3    -116.166
    ... |    ...  ...       ...    
2377338 | 726813   43.65   -116.633
2377339 | 726813   43.65   -116.633
2377340 | 726813   43.65   -116.633
2377341 | 726813   43.642  -116.636
2377342 | 726813   43.642  -116.636
[2377343 rows x 3 columns]What happens if instead of ["USAFID", "lat", "lon"] you used {"USAFID", "lat", "lon"} (vector vs set).
For the rest of the session, we will use these variables: USAFID, WBAN, year, month, day, hour, min, lat, lon, elev, wind.sp, temp, and atm.press.
Based on logical operations, e.g., condition 1 [and|or condition2 [and|or ...]]
Need to be aware of ordering and grouping of and and or operators (see Comparison).
| x | y | Negate !x | And x & y | Or x | y | Xor xor(x, y) | 
|---|---|---|---|---|---|
| true | true | false | true | true | false | 
| false | true | true | false | true | true | 
| true | false | false | false | true | true | 
| false | false | true | false | false | false | 
Write a function that takes two arguments (x,y) and applies the XOR operator element-wise. Here you have a template:
myxor <- function(x, y) {
  res <- logical(length(x))
  for (i in 1:length(x)) {
    res[i] <- # do something with x[i] and y[i]
  }
  return(res)
}Or if vectorized (which would be better)
Hint 1: Remember that negating (x & y) equals (!x | !y).
Hint 2: Logical operators are distributive, meaning a * (b + c) = (a * b) + (a + c), where * and + are & or |.
In R
myxor1 <- function(x,y) {(x & !y) | (!x & y)}
myxor2 <- function(x,y) {!((!x | y) & (x | !y))}
myxor3 <- function(x,y) {(x | y) & (!x | !y)}
myxor4 <- function(x,y) {!((!x & !y) | (x & y))}
cbind(
  ifelse(xor(test[,1], test[,2]), "true", "false"),
  ifelse(myxor1(test[,1], test[,2]), "true", "false"),
  ifelse(myxor2(test[,1], test[,2]), "true", "false"),
  ifelse(myxor3(test[,1], test[,2]), "true", "false"),
  ifelse(myxor4(test[,1], test[,2]), "true", "false")
)     [,1]    [,2]    [,3]    [,4]    [,5]   
[1,] "false" "false" "false" "false" "false"
[2,] "true"  "true"  "true"  "true"  "true" 
[3,] "true"  "true"  "true"  "true"  "true" 
[4,] "false" "false" "false" "false" "false"Or in python
# Loading the libraries
import numpy as np
import pandas as pa
# Defining the data
x = [True, True, False, False]
y = [False, True, True, False]
ans = {
    'x'   : x,
    'y'   : y,
    'and' : np.logical_and(x, y),
    'or'  : np.logical_or(x, y),
    'xor' : np.logical_xor(x, y)
    }
pa.DataFrame(ans)       x      y    and     or    xor
0   True  False  False   True   True
1   True   True   True   True  False
2  False   True  False   True   True
3  False  False  False  False  FalseOr in python (bis)
def myxor(x,y):
    return np.logical_or(
        np.logical_and(x, np.logical_not(y)),
        np.logical_and(np.logical_not(x), y)
    )
ans['myxor'] = myxor(x,y)
pa.DataFrame(ans)       x      y    and     or    xor  myxor
0   True  False  False   True   True   True
1   True   True   True   True  False  False
2  False   True  False   True   True   True
3  False  False  False  False  False  FalseWe will now see applications using the met dataset.
Need to select records according to some criteria. For example:
The logical expressions would be
(day == 1)(lat > 40)((elev < 500) | (elev > 1000))Respectively.
In R with data.table:
In R with dplyr::filter():
In Python
dat[(dt.f.day == 1) & (dt.f.lat > 40) & ((dt.f.elev < 500) | (dt.f.elev > 1000)), :].nrows
# dat[dt.f.day == 1,:][dt.f.lat > 40,:][(dt.f.elev < 500) | (dt.f.elev > 1000),:].nrows27623In the case of pydatatable we use dt.f. to refer to a column. df. is what we use to refer to datatable’s namespace.
The f. is a symbol that allows accessing column names in a datatable’s Frame.
How many records have a temperature between 18 and 25?
Some records have missings. Count how many records have temp as NA?
Following the previous question, plot a sample of 1,000 of (lat, lon) of the stations with temp as NA and those with data.
logical: Bool true/false type, e.g., dead/alive, sick/healthy, good/bad, yes/no, etc.
strings: string of characters (letters/symbols), e.g., names, text, etc.
integer: Numeric variable with no decimal (discrete), e.g., age, days, counts, etc.
double: Numeric variable with decimals (continuous), e.g., distance, expression level, time.
In C (and other languages), strings, integers, and doubles may be specified with size, e.g., in python integers can be of 9, 16, and 32 bits. This is relevant when managing large datasets, where saving space can be fundamental (more info).
Most programming languages have special types which are built using basic types. Here are a few examples:
time: Could be date, date + time, or a combination of both. Usually, it has a reference number defined as date 0. In R, the Date class has as reference 1970-01-01, in other words, “days since January 1st, 1970”.
categorical: Commonly used to represent strata/levels of variables, e.g., a variable “country” could be represented as a factor where the data is stored as numbers but has a label.
ordinal: Similar to factor, but it has ordering, e.g. “satisfaction level: 5 very satisfied, …, 1 very unsatisfied”.
Other noteworthy data types could be ways to represent missings (usually described as na or NA), or particular numeric types, e.g., +-Inf and Undefined (NaN).
When storing/sharing datasets, it is a good practice to do it along a dictionary describing each column data type/format.
0, 1, 1, 0, 0, 1
Diabetes type 1, Diabetes type 2, Diabetes type 1, Diabetes type 2
on, off, off, on, on, on
5, 10, 1, 15, 0, 0, 1
1.0, 2.0, 10.0, 6.0
high, low, medium, medium, high
-1, 1, -1, -1, 1,
.2, 1.5, .8, \(\pi\)
\(\pi\), \(\exp{1}\), \(\pi\), \(\pi\)
If we wanted to create two variables, elev^2 and the scaled version of wind.sp by it’s standard error, we could do the following:
Using data.table
With the verb dplyr::mutate():
# A tibble: 2,377,343 × 15
   USAFID  WBAN  year month   day  hour   min   lat   lon  elev wind.sp  temp
    <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <int>   <dbl> <dbl>
 1 690150 93121  2019     8     1     0    56  34.3 -116.   696     5.7  37.2
 2 690150 93121  2019     8     1     1    56  34.3 -116.   696     8.2  35.6
 3 690150 93121  2019     8     1     2    56  34.3 -116.   696     6.7  34.4
 4 690150 93121  2019     8     1     3    56  34.3 -116.   696     5.1  33.3
 5 690150 93121  2019     8     1     4    56  34.3 -116.   696     2.1  32.8
 6 690150 93121  2019     8     1     5    56  34.3 -116.   696     0    31.1
 7 690150 93121  2019     8     1     6    56  34.3 -116.   696     1.5  29.4
 8 690150 93121  2019     8     1     7    56  34.3 -116.   696     2.1  28.9
 9 690150 93121  2019     8     1     8    56  34.3 -116.   696     2.6  27.2
10 690150 93121  2019     8     1     9    56  34.3 -116.   696     1.5  26.7
# ℹ 2,377,333 more rows
# ℹ 3 more variables: atm.press <dbl>, elev2 <dbl>, windsp_scaled <dbl>Notice that since dat_ldt is a pointer, the variables are now in dat.
Imagine that we needed to generate all those calculations (scale by sd) on many more variables. We could then use the .SD symbol:
   wind.sp_scaled temp_scaled atm.press_scaled
            <num>       <num>            <num>
1:       2.654379    6.139348         248.7889
2:       3.818580    5.875290         248.8874
3:       3.120059    5.677247         248.9613
4:       2.374970    5.495707         249.2077Key things to notice here: c(out_names), .SD, and .SDCols.
In the case of dplyr, we could use the following
# A tibble: 4 × 3
  wind.sp_scaled2 temp_scaled2 atm.press_scaled2
            <dbl>        <dbl>             <dbl>
1            2.65         6.14              249.
2            3.82         5.88              249.
3            3.12         5.68              249.
4            2.37         5.50              249.A key thing here: This approach has no direct translation to data.table, which is why we used as_tibble().
While building the MET dataset, we dropped the State data.
We can use the original Stations dataset and merge it with the MET dataset.
But we cannot do it right away. We need to process the data somewhat first.
stations <- fread("ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv")
stations[, USAF := as.integer(USAF)]
# Dealing with NAs and 999999
stations[, USAF   := fifelse(USAF == 999999, NA_integer_, USAF)]
stations[, CTRY   := fifelse(CTRY == "", NA_character_, CTRY)]
stations[, STATE  := fifelse(STATE == "", NA_character_, STATE)]
# Selecting the three relevant columns, and keeping unique records
stations <- unique(stations[, list(USAF, CTRY, STATE)])
# Dropping NAs
stations <- stations[!is.na(USAF)]
head(stations, n = 4)    USAF   CTRY  STATE
   <int> <char> <char>
1:  7018   <NA>   <NA>
2:  7026     AF   <NA>
3:  7070     AF   <NA>
4:  8260   <NA>   <NA>Notice the function fifelse(). Now, let’s try to merge the data!
[1] 2385443This is more rows! The original dataset, dat, has 2377343. This means that the stations dataset has duplicated IDs. We can fix this:
We can now use the function merge() to add the extra data
dat <- merge(
  # Data
  x     = dat,      
  y     = stations, 
  # List of variables to match
  by.x  = "USAFID",
  by.y  = "USAF", 
  # Which obs to keep?
  all.x = TRUE,      
  all.y = FALSE
  )
head(dat[, list(USAFID, WBAN, STATE)], n = 4)Key: <USAFID>
   USAFID  WBAN  STATE
    <int> <int> <char>
1: 690150 93121     CA
2: 690150 93121     CA
3: 690150 93121     CA
4: 690150 93121     CAWhat happens when you change the options all.x and all.y?
Often, we must either impute some data or generate variables by strata.
If we, for example, wanted to impute missing temperature with the daily state average, we could use by together with the data.table::fcoalesce() function:
In the case of dplyr, we can do the following using dplyr::group_by together with dplyr::coalesce():
Let’s see how it looks like
Using by also allow us creating summaries of our data.
For example, if we wanted to compute the average temperature, wind-speed, and atmospheric preassure by state, we could do the following
    STATE temp_avg wind.sp_avg atm.press_avg
   <char>    <num>       <num>         <num>
1:     AL 26.19799    1.566381      1016.148
2:     AR 26.20697    1.836963      1014.551
3:     AZ 28.80596    2.984547      1010.771
4:     CA 22.36199    2.614120      1012.640When dealing with too many variables, we can use the .SD special symbol in data.table:
Key: <STATE>
    STATE wind.sp_avg temp_avg atm.press_avg
   <char>       <num>    <num>         <num>
1:     AL    1.566381 26.19799      1016.148
2:     AR    1.836963 26.20697      1014.551
3:     AZ    2.984547 28.80596      1010.771
4:     CA    2.614120 22.36199      1012.640Notice the keyby option here: “Group by STATE and order by STATE”.
Using dplyr verbs
Source: local data table [4 x 4]
Call:   head(setorder(`_DT4`[, .(temp_avg = mean(temp, na.rm = TRUE), 
    wind.sp_avg = mean(wind.sp, na.rm = TRUE), atm.press_avg = mean(atm.press, 
        na.rm = TRUE)), keyby = .(STATE)], STATE, na.last = TRUE), 
    n = 4)
  STATE temp_avg wind.sp_avg atm.press_avg
  <chr>    <dbl>       <dbl>         <dbl>
1 AL        26.2        1.57         1016.
2 AR        26.2        1.84         1015.
3 AZ        28.8        2.98         1011.
4 CA        22.4        2.61         1013.
# Use as.data.table()/as.data.frame()/as_tibble() to access resultsNotice the arrange() function.
shift() Fast lead/lag for vectors and lists.
fifelse() Fast if-else, similar to base R’s ifelse().
fcoalesce() Fast coalescing of missing values.
%between% A short form of (x < lb) & (x > up)
%inrange% A short form of x %in% lb:up
%chin% Fast match of character vectors, equivalent to x %in% X, where both x and X are character vectors.
nafill() Fill missing values using a constant, last observed value, or the next observed value.
H2O.ai’s benchmark (link): Designed by the lead developer of data.table Matt Dowle
RStudio’s benchmark (link): Designed as part of the benchmarks with the vroom package.