Functions and data.table

PHS 7045: Advanced Programming

George G. Vega Yon, Ph.D.

Today’s goals

  1. Discuss some technical aspects of functions in R.

  2. We will learn about how to manipulate data with data.table, and in particular,

  • Selecting variables, filtering data, creating variables, and summarizing data.

Throughout the session, we will see examples using:

Using the MET dataset.

Functions (part II)

Functions: Scoping and environments

Search paths and scoping (source: Hadley Wickham’s Advanced R)
  • Where the object was created matters.

  • And the order also matters.

Question: Two R packages with the same function name, degree. Are these two equivalent?

library(igraph)
library(sna)
degree(...)

vs

library(sna)
library(igraph)
degree(...)

What should you do?

In this example, the object a lives where fun was created, the Global Environment (.GlobalEnv, see ?.GlobalEnv.)

a <- 1
fun <- function() {
  return(a)
}

fun_sub <- function() {
  a <- pi # It doesn't change the result!
  return(fun())
}

fun_sub()
[1] 1
a <- 2
fun_sub() # It gets the new value
[1] 2

Question: What is the danger of this?

Now, things change if fun is moved to another environment.

fun_sub()
[1] 2
fun_sub <- function() {
  a <- pi
  environment(fun) <- environment()
  fun()
}
fun_sub()
[1] 3.141593

Functions: Arguments and copy-on-modify

  • R is very smart when it comes to saving space.

  • Objects passed to a function are only copied if modified, the “copy-on-modify” concept.

  • This is very powerful as it mixes passing objects by reference and by copy dynamically (see this wiki entry).

bigdata <- 123; data.table::address(bigdata)
[1] "0x555558e70a30"
(function(a) {
  data.table::address(a)
})(bigdata)
[1] "0x555558e70a30"
# If we change it...
(function(a) {
  a <- a + 0 # The input is now modified!
  data.table::address(a)
})(bigdata)
[1] "0x55555ad042b0"

Note: Functions that don’t have a name are called “anonymous functions.”

Data Table

Data wrangling in R

Overall, you will find the following approaches:

  • base R: Use only base R functions.

  • dplyr: Using “verbs”.

  • data.table: High-performing (ideal for large data)

  • dplyr + data.table = dtplyr: High-performing + dplyr verbs.

Other methods involve, for example, using external tools such as Spark, sparkly.

We will be focusing on data.table because of this

Take a look at this neat cheat sheet by Erik Petrovski here.

Selecting variables: Load the packages

library(data.table)
library(dtplyr)
library(dplyr)
library(ggplot2)

The dtplyr R package translates dplyr (tidyverse) syntax to data.table, so that we can still use the dplyr verbs while at the same time leveraging the performance of data.table.

Loading the data

The data that we will be using is an already processed version of the MET dataset. We can download (and load) the data directly in our session using the following commands:

# Where are we getting the data from
met_url <- "https://github.com/USCbiostats/data-science-data/raw/master/02_met/met_all.gz"

# Downloading the data to a tempfile (so it is destroyed afterwards)
# you can replace this with, for example, your own data:
# tmp <- tempfile(fileext = ".gz")
tmp <- "met.gz"

# We sould be downloading this, ONLY IF this was not downloaded already.
# otherwise is just a waste of time.
if (!file.exists(tmp)) {
  download.file(
    url      = met_url,
    destfile = tmp,
    # method   = "libcurl", timeout = 1000 (you may need this option)
  )
}

Now we can load the data using the fread() function. In R

# Reading the data
dat <- fread(tmp)
head(dat)
   USAFID  WBAN  year month   day  hour   min   lat      lon  elev wind.dir
    <int> <int> <int> <int> <int> <int> <int> <num>    <num> <int>    <int>
1: 690150 93121  2019     8     1     0    56  34.3 -116.166   696      220
2: 690150 93121  2019     8     1     1    56  34.3 -116.166   696      230
3: 690150 93121  2019     8     1     2    56  34.3 -116.166   696      230
4: 690150 93121  2019     8     1     3    56  34.3 -116.166   696      210
5: 690150 93121  2019     8     1     4    56  34.3 -116.166   696      120
6: 690150 93121  2019     8     1     5    56  34.3 -116.166   696       NA
   wind.dir.qc wind.type.code wind.sp wind.sp.qc ceiling.ht ceiling.ht.qc
        <char>         <char>   <num>     <char>      <int>         <int>
1:           5              N     5.7          5      22000             5
2:           5              N     8.2          5      22000             5
3:           5              N     6.7          5      22000             5
4:           5              N     5.1          5      22000             5
5:           5              N     2.1          5      22000             5
6:           9              C     0.0          5      22000             5
   ceiling.ht.method sky.cond vis.dist vis.dist.qc vis.var vis.var.qc  temp
              <char>   <char>    <int>      <char>  <char>     <char> <num>
1:                 9        N    16093           5       N          5  37.2
2:                 9        N    16093           5       N          5  35.6
3:                 9        N    16093           5       N          5  34.4
4:                 9        N    16093           5       N          5  33.3
5:                 9        N    16093           5       N          5  32.8
6:                 9        N    16093           5       N          5  31.1
   temp.qc dew.point dew.point.qc atm.press atm.press.qc       rh
    <char>     <num>       <char>     <num>        <int>    <num>
1:       5      10.6            5    1009.9            5 19.88127
2:       5      10.6            5    1010.3            5 21.76098
3:       5       7.2            5    1010.6            5 18.48212
4:       5       5.0            5    1011.6            5 16.88862
5:       5       5.0            5    1012.7            5 17.38410
6:       5       5.6            5    1012.7            5 20.01540

In Python

import datatable as dt
dat = dt.fread("met.gz")
dat.head(5)

Before we continue, let’s learn a bit more on data.table and dtplyr

data.table and dtplyr: Data Table’s Syntax

  • As you have seen in previous lectures, in data.table all happens within the square brackets. Here is common way to imagine DT:

image/svg+xml

  • Any time that you see := in j that is “Assignment by reference.” Using = within j only works in some specific cases.

data.table and dtplyr: Data Table’s Syntax

Operations applied in j are evaluated within the data, meaning that names work as symbols, e.g.,

data(USArrests)
USArrests_dt <- data.table(USArrests)

# This returns an error
USArrests[, Murder]

# This works fine
USArrests_dt[, Murder]

Furthermore, we can do things like this:

USArrests_dt[, plot(Murder, UrbanPop)]
NULL

data.table and dtplyr: Lazy table

  • The dtplyr package provides a way to translate dplyr verbs to data.table syntax.

  • The key lies on the function lazy_dt from dtplyr (see ?dtplyr::lazy_dt).

  • This function creates a wrapper that “points” to a data.table object

data.table and dtplyr: Lazy table (cont.)

# Creating a lazy table object
dat_ldt <- lazy_dt(dat, immutable = FALSE)

# We can use the address() function from data.table
address(dat)
address(dat_ldt$parent)

# To ensure we are not modifying it
# SEEMS THERES A BUG HERE!
dat_ldt <- lazy_dt(dat, immutable = TRUE) 
[1] "0x55555b2bb710"
[1] "0x55555b2bb710"
image/svg+xml

Question: What is the immutable = FALSE option used for?

Selecting columns

How can we select the columns USAFID, lat, and lon, using data.table:

dat[, list(USAFID, lat, lon)]
# dat[, .(USAFID, lat, lon)]       # Alternative 1
# dat[, c("USAFID", "lat", "lon")] # Alternative 2
         USAFID    lat      lon
          <int>  <num>    <num>
      1: 690150 34.300 -116.166
      2: 690150 34.300 -116.166
      3: 690150 34.300 -116.166
      4: 690150 34.300 -116.166
      5: 690150 34.300 -116.166
     ---                       
2377339: 726813 43.650 -116.633
2377340: 726813 43.650 -116.633
2377341: 726813 43.650 -116.633
2377342: 726813 43.642 -116.636
2377343: 726813 43.642 -116.636

What happens if instead of list() you used c()?

Selecting columns (cont. 1)

Using the dplyr::select verb:

dat_ldt |>  select(USAFID, lat, lon)
Source: local data table [2,377,343 x 3]
Call:   `_DT2`[, .(USAFID, lat, lon)]

  USAFID   lat   lon
   <int> <dbl> <dbl>
1 690150  34.3 -116.
2 690150  34.3 -116.
3 690150  34.3 -116.
4 690150  34.3 -116.
5 690150  34.3 -116.
6 690150  34.3 -116.
# ℹ 2,377,337 more rows

# Use as.data.table()/as.data.frame()/as_tibble() to access results

Selecting columns (cont. 2)

In the case of pydatatable

dat[:,["USAFID", "lat", "lon"]]
        | USAFID      lat       lon
        |  int32  float64   float64
------- + ------  -------  --------
      0 | 690150   34.3    -116.166
      1 | 690150   34.3    -116.166
      2 | 690150   34.3    -116.166
      3 | 690150   34.3    -116.166
      4 | 690150   34.3    -116.166
      5 | 690150   34.3    -116.166
      6 | 690150   34.3    -116.166
      7 | 690150   34.3    -116.166
      8 | 690150   34.3    -116.166
      9 | 690150   34.3    -116.166
     10 | 690150   34.296  -116.162
     11 | 690150   34.296  -116.162
     12 | 690150   34.3    -116.166
     13 | 690150   34.3    -116.166
     14 | 690150   34.3    -116.166
    ... |    ...  ...       ...    
2377338 | 726813   43.65   -116.633
2377339 | 726813   43.65   -116.633
2377340 | 726813   43.65   -116.633
2377341 | 726813   43.642  -116.636
2377342 | 726813   43.642  -116.636
[2377343 rows x 3 columns]

What happens if instead of ["USAFID", "lat", "lon"] you used {"USAFID", "lat", "lon"} (vector vs set).

For the rest of the session, we will use these variables: USAFID, WBAN, year, month, day, hour, min, lat, lon, elev, wind.sp, temp, and atm.press.

# Data.table
cat(dat[1,] |> colnames(), file="tmp.tmp")
dat <- dat[,
  .(USAFID, WBAN, year, month, day, hour, min, lat, lon, elev,
    wind.sp, temp, atm.press)]

# Need to redo the lazy table
dat_ldt <- lazy_dt(dat)

Data filtering: Logical conditions

  • Based on logical operations, e.g., condition 1 [and|or condition2 [and|or ...]]

  • Need to be aware of ordering and grouping of and and or operators (see Comparison).

    x y Negate
    !x
    And
    x & y
    Or
    x | y
    Xor
    xor(x, y)
    true true false true true false
    false true true false true true
    true false false false true true
    false false true false false false

Questions 1: How many ways can you write an XOR operator?

Write a function that takes two arguments (x,y) and applies the XOR operator element-wise. Here you have a template:

myxor <- function(x, y) {
  res <- logical(length(x))
  for (i in 1:length(x)) {
    res[i] <- # do something with x[i] and y[i]
  }
  return(res)
}

Or if vectorized (which would be better)

Hint 1: Remember that negating (x & y) equals (!x | !y).

Hint 2: Logical operators are distributive, meaning a * (b + c) = (a * b) + (a + c), where * and + are & or |.

In R

myxor1 <- function(x,y) {(x & !y) | (!x & y)}
myxor2 <- function(x,y) {!((!x | y) & (x | !y))}
myxor3 <- function(x,y) {(x | y) & (!x | !y)}
myxor4 <- function(x,y) {!((!x & !y) | (x & y))}
cbind(
  ifelse(xor(test[,1], test[,2]), "true", "false"),
  ifelse(myxor1(test[,1], test[,2]), "true", "false"),
  ifelse(myxor2(test[,1], test[,2]), "true", "false"),
  ifelse(myxor3(test[,1], test[,2]), "true", "false"),
  ifelse(myxor4(test[,1], test[,2]), "true", "false")
)
     [,1]    [,2]    [,3]    [,4]    [,5]   
[1,] "false" "false" "false" "false" "false"
[2,] "true"  "true"  "true"  "true"  "true" 
[3,] "true"  "true"  "true"  "true"  "true" 
[4,] "false" "false" "false" "false" "false"

Or in python

# Loading the libraries
import numpy as np
import pandas as pa

# Defining the data
x = [True, True, False, False]
y = [False, True, True, False]
ans = {
    'x'   : x,
    'y'   : y,
    'and' : np.logical_and(x, y),
    'or'  : np.logical_or(x, y),
    'xor' : np.logical_xor(x, y)
    }
pa.DataFrame(ans)
       x      y    and     or    xor
0   True  False  False   True   True
1   True   True   True   True  False
2  False   True  False   True   True
3  False  False  False  False  False

Or in python (bis)

def myxor(x,y):
    return np.logical_or(
        np.logical_and(x, np.logical_not(y)),
        np.logical_and(np.logical_not(x), y)
    )

ans['myxor'] = myxor(x,y)
pa.DataFrame(ans)
       x      y    and     or    xor  myxor
0   True  False  False   True   True   True
1   True   True   True   True  False  False
2  False   True  False   True   True   True
3  False  False  False  False  False  False

We will now see applications using the met dataset.

Filtering (subsetting) the data

Need to select records according to some criteria. For example:

  • First day of the month, and
  • Above latitude 40, and
  • Elevation outside the range between 500 and 1,000.

The logical expressions would be

  • (day == 1)
  • (lat > 40)
  • ((elev < 500) | (elev > 1000))

Respectively.

In R with data.table:

dat[(day == 1) & (lat > 40) & ((elev < 500) | (elev > 1000))] %>%
  nrow()
[1] 27623

In R with dplyr::filter():

dat_ldt %>%
  filter(day == 1, lat > 40, (elev < 500) | (elev > 1000)) %>%
  collect() %>% # Notice this line!
  nrow() 
[1] 27623

In Python

import datatable as dt
dat = dt.fread("met.gz")
dat[(dt.f.day == 1) & (dt.f.lat > 40) & ((dt.f.elev < 500) | (dt.f.elev > 1000)), :].nrows
# dat[dt.f.day == 1,:][dt.f.lat > 40,:][(dt.f.elev < 500) | (dt.f.elev > 1000),:].nrows
27623

In the case of pydatatable we use dt.f. to refer to a column. df. is what we use to refer to datatable’s namespace.

The f. is a symbol that allows accessing column names in a datatable’s Frame.

Questions 2

  1. How many records have a temperature between 18 and 25?

  2. Some records have missings. Count how many records have temp as NA?

  3. Following the previous question, plot a sample of 1,000 of (lat, lon) of the stations with temp as NA and those with data.

Solutions

Creating variables: Data types

  • logical: Bool true/false type, e.g., dead/alive, sick/healthy, good/bad, yes/no, etc.

  • strings: string of characters (letters/symbols), e.g., names, text, etc.

  • integer: Numeric variable with no decimal (discrete), e.g., age, days, counts, etc.

  • double: Numeric variable with decimals (continuous), e.g., distance, expression level, time.

In C (and other languages), strings, integers, and doubles may be specified with size, e.g., in python integers can be of 9, 16, and 32 bits. This is relevant when managing large datasets, where saving space can be fundamental (more info).

Creating variables: Special data types

Most programming languages have special types which are built using basic types. Here are a few examples:

  • time: Could be date, date + time, or a combination of both. Usually, it has a reference number defined as date 0. In R, the Date class has as reference 1970-01-01, in other words, “days since January 1st, 1970”.

  • categorical: Commonly used to represent strata/levels of variables, e.g., a variable “country” could be represented as a factor where the data is stored as numbers but has a label.

  • ordinal: Similar to factor, but it has ordering, e.g. “satisfaction level: 5 very satisfied, …, 1 very unsatisfied”.

Other noteworthy data types could be ways to represent missings (usually described as na or NA), or particular numeric types, e.g., +-Inf and Undefined (NaN).

When storing/sharing datasets, it is a good practice to do it along a dictionary describing each column data type/format.

Questions 3: What’s the best way to represent the following

  • 0, 1, 1, 0, 0, 1

  • Diabetes type 1, Diabetes type 2, Diabetes type 1, Diabetes type 2

  • on, off, off, on, on, on

  • 5, 10, 1, 15, 0, 0, 1

  • 1.0, 2.0, 10.0, 6.0

  • high, low, medium, medium, high

  • -1, 1, -1, -1, 1,

  • .2, 1.5, .8, \(\pi\)

  • \(\pi\), \(\exp{1}\), \(\pi\), \(\pi\)

Variable creation

If we wanted to create two variables, elev^2 and the scaled version of wind.sp by it’s standard error, we could do the following:

Using data.table

dat[, elev2         := elev^2]
dat[, windsp_scaled := wind.sp/sd(wind.sp, na.rm = TRUE)]
# Alternatively:
# dat[, c("elev2", "windsp_scaled") := .(elev^2, wind.sp/sd(wind.sp,na.rm=TRUE)) ]

Variable creation (cont. 1)

With the verb dplyr::mutate():

# A tibble: 2,377,343 × 15
   USAFID  WBAN  year month   day  hour   min   lat   lon  elev wind.sp  temp
    <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <int>   <dbl> <dbl>
 1 690150 93121  2019     8     1     0    56  34.3 -116.   696     5.7  37.2
 2 690150 93121  2019     8     1     1    56  34.3 -116.   696     8.2  35.6
 3 690150 93121  2019     8     1     2    56  34.3 -116.   696     6.7  34.4
 4 690150 93121  2019     8     1     3    56  34.3 -116.   696     5.1  33.3
 5 690150 93121  2019     8     1     4    56  34.3 -116.   696     2.1  32.8
 6 690150 93121  2019     8     1     5    56  34.3 -116.   696     0    31.1
 7 690150 93121  2019     8     1     6    56  34.3 -116.   696     1.5  29.4
 8 690150 93121  2019     8     1     7    56  34.3 -116.   696     2.1  28.9
 9 690150 93121  2019     8     1     8    56  34.3 -116.   696     2.6  27.2
10 690150 93121  2019     8     1     9    56  34.3 -116.   696     1.5  26.7
# ℹ 2,377,333 more rows
# ℹ 3 more variables: atm.press <dbl>, elev2 <dbl>, windsp_scaled <dbl>

Notice that since dat_ldt is a pointer, the variables are now in dat.

Variable creation (cont. 2)

Imagine that we needed to generate all those calculations (scale by sd) on many more variables. We could then use the .SD symbol:

   wind.sp_scaled temp_scaled atm.press_scaled
            <num>       <num>            <num>
1:       2.654379    6.139348         248.7889
2:       3.818580    5.875290         248.8874
3:       3.120059    5.677247         248.9613
4:       2.374970    5.495707         249.2077

Key things to notice here: c(out_names), .SD, and .SDCols.

Variable creation (cont. 3)

In the case of dplyr, we could use the following

# A tibble: 4 × 3
  wind.sp_scaled2 temp_scaled2 atm.press_scaled2
            <dbl>        <dbl>             <dbl>
1            2.65         6.14              249.
2            3.82         5.88              249.
3            3.12         5.68              249.
4            2.37         5.50              249.

A key thing here: This approach has no direct translation to data.table, which is why we used as_tibble().

Merging data

  • While building the MET dataset, we dropped the State data.

  • We can use the original Stations dataset and merge it with the MET dataset.

  • But we cannot do it right away. We need to process the data somewhat first.

Merging data (cont. 1)

stations <- fread("ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv")
stations[, USAF := as.integer(USAF)]

# Dealing with NAs and 999999
stations[, USAF   := fifelse(USAF == 999999, NA_integer_, USAF)]
stations[, CTRY   := fifelse(CTRY == "", NA_character_, CTRY)]
stations[, STATE  := fifelse(STATE == "", NA_character_, STATE)]

# Selecting the three relevant columns, and keeping unique records
stations <- unique(stations[, list(USAF, CTRY, STATE)])

# Dropping NAs
stations <- stations[!is.na(USAF)]

head(stations, n = 4)
    USAF   CTRY  STATE
   <int> <char> <char>
1:  7018   <NA>   <NA>
2:  7026     AF   <NA>
3:  7070     AF   <NA>
4:  8260   <NA>   <NA>

Notice the function fifelse(). Now, let’s try to merge the data!

Merging data (cont. 2)

[1] 2385443

This is more rows! The original dataset, dat, has 2377343. This means that the stations dataset has duplicated IDs. We can fix this:

Merging data (cont. 3)

We can now use the function merge() to add the extra data

dat <- merge(
  # Data
  x     = dat,      
  y     = stations, 
  # List of variables to match
  by.x  = "USAFID",
  by.y  = "USAF", 
  # Which obs to keep?
  all.x = TRUE,      
  all.y = FALSE
  )

head(dat[, list(USAFID, WBAN, STATE)], n = 4)
Key: <USAFID>
   USAFID  WBAN  STATE
    <int> <int> <char>
1: 690150 93121     CA
2: 690150 93121     CA
3: 690150 93121     CA
4: 690150 93121     CA

What happens when you change the options all.x and all.y?

Aggregating data: Adding groupped variables

  • Often, we must either impute some data or generate variables by strata.

  • If we, for example, wanted to impute missing temperature with the daily state average, we could use by together with the data.table::fcoalesce() function:

  • In the case of dplyr, we can do the following using dplyr::group_by together with dplyr::coalesce():

Aggregating data: Adding grouped variables (cont.)

Let’s see how it looks like

Aggregating data: Summary table

  • Using by also allow us creating summaries of our data.

  • For example, if we wanted to compute the average temperature, wind-speed, and atmospheric preassure by state, we could do the following

        STATE temp_avg wind.sp_avg atm.press_avg
       <char>    <num>       <num>         <num>
    1:     AL 26.19799    1.566381      1016.148
    2:     AR 26.20697    1.836963      1014.551
    3:     AZ 28.80596    2.984547      1010.771
    4:     CA 22.36199    2.614120      1012.640

Aggregating data: Summary table (cont. 1)

When dealing with too many variables, we can use the .SD special symbol in data.table:

Key: <STATE>
    STATE wind.sp_avg temp_avg atm.press_avg
   <char>       <num>    <num>         <num>
1:     AL    1.566381 26.19799      1016.148
2:     AR    1.836963 26.20697      1014.551
3:     AZ    2.984547 28.80596      1010.771
4:     CA    2.614120 22.36199      1012.640

Notice the keyby option here: “Group by STATE and order by STATE”.

Aggregating data: Summary table (cont. 2)

  • Using dplyr verbs

    Source: local data table [4 x 4]
    Call:   head(setorder(`_DT4`[, .(temp_avg = mean(temp, na.rm = TRUE), 
        wind.sp_avg = mean(wind.sp, na.rm = TRUE), atm.press_avg = mean(atm.press, 
            na.rm = TRUE)), keyby = .(STATE)], STATE, na.last = TRUE), 
        n = 4)
    
      STATE temp_avg wind.sp_avg atm.press_avg
      <chr>    <dbl>       <dbl>         <dbl>
    1 AL        26.2        1.57         1016.
    2 AR        26.2        1.84         1015.
    3 AZ        28.8        2.98         1011.
    4 CA        22.4        2.61         1013.
    
    # Use as.data.table()/as.data.frame()/as_tibble() to access results

    Notice the arrange() function.

Other data.table goodies

  • shift() Fast lead/lag for vectors and lists.

  • fifelse() Fast if-else, similar to base R’s ifelse().

  • fcoalesce() Fast coalescing of missing values.

  • %between% A short form of (x < lb) & (x > up)

  • %inrange% A short form of x %in% lb:up

  • %chin% Fast match of character vectors, equivalent to x %in% X, where both x and X are character vectors.

  • nafill() Fill missing values using a constant, last observed value, or the next observed value.

Benchmarks