a <- 1
fun <- function() {
return(a)
}
fun_sub <- function() {
a <- pi # It doesn't change the result!
return(fun())
}
fun_sub()
[1] 1
PHS 7045: Advanced Programming
Discuss some technical aspects of functions in R.
We will learn about how to manipulate data with data.table
, and in particular,
Throughout the session, we will see examples using:
Using the MET dataset.
R is very smart when it comes to saving space.
Objects passed to a function are only copied if modified, the “copy-on-modify” concept.
This is very powerful as it mixes passing objects by reference and by copy dynamically (see this wiki entry).
# If we change it...
(function(a) {
a <- a + 0 # The input is now modified!
data.table::address(a)
})(bigdata)
[1] "0x55555ad042b0"
Note: Functions that don’t have a name are called “anonymous functions.”
Overall, you will find the following approaches:
base R: Use only base R functions.
dplyr: Using “verbs”.
data.table: High-performing (ideal for large data)
dplyr + data.table = dtplyr: High-performing + dplyr verbs.
Other methods involve, for example, using external tools such as Spark, sparkly.
We will be focusing on data.table because of this
Take a look at this neat cheat sheet by Erik Petrovski here.
The dtplyr
R package translates dplyr
(tidyverse
) syntax to data.table
, so that we can still use the dplyr verbs while at the same time leveraging the performance of data.table
.
The data that we will be using is an already processed version of the MET dataset. We can download (and load) the data directly in our session using the following commands:
# Where are we getting the data from
met_url <- "https://github.com/USCbiostats/data-science-data/raw/master/02_met/met_all.gz"
# Downloading the data to a tempfile (so it is destroyed afterwards)
# you can replace this with, for example, your own data:
# tmp <- tempfile(fileext = ".gz")
tmp <- "met.gz"
# We sould be downloading this, ONLY IF this was not downloaded already.
# otherwise is just a waste of time.
if (!file.exists(tmp)) {
download.file(
url = met_url,
destfile = tmp,
# method = "libcurl", timeout = 1000 (you may need this option)
)
}
Now we can load the data using the fread()
function. In R
USAFID WBAN year month day hour min lat lon elev wind.dir
<int> <int> <int> <int> <int> <int> <int> <num> <num> <int> <int>
1: 690150 93121 2019 8 1 0 56 34.3 -116.166 696 220
2: 690150 93121 2019 8 1 1 56 34.3 -116.166 696 230
3: 690150 93121 2019 8 1 2 56 34.3 -116.166 696 230
4: 690150 93121 2019 8 1 3 56 34.3 -116.166 696 210
5: 690150 93121 2019 8 1 4 56 34.3 -116.166 696 120
6: 690150 93121 2019 8 1 5 56 34.3 -116.166 696 NA
wind.dir.qc wind.type.code wind.sp wind.sp.qc ceiling.ht ceiling.ht.qc
<char> <char> <num> <char> <int> <int>
1: 5 N 5.7 5 22000 5
2: 5 N 8.2 5 22000 5
3: 5 N 6.7 5 22000 5
4: 5 N 5.1 5 22000 5
5: 5 N 2.1 5 22000 5
6: 9 C 0.0 5 22000 5
ceiling.ht.method sky.cond vis.dist vis.dist.qc vis.var vis.var.qc temp
<char> <char> <int> <char> <char> <char> <num>
1: 9 N 16093 5 N 5 37.2
2: 9 N 16093 5 N 5 35.6
3: 9 N 16093 5 N 5 34.4
4: 9 N 16093 5 N 5 33.3
5: 9 N 16093 5 N 5 32.8
6: 9 N 16093 5 N 5 31.1
temp.qc dew.point dew.point.qc atm.press atm.press.qc rh
<char> <num> <char> <num> <int> <num>
1: 5 10.6 5 1009.9 5 19.88127
2: 5 10.6 5 1010.3 5 21.76098
3: 5 7.2 5 1010.6 5 18.48212
4: 5 5.0 5 1011.6 5 16.88862
5: 5 5.0 5 1012.7 5 17.38410
6: 5 5.6 5 1012.7 5 20.01540
In Python
Before we continue, let’s learn a bit more on data.table
and dtplyr
data.table
and dtplyr
: Data Table’s Syntaxdata.table
all happens within the square brackets. Here is common way to imagine DT:data.table
and dtplyr
: Data Table’s SyntaxOperations applied in j are evaluated within the data, meaning that names work as symbols, e.g.,
Furthermore, we can do things like this:
data.table
and dtplyr
: Lazy tableThe dtplyr
package provides a way to translate dplyr
verbs to data.table
syntax.
The key lies on the function lazy_dt
from dtplyr
(see ?dtplyr::lazy_dt
).
This function creates a wrapper that “points” to a data.table
object
data.table
and dtplyr
: Lazy table (cont.)# Creating a lazy table object
dat_ldt <- lazy_dt(dat, immutable = FALSE)
# We can use the address() function from data.table
address(dat)
address(dat_ldt$parent)
# To ensure we are not modifying it
# SEEMS THERES A BUG HERE!
dat_ldt <- lazy_dt(dat, immutable = TRUE)
[1] "0x55555b2bb710"
[1] "0x55555b2bb710"
Question: What is the immutable = FALSE
option used for?
How can we select the columns USAFID
, lat
, and lon
, using data.table
:
dat[, list(USAFID, lat, lon)]
# dat[, .(USAFID, lat, lon)] # Alternative 1
# dat[, c("USAFID", "lat", "lon")] # Alternative 2
USAFID lat lon
<int> <num> <num>
1: 690150 34.300 -116.166
2: 690150 34.300 -116.166
3: 690150 34.300 -116.166
4: 690150 34.300 -116.166
5: 690150 34.300 -116.166
---
2377339: 726813 43.650 -116.633
2377340: 726813 43.650 -116.633
2377341: 726813 43.650 -116.633
2377342: 726813 43.642 -116.636
2377343: 726813 43.642 -116.636
What happens if instead of list()
you used c()
?
Using the dplyr::select verb:
Source: local data table [2,377,343 x 3]
Call: `_DT2`[, .(USAFID, lat, lon)]
USAFID lat lon
<int> <dbl> <dbl>
1 690150 34.3 -116.
2 690150 34.3 -116.
3 690150 34.3 -116.
4 690150 34.3 -116.
5 690150 34.3 -116.
6 690150 34.3 -116.
# ℹ 2,377,337 more rows
# Use as.data.table()/as.data.frame()/as_tibble() to access results
In the case of pydatatable
| USAFID lat lon
| int32 float64 float64
------- + ------ ------- --------
0 | 690150 34.3 -116.166
1 | 690150 34.3 -116.166
2 | 690150 34.3 -116.166
3 | 690150 34.3 -116.166
4 | 690150 34.3 -116.166
5 | 690150 34.3 -116.166
6 | 690150 34.3 -116.166
7 | 690150 34.3 -116.166
8 | 690150 34.3 -116.166
9 | 690150 34.3 -116.166
10 | 690150 34.296 -116.162
11 | 690150 34.296 -116.162
12 | 690150 34.3 -116.166
13 | 690150 34.3 -116.166
14 | 690150 34.3 -116.166
... | ... ... ...
2377338 | 726813 43.65 -116.633
2377339 | 726813 43.65 -116.633
2377340 | 726813 43.65 -116.633
2377341 | 726813 43.642 -116.636
2377342 | 726813 43.642 -116.636
[2377343 rows x 3 columns]
What happens if instead of ["USAFID", "lat", "lon"]
you used {"USAFID", "lat", "lon"}
(vector vs set).
For the rest of the session, we will use these variables: USAFID, WBAN, year, month, day, hour, min, lat, lon, elev, wind.sp, temp, and atm.press.
Based on logical operations, e.g., condition 1 [and|or condition2 [and|or ...]]
Need to be aware of ordering and grouping of and
and or
operators (see Comparison
).
x | y | Negate !x |
And x & y |
Or x | y |
Xor xor(x, y) |
---|---|---|---|---|---|
true | true | false | true | true | false |
false | true | true | false | true | true |
true | false | false | false | true | true |
false | false | true | false | false | false |
Write a function that takes two arguments (x,y)
and applies the XOR operator element-wise. Here you have a template:
myxor <- function(x, y) {
res <- logical(length(x))
for (i in 1:length(x)) {
res[i] <- # do something with x[i] and y[i]
}
return(res)
}
Or if vectorized (which would be better)
Hint 1: Remember that negating (x & y)
equals (!x | !y)
.
Hint 2: Logical operators are distributive, meaning a * (b + c) = (a * b) + (a + c)
, where *
and +
are &
or |
.
In R
myxor1 <- function(x,y) {(x & !y) | (!x & y)}
myxor2 <- function(x,y) {!((!x | y) & (x | !y))}
myxor3 <- function(x,y) {(x | y) & (!x | !y)}
myxor4 <- function(x,y) {!((!x & !y) | (x & y))}
cbind(
ifelse(xor(test[,1], test[,2]), "true", "false"),
ifelse(myxor1(test[,1], test[,2]), "true", "false"),
ifelse(myxor2(test[,1], test[,2]), "true", "false"),
ifelse(myxor3(test[,1], test[,2]), "true", "false"),
ifelse(myxor4(test[,1], test[,2]), "true", "false")
)
[,1] [,2] [,3] [,4] [,5]
[1,] "false" "false" "false" "false" "false"
[2,] "true" "true" "true" "true" "true"
[3,] "true" "true" "true" "true" "true"
[4,] "false" "false" "false" "false" "false"
Or in python
# Loading the libraries
import numpy as np
import pandas as pa
# Defining the data
x = [True, True, False, False]
y = [False, True, True, False]
ans = {
'x' : x,
'y' : y,
'and' : np.logical_and(x, y),
'or' : np.logical_or(x, y),
'xor' : np.logical_xor(x, y)
}
pa.DataFrame(ans)
x y and or xor
0 True False False True True
1 True True True True False
2 False True False True True
3 False False False False False
Or in python (bis)
def myxor(x,y):
return np.logical_or(
np.logical_and(x, np.logical_not(y)),
np.logical_and(np.logical_not(x), y)
)
ans['myxor'] = myxor(x,y)
pa.DataFrame(ans)
x y and or xor myxor
0 True False False True True True
1 True True True True False False
2 False True False True True True
3 False False False False False False
We will now see applications using the met
dataset.
Need to select records according to some criteria. For example:
The logical expressions would be
(day == 1)
(lat > 40)
((elev < 500) | (elev > 1000))
Respectively.
In R with data.table
:
In R with dplyr::filter():
In Python
dat[(dt.f.day == 1) & (dt.f.lat > 40) & ((dt.f.elev < 500) | (dt.f.elev > 1000)), :].nrows
# dat[dt.f.day == 1,:][dt.f.lat > 40,:][(dt.f.elev < 500) | (dt.f.elev > 1000),:].nrows
27623
In the case of pydatatable we use dt.f.
to refer to a column. df.
is what we use to refer to datatable’s namespace.
The f.
is a symbol that allows accessing column names in a datatable’s Frame
.
How many records have a temperature between 18 and 25?
Some records have missings. Count how many records have temp
as NA
?
Following the previous question, plot a sample of 1,000 of (lat, lon)
of the stations with temp as NA and those with data.
logical: Bool true/false type, e.g., dead/alive, sick/healthy, good/bad, yes/no, etc.
strings: string of characters (letters/symbols), e.g., names, text, etc.
integer: Numeric variable with no decimal (discrete), e.g., age, days, counts, etc.
double: Numeric variable with decimals (continuous), e.g., distance, expression level, time.
In C (and other languages), strings, integers, and doubles may be specified with size, e.g., in python
integers can be of 9, 16, and 32 bits. This is relevant when managing large datasets, where saving space can be fundamental (more info).
Most programming languages have special types which are built using basic types. Here are a few examples:
time: Could be date, date + time, or a combination of both. Usually, it has a reference number defined as date 0. In R, the Date
class has as reference 1970-01-01, in other words, “days since January 1st, 1970”.
categorical: Commonly used to represent strata/levels of variables, e.g., a variable “country” could be represented as a factor where the data is stored as numbers but has a label.
ordinal: Similar to factor, but it has ordering, e.g. “satisfaction level: 5 very satisfied, …, 1 very unsatisfied”.
Other noteworthy data types could be ways to represent missings (usually described as na
or NA
), or particular numeric types, e.g., +-Inf
and Undefined (NaN
).
When storing/sharing datasets, it is a good practice to do it along a dictionary describing each column data type/format.
0, 1, 1, 0, 0, 1
Diabetes type 1, Diabetes type 2, Diabetes type 1, Diabetes type 2
on, off, off, on, on, on
5, 10, 1, 15, 0, 0, 1
1.0, 2.0, 10.0, 6.0
high, low, medium, medium, high
-1, 1, -1, -1, 1,
.2, 1.5, .8, \(\pi\)
\(\pi\), \(\exp{1}\), \(\pi\), \(\pi\)
If we wanted to create two variables, elev^2
and the scaled version of wind.sp
by it’s standard error, we could do the following:
Using data.table
With the verb dplyr::mutate():
# A tibble: 2,377,343 × 15
USAFID WBAN year month day hour min lat lon elev wind.sp temp
<int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 690150 93121 2019 8 1 0 56 34.3 -116. 696 5.7 37.2
2 690150 93121 2019 8 1 1 56 34.3 -116. 696 8.2 35.6
3 690150 93121 2019 8 1 2 56 34.3 -116. 696 6.7 34.4
4 690150 93121 2019 8 1 3 56 34.3 -116. 696 5.1 33.3
5 690150 93121 2019 8 1 4 56 34.3 -116. 696 2.1 32.8
6 690150 93121 2019 8 1 5 56 34.3 -116. 696 0 31.1
7 690150 93121 2019 8 1 6 56 34.3 -116. 696 1.5 29.4
8 690150 93121 2019 8 1 7 56 34.3 -116. 696 2.1 28.9
9 690150 93121 2019 8 1 8 56 34.3 -116. 696 2.6 27.2
10 690150 93121 2019 8 1 9 56 34.3 -116. 696 1.5 26.7
# ℹ 2,377,333 more rows
# ℹ 3 more variables: atm.press <dbl>, elev2 <dbl>, windsp_scaled <dbl>
Notice that since dat_ldt
is a pointer, the variables are now in dat
.
Imagine that we needed to generate all those calculations (scale by sd) on many more variables. We could then use the .SD symbol:
wind.sp_scaled temp_scaled atm.press_scaled
<num> <num> <num>
1: 2.654379 6.139348 248.7889
2: 3.818580 5.875290 248.8874
3: 3.120059 5.677247 248.9613
4: 2.374970 5.495707 249.2077
Key things to notice here: c(out_names), .SD, and .SDCols.
In the case of dplyr, we could use the following
# A tibble: 4 × 3
wind.sp_scaled2 temp_scaled2 atm.press_scaled2
<dbl> <dbl> <dbl>
1 2.65 6.14 249.
2 3.82 5.88 249.
3 3.12 5.68 249.
4 2.37 5.50 249.
A key thing here: This approach has no direct translation to data.table
, which is why we used as_tibble().
While building the MET dataset, we dropped the State data.
We can use the original Stations dataset and merge it with the MET dataset.
But we cannot do it right away. We need to process the data somewhat first.
stations <- fread("ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv")
stations[, USAF := as.integer(USAF)]
# Dealing with NAs and 999999
stations[, USAF := fifelse(USAF == 999999, NA_integer_, USAF)]
stations[, CTRY := fifelse(CTRY == "", NA_character_, CTRY)]
stations[, STATE := fifelse(STATE == "", NA_character_, STATE)]
# Selecting the three relevant columns, and keeping unique records
stations <- unique(stations[, list(USAF, CTRY, STATE)])
# Dropping NAs
stations <- stations[!is.na(USAF)]
head(stations, n = 4)
USAF CTRY STATE
<int> <char> <char>
1: 7018 <NA> <NA>
2: 7026 AF <NA>
3: 7070 AF <NA>
4: 8260 <NA> <NA>
Notice the function fifelse()
. Now, let’s try to merge the data!
[1] 2385443
This is more rows! The original dataset, dat
, has 2377343. This means that the stations
dataset has duplicated IDs. We can fix this:
We can now use the function merge()
to add the extra data
dat <- merge(
# Data
x = dat,
y = stations,
# List of variables to match
by.x = "USAFID",
by.y = "USAF",
# Which obs to keep?
all.x = TRUE,
all.y = FALSE
)
head(dat[, list(USAFID, WBAN, STATE)], n = 4)
Key: <USAFID>
USAFID WBAN STATE
<int> <int> <char>
1: 690150 93121 CA
2: 690150 93121 CA
3: 690150 93121 CA
4: 690150 93121 CA
What happens when you change the options all.x
and all.y
?
Often, we must either impute some data or generate variables by strata.
If we, for example, wanted to impute missing temperature with the daily state average, we could use by together with the data.table::fcoalesce() function:
In the case of dplyr, we can do the following using dplyr::group_by together with dplyr::coalesce():
Let’s see how it looks like
Using by
also allow us creating summaries of our data.
For example, if we wanted to compute the average temperature, wind-speed, and atmospheric preassure by state, we could do the following
STATE temp_avg wind.sp_avg atm.press_avg
<char> <num> <num> <num>
1: AL 26.19799 1.566381 1016.148
2: AR 26.20697 1.836963 1014.551
3: AZ 28.80596 2.984547 1010.771
4: CA 22.36199 2.614120 1012.640
When dealing with too many variables, we can use the .SD
special symbol in data.table
:
Key: <STATE>
STATE wind.sp_avg temp_avg atm.press_avg
<char> <num> <num> <num>
1: AL 1.566381 26.19799 1016.148
2: AR 1.836963 26.20697 1014.551
3: AZ 2.984547 28.80596 1010.771
4: CA 2.614120 22.36199 1012.640
Notice the keyby option here: “Group by STATE and order by STATE”.
Using dplyr verbs
Source: local data table [4 x 4]
Call: head(setorder(`_DT4`[, .(temp_avg = mean(temp, na.rm = TRUE),
wind.sp_avg = mean(wind.sp, na.rm = TRUE), atm.press_avg = mean(atm.press,
na.rm = TRUE)), keyby = .(STATE)], STATE, na.last = TRUE),
n = 4)
STATE temp_avg wind.sp_avg atm.press_avg
<chr> <dbl> <dbl> <dbl>
1 AL 26.2 1.57 1016.
2 AR 26.2 1.84 1015.
3 AZ 28.8 2.98 1011.
4 CA 22.4 2.61 1013.
# Use as.data.table()/as.data.frame()/as_tibble() to access results
Notice the arrange()
function.
shift()
Fast lead/lag for vectors and lists.
fifelse()
Fast if-else, similar to base R’s ifelse()
.
fcoalesce()
Fast coalescing of missing values.
%between%
A short form of (x < lb) & (x > up)
%inrange%
A short form of x %in% lb:up
%chin%
Fast match of character vectors, equivalent to x %in% X
, where both x
and X
are character vectors.
nafill()
Fill missing values using a constant, last observed value, or the next observed value.
H2O.ai’s benchmark (link): Designed by the lead developer of data.table Matt Dowle
RStudio’s benchmark (link): Designed as part of the benchmarks with the vroom package.