Introduction to Respectables

Introduction

The respecatlbes package provides a framework to

  • create recipes define a way to derive interdependent variables
  • variables are created using generating functions

For this vignette we load the respecatbles and dplyr package:

library(respectables)
library(dplyr)

Note the respectables package is still under development.

Simple Dataset

Lets start defining a simple dataset dm with a single variable id.

gen_id <- function(n) {
  paste0("id-", 1:n)
}

dm_recipe <- tribble(
  ~variables, ~dependencies,  ~func,   ~func_args,
  "id",       no_deps,        gen_id,  no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id
## 1 id-1
## 2 id-2

Note that the argument n is defined by respectables, in this case it is equal N.

We can use the recepie dm_recepie again to create a different dataset:

gen_table_data(N = 5, recipe = dm_recipe)
##     id
## 1 id-1
## 2 id-2
## 3 id-3
## 4 id-4
## 5 id-5

Adding Multiple Variables

We will now specify the variables height and weight to the dm recipe:

gen_hw <- function(n) {
  bmi <- 17 + abs(rnorm(n, mean = 3, sd = 3))
  
  data.frame(height = runif(n, min = 1.5, 1.95)) %>%
    mutate(weight = bmi * height^2)
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,  no_args,
   c("height", "weight"),   no_deps,        gen_hw,  no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight
## 1 id-1 1.674830 52.71748
## 2 id-2 1.583293 44.50277

Note that we used random number generators in gen_hw, hence rerunning gen_table_data will give different values

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight
## 1 id-1 1.643961 56.36275
## 2 id-2 1.912219 65.70995

Variable Dependencies

We will now continue our dm example by defining the variable age which for illustrative purposes is dependent on the height.

gen_age <- function(n, .df) {
  .df %>%
    transmute(age = height*25)
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,  no_args,
   c("height", "weight"),   no_deps,        gen_hw,  no_args,
  "age",                    "height",       gen_age, no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight      age
## 1 id-1 1.655448 57.94775 41.38620
## 2 id-2 1.703128 64.63424 42.57819

Note that respectables creates the arguments n and .df on the fly. Also, respectables determines the evaluation order of the variables based on the dependency structure. That is, respectables does not guarantee to build the resulting data frame using the recipe row by row.

Configurable Arguments

If we plan to make configurable variable generating functions we can specify the arguments in the recipe

gen_color <- function(n, colors = colors()) {
  data.frame(color = sample(colors, n, replace = TRUE))
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,     no_args,
   c("height", "weight"),   no_deps,        gen_hw,     no_args,
  "age",                    "height",       gen_age,    no_args,
  "color",                  no_deps,        gen_color,  list(color = c("blue", "red"))
)

gen_table_data(N = 4, recipe = dm_recipe)
##     id   height   weight      age color
## 1 id-1 1.589870 54.59671 39.74676   red
## 2 id-2 1.588014 45.93671 39.70035   red
## 3 id-3 1.793005 62.68109 44.82512  blue
## 4 id-4 1.919066 71.77582 47.97664   red

Injecting Missing Data

The miss_recipe argument in gen_table_data can be used to inject missing values in the last step when creating data with gen_table_data. That is, first the data generation recipe is executed and then the missing data is injected. Hence, all variables are available at execution time and the .df argument is supplied to the func.

gen_alternate_na <- function(.df) {
  n <- nrow(.df)
  rep(c(TRUE, FALSE), length.out = n)
}

dm_na_recipe <- tribble(
  ~variables,       ~func,             ~func_args,
  "age",            gen_alternate_na,  no_args
)

gen_table_data(N = 4, recipe = dm_recipe, miss_recipe = dm_na_recipe)
##     id   height   weight      age color
## 1 id-1 1.613530 61.76107       NA   red
## 2 id-2 1.527325 47.54801 38.18313  blue
## 3 id-3 1.655560 55.56014       NA  blue
## 4 id-4 1.776139 68.58360 44.40347   red

Note that this currently only works with one variable per row in the missing recipe. This is a feature that we are still working on to allow for more complex missing structure definition.

Scaffolding

For this example we create a data frame aseq with the variable seqterm being c("step 1", ..., "step i"), where i is extracted from the variable id.

dm <- gen_table_data(N = 3, recipe = dm_recipe)

# grow dataset
gen_seq <- function(.db) {
  
  dm <- .db$dm
  
  ni <- as.numeric(substring(dm$id, 4))
  
  df_grow <- data.frame(
    id = rep(dm$id, ni),
    seq = unlist(sapply(ni, seq, from = 1))
  )
  
  left_join(dm, df_grow, by = "id")
}

aseq_scf_recipe <- tribble(
  ~foreign_tbl, ~foreign_key, ~func,     ~func_args,
  "dm",         "id",         gen_seq,   no_args     
)

gen_seq_term <- function(.df, ...) {
  data.frame(seqterm = paste("step", .df$seq))
}

aseq_recipe <- tribble(
  ~variables,      ~dependencies,  ~func,            ~func_args,
  "seqterm",       "seq",          gen_seq_term,     no_args
)

gen_reljoin_table(joinrec = aseq_scf_recipe, tblrec = aseq_recipe, db = list(dm = dm))
##     id   height   weight      age color seq seqterm
## 1 id-1 1.718011 78.03062 42.95027  blue   1  step 1
## 2 id-2 1.651156 55.09478 41.27890   red   1  step 1
## 3 id-2 1.651156 55.09478 41.27890   red   2  step 2
## 4 id-3 1.711076 52.59656 42.77689  blue   1  step 1
## 5 id-3 1.711076 52.59656 42.77689  blue   2  step 2
## 6 id-3 1.711076 52.59656 42.77689  blue   3  step 3

The steps here are:

  1. use joinrec to grow a new data frame, say A, possibly from db
  2. call gen_table_data with the following arguments
    • A for df
    • tblrec for recipe
    • forward miss_recipe

Note that this functionality is under development. Currently aseq_scf_recipe needs to be a tibble with one row, and the foreign_key is currently not used.

Compare dplyr

This section needs further work.

Let’s map the following code into respectible recipes:

iris %>% 
  mutate(SPECIES = toupper(Species)) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES
## 1          5.1         3.5          1.4         0.2  setosa  SETOSA
## 2          4.9         3.0          1.4         0.2  setosa  SETOSA
## 3          4.7         3.2          1.3         0.2  setosa  SETOSA
## 4          4.6         3.1          1.5         0.2  setosa  SETOSA
## 5          5.0         3.6          1.4         0.2  setosa  SETOSA
## 6          5.4         3.9          1.7         0.4  setosa  SETOSA

There are multiple solutions to map this to the respectables framework.

gen_toupper <- function(varname, .df, ...) {
   toupper(.df[[varname]])
}

rcp <- tribble(
    ~variables, ~dependencies,  ~func,          ~func_args,
    "SPECIES",  "Species",       gen_toupper,   list(varname = "Species") 
)

gen_table_data(recipe = rcp, df = iris) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES
## 1          5.1         3.5          1.4         0.2  setosa  SETOSA
## 2          4.9         3.0          1.4         0.2  setosa  SETOSA
## 3          4.7         3.2          1.3         0.2  setosa  SETOSA
## 4          4.6         3.1          1.5         0.2  setosa  SETOSA
## 5          5.0         3.6          1.4         0.2  setosa  SETOSA
## 6          5.4         3.9          1.7         0.4  setosa  SETOSA

Note in gen_toupper we use the ellipsis ... to absorb not used arguments such as n.