Introduction to Respectables

Introduction

The respecatlbes package provides a framework to

  • create recipes define a way to derive interdependent variables
  • variables are created using generating functions

For this vignette we load the respecatbles and dplyr package:

library(respectables)
library(dplyr)

Note the respectables package is still under development.

Simple Dataset

Lets start defining a simple dataset dm with a single variable id.

gen_id <- function(n) {
  paste0("id-", 1:n)
}

dm_recipe <- tribble(
  ~variables, ~dependencies,  ~func,   ~func_args,
  "id",       no_deps,        gen_id,  no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id
## 1 id-1
## 2 id-2

Note that the argument n is defined by respectables, in this case it is equal N.

We can use the recepie dm_recepie again to create a different dataset:

gen_table_data(N = 5, recipe = dm_recipe)
##     id
## 1 id-1
## 2 id-2
## 3 id-3
## 4 id-4
## 5 id-5

Adding Multiple Variables

We will now specify the variables height and weight to the dm recipe:

gen_hw <- function(n) {
  bmi <- 17 + abs(rnorm(n, mean = 3, sd = 3))
  
  data.frame(height = runif(n, min = 1.5, 1.95)) %>%
    mutate(weight = bmi * height^2)
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,  no_args,
   c("height", "weight"),   no_deps,        gen_hw,  no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight
## 1 id-1 1.943792 79.91244
## 2 id-2 1.791271 67.27335

Note that we used random number generators in gen_hw, hence rerunning gen_table_data will give different values

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight
## 1 id-1 1.600315 53.34039
## 2 id-2 1.591516 49.22176

Variable Dependencies

We will now continue our dm example by defining the variable age which for illustrative purposes is dependent on the height.

gen_age <- function(n, .df) {
  .df %>%
    transmute(age = height*25)
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,  no_args,
   c("height", "weight"),   no_deps,        gen_hw,  no_args,
  "age",                    "height",       gen_age, no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight      age
## 1 id-1 1.853473 59.02684 46.33681
## 2 id-2 1.557912 56.24207 38.94780

Note that respectables creates the arguments n and .df on the fly. Also, respectables determines the evaluation order of the variables based on the dependency structure. That is, respectables does not guarantee to build the resulting data frame using the recipe row by row.

Configurable Arguments

If we plan to make configurable variable generating functions we can specify the arguments in the recipe

gen_color <- function(n, colors = colors()) {
  data.frame(color = sample(colors, n, replace = TRUE))
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,     no_args,
   c("height", "weight"),   no_deps,        gen_hw,     no_args,
  "age",                    "height",       gen_age,    no_args,
  "color",                  no_deps,        gen_color,  list(color = c("blue", "red"))
)

gen_table_data(N = 4, recipe = dm_recipe)
##     id   height   weight      age color
## 1 id-1 1.853894 70.14849 46.34736   red
## 2 id-2 1.866296 65.30558 46.65740   red
## 3 id-3 1.807959 58.55209 45.19899   red
## 4 id-4 1.604780 50.24193 40.11949  blue

Injecting Missing Data

The miss_recipe argument in gen_table_data can be used to inject missing values in the last step when creating data with gen_table_data. That is, first the data generation recipe is executed and then the missing data is injected. Hence, all variables are available at execution time and the .df argument is supplied to the func.

gen_alternate_na <- function(.df) {
  n <- nrow(.df)
  rep(c(TRUE, FALSE), length.out = n)
}

dm_na_recipe <- tribble(
  ~variables,       ~func,             ~func_args,
  "age",            gen_alternate_na,  no_args
)

gen_table_data(N = 4, recipe = dm_recipe, miss_recipe = dm_na_recipe)
##     id   height   weight      age color
## 1 id-1 1.828758 72.83832       NA   red
## 2 id-2 1.807926 67.21645 45.19816  blue
## 3 id-3 1.540891 41.21060       NA   red
## 4 id-4 1.626210 68.30714 40.65525  blue

Note that this currently only works with one variable per row in the missing recipe. This is a feature that we are still working on to allow for more complex missing structure definition.

Scaffolding

For this example we create a data frame aseq with the variable seqterm being c("step 1", ..., "step i"), where i is extracted from the variable id.

dm <- gen_table_data(N = 3, recipe = dm_recipe)

# grow dataset
gen_seq <- function(.db) {
  
  dm <- .db$dm
  
  ni <- as.numeric(substring(dm$id, 4))
  
  df_grow <- data.frame(
    id = rep(dm$id, ni),
    seq = unlist(sapply(ni, seq, from = 1))
  )
  
  left_join(dm, df_grow, by = "id")
}

aseq_scf_recipe <- tribble(
  ~foreign_tbl, ~foreign_key, ~func,     ~func_args,
  "dm",         "id",         gen_seq,   no_args     
)

gen_seq_term <- function(.df, ...) {
  data.frame(seqterm = paste("step", .df$seq))
}

aseq_recipe <- tribble(
  ~variables,      ~dependencies,  ~func,            ~func_args,
  "seqterm",       "seq",          gen_seq_term,     no_args
)

gen_reljoin_table(joinrec = aseq_scf_recipe, tblrec = aseq_recipe, db = list(dm = dm))
##     id   height   weight      age color seq seqterm
## 1 id-1 1.829745 69.46447 45.74364   red   1  step 1
## 2 id-2 1.872380 70.51194 46.80949  blue   1  step 1
## 3 id-2 1.872380 70.51194 46.80949  blue   2  step 2
## 4 id-3 1.806641 58.39891 45.16601  blue   1  step 1
## 5 id-3 1.806641 58.39891 45.16601  blue   2  step 2
## 6 id-3 1.806641 58.39891 45.16601  blue   3  step 3

The steps here are:

  1. use joinrec to grow a new data frame, say A, possibly from db
  2. call gen_table_data with the following arguments
    • A for df
    • tblrec for recipe
    • forward miss_recipe

Note that this functionality is under development. Currently aseq_scf_recipe needs to be a tibble with one row, and the foreign_key is currently not used.

Compare dplyr

This section needs further work.

Let’s map the following code into respectible recipes:

iris %>% 
  mutate(SPECIES = toupper(Species)) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES
## 1          5.1         3.5          1.4         0.2  setosa  SETOSA
## 2          4.9         3.0          1.4         0.2  setosa  SETOSA
## 3          4.7         3.2          1.3         0.2  setosa  SETOSA
## 4          4.6         3.1          1.5         0.2  setosa  SETOSA
## 5          5.0         3.6          1.4         0.2  setosa  SETOSA
## 6          5.4         3.9          1.7         0.4  setosa  SETOSA

There are multiple solutions to map this to the respectables framework.

gen_toupper <- function(varname, .df, ...) {
   toupper(.df[[varname]])
}

rcp <- tribble(
    ~variables, ~dependencies,  ~func,          ~func_args,
    "SPECIES",  "Species",       gen_toupper,   list(varname = "Species") 
)

gen_table_data(recipe = rcp, df = iris) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES
## 1          5.1         3.5          1.4         0.2  setosa  SETOSA
## 2          4.9         3.0          1.4         0.2  setosa  SETOSA
## 3          4.7         3.2          1.3         0.2  setosa  SETOSA
## 4          4.6         3.1          1.5         0.2  setosa  SETOSA
## 5          5.0         3.6          1.4         0.2  setosa  SETOSA
## 6          5.4         3.9          1.7         0.4  setosa  SETOSA

Note in gen_toupper we use the ellipsis ... to absorb not used arguments such as n.