Introduction to Respectables

Introduction

The respecatlbes package provides a framework to

  • create recipes define a way to derive interdependent variables
  • variables are created using generating functions

For this vignette we load the respecatbles and dplyr package:

library(respectables)
library(dplyr)

Note the respectables package is still under development.

Simple Dataset

Lets start defining a simple dataset dm with a single variable id.

gen_id <- function(n) {
  paste0("id-", 1:n)
}

dm_recipe <- tribble(
  ~variables, ~dependencies,  ~func,   ~func_args,
  "id",       no_deps,        gen_id,  no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id
## 1 id-1
## 2 id-2

Note that the argument n is defined by respectables, in this case it is equal N.

We can use the recepie dm_recepie again to create a different dataset:

gen_table_data(N = 5, recipe = dm_recipe)
##     id
## 1 id-1
## 2 id-2
## 3 id-3
## 4 id-4
## 5 id-5

Adding Multiple Variables

We will now specify the variables height and weight to the dm recipe:

gen_hw <- function(n) {
  bmi <- 17 + abs(rnorm(n, mean = 3, sd = 3))
  
  data.frame(height = runif(n, min = 1.5, 1.95)) %>%
    mutate(weight = bmi * height^2)
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,  no_args,
   c("height", "weight"),   no_deps,        gen_hw,  no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight
## 1 id-1 1.766413 54.70709
## 2 id-2 1.680255 61.17568

Note that we used random number generators in gen_hw, hence rerunning gen_table_data will give different values

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight
## 1 id-1 1.903912 85.25352
## 2 id-2 1.884228 75.13749

Variable Dependencies

We will now continue our dm example by defining the variable age which for illustrative purposes is dependent on the height.

gen_age <- function(n, .df) {
  .df %>%
    transmute(age = height*25)
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,  no_args,
   c("height", "weight"),   no_deps,        gen_hw,  no_args,
  "age",                    "height",       gen_age, no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight      age
## 1 id-1 1.755060 57.03430 43.87651
## 2 id-2 1.888836 64.41491 47.22091

Note that respectables creates the arguments n and .df on the fly. Also, respectables determines the evaluation order of the variables based on the dependency structure. That is, respectables does not guarantee to build the resulting data frame using the recipe row by row.

Configurable Arguments

If we plan to make configurable variable generating functions we can specify the arguments in the recipe

gen_color <- function(n, colors = colors()) {
  data.frame(color = sample(colors, n, replace = TRUE))
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,     no_args,
   c("height", "weight"),   no_deps,        gen_hw,     no_args,
  "age",                    "height",       gen_age,    no_args,
  "color",                  no_deps,        gen_color,  list(color = c("blue", "red"))
)

gen_table_data(N = 4, recipe = dm_recipe)
##     id   height   weight      age color
## 1 id-1 1.693552 58.90777 42.33879   red
## 2 id-2 1.833927 57.60692 45.84817  blue
## 3 id-3 1.540379 49.29668 38.50947  blue
## 4 id-4 1.691383 57.73142 42.28457   red

Injecting Missing Data

The miss_recipe argument in gen_table_data can be used to inject missing values in the last step when creating data with gen_table_data. That is, first the data generation recipe is executed and then the missing data is injected. Hence, all variables are available at execution time and the .df argument is supplied to the func.

gen_alternate_na <- function(.df) {
  n <- nrow(.df)
  rep(c(TRUE, FALSE), length.out = n)
}

dm_na_recipe <- tribble(
  ~variables,       ~func,             ~func_args,
  "age",            gen_alternate_na,  no_args
)

gen_table_data(N = 4, recipe = dm_recipe, miss_recipe = dm_na_recipe)
##     id   height   weight      age color
## 1 id-1 1.541134 44.53334       NA  blue
## 2 id-2 1.793726 70.39314 44.84316  blue
## 3 id-3 1.562529 53.84893       NA  blue
## 4 id-4 1.935932 79.35835 48.39830   red

Note that this currently only works with one variable per row in the missing recipe. This is a feature that we are still working on to allow for more complex missing structure definition.

Scaffolding

For this example we create a data frame aseq with the variable seqterm being c("step 1", ..., "step i"), where i is extracted from the variable id.

dm <- gen_table_data(N = 3, recipe = dm_recipe)

# grow dataset
gen_seq <- function(.db) {
  
  dm <- .db$dm
  
  ni <- as.numeric(substring(dm$id, 4))
  
  df_grow <- data.frame(
    id = rep(dm$id, ni),
    seq = unlist(sapply(ni, seq, from = 1))
  )
  
  left_join(dm, df_grow, by = "id")
}

aseq_scf_recipe <- tribble(
  ~foreign_tbl, ~foreign_key, ~func,     ~func_args,
  "dm",         "id",         gen_seq,   no_args     
)

gen_seq_term <- function(.df, ...) {
  data.frame(seqterm = paste("step", .df$seq))
}

aseq_recipe <- tribble(
  ~variables,      ~dependencies,  ~func,            ~func_args,
  "seqterm",       "seq",          gen_seq_term,     no_args
)

gen_reljoin_table(joinrec = aseq_scf_recipe, tblrec = aseq_recipe, db = list(dm = dm))
##     id   height   weight      age color seq seqterm
## 1 id-1 1.758112 65.20014 43.95280  blue   1  step 1
## 2 id-2 1.588006 69.84311 39.70016  blue   1  step 1
## 3 id-2 1.588006 69.84311 39.70016  blue   2  step 2
## 4 id-3 1.772379 59.71679 44.30948   red   1  step 1
## 5 id-3 1.772379 59.71679 44.30948   red   2  step 2
## 6 id-3 1.772379 59.71679 44.30948   red   3  step 3

The steps here are:

  1. use joinrec to grow a new data frame, say A, possibly from db
  2. call gen_table_data with the following arguments
    • A for df
    • tblrec for recipe
    • forward miss_recipe

Note that this functionality is under development. Currently aseq_scf_recipe needs to be a tibble with one row, and the foreign_key is currently not used.

Compare dplyr

This section needs further work.

Let’s map the following code into respectible recipes:

iris %>% 
  mutate(SPECIES = toupper(Species)) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES
## 1          5.1         3.5          1.4         0.2  setosa  SETOSA
## 2          4.9         3.0          1.4         0.2  setosa  SETOSA
## 3          4.7         3.2          1.3         0.2  setosa  SETOSA
## 4          4.6         3.1          1.5         0.2  setosa  SETOSA
## 5          5.0         3.6          1.4         0.2  setosa  SETOSA
## 6          5.4         3.9          1.7         0.4  setosa  SETOSA

There are multiple solutions to map this to the respectables framework.

gen_toupper <- function(varname, .df, ...) {
   toupper(.df[[varname]])
}

rcp <- tribble(
    ~variables, ~dependencies,  ~func,          ~func_args,
    "SPECIES",  "Species",       gen_toupper,   list(varname = "Species") 
)

gen_table_data(recipe = rcp, df = iris) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES
## 1          5.1         3.5          1.4         0.2  setosa  SETOSA
## 2          4.9         3.0          1.4         0.2  setosa  SETOSA
## 3          4.7         3.2          1.3         0.2  setosa  SETOSA
## 4          4.6         3.1          1.5         0.2  setosa  SETOSA
## 5          5.0         3.6          1.4         0.2  setosa  SETOSA
## 6          5.4         3.9          1.7         0.4  setosa  SETOSA

Note in gen_toupper we use the ellipsis ... to absorb not used arguments such as n.