Introduction to Respectables

Introduction

The respecatlbes package provides a framework to

  • create recipes define a way to derive interdependent variables
  • variables are created using generating functions

For this vignette we load the respecatbles and dplyr package:

library(respectables)
library(dplyr)

Note the respectables package is still under development.

Simple Dataset

Lets start defining a simple dataset dm with a single variable id.

gen_id <- function(n) {
  paste0("id-", 1:n)
}

dm_recipe <- tribble(
  ~variables, ~dependencies,  ~func,   ~func_args,
  "id",       no_deps,        gen_id,  no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id
## 1 id-1
## 2 id-2

Note that the argument n is defined by respectables, in this case it is equal N.

We can use the recepie dm_recepie again to create a different dataset:

gen_table_data(N = 5, recipe = dm_recipe)
##     id
## 1 id-1
## 2 id-2
## 3 id-3
## 4 id-4
## 5 id-5

Adding Multiple Variables

We will now specify the variables height and weight to the dm recipe:

gen_hw <- function(n) {
  bmi <- 17 + abs(rnorm(n, mean = 3, sd = 3))
  
  data.frame(height = runif(n, min = 1.5, 1.95)) %>%
    mutate(weight = bmi * height^2)
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,  no_args,
   c("height", "weight"),   no_deps,        gen_hw,  no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight
## 1 id-1 1.547805 49.27803
## 2 id-2 1.592881 51.36388

Note that we used random number generators in gen_hw, hence rerunning gen_table_data will give different values

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight
## 1 id-1 1.906048 69.70237
## 2 id-2 1.543742 44.50299

Variable Dependencies

We will now continue our dm example by defining the variable age which for illustrative purposes is dependent on the height.

gen_age <- function(n, .df) {
  .df %>%
    transmute(age = height*25)
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,  no_args,
   c("height", "weight"),   no_deps,        gen_hw,  no_args,
  "age",                    "height",       gen_age, no_args
)

gen_table_data(N = 2, recipe = dm_recipe)
##     id   height   weight      age
## 1 id-1 1.568570 60.02290 39.21426
## 2 id-2 1.658362 55.46995 41.45905

Note that respectables creates the arguments n and .df on the fly. Also, respectables determines the evaluation order of the variables based on the dependency structure. That is, respectables does not guarantee to build the resulting data frame using the recipe row by row.

Configurable Arguments

If we plan to make configurable variable generating functions we can specify the arguments in the recipe

gen_color <- function(n, colors = colors()) {
  data.frame(color = sample(colors, n, replace = TRUE))
}

dm_recipe <- tribble(
  ~variables,               ~dependencies,  ~func,   ~func_args,
  "id",                     no_deps,        gen_id,     no_args,
   c("height", "weight"),   no_deps,        gen_hw,     no_args,
  "age",                    "height",       gen_age,    no_args,
  "color",                  no_deps,        gen_color,  list(color = c("blue", "red"))
)

gen_table_data(N = 4, recipe = dm_recipe)
##     id   height   weight      age color
## 1 id-1 1.601608 48.29782 40.04019  blue
## 2 id-2 1.692776 55.71318 42.31939   red
## 3 id-3 1.928869 72.33984 48.22171  blue
## 4 id-4 1.938519 65.40852 48.46298  blue

Injecting Missing Data

The miss_recipe argument in gen_table_data can be used to inject missing values in the last step when creating data with gen_table_data. That is, first the data generation recipe is executed and then the missing data is injected. Hence, all variables are available at execution time and the .df argument is supplied to the func.

gen_alternate_na <- function(.df) {
  n <- nrow(.df)
  rep(c(TRUE, FALSE), length.out = n)
}

dm_na_recipe <- tribble(
  ~variables,       ~func,             ~func_args,
  "age",            gen_alternate_na,  no_args
)

gen_table_data(N = 4, recipe = dm_recipe, miss_recipe = dm_na_recipe)
##     id   height   weight      age color
## 1 id-1 1.685651 58.74241       NA   red
## 2 id-2 1.768599 57.62706 44.21496  blue
## 3 id-3 1.570489 48.85073       NA  blue
## 4 id-4 1.811559 61.47564 45.28899   red

Note that this currently only works with one variable per row in the missing recipe. This is a feature that we are still working on to allow for more complex missing structure definition.

Scaffolding

For this example we create a data frame aseq with the variable seqterm being c("step 1", ..., "step i"), where i is extracted from the variable id.

dm <- gen_table_data(N = 3, recipe = dm_recipe)

# grow dataset
gen_seq <- function(.db) {
  
  dm <- .db$dm
  
  ni <- as.numeric(substring(dm$id, 4))
  
  df_grow <- data.frame(
    id = rep(dm$id, ni),
    seq = unlist(sapply(ni, seq, from = 1))
  )
  
  left_join(dm, df_grow, by = "id")
}

aseq_scf_recipe <- tribble(
  ~foreign_tbl, ~foreign_key, ~func,     ~func_args,
  "dm",         "id",         gen_seq,   no_args     
)

gen_seq_term <- function(.df, ...) {
  data.frame(seqterm = paste("step", .df$seq))
}

aseq_recipe <- tribble(
  ~variables,      ~dependencies,  ~func,            ~func_args,
  "seqterm",       "seq",          gen_seq_term,     no_args
)

gen_reljoin_table(joinrec = aseq_scf_recipe, tblrec = aseq_recipe, db = list(dm = dm))
##     id   height   weight      age color seq seqterm
## 1 id-1 1.700983 67.94144 42.52457   red   1  step 1
## 2 id-2 1.736638 79.05012 43.41595   red   1  step 1
## 3 id-2 1.736638 79.05012 43.41595   red   2  step 2
## 4 id-3 1.770105 61.37452 44.25262  blue   1  step 1
## 5 id-3 1.770105 61.37452 44.25262  blue   2  step 2
## 6 id-3 1.770105 61.37452 44.25262  blue   3  step 3

The steps here are:

  1. use joinrec to grow a new data frame, say A, possibly from db
  2. call gen_table_data with the following arguments
    • A for df
    • tblrec for recipe
    • forward miss_recipe

Note that this functionality is under development. Currently aseq_scf_recipe needs to be a tibble with one row, and the foreign_key is currently not used.

Compare dplyr

This section needs further work.

Let’s map the following code into respectible recipes:

iris %>% 
  mutate(SPECIES = toupper(Species)) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES
## 1          5.1         3.5          1.4         0.2  setosa  SETOSA
## 2          4.9         3.0          1.4         0.2  setosa  SETOSA
## 3          4.7         3.2          1.3         0.2  setosa  SETOSA
## 4          4.6         3.1          1.5         0.2  setosa  SETOSA
## 5          5.0         3.6          1.4         0.2  setosa  SETOSA
## 6          5.4         3.9          1.7         0.4  setosa  SETOSA

There are multiple solutions to map this to the respectables framework.

gen_toupper <- function(varname, .df, ...) {
   toupper(.df[[varname]])
}

rcp <- tribble(
    ~variables, ~dependencies,  ~func,          ~func_args,
    "SPECIES",  "Species",       gen_toupper,   list(varname = "Species") 
)

gen_table_data(recipe = rcp, df = iris) %>%
  head()
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species SPECIES
## 1          5.1         3.5          1.4         0.2  setosa  SETOSA
## 2          4.9         3.0          1.4         0.2  setosa  SETOSA
## 3          4.7         3.2          1.3         0.2  setosa  SETOSA
## 4          4.6         3.1          1.5         0.2  setosa  SETOSA
## 5          5.0         3.6          1.4         0.2  setosa  SETOSA
## 6          5.4         3.9          1.7         0.4  setosa  SETOSA

Note in gen_toupper we use the ellipsis ... to absorb not used arguments such as n.