--- title: "Comparison with dplyr Tabulation" author: "Gabriel Becker and Adrian Waddell" date: "`r Sys.Date()`" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Comparison with dplyr Tabulation} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} editor_options: chunk_output_type: console --- ```{r, include = FALSE} suggested_dependent_pkgs <- c("dplyr", "tibble") knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = all(vapply( suggested_dependent_pkgs, requireNamespace, logical(1), quietly = TRUE )) ) ``` ```{r, echo=FALSE} knitr::opts_chunk$set(comment = "#") ``` ```{css, echo=FALSE} .reveal .r code { white-space: pre; } ``` ## Introduction In this vignette, we would like to discuss the similarities and differences between `dplyr` and `rtable`. Much of the `rtables` framework focuses on tabulation/summarizing of data and then the visualization of the table. In this vignette, we focus on summarizing data using `dplyr` and contrast it to `rtables`. We won't pay attention to the table visualization/markup and just derive the cell content. Using `dplyr` to summarize data and `gt` to visualize the table is a good way if the tabulation is of a certain nature or complexity. However, there are tables such as the table created in the [`introduction`](https://insightsengineering.github.io/rtables/latest-release/articles/rtables.html) vignette that take some effort to create with `dplyr`. Part of the effort is due to fact that when using `dplyr` the table data is stored in `data.frame`s or `tibble`s which is not the most natural way to represent a table as we will show in this vignette. If you know a more elegant way of deriving the table content with `dplyr` please let us know and we will update the vignette. ```{r, message=FALSE} library(rtables) library(dplyr) ``` Here is the table and data used in the [`introduction`](https://insightsengineering.github.io/rtables/latest-release/articles/rtables.html) vignette: ```{r} n <- 400 set.seed(1) df <- tibble( arm = factor(sample(c("Arm A", "Arm B"), n, replace = TRUE), levels = c("Arm A", "Arm B")), country = factor(sample(c("CAN", "USA"), n, replace = TRUE, prob = c(.55, .45)), levels = c("CAN", "USA")), gender = factor(sample(c("Female", "Male"), n, replace = TRUE), levels = c("Female", "Male")), handed = factor(sample(c("Left", "Right"), n, prob = c(.6, .4), replace = TRUE), levels = c("Left", "Right")), age = rchisq(n, 30) + 10 ) %>% mutate( weight = 35 * rnorm(n, sd = .5) + ifelse(gender == "Female", 140, 180) ) lyt <- basic_table(show_colcounts = TRUE) %>% split_cols_by("arm") %>% split_cols_by("gender") %>% split_rows_by("country") %>% summarize_row_groups() %>% split_rows_by("handed") %>% summarize_row_groups() %>% analyze("age", afun = mean, format = "xx.x") tbl <- build_table(lyt, df) tbl ``` ## Getting Started We will start by deriving the first data cell on row 3 (note, row 1 and 2 have content cells, see the [`introduction`](https://insightsengineering.github.io/rtables/latest-release/articles/rtables.html) vignette). Cell 3,1 contains the mean age for left handed & female Canadians in "Arm A": ```{r} mean(df$age[df$country == "CAN" & df$arm == "Arm A" & df$gender == "Female" & df$handed == "Left"]) ``` or with `dplyr`: ```{r} df %>% filter(country == "CAN", arm == "Arm A", gender == "Female", handed == "Left") %>% summarise(mean_age = mean(age)) ``` Further, `dplyr` gives us other verbs to easily get the average age of left handed Canadians for each group defined by the 4 columns: ```{r} df %>% group_by(arm, gender) %>% filter(country == "CAN", handed == "Left") %>% summarise(mean_age = mean(age)) ``` We can further get to all the average age cell values with: ```{r} average_age <- df %>% group_by(arm, gender, country, handed) %>% summarise(mean_age = mean(age)) average_age ``` In `rtable` syntax, we need the following code to get to the same content: ```{r} lyt <- basic_table() %>% split_cols_by("arm") %>% split_cols_by("gender") %>% split_rows_by("country") %>% split_rows_by("handed") %>% analyze("age", afun = mean, format = "xx.x") tbl <- build_table(lyt, df) tbl ``` As mentioned in the introduction to this vignette, please ignore the difference in arranging and formatting the data: it's possible to condense the `rtable` more and it is possible to make the `tibble` look more like the reference table using the `gt` R package. In terms of tabulation for this example there was arguably not much added by `rtables` over `dplyr`. ## Content Information Unlike in `rtables` the different levels of summarization are discrete computations in `dplyr` which we will then need to combine We first focus on the count and percentage information for handedness within each country (for each arm-gender pair), along with the analysis row mean values: ```{r} c_h_df <- df %>% group_by(arm, gender, country, handed) %>% summarize(mean = mean(age), c_h_count = n()) %>% ## we need the sum below to *not* be by country, so that we're dividing by the column counts ungroup(country) %>% # now the `handed` grouping has been removed, therefore we can calculate percent now: mutate(n_col = sum(c_h_count), c_h_percent = c_h_count / n_col) c_h_df ``` which has 16 rows (cells) like the `average_age` data frame defined above. Next, we will derive the group information for countries: ```{r} c_df <- df %>% group_by(arm, gender, country) %>% summarize(c_count = n()) %>% # now the `handed` grouping has been removed, therefore we can calculate percent now: mutate(n_col = sum(c_count), c_percent = c_count / n_col) c_df ``` Finally, we `left_join()` the two levels of summary to get a data.frame containing the full set of values which make up the body of our table (note, however, they are not in the same order): ```{r} full_dplyr <- left_join(c_h_df, c_df) %>% ungroup() ``` Alternatively, we could calculate only the counts in `c_h_df`, and use `mutate()` after the `left_join()` to divide the counts by the `n_col` values which are more naturally calculated within `c_df`. This would simplify `c_h_df`'s creation somewhat by not requiring the explicit `ungroup()`, but it prevents each level of summarization from being a self-contained set of computations. The `rtables` call in contrast is: ```{r} lyt <- basic_table(show_colcounts = TRUE) %>% split_cols_by("arm") %>% split_cols_by("gender") %>% split_rows_by("country") %>% summarize_row_groups() %>% split_rows_by("handed") %>% summarize_row_groups() %>% analyze("age", afun = mean, format = "xx.x") tbl <- build_table(lyt, df) tbl ``` We can now spot check that the values are the same ```{r} frm_rtables_h <- cell_values( tbl, rowpath = c("country", "CAN", "handed", "Right", "@content"), colpath = c("arm", "Arm B", "gender", "Female") )[[1]] frm_rtables_h frm_dplyr_h <- full_dplyr %>% filter(country == "CAN" & handed == "Right" & arm == "Arm B" & gender == "Female") %>% select(c_h_count, c_h_percent) frm_dplyr_h frm_rtables_c <- cell_values( tbl, rowpath = c("country", "CAN", "@content"), colpath = c("arm", "Arm A", "gender", "Male") )[[1]] frm_rtables_c frm_dplyr_c <- full_dplyr %>% filter(country == "CAN" & arm == "Arm A" & gender == "Male") %>% select(c_count, c_percent) frm_dplyr_c ``` ```{r echo = FALSE, result="hidden"} stopifnot(isTRUE(all.equal(frm_rtables_h, unname(unlist(frm_dplyr_h))))) stopifnot(isTRUE(all.equal(frm_rtables_c, unname(unlist(frm_dplyr_c[1, ]))))) ``` Further, the `rtable` syntax has hopefully also become a bit more straightforward to derive the cell values than with `dplyr` for this particular table. ## Summary In this vignette learned that: * many tables are quite easily created with `dplyr` and `data.frame` or `tibble` as data structure * `dplyr` keeps simple things simple * if tables have group summaries then repeating of information is required * `rtables` streamlines the construction of complex tables We recommend that you continue reading the [`clinical_trials`](https://insightsengineering.github.io/rtables/latest-tag/articles/clinical_trials.html) vignette where we create a number of more advanced tables using layouts.