---
title: "Totals, Missings, and Denominators"
output:
  rmarkdown::html_vignette:
    toc: true
vignette: >
  %\VignetteIndexEntry{denom}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

```{r, include = FALSE, setup}
library(Tplyr)
library(dplyr, warn.conflicts = FALSE)
library(knitr)
```

Counting is pretty easy, right? There's not all that much to it. With a few considerations we can cover most of the scenarios that users will encounter while using Tplyr. Denominators, on the other hand, get *_a lot_* more complicated. Why? Because there are a lot of ways to do it. What values do we exclude from the denominator? What variables establish denominator grouping? Does the denominator use a different filter than the values being counted? If you've programmed enough of these tables, you know that it's all very situational. 

Given the complexity, we've spent a lot of time trying to make this easy for you, the user. In doing this it became very clear that denominators needed their own vignette. Additionally, a few other things go hand in hand with denominators. 

Make sure you have a good understand of count and shift layers before you review this content. If you've done your due diligence, then - here we go.

## Population Data in the Denominator

What do you do when your target dataset doesn't _have_ the information necessary to create your denominator? For example - when you create an adverse event table, the adverse event dataset likely only contains records for subjects who experienced an adverse event. But subjects who did _not_ have an adverse event are still part of the study population and must be considered in the denominator. 

For this reason,**Tplyr** allows lets you set a separate population dataset - but there are a couple things you need to do to trigger **Tplyr** to use the population data as your denominator. 

Consider these two examples.

```{r}
tplyr_table(tplyr_adae, TRTA) %>% 
  add_layer(
    group_count(AEDECOD) %>% 
      set_distinct_by(USUBJID) %>% 
      set_format_strings(f_str('xx (xx.x%)', distinct_n, distinct_pct))
  ) %>% 
  build() %>% 
  head() %>% 
  kable()
```

```{r}
tplyr_table(tplyr_adae, TRTA) %>% 
  set_pop_data(tplyr_adsl) %>%
  set_pop_treat_var(TRT01A) %>%
  add_layer(
    group_count(AEDECOD) %>% 
      set_distinct_by(USUBJID) %>% 
      set_format_strings(f_str('xx (xx.x%)', distinct_n, distinct_pct))
  ) %>% 
  build() %>% 
  head() %>% 
  kable()
```

There are three things done above that use the population data:

- Population data must be set in the first place using `set_pop_where()` and the population treatment variable must be specified using `set_pop_treat_var()`
- `set_distinct_by()` must be used. Why? Because when you need to use a separate population dataset, the target dataset is likely more than one record per subject. So `set_distinct_by()` in this scenario would be used to get distinct counts per subject. In this example, we're looking at the number of unique subjects who experienced a specific adverse event.
- The population denominator is then used when calculating `distinct_pct`. It is also worth noting that the default count layer formats use distinct values.
  
Note that if you need more control over the values used in the denominator from the population data, you can set a separate filter on the population data used `set_pop_where()`. 

Fortunately, denominators are much simpler when they're kept within a single dataset. Just kidding! Let's get weird.
  
## Denominator Grouping

When you're looking within a single dataset, there are a couple factors that you need to consider for a denominator. The first is which grouping variables create those denominators. Let's look at this from two perspectives - count layers and shift layers.

### Count layers

Most of the complexity of denominators comes from nuanced situations. A solid 80% of the time, defaults will work. For example, in a frequency table, you will typically want data within a column to sum to 100%. For example:

```{r}
tplyr_adsl <- tplyr_adsl %>% 
  mutate(DCSREAS = ifelse(DCSREAS == '', 'Completed', DCSREAS))
         
tplyr_table(tplyr_adsl, TRT01P) %>% 
  add_layer(
    group_count(DCSREAS)
  ) %>% 
  build() %>% 
  kable()
```
By default, when not using the population data strategy shown above, a count layer assumes that you want columns to sum to 100%. But that's not always the case. Perhaps you'd like to break this summary down by sex presented row-wise.

```{r}
tplyr_table(tplyr_adsl, TRT01P) %>% 
  add_layer(
    group_count(DCSREAS, by=SEX)
  ) %>% 
  build() %>% 
  kable()
```

Ok - so, now this is a little bit off. By breaking sex down as a row group, the denominators are still the total treatment group. Does that make sense? 34 female Placebo group subjects completed, but that calculated 39.5% also includes male subjects in the denominator. Let's fix this using `set_denoms_by()`.

```{r}
tplyr_table(tplyr_adsl, TRT01P) %>% 
  add_layer(
    group_count(DCSREAS, by=SEX) %>% 
      set_denoms_by(SEX, TRT01P)
  ) %>% 
  build() %>% 
  kable()
```

Ok - much better. `set_denoms_by()` now changed the denominator grouping for us, so the denominator used for those 34 female Placebo subjects are now the total number of female Placebo subjects. Makes sense, right? `set_denoms_by()` allows you to specify:

- The treatment variable
- any `cols` variables specified at the table level
- any `by` variables specified at the layer level

Depending on your presentation, what you require may change - but the flexibility is there to choose what you need.

### Shift layers

A major part of the shift API is the control of the denominators used in the calculation of the percentages. In shift tables, most percentages are relative to the "box" that is formed from the "from" and "to" groups of the shift for each treatment group. Just like the count layers, the `set_denoms_by()` functions any variable name from the treatment variable, `cols` argument, `by` variables. The difference with shift layers is that now you can also include your target variables used for the row or column.

```{r} 
tplyr_table(tplyr_adlb, TRTA, where=PARAMCD == "CK") %>%
  add_layer(
    group_shift(vars(row = BNRIND, column = ANRIND), by = vars(PARAM, AVISIT)) %>%
      set_format_strings(f_str("xx (xxx.x%)", n, pct)) %>%
      # This is the default, the 3x3 box formed by the target variables
      set_denoms_by(TRTA, PARAM, AVISIT) 
  ) %>%
  build() %>%
  kable()
```

In the example above, the denominators were based on the by and treatment variables, `TRTA`, `PARAM` and `VISIT`. This creates a 3 X 3 box, where the denominator is the total of all record within the **FROM** and **TO** shift variables, within each parameter, visit, and treatment. This is the default, and this is how **Tplyr** will create the denominators if `set_denoms_by()` isn't specified.

In the next example, the percentage denominators are calculated row-wise, each row percentage sums to 100%. 

```{r}
tplyr_table(tplyr_adlb, TRTA, where=PARAMCD == "CK") %>%
  add_layer(
    group_shift(vars(row = BNRIND, column = ANRIND), by = vars(PARAM, AVISIT)) %>%
      set_format_strings(f_str("xx (xxx.x%)", n, pct)) %>%
      set_denoms_by(TRTA, PARAM, AVISIT, BNRIND) # Each row made by TRTA, BNRIND
  ) %>%
  build() %>%
  arrange(ord_layer_1, ord_layer_2, ord_layer_3) %>% 
  head() %>% 
  kable()
```

While not practical, in this last example the denominators are changed to be based on the entire column instead of the 3 x 3 box. By passing the column variables, `TRTA` and `ANRIND` the layer will use those denominators when determining the percentages. 

```{r}
tplyr_table(tplyr_adlb, TRTA, where = PARAMCD == "CK") %>%
  add_layer(
    group_shift(vars(row = BNRIND, column = ANRIND), by = vars(PARAM, AVISIT)) %>%
      set_format_strings(f_str("xx (xx.xx%)", n, pct)) %>%
      set_denoms_by(TRTA, ANRIND) # Use the column total as the denominator
  ) %>%
  build() %>%
  arrange(ord_layer_1, ord_layer_2, ord_layer_3) %>% 
  head() %>%
  kable()
```

Our hope is that this gives you the flexibility you need to structure your denominator however required. 

## Controlling the Denominator Filter

There are some circumstances that you'll encounter where the filter used for a denominator needs to be different than the filter used to count. Disposition tables are an example of this, and we'll use that example to paint this picture.

**Tplyr**offers you the ability to specifically control the filter used within the denominator. This is provided through the function `set_denom_where()`. The default for `set_denom_where()` is the layer level `where` parameter, if one was supplied. `set_denom_where()` allows you to replace this layer level filter with a custom filter of your choosing. This is done on top of any filtering specified in the `tplyr_table()` where parameter - which means that the `set_denom_where()` filter is applied _in addition to_ any table level filtering. 

Yeah we know - there are a lot of different places that filtering can happen...

So let's take the example shown below. The first layer has no layer level filtering applied, so the table level `where` is the only filter applied. The second layer has a layer level filter applied, so the denominators will be based on that layer level filter. Notice how in this case, the percentages in the second layer add up to 100%. This is because the denominator only includes values used in that layer. 

The third layer has a layer level filter applied, but additionally uses `set_denom_where()`. The `set_denom_where()` in this example is actually *removing* the layer level filter for the denominators. This is because in R, when you filter using `TRUE`, the filter returns all records. So by using `TRUE` in `set_denom_where()`, the layer level filter is effectively removed. This causes the denominator to include all values available from the table and not just those selected for that layer - so for this layer, the percentages will *not add up to 100%*. This is important - this allows the percentages from Layer 3 to sum to the total percentage of "DISCONTINUED" from Layer 1. 

```{r}
tplyr_adsl2 <- tplyr_adsl %>% 
  mutate(DISCONTEXT = if_else(DISCONFL == 'Y', 'DISCONTINUED', 'COMPLETED'))

t <- tplyr_table(tplyr_adsl2, TRT01P, where = SAFFL == 'Y') %>%
  add_layer(
    group_count(DISCONTEXT)
  ) %>%
  add_layer(
    group_count(DCSREAS, where = DISCONFL == 'Y')
  ) %>%
  add_layer(
    group_count(DCSREAS, where = DISCONFL == 'Y') %>% 
    set_denom_where(TRUE)
  ) %>%
  build() %>%
  arrange(ord_layer_index, ord_layer_1) 

t %>% 
  kable()
```

## Missing Counts

Missing counts are a tricky area for frequency tables, and they play directly in with denominators as well. These values raise a number of questions. For example, do you want to format the missing counts the same way as the event counts? Do you want to present missing counts with percentages? Do missing counts belong in the denominator? 

The `set_missing_count()` function can take a new `f_str()` object to set the display of missing values. If not specified, the associated count layer's format will be used. Using the `...` parameter, you are able to specify the row label desired for missing values and values that you determine to be considered 'missing'. For example, you may have NA values in the target variable, and then values like "Not Collected" that you also wish to consider "missing". `set_missing_count()` allows you to group those together. Actually - you're able to establish as many different "missing" groups as you want - even though that scenario is fairly unlikely.

In the example below 50 random values are removed and NA is specified as the missing string. This leads us to another parameter - `denom_ignore`. By default, if you specify missing values they will still be considered within the denominator, but when you have missing counts, you may wish to exclude them from the totals being summarized. By setting `denom_ignore` to TRUE, your denominators will ignore any groups of missing values that you've specified. 

```{r}
tplyr_adae2 <- tplyr_adae
tplyr_adae2[sample(nrow(tplyr_adae2), 50), "AESEV"] <- NA

t <- tplyr_table(tplyr_adae2, TRTA) %>%
  add_layer(
    group_count(AESEV) %>%
      set_format_strings(f_str("xxx (xx.xx%)", n, pct)) %>%
      set_missing_count(f_str("xxx", n), sort_value=Inf, denom_ignore=TRUE, Missing = NA)
  ) %>%
  build() %>% 
  arrange(ord_layer_1)

t %>% 
  kable()
```

We did one more other thing worth explaining in the example above - gave the missing count its own sort value. If you leave this field null, it will simply be the maximum value in the order layer plus 1, to put the Missing counts at the bottom during an ascending sort. But tables can be sorted a lot of different ways, as you'll see in the sort vignette. So instead of trying to come up with novel ways for you to control where the missing row goes - we decided to just let you specify your own value.

## Missing Subjects

Missing counts and counting missing subjects work two different ways within Tplyr. Missing counts, as described above, will examine the records present in the data and collect and missing values. But for these results to be counted, they need to first be provided within the input data itself. On the other hand, missing subjects are calculated by looking at the difference between the potential number of subjects within the column (i.e. the combination of the treatment variables and column variables) and the number of subjects actually present. Consider this example:

```{r missing_subs1}
  missing_subs <- tplyr_table(tplyr_adae, TRTA) %>%
    set_pop_data(tplyr_adsl) %>%
    set_pop_treat_var(TRT01A) %>%
    add_layer(
      group_count(vars(AEBODSYS, AEDECOD)) %>%
        set_nest_count(TRUE) %>% 
        set_distinct_by(USUBJID) %>%
        add_missing_subjects_row(f_str("xx (XX.x%)", distinct_n, distinct_pct), sort_value = Inf) %>% 
        set_missing_subjects_row_label("Missing Subjects")
    ) %>%
    build()

  tail(missing_subs) %>% 
    select(-starts_with('ord')) %>% 
    kable()
```

In the example above, we produce a nested count layer. The function `add_missing_subjects_row()` triggers the addition of the new result row for which the missing subjects are calculated. The row label applied for this can be configured using `set_missing_subjects_row_label()`, and the row label itself will default to 'Missing'. Depending on your sorting needs, a `sort_value` can be applied to whatever numeric value you provide. Lastly, you can provide an `f_str()` to format the missing subjects row separately from the rest of the layer, but whatever format is applied to the layer will apply otherwise.

Note that in nested count layers, missing subject rows will generate for each independent group within the outer layer. Outer layers cannot have missing subject rows calculated individually. This would best be done in an independent layer itself, as the result would apply to the whole input target dataset. 

## Adding a 'Total' Row

In addition to missing counts, some summaries require the addition of a 'Total' row. **Tplyr** has the helper function `add_total_row()` to ease this process for you. Like most other things within **Tplyr** - particularly in this vignette - this too has a significant bit of nuance to it. 

Much of this functionality is similar to `set_missing_count()`. You're able to specify a different format for the total, but if not specified, the associated count layer's format will be used. You're able to set your own sort value to specify where you want the total row to sit. 

More nuance comes in two places:

- By default, `add_total_row()` *will count missing values*, but you can exclude those values using the `count_missings` parameter. **Tplyr** will warn you when `set_count_missing()` has `denom_ignore` set to `TRUE`, `add_total_row()` has `count_missings` set to `TRUE` and the format contains a percentage. Why? Because if the denominator is ignoring missing values but you're still counting them in your total, the percentage shown can exceed 100%. 
- `add_total_row()` will throw a warning when a `by` variable is used, because it becomes ambiguous what total should be calculated. You can rectify this by using `set_denoms_by()`, which allows the user to control exactly which groups are used to form the denominator. This way the totals presented by `add_total_row()` will align with denominators specified in `set_denom_by()` and generate total rows that match the grouping of your denominator values. 

In the example below, we summarize age groups by sex. The denominators are determined by treatment group and sex, and since we are not excluding any values from the denominator, the total row ends up matching the denominator that was used. The 'Missing' row tells us the number of missing values, but because `count_missings` is set to `TRUE`, the missing counts are included in the total row. This probably isn't how you would choose to display things, but here we're trying to show the flexibility built into **Tplyr**.

```{r}
tplyr_adsl2 <- tplyr_adsl
tplyr_adsl2[sample(nrow(tplyr_adsl2), 50), "AGEGR1"] <- NA

tplyr_table(tplyr_adsl2, TRT01P) %>% 
  add_layer(
    group_count(AGEGR1, by=SEX) %>% 
      set_denoms_by(TRT01P, SEX) %>%  # This gives me a Total row each group
      add_total_row(f_str("xxx", n), count_missings=TRUE, sort_value=-Inf) %>% 
      set_total_row_label("All Age Groups") %>% 
      set_missing_count(f_str("xx (xx.x%)", n, pct), Missing = NA, sort_value=Inf)
  ) %>% 
  build() %>% 
  arrange(ord_layer_1, ord_layer_2) %>% 
  kable()
```
The default text for the Total row is "Total", but we provide `set_total_row_label()` to allow you to customize the text used in your display. 

Let's look at a more practical version of the table above. If you display missings, you probably want to exclude them from the total. Here we do that using `set_missing_count()`.  So more commonly, you'll see this:

```{r}
tplyr_table(tplyr_adsl2, TRT01P) %>% 
  add_layer(
    group_count(AGEGR1, by=SEX) %>% 
      set_denoms_by(TRT01P, SEX) %>%  # This gives me a Total row each group
      add_total_row(f_str("xxx", n), count_missings=FALSE, sort_value=-Inf) %>% 
      set_total_row_label("All Age Groups") %>% 
      set_missing_count(f_str("xxx", n), Missing = NA, sort_value=Inf, denom_ignore=TRUE)
  ) %>% 
  build() %>% 
  arrange(ord_layer_1, ord_layer_2) %>% 
  kable()
```

Now the table is more intuitive. We used `set_missing_count()` to update our denominators, so missing have been excluded. Now, the total row intuitively matches the denominators used within each group, and we can see how many missing records were excluded.

_You may have stumbled upon this portion of the vignette while searching for how to create a total column. **Tplyr** allows you to do this as well with the function `add_total_group()` and read more in `vignette("table")`._ 

And that's it for denominators! Happy counting!