The freq
and nested_freq
functions are
called to produce categorical-type summary statistics (i.e. counts and
percentages) for character variables. These functions can be used to
create all different types of categorical summary statistics tables.
There are many uses for freq
, the first one we’ll show
is to create the first row of analysis set summary of population counts
in the output. In the following table the subset
argument
is used to create the one row summary with the row heading specified in
the rowtext
argument.
tbl <- cdisc_adsl %>%
freq(colvar = "TRT01PN",
rowvar = "ITTFL",
statlist = statlist("n"),
subset = ITTFL == "Y",
rowtext = "Analysis set: ITT")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | group_level |
---|---|---|---|---|---|
Analysis set: ITT | 5 | 5 | 5 | HEADER | 0 |
A typical call of using the freq
function is to create
the tbl
chunk with:
the row of big N indicating number of subjects having values in the character variable for each treatment group
summary of counts and percentages for each category in the character variable such as age groups, gender, race, ethnicity, etc.
tbl <- cdisc_adsl %>%
freq(colvar = "TRT01PN",
rowvar = "SEX",
statlist = statlist(c("N","n (x.x%)")),
row_header = "Sex")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | group_level |
---|---|---|---|---|---|
Sex | HEADER | 0 | |||
N | 5 | 5 | 5 | N | 0 |
F | 2 (40.0%) | 1 (20.0%) | 3 (60.0%) | VALUE | 0 |
M | 3 (60.0%) | 4 (80.0%) | 2 (40.0%) | VALUE | 0 |
In the above table, F
appears first and then
M
is shown. This is because SEX
is a character
variable and alphabetical sorting is applied to the summarized results.
We can convert SEX
to a factor variable with the customized
order and labels for enabling the user-defined sorting.
tbl <- cdisc_adsl %>%
mutate(SEX = factor(SEX, levels = c("M", "F"), labels = c("Male", "Female"))) %>%
freq(colvar = "TRT01PN",
rowvar = "SEX",
statlist = statlist(c("N","n (x.x%)")),
row_header = "Sex")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | group_level |
---|---|---|---|---|---|
Sex | HEADER | 0 | |||
N | 5 | 5 | 5 | N | 0 |
Male | 3 (60.0%) | 4 (80.0%) | 2 (40.0%) | VALUE | 0 |
Female | 2 (40.0%) | 1 (20.0%) | 3 (60.0%) | VALUE | 0 |
In some outputs, the table mock-up may require showing the percentage
denominator, which can be done by specifying
statlist = statlist("n/N (x.x%)")
.
tbl <- cdisc_adsl %>%
mutate(SEX = factor(SEX, levels = c("M", "F"), labels = c("Male", "Female"))) %>%
freq(colvar = "TRT01PN",
rowvar = "SEX",
statlist = statlist("n/N (x.x%)"),
row_header = "Sex")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | group_level |
---|---|---|---|---|---|
Sex | HEADER | 0 | |||
Male | 3/5 (60.0%) | 4/5 (80.0%) | 2/5 (40.0%) | VALUE | 0 |
Female | 2/5 (40.0%) | 1/5 (20.0%) | 3/5 (60.0%) | VALUE | 0 |
By-processing splits the summary statistics by another character
variable that can be specified in the argument of rowbyvar
or tablebyvar
. For the example shown below, age group
categories are summarized by the SEX
variable. In this
scenario, the denominator should also be split by SEX
in
addition to TRT01PN
, which can be done by
statlist(c("N","n (x.x%)"), denoms_by = c("SEX", "TRT01PN"))
.
tbl <- cdisc_adsl %>%
mutate(SEX = factor(SEX, levels = c("M", "F"), labels = c("Male", "Female"))) %>%
freq(colvar = "TRT01PN",
rowbyvar = "SEX",
rowvar = "AGEGR1",
statlist = statlist(c("N","n (x.x%)"), denoms_by = c("SEX", "TRT01PN")),
row_header = "Age group")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | group_level | SEX |
---|---|---|---|---|---|---|
Male | BY_HEADER1 | 0 | Male | |||
Age group | HEADER | 0 | Male | |||
N | 3 | 4 | 2 | N | 0 | Male |
65-80 | 0 | 3 (75.0%) | 1 (50.0%) | VALUE | 0 | Male |
<65 | 2 (66.7%) | 0 | 1 (50.0%) | VALUE | 0 | Male |
>80 | 1 (33.3%) | 1 (25.0%) | 0 | VALUE | 0 | Male |
Female | BY_HEADER1 | 0 | Female | |||
Age group | HEADER | 0 | Female | |||
N | 2 | 1 | 3 | N | 0 | Female |
65-80 | 0 | 0 | 2 (66.7%) | VALUE | 0 | Female |
<65 | 1 (50.0%) | 0 | 0 | VALUE | 0 | Female |
>80 | 1 (50.0%) | 1 (100.0%) | 1 (33.3%) | VALUE | 0 | Female |
By default, the denominators are calculated by using
colvar
, tablebyvar
, and rowbyvar
.
The above freq
function call will also produce the same
results without specifying the denoms_by
argument inside
the statlist
function.
When using rowbyvar
to create by-processing summaries,
some levels of the rowvar
may have zero records as shown in
the example below.
tbl <- cdisc_adsl %>%
freq(colvar = "TRT01PN",
rowbyvar = "ETHNIC",
rowvar = "RACE",
statlist = statlist(c("N","n (x.x%)")),
row_header = "Race")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | group_level | ETHNIC |
---|---|---|---|---|---|---|
HISPANIC OR LATINO | BY_HEADER1 | 0 | HISPANIC OR LATINO | |||
Race | HEADER | 0 | HISPANIC OR LATINO | |||
N | 2 | 0 | 0 | N | 0 | HISPANIC OR LATINO |
WHITE | 2 (100.0%) | - | - | VALUE | 0 | HISPANIC OR LATINO |
NOT HISPANIC OR LATINO | BY_HEADER1 | 0 | NOT HISPANIC OR LATINO | |||
Race | HEADER | 0 | NOT HISPANIC OR LATINO | |||
N | 3 | 5 | 5 | N | 0 | NOT HISPANIC OR LATINO |
WHITE | 3 (100.0%) | 5 (100.0%) | 5 (100.0%) | VALUE | 0 | NOT HISPANIC OR LATINO |
To remove the zero record rows and create the data driven summary,
users can specify the pad = FALSE
in the freq
function.
tbl <- cdisc_adsl %>%
freq(colvar = "TRT01PN",
rowbyvar = "ETHNIC",
rowvar = "RACE",
statlist = statlist(c("N","n (x.x%)")),
row_header = "Race",
pad = FALSE)
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | group_level | ETHNIC |
---|---|---|---|---|---|---|
HISPANIC OR LATINO | BY_HEADER1 | 0 | HISPANIC OR LATINO | |||
Race | HEADER | 0 | HISPANIC OR LATINO | |||
N | 2 | 0 | 0 | N | 0 | HISPANIC OR LATINO |
WHITE | 2 (100.0%) | - | - | VALUE | 0 | HISPANIC OR LATINO |
NOT HISPANIC OR LATINO | BY_HEADER1 | 0 | NOT HISPANIC OR LATINO | |||
Race | HEADER | 0 | NOT HISPANIC OR LATINO | |||
N | 3 | 5 | 5 | N | 0 | NOT HISPANIC OR LATINO |
WHITE | 3 (100.0%) | 5 (100.0%) | 5 (100.0%) | VALUE | 0 | NOT HISPANIC OR LATINO |
When using frequency analysis you do not need to always use your main dataframe to calculate your denominators. For example, an adverse event table may use ADSL as the denominator dataframe instead of ADAE even though the counts are coming from ADAE. The example below shows a table counting AEDECOD but using ADSL to calculate the denominators.
adae <- cdisc_adae %>%
rename(TRT01AN = TRTAN)
tbl <- adae %>%
freq(denom_df = cdisc_adsl,
colvar = "TRT01AN",
rowvar = "AEDECOD",
descending_by = "81")
knitr::kable(head(tbl, 10))
label | 0 | 54 | 81 | row_type | group_level |
---|---|---|---|---|---|
APPLICATION SITE PRURITUS | 1 (20.0) | 2 (40.0) | 5 (100.0) | VALUE | 0 |
APPLICATION SITE ERYTHEMA | 1 (20.0) | 1 (20.0) | 4 (80.0) | VALUE | 0 |
FATIGUE | 0 | 1 (20.0) | 2 (40.0) | VALUE | 0 |
ACTINIC KERATOSIS | 0 | 0 | 1 (20.0) | VALUE | 0 |
APPLICATION SITE IRRITATION | 0 | 1 (20.0) | 1 (20.0) | VALUE | 0 |
APPLICATION SITE PAIN | 0 | 0 | 1 (20.0) | VALUE | 0 |
APPLICATION SITE VESICLES | 0 | 1 (20.0) | 1 (20.0) | VALUE | 0 |
CALCULUS URETHRAL | 0 | 0 | 1 (20.0) | VALUE | 0 |
DEPRESSED MOOD | 0 | 0 | 1 (20.0) | VALUE | 0 |
DYSPEPSIA | 0 | 0 | 1 (20.0) | VALUE | 0 |
When using by variables such as tablebyvar
and
rowbyvar
the denominators may need to be changed to work
correctly. This also works in conjunction with the denom_df
argument. By default, the denominators are calculated by using
colvar
, tablebyvar
, and rowbyvar
.
This works well if you are using the same dataframe for counts and
denominators but this is not always the case. In the following example
we are doing a similar table as above but using AESEV as a
rowbyvar
. In this example we don’t want our denoms by AESEV
since that variable is not in ADSL, which is where our denominators are
coming from. To change the variables by which our denoms are calculated
by we use the denoms_by
argument to the
statlist
function. Below you can see that we are using only
our colvar
of TRT01PN as our denoms_by
.
tbl <- adae %>%
freq(denom_df = cdisc_adsl,
colvar = "TRT01AN",
rowvar = "AEDECOD",
rowbyvar = "AESEV",
statlist = statlist(c("n (x.x)"), denoms_by = "TRT01AN"))
knitr::kable(head(tbl, 10))
label | 0 | 54 | 81 | row_type | group_level | AESEV |
---|---|---|---|---|---|---|
MILD | BY_HEADER1 | 0 | MILD | |||
ACTINIC KERATOSIS | 0 | 0 | 1 (20.0) | VALUE | 0 | MILD |
APPLICATION SITE DERMATITIS | 0 | 0 | 0 | VALUE | 0 | MILD |
APPLICATION SITE ERYTHEMA | 1 (20.0) | 1 (20.0) | 4 (80.0) | VALUE | 0 | MILD |
APPLICATION SITE IRRITATION | 0 | 1 (20.0) | 1 (20.0) | VALUE | 0 | MILD |
APPLICATION SITE PAIN | 0 | 0 | 0 | VALUE | 0 | MILD |
APPLICATION SITE PRURITUS | 1 (20.0) | 2 (40.0) | 5 (100.0) | VALUE | 0 | MILD |
APPLICATION SITE URTICARIA | 0 | 1 (20.0) | 0 | VALUE | 0 | MILD |
APPLICATION SITE VESICLES | 0 | 1 (20.0) | 0 | VALUE | 0 | MILD |
ARTHRALGIA | 0 | 0 | 0 | VALUE | 0 | MILD |
We have shown the common use cases of calling different variants of
the statlist for frequency summaries. The table below describes all
available options to be specified for the statlist
in
freq
.
Statlist | Description |
---|---|
n | count |
n (x.x) | count (percentage without %) |
n (x.x%) | count (percentage with %) |
n/N | count/denominator |
n/N (x.x) | count/denominator (percentage without %) |
n/N (x.x%) | count/denominator (percentage with %) |
To learn more about using the statlist
function for freq
analysis, please type ?statlist
in your console.
A major portion of Adverse Events (AE) summary tables require
summarizing number of subjects with treatment-emergent adverse events by
system organ class and preferred term, which is in a nested structure
and needs additional processing on top of the freq
function. Therefore, we developed the nested_freq
function
to address the nested structure (counts within counts):
rowvar
: we can specify nested levels separated by
*
(e.g. AEBODSYS*AEDECOD
); this can be
expanded to three levels
descending_by
: the name of the column for sorting in
descending frequency order
cutoff_stat
: the value to cutoff by, n (count) or
pct (percentage); default = ‘pct’
cutoff
: numeric value of the percentage/count
threshold in any treatment group for cutting the data to be presented;
for example, cutoff = 1.0
means to only keep the preferred
term rows with percentages >= 1% when cutoff_stat = ‘pct’.
In the example below, we will show you how to use these arguments in
the nested_freq
function call for creating the AE summary
table by AEBODSYS and AEDECOD.
adae <- cdisc_adae %>%
filter(SAFFL == "Y", TRTEMFL == "Y") %>%
filter(AEBODSYS %in% c("GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS","SKIN AND SUBCUTANEOUS TISSUE DISORDERS")) %>%
rename(TRT01AN = TRTAN)
adsl <- cdisc_adsl %>%
filter(SAFFL == "Y")
For illustration purpose, we subset the adae data to only contain
records in the 2 categories of system organ class: GENERAL DISORDERS AND
ADMINISTRATION SITE CONDITIONS, SKIN AND SUBCUTANEOUS TISSUE DISORDERS.
So the table output is not too long and will be easier to visualize. In
addition, we would like to sort the output by the active drug group
(TRT01AN = 81) with descending frequency. Therefore, we specify
descending_by = "81"
.
tbl <- nested_freq(adae,
denom_df = adsl,
colvar = "TRT01AN",
rowvar = "AEBODSYS*AEDECOD",
statlist = statlist("n (x.x%)"),
descending_by = "81",
row_header = "System organ class \\\n Preferred term")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | nested_level | group_level | AEBODSYS |
---|---|---|---|---|---|---|---|
System organ class Preferred term |
HEADER | NA | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS | |||
GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS | 2 (40.0%) | 3 (60.0%) | 5 (100.0%) | NESTED | 0 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE PRURITUS | 1 (20.0%) | 2 (40.0%) | 5 (100.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE ERYTHEMA | 1 (20.0%) | 1 (20.0%) | 4 (80.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
FATIGUE | 0 | 1 (20.0%) | 2 (40.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE IRRITATION | 0 | 1 (20.0%) | 1 (20.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE PAIN | 0 | 0 | 1 (20.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE VESICLES | 0 | 1 (20.0%) | 1 (20.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE DERMATITIS | 0 | 1 (20.0%) | 0 | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE URTICARIA | 0 | 1 (20.0%) | 0 | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
PYREXIA | 1 (20.0%) | 0 | 0 | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | 2 (40.0%) | 2 (40.0%) | 1 (20.0%) | NESTED | 0 | 0 | SKIN AND SUBCUTANEOUS TISSUE DISORDERS |
ACTINIC KERATOSIS | 0 | 0 | 1 (20.0%) | NESTED | 1 | 0 | SKIN AND SUBCUTANEOUS TISSUE DISORDERS |
ERYTHEMA | 1 (20.0%) | 2 (40.0%) | 0 | NESTED | 1 | 0 | SKIN AND SUBCUTANEOUS TISSUE DISORDERS |
PRURITUS | 1 (20.0%) | 1 (20.0%) | 0 | NESTED | 1 | 0 | SKIN AND SUBCUTANEOUS TISSUE DISORDERS |
PRURITUS GENERALISED | 0 | 1 (20.0%) | 0 | NESTED | 1 | 0 | SKIN AND SUBCUTANEOUS TISSUE DISORDERS |
URTICARIA | 0 | 1 (20.0%) | 0 | NESTED | 1 | 0 | SKIN AND SUBCUTANEOUS TISSUE DISORDERS |
As shown in the output above, the most frequent system organ class is GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS, followed by SKIN AND SUBCUTANEOUS TISSUE DISORDERS. Within each system organ class, the preferred terms are also sorted by descending frequency. When there is a tie in the counts, the preferred terms are sorted alphabetically.
The cutoff feature is controlled by two arguments:
cutoff
and cutoff_stat
. If we want to remove
the rows of preferred terms with only 1 count in any treatment columns,
we can specify cutoff = 2
and
cutoff_stat = "n"
in the nested_freq call below.
tbl <- nested_freq(adae,
denom_df = adsl,
colvar = "TRT01AN",
rowvar = "AEBODSYS*AEDECOD",
statlist = statlist("n (x.x%)"),
descending_by = "81",
cutoff = 2,
cutoff_stat = "n",
row_header = "System organ class \\\n Preferred term")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | nested_level | group_level | AEBODSYS |
---|---|---|---|---|---|---|---|
System organ class Preferred term |
HEADER | NA | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS | |||
GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS | 2 (40.0%) | 3 (60.0%) | 5 (100.0%) | NESTED | 0 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE PRURITUS | 1 (20.0%) | 2 (40.0%) | 5 (100.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE ERYTHEMA | 1 (20.0%) | 1 (20.0%) | 4 (80.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
FATIGUE | 0 | 1 (20.0%) | 2 (40.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
SKIN AND SUBCUTANEOUS TISSUE DISORDERS | 2 (40.0%) | 2 (40.0%) | 1 (20.0%) | NESTED | 0 | 0 | SKIN AND SUBCUTANEOUS TISSUE DISORDERS |
ERYTHEMA | 1 (20.0%) | 2 (40.0%) | 0 | NESTED | 1 | 0 | SKIN AND SUBCUTANEOUS TISSUE DISORDERS |
The same cutoff results can also be achieved by specifying the cutoff
percentage: cutoff = 25
and
cutoff_stat = "pct"
. In our example, we only have 5
subjects in each arm and one subject count is equal to 20%, and so we
need more than 20% as the cutoff.
For only keeping the preferred terms with at least 2 counts in the
active arm of TRT01AN = 81
(i.e. not considering the other
arms), we can specify cutoff = "81 >= 2" and
cutoff_stat
= “n”`.
tbl <- nested_freq(adae,
denom_df = adsl,
colvar = "TRT01AN",
rowvar = "AEBODSYS*AEDECOD",
statlist = statlist("n (x.x%)"),
descending_by = "81",
cutoff = "81 >= 2",
cutoff_stat = "n",
row_header = "System organ class \\\n Preferred term")
knitr::kable(tbl)
label | 0 | 54 | 81 | row_type | nested_level | group_level | AEBODSYS |
---|---|---|---|---|---|---|---|
System organ class Preferred term |
HEADER | NA | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS | |||
GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS | 2 (40.0%) | 3 (60.0%) | 5 (100.0%) | NESTED | 0 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE PRURITUS | 1 (20.0%) | 2 (40.0%) | 5 (100.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
APPLICATION SITE ERYTHEMA | 1 (20.0%) | 1 (20.0%) | 4 (80.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |
FATIGUE | 0 | 1 (20.0%) | 2 (40.0%) | NESTED | 1 | 0 | GENERAL DISORDERS AND ADMINISTRATION SITE CONDITIONS |