---
title: "Data Expectations"
author: "NEST CoreDev Team"
date: "10.11.2022"
output:
  rmarkdown::html_document:
    theme: "spacelab"
    highlight: "kate"
    toc: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{Data Expecations}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---
## Introduction
`teal.goshawk` expects to be provided `ADSL` and an accompanying `ADLB` clinical trials data set in `ADaM` format. For more information
on `ADaM` please the [`ADaM standards`](https://www.cdisc.org/standards/foundational/adam).

The package provides ready-to-use `teal` modules you can embed in your `teal` application. The modules
generate highly customizable plots and outputs often used in exploratory data analysis, e.g.:

- box plots - `tm_g_gh_boxplot()`
- correlation and scatter plots - `tm_g_gh_correlationplot()` and `tm_g_gh_scatterplot()`
- density distribution plots - `tm_g_gh_density_distribution_plot()`
- line plots - `tm_g_gh_lineplot()`
- spaghetti plots - `tm_g_spaghettiplot()`

See [`package functions / modules`](https://insightsengineering.github.io/teal.goshawk/reference/index.html).

## Data Expectations

#### `ADSL` 
This is a subject level data set with one record per subject and includes any variables that are intended to be used for plot splitting

- For example if it is the intention to be able to split the line plot by some outcome variable then that outcome variable must be in `ADSL`.
  -  e.g. `ABCWK24` which represents a two level outcome variable with values of `"Y"` and `"N"` at Week 24. 
<br><br>

#### `ADLB`
This is a Basic Data Structure (`BDS`) data set meaning multiple records per subject per assay (`PARAM`) across unique time points. Additional variables that are intended to be used for plot splitting should be joined to `ADLB`.

  - See `ADSL` example above where `ABCWK24` would need to be joined to `ADLB`
<br><br>

### Other Basic Data Structures
Other `BDS` data sets could be provisioned to `teal.goshawk` like `ADQS` which contains multiple records per subject per question (`PARAM`) across unique time points. However with all cases other than `ADLB` there are likely workarounds needed.

  - For example the concept of assay units, stored in `AVALU`, is not really relevant to a `BDS` like `ADQS` which contains questionnaire data. Given `teal.goshawk` expects an `AVALU` variable and uses the values in the plot title and y-axis label, `AVALU` would need to be added to `ADQS` with some appropriate value: Perhaps `"Q"`. If a value is not provided the the `"()"` portion of the title and y-axis label will be empty. 

## Required Variables
Several variables are required to realize the full functionality of `teal.goshawk`.
<br><br>

#### `TRTORD`
**Definition:** This variable orders treatment values in the legend

**Rationale:** Allows for congruent ordering as compared to other outputs being generated by a study team

**Alternative:** Variable is required
<br><br>

#### `AVISITCD`
**Definition:** This variable contains abbreviated values of `AVISIT` values

**Rationale:** Many `AVISIT` values are long and contain arguably superfluous information in some cases. Using these long values as x-axis tick labels can really chew into the real estate area available for the plot. Using thoughtful abbreviations conveys chronology with no substantive loss of information and maximizes the  area available for the plot.

**Alternative:** If in cases creating abbreviations is not helpful then simply set `AVISITCD <- AVISIT`
<br><br>

#### `AVISITCDN`
**Definition:** This variable contains the numeric portion of `AVISITCD` values

**Rationale:** Often `AVISITN` contains values that are not particularly helpful to reflect the proportional chronology of visits. Once `AVISITCD` is created then it is helpful to create the numeric values from `AVISITCD` values that can be seen as intuitively reflecting the visit chronology. For example: 0, 2, 4, 12, 24, 56, 84 etc. for weeks or 0, 14, 28, 84, 168, 392, 588 etc. for days. Using these, the longitudinal visualization x-axis will nicely reflect proportional distances between visits.

**Alternative:** If in cases creating a more intuitive numeric chronology is not helpful then simply set `AVISITCDN <- AVISITN`
<br><br>

#### `AVALU`
**Definition:** Analysis Value Unit

**Rationale:** Used in the plot title and y-axis labels. Please see `"Other BDS data sets"` comments above 

**Alternative:** Variable is required.
<br><br>

#### `LBSTRESC`
**Definition:** From `SDTM` Character Result/Finding in Std Format

**Rationale:** As a character type variable, this variable contains values that include those below or above limits of quantitation (`LOQ`). These might look like `"2.1<"` or `">20.7"`. When this is the case then `AVAL` is often missing. It is important to be able to still capture these values so the following derivation is used to do so and when this is needed the `LOQFL` variable should be set to `"Y"`. This signifies that the `AVAL` value for this record has been derived.
    - For values below the limit of quantitation, `AVAL` is set to the numeric portion of `LBSTRESC` divided by 2.
    - For values above the limit of quantitation, `AVAL` is set to the numeric portion of `LBSTRESC`.

**Alternative:** Variable is required
<br><br>

#### `LOQFL`
**Definition:** This is not an `ADaM` standard variable but represents the Limit of Quantitation Flag

**Rationale:** Set to `"Y"` when `LBSTRESC` value is used to populate `AVAL` and `LBSTRESC` value is either below a limit of quantitation for the assay or above the limit of quantitation for the assay. Derivations for `AVAL` and `LOQFL` could look like the following in a `mutate()` statement.
- `AVAL = if_else(grepl("<|>", LBSTRESC), as.numeric(gsub("[^0-9, .]+", "", LBSTRESC)), AVAL)`
- `LOQFL = if_else(grepl("<|>", LBSTRESC), "Y", "N")`

**Alternative:** If the limit of quantitation concept is not relevant then please set `LOQFL <- "N"`
<br><br>

#### `BASE2`
**Definition:** This is not an `ADaM` standard variable but represents the assay value at Screening

**Rationale:** When change from Screening visit analyses are needed then this variable contains the assay value at Screening

**Alternative:** If Screening visit analyses are not relevant then please set `BASE2 <- NA`
<br><br>

#### CHG2
**Definition:** This is not an `ADaM` standard variable but represents the change from Screening assay value

**Rationale:** When change from Screening visit analyses are needed then this variable contains the assay value change between Screening and subsequent visit

**Alternative:** If Screening visit analyses are not relevant then please set `CHG2 <- NA`
<br><br>

#### `PCHG2`
**Definition:** This is not an `ADaM` standard variable but represents the percent change from Screening assay value

**Rationale:** When percent change from Screening visit analyses are needed then this variable contains the assay value percent change between Screening and subsequent visit

**Alternative:** If Screening visit analyses are not relevant then please set `PCHG2 <- NA`
<br><br>

## Additional Variables
There are additional data manipulations that are performed to create variables that are useful in the context of longitudinal visualizations
<br><br>

#### `AVALL2`, `BASEL2`, `BASE2L2`
**Description:** Log 2 of `AVAL`, `BASE` and `BASE2` respectively

**Rationale:** This transformation addresses data variance to improve the interpretability or appearance of plots.