Clinical trial datasets can contain a million different types of
incorrect data. This package does not intend to comprehensively cover
all scenarios in which data may be wrong. Nor does this package intend
to replicate the comprehensive set of P21 data checks for SDTM. Instead,
the data checks in this package are intended to be
generalizable, actionable, and
meaningful for analysis. For example many clinical
trials contain the CO
domain, however the
sdtmchecks
package does not have any functionality around
this domain as it is usually not meaningful for analysis.
The main
branch
(pharmaverse/sdtmchecks@main
) contains the latest released
version and should not be used for development.
The devel
branch is the default branch and contains the
latest development version of the package. To start contributing, please
make a feature branch off of devel
. To install, please
refer to the front
page of the package site. When your code is ready to be incorporated
please open a pull request that another person will review prior to
merging the update into devel
. If you do not have write
access to the repository, please work off of a forked repo and open
a pull request from the fork.
The {renv}
package is used to handle package dependencies. Run
renv::restore()
to install the same set of package versions
being used by the team.
The sdtmchecksmeta
dataset lists existing checks and
contains helpful additional information
## # A tibble: 10 × 4
## check domains title description
## <chr> <chr> <chr> <chr>
## 1 check_ae_aeacn_ds_disctx_covid ae, ds COVID AE trt discon "Check pat…
## 2 check_ae_aeacnoth ae AE AEACNOTH multiple "Check for…
## 3 check_ae_aeacnoth_ds_disctx ae, ds AE AEACNOTx Discon "Check for…
## 4 check_ae_aeacnoth_ds_stddisc_covid ae, ds COVID AE study discon "Check pat…
## 5 check_ae_aedecod ae AE Missing PT "Check for…
## 6 check_ae_aedthdtc_aesdth ae AE Death Date vs Indi… "Check for…
## 7 check_ae_aedthdtc_ds_death ae, ds DS Death Dates in AE "Check pat…
## 8 check_ae_aelat ae AE AELAT Missing "OPHTHALMO…
## 9 check_ae_aeout ae AE Death Outcome "Check for…
## 10 check_ae_aeout_aeendtc_aedthdtc ae Fatal AE Resolution D… "Check for…
When writing a new data check function, the function name should
start with the word ‘check’ and contain key dataset information plus a
1-2 word description,
i.e. check_[dataset1]_..._[datasetN]_[brief description]
.
For example:
check_dm_race <- function(DM){...}
check_dv_ae_covid <- function(DV, AE){...}
Each function should at minimum take the datasets being
investigated as parameters,
e.g. check_ae_aedecod <- function(AE){...}
Each function should start with a check for required variables.
The following internal utility
functions may be helpful for this: %lacks_all%
,
%lacks_any%
, %has_all%
, or
%has_any%
.
All checks should use the internal pass()
and
fail()
functions to return either TRUE
or
FALSE
depending on check results. These functions enable
attributes to be attached to the boolean result, e.g. a message or a
listing of flagged records.
fail()
scenario with accompaying dataframe to have an
additional space after the period. For example,
"...age value(s). "
rather than
"...age value(s)."
This extra space is a formatting detail
used within an existing report.sdtmchecks
intentionally attempts to minimize
external dependencies. This is something to keep in mind when developing
a new check. Currently the only dependencies within data check functions
are dplyr
and tidyselect
.
Include examples with dummied dataframes in the header that test the logic, assumptions, and robustness of the data check function.
If you are writing your first check it might be helpful to start by editing an existing one, for example the one below:
#' Example check
#'
#' @param DM
#'
#' @return boolean
#' @export
#'
#' @examples
#'
#' \dontrun{
#' check_dm_age_missing(DM)
#' }
#'
check_dm_age_missing <- function(DM){
###First check that required variables exist and return a message if they don't
if(DM %lacks_any% c("USUBJID","AGE")){
fail(lacks_msg(DM, c("USUBJID","AGE")))
}else{
### Subset DM to only records with missing AGE
mydf_0 = subset(DM, is_sas_na(DM$AGE), c("USUBJID","AGE"))
### Subset DM to only records with AGE<18
mydf_1 = subset(DM, !is_sas_na(DM$AGE) & DM$AGE<18, c("USUBJID","AGE"))
### Subset DM to only records with AGE>90
mydf_2 = subset(DM, !is_sas_na(DM$AGE) & DM$AGE>=90, c("USUBJID","AGE"))
### Combine records with abnormal AGE
mydf3 = rbind(mydf_0, mydf_1, mydf_2)
mydf = mydf3[order(mydf3$USUBJID),]
rownames(mydf)=NULL
###Print to report
### Return message if no records with missing AGE, AGE<18 or AGE>90
if(nrow(mydf)==0){
pass()
### Return subset dataframe if there are records with missing AGE, AGE<18 or AGE>90
}else if(nrow(mydf)>0){
fail(paste("DM has ",length(unique(mydf$USUBJID)),
" patient(s) with suspicious age value(s). ",sep=""),
mydf)
}
}
}
Roxygen2
header so that information
is available in package documentation.