Title: | synthesize data from real clinical trial |
---|---|
Description: | Syntrial takes as input the data of a clinical trial in SDTM format (https://www.cdisc.org/standards/foundational/sdtm). For each synthesized patient a number of source patients is selected. From the source patients' data randomly weighted numeric values or, for categorical variables, one value is selected in a modification of classical bootstrap. |
Authors: | Reinhold Koch [aut, cre] |
Maintainer: | Reinhold Koch <[email protected]> |
License: | GPL-3 |
Version: | 0.4.2.9001 |
Built: | 2024-10-31 03:04:07 UTC |
Source: | https://github.com/openpharma/syntrial |
Determine all categorical variables of a dataframe and return a vector of their names.
categorical_vars(df)
categorical_vars(df)
df |
dataframe to be evaluated |
logical
categorical_vars(CRC305ABC_DM)
categorical_vars(CRC305ABC_DM)
An anonymized dataset from 123 patients in SDTM format (https://en.wikipedia.org/wiki/SDTM).
CRC305ABC_AE
CRC305ABC_AE
A dataframe with 552 rows and 32 variables:
Study Identifier
Domain Abbreviation
Unique Subject Identifier
Sequence Number
Sponsor-Defined Identifier
Reported Term for the Adverse Event
Modified Reported Term
Lowest Level Term
Lowest Level Term Code
Dictionary-Derived Term
Preferred Term Code
High Level Term
High Level Term Code
High Level Group Term
High Level Group Term Code
Category for Adverse Event
Pre-Specified Adverse Event
Primary System Organ Class
Primary System Organ Class Code
Location of Event
Severity/Intensity
Serious Event
Action Taken with Study Treatment
Causality
Pattern of Adverse Event
Outcome of Adverse Event
Start Date/Time of Adverse Event
End Date/Time of Adverse Event
Study Day of Start of Adverse Event
Study Day of End of Adverse Event
Duration of Adverse Event
Measure of Adverse Event
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX
CRC305ABC_AE <- readr::read_tsv('https://dataverse.harvard.edu/api/access/datafile/3462713?gbrecs=false', col_types = 'cccicccciciciciccciccccccccciici', locale = readr::locale(encoding = 'latin1') )
An anonymized dataset of 123 patients in SDTM format (https://en.wikipedia.org/wiki/SDTM).
CRC305ABC_DM
CRC305ABC_DM
A dataframe with 123 rows and 19 variables:
Study Identifier
Domain Abbreviation
Unique Subject Identifier
Subject Identifier for the Study
Subject Reference Start Date/Time
Subject Reference End Date/Time
Date/Time of First Study Treatment
Date/Time of Last Study Treatment
Date/Time of Informed Consent
Date/Time of End of Participation
Study Site Identifier
Age
Age Units
Sex
Race
Ethnicity
Planned Arm Code
Description of Planned Arm
Country
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX
CRC305ABC_DM <- readr::read_tsv('https://dataverse.harvard.edu/api/access/datafile/3462712?gbrecs=false', col_types = 'ccccccccccciccccccc', locale = readr::locale(encoding = 'latin1') )
An anonymized dataset from 123 patients in SDTM format (https://en.wikipedia.org/wiki/SDTM).
CRC305ABC_LB
CRC305ABC_LB
A dataframe with 55660 rows and 31 variables:
Study Identifier
Domain Abbreviation
Unique Subject Identifier
Sequence Number
Specimen ID
Lab Test or Examination Short Name
Lab Test or Examination Name
Category for Lab Test
Result or Finding in Original Units
Original Units
Reference Range Lower Limit in Orig Unit
Reference Range Upper Limit in Orig Unit
Character Result/Finding in Std Format
Numeric Result/Finding in Standard Units
Standard Units
Reference Range Lower Limit-Std Units
Reference Range Upper Limit-Std Units
Reference Range Indicator
Completion Status
Reason Test Not Done
Vendor Name
Speciment Type
Specimen Condition
Baseline Flag
Fasting Status
Visit Number
Visit Name
Date/Time of Specimen Collection
Study Day of Specimen Collection
Planned Time Point Name
Planned Time Point Number
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX
CRC305ABC_LB <- readr::read_tsv('https://dataverse.harvard.edu/api/access/datafile/3462715?gbrecs=false', col_types = 'cccicccccccccdcddcccccccciccici', locale = readr::locale(encoding = 'latin1') )
An anonymized dataset from 123 patients in SDTM format (https://en.wikipedia.org/wiki/SDTM).
CRC305ABC_VS
CRC305ABC_VS
A dataframe with 20581 rows and 25 variables:
Study Identifier
Domain Abbreviation
Unique Subject Identifier
Sequence Number
Vital Signs Test Short Name
Vital Signs Test Name
Vital Signs Position of Subject
Result or Finding in Original Units
Original Units
Character Result/Finding in Std Format
Numeric Result/Finding in Standard Units
Standard Units
Completion Status
Reason Not Performed
Location of Vital Signs Measurement
Baseline Flag
Visit Number
Visit Name
Date/Time of Measurements
Study Day of Vital Signs
Planned Time Point Name
Planned Time Point Number
Reference Range Lower Limit
Reference Range Upper Limit
Reference Range Indicator
https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX
CRC305ABC_VS <- readr::read_tsv('https://dataverse.harvard.edu/api/access/datafile/3462714?gbrecs=false', col_types = 'ccciccccccdcccccicciciccc', locale = readr::locale(encoding = 'latin1') )
Return identifier variables of a domain.
domain_ids(df)
domain_ids(df)
df |
dataframe of one SDTM domain |
character vector of identifier variable names
domain_ids(CRC305ABC_DM)
domain_ids(CRC305ABC_DM)
Compute maximum shannon entropy log(N, base=2)
of a number or of a dataframe or matrix.
In the latter case the number of rows is used.
Hmax(N)
Hmax(N)
N |
a number, a dataframe, or a matrix |
log(N, base=2)
Hmax(CRC305ABC_DM)
Hmax(CRC305ABC_DM)
Hstruc
computes the Shannon entropy of each dataframe's variable and
checks for
constant,
record identifying, or
1:1 equivalent
variables.
Hstruc(.data)
Hstruc(.data)
.data |
dataframe |
A list
Hstruc(CRC305ABC_DM)
Hstruc(CRC305ABC_DM)
synth_df()
generates the tibble that relates original and synthetic new persons.
This synthesis dataframe should be kept secret to protect privacy of the original persons;
it is the base for all synthetic dataframes/tibbles that are generated.
synth_df(USUBJID, n_new = 10, width = 3, maxweight = 2/3)
synth_df(USUBJID, n_new = 10, width = 3, maxweight = 2/3)
USUBJID |
character vector of person identifiers |
n_new |
number of new persons to generate, defaults to 10 |
width |
number of persons from original trial to use for new person synthesis, defaults to 3 |
maxweight |
maximum allowed weight for one person for synthesis of new person, defaults to 2/3 |
The synthesis tibble relates new synthetic persons to source persons and specifies the weight of each source person's contribution.
synth_df(USUBJID = letters)
synth_df(USUBJID = letters)
Create a synthetic dataframe from syndf and a source SDTM dataframe.
synthesize(syndf, df, cat_fuzz = 1)
synthesize(syndf, df, cat_fuzz = 1)
syndf |
synthesis dataframe |
df |
original SDTM dataframe |
cat_fuzz |
fuzz factor for noise on categorical variables, defaults to 1 |
synthesized SDTM dataframe
synthesize(synth_df(CRC305ABC_DM$USUBJID), CRC305ABC_DM)
synthesize(synth_df(CRC305ABC_DM$USUBJID), CRC305ABC_DM)
synthesize events dataframe
synthesize_E(syndf, df, cat_fuzz = 1)
synthesize_E(syndf, df, cat_fuzz = 1)
syndf |
synthesis dataframe |
df |
original SDTM dataframe |
cat_fuzz |
fuzz factor for noise on categorical variables, defaults to 1 |
synthesized SDTM events dataframe
## Not run: synthesize_E(synth_df(CRC305ABC_DM$USUBJID), CRC305ABC_AE)
## Not run: synthesize_E(synth_df(CRC305ABC_DM$USUBJID), CRC305ABC_AE)
Stricter privacy regulations affect not only sharing of clinical trial data but also development of analysis scripts and reports. Synthetic data with same statistical properties as the original data can be used instead, possibly even for data exploration.
syntrial synthesizes data in SDTM format and protects privacy via an intermediate mechanism between frequentist and Bayesian bootstrap. For each synthesized patient a number of source patients is selected. From the source patients' data randomly weighted numeric values or, for categorical variables, one value is selected.
create the table linking real and synthetic persons
synthesize demographic and findings data
synthesize events data