Package 'syntrial' reference manual

Title:	synthesize data from real clinical trial
Description:	Syntrial takes as input the data of a clinical trial in SDTM format (https://www.cdisc.org/standards/foundational/sdtm). For each synthesized patient a number of source patients is selected. From the source patients' data randomly weighted numeric values or, for categorical variables, one value is selected in a modification of classical bootstrap.
Authors:	Reinhold Koch [aut, cre]
Maintainer:	Reinhold Koch <[email protected]>
License:	GPL-3
Version:	0.4.2.9001
Built:	2025-03-30 02:54:37 UTC
Source:	https://github.com/openpharma/syntrial

categorical variables

Description

Determine all categorical variables of a dataframe and return a vector of their names.

Usage

categorical_vars(df)
categorical_vars(df)

Arguments

`df`	dataframe to be evaluated

Value

logical

Examples

categorical_vars(CRC305ABC_DM)
categorical_vars(CRC305ABC_DM)

adverse event data from clinical trials CRC305ABC

Description

An anonymized dataset from 123 patients in SDTM format (https://en.wikipedia.org/wiki/SDTM).

Usage

CRC305ABC_AE
CRC305ABC_AE

Format

A dataframe with 552 rows and 32 variables:

STUDYID: Study Identifier
DOMAIN: Domain Abbreviation
USUBJID: Unique Subject Identifier
AESEQ: Sequence Number
AESPID: Sponsor-Defined Identifier
AETERM: Reported Term for the Adverse Event
AEMODIFY: Modified Reported Term
AELLT: Lowest Level Term
AELLTCD: Lowest Level Term Code
AEDECOD: Dictionary-Derived Term
AEPTCD: Preferred Term Code
AEHLT: High Level Term
AEHLTCD: High Level Term Code
AEHLGT: High Level Group Term
AEHLGTCD: High Level Group Term Code
AECAT: Category for Adverse Event
AEPRESP: Pre-Specified Adverse Event
AESOC: Primary System Organ Class
AESOCCD: Primary System Organ Class Code
AELOC: Location of Event
AESEV: Severity/Intensity
AESER: Serious Event
AEACN: Action Taken with Study Treatment
AEREL: Causality
AEPATT: Pattern of Adverse Event
AEOUT: Outcome of Adverse Event
AESTDTC: Start Date/Time of Adverse Event
AEENDTC: End Date/Time of Adverse Event
AESTDY: Study Day of Start of Adverse Event
AEENDY: Study Day of End of Adverse Event
AEDUR: Duration of Adverse Event
AESIZE: Measure of Adverse Event

Source

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX

CRC305ABC_AE <- readr::read_tsv('https://dataverse.harvard.edu/api/access/datafile/3462713?gbrecs=false',
                                col_types = 'cccicccciciciciccciccccccccciici',
                                locale = readr::locale(encoding = 'latin1')
                               )

demographic data from clinical trials CRC305ABC

Description

An anonymized dataset of 123 patients in SDTM format (https://en.wikipedia.org/wiki/SDTM).

Usage

CRC305ABC_DM
CRC305ABC_DM

Format

A dataframe with 123 rows and 19 variables:

STUDYID: Study Identifier
DOMAIN: Domain Abbreviation
USUBJID: Unique Subject Identifier
SUBJID: Subject Identifier for the Study
RFSTDTC: Subject Reference Start Date/Time
RFENDTC: Subject Reference End Date/Time
RFXSTDTC: Date/Time of First Study Treatment
RFXENDTC: Date/Time of Last Study Treatment
RFICTC: Date/Time of Informed Consent
RFPENDTC: Date/Time of End of Participation
SITEID: Study Site Identifier
AGE: Age
AGEU: Age Units
SEX: Sex
RACE: Race
ETHNIC: Ethnicity
ARMCD: Planned Arm Code
ARM: Description of Planned Arm
COUNTRY: Country

Source

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX

CRC305ABC_DM <- readr::read_tsv('https://dataverse.harvard.edu/api/access/datafile/3462712?gbrecs=false',
                                 col_types = 'ccccccccccciccccccc',
                                 locale = readr::locale(encoding = 'latin1')
                               )

lab data from clinical trials CRC305ABC

Description

An anonymized dataset from 123 patients in SDTM format (https://en.wikipedia.org/wiki/SDTM).

Usage

CRC305ABC_LB
CRC305ABC_LB

Format

A dataframe with 55660 rows and 31 variables:

STUDYID: Study Identifier
DOMAIN: Domain Abbreviation
USUBJID: Unique Subject Identifier
LBSEQ: Sequence Number
LBREFID: Specimen ID
LBTESTCD: Lab Test or Examination Short Name
LBTEST: Lab Test or Examination Name
LBCAT: Category for Lab Test
LBORRES: Result or Finding in Original Units
LBORRESU: Original Units
LBORNRLO: Reference Range Lower Limit in Orig Unit
LBORNRHI: Reference Range Upper Limit in Orig Unit
LBSTRESC: Character Result/Finding in Std Format
LBSTRESN: Numeric Result/Finding in Standard Units
LBSTRESU: Standard Units
LBSTNRLO: Reference Range Lower Limit-Std Units
LBSTNRHI: Reference Range Upper Limit-Std Units
LBNRIND: Reference Range Indicator
LBSTAT: Completion Status
LBREASND: Reason Test Not Done
LBNAM: Vendor Name
LBSPEC: Speciment Type
LBSPCCND: Specimen Condition
LBBLFL: Baseline Flag
LBFAST: Fasting Status
VISITNUM: Visit Number
VISIT: Visit Name
LBDTC: Date/Time of Specimen Collection
LBDY: Study Day of Specimen Collection
LBTPT: Planned Time Point Name
LBTPTNUM: Planned Time Point Number

Source

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX

CRC305ABC_LB <- readr::read_tsv('https://dataverse.harvard.edu/api/access/datafile/3462715?gbrecs=false',
                                col_types = 'cccicccccccccdcddcccccccciccici',
                                locale = readr::locale(encoding = 'latin1')
                                )

vital sign data from clinical trials CRC305ABC

Description

An anonymized dataset from 123 patients in SDTM format (https://en.wikipedia.org/wiki/SDTM).

Usage

CRC305ABC_VS
CRC305ABC_VS

Format

A dataframe with 20581 rows and 25 variables:

STUDYID: Study Identifier
DOMAIN: Domain Abbreviation
USUBJID: Unique Subject Identifier
VSSEQ: Sequence Number
VSTESTCD: Vital Signs Test Short Name
VSTEST: Vital Signs Test Name
VSPOS: Vital Signs Position of Subject
VSORRES: Result or Finding in Original Units
VSORRESU: Original Units
VSSTRESC: Character Result/Finding in Std Format
VSSTRESN: Numeric Result/Finding in Standard Units
VSSTRESU: Standard Units
VSSTAT: Completion Status
VSREASND: Reason Not Performed
VSLOC: Location of Vital Signs Measurement
VSBLFL: Baseline Flag
VISITNUM: Visit Number
VISIT: Visit Name
VSDTC: Date/Time of Measurements
VSDY: Study Day of Vital Signs
VSTPT: Planned Time Point Name
VSTPTNUM: Planned Time Point Number
VSORNRLO: Reference Range Lower Limit
VSORNRHI: Reference Range Upper Limit
VSNRIND: Reference Range Indicator

Source

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX

CRC305ABC_VS <- readr::read_tsv('https://dataverse.harvard.edu/api/access/datafile/3462714?gbrecs=false',
                                col_types = 'ccciccccccdcccccicciciccc',
                                locale = readr::locale(encoding = 'latin1')
                                )

domain identifier variables

Description

Return identifier variables of a domain.

Usage

domain_ids(df)
domain_ids(df)

Arguments

`df`	dataframe of one SDTM domain

Value

character vector of identifier variable names

Examples

domain_ids(CRC305ABC_DM)
domain_ids(CRC305ABC_DM)

maximum entropy

Description

Compute maximum shannon entropy log(N, base=2) of a number or of a dataframe or matrix. In the latter case the number of rows is used.

Usage

Hmax(N)
Hmax(N)

Arguments

`N`	a number, a dataframe, or a matrix

Value

log(N, base=2)

Examples

Hmax(CRC305ABC_DM)
Hmax(CRC305ABC_DM)

Shannon entropy structure details of a dataframe

Description

Hstruc computes the Shannon entropy of each dataframe's variable and checks for

constant,
record identifying, or
1:1 equivalent

variables.

Usage

Hstruc(.data)
Hstruc(.data)

Arguments

.data

dataframe

Value

A list

Examples

Hstruc(CRC305ABC_DM)
Hstruc(CRC305ABC_DM)

synth_df() generates the tibble that relates original and synthetic new persons. This synthesis dataframe should be kept secret to protect privacy of the original persons; it is the base for all synthetic dataframes/tibbles that are generated.

Usage

synth_df(USUBJID, n_new = 10, width = 3, maxweight = 2/3)
synth_df(USUBJID, n_new = 10, width = 3, maxweight = 2/3)

Arguments

`USUBJID`	character vector of person identifiers
`n_new`	number of new persons to generate, defaults to 10
`width`	number of persons from original trial to use for new person synthesis, defaults to 3
`maxweight`	maximum allowed weight for one person for synthesis of new person, defaults to 2/3

Value

The synthesis tibble relates new synthetic persons to source persons and specifies the weight of each source person's contribution.

Examples

synth_df(USUBJID = letters)
synth_df(USUBJID = letters)

synthesize a SDTM dataframe

Description

Create a synthetic dataframe from syndf and a source SDTM dataframe.

Usage

synthesize(syndf, df, cat_fuzz = 1)
synthesize(syndf, df, cat_fuzz = 1)

Arguments

`syndf`	synthesis dataframe
`df`	original SDTM dataframe
`cat_fuzz`	fuzz factor for noise on categorical variables, defaults to 1

Value

synthesized SDTM dataframe

Examples

synthesize(synth_df(CRC305ABC_DM$USUBJID), CRC305ABC_DM)
synthesize(synth_df(CRC305ABC_DM$USUBJID), CRC305ABC_DM)

synthesize events dataframe

Description

synthesize events dataframe

Usage

synthesize_E(syndf, df, cat_fuzz = 1)
synthesize_E(syndf, df, cat_fuzz = 1)

Arguments

`syndf`	synthesis dataframe
`df`	original SDTM dataframe
`cat_fuzz`	fuzz factor for noise on categorical variables, defaults to 1

Value

synthesized SDTM events dataframe

Examples

## Not run: synthesize_E(synth_df(CRC305ABC_DM$USUBJID), CRC305ABC_AE)
## Not run: synthesize_E(synth_df(CRC305ABC_DM$USUBJID), CRC305ABC_AE)

Synthetic "twin trial" from real clinical trial data in SDTM format.

Description

Stricter privacy regulations affect not only sharing of clinical trial data but also development of analysis scripts and reports. Synthetic data with same statistical properties as the original data can be used instead, possibly even for data exploration.

Details

syntrial synthesizes data in SDTM format and protects privacy via an intermediate mechanism between frequentist and Bayesian bootstrap. For each synthesized patient a number of source patients is selected. From the source patients' data randomly weighted numeric values or, for categorical variables, one value is selected.

synthdf

create the table linking real and synthetic persons

synthesize

synthesize demographic and findings data

synthesize_E

synthesize events data

Package 'syntrial'

Help Index

categorical variables

Description

Usage

Arguments

Value

Examples

adverse event data from clinical trials CRC305ABC

Description

Usage

Format

Source

demographic data from clinical trials CRC305ABC

Description

Usage

Format

Source

lab data from clinical trials CRC305ABC

Description

Usage

Format

Source

vital sign data from clinical trials CRC305ABC

Description

Usage

Format

Source

domain identifier variables

Description

Usage

Arguments

Value

Examples

maximum entropy

Description

Usage

Arguments

Value

Examples

Shannon entropy structure details of a dataframe

Description

Usage

Arguments

Value

Examples

generate synthesis dataframe

Description

Usage

Arguments

Value

Examples

synthesize a SDTM dataframe

Description

Usage

Arguments

Value

Examples

synthesize events dataframe

Description

Usage

Arguments

Value

Examples

Synthetic "twin trial" from real clinical trial data in SDTM format.

Description

Details

synthdf

synthesize

synthesize_E