--- title: "Fitting parametric survival models in R" author: "Roche" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: references.bib csl: biomedicine.csl vignette: > %\VignetteIndexEntry{Fitting parametric survival models in R} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 7, fig.height = 6, message = FALSE ) ``` # Introduction Parametric survival models are often the preferred method of extrapolating survival data for use in economic models. The National Institute for Health and Care Excellence (NICE) Decision Support Unit (DSU) technical support document (TSD) 14 recommends that the Exponential, Weibull, Gompertz, log-logistic, log normal and Generalized Gamma parametric models should all be considered.[@latimer2011nice] More recently, NICE also discusses more flexible models in NICE DSU TSD 21, however, more these models are not in the scope of this package.[@rutherford2020nice] The Canadian Agency for Drugs and Technologies in Health (CADTH) additionally specifies that the Gamma distribution must also be considered. This document therefore details the characteristics of each of these distributions and demonstrates how the parameters from each distribution, outputted using the flexsurvPlus package, can be implemented within an economic model.[@cadth2020] The Generalized F distribution is not commonly used, however it has been included in this package in case it is required. The flexsurvPlus package allows the inclusion of a treatment effect in the following four ways: * Separate models - Models fitted to each treatment arm separately * Independent shape models - Models fitted to both/all treatment arms including a treatment covariate to model the effect of treatment on both the scale and shape parameter(s) of the distribution * Common shape models - Models fitted to both/all treatment arms including a treatment covariate to model the effect of treatment on the scale parameter of the distribution. The shape parameter(s) of the distribution is common across treatments which reflects an assumption of proportional hazards or accelerated failure time between treatments depending on the distribution * One arm models - Models fitted to the entire dataset (no treatment strata) This document details how to use the flexsurvPlus package to perform these models. A separate vignette; "Parametric survival analysis using the flexsurvPlus package: understanding the theory" details the theory behind the models. # Set up packages and data ## Install packages The following packages are required to run this example: ```{r setup} # Libraries library(flexsurvPlus) library(tibble) library(dplyr) library(boot) library(ggplot2) # Non-scientific notation options(scipen=999) # Colours for plots blue = rgb(69, 185, 209, max=255) pink = rgb(211,78,147,max=255) Dyellow = rgb(214, 200, 16, max=255) orange<-rgb(247,139,21,max=255) ``` ## Read in the data To perform survival analyses, patient level data is required for the survival endpoints. This example uses a standard simulated data set (adtte). There is no standard naming that is needed for this package however, there are some set variables that are needed: * Time - a numeric variable * Event - a binary variable (event=1, censor=0) * Treatment - a character variable with the name of the intervention treatment The data must be in "wide" format such that there is one row per patient and columns for each endpoint separately. In this example, we analyze progression-free survival (PFS). ```{r} adtte <- sim_adtte(seed = 2020, rho = 0.6) head(adtte) # subset PFS data and rename PFS_data <- adtte %>% filter(PARAMCD=="PFS") %>% transmute(USUBJID, ARMCD, PFS_days = AVAL, PFS_event = 1- CNSR ) head(PFS_data) ``` # Exploratory analysis Before performing any statistical analysis, it is important to explore the data. Most importantly is a Kaplan-Meier plot. A simple Kaplan-Meier plot has been produced here: ```{r, warning=FALSE} # Create survfit object km.est.PFS <- survfit(Surv(PFS_days, PFS_event) ~ ARMCD, data = PFS_data, conf.type = 'plain') # Plot Kaplan-Meier plot(km.est.PFS, col = c(blue, pink), # plot colours lty = c(1:2), # line type xlab = "Time (Days)", # x-axis label ylab = "Progression-free survival", # y-axis label xlim = c(0, 800)) legend(x = 500, y = .9, legend = c("Arm A", "Arm B"), lty = c(1:2), col = c(blue, pink)) ``` # Fitting the models The runPSM function fits parametric survival models for multiple distributions using the flexsurv package, manipulates the flexsurv objects to get the parameter estimates and AIC and BIC value (using the flexsurvPlus function get_params) and rearranges the parameter estimates such that they can easily be output to excel to calculate survival for both the intervention and reference treatment in an economic model. These functions can be used to estimate 3 types of model: * Separate models - Models fitted to each treatment arm separately * These models are fitted using the runPSM function using the model.type="Separate" argument. * Independent shape models - Models fitted to both/all treatment arms including a treatment covariate to model the effect of treatment on both the scale and shape parameter(s) of the distribution. * They have been fit using the runPSM function using the model.type="Independent shape" argument. * Common shape models - Models fitted to both/all treatment arms including a treatment covariate to model the effect of treatment on the scale parameter of the distribution. The shape parameter(s) of the distribution is common across treatments which reflects an assumption of proportional hazards or accelerated failure time between treatments depending on the distribution * These models have been fit using the runPSM function using the model.type="Common shape" argument. * One arm models - Models fitted to the entire dataset (no treatment strata) * These models are fitted using the runPSM function using the model.type="One arm" argument. The inputs to the runPSM function are: * data - A data frame containing individual patient data for the relevant time to event outcomes * time_var - Name of time variable in 'data'. Variable must be numerical and >0. * event_var - Name of event variable in 'data'. Variable must be numerical and contain 1's to indicate an event and 0 to indicate a censor. * model.type - Character vector indicating the types of model formula provided. Permitted values are 'Common shape', 'Independent shape' or 'Separate' as per the models explained above. * distr - Each type of model can be fitted with multiple distributions. The distributions available for this package are: * Exponential ('exp') * Weibull ('weibull') * Gompertz ('gompertz') * Log-normal ('lnorm') * Log-logistic ('llogis') * Generalized gamma ('gengamma') * Gamma ('gamma') * Generalized F ('genf') * strata_var - Name of stratification variable in "data". This is usually the treatment variable and must be categorical. Not required when model.type='One arm'. * int_name - Character to indicate the name of the treatment of interest, must be a level of the "strata_var" column in "data", used for labelling the parameters. * ref_name - Character to indicate the name of the reference treatment, must be a level of the "strata_var" column in "data", used for labelling the parameters. Not required when model.type='One arm'. More information about each function can be used by running the code ?runPSM. Example code to fit all two arm models is presented below. ```{r} psm_PFS_all <- runPSM(data=PFS_data, time_var="PFS_days", event_var="PFS_event", model.type= c("Common shape", "Independent shape", "Separate"), distr = c('exp', 'weibull', 'gompertz', 'lnorm', 'llogis', 'gengamma', 'gamma', 'genf'), strata_var = "ARMCD", int_name="A", ref_name = "B") psm_PFS_all ``` # Estimate survival from the models and plot the curves Survival at a given time, t, is estimated as follows: $$ S(t) = P({T>t}) = 1 - F(t) $$ Where F(t) is the cumulative distribution function. To cross check survival estimates in Excel models, the following functions in R can be used to estimate the cumulative distribution function at given time points for each distribution explored in this package (the estimates from the cumulative distribution function can then be subtracted from 1 to estimate the survival probability): * Exponential: pexp() * Weibull: pweibull() * Gompertz: pgompertz() * Log-normal: plnorm() * Log-logistic: pllogis() * Generalized gamma: pgengamma() * Gamma: pgamma() * Generalized F: pgenf() The parameters outputted from each of the fitted models are used as inputs to these functions. The code below gives some examples. ```{r} # Landmark survival # vector of times to estimate survival (days) landmark_times <- c(0, 100, 200, 300) # Example 1: intervention arm, Weibull distribution, common shape model surv_comshp_weibull_int <- 1 - pweibull(landmark_times, shape = psm_PFS_all$parameters_vector["comshp.weibull.shape.int"], scale = psm_PFS_all$parameters_vector["comshp.weibull.scale.int"]) surv_comshp_weibull_int # Example 2: intervention arm, log-normal distribution, separate model surv_sep_lnorm_int <- 1 - plnorm(landmark_times, meanlog = psm_PFS_all$parameters_vector["sep.lnorm.meanlog.int"], sdlog = psm_PFS_all$parameters_vector["sep.lnorm.sdlog.int"]) surv_sep_lnorm_int # Example 3: reference arm, Generalized Gamma distribution, independent shape model surv_indshp_gengamma_ref <- 1 - pgengamma(landmark_times, mu = psm_PFS_all$parameters_vector["indshp.gengamma.mu.ref"], sigma = psm_PFS_all$parameters_vector["indshp.gengamma.sigma.ref"], Q = psm_PFS_all$parameters_vector["indshp.gengamma.Q.ref"]) surv_indshp_gengamma_ref ``` To simplify these actions a helper function is included in the `flexsurvPlus` package that will extract these values directly. This will calculate the same values as above. ```{r} # repeat the prior example for landmark times landmarks_df <- summaryPSM(x = psm_PFS_all, types = "survival", t = landmark_times ) landmarks_df %>% filter(Model == "Common shape", Dist == "Weibull") ``` The same functions can be used to generate the data required to plot the survival curves, overlaid on top of the KM plot. The time argument should reflect how long you want to the extrapolate for and the unit of time is the same as the input data (in this example, days). ```{r} # Plot the common shape models (Weibull distribution) with the Kaplan-Meier # vector of times to estimate survival (days) times <- c(seq(from = 0, to = 1000, by = 0.1)) # Survival probabilities: intervention arm, Weibull distribution, common shape model surv_comshp_weibull_int <- 1 - pweibull(times, shape = psm_PFS_all$parameters_vector["comshp.weibull.shape.int"], scale = psm_PFS_all$parameters_vector["comshp.weibull.scale.int"]) # Survival probabilities: reference arm, Weibull distribution, common shape model surv_comshp_weibull_ref <- 1 - pweibull(times, shape = psm_PFS_all$parameters_vector["comshp.weibull.shape.ref"], scale = psm_PFS_all$parameters_vector["comshp.weibull.scale.ref"]) # Create two data frames that include the survival probablaities and times surv_comshp_weibull_int_times <- data.frame(Time = times, Surv = surv_comshp_weibull_int, Trt = "Intervention") surv_comshp_weibull_ref_times <- data.frame(Time = times, Surv = surv_comshp_weibull_ref, Trt = "Reference") # Plot Kaplan-Meier plot(km.est.PFS, col = c(blue, pink), # plot colours lty = c(1:2), # line type xlab = "Time (Days)", # x-axis label ylab = "Progression-free survival", # y-axis label xlim = c(0, 1000)) # Add legend legend(x = 500, y = .9, legend = c("Arm A", "Arm B"), lty = c(1:2), col = c(blue, pink)) # Add model estimates lines(x = surv_comshp_weibull_int_times$Time, y = surv_comshp_weibull_int_times$Surv, col = blue) lines(x = surv_comshp_weibull_ref_times$Time, y = surv_comshp_weibull_ref_times$Surv, col = pink) ``` The same helper function can also be used to generate plots. ```{r} # repeat the prior example for plot data plot_esurv_df <- summaryPSM(x = psm_PFS_all, types = "survival", t = times ) # a similar function will estimate the values needed for KM estimates plot_km_df <- summaryKM(data = PFS_data, time_var="PFS_days", event_var="PFS_event", strata_var = "ARMCD", int_name="A", ref_name = "B", types = "survival" ) # can then combine these to plot plot_esurv_df %>% dplyr::filter(Model == "Common shape", Dist == "weibull") %>% dplyr::bind_rows(plot_km_df) %>% dplyr::filter(type == "survival", variable == "est") %>% dplyr::mutate(Model_Dist = paste(Model, Dist, sep = " - ")) %>% ggplot(aes(x = time, y = value, color = StrataName, linetype = Model_Dist)) + geom_step(data = function(x){dplyr::filter(x, Model == "Kaplan Meier") }) + geom_line(data = function(x){dplyr::filter(x, Model != "Kaplan Meier") }) ``` # Adressing uncertainty - Bootstrapping Bootstrapping has been used to estimate the uncertainty of the parameters from the survival models. Boostrapping is used for two reasons motivated by intent of this package to support further modeling in excel. 1. To simplify and accelerate calculations in excel while maintaining correlations between parameters (as is commonly done for NMA) 2. To maintain correlations across multiple endpoints (see separate vignette for details) Bootstrapping involves: 1. Sampling, with replacement, from all patients 2. Estimating all specified parametric survival models This procedure is repeated multiple times to obtain a distribution of parameters. For this example, bootstrap estimates of the parameters were calculated using the boot package. An argument for the boot function is statistic which is a function which when applied to data returns a vector containing the statistic(s) of interest. The bootPSM function in the flexsurvPlus package can be used for this purpose. The inputs for the bootPSM function are identical to the runPSM function, however there is one additional argument: * i - Index used to select a sample within boot As the parameters are stored in the config object returned by runPSM it is possible to use do.call to simplify these calls assuming that models have already been fit using runPSM. ```{r} # illustrative example using original analysis models # only create 2 replicates for illustration set.seed(2358) boot_psm_PFS_all <- do.call(boot, args = c(psm_PFS_all$config, R = 2, statistic = bootPSM)) # is the same as set.seed(2358) boot_psm_PFS_demo <- boot( R = 2, # number of bootstrap samples statistic = bootPSM, # bootstrap function data=PFS_data, time_var="PFS_days", event_var="PFS_event", model.type= c("Common shape", "Independent shape", "Separate"), distr = c('exp', 'weibull', 'gompertz', 'lnorm', 'llogis', 'gengamma', 'gamma', 'genf'), strata_var = "ARMCD", int_name="A", ref_name = "B" ) all(boot_psm_PFS_all$t==boot_psm_PFS_demo$t, na.rm = TRUE) ``` For speed and to examine how this can be used we will repeat this selecting only 4 models. ```{r} set.seed(2358) # To minimize vignette computation time only 100 bootstrap samples are taken. In general more samples should be used. n.sim <- 100 psm_PFS_selected <- runPSM( data=PFS_data, time_var="PFS_days", event_var="PFS_event", model.type = c("Common shape", "Separate"), distr = c('weibull', 'gamma'), strata_var = "ARMCD", int_name = "B", ref_name = "A" ) PSM_bootstraps_PFS <- do.call(boot, args = c(psm_PFS_selected$config, statistic = bootPSM, # bootstrap function R=n.sim # number of bootstrap samples ) ) ``` To use the result of these samples it is helpful to do some post processing to make the resulting samples easier to interpret. ```{r} # first extract the bootstrapped parameters into a tibble PFS_bootsamples <- as_tibble(PSM_bootstraps_PFS$t) # then add column names so can identify model and parameter more easily colnames(PFS_bootsamples) <- names(PSM_bootstraps_PFS$t0) # show the first 3 samples PFS_bootsamples[1:3,] ``` # Estimating quantities from the sample As the flexsurv parameterisations are used any quantity of interest can be simply calculated for all models and samples through use of the flexsurv functions such as the extrapolated means via flexsurv::mean_weibull. For this example we assume we have decided that separate models for reference and intervention are most appropriate and that for the reference arm a gamma model is preferred while for the intervention arm a weibull model is best. ```{r} # we can now calculate the mean PFS for the selected models for each bootstrap sample # using weibull for the reference PFS_ref_mean <- with(PFS_bootsamples, flexsurv::mean_gamma(shape = sep.gamma.shape.ref, rate = sep.gamma.rate.ref)) # using gamma for the intervention PFS_int_mean <- with(PFS_bootsamples, flexsurv::mean_weibull(shape = sep.weibull.shape.int, scale = sep.weibull.scale.int)) # we could also calculate these values also for the original data for a deterministic estimate PFS_means <- summaryPSM(psm_PFS_selected, type = "mean") PFS_means %>% dplyr::transmute(Model, Dist, Strata,StrataName, value) %>% dplyr::filter((Dist == "Gamma" & Model == "Separate - Reference") | (Dist == "Weibull" & Model == "Separate - Intervention") ) # we can then calculate the incremental mean PFS from these PFS_delta = PFS_int_mean - PFS_ref_mean # if we are interested we can then estimate quantiles from this quantile(PFS_delta, probs = c(0.025,0.975)) # or plot the density of this derived quantity density <- density(PFS_delta) plot(density, lwd = 2, main = "Density") ``` # Outputing parameters to excel The primary use of the bootstrap samples is to be used in probabilistic sensitivity analyses in economic models. Once all the models have been fit and bootstrap samples estimated they can be output to excel. By selecting the "Main Estimates" the estimates for the original data are returned. To run PSA in excel only a random number between 1 and the number of samples needs to be generated and the associated Bootstrap sample selected. ```{r} # combine the estimates from the main analysis with the bootstrap samples # and add meta data to include details of analysis # first we can get combine the main estimates for the models with those that were bootstrapped parameters_PFS <- rbind(PSM_bootstraps_PFS$t0, as_tibble(PSM_bootstraps_PFS$t)) # we can now add names colnames(parameters_PFS) <- names(PSM_bootstraps_PFS$t0) # we can then label the samples and add some metadata metadata_PFS <- tibble(Estimate = c("Main Estimates", paste("Bootstrap Sample",1:n.sim)), Study_name = "Study ABC", Datacut = "Final", Population = "ITT", Endpoint = "PFS", RefArmName = PSM_bootstraps_PFS$call$ref_name, IntArmName = PSM_bootstraps_PFS$call$int_name) pfs_for_export <- cbind(metadata_PFS, parameters_PFS) # not run #write.csv(parameters_PFS, "params_for_model.csv") ``` # References