Mathematical Oncology

Data science can help us run better cancer clinical trials

Behind the paper

Written by Deborah Plana - February 14, 2022

Cancer patient survival can be parametrized to improve trial precision and reveal time-dependent therapeutic effects

Deborah Plana, Geoffrey Fell, Brian M. Alexander, Adam C. Palmer, Peter K. Sorger

Read the paper
Clinical trials are the most expensive and highest stakes experiments in biomedicine. They are designed to answer a single but important question: is a medical intervention superior to standard treatment in improving an outcome of interest (e.g.: do cancer patients live longer by taking a new drug?). The results from these trials are used to decide which innovative therapies make it to the clinic and represent the frontier of medical progress.

Yet the core methods used to analyze results from these trials, and to decide on which drugs are used to treat patients, have largely remained the same since the 1970s1,2. This is not due to a dearth of innovation in mathematical modeling or biostatistical methods, but largely due to a dearth of the data itself. No matter how ground-breaking a method is in theory, it’s hard to prove its importance until it has been tested on a large set of empirical data. Sadly, even with recent attempts at improving individual participant data access3-5, such information remains largely unavailable, with new clinical trials rarely making their full results freely available to the public6.

Our recent work7 in Nature Communications aims to modernize the study and interpretation of clinical trials by curating, releasing, and re-analyzing data reconstructed from ~150 clinical trials in breast, colorectal, lung, and prostate cancer (procedure described in Figure 1; all data found in and the Synapse repository8). These trials were published between 2014-2016, enrolled patients with metastatic and non-metastatic disease, and included treatments such as chemotherapies, immune checkpoint inhibitors, targeted therapies, radiotherapy, surgery, and placebo. We believe that this website hosts one of the largest and most diverse sets of publicly available individual participant data in oncology. Resources like The Cancer Genome Atlas (TCGA) helped revolutionize our understanding of genomics in oncology; we hope that efforts such as ours can help do the same for the study and execution of clinical trials.


Figure 1: Procedure for parameterizing survival curves starting with published figures. a Kaplan-Meier survival curve and at-risk table obtained from clinical trial publication. Individual participant data were imputed from digitized survival curves and at-risk tables as previously described9. b Each set of parameters corresponds to a different probability density function and survival function (which corresponds to 1 minus the cumulative density function). The likelihood of observing actual data is then computed. c Likelihood calculation is repeated for a set of possible parameter values. d The most likely (best) fit is obtained by finding the parameter values with the maximum likelihood.

Some key insights emerged from analyzing this dataset. First, that a simple two-parameter distribution (the Weibull function) can accurately describe survival data from oncology clinical trials. Second, that by fitting Weibull functions to survival data, parameterized 50-patient trial arms are as accurate and precise in describing drug efficacy (i.e. percent overall survival at 12-months) as 90-patient arms evaluated by traditional non-parametric statistics. This could improve the precision of early-phase trials, obtaining the same signal on a drug’s activity with fewer patients as compared to conventional methods. Third, we show that the length of a trial is related to its likelihood of succeeding, and that this effect is dependent on the shape of the curves in a trial. We hope that these findings highlight the power of re-analyzing published trial results, and inspire others to release de-identified participant events.

Patients make an invaluable gift to society by enrolling in clinical trials, and they overwhelmingly support making their data public10. Making the most out of their contribution by releasing and re-analyzing clinical trial results is both a scientific opportunity and an ethical responsibility.


  1. Kaplan, E. L. & Meier, P. Nonparametric Estimation from Incomplete Observations. 53, (1958).
  2. Cox, D. R. Regression Models and Life-Tables. Journal of the Royal Statistical Society. Series B (Methodological) 34, 187–220 (1972).
  3. Hede, K. Project Data Sphere to Make Cancer Clinical Trial Data Publicly Available. JNCI Journal of the National Cancer Institute 105, 1159–1160 (2013).
  4. Ross, J. S. et al. Overview and experience of the YODA Project with clinical trial data sharing after 5 years. Sci Data 5, 180268 (2018).
  5. National Cancer Institute (NCI). National Clinical Trials Network (NCTN) and NCI Community Oncology Research Program (NCORP) Data Archive. (2021).
  6. Danchev, V., Min, Y., Borghi, J., Baiocchi, M. & Ioannidis, J. P. A. Evaluation of Data Sharing After Implementation of the International Committee of Medical Journal Editors Data Sharing Statement Requirement. JAMA Network Open 4, e2033972 (2021).
  7. Plana, D., Fell, G., Alexander, B. M., Palmer, A. C. & Sorger, P. K. Cancer patient survival can be parametrized to improve trial precision and reveal time-dependent therapeutic effects. Nat Commun 13, 873 (2022).
  8. Plana, D., Fell, G., Alexander, B. M., Palmer, A. C. & Sorger, P. K. Imputed individual participant data from oncology clinical trials. (2021).
  9. Fell, G. et al. KMDATA: a curated database of reconstructed individual patient-level data from 153 oncology clinical trials. Database 2021, baab037 (2021).
  10. Mello, M. M., Lieou, V. & Goodman, S. N. Clinical Trial Participants’ Views of the Risks and Benefits of Data Sharing. New England Journal of Medicine 378, 2202–2211 (2018).
← Previous Post Next Post