The goal of this paper is to help define a path toward a grammar for processing clinical trials by a) defining a format in which we would like to represent data from standardized clinical trial data b) describing a standard set of operations to transform clinical trial data into this format, and c) to identify a set of verbs and other functionality to facilitate data processing and encourage reproducibility in the processing of these data. It provides a background on standard clinical trial data and goes through a simple preprocessing example illustrating the value of the proposed approach through the use of the forceps package, which is currently being used for data of this kind.
There are few areas of data science research that provide more promise to improve human quality-of-life and treat a disease than the development of methods and analysis in clinical trials. While adjacent, data-focused areas of biomedicine and health related research have recently seen increased attention, especially the analysis of real-world evidence (RWE) and electronic health records (EHR) in particular, clinical trial data maintains several distinct quality advantages, enumerated here.
Along with maintaining higher quality clinical trial data is more available and more easily accessible when compared to real-world data sources, which often require affiliations with appropriate research institutions as well as infrastructure and appropriate staff, including data managers, to extract data. By contrast, modern clinical trial data organizations allow users to quickly search and download thousands of trials, including anonymized patient-level information. These data sets tend to include control-arm data, which can be used to understand prognostic disease populations construct historical controls for existing trials. However, some also include treatment data, which can be used to characterize predictive patient subtypes for a given treatment, understand safety profiles for classes of drugs, and aid in the design of new trials. We note that, for oncology, Project Data Sphere (Project data sphere: Convener, collaborator, catalyst in the fight against cancer 2020) and outside of oncology, Immport (Immport: Bioinformatics for the future of immunology 2020) have been invaluable in our own experience by facilitating these types of analyses.
During a clinical trial, patient-level data is collected in case report forms (CRFs). The format and data collected in these forms are prescribed in the trial design. These forms are the basis for the construction of analysis data sets and other documents that will be submitted to governing bodies, including the Food and Drug Association (FDA) and European Medicines Agency (EMA), for approval if the sponsor (party funding the trial) decides it is appropriate. The Clinical Data Interchange Standards Consortium (CDISC) (Clinical data interchange standards consortium 2020) develops standards dealing with medical research data, including the submission of trial results. Adhering to these standards is necessary for a successful trial submission.
There are several data sets included with a submission that tend to be useful for analysis. This paper focuses on the Analysis Data Model (ADaM) data, which provides patient-level data, which has been validated and used for data derivation and analysis. An ADaM data set is itself composed of several data sets, including a Subject-Level Analysis Data Set (ADSL) holding analysis and treatment information. Other information, including baseline characteristics, demographic data, visit information, etc., are held in the and Basic Data Structure (BDS) formatted data sets. Finally, adverse events are held in the Analysis Data Sets for Adverse Events (ADAE).
ADaM data for a clinical trial is generally made available as a set of SAS7BDAT (Shotwell et al. 2013) files. While neither the FDA nor the EMA require this format for submission nor do they requires the use of SAS (SAS Institute 2020) for analysis, there is a heavy bias toward the data format and computing platform. This is partially because they are validated and approved by governing bodies and because a large effort has gone into their use in submissions. Packages like sas7bdat (Shotwell 2014) and, more recently, haven (Wickham and Miller 2020) have gone a long way to make these data sets easily accessible to R (R Core Team 2012) users working with clinical trial data.
Despite the effort that has gone into defining a structure for the data as well as the tools implemented to aid in their analysis, the data sets themselves are not particularly easy to analyze for two reasons. First, the standard is not “tidy” as defined by Wickham et al. (2014). In particular, it is not required that each variable forms a column. In fact, multiple variables may be stored in one column, with another column acting as a key as to which variable’s value is given. This case is often seen in the ADSL data set where a single column may primary and secondary endpoints. For data sets like these, the value variable is held in the Analysis Value (AVAL) if the corresponding variable is numeric, Analysis Value Character (AVALC) if the variable is a string, the Parameter Character Description (PARAMCD) column giving a shorted variable name, and the Parameter column providing a text description of the variable. As an example, consider the adakiep.xpt
data set, which is provided as an example on the CDISC website and whose data is included in the supplementary material.
# A tibble: 24 × 8
USUBJID PARAM PARAMCD AVALC ADY ADT SRCDOM SRCSEQ
<chr> <chr> <chr> <chr> <dbl> <date> <chr> <dbl>
1 XYZ-001-001 Death DEATH Y 85 2013-11-02 DS 1
2 XYZ-001-001 Dialysis DIALYS… Y 80 2013-10-29 PR 2
3 XYZ-001-001 eGFR 25 P… EGFRDEC N 85 2013-11-02 <NA> NA
4 XYZ-001-001 Composite… AKIEP Y 80 2013-10-29 <NA> NA
5 XYZ-001-002 Death DEATH Y 82 2015-03-20 DS 1
6 XYZ-001-002 Dialysis DIALYS… Y 73 2015-03-11 PR 2
7 XYZ-001-002 eGFR 25 P… EGFRDEC N 82 2015-03-20 <NA> NA
8 XYZ-001-002 Composite… AKIEP Y 73 2015-03-11 <NA> NA
9 XYZ-001-003 Death DEATH N 94 2010-10-12 DS 1
10 XYZ-001-003 Dialysis DIALYS… Y 64 2010-09-12 PR 2
# … with 14 more rows
# ℹ Use `print(n = ...)` to see more rows
The data set includes minimal information about the trial. However, we can infer that it is from a study focusing on kidney disease. There are four distinct endpoints, death, whether dialysis was needed, whether a 25% decrease in estimated glomerular filtration rate, indicating a decrease in kidney function. For analysis, these data will need to be re-arranged so that each endpoint has it’s own column along with another column per endpoint indicating the trial day where the measurement was taken (from the ADY column).
The task of transforming these types of data into into appropriate analysis is complicated by the fact that there may be other files with relevant information with similar layout or layouts slightly more complicated if they include longitudinal information, for example. The rest of this paper focuses on shaping these types of data so that they can be quickly understood; they are amenable to many different types of analyses at the individual patient level; and they can be reformatted for an even larger class of analyses through a minimal set of verbs, including cohorting, which is introduced in this paper and is implemented in the forceps package (Kane 2020). The package is currently in development and has not been released to CRAN. However, it has been tagged for prerelease on Github and can be installed with the following code.
devtools::install_github("kaneplusplus/forceps@v0.0.5")
The next section specifies the target data shape, which can be thought of as a restriction on the tidy format. The following section specifies the steps needed to prepare clinical trial data so that it conforms to this restriction and includes an anonymized trial example. The final section provides a roadmap if near-term development as well as directions for enhancements and integration of the larger R ecosystem.
Clinical trial data is collected to be used in an analysis that determines whether or not a treatment for a disease provides a benefit when compared to those receiving either a placebo or the standard of care. “Benefit” is quantified by one or more endpoints, defined before the trial starts (in the design), which are compared across arms (treatment and placebo) at the conclusion of the trial using a statistical test. These data provide a wealth of information, and their usefulness extends well beyond the scope of the trial. For example, they can also be used to understand prognostic characteristics of the disease-population; they can be used to create a ``historical control’’ for another trial; they can be used to identify patients’ characteristics associated with better outcomes, etc.
As shown in the previous section, while ADaM-formatted data is structured, the structure does not lend itself to analysis without first performing some data transformations. We propose that the result of these transformations is a single data set with the following characteristics.
data.frame
per subject.Data conforming to these characteristics provide several advantages to ADaM data sets. First, they are oriented towards trial analysis. Essentially, trials compare response rates between treatment and control arms. Having those values coded as their own variables in a single data set minimizes the complexity and effort that would otherwise go into extracting data from multiple files, cleaning them and joining them. Second, it minimizes the reshaping effort for other types of analyses. For example, response rates are often analyzed by the site at which patient measurements were taken in order to check for certain types of enrollment heterogeneity. The described patient-centric format can be transformed into a site-centric format by nesting or grouping on a site variable followed by the extraction of site-specific features and analyses, which can then be compared across sites. Transforming between these formats requires a single operation. Likewise, the patient-centric format can be transformed to a patient-longitudinal format by unnesting on the embedded variable holding the relevant longitudinal information. Third and finally, creating a single patient-centric data set minimizes the chance of inconsistent analyses. Primary and secondary analyses often use similar variables and may require similar preprocessing. If these preprocessing steps are performed separately for parallel analyses, then the probability that at least one of them contains an error in these steps is greater than when a validated patient-centric data set is created. It also makes it easier to provide provenance for an analysis if they are dependent on the same preprocessed data.
This section provides an example of how to use the functionality provided in the forceps package, in the order that the operations take place. The data set is provided with the package, and the variable names are taken from several example lung cancer studies. The data set has been significantly reduced in size, and some values and variable names have been preprocessed. This allows the example to remain easy to follow. It also allows us to illustrate the formation of a patient-centric data set in a single pass. In practice, this is often an iterative process, requiring several revisions as bugs are found, and hypotheses change.
The data sets used are as follows, and the task will be to create a patient-centered data set as described above.
lc_adverse_events
- adverse events longitudinal data.lc_biomarkers
- patient biomarkers.lc_demography
- patient demographic information.lc_adsl
- response data.SAS ADaM formatted data sets generally include extra information about variables, including a short description of each of the variables and possibly formatting information. The haven package keeps this as attributes of each of the columns of a tibble
that is read from these files. The forceps package is capable of extracting this meta-information to create a tibble
that can be used as a data dictionary using the consolidated_describe_data()
function, shown below. In practice, we have found it helpful as a starting for a fuller description of the data and often add columns to further categorize individual variables for analyses.
library(forceps)
data(lc_adverse_events)
data(lc_biomarkers)
data(lc_demography)
data(lc_adsl)
consolidated_describe_data(lc_adverse_events,
lc_biomarkers,
lc_demography,
lc_adsl)
# A tibble: 27 × 5
var_name type label forma…¹ data_…²
<chr> <chr> <chr> <chr> <chr>
1 usubjid double Randomization Code <NA> lc_adv…
2 ae character AE Preferred Term charac… lc_adv…
3 ae_type character System Organ Class 1 charac… lc_adv…
4 grade integer Adverse Event Grade? numeric lc_adv…
5 ae_day double Days From First Dose (nume… <NA> lc_adv…
6 ae_duration double Adverse Event Duration numeric lc_adv…
7 ae_treat logical Was the Adverse Event Trea… logical lc_adv…
8 ae_count integer Total Patient Adverse Even… integer lc_adv…
9 usubjid double Randomization Code <NA> lc_bio…
10 egfr_mutation character EGFR Mutation +ve/-ve Resu… charac… lc_bio…
# … with 17 more rows, and abbreviated variable names ¹format_sas,
# ²data_source
# ℹ Use `print(n = ...)` to see more rows
The data dictionary (or data description) provides a summary of the variable types and information held by variables in each of the data sets. Some data sets will include repeated, longitudinal, or time series information about individual patients, like lc_adverese_events
in our example. Consolidating data sets like these into a single, patient-centric data set generally involves three distinct operations. The first can be thought of as pivot_wider()
operations that take columns composed of multiple variables and spread them across new columns in the data set. The second takes the data set and nest()
’s the data so that the the resulting data set contains time-varying data embedded in a data.frame
variable and those variables that are repeated appear once per patient in the new variables. This verb, which is referred to as cohort()
in the package, takes the variable to cohort on (the usubjid
in the example below), checks for values that are repeated by the subject identifier (ae_count
in the example below) and those that are not, and handles the nesting appropriately. A final operation may be applied to the patient-level embedded data.frame
objects to extract other features that will be used in subsequent analyses.
# A tibble: 6 × 8
usubjid ae ae_type grade ae_day ae_du…¹ ae_tr…² ae_co…³
<dbl> <chr> <chr> <int> <dbl> <dbl> <lgl> <int>
1 1003 BURNING SENSAT… NERVOU… 1 27 4 FALSE 15
2 1003 CONSTIPATION GASTRO… 2 4 4 TRUE 15
3 1003 DEPRESSION PSYCHI… 2 66 NA FALSE 15
4 1003 BACK PAIN MUSCUL… 2 27 NA TRUE 15
5 1003 DYSURIA RENAL … 2 1 3 TRUE 15
6 1003 SKIN EXFOLIATI… SKIN A… 1 5 26 FALSE 15
# … with abbreviated variable names ¹ae_duration, ²ae_treat,
# ³ae_count
lc_adverse_events <- lc_adverse_events %>%
cohort(on = "usubjid", name = "ae_long")
lc_adverse_events %>% head()
# A tibble: 6 × 3
usubjid ae_count ae_long
<dbl> <int> <list>
1 1003 15 <tibble [15 × 6]>
2 1005 19 <tibble [19 × 6]>
3 1006 11 <tibble [11 × 6]>
4 1009 12 <tibble [12 × 6]>
5 1014 5 <tibble [5 × 6]>
6 1018 10 <tibble [10 × 6]>
After cohorting, each of the data sets is in the specified format, and we are almost ready to combine them. It is important to first check to see if there are variables that are repeated across the individual data sets and detect conflicts. While ADaM data sets should be free of conflicts and redundancies, we have observed multiple cases where this is not true. In order to identify these issues, the duplicate_vars()
function is provided. The function checks the column names of each of the data sets with those of other column names. The object returned is a named list where the name corresponds to the variable that is repeated. Each list element returns a tibble
, joined by the on
parameter with columns corresponding to the on
variable, the duplicated variable, the data sets where the duplicated variable appears. The example below shows that the chemo_stop
variable appears in the demography
and adsl
data sets. Furthermore, we can see that the values in each of the data sets are different by looking at the correspondence between the demography
and adsl
columns. To fix this and move on, we will remove the variable from the demography
data set.
data(lc_adsl)
data(lc_biomarkers)
data(lc_demography)
data_list <- list(demography = lc_demography,
biomarkers = lc_biomarkers,
adverse_events = lc_adverse_events,
adsl = lc_adsl)
duplicated_vars(data_list, on = "usubjid")
$chemo_stop
# A tibble: 558 × 4
usubjid var demography adsl
<dbl> <chr> <chr> <chr>
1 1003 chemo_stop patient discontinued adverse events
2 1005 chemo_stop treatment ineffective adverse events
3 1006 chemo_stop <NA> treatment ineffective
4 1009 chemo_stop treatment ineffective <NA>
5 1014 chemo_stop <NA> adverse events
6 1018 chemo_stop treatment ineffective treatment ineffective
7 1023 chemo_stop <NA> adverse events
8 1025 chemo_stop adverse events adverse events
9 1030 chemo_stop adverse events adverse events
10 1033 chemo_stop adverse events treatment ineffective
# … with 548 more rows
# ℹ Use `print(n = ...)` to see more rows
The last step is to consolidate the data sets into a single one. This is accomplished by reducing the data_list
using full joins, along with some extra checking. The consolidate()
function wraps this functionality. The result, conforming to the provided format, which can easily be used in the exploration and analysis stage.
consolidate(data_list, on = "usubjid")
# A tibble: 558 × 18
usubjid site_id sex refrac…¹ age egfr_…² smoking ecog prior…³
<dbl> <int> <chr> <lgl> <dbl> <chr> <chr> <chr> <chr>
1 1003 1 male FALSE 51 negati… former… ambu… comple…
2 1005 4 female TRUE 44 negati… former… ambu… partia…
3 1006 2 male TRUE 22 negati… former… ambu… comple…
4 1009 8 male FALSE 44 <NA> unknown ambu… comple…
5 1014 6 male TRUE 76 <NA> former… ambu… partia…
6 1018 10 female TRUE 35 positi… former… ambu… comple…
7 1023 6 female TRUE 73 <NA> former… ambu… comple…
8 1025 7 male FALSE 71 <NA> never … ambu… partia…
9 1030 5 female TRUE 20 <NA> unknown ambu… partia…
10 1033 6 female TRUE 55 <NA> unknown ambu… stable…
# … with 548 more rows, 9 more variables: ae_count <int>,
# ae_long <list>, best_response <chr>, pfs_days <dbl>,
# pfs_censor <dbl>, os_days <dbl>, os_censor <dbl>,
# chemo_stop <chr>, arm <chr>, and abbreviated variable names
# ¹refractory, ²egfr_mutation, ³prior_resp
# ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
As stated before, the goal of this paper is to help define a path toward a grammar for processing clinical trials by a) defining a format in which we would like to represent data from standardized clinical trial data b) describing a standard set of operations to transform clinical trial data into this format, and c) to identify a set of verbs and other functionality to facilitate data processing of this kind and encourage reproducibility of these steps. Admittedly, this only serves to mitigate the process of preparing these types of data for exploration and analysis. Clinical trial data generally contains many more variables than what was presented, and each of these data sets comes with its own set of “quirks” and other challenges. However, it does serve to make the data preparation better defined and propose a path toward standardization of both the processed data set format as well as the operations to achieve that goal.
Along with further development towards those ends, there is a plethora of development that can be done to provide an integrated data processing experience. For example, the define.xml file, which appears alongside ADaM data sets, gives better descriptions of the variables as well as the variable values. Tools to integrate these data into the construction of the data dictionary would go a long way towards orienting researchers with the data contained and help them more quickly formulate analyses. Packages like lumberjack (van der Loo 2020) could enhance and augment data preprocessing steps by keeping better track of when data are being removed and how they are being manipulated. The artifacts accumulated could then be used by packages such as ggconsort (Higgins 2020) to provide consort diagrams of how patients progress through the trial and how data progresses through preprocessing. In the longer term, these advancements can provide better data provenance, more reproducible processing, quicker debugging of problems in the processing stage, and give rise to more effective and convenient tools for summarizing trial data.
Supplementary materials are available in addition to this article. It can be downloaded at RJ-2021-052.zip
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Kane, "The R Journal: Towards a Grammar for Processing Clinical Trial Data", The R Journal, 2022
BibTeX citation
@article{RJ-2021-052, author = {Kane, Michael J.}, title = {The R Journal: Towards a Grammar for Processing Clinical Trial Data}, journal = {The R Journal}, year = {2022}, note = {https://doi.org/10.32614/RJ-2021-052}, doi = {10.32614/RJ-2021-052}, volume = {13}, issue = {1}, issn = {2073-4859}, pages = {563-569} }