Data Sets and Data Access Units

Last updated: 2 years ago (view history), Time to read: 13 mins

Data Set

A St. Jude Cloud Data Set is a grouping of data which has been curated by St. Jude and can correspond to a study, project, or specific disease. They are available for free to researchers and access to a Data Set can be requested from the data browser. However, access is not granted at the Data Set level, but rather the Data Access Unit. A single Data Set may belong to only one DAU, or it could belong in multiple if it contains data came from different groups.

An approved Data Access Request grants access to a particular Data Access Unit which includes specific Data Sets that can be selected from the Data Browser. An approved DAR would give access not only to the data selected at the time, but also any additional data that is included in the DAUs which were approved. Only the data initially selected will be vended to a project folder upon approval but returning to the data browser and selecting additional data which falls under the approved DAUs will not require another Data Access Request.

See the list of Data Sets.

Data Access Unit

A St. Jude Cloud Data Access Unit (DAU) is a grouping of data that typically corresponds to a project, study, or Data Set generated at the same time at the same institution. Each DAU has its own governing body of researchers, the Data Access Committee, who preside over the data and who may grant or deny access. Each Data Access Committee is responsible for only one DAU and has its own protocols for approving access to their DAU. Please contact us if you have questions about committee approval protocols. We currently have 6 DAUs: Pediatric Cancer Genome Project (PCGP), St. Jude Lifetime Cohort Study (SJLIFE), Genomes for Kids (G4K) and Clinical Genomics, Sickle Cell Genome Project (SGP), Childhood Cancer Survivor Study (CCSS), and Pan-Acute Lymphoblastic Leukemia (PanALL). See below for a brief description of each DAU. For a more detailed description please see the respective Schedule 1(s).

See the list of Data Access Units.

Data Access Committee

A St. Jude Cloud Data Access Committee (DAC) is a group of St. Jude researchers who oversee access to a particular Data Access Unit (DAU) and evaluate incoming data requests.

The first time you request access to files in a DAU, it is required that you fill out a Data Access Agreement (DAA). Access is granted at the DAU level based on the decision of each DAC upon reviewing the DAA.

example

For example, if you make a request asking for all of St. Jude’s Acute Lymphoblastic Leukemia sequencing data, you might be asking for data from multiple different projects (DAUs) here at St. Jude. For the sake of the example, let’s say the data you want is spread across three different Data Sets and two DAUs. Once you place a request, your application will be routed to the corresponding two data access committees for approval. Since each DAC is made up of different individuals using different criteria for evaluation, you may or may not be approved for access to all of the files.

Embargo Date

The Embargo Date specifies the date that a publishing embargo on the file in question has been lifted. Publishing using any of the files before the embargo date has passed is strictly prohibited as outlined in the Data Access Agreement (DAA). Typically, samples from the same Data Access Unit (DAU) all have the same embargo date, as they would have been released on St. Jude Cloud at the same time.

Current Embargo Dates

DAU or Data Set Embargo Date
Pediatric Cancer Genome Project July 23, 2018
Pan-Acute Lymphoblastic Leukemia January 14, 2019
St. Jude LIFE January 15, 2019
Clinical Genomics Rolling based on release date
Sickle Cell Genome Project September 1, 2019
Childhood Cancer Survivor Study November 1, 2019
Pediatric therapy-related Myeloid Neoplasms February 12, 2021
Bone Marrow Failure and Myelodysplastic Syndromes October 30, 2021
Pediatric Acute Myeloid Leukemia December 1, 2021
Genomics and transcriptomics of relapsed pediatric AML August 20, 2022
Medulloblastoma Preclinical Ribociclib and Gemcitabine September 1, 2022

List of DAUs

We currently have the six Data Access Units (DAU) listed below. Basic clinical data is available for relevant subjects in each DAU. Click on the name below to navigate directly to that DAU’s Study page for more detailed information. The Data Sets included in each DAU are listed below, note that some Data Sets are a part of multiple DAU’s.

Childhood Cancer Survivor Study (CCSS)

CCSS consists of germline-only whole genome sequencing samples of childhood cancer survivors. The following data set(s) are included within CCSS:

Clinical Genomics

Clinical Genomics contains paired-tumor normal whole genome, whole exome, and RNA sequencing data focused on identifying variants that influence the development and behavior of childhood tumors. The following data set(s) are included within Clinical Genomics:

Pan-Acute Lymphoblastic Leukemia (PanALL)

PanALL contains tumor-only RNA-Seq data focused on the spectrum of ALL subtypes from a variety of contributing sources. The following data set(s) are included within PanALL:

Pediatric Cancer Genome Project (PCGP)

PCGP contains paired-tumor normal whole genome, whole exome, and RNA sequencing data focused on discovering the genetic origins of pediatric cancer. The following data set(s) are included within PCGP:

Sickle Cell Genome Project (SGP)

SGP contains germline-only whole genome sequencing data of Sickle Cell Disease patients from birth to young adulthood. The following data set(s) are included within SGP:

St. Jude Life (SJLIFE)

SJLIFE contains germline-only whole genome and whole exome sequencing data focused on studying the long-term adverse outcomes associated with cancer and cancer-related therapy. The following data set(s) are included within SJLIFE:

List of Data Sets

We currently have the ten Data Sets listed below. Additional information can also be seen including which Data Access Units (DAU) the Data Set belongs to, tissue type, sequencing type, number of samples, additional links, and a brief description.

Childhood Cancer Survivor Study

DAU: CCSS | Tissue Type: Germline Only | Sequencing Type: WGS | Samples: 2912 | Additional Information About CCSS

Childhood Cancer Survivor Study (CCSS) is a germline-only Data Set consisting of whole genome sequencing of childhood cancer survivors. CCSS is a multi-institutional, multi-disciplinary, NCI-funded collaborative resource established to evaluate long-term outcomes among survivors of childhood cancer. It is a retrospective cohort consisting of >24,000 five-year survivors of childhood cancer who were diagnosed between 1970-1999 at one of 31 participating centers in the U.S. and Canada. The primary purpose of this sequencing of CCSS participants is to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy.

CCSS: Potential Bacterial Contamination

Samples for the Childhood Cancer Survivorship Study were collected by sending out Buccal swab kits to enrolled participants and having them complete the kits at home. This mechanism of collecting saliva and buccal cells for sequencing is highly desirable because of its non-invasive nature and ease of execution. However, collection of samples in this manner also has higher probability of contamination from external sources (as compared to, say, samples collected using blood). We have observed some samples in this cohort which suffer from bacterial contamination. To address this issue, we have taken the following steps:

  1. We have estimated the bacterial contamination rate and annotated each of the samples in the CCSS cohort. For each sample, you will find the estimated contamination rate in the Description field of the SAMPLE_INFO.txt file that is vended with your data (and as a property on the DNAnexus file). For information on this field, see the Metadata specification.
  2. Using this estimated contamination rate, we have removed 82 samples which exhibited large rates of bacterial contamination.
  3. For the remaining samples, we have provided the BAM file as aligned with bwa mem with default parameters. We have observed that there are instances of reads originating from bacterial contamination that are erroneously mapped to the human genome and display a very low mapping quality. Please be advised that we have kept these reads as they were aligned and have not yet made any attempt to unmap these reads. Any analysis you perform on these samples will need to take this into account!
  4. Last, we will be working over the coming months to unmap the reads originating from bacterial contamination and release updated BAM files along with the associated gVCF files from Microsoft Genomics Service.

With any questions on the nature or implications of this warning, please contact us at support@stjude.cloud.

Childhood Solid Tumor Network

DAU: PCGP and Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 143 | Additional Information About CSTN

The Childhood Solid Tumor Network (CSTN) is a St. Jude Children’s Research Hospital initiative to disseminate its childhood solid tumor resources and data. The raw Data Sets from this initiative are made available via St. Jude Cloud.

Cicero Benchmark

DAU: PCGP and Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: RNA-Seq | Samples: 124

The CICERO Data Set contains the samples which were selected for use in the CICERO Paper.

Clinical Pilot

DAU: PCGP and Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 155 | Additional Information About Clinical Genomics

The Clinical Pilot project was a retrospective study that evaluated the accuracy and demonstrated the feasibility of three-platform sequencing in a CAP/CLIA setting. The findings of this project were published in Nature Communications

Genome 4 Kids

DAU: Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 571 | Additional Information About Clinical Genomics

The goal of the Genomes 4 Kids (G4K) prospective study was to determine whether the three-platform sequencing protocol laid out in the Clinical Pilot project could generate results on a clinical timeline in practice and to evaluate the prevalence of actionable findings. The study concluded with just over 300 patients, and the publication is currently in review.

Pan-Acute Lymphoblastic Leukemia

DAU: PanALL | Tissue Type: Paired Tumor-Normal | Sequencing Type: RNA-Seq | Samples: 735

Pan-Acute Lymphoblastic Leukemia (PanALL) comprises cases of B-progenitor and T-lineage ALL encompassing the spectrum of ALL subtypes across the age continuum. Samples sequenced were obtained from multiple sites, centers and cooperative groups including St. Jude Children’s Research Hospital, The Children’s Oncology Group, The Alliance – Cancer and Leukemia Group B, the Eastern Cooperative Oncology Group, The Southwestern Oncology group, MD Anderson Cancer Center, City of Hope National Medical Center, Princess Margaret Cancer Center, Northern Italy Leukemia Group, and UKALL.

Pediatric Cancer Genome Project

DAU: PCGP | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 3031 | Additional Information About PCGP

The Pediatric Cancer Genome Project (PCGP) is a collaboration between St. Jude Children’s Research Hospital and the McDonnell Genome Institute at Washington University School of Medicine that sequenced the genomes of over 600 pediatric cancer patients.

DAU: PCGP | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 206 | Additional Information About tMN

The primary purpose of the Pediatric therapy-related Myeloid Neoplasms (tMN) study is to define the genomic alterations in therapy-related myeloid neoplasms in children. The objective of the study was to define the somatic and germline alterations using WGS, WES and/or RNA-seq that drive tMN in children. The dataset is a mixture of paired tumor-normal samples or normal-only samples.

Real-time Clinical Genomics

DAU: Clinical Genomics | Tissue Type: Paired Tumor-Normal | Sequencing Type: WGS, WES, RNA-Seq | Samples: 2,371 | Additional Information About Clinical Genomics

Real-time Clinical Genomics (RTCG) is a first of its kind initiative, whereby St. Jude began releasing data from the clinical NGS service consented for research use to St. Jude Cloud in monthly batches to give researchers access to valuable data as quickly as possible.

Sickle Cell Genome Project

DAU: SGP | Tissue Type: Germline Only | Sequencing Type: WGS | Samples: 807 | Additional Information About SGP

SGP is a germline-only Data Set of Sickle Cell Disease (SCD) patients from birth to young adulthood.The Sickle Cell Genome Project (SGP) is a collaboration between St. Jude Children’s Research Hospital and Baylor College of Medicine focused on identifying genetic modifiers that contribute to various health complications in SCD patients. Additional objectives include, but are not limited to, developing accurate methods to characterize germline structural variants in highly homologous globin locus and blood typing.

St. Jude Life

DAU: SJLIFE | Tissue Type: Germline Only | Sequencing Type: WGS, WES | Samples: 4838 | Additional Information About SJLIFE

St. Jude Lifetime (SJLIFE) is a longevity study from St. Jude Children’s Research Hospital that aims to identify all inherited genome sequence and structural variants influencing the development of childhood cancer and occurrence of long-term adverse outcomes associated with cancer and cancer-related therapy. This cohort contains unpaired germline samples and does not contain tumor samples.