Metadata and Clinical Information

Last updated: 5 months ago (view history), Time to read: 7 mins

Metadata

Each data request includes a text file called SAMPLE_INFO.txt that provides a number of file level properties (sample identifiers, clinical attributes, etc).

Definitions

Below are the set of tags which may exist for any given file in St. Jude Cloud. Tags with sj prepended are required fields. Tags with attr prepended are information queried from the physician or research team’s records at the time of sample submission to St. Jude Cloud and are considered optional, as the level of information gathered for each sample varies.

Property Description
file_path The path to the file in your St. Jude Cloud project.
file_id A unique identifier for the file on DNAnexus, you can see this value listed as “ID” in the DNAnexus user interface.
subject_name A unique subject identifier assigned internally at St. Jude.
sample_name A unique sample identifier assigned internally at St. Jude.
sample_type One of Autopsy, Cell line, Diagnosis, Germline, Metastasis, Relapse, or Xenograft.
sequencing_type Whether the file was generated from Whole Genome (WGS), Whole Exome (WES), or RNA-Seq.
file_type Specifies the type of file. Note that index files will be labeled as the file type they accompany and will automatically be selected together in our data browser. If you wish to distinguish between the two in your project, please parse the file_path where index files are appended with an additional string, such as .bai.
description Optional field that may contain additional file information.
file_size The size of the file in bytes, not exceeding 12 integers.
sj_dataset_accession The permanent accession number assigned to a dataset in St. Jude Cloud.
sj_embargo_date The embargo date, which specifies the first date which the files can be used in a publication.
sj_long_disease_name The complete written name of the disease associated with the disease code store in the sj_disease attribute. For more information about our disease ontology go here.
attr_age_at_diagnosis Age at first diagnosis. This field is normalized as a decimal value. If empty, the physician or research team did not indicate a value for this field.
attr_diagnosis Unharmonized primary diagnosis as reported by the lab or PI upon submission of data to St. Jude Cloud.
attr_sex Self-reported sex.
attr_ethnicity Self-reported ethnicity. Values are normalized according to the US Census Bureau classifications.
attr_race Self-reported race. Values are normalized according to the US Census Bureau classifications.
attr_oncotree_disease_code The disease code (assigned at the time of genomic sequencing) as specified by Oncotree Version 2019-03-01.
attr_library_selection_protocol The laboratory method used to prepare and select the DNA or RNA for sequencing from a sample. The possible values are PCR, PolyA, Total, Random, Not Available, or Not Applicable.
attr_read_length The read length used, when available.
attr_sequencing_platform This defines which sequencing platform was used to generate the data, when available.
attr_read_type The sequencing read type, if available.
attr_inferred_strandedness Computationally determined strandedness of RNA-seq data, if applicable.
sj_publication_titles The title of associated publications(s), if the file was associated with a paper(s).
sj_pub_accessions The related St. Jude Cloud accession number(s), if the file was associated with a paper(s). These group the files into publications as displayed on the Genomics Platform data browser.
sj_pmid_accessions The related Pubmed accession number, if the file was associated with a paper.
attr_subtype_biomarkers A molecular mutation, SV or fusion event associated with a particular disease subtype that is used to define membership in that subtype.
sj_associated_diagnoses List of all available associated diagnoses for the subject (from the tumor samples or from a patient’s clinical history).
attr_germline_sample The paired germline sample that was used when creating the Somatic VCF file, if applicable.
attr_diagnosis_group Each file is categorized into one of five diagnosis groups based on the type of tumor - hematologic malignancy, solid tumor, brain tumor, germ cell tumor, or not applicable (for germline samples).
sj_ega_accessions The related EGA accession number, if the file was associated with a paper.
sj_access_unit Lists which Data Access Unit (DAU) the file belongs to. For more on Data Access Units, see here.
sj_diseases If your data request was process after August 18, 2020, the field should be interpreted as the harmonized St. Jude Cloud diagnosis based on the best available information (data provided by the lab or PI and followup by scientists on the St. Jude Cloud team). If your data request was processed before August 18, 2020, this field should be interpreted as the disease identifier assigned at the time of genomic sequencing (keyly, the diagnosis known at the time of genomic testing may not be the best available information). If your data request was processed after August 18, 2020 and you’d like to use the most up to date, harmonized diagnosis, we recommend using sj_diseases when including diagnosis in your analysis. If your data request was made before this time or if you wish to use the values exactly as provided by the lab or PI, we recommend using the lab-provided value in attr_diagnosis. For more information about our disease ontology go here.
sj_datasets The dataset(s) in the data browser which this file is associated with.
sj_pipeline_name Specifies which specific version of the pipeline was used when generating the file.
attr_tissue_preservative The preservation method used for the tissue sample, with two options: FFPE (formalin-fixed, paraffin-embedded) or Fresh/Frozen.
attr_lab_strandedness Lab reported strandedness of RNA-seq data.

Disease Codes

note

During the release of the St. Jude Cloud paper, we undertook a massive effort to curate and harmonize diagnosis values within St. Jude Cloud. We provide two values for diagnosis, and you should select carefully which value you use based on your use case:

  1. sj_diseases, which, since August 18, 2020, represents the harmonized diagnosis value curated by scientists on the St. Jude Cloud team (before that time it represented the diagnosis known at time of sequencing). For more information about our disease ontology go here.
  2. attr_diagnosis, which contains the unharmonized diagnosis value directly as it was submitted to us from the lab or PI.

If your data request was processed after August 18, 2020 and you’d like to use the most up to date, harmonized diagnosis, we recommend using sj_diseases field. If your data request was made before this time or if you wish to use the values exactly as provided by the lab or PI, we recommend using the value in attr_diagnosis. For more information about our disease ontology go here.

The SAMPLE_INFO.txt file that comes with your data request will contain the list of associated harmonized diagnosis codes (sj_diseases) for each sample. These codes represent the harmonized diagnosis values curated by the St. Jude Cloud team and reflect the most up to date information about the sample. For more information about our full disease ontology, please navigate to our St. Jude Cloud Disease Ontology section to read our white paper and access our downloadable disease ontology.

Similar Topics

About our Decision Process & Terminology

Making a Data Request

Managing Data Overview