File Formats and Sequencing Information

Last updated: 2 years ago (view history), Time to read: 14 mins

File Formats

St. Jude Cloud hosts both raw genomic data files and processed results files:

File Type Short Description Details
BAM hg38 aligned BAM files produced by Microsoft Genomics Service (DNA-Seq) or STAR 2-pass mapping (RNA-Seq). Click here
gVCF Genomic VCF files produced by Microsoft Genomics Service. Click here
Somatic VCF Curated list of somatic variants produced by the St. Jude somatic variant analysis pipeline. Click here
CNV List of somatic copy number alterations produced by St. Jude CONSERTING pipeline. Click here
Feature Counts Curated list of read counts mapped to each gene produced by HTSeq Click here

BAM files

In St. Jude Cloud, we store aligned sequence reads in BAM file format for whole genome sequencing, whole exome sequencing, and RNA-Seq. For more information on SAM/BAM files, please refer to the SAM/BAM specification. For research samples, we require the standard 30X coverage for whole genome and 100X for whole exome sequencing. For clinical samples, we require higher coverage, 45X, for whole genome sequencing due to tumor purity issues found in clinical tumor specimens. For RNA-Seq, since only a subset of genes are expressed in a specific tissue, we require 30% of the exons to have 20X coverage in order to ensure that at least 30% of the expressed genes have sufficient coverage.

gVCF files

We provide gVCF files produced by the Microsoft Genomics Service. gVCF files are derived from the BAM files produced above as called by GATK’s haplotype caller. Today, we defer to the official specification document from the Broad Institute, as well as this discussion on the difference between VCF and gVCF files. For more information about how Microsoft Genomics produces gVCF files or any other questions regarding data generation, please refer to the official Microsoft Genomics whitepaper.

Somatic VCF files

Somatic VCF files available on St. Jude Cloud have been generated when possible for tumor samples where a matching patient normal sample has also been sequenced. These include samples from the Pediatric Cancer Genome Project (PCGP), Clinical Pilot, and Genomes 4 Kids (G4K) cohorts. These somatic VCF files contain hg38 coordinate SNV/Indel variant calls generated by the St. Jude somatic variant analysis pipeline outlined below:

  1. Whole genome (WGS) and/or Whole exome (WES) sequencing data were aligned to hg19 using bwa backtrack (bwa aln + bwa sampe) using default parameters.*
  2. Post processing of aligned reads was performed using Picard CleanSam and MarkDuplicates.*
  3. Variants were called using the Bambino variant caller (you can download Bambino here).*
  4. If RNA-Sequencing was performed on a tumor sample, we assessed the presence of any WGS/WES called variants within a StrongARM aligned RNA-Seq BAM file.*
  5. In some Clinical Pilot cohort study cases, validation by custom capture was performed, in which case the read counts of that experiment were reported.*
  6. Variants were post-processed using an in-house post-processing pipeline that cleans and annotates variants. This pipeline is not currently publicly available.*
  7. Depending on the cohort, either all variants were manually reviewed by analysts (PCGP cohort) or a subset comprising all coding variants were manually reviewed (Clinical Pilot and G4K). This resulted in the assignment of a validation status for each variant (see below).
  8. Variant filtering:

    • For patient samples where a complete set of variants have been manually curated and reported as part of an existing publication, the published set of variants was considered the complete set of variants for that patient tumor sample, and used to populate the associated somatic VCF file. This includes samples belonging to the Clinical Pilot cohort study and some of the PCGP cohort studies.
    • For currently unpublished patient tumor samples, we filtered variants based on metrics derived from the in-house post-processing pipeline and the input of analysts (see below). This includes the patient samples in the G4K study, along with some unpublished PCGP and Clinical Pilot cohort study samples.
  9. Variants were converted from our in-house formats to VCF format.
  10. Variant coordinates were lifted over to GRCh38noalt using Picard LiftoverVcf.
  11. Variants were normalized using vt normalize.
  12. Variants were annotated using VEP v100 and the --everything flag.
  13. VCFs were compressed with bgzip and tabix indexed, then validated using VCFtools’ vcf-validator.

    • VCFtools’ vcf-validator is outdated, and is being replaced with GATK’s ValidateVariants in future versions of the pipeline.

* For more detailed information on steps 1-6, see the Clinical Pilot paper where they are described in depth.

warning

Please note, as indicated above, that the somatic variants recorded in these VCF files were called using hg19 aligned WGS/WES/RNA-Seq sequencing data, and not from the hg38 BAMs available elsewhere on St. Jude Cloud.

Sample Variant Validation and Filtering

The validation_status field indicates the status of validating the presence and origin (somatic/germline) of the variant using an orthogonal methodology. Note that this doesn’t imply anything regarding variant significance, pathogenicity, or relation to disease. Typically, validation is performed either by secondary sequencing using a custom capture panel designed around the called variants (common in research studies) or by cross-validation using simultaneous sequencing by WGS, WES, and/or RNA-Seq (used in clinical genomics).

For validation by a secondary method, “valid” indicates the variant was validated by the secondary method, “putative” indicates that the secondary method could neither confirm nor invalidate the variant (e.g. due to lack of read coverage by the secondary method), and “invalid” indicates a negative result from the secondary method.

Cross-validation of variant data across multiple sequencing methodologies (WGS, WES, RNA-Seq etc) has been used across clinical genomics studies (Clinical Pilot, G4K, and RTCG). Here SNVs and indels are first manually reviewed and called as “Good” or “Bad” for each sequencing methodology. To make this assessment, the analyst considers 1) computationally generated quality metrics and tags assigned to the variant based on read counts, base qualities, and realignment; and 2) manual inspection of tumor and germline read evidence within sequencing data from the given sequencing platform. The assigned validation status is determined based on the observed support for the variant across each of the sequencing methodologies as follows:

Fresh/Frozen Samples
Designation Criteria
Valid Good in both WGS and WES
Likely Valid Good in WGS and RNA-Seq, but not WES (though not validated, the high fidelity of WGS means these have a high probability of being valid)
Putative Good in WGS only or Good in both WES and RNA-Seq
Invalid any other combination of assignments
FFPE samples
Designation Criteria
Valid Good in both WES and RNA
Likely Valid Good in WES but not RNA-Seq
Putative Generally not used
Invalid any other combination of assignments

Pre-annotation

As part of our post-processing pipeline, variants are processed by a variety of in-silico pathogenicity prediction algorithms that default VEP doesn’t provide. We include those results as VCF INFO tags when applicable. Note that they were performed on the hg19 coordinates. Algorithms include Likelihood Ratio Test (LRT), MutationAssessor, MutationTaster, and FATHMM.

Bambino represents indels in a format incompatible with the VCF specification. When we convert variants to VCFs, we store the original Bambino representation in the bambino_representation tag of the info field. Note that this entry is always in hg19 coordinates.

Where possible, allele counts and read depths for each variant are determined across each sequencing type available for the associated patient tumor-normal pair. These can include WGS, WES, RNA-Seq, and validation-capture (VALCAP) data. These data are stored in the sample columns of the VCF.

Processing tools

The newly created hg19 coordinate VCFs are lifted over using Picard LiftoverVcf. hg19ToHg38.over.chain is used as the chain file and GRCh38_no_alt.fa is used as the reference. The hg38 build includes some reference contigs not present in the GRCh38noalt build, so variants which would map to one of those alternate sequences are excluded from the VCFs. Variants which fail Picard liftover may be reviewed by an analyst and lifted over manually. bcftools annotate is used to document the reference file and exact liftover command used in the VCF’s header. Note that hg19 alleles and coordinates are stored by Picard in the OriginalAlleles, OriginalContig, and OriginalStart info tags within the final VCF file. Please also note that there is a bug in the version of Picard we use (version 2.18.29) which rarely truncates some of the VCF’s format and sample fields. These are fields which we use to store read counts and depths for each sequencing type (WGS, WES, RNA-Seq, VALCAP). A bug report has been filed with Picard. In the current version of the pipeline we recover the dropped entries from the original hg19 VCF. While searching for and correcting these entries, we also reorder the FORMAT and genotype fields to a logical ordering, as opposed to the alphabetical output order of Picard.

vt normalize is used to obtain consistent representations of all variants which may have multiple equivalent forms. If VT modifies a variant, it records the original form in the OLD_VARIANT info tag. Click here to read about variant normalization.

For a full description of the annotations VEP determines for each variant, see the VEP documentation.

Naming convention

Somatic VCFs are the result of a paired tumor-normal analysis and are named to indicate the samples used in that analysis. We use the tumor sample name followed by the germline sample name. Usually, the tumor and germline sample names share the same initial prefix, and in this case we do not repeat the prefix. For example, an analysis with tumor sample SJOS0123_D and germline sample SJOS0123_G becomes SJOS0123_D_G. In some rare cases, the prefix differs. For example, with SJHM030441_D2_SJOS030441_G1, the germline sample was taken when the patient was being treated for Osteosarcoma, but the tumor sample was taken from the same patient at the time of treatment for Acute Myeloid Leukemia; since there is no shared prefix, they are appended in full length.

The metadata file SAMPLE_INFO.txt contains the exact details for which samples are included in the analysis. The disease sample is under sample_name, and the normal sample is under attr_control_sample. Since these original files are from a wide-span of projects over the years, the exact sample naming conventions have changed and the VCF file names reflect those changes.

An exception occurs when different germline samples were used in separate experiments comparing against the same disease sample. All somatic VCFs may contain data from up to four different experiments; WGS, WES, RNA-Seq, and VALCAP experiments can all be presented in the same file. In nearly all cases those 4 experiments will have been done using the same germline samples. But rarely one analysis (i.e. WGS analysis) was done using one germline sample, and another analysis (i.e. WES analysis) was done using another germline sample. In those cases we integrate the data and join the germline extensions using hyphens (-); for example SJETV010_D_G-H. The VCF header and the metadata file SAMPLE_INFO.txt will contain more details about which samples were used for which experiments.

Filenames are not always reflective of the truth for a given sample, and the metadata should always be used during sample selection as it has been thoroughly curated by experts.

Additional Caveats

The VCF headers will explain each annotation used, including the fields for read counts and depths of each sequencing type. For more information on the variant calling format (VCF), please see the version 4.2 specification for VCF documents here.

It is possible for the same variant to be included more than once. This is most commonly caused by functionally equivalent indels being called with different forms at nearby locations in the original pipeline. However once the variants are normalized, it is revealed that they are the same variant. We make no attempt to select one entry when this occurs and opt to include all duplicates. They may have different read evidence or validation statuses. There may be other sources of these duplicates (such as liftover), but they are rare and we make no attempt to remove them.

CNV files

CNV files contain copy number alteration (CNA) analysis results for paired tumor-normal WGS samples. Files are produced by running paired tumor-normal BAM files through the CONSERTING pipeline which identifies CNA through iterative analysis of (i) local segmentation by read depth within boundaries identified by structural variation (SV) breakpoints followed by (ii) segment merging and local SV analysis. CREST was used to identify local SV breakpoints. CNV files contain the following information:

Field Description
chrom chromosome
loc.start start of segment
loc.end end of segment
LogRatio Log2 ratio between tumor and normal coverage

Feature Counts files

Feature counts are text files that contain counts of reads aligned to genomic features. St. Jude Cloud feature files are generated using HTSeq. The detailed command is documented in our RNA-Seq V2 RFC. The files contain a count of the number of reads overlapping each genomic feature, in this case, genes as specified in GENCODE V31. St. Jude Cloud uses the gene name as feature key. The files are tab-delimited text and contain the feature key and read count for that feature.

Sequencing Information

Whole Genome and Whole Exome

Whole Genome Sequence (WGS) and Whole Exome Sequence (WES) BAM files were produced by the Microsoft Genomics Service aligned to hg38 (GRCh38 no alt analysis set). For more information about how Microsoft Genomics produces BAM files or any other questions regarding data generation, please refer to the official Microsoft Genomics whitepaper.

RNA-Seq

RNA-Seq BAM files are mapped to hg38. For alignment, STAR v2.7.1a 2-pass mapping is used. Below is the STAR command used during alignment. For more information about any of the parameters used, please refer to the STAR manual for v2.7.1a. The complete RNA-Seq WDL pipeline is available on GitHub. The STAR alignment parameters are also available on GitHub. For additional information about St. Jude Cloud’s RNA-Seq pipeline, visit the RFC.

    STAR \
             --readFilesIn $(cat read_one_fastqs_sorted.txt) $(cat read_two_fastqs_sorted.txt) \
             --genomeDir ~{stardb_dir} \
             --runThreadN $n_cores \
             --outSAMunmapped Within \
             --outSAMstrandField intronMotif \
             --outSAMtype BAM Unsorted \
             --outSAMattributes NH HI AS nM NM MD XS \
             --outFilterMultimapScoreRange 1 \
             --outFilterMultimapNmax 20 \
             --outFilterMismatchNmax 10 \
             --alignIntronMax 500000 \
             --alignMatesGapMax 1000000 \
             --sjdbScore 2 \
             --alignSJDBoverhangMin 1 \
             --outFilterMatchNminOverLread 0.66 \
             --outFilterScoreMinOverLread 0.66 \
             --outFileNamePrefix ~{output_prefix + "."} \
             --twopassMode Basic \
             --limitBAMsortRAM ~{(memory_gb - 2) + "000000000"} \
             --outSAMattrRGline $(cat read_groups_sorted.txt)