Authors | Scott Newman, Clay McLeod, Yongjin Li |
Publication | N/A (not published) |
Technical Support | Contact Us |
Overview
Fusion genes are important for cancer diagnosis, subtype definition and targeted therapy. RNA-Seq is useful for detecting fusion transcripts; however, computational methods face challenges to identify fusion transcripts due to events such as internal tandem duplication (ITD), multiple genes, low expression, or non-templated insertions. To address some of these challenges, St. Jude Cloud offers “Rapid RNA-Seq”, an end-to-end clinically validated pipeline that detects gene fusions and ITDs from human RNA-Seq.
Inputs
The input can be either of the two entries below, based on whether you want to start with FASTQ files or a BAM file.
Name | Description | Example |
---|---|---|
Paired FASTQ files | Gzipped FASTQ files generated by human RNA-Seq | SampleR1.fastq.gz and SampleR2.fastq.gz |
BAM file | Aligned reads file from human RNA-Seq | Sample.bam |
warning
If you provide a BAM file to the pipeline, it must be aligned to GRCh37-lite. Running a BAM aligned to any other reference genome is not supported. Maybe more importantly, we do not check the genome build of the BAM, so errors in computation or the results can occur. If your BAM is not aligned to this genome build, we recommend converting the BAM back to FASTQ files using Picard’s SamToFastq functionality and using the FASTQ version of the pipeline.
Outputs
The Rapid RNA-Seq pipeline produces the following outputs:
Name | Description |
---|---|
Predicted gene fusions (.txt) | File containing putative gene fusions. |
Coverage file (.bw) | bigWig file containing coverage information. |
Splice junction read counts (.txt) | Read counts for the splice junction detected. |
Interactive fusion visualization | Fusion visualization produced by ProteinPaint. |
Interactive coverage visualization | Coverage visualization produced by ProteinPaint. |
Workflow Steps
- The raw sequence data is aligned to GRCh37-lite using standard STAR mapping.
- A coverage bigWig (.bw) file is produced to allow the user to assess sample quality across the genome.
-
Two gene fusion detection algorithms are run in parallel.
- The Fuzzion (Rice et al. unpublished data) fusion detection algorithm is run to provide high sensitivity for recurrent gene fusions.
- The RNAPEG (Wu et al.) splice junction read counting algorithm is run to quantify read counts for splice junctions. These splice junction read counts are then used by CICERO (Li et al.) to detect putative gene fusions.
- Custom visualizations for putative gene fusions and genome coverage are produced by ProteinPaint.
Mapping
We use the STAR aligner to rapidly map reads to the GRCh37 human reference genome. This step generally takes around one hour to complete assuming approximately 55-75 million paired reads are supplied.
Coverage
Internally developed scripts calculate the coverage of mapped reads genome wide. The resulting bigWig file can be viewed in ProteinPaint or used for quality control.
Splice junction read quantification
We use our RNAPEG software to quantify reads spanning known and novel splice junctions. RNAPEG also corrects improper mappings at splice junction boundaries for more accurate definition of novel splice junctions. The resulting junctions file can be viewed along with the coverage bigWig file to gain insights into gene expression and splicing patterns
Genome-wide fusion prediction
We developed an assembly-based algorithm CICERO (Clipped-reads Extended
for RNA Optimization) that is able to extend the read-length spanning
fusion junctions for detecting complex fusions. CICERO finds clipped
reads and junction spanning reads, assembles them into a contig and maps
the contig back to the reference genome. Mapped contigs are then
annotated and filtered. Those with potential genic effects including
gene fusion, ITD, readthrough or circular RNA are reported in the
final_fusions.txt
file. An interactive version of this file with
predictions sorted by quality can be inspected with the ProteinPaint
interactive fusion viewer.
An abstract describing CICERO was presented at ASHG, 2014: http://www.ashg.org/2014meeting/abstracts/fulltext/f140120024.htm
Low stringency fusion gene “Hotpot” search
We have observed that certain fusions such as KIAA1549-BRAF in low-grade
glioma have apparently limited read support in the bam file — either due
to low expression or low tumor purity. In these cases, we use a
secondary tool, FUZZION, that performs fuzzy matching for known fusion
gene junctions for every read in the bam file (both mapped and
unmapped). FUZZION can recover even a single low quality read
potentially supporting a known fusion gene junction. The FUZZION output
is a simple text file with read IDs and sequences supporting a
particular gene fusion. The fusion point is indicated with square
brackets []
.
Creating a workspace
Before you can run one of our workflows, you must first create a workspace in DNAnexus for the run. Refer to the general workflow guide to learn how to create a DNAnexus workspace for each workflow run.
You can navigate to the Rapid RNA-Seq workflow page here.
Uploading Input Files
warning
This pipeline assumes GRCh37-lite coordinates. If your BAM is not aligned to this genome build, we recommend converting the BAM back to FASTQ files using Picard’s SamToFastq functionality.
The Rapid RNA-Seq pipeline takes as input either a paired set of Gzipped FASTQ files or a GRCh37-lite aligned BAM from human RNA-Seq.
Refer to the general workflow guide to learn how to upload input files to the workspace you just created.
Running the Workflow
Refer to the general workflow guide to learn how to launch the workflow, hook up input files, adjust parameters, start a run, and monitor run progress.
note
Several of the apps that make up the Rapid RNA-seq workflow have an automatic timeout value included to prevent possible runaway costs, if you need to adjust or remove any of these, currently you must do so via the DNAnexus CLI (though a method to do so from the UI should be available in the future). See the section below for more information about these timeouts and how to adjust or remove.
The Rapid RNA-seq pipeline contains several different apps and each of their timeout values are currently set as:
- STAR - 24.5 hour timeout
- RNApeg - 15 hour timeout
- CICERO - 30 hour timeout
- Coverage_bw - no timeout
- Fuzzion - no timeout
You can use the following command to change or remove a timeout for one of the steps in the workflow, but will need the information for the specific apps within your version of the workflow. To do this, here is an example where the CICERO timeout is removed entirely:
dx run "Rapid RNA-Seq (BAM)" --extra-args '{"timeoutPolicyByExecutable": {"app-GQB77jQ078YQk83ZYF5X7fyq":{"*": {"hours": 0}}}}'
You will need to specify the workflow you are trying to run, either “Rapid RNA-Seq (BAM)” or “Rapid RNA-Seq (FastQ)“. You will also need to update with the appropriate DNAnexus app ID (replace app-GQB77jQ078YQk83ZYF5X7fyq
above).
To find the information required in that command, if you have jq
installed, you can find the app name from your copy of the workflow with:
dx describe --json "Rapid RNA-Seq (BAM)" | jq -r '.stages | .[] | select(.id == "cicero").executable'
And that can be combined with a second dx describe
to get the app ID that’s needed for the run command:
dx describe --json $(dx describe --json "Rapid RNA-Seq (BAM)" | jq -r '.stages | .[] | select(.id == "cicero").executable') | jq -r .id
The values can also be manually found in the output of dx describe
.
Detailed information on timeout values can be found in the DNAnexus documentation.
Please reach out to support@dnanexus.com and support@stjude.cloud with any questions about doing this.
Analysis of Results
Each tool in St. Jude Cloud produces a visualization that makes understanding results more accessible than working with excel spreadsheet or tab delimited files. This is the primary way we recommend you work with your results.
Refer to the general workflow guide to learn how to access these visualizations.
We also include the raw output files for you to dig into if the visualization is not sufficient to answer your research question.
Refer to the general workflow guide to learn how to access raw results files.
Interpreting results
The complete output file specification is listed in the overview section of this guide. Here, we will discuss each of the different output files in more detail.
- Predicted gene fusions: The putative gene fusions will be
contained in the file
[SAMPLE].final_fusions.txt
. This file is a tab-delimited file containing many fields for each of the predicted SV. The most important columns are the following.
Field Name | Description |
---|---|
sample |
Sample name |
gene* |
Gene name |
chr* |
Chromosome name |
pos* |
Genomic Location |
ort* |
Strand |
reads* |
Supporting reads |
medal |
Estimated pathogenicity assessment using St. Jude Medal Ceremony |
- Coverage file: A standard bigWig file used to describe genomic read coverage.
- Splice junction read counts: A custom file format describing the junction read counts. The following fields are included in the tab-delimited output file.
Field Name | Description |
---|---|
junction |
Splice junction in the TCGA format “chrX:start:+,chrX:end,+“. “start” and “end” are the 1-based position of the last mapped nucleotide before the skip and the first mapped nucleotide after the skip (i.e. the last base of the previous exon and the first base of the next exon). Note that in .bed output these coordinates will be different, see the .bed output section below. The ”+” is currently hardcoded, though this may change in the future. |
count |
Raw count of reads supporting the junction. During correction counts for ambiguous junctions can be combined, though obviously these additional reads will not be visible in the raw BAM file. |
type |
Either “known” (matching a reference junction) or “novel” (not observed in the reference junction collection). |
genes |
Gene symbols from the junction calling process. These still need work in the raw junction calling process, it’s recommended to use the “annotated” output files instead which assign matching HUGO gene symbols based on the UCSC refGene table. |
transcripts |
List of known transcript IDs matching the junction. |
qc_flanking |
Count of supporting reads passing flanking sequence checks (junctions observed adjacent to read ends require 18+ nt of flanking sequence, otherwise 10+ nt). |
qc_plus |
Count of supporting reads aligned to the + strand. |
qc_minus |
Count of supporting reads aligned to the - strand. |
qc_perfect_reads |
Count of supporting reads with perfect alignments (no reference mismatches of quality 15+, indels, or soft clips). |
qc_clean_reads |
Count of supporting reads whose alignments are not perfect but which have a ratio of <= 5% of reference mismatches of quality 15+, indels, or soft clips relative to the count of aligned bases on both the left and right flanking sequence. Note: qccleanreads does NOT include qcperfectreads: to get a count of “perfect plus pretty good” reads the two values must be added together. |
Known issues
Adapter contamination
This pipeline does not, at present, remove adapter sequences. If the sequencing library is contaminated with adapters, CICERO runtimes can increase exponentially. We recommend running FASTQ files through a QC pipeline such as FastQC and trimming adapters with tools such as Trimmomatic if adapters are found.
High coverage regions
Certain cell types show very high transcription of certain loci, for example, the immunoglobulin heavy chain locus in plasma cells. The presence of very highly covered regions (typically 100,000-1,000,000+ X) has an adverse effect on CICERO runtimes. Presently, we have no good solution to this problem as strategies such as down-sampling may reduce sensitivity over important regions of the genome.
Interactive Visualizations Exon vs Intron Nomenclature
When a codon is split over a fusion gene junction, the annotation software marks the event as intronic when really, the event should be exonic. We are working to fix this bug. In the mean time, if a fusion is predicted to be in frame but the interactive plot shows “intronic”, we suggest the user blat the contig shown just below to clarify if the true junction is either in the intron or exon.
Frequently asked questions
If you have any questions not covered here, feel free to reach out on our contact form.
Submit batch jobs on command line
See How can I run an analysis workflow on multiple sample files at the same time?
Similar Topics
Running our Workflows
Working with our Data Overview
Upload/Download Data (local)