RNA-Seq Expression Classification

Last updated: 2 years ago (view history), Time to read: 13 mins

Overview

St. Jude Cloud provides functionality for generating RNA-Seq Expression Classification plots. This workflow allows plotting of RNA-Seq data by running through the St. Jude Cloud normalization pipeline. The generated count data is then compared to a reference set of data from a cohort of St. Jude samples and a plot is produced.

Requirements

  • The RNA-Seq Expression Classification pipeline reference data uses sequencing data from fresh, frozen tissue samples. It has not been evaluated for use with sequencing data generated from formalin-fixed paraffin-embedded (FFPE) specimens.
  • If running the count-based pipeline, alignment must be done against the GRCh38noalt reference. It should use parameters as specified in our RNA-Seq workflow to minimize any discrepancies caused by differing alignment specification.
  • If running the count-based pipeline, feature counts should be generated with htseq-count as described in our RNA-Seq workflow. This pipeline uses Gencode v31 annotations.

Inputs

The input can be either of the two entries below, based on whether you want to start with a counts file or a BAM file.

Name Description Example
BAM file Aligned reads file from human RNA-Seq Sample.bam
Counts file htseq-count output feature counts file from human RNA-Seq Sample.counts.txt
warning

If you provide counts data to the counts-based pipeline, it must be aligned to GRCh38_no_alt. Running a sample aligned to any other reference genome is not supported. Maybe more importantly, we do not check the genome build of the BAM, so errors in computation or the results can occur. If your sample is not aligned to this genome build, we recommend submitting the sample to the realignment-based workflow.

Outputs

The RNA-Seq Expression Classification pipeline produces the following outputs:

Name Description Pipeline Version
Interactive expression plot (.html) Visualization of RNA-Seq data all
Aligned BAM (.bam) BAM file produced by our RNA-Seq pipeline for the input samples. Realignment
Feature read counts (.txt) Read counts for the Gencode features. Realignment
Gene list (.txt) List of genes used in analysis (if a fixed list not specified) all

Workflow Steps

  1. [Only for realignment workflow] The aligned BAM is converted to FASTQ and is aligned to GRCh38_no_alt using standard STAR mapping.
  2. [Only for realignment workflow] A feature count (.count.txt) file is produced for comparison to St. Jude Cloud reference data.
  3. A visualization for genomic features is produced.

Mapping

We use the STAR aligner to rapidly map reads to the GRCh38 human reference genome.

Feature Counts

We use htseq-count to produce genomic feature counts.

Visualization

A t-Distributed Stochastic Neighbor Embedding (t-SNE) visualization is produced using Rtsne. You can find the t-SNE paper here.

Cost Estimation

Workflow Name Per sample cost Outliers
Realignment workflow $3-10 $25
Feature read counts workflow $0.30-0.40

Running the workflow

Getting Started

warning

If you provide counts data to the counts-based pipeline, it must be aligned to GRCh38_no_alt. Running a sample aligned to any other reference genome is not supported. Maybe more importantly, we do not check the genome build of the sample, so errors in computation or the results can occur. If your sample is not aligned to this genome build, we recommend submitting the sample to the realignment-based workflow.

We provide two versions of the RNA-Seq Expression Classification workflow depending on desired input. The full workflow allows a user to upload a sample in BAM format. That sample will then be converted to FASTQ format, aligned with STAR two-pass alignment, and feature counts generated with htseq-count

To get started, you need to navigate to the RNA-Seq Expression Classification workflow page. You’ll need to click the “Start” button in the left hand pane. This creates a cloud workspace in DNAnexus with the same name as the workflow. After this, you will be able to upload your input files to that workspace.

Obtaining reference data

Reference feature count data is vended to the project folder along with the workflow. It can be found in the “counts” folder.

Reference data can also be retrieved through the Genomics Platform Data Browser

These must then be provided to the workflow through the reference_counts parameter. By default, all reference files will be used by the app, but this can be restricted to one of the three tumor types [Blood, Brain, Solid] through the app settings.

warning

The RNA-Seq Expression Classification workflow does not allow the same sample name to be included more than once. If data from multiple projects is requested through St. Jude Cloud Genomics Platform, a sample may be included more than once. We offer an opinionated deduplication method at https://github.com/stjudecloud/utilities.

Running the workflow

Once you’ve uploaded data to your cloud workspace, click “Launch Tool” on the workflow’s landing page. A dropdown will present the different presets for running the workflow. Here, you can select whether you wish to start with a counts file or a BAM file. Once the workflow has been vended, copy the workflow to the project containing the reference data.

Uploading data

The RNA-Seq Expression Classification pipeline takes either a htseq-count count file or a GRCh38noalt aligned BAM from human RNA-Seq. You can upload your input file(s) through the command line. See Uploading Data from your Local Computer.

Once you have the dx toolkit, to upload a sample HTSeq count file sample.counts.txt to the inputs folder of the project-rnaseq cloud project, you could use the following command:

dx upload sample.counts.txt --destination "project-rnaseq:/inputs/"

Preparing input data

To run an input sample, certain properties need to be set on the file. These should be specified on the HTSeq count file for the counts-based pipeline or on the BAM file for the realignment-based workflow.

Property Name Values
sample_name should match filename up to first period (”.“)
library_type PolyA or Total
read_length integer, in bp
strandedness Stranded-Forward, Stranded-Reverse, Unstranded
pairing Paired-end or Single-end

Properties can be set at the job-level for all input files that do not have properties set. These function as defaults and will not override existing values on files. The options are limited to values in our reference cohort to ensure reliable results.

dx options

If a more granular control of covariates is needed, the input files can be annotated directly with properties. To input properties in DNAnexus, you can either use the web UI or the command line interface. Here, we provide a command line snippet to set properties on a file.

# you can run the following code snippet after filling in your values
# to successfully prepare the file.

file_id=<DNAnexus file ID> # file ID or file path
dx set_properties $file_id sample_name="<value>" # Should match the file name up to the first period character
dx set_properties $file_id strandedness="<value>" # Stranded-Forward, Stranded-Reverse, or Unstranded
dx set_properties $file_id library_type="<value>" # PolyA or Total
dx set_properties $file_id read_length="<value>" # Integer number of base pairs in reads (e.g. 101, 126)
dx set_properties $file_id pairing="<value>" # Paired-end or Single-end

The file ID can be retrieved from the DNAnexus web interface. Click on the file of interest and the file ID is displayed in the sidebar.

file id

Hooking up inputs

You will need to select reference counts files from your project. These can be specified in the reference counts data input.

When specifying the reference_counts, select all HTSeq count files in the reference dataset. For the St. Jude Cloud paper dataset, there are 1576 total files to select.

input selections

The UI will then display the number of selected reference files.

input selected

Next, you’ll need to hook up either the counts file or the BAM file you uploaded in the upload data section. In this example, we are using the counts-based version of the pipeline, so you can hook up the inputs by clicking on the input_counts slot and selecting the respective files. If you are using the realignment-based workflow, the process is similar with BAM input.

rnaseq expression workflow new ui

Additionally, a parameter selecting the tissue type to compare against can be selected. The available options are “Blood Cancer”, “Brain Tumor”, and “Solid Tumor”. Based on the selection, a reference collection of tumors of that type will be selected from St. Jude Cloud data and the input samples will be compared against this reference collection.

rnaseq expression workflow options new ui

If running the realignment workflow, the input file should be specified as a BAM to realign. HTSeq will run on the realigned BAM and the result will be passed into the t-SNE app for plotting against the reference data. The reference data should be specified in the reference_counts parameter as an array of HTSeq count files. The BAM will be specified to the RNA-Seq V2 stage as input_bam. The BAM should have properties set as described above. These will automatically be applied to the new HTSeq count file.

rnaseq expression workflow realignment options new ui

Optional settings

An optional gene list can be specified as an input. This is a single column text file with gene names. If this option is specified, the list will override the default selection of the top 1000 most differentially expressed genes.

By default, all tumor types (blood, brain, and solid) will be included in the analysis. It is also possible to run each of the three types separately by selecting the appropriate tumor type.

The workflow supports both Fresh/Frozen and FFPE inputs. By default, only Fresh/Frozen samples from the reference cohort will be included. This setting can be changed to run only FFPE or Fresh/Frozen + FFPE samples.

It is also possible to return intermediate files from the workflow. If desired, ‘originalmatrix’ is the matrix of counts after covariates and additional metadata has been removed, ‘variancestabilized’ is the matrix after running varianceStabilizingTransformation, ‘batchcorrected’ is the matrix after ComBat batch correction, ‘tsneoutput’ is the matrix after Rtsne has been run and ‘matrix’ is the original file input to the R script for plotting.

The expression workflow can optionally include PDX samples from the reference cohort as well as a selection of PDX models from Dana Farber Cancer Insitute. The inclusion of PDX samples can facilitate selection of mouse models for followup functional experimentation.

Starting the workflow

Once your input files are hooked up, you should be able to start the workflow by clicking the “Run Analysis” button in the top right hand corner of the workflow dialog.

Monitoring run progress

Once you have started one or more RNA-Seq Expression Classification runs, you can safely close your browser and come back later to check the status of the jobs. To do this, navigate to the workflow’s landing page. Next, click “View Results” then select the “View Running Jobs” option. You will be redirected to the job monitoring page. Each job you kicked off gets one row in this table.

You can click the ”+” on any of the runs to check the status of individual steps of the RNA-Seq Expression Classification pipeline. Other information, such as time, cost of individual steps in the pipeline, and even viewing the job logs can accessed by clicking around the sub-items.

Interpreting Results

Once the resulting analysis job completes, an HTML plot of the results should be available. The plot is generated with the ProteinPaint library. The plot can be zoomed arbitrarily and group labels can be turned on/off for manual inspection. User input samples will be displayed in black marks with a label on the graph as well as an entry in the legend.

rnaseq expression plot

Batch effect corrections

When comparing numerous samples such as those included in this analysis, it is important to consider the variation in data created by obtaining from various sources and across time. Therefore the t-SNE visualization incorporates some batch effect corrections for the reference data. Currently we correct for batch effect based on strandedness of the RNA-Seq sample, library type, read pairing, and read length.

Batch Variable Values
Library Type PolyA or Total
Read Length integer, in bp, e.g. 101, 126
Strandedness Stranded-Forward, Stranded-Reverse, Unstranded
Pairing Paired-end or Single-end

Known issues

There are a few known cautions with the RNA-Seq Expression Classification workflow.

Data must fit well defined values

The RNA-Seq Expression Classification pipeline reference data is based on GRCh38 aligned, Gencode v31 annotated samples from fresh, frozen tissue samples. It has not been evaluated for samples that do not meet this criteria.

The RNA-Seq Expression Classification pipeline reference data uses sequencing data from fresh, frozen tissue samples. It has not been evaluated for use with sequencing data generated from formalin-fixed paraffin-embedded (FFPE) specimens.

If running the count-based RNA-Seq Expression Classification pipeline, alignment must be done against the GRCh38noalt reference. It should use parameters as specified in our RNA-Seq workflow to minimize any discrepancies caused by differing alignment specification.

If running the count-based RNA-Seq Expression Classification pipeline, feature counts should be generated with htseq-count as described in our RNA-Seq workflow. This pipeline uses Gencode v31 annotations.

Batch correction requires a minimum of two samples per batch to run properly. Introducing a single sample batch by adding an input sample with a unique protocol will cause unexpected results.

Frequently asked questions

If you have any questions not covered here, feel free to reach out on our contact form.

Submit batch jobs on the command line

See How can I run an analysis workflow on multiple sample files at the same time?