Authors | Yu Liu1,2,*,#, Chunliang Li3,*, Shuhong Shen1,4,*, Xiaolong Chen2, Karol Szlachta2, Michael N. Edmonson2, Ying Shao2, Xiaotu Ma2, Judith Hyle3, Shaela Wright3, Bensheng Ju2, Michael C. Rusch2, Yanling Liu2, Benshang Li1,4, Michael Macias2, Liqing Tian2, John Easton2, Maoxiang Qian5, Jun J. Yang5,6,7, Shaoyan Hu8, A. Thomas Look9,10 and Jinghui Zhang2,# |
Publication | “Discovery of regulatory non-coding variants in individual cancer genomes using cis-X” (in submission) |
Technical Support | Contact Us |
Overview
Activating regular variants usually cause the cis-activation of target genes. To find cis-activated genes, allelic specific/imbalance expressions (ASE) and outlier high expression (OHE) signals are used. Variants in the same topologically associated domains with the candidates can then be searched, including structural variants (SV), copy number aberrations (CNA), and single nucleotide variations (SNV) and insertion/deletions (indel).
A transcription factor binding analysis is also done, using motifs from HOCOMOCO v10 models.
cis-X currently only works with hg19 (GRCh37).
Inputs
Name | Type | Description | Example |
---|---|---|---|
Sample ID | String | The ID of the input sample | SJALL018373_D1 |
Disease subtype | String | The disease name under analysis. Must be either NBL or TALL. | TALL |
Single nucleotide variants | File | Tab-delimited file containing raw sequence variants | *.txt |
CNV/LOH regions | File | Tab-delimited file containing any aneuploidy region existing in the tumor genome under analysis | *.txt |
RNA-Seq BAM | File | BAM file aligned to hg19 (GRCh37) | *.bam |
RNA-Seq BAM index | File | BAM index for the given BAM | *.bam.bai |
Gene expression table | File | Tab-delimited file containing gene level expressions for the tumor under analysis in FPKM | *.txt |
Somatic SNV/indels | File | Tab-delimited file containing somatic SNV/indels in the tumor genome | *.txt |
Somatic SVs | File | Tab-delimited file containing somatic acquired structural variants in the tumor genome | *.txt |
Somatic CNVs | File | Tab-delimited file containing copy number aberrations in the tumor genome | *.txt |
CNV/LOH action | String | The behavior when handling markers in CNV/LOH regions. Can be either keep or drop . [default: keep] |
drop |
Minimum coverage for WGS | Integer | The minimum coverage in WGS to be included in the analysis [default: 10] | 10 |
Minimum coverage for RNA-Seq | Integer | The minimum coverage in RNA-Seq to be included in the analysis [default: 10] | 5 |
Candidate FPKM threshold | Float | The FPKM threshold for the nomination of a cis-activated candidate [default: 5.0] | 0.1 |
User annotations | File | User applied annotations [optional] | *.bed |
chr Prefix | String | Whether the names in the reference sequence dictionary are prefixed with “chr”. Must be either TRUE or FALSE . [default: TRUE] |
TRUE |
TAD annotations | File | TAD annotations [optional] | *.bed |
Input file configuration
cis-X requires six tab-delimited input files to be prepared in advance. These files can be uploaded via the command line.
note
Even though CNV/LOH regions, somatic SNV/indels, somatic SVs, and somatic CNVs can be “empty”, using such inputs will produce results with a much higher false positive rate.
Single Nucleotide Variants
A list of single nucleotide markers is a tab-delimited file with the following columns:
Chr
: chromosome name for the markerPos
: genomic start location for the markerChr_Allele
: reference alleleAlternative_Allele
: alternative allelereference_tumor_count
: reference allele count in the tumor genomealternative_tumor_count
: alternative allele count in the tumor genomereference_normal_count
: reference allele count in the matched normal genomealternative_normal_count
: alternative count in the matched normal genome
This file can be generated with Bambino.
Example
Chr | Pos | Chr_Allele | Alternative_Allele | referencetumorcount | alternativetumorcount | referencenormalcount | alternativenormalcount |
---|---|---|---|---|---|---|---|
chr11 | 61396 | TT | 0 | 3 | 0 | 10 | |
chr11 | 72981 | T | 1 | 3 | 2 | 3 |
CNV/LOH regions
The CNV/LOH regions are all the genomic regions carrying copy number variations (CNV) or loss of heterozygosity (LOH), which will be filtered out during analysis.
This is a tab-delimited file in the bed format. It must have at least the following three columns:
chrom
: chromosome nameloc.start
: genomic start locationloc.end
: genomic end location
If no CNV/LOH are in the genome under analysis, a file with no rows (but including headers) can be provided.
This file can be generated with CONSERTING.
Example
chrom | loc.start | loc.end | Sample | seg.mean | LogRatio | source |
---|---|---|---|---|---|---|
chr9 | 10712 | 37855747 | SJALL018373_D1 | 0.471181417 | LOH | |
chr9 | 20276901 | 20703900 | SJALL018373_D1 | -0.978 | -5.696 | CNV |
Gene expression table
The gene expression table is a tab-delimited file containing gene level expressions for the tumor under analysis. The expressions are in FPKM (fragments per kilobase of transcript per million mapped reads).
GeneID
: gene Ensembl IDGeneName
: gene symbolType
: transcript typeStatus
: transcript status (must beKNOWN
,NOVEL
, orPUTATIVE
)Chr
: chromosome nameStart
genomic start locationEnd
: genomic end location- [SampleID…]: FPKM for the given sample
This file can be generated with the output of HTseq-count preprocessed
through mergeData_geneName.pl
(available with the distribution of cis-X).
The data must be able to match values in the given gene specific reference
expression matrices generated from a larger cohort.
Example
GeneID | GeneName | Type | Status | Chr | Start | End | SJALL018373_D1 |
---|---|---|---|---|---|---|---|
ENSG00000261122.2 | 5S_rRNA | lincRNA | NOVEL | chr16 | 34977639 | 34990886 | 0.0000 |
ENSG00000249352.3 | 7SK | lincRNA | NOVEL | chr5 | 68266266 | 68325992 | 4.5937 |
Somatic SNV/indels
This is a tab-delimited file containing somatic sequence mutations present in the genome under analysis. It includes both single nucleotide variants (SNV) and small insertion/deletions (indel). The file must have the following columns:
chr
: chromosome namepos
: genomic start locationref
: reference nucleotidemutant
: mutant nucleotidetype
: mutation type (must be eithersnv
orindel
)
Note that the coordinate used for an indel is after the inserted sequence.
If no SNV/indels are in the sample under analysis, a file with no rows (but including headers) can be provided.
This file can can be created with Bambino and then preprocessed using the steps taken in ”The genetic basis of early T-cell precursor acute lymphoblastic leukaemia“.
Example
chr | pos | ref | mut | type |
---|---|---|---|---|
chr1 | 24782720 | G | A | snv |
chr11 | 82896176 | T | C | snv |
Somatic SVs
This is a tab-delimited file containing somatic-acquired structural variants (SV) in the cancer genome. The file must have the following columns:
chrA
: chromosome name of the left breakpointposA
: genomic location of the left breakpointortA
: strand orientation of the left breakpointchrB
: chromosome name of the right breakpointposB
: genomic location of the right breakpointortB
: strand orientation of the right breakpoint
Strand orientations are denoted with a +
for a sense or coding strand and
-
for a antisense or non-coding strand.
If no somatic SVs are in the sample under analysis, a file with no rows (but including headers) can be provided.
This file can be generated by CREST.
Example
chrA | posA | ortA | chrB | posB | ortB | type |
---|---|---|---|---|---|---|
chr11 | 33913169 | + | chr7 | 142494049 | - | CTX |
chr11 | 64219334 | + | chr2 | 205042527 | - | CTX |
Somatic CNVs
This is a tab-delimited file containing the genomic regions with somatic-acquired copy number aberrations (CNA) in the cancer genome.
chr
: chromosome namestart
: genomic start locationend
: genomic end locationlogR
: log2 ratio
If no somatic CNVs are in the sample under analysis, a file with no rows (but including headers) can be provided.
This file can be generating by CONSERTING.
Example
chr | start | end | logR |
---|---|---|---|
chr9 | 20276901 | 20703900 | -5.696 |
Outputs
Name | Description |
---|---|
cis-activated candidates | cis-activated candidates in the tumor genome under analysis |
SV candidates | Structural variant (SV) candidates predicted as the causal for the cis-activated genes in the regulatory territory |
CNA candidates | Copy number aberrations (CNA) predicted as the causal for the cis-activated genes in the regulatory territory |
SNV/indel candidates | SNV/indel candidates predicted as functional and predicted transcription factors |
OHE results | Raw outlier high expression (OHE) results |
Gene level ASE results | Raw gene level allelic specific expression (ASE) results |
Single marker ASE results | Raw single marker allelic specific expression (ASE) results |
Creating a workspace
Before you can run one of our workflows, you must first create a workspace in DNAnexus for the run. Refer to the general workflow guide to learn how to create a DNAnexus workspace for each workflow run.
You can navigate to the Cis-X workflow page here
Uploading Input Files
cis-X requires a total of eight files to be uploaded, as input.
Refer to the general workflow guide to learn how to upload input files to the workspace you just created.
Running the Workflow
Refer to the general workflow guide to learn how to launch the workflow, hook up input files, adjust parameters, start a run, and monitor run progress.
Analysis of Results
Each tool in St. Jude Cloud produces a visualization that makes understanding results more accessible than working with excel spreadsheet or tab delimited files. This is the primary way we recommend you work with your results.
Refer to the general workflow guide to learn how to access these visualizations.
We also include the raw output files for you to dig into if the visualization is not sufficient to answer your research question.
Refer to the general workflow guide to learn how to access raw results files.
Interpreting results
cis-activated candidates
The main result file contains the cis-activated candidates in the tumor genome under analysis.
gene
: gene accession number (RefSeq ID)gsym
: gene symbolchrom
: chromosome namestrand
: strand orientationstart
: genomic start locationend
: genomic end locationcdsStartStat
: coding sequence (CDS) start statuscdsEndStat
: coding sequence (CDS) end statusmarkers
: number of heterozygous markers in this genease_markers
: number of heterozygous markers showing allelic specific expressions (ASE)average_ai_all
: average B-allele frequency (BAF) difference between RNA and DNA for all heterozygous markersaverage_ai_ase
: average BAF difference between RNA and DNA for ASE markerspval_all_markers
: p-value for each marker in the ASE testpval_ase_markers
: p-value for ASE markers in the ASE testai_all_markers
: BAF difference between RNA and DNA for all heterozygrous markersai_ase_markers
: BAF difference between RNA and DNA for ASE markerscomb.pval
: combined p-value for the ASE testmean.delta
: average BAF difference between RNA and DNA for all markersrawp
: raw p-value for the ASE testBonferroni
: adjusted p-value for the ASE test (single-step Bonferroni)ABH
: adjusted p-value for the ASE test (Benjamini-Hochberg)FPKM
: FPKM valueloo.source
: which reference expression matrix was used in the outlier high expression (OHE) testloo.cohort.size
: number of cases in the reference expression matrix for this geneloo.pval
: p-value of the OHE testloo.rank
: rank of the case under analysis among the reference casesimprinting.status
: imprinting status of the genecandidate.group
: status of the gene, combining both ASE and outlier tests
Strand orientations are denoted with a +
for a sense or coding strand and
-
for a antisense or non-coding strand.
Coding sequence status is typically one of “none” (not specified), “unk” (unknown), “incmpl” (incomplete), or “cmpl” (complete).
Example
gene | gsym | chrom | strand | start | end | cdsStartStat | cdsEndStat | markers | ase_markers | averageaiall | averageaiase | pvalallmarkers | pvalasemarkers | aiallmarkers | aiasemarkers | comb.pval | mean.delta | rawp | Bonferroni | ABH | FPKM | loo.source | loo.cohort.size | loo.pval | loo.rank | imprinting.status | candidate.group |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NM_145804 | ABTB2 | chr11 | - | 34172533 | 34379555 | cmpl | cmpl | 5 | 5 | 0.5 | 0.500 | 0.001953125,0.001953125,0.001953125,6.10351562500001e-05,0.000244140625 | 0.001953125,0.001953125,0.001953125,6.10351562500001e-05,0.000244140625 | 0.5,0.5,0.5,0.5,0.5 | 0.5,0.5,0.5,0.5,0.5 | 0.000644290972057077 | 0.5 | 0.000644290972057077 | 0.632049443587993 | 0.0110866672927557 | 7.6776 | bi_cohort | 40 | 0.0367241086505276 | 1 | ase_outlier | |
NM_003189 | TAL1 | chr1 | - | 47681961 | 47698007 | cmpl | cmpl | 2 | 2 | 0.482 | 0.482 | 6.66361745922277e-28,3.30872245021211e-24 | 6.66361745922277e-28,3.30872245021211e-24 | 0.464912280701754,0.5 | 0.464912280701754,0.5 | 4.69553625126628e-26 | 0.482456140350877 | 4.69553625126628e-26 | 4.60632106249222e-23 | 6.11761294450693e-24 | 8.8168 | white_list | 167 | 0.0139385771987089 | 1 | ase_outlier |
SV candidates
Structural variant (SV) candidates include candidates predicted as the causal for the cis-activated genes in the regulatory territory.
left.candidate.inTAD
: cis-activated candidate near the left breakpointright.candidate.inTAD
: cis-activated candidate near the right breakpointchrA
: chromosome name of the left breakpointposA
: genomic location of the left breakpointortA
: strand orientation of the left breakpointchrB
: chromosome name of the right breakpointposB
: genomic location of the right breakpointortB
: strand orientation of the right breakpointtype
: type of translocation
Example
left.candidate.inTAD | right.candidate.inTAD | chrA | posA | ortA | chrB | posB | ortB | type |
---|---|---|---|---|---|---|---|---|
LMO2 | chr11 | 33913169 | + | chr7 | 142494049 | - | CTX |
CNA candidates
Copy number aberration (CNA) candidates include candidates predicted as the causal for the cis-activated genes in the regulatory territory.
candidate.inTAD
: cis-activated candidate by the CNAchr
: chromosome namestart
: genomic start positionend
: genomic end locationlogR
: log ratio of the CNA
SNV/indel candidates
SNV/indel candidates include predicted candidates as functional and predicted transcription factors. The mutations are also annotated for known regulatory elements reported by the NIH Roadmap Epigenomics Project by collecting 111 cell lines.
chrom
: chromosome namepos
: genomic start positionref
: reference allele genotypemut
: mutant allele genotypetype
: mutation type (eithersnv
orindel
)target
: cis-activated candidatedist
: distance between the mutation and transcription start sites of the target genetf
: transcription factors predicted to have the binding motif introduced by the mutationEpiRoadmap_enhancer
: enhancer regions that overlap with the mutation (from the NIH Roadmap Epigenomics Project)EpiRoadmap_promoter
: promoter regions that overlap with the mutation (from the NIH Roadmap Epigenomics Project)EpiRoadmap_dyadic
: dyadic regions that overlap with the mutation (from the NIH Roadmap Epigenomics Project)
Example
chrom | pos | ref | mut | type | target | dist | tf | EpiRoadmap_enhancer | EpiRoadmap_promoter | EpiRoadmap_dyadic |
---|---|---|---|---|---|---|---|---|---|---|
chr1 | 47696311 | C | T | snv | TAL1 | 1696 | BCL11A,CEBPG,PBX2,YY1,ZBTB4 | Brain,Digestive,ES-deriv,ESC,HSC & B-cell,Heart,Muscle,Other,Sm. Muscle,iPSC |
OHE results
OHE results are the raw results for the outlier expression test.
Gene
: gene symbolfpkm.raw
: FPKM valuesize.bi
: number of cases in the bi-allelic reference cohortp.bi
: p-value in the outlier test using the bi-allelic reference cohortrank.bi
: rank of the expression level in the case under analysis compared to the bi-allelic reference cohortsize.cohort
: number of cases in the entire reference cohortp.cohort
: p-value in the outlier test using the entire reference cohortrank.cohort
: rank of the expression level in the case under analysis compared to the entire reference cohortsize.white
: number of cases in the whitelist reference cohortp.white
: p-value in the outlier test using the whitelist reference cohortrank.white
: rank of the expression level in the case under analysis compared to the whitelist reference cohort
Example
Gene | fpkm.raw | size.bi | p.bi | rank.bi | size.cohort | p.cohort | rank.cohort | size.white | p.white | rank.white |
---|---|---|---|---|---|---|---|---|---|---|
7SK | 4.5937 | na | na | na | 264 | 0.716284011918374 | 162 | na | na | na |
A1BG | 0.2312 | 24 | 0.900132642257996 | 21 | 264 | 0.84055666600945 | 222 | na | na | na |
Gene level ASE results
Gene level ASE results are the raw results from the gene level ASE test.
gene
: gene accession number (RefSeq ID)gsym
: gene symbolchrom
: chromosome namestrand
: strand orientationstart
: genomic start locationend
: genomic end locationcdsStartStat
: coding sequence (CDS) start statuscdsEndStat
: coding sequence (CDS) end statusmarkers
: number of heterozygous markers in this genease_markers
: number of heterozygous markers showing allelic specific expressions (ASE)average_ai_all
: average B-allele frequency (BAF) difference between RNA and DNA for all heterozygous markersaverage_ai_ase
: average BAF difference between RNA and DNA for ASE markerspval_all_markers
: p-value for each marker in the ASE testpval_ase_markers
: p-value for ASE markers in the ASE testai_all_markers
: BAF difference between RNA and DNA for all heterozygrous markersai_ase_markers
: BAF difference between RNA and DNA for ASE markerscomb.pval
: combined p-value for the ASE testmean.delta
: average BAF difference between RNA and DNA for all markersrawp
: raw p-value for the ASE testBonferroni
: adjusted p-value for the ASE test (single-step Bonferroni)ABH
: adjusted p-value for the ASE test (Benjamini-Hochberg)
Example
gene | gsym | chrom | strand | start | end | cdsStartStat | cdsEndStat | markers | ase_markers | averageaiall | averageaiase | pvalallmarkers | pvalasemarkers | aiallmarkers | aiasemarkers | comb.pval | mean.delta | rawp | Bonferroni | ABH |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
NM_024684 | AAMDC | chr11 | + | 77532207 | 77583398 | cmpl | cmpl | 2 | 0 | 0.079 | na | 0.924775093657227,0.0331439677875056 | na | 0.00892857142857145,0.149122807017544 | na | 0.175073458624837 | 0.0790256892230577 | 0.175073458624837 | 1 | 0.480780882445856 |
NM_015423 | AASDHPPT | chr11 | + | 105948291 | 105969419 | cmpl | cmpl | 2 | 0 | 0.023 | na | 0.749258624760841,1 | na | 0.0384615384615384,0.00769230769230766 | na | 0.86559726476049 | 0.023076923076923 | 0.86559726476049 | 1 | 0.873257417545981 |
Single marker ASE results
Single marker ASE results are the raw results from the single marker ASE test.
chrom
: chromosome namepos
: genomic start positionref
: reference allele genotypemut
: non-reference allele genotypecvg_wgs
: coverage of the marker from the whole genome sequence (WGS)mut_freq_wgs
: non-reference allele fraction in the WGScvg_rna
: coverage of the marker from the RNA-Seqmut_freq_rna
: non-reference allele fraction in the RNA-Seqref.1
: read count of the reference allele in the RNA-Seqvar
: read count of the non-reference allele in the RNA-Seqpvalue
: p-value from the binomial testdelta.abs
: absolute difference of the non-reference allele fraction between the WGS and RNA-Seq
Example
chrom | pos | ref | mut | cvg_wgs | mutfreqwgs | cvg_rna | mutfreqrna | ref.1 | var | pvalue | delta.abs |
---|---|---|---|---|---|---|---|---|---|---|---|
chr11 | 204147 | G | A | 36 | 0.472 | 85 | 0.553 | 38 | 47 | 0.385669420119278 | 0.0529411764705883 |
chr11 | 205198 | C | A | 23 | 0.522 | 83 | 0.313 | 57 | 26 | 0.000877551780002863 | 0.186746987951807 |
Frequently asked questions
None yet! If you have any questions not covered here, feel free to reach out on our contact form.
Similar Topics
Running our Workflows
Working with our Data Overview
Upload/Download Data (local)
Footnotes
1 Pediatric Translational Medicine Institute, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
2 Department of Computational Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
3 Department of Tumor Cell Biology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
4 Key Laboratory of Pediatric Hematology & Oncology Ministry of Health, Department of Hematology & Oncology, Shanghai Children’s Medical Center, Shanghai Jiao Tong University School of Medicine, Shanghai, China
5 Department of Pharmaceutical Sciences, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
6 Hematological Malignancies Program, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
7 Department of Oncology, St. Jude Children’s Research Hospital, Memphis, TN 38105, USA
8 Children’s Hospital of Soochow University, Suzhou, Jiangsu, China
9 Department of Pediatric Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA 02215, USA
10 Division of Pediatric Hematology-Oncology, Boston Children’s Hospital, MA 02115, USA
* Contributed equally to this work.
# Correspondence should be addressed to Y.L. (liuyu@scmc.com.cn) or J.Z. (jinghui.zhang@stjude.org).