A departure from the usual productivity theme, I pleased to announce the release of my RNA-Sequencing differential expression (RSDE for short) package, available on Github. I have worked on this package for the past 4 months and have put a lot of effort into making it as easy to use as possible, so please let me know if you run into any issues.
My vision of how easy it should be to use is outlined in the headline photo and title:
.bam
files —-> ??? —> Results!
Structural variant detection is grey because it is not yet available.
The example included in the code looks at the differential expression of two prostate cancer cell lines, LNCaP and PrEC. It only includes reads from chromosome 9, an arbitrary choice to reduce the time of running the example. Currently, the example takes 20 minutes to run on my laptop (MacBook Pro, Late 2008 model, 2.4G Hz Intel Core Duo, 4 GB RAM), and this is really supposed to be run on a server that holds all your gigantic (10gb+) .bam
files. If you do choose to run it on your own personal computer, I would recommend leaving your computer alone while it runs, or else your entire machine may crash. Mine has.
Basically, the inputs are (they will be more thoroughly described below):
.bam
file locations, sample IDs, and experiment descriptions
bedtools Coverage
and htseq-count
)
And after submitting all those files and waiting ~20 min for the example to finish, you get:
.bam
files, via RSeQC. Most of the commands listed in the manual are performed. In alphabetical order, they are (with documentation copy/pasted from the RSeQC manual):
clipping_profile.py
: This program is used estimate clipping profile of RNA-seq reads from BAM or SAM file. Note that to use this function, CIGAR strings within SAM/BAM file should have S
operation (This means your reads mapper should support clipped mapping).
geneBody_coverage.py
: Read coverage over gene body. This module is used to check if reads coverage is uniform and if there is any 5’/3’ bias. This module scales all transcripts to 100 nt and calculates the number of reads covering each nucleotide position. Finally, it generates a plot illustrating the coverage profile along the gene body. NOTE: this module requires lots of memory for large BAM files, because it load the entire BAM file into memory.
infer_experiment.py
: This program is used to speculate how RNA-seq sequencing were configured, especially how reads were stranded for strand-specific RNA-seq data, through comparing reads' mapping information to the underneath gene model.
inner_distance.py
: This module is used to calculate the inner distance (or insert size) between two paired RNA reads. The distance is the mRNA length between two paired fragments.
junction_annotation.py
: For a given alignment file (-i
) in BAM or SAM format and a reference gene model (-r
) in BED format, this program will compare detected splice junction to reference gene model.
junction_saturation.py
: It’s very important to check if current sequencing depth is deep enough to perform alternative splicing analyses. For a well annotated organism, the number of expressed genes in particular tissue is almost fixed so the number of splice junctions is also fixed, all splice junctions can be predetermined according to reference gene model. All (annotated) splice junctions should be rediscovered from a saturated RNA-seq data, otherwise, downstream alternative splicing analysis is problematic because low abundance splice junctions are missing. This module checks for saturation by resampling 5%, 10%, 15%, …, 95% of total alignments from BAM or SAM file, and then detects splice junctions from each subset and compares them to reference gene model.
read_distribution.py
: Provided a BAM/SAM file and reference gene model, this module will calculate how mapped reads were distributed over genome feature (like CDS exon, 5'UTR exon, 3' UTR exon, Intron, Intergenic regions). When genome features are overlapped (e.g. a region could be annotated as both exon and intron by two different transcripts) , they are prioritize as: CDS exons > UTR exons > Introns > Intergenic regions, for example,if a read was mapped to both CDS exon and intron, it will be assigned to CDS exons.
read_duplication.py
: Two strategies were used to determine reads duplication rate:
read_GC.py
read_NVC.py
: This module is used to check the nucleotide composition bias. Due to random priming, certain patterns are over represented at the beginning (5'end) of reads. This bias could be easily examined by NVC (Nucleotide versus cycle) plot. NVC plot is generated by overlaying all reads together, then calculating nucleotide composition for each position of read (or each sequencing cycle). In ideal condition (genome is random and RNA-seq reads is randomly sampled from genome), we expect A%=C%=G%=T%=25% at each position of reads.
read_quality.py
: According to SAM specification, if Q is the character to represent “base calling quality” in SAM file, then Phred Quality Score = ord(Q) - 33
. Here ord()
is python function that returns an integer representing the Unicode code point of the character when the argument is a unicode object, for example, ord('a')
returns 97. Phred quality score is widely used to measure “reliability” of base-calling, for example, phred quality score of 20 means there is 1/100 chance that the base-calling is wrong, phred quality score of 30 means there is 1/1000 chance that the base-calling is wrong. In general: Phred quality score = -10*log10P
, here P
is probability that base-calling is wrong.
RPKM_count.py
: Given a BAM file and reference gene model, this program will calculate the raw count and RPKM values for transcript at exon, intron and mRNA level. For strand specific RNA-seq data, program will assign read to its parental gene according to strand rule, if you don’t know the strand rule, run infer_experiment.py
. Please note that chromosome ID, genome cooridinates should be concordant between BAM and BED files.
RPKM_saturation.py
: The precision of any sample statitics (RPKM) is affected by sample size (sequencing depth); “resampling” or “jackknifing” is a method to estimate the precision of sample statistics by using subsets of available data. This module will resample a series of subsets from total RNA reads and then calculate RPKM value using each subset. By doing this we are able to check if the current sequencing depth was saturated or not (or if the RPKM values were stable or not) in terms of genes' expression estimation. If sequencing depth was saturated, the estimated RPKM value will be stationary or reproducible. By default, this module will calculate 20 RPKM values (using 5%, 10%, … , 95%,100% of total reads) for each transcripts. Although people can use KPSS test to determine if the estimated RPKM level is in stationary or not. Visual inspection is more accurate.
bedtools coverage
htseq-count
DESeq
(in the R
programming language). This include gene lists and heat maps.
This is where all the results files will be. Folders that will be created are:
The basic directories are:
expression/ rseqc/ circos/
Quality control files go in:
rseqc/[all your sample ids]
Circos files go in:
circos/[all your sample ids]
The counts files will go in:
expression/bedtools/[all your sample ids] expression/htseq/[all your sample ids]
Heatmaps will go:
expression/bedtools/figures/ expression/hsteq/figures/
test-data/conditions_chr9.tab
Text of example file:
# These data are a subset of http://0-www.ncbi.nlm.nih.gov.elis.tmu.edu.tw/geo/query/acc.cgi?acc=GSE27619 from GEO, specifically: # GSM721116, GSM721117, GSM721118, GSM721119, GSM721123, GSM721124 ## LNCaP and PrEC are the names of prostate cancer cell lines used in this GEO dataset ## As you may have noticed, any lines starting with a hash (`#') are ignored ## This file is based on the columns, so if you add extraneous columns, they will not be read into the program (ie you can add comments) #### # This is a tab-delimited file, with the following columns: # bam_prefix id group gender read_type strandedness ## bam_prefix = the filename, without the .bam ## id = (user-defined, could be anything) ## group = (user-defined, could be anything) ## gender = male --or-- female ## read_type = single_read --or-- paired_end ## strandedness = whether the cDNA library preparation was strand-specific or not. values = "strand_specific" --or-- "not_strand_specific" test-data/GSM721117_mctp_20F0GAAXX_1_chr9_withheader LNCaP_1 LNCaP male single_read not_strand_specific # (This is a comment) Illumina test-data/GSM721119_mctp_20F0GAAXX_2_chr9_withheader LNCaP_2 LNCaP male single_read not_strand_specific # Illumina test-data/GSM721118_mctp_20F0GAAXX_3_chr9_withheader LNCaP_3 LNCaP male single_read not_strand_specific test-data/GSM721116_mctp_20F0GAAXX_4_chr9_withheader LNCaP_4 LNCaP male single_read not_strand_specific test-data/GSM721123_mctp_30CYNAAXX_5_chr9_withheader PrEC_1 PrEC male single_read not_strand_specific test-data/GSM721124_mctp_209ENAAXX_8_chr9_withheader PrEC_2 PrEC male single_read not_strand_specific
A tab-delimited file of the conditions of each sample, including:
.bam
file location (without the final .bam
, this makes the processing much easier)
.bam
files MUST be indexed (have a accompanying .bai
files) for RSeQC
HTSeq
and DEXSeq
(differential exon usage)
test-data/hg19_ucsc_genes.gtf
First few lines of example file:
chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 12613 12721 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc001aaa.3"; transcript_id "uc001aaa.3"; chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene exon 12646 12697 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene exon 13221 14409 0.000000 + . gene_id "uc010nxr.1"; transcript_id "uc010nxr.1"; chr1 hg19_knownGene start_codon 12190 12192 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12190 12227 0.000000 + 0 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene exon 11874 12227 0.000000 + . gene_id "uc010nxq.1"; transcript_id "uc010nxq.1"; chr1 hg19_knownGene CDS 12595 12721 0.000000 + 1 gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
GTF files are a subset of GFF (General Feature Format) files. To get one of these files, follow the following steps: (there is probably a similar method to use the ENSEMBL website, but I am not familiar with it so I am giving these instructions that I myself have followed many times)
hg19_ucsc_genes.gtf
)
.gtf
) in the filename
test-data/hg19_ucsc_genes_chr9_dexseq.gtf
First few lines of example file:
chr9 hg19_ucsc_genes_chr9.gtf aggregate_gene 11987 14525 . + . gene_id "uc011llp.1" chr9 hg19_ucsc_genes_chr9.gtf exonic_part 11987 12340 . + . transcripts "uc011llp.1"; exonic_part_number "001"; gene_id "uc011llp.1" chr9 hg19_ucsc_genes_chr9.gtf exonic_part 12726 12834 . + . transcripts "uc011llp.1"; exonic_part_number "002"; gene_id "uc011llp.1" chr9 hg19_ucsc_genes_chr9.gtf exonic_part 13334 14525 . + . transcripts "uc011llp.1"; exonic_part_number "003"; gene_id "uc011llp.1" chr9 hg19_ucsc_genes_chr9.gtf aggregate_gene 14511 29739 . - . gene_id "uc010mgp.1+uc010mgm.1+uc022bcs.1+uc011llq.1+uc003zfu.1" chr9 hg19_ucsc_genes_chr9.gtf exonic_part 14511 14940 . - . transcripts "uc010mgm.1"; exonic_part_number "001"; gene_id "uc010mgp.1+uc010mgm.1+uc022bcs.1+uc011llq.1+uc003zfu.1" chr9 hg19_ucsc_genes_chr9.gtf exonic_part 15081 15149 . - . transcripts "uc010mgm.1+uc022bcs.1"; exonic_part_number "002"; gene_id "uc010mgp.1+uc010mgm.1+uc022bcs.1+uc011llq.1+uc003zfu.1" chr9 hg19_ucsc_genes_chr9.gtf exonic_part 15909 16061 . - . transcripts "uc010mgm.1+uc022bcs.1"; exonic_part_number "003"; gene_id "uc010mgp.1+uc010mgm.1+uc022bcs.1+uc011llq.1+uc003zfu.1" chr9 hg19_ucsc_genes_chr9.gtf exonic_part 16188 16421 . - . transcripts "uc011llq.1"; exonic_part_number "004"; gene_id "uc010mgp.1+uc010mgm.1+uc022bcs.1+uc011llq.1+uc003zfu.1" chr9 hg19_ucsc_genes_chr9.gtf exonic_part 16718 16876 . - . transcripts "uc010mgm.1+uc022bcs.1+uc011llq.1"; exonic_part_number "005"; gene_id "uc010mgp.1+uc010mgm.1+uc022bcs.1+uc011llq.1+uc003zfu.1"
To create this file, You need a GTF file (which can be obtained as described in the GTF section), and then use the script included in rna-seq-diff-exprn/scripts/external/dexseq_prepare_annotation.py
:
$ python2.7 scripts/external/dexseq_prepare_annotation.py test-data/hg19_ucsc_genes.gtf test-data/hg19_ucsc_genes_dexseq.gtf
The dollar sign $
indicates a bash shell and shows that we are using a command-line interface, as opposed to a command embedded in source code such as this document.
bedtools coverage
test-data/hg19_ucsc_genes.bed
First few lines of example file:
track name="tb_knownGene" description="table browser query on knownGene" visibility=3 url= chr1 11873 14409 uc001aaa.3 0 + 11873 11873 0 3 354,109,1189, 0,739,1347, chr1 11873 14409 uc010nxr.1 0 + 11873 11873 0 3 354,52,1189, 0,772,1347, chr1 11873 14409 uc010nxq.1 0 + 12189 13639 0 3 354,127,1007, 0,721,1529, chr1 14361 16765 uc009vis.3 0 - 14361 14361 0 4 468,69,147,159, 0,608,1434,2245, chr1 16857 17751 uc009vjc.1 0 - 16857 16857 0 2 198,519, 0,375, chr1 15795 18061 uc009vjd.2 0 - 15795 15795 0 5 152,159,198,136,456, 0,811,1062,1437,1810, chr1 14361 19759 uc009vit.3 0 - 14361 14361 0 9 468,69,152,159,198,510,147,99,847, 0,608,1434,2245,2496,2871,3553,3906,4551, chr1 14361 19759 uc009viu.3 0 - 14361 14361 0 10 468,69,152,159,198,510,147,102,54,847, 0,608,1434,2245,2496,2871,3553,3906,4139,4551, chr1 14361 19759 uc001aae.4 0 - 14361 14361 0 10 468,69,152,159,198,136,137,147,99,847, 0,608,1434,2245,2496,2871,3244,3553,3906,4551,
To get one of these files, do the following steps: (there is probably a similar method to use the ENSEMBL website, but I am not familiar with it so I am giving these instructions that I myself have followed many times and can help you out with if you are stuck)
TranscriptID-Symbol
file, so I recommend sticking with UCSC IDs for now)
hg19_ucsc_genes.bed
)
.bed
) in the filename
If you have an existing BED file, make sure the first line has track name=….
or else RSeQC gets mad.
test-data/hg19_id_symbol.txt
First few lines of example file:
uc001aaa.3 DDX11L1 uc010nxr.1 DDX11L1 uc010nxq.1 DDX11L9 uc001aal.1 OR4F5 uc001aaq.2 DQ597235 uc001aar.2 DQ599768 uc001aau.3 LOC100132287 uc021oeh.1 LOC100133331 uc009vjk.2 LOC100133331 uc021oei.1 LOC388312
You can get one of these files by:
TrascriptID-Symbol
file, so I recommend sticking with UCSC IDs for now)
hg19_kgXref.txt
)
.txt
, for example) in the filename
uc002gig.1
(column 1 in the kgXref file) and the gene symbols, e.g. TP53
(column 5 in the kgXref file), a known tumor suppressor gene.
But wait! You’re not done yet!
You need to remove the first line, the header of the file that explains what is in which column.
You could do this in Microsoft Excel, but the human file (for example) has 80,923 lines in it and will crash Excel. For organisms with fewer documented genes, using Excel to push columns around may be just fine.
The Linux/UNIX (lovingly called “*NIX”) commands to take columns is called “cut” (there is also “paste” to put together columns from different files but that’s out of the scope of what we’re doing here) We want columns 1 and 5 (the UCSC ID and the gene symbol – take a peek at the file by typing head hg19_kgXref.txt
on the command line in the directory – this will show the first 10 lines of the file), so we’ll say cut -f 1,5
where the -f
indicates the “fields” we want to cut
. Then we use sed 1d
to skip the first line (skipping more than one line has a slighly different command, check out my favorite Sed tutorial if you’re interested in learning more). And the <
indicates the input file, the |
indicates that the output of the previous command is treated as input to the next command, and the >
indicates the output file. Note that we created a new file and did not overwrite the old one. In general, it is best practices to create a new file rather than overwrite the old one. Also, if you try to make your input and output files the same, the commands may get confused and you could lose your original data. :(
$ cut -f 1,5 < hg19_kgXref.txt | sed 1d > hg19_id_symbol.txt
Or, if you want to create a chromosome-specific file like I did, use your .bed
file to search through your kgXref file:
$ cut -f 4 hg19_ucsc_genes_chr9.bed | grep --fixed-strings - hg19_kgXref.txt >hg19_kgXref_chr9.txt
Then do the same as above, but with your chr9 file:
$ cut -f 1,5 < hg19_kgXref_chr9.txt | sed 1d > hg19_id_symbol_chr9.txt
genomeCoverageBed
test-data/human.hg19.genome
chr1 249250621 chr2 243199373 chr3 198022430 chr4 191154276 chr5 180915260 chr6 171115067 chr7 159138663 chrX 155270560 chr8 146364022 chr9 141213431
Besides the example files, you can also use ones shipped with BEDTools. On my machine, these files are located in ~/packages/BEDTools/genomes
:
$ ls ~/packages/BEDTools-Version-2.16.2/genomes/ human.hg18.genome human.hg19.genome mouse.mm8.genome mouse.mm9.genome
As to how to create these files for non-mouse or human organisms, my suggestion (while rather unwieldy) is to:
$ cut -f 1,2 < platypus.ornAna1.genome | sed 1d > platypus.ornAna1.genome.fixed
test-data/karyotype/karyotype.human.hg19.txt
This file looks like:
chr - chr1 chr1 0 249250621 chr1 chr - chr2 chr2 0 243199373 chr2 chr - chr3 chr3 0 198022430 chr3 chr - chr4 chr4 0 191154276 chr4 chr - chr5 chr5 0 180915260 chr5 chr - chr6 chr6 0 171115067 chr6 chr - chr7 chr7 0 159138663 chr7 chr - chr8 chr8 0 146364022 chr8 chr - chr9 chr9 0 141213431 chr9
Karyotype file used by Circos, which specifies the chromosome lengths. The third column, the chromosome name, MUST use chr1
-type notation, and not the typical hs1
notation for Homo sapiens chromosome 1. This is because bedtools
and friends use chr1
notation, but I didn’t want to require the organism name and then lookup the conversion. Presumably, you would have samples from all the same organism since you are comparing gene expression and coverage across different treatments, so I felt this was a safe assumption. I also didn’t want to lock you into ONLY using human data, because there are plenty of interesting organisms out there.
test-data/hg19_gc_content_circos_chr9.txt
First few lines of example file:
chr9 10000 10999 58 chr9 11000 11999 58 chr9 12000 12999 57.9 chr9 13000 13999 57.9 chr9 14000 14999 58 chr9 15000 15999 60.7 chr9 16000 16999 56.9 chr9 17000 17999 61.3 chr9 18000 18999 60.7 chr9 19000 19999 57.7
This GC content file can be created by converting a .wig
(wiggle) format file that’s used for a genome browser into a circus format using:
../scripts/wig_to_circos.R hg19_gc1000Base.txt hg19_gc_content.circos
LNCaP_1
, LNCaP_2
, LNCaP_3
, LNCaP_4
, PrEC_1
, PrEC_2
, the pipeline will also create
LNCaP_bunch1of2
, containing LNCaP_1
and LNCaP_2
LNCaP_bunch2of2
, containing LNCaP_3
and LNCaP_4
PrEC_bunch1of2
, containing PrEC_1
PrEC_bunch2of2
, containing PrEC_2
To change the ordering of samples, change the conditions file, as that is the basis of the bunching order.
If you don’t want any bunches, omit this variable from the command line.
These are programs you must have installed before attempting to run the code. Note: This may seem like a lot, but if you are in biomedical research, it is likely that the servers at your institution already have most of these installed.
python setup.py install
Say: python2.7 setup.py install
http://circos.ca/software/download/circos
Aliased such that circos
will run the program On my machine, this is accomplished by adding this line to the file in /Users/olgabotvinnik/.bashrc
, or my ~/.bashrc
file: PATH=PATH : /usr/bin/circos/bin; exportPATH < /code > Whichmeansthatwhenyouruncommands, thecomputerwillknowtolookin < code > /usr/bin/circos/bin < /code > forpotentialexecutablefiles./usr/bin/circosIswhereIpersonallyinstalledCircos.OntheserverthatIuse, forexample, itisinstalledin : <code > /share/apps/circos − 0.60/bin < /code > Sothenmy < code > /.bashrc < /code > fileontheserverlookslike : <code > PATH=PATH:/share/apps/circos-0.60/bin ; export PATH
Before running, make sure you add these to your path: python2.7
, BEDTools
, SAMTools
, RSeQC
in python2.7
.
For example, for me, these are the locations of these packages on my computer. Add these lines, modified to the locations on your computer, below to your ~/.bashrc
file and then “source” it: source ~/.bashrc
. FYI: the colon character, “:” is the delimiter between places to look for binary executables files.
export PATH=/usr/local/bin/python2.7:$PATH export PATH=$PATH:/share/apps/samtools-0.1.18:/share/apps/BEDTools-Version-2.15.0/ export PYTHONPATH=/share/apps/RSeQC/usr/local/lib/python2.7/site-packages:$PYTHONPATH export PATH=/share/apps/RSeQC/usr/local/bin:$PATH
An example run looks like this (yes it is a very long command):
scripts/pipeline.sh test-results test-data/conditions_chr9.tab test-data/hg19_ucsc_genes.gtf test-data/hg19_ucsc_genes_chr9_dexseq.gtf test-data/hg19_ucsc_genes.bed test-data/hg19_id_symbol.txt test-data/human.hg19.genome test-data/karyotype/karyotype.human.hg19.txt test-data/hg19_gene_density_1e5bins.txt test-data/hg19_gc_content_circos_chr9.txt 2
Or, using the keywords we’ve discussed:
scripts/pipeline.sh [output directory] [metadata] [.gtf] [dexseq-processed .gtf] [.bed] [transcriptID-symbol] [genome] [karyotype] [gene density] [gc content] [number of bunches]