[galaxy-commits] [hg] galaxy 3676: Cufflinks tools update; added cuffdiff wrapper.

10 May 2010

details:   http://www.bx.psu.edu/hg/galaxy/rev/afbdedd0e758
changeset: 3676:afbdedd0e758
user:      jeremy goecks <jeremy.goecks@emory.edu>
date:      Wed Apr 21 11:42:50 2010 -0400
description:
Cufflinks tools update; added cuffdiff wrapper.

diffstat:

 tool_conf.xml.sample                  |    2 +
 tools/ngs_rna/cuffcompare_wrapper.xml |   14 +-
 tools/ngs_rna/cuffdiff_wrapper.py     |  129 ++++++++++++++++++++++++++++++++++
 tools/ngs_rna/cuffdiff_wrapper.xml    |  117 ++++++++++++++++++++++++++++++
 tools/ngs_rna/cufflinks_wrapper.py    |    2 +-
 tools/ngs_rna/cufflinks_wrapper.xml   |   22 +++--
 6 files changed, 270 insertions(+), 16 deletions(-)

diffs (367 lines):

diff -r d6fddb034db7 -r afbdedd0e758 tool_conf.xml.sample

--- a/tool_conf.xml.sample	Wed Apr 21 11:35:21 2010 -0400
+++ b/tool_conf.xml.sample	Wed Apr 21 11:42:50 2010 -0400
@@ -228,6 +228,8 @@
   <section name="NGS: Expression Analysis" id="ngs-rna-tools">
    <tool file="ngs_rna/tophat_wrapper.xml" />
    <tool file="ngs_rna/cufflinks_wrapper.xml" />
+   <tool file="ngs_rna/cuffcompare_wrapper.xml" />
+   <tool file="ngs_rna/cuffdiff_wrapper.xml" />
   </section>
   <section name="NGS: SAM Tools" id="samtools">
    <tool file="samtools/sam_bitwise_flag_filter.xml" />
diff -r d6fddb034db7 -r afbdedd0e758 tools/ngs_rna/cuffcompare_wrapper.xml
--- a/tools/ngs_rna/cuffcompare_wrapper.xml	Wed Apr 21 11:35:21 2010 -0400
+++ b/tools/ngs_rna/cuffcompare_wrapper.xml	Wed Apr 21 11:42:50 2010 -0400
@@ -15,15 +15,15 @@
             $input2
     </command>
     <inputs>
-        <param format="gtf" name="input1" type="data" label="SAM file of aligned RNA-Seq reads" help=""/>
-        <param format="gtf" name="input2" type="data" label="SAM file of aligned RNA-Seq reads" help=""/>
+        <param format="gtf" name="input1" type="data" label="GTF file produced by Cufflinks" help=""/>
+        <param format="gtf" name="input2" type="data" label="GTF file produced by Cufflinks" help=""/>
         <conditional name="annotation">
-            <param name="use_ref_annotation" type="select" label="Use Reference Annotation?">
+            <param name="use_ref_annotation" type="select" label="Use Reference Annotation">
                 <option value="No">No</option>
                 <option value="Yes">Yes</option>
             </param>
             <when value="Yes">
-                <param format="gtf" name="reference_annotation" type="data" label="Reference Annotation" help=""/>    
+                <param format="gtf" name="reference_annotation" type="data" label="Reference Annotation" help="Make sure your annotation file is in GTF format and that Galaxy knows that your file is GTF--not GFF."/>    
                 <param name="ignore_nonoverlapping_reference" type="boolean" label="Ignore reference transcripts that are not overlapped by any transcript in input files"/>
             </when>
             <when value="No">
@@ -32,9 +32,9 @@
     </inputs>
 
     <outputs>
-        <data format="gtf" name="transcripts_combined" />
-        <data format="tracking" name="transcripts_tracking" />
-        <data format="gtf" name="transcripts_accuracy" />
+        <data format="gtf" name="transcripts_combined" label="Cuffcompare on data ${input1.hid} and data ${input2.hid}: combined transcripts"/>
+        <data format="tracking" name="transcripts_tracking" label="Cuffcompare on data ${input1.hid} and data ${input2.hid}: transcript tracking"/>
+        <data format="gtf" name="transcripts_accuracy" label="Cuffcompare on data ${input1.hid} and data ${input2.hid}: transcript accuracy"/>
     </outputs>
 
     <tests>
diff -r d6fddb034db7 -r afbdedd0e758 tools/ngs_rna/cuffdiff_wrapper.py
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/ngs_rna/cuffdiff_wrapper.py	Wed Apr 21 11:42:50 2010 -0400
@@ -0,0 +1,129 @@
+#!/usr/bin/env python
+
+import optparse, os, shutil, subprocess, sys, tempfile
+
+def stop_err( msg ):
+    sys.stderr.write( "%s\n" % msg )
+    sys.exit()
+
+def __main__():
+    #Parse Command Line
+    parser = optparse.OptionParser()
+    
+    # Cuffdiff options.
+    parser.add_option( '-s', '--inner-dist-std-dev', dest='inner_dist_std_dev', help='The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.' )
+    parser.add_option( '-p', '--num-threads', dest='num_threads', help='Use this many threads to align reads. The default is 1.' )
+    parser.add_option( '-m', '--inner-mean-dist', dest='inner_mean_dist', help='This is the expected (mean) inner distance between mate pairs. \
+                                                                                For, example, for paired end runs with fragments selected at 300bp, \
+                                                                                where each end is 50bp, you should set -r to be 200. The default is 45bp.')
+    parser.add_option( '-Q', '--min-mapqual', dest='min_mapqual', help='Instructs Cufflinks to ignore alignments with a SAM mapping quality lower than this number. The default is 0.' )
+    parser.add_option( '-c', '--min-alignment-count', dest='min_alignment_count', help='The minimum number of alignments in a locus for needed to conduct significance testing on changes in that locus observed between samples. If no testing is performed, changes in the locus are deemed not signficant, and the locus\' observed changes don\'t contribute to correction for multiple testing. The default is 1,000 fragment alignments (up to 2,000 paired reads).' )
+    parser.add_option( '--FDR', dest='FDR', help='The allowed false discovery rate. The default is 0.05.' )
+
+    # Advanced Options:	
+    parser.add_option( '--num-importance-samples', dest='num_importance_samples', help='Sets the number of importance samples generated for each locus during abundance estimation. Default: 1000' )
+    parser.add_option( '--max-mle-iterations', dest='max_mle_iterations', help='Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000' )
+    
+    # Wrapper / Galaxy options.
+    parser.add_option( '-A', '--inputA', dest='inputA', help='A transcript GTF file produced by cufflinks, cuffcompare, or other source.')
+    parser.add_option( '-1', '--input1', dest='input1', help='File of RNA-Seq read alignments in the SAM format. SAM is a standard short read alignment, that allows aligners to attach custom tags to individual alignments, and Cufflinks requires that the alignments you supply have some of these tags. Please see Input formats for more details.' )
+    parser.add_option( '-2', '--input2', dest='input2', help='File of RNA-Seq read alignments in the SAM format. SAM is a standard short read alignment, that allows aligners to attach custom tags to individual alignments, and Cufflinks requires that the alignments you supply have some of these tags. Please see Input formats for more details.' )
+
+    parser.add_option( "--isoforms_fpkm_tracking_output", dest="isoforms_fpkm_tracking_output" )
+    parser.add_option( "--genes_fpkm_tracking_output", dest="genes_fpkm_tracking_output" )
+    parser.add_option( "--cds_fpkm_tracking_output", dest="cds_fpkm_tracking_output" )
+    parser.add_option( "--tss_groups_fpkm_tracking_output", dest="tss_groups_fpkm_tracking_output" )
+    parser.add_option( "--isoforms_exp_output", dest="isoforms_exp_output" )
+    parser.add_option( "--genes_exp_output", dest="genes_exp_output" )
+    parser.add_option( "--tss_groups_exp_output", dest="tss_groups_exp_output" )
+    parser.add_option( "--cds_exp_fpkm_tracking_output", dest="cds_exp_fpkm_tracking_output" )
+    parser.add_option( "--splicing_diff_output", dest="splicing_diff_output" )
+    parser.add_option( "--cds_diff_output", dest="cds_diff_output" )
+    parser.add_option( "--promoters_diff_output", dest="promoters_diff_output" )
+    
+    (options, args) = parser.parse_args()
+    
+    # Make temp directory for output.
+    tmp_output_dir = tempfile.mkdtemp()
+    
+    # Build command.
+    
+    # Base.
+    cmd = "cuffdiff"
+    
+    # Add options.
+    if options.inner_dist_std_dev:
+        cmd += ( " -s %i" % int ( options.inner_dist_std_dev ) )
+    if options.num_threads:
+        cmd += ( " -p %i" % int ( options.num_threads ) )
+    if options.inner_mean_dist:
+        cmd += ( " -m %i" % int ( options.inner_mean_dist ) )
+    if options.min_mapqual:
+        cmd += ( " -Q %i" % int ( options.min_mapqual ) )
+    if options.min_alignment_count:
+        cmd += ( " -c %i" % int ( options.min_alignment_count ) )
+    if options.FDR:
+        cmd += ( " --FDR %f" % float( options.FDR ) )
+    if options.num_importance_samples:
+        cmd += ( " --num-importance-samples %i" % int ( options.num_importance_samples ) )
+    if options.max_mle_iterations:
+        cmd += ( " --max-mle-iterations %i" % int ( options.max_mle_iterations ) )
+        
+    # Add inputs.
+    cmd += " " + options.inputA + " " + options.input1 + " " + options.input2
+    print cmd
+
+    # Run command.
+    try:
+        tmp_name = tempfile.NamedTemporaryFile( dir=tmp_output_dir ).name
+        tmp_stderr = open( tmp_name, 'wb' )
+        proc = subprocess.Popen( args=cmd, shell=True, cwd=tmp_output_dir, stderr=tmp_stderr.fileno() )
+        returncode = proc.wait()
+        tmp_stderr.close()
+        
+        # Get stderr, allowing for case where it's very large.
+        tmp_stderr = open( tmp_name, 'rb' )
+        stderr = ''
+        buffsize = 1048576
+        try:
+            while True:
+                stderr += tmp_stderr.read( buffsize )
+                if not stderr or len( stderr ) % buffsize != 0:
+                    break
+        except OverflowError:
+            pass
+        tmp_stderr.close()
+        
+        # Error checking.
+        if returncode != 0:
+            raise Exception, stderr
+            
+        # check that there are results in the output file
+        if len( open( tmp_output_dir + "/isoforms.fpkm_tracking", 'rb' ).read().strip() ) == 0:
+            raise Exception, 'The main output file is empty, there may be an error with your input file or settings.'
+    except Exception, e:
+        stop_err( 'Error running cuffdiff. ' + str( e ) )
+
+        
+    # Copy output files from tmp directory to specified files.
+    try:
+        try:
+            shutil.copyfile( tmp_output_dir + "/isoforms.fpkm_tracking", options.isoforms_fpkm_tracking_output )
+            shutil.copyfile( tmp_output_dir + "/genes.fpkm_tracking", options.genes_fpkm_tracking_output )
+            shutil.copyfile( tmp_output_dir + "/cds.fpkm_tracking", options.cds_fpkm_tracking_output )
+            shutil.copyfile( tmp_output_dir + "/tss_groups.fpkm_tracking", options.tss_groups_fpkm_tracking_output )
+            shutil.copyfile( tmp_output_dir + "/0_1_isoform_exp.diff", options.isoforms_exp_output )
+            shutil.copyfile( tmp_output_dir + "/0_1_gene_exp.diff", options.genes_exp_output )
+            shutil.copyfile( tmp_output_dir + "/0_1_tss_group_exp.diff", options.tss_groups_exp_output )
+            shutil.copyfile( tmp_output_dir + "/0_1_splicing.diff", options.splicing_diff_output )
+            shutil.copyfile( tmp_output_dir + "/0_1_cds.diff", options.cds_diff_output )
+            shutil.copyfile( tmp_output_dir + "/0_1_cds_exp.diff", options.cds_diff_output )
+            shutil.copyfile( tmp_output_dir + "/0_1_promoters.diff", options.promoters_diff_output )    
+        except Exception, e:
+            stop_err( 'Error in cuffdiff:\n' + str( e ) ) 
+    finally:
+        # Clean up temp dirs
+        if os.path.exists( tmp_output_dir ):
+            shutil.rmtree( tmp_output_dir )
+
+if __name__=="__main__": __main__()
\ No newline at end of file
diff -r d6fddb034db7 -r afbdedd0e758 tools/ngs_rna/cuffdiff_wrapper.xml
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/ngs_rna/cuffdiff_wrapper.xml	Wed Apr 21 11:42:50 2010 -0400
@@ -0,0 +1,117 @@
+<tool id="cuffdiff" name="Cuffdiff" version="0.8.2">
+    <description>find significant changes in transcript expression, splicing, and promoter use</description>
+    <command interpreter="python">
+        cuffdiff_wrapper.py
+            --FDR=$fdr
+            --num-threads="4"
+            --min-mapqual=$min_mapqual
+            --min-alignment-count=$min_alignment_count
+
+            --isoforms_fpkm_tracking_output=$isoforms_fpkm_tracking
+            --genes_fpkm_tracking_output=$genes_fpkm_tracking
+            --cds_fpkm_tracking_output=$cds_fpkm_tracking
+            --tss_groups_fpkm_tracking_output=$tss_groups_fpkm_tracking
+            --isoforms_exp_output=$isoforms_exp
+            --genes_exp_output=$genes_exp
+            --tss_groups_exp_output=$tss_groups_exp
+            --cds_exp_fpkm_tracking_output=$cds_exp_fpkm_tracking
+            --splicing_diff_output=$splicing_diff
+            --cds_diff_output=$cds_diff
+            --promoters_diff_output=$promoters_diff
+            
+            --inputA=$gtf_input
+            --input1=$aligned_reads1
+            --input2=$aligned_reads2
+    </command>
+    <inputs>
+        <param format="gtf" name="gtf_input" type="data" label="Transcripts" help="A transcript GTF file produced by cufflinks, cuffcompare, or other source."/>
+        <param format="sam" name="aligned_reads1" type="data" label="SAM file of aligned RNA-Seq reads" help=""/>
+        <param format="sam" name="aligned_reads2" type="data" label="SAM file of aligned RNA-Seq reads" help=""/>
+        <param name="fdr" type="float" value="0.05" label="False Discovery Rate" help="The allowed false discovery rate."/>
+        <param name="min_mapqual" type="integer" value="0" label="Min SAM Mapping Quality" help="Instructs Cufflinks to ignore alignments with a SAM mapping quality lower than this number."/>
+        <param name="min_alignment_count" type="integer" value="0" label="Min Alignment Count" help="The minimum number of alignments in a locus for needed to conduct significance testing on changes in that locus observed between samples."/>
+        <conditional name="singlePaired">
+            <param name="sPaired" type="select" label="Is this library mate-paired?">
+                <option value="single">Single-end</option>
+                <option value="paired">Paired-end</option>
+            </param>
+            <when value="single"></when>
+            <when value="paired">
+                <param name="mean_inner_distance" type="integer" value="20" label="Mean Inner Distance between Mate Pairs"/>
+                <param name="inner_distance_std_dev" type="integer" value="20" label="Standard Deviation for Inner Distance between Mate Pairs"/>
+            </when>
+        </conditional>
+    </inputs>
+
+    <outputs>
+        <data format="tabular" name="isoforms_exp" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: isoform expression"/>
+        <data format="tabular" name="genes_exp" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: gene expression"/>
+        <data format="tabular" name="tss_groups_exp" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: TSS groups expression"/>
+        <data format="tabular" name="cds_exp_fpkm_tracking" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: CDS Expression FPKM Tracking"/>
+        <data format="tabular" name="splicing_diff" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: splicing diff"/>
+        <data format="tabular" name="cds_diff" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: CDS diff"/>
+        <data format="tabular" name="promoters_diff" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: promoters diff"/>
+        <data format="tabular" name="tss_groups_fpkm_tracking" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: TSS groups FPKM tracking" />
+        <data format="tabular" name="cds_fpkm_tracking" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: CDS FPKM tracking"/>
+        <data format="tabular" name="genes_fpkm_tracking" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: gene FPKM tracking"/>
+        <data format="tabular" name="isoforms_fpkm_tracking" label="Cuffdiff on data ${gtf_input.hid}, data ${aligned_reads1.hid}, and data ${aligned_reads2.hid}: isoform FPKM tracking"/>        
+    </outputs>
+
+    <tests>
+        <test>
+        </test>
+    </tests>
+
+    <help>
+**Cuffdiff Overview**
+
+Cuffdiff is part of Cufflinks_. Cuffdiff find significant changes in transcript expression, splicing, and promoter use. Please cite: Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B, Pachter L. Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. (manuscript in press)
+
+.. _Cufflinks: http://cufflinks.cbcb.umd.edu/
+        
+------
+
+**Know what you are doing**
+
+.. class:: warningmark
+
+There is no such thing (yet) as an automated gearshift in expression analysis. It is all like stick-shift driving in San Francisco. In other words, running this tool with default parameters will probably not give you meaningful results. A way to deal with this is to **understand** the parameters by carefully reading the `documentation`__ and experimenting. Fortunately, Galaxy makes experimenting easy.
+
+.. __: http://cufflinks.cbcb.umd.edu/manual.html#cuffdiff
+
+------
+
+**Input format**
+
+Cuffcompare takes Cufflinks or Cuffcompare GTF files as input along with two SAM files containing the fragment alignments for two or more samples.
+
+.. ___: http://www.todo.org 
+
+------
+
+**Outputs**
+
+TODO
+    
+-------
+
+**Settings**
+
+All of the options have a default value. You can change any of them. Most of the options in Cuffdiff have been implemented here.
+
+------
+
+**Cuffdiff parameter list**
+
+This is a list of implemented Cuffdiff options::
+
+  -m INT                         This is the expected (mean) inner distance between mate pairs. For, example, for paired end runs with fragments selected at 300bp, where each end is 50bp, you should set -r to be 200. The default is 45bp.
+  -s INT                         The standard deviation for the distribution on inner distances between mate pairs. The default is 20bp.
+  -Q	                         Instructs Cufflinks to ignore alignments with a SAM mapping quality lower than this number. The default is 0.
+  -c INT                         The minimum number of alignments in a locus for needed to conduct significance testing on changes in that locus observed between samples. If no testing is performed, changes in the locus are deemed not signficant, and the locus' observed changes don't contribute to correction for multiple testing. The default is 1,000 fragment alignments (up to 2,000 paired reads).
+  --FDR FLOAT                    The allowed false discovery rate. The default is 0.05.
+  --num-importance-samples INT   Sets the number of importance samples generated for each locus during abundance estimation. Default: 1000
+  --max-mle-iterations INT       Sets the number of iterations allowed during maximum likelihood estimation of abundances. Default: 5000
+  
+    </help>
+</tool>
diff -r d6fddb034db7 -r afbdedd0e758 tools/ngs_rna/cufflinks_wrapper.py
--- a/tools/ngs_rna/cufflinks_wrapper.py	Wed Apr 21 11:35:21 2010 -0400
+++ b/tools/ngs_rna/cufflinks_wrapper.py	Wed Apr 21 11:42:50 2010 -0400
@@ -56,7 +56,7 @@
     if options.min_mapqual:
         cmd += ( " -Q %i" % int ( options.min_mapqual ) )
     if options.GTF:
-        cmd += ( " -G %i" % options.GTF )
+        cmd += ( " -G %s" % options.GTF )
     if options.num_importance_samples:
         cmd += ( " --num-importance-samples %i" % int ( options.num_importance_samples ) )
     if options.max_mle_iterations:
diff -r d6fddb034db7 -r afbdedd0e758 tools/ngs_rna/cufflinks_wrapper.xml
--- a/tools/ngs_rna/cufflinks_wrapper.xml	Wed Apr 21 11:35:21 2010 -0400
+++ b/tools/ngs_rna/cufflinks_wrapper.xml	Wed Apr 21 11:42:50 2010 -0400
@@ -1,5 +1,5 @@
 <tool id="cufflinks" name="Cufflinks" version="0.8.2">
-    <description>transcript assembly, differential expression, and differential regulation for RNA-Seq</description>
+    <description>transcript assembly and FPKM (RPKM) estimates for RNA-Seq data</description>
     <command interpreter="python">
         cufflinks_wrapper.py 
             --input=$input
@@ -32,7 +32,7 @@
             </param>
             <when value="No"></when>
             <when value="Yes">
-                <param format="gtf" name="reference_annotation_file" type="data" label="Reference Annotation" help=""/>
+                <param format="gtf" name="reference_annotation_file" type="data" label="Reference Annotation" help="Make sure your annotation file is in GTF format and that Galaxy knows that your file is GTF--not GFF."/>
             </when>
         </conditional>
         <conditional name="singlePaired">
@@ -50,9 +50,9 @@
     </inputs>
 
     <outputs>
-        <data format="expr" name="genes_expression" />
-        <data format="expr" name="transcripts_expression" />
-        <data format="gtf" name="assembled_isoforms" />
+        <data format="expr" name="genes_expression" label="Cufflinks on data ${input.hid}: gene expression"/>
+        <data format="expr" name="transcripts_expression" label="Cufflinks on data ${input.hid}: transcript expression"/>
+        <data format="gtf" name="assembled_isoforms" label="Cufflinks on data ${input.hid}: assembled transcripts"/>
     </outputs>
 
     <tests>
@@ -60,10 +60,16 @@
             <param name="sPaired" value="single"/>
             <param name="input" value="cufflinks_in.sam"/>
             <param name="mean_inner_distance" value="20"/>
+            <param name="max_intron_len" value="300000"/>
+            <param name="min_isoform_fraction" value="0.05"/>
+            <param name="pre_mrna_fraction" value="0.05"/>
+            <param name="min_map_quality" value="0"/>
+            <param name="use_ref" value="No"/>
             <output name="assembled_isoforms" file="cufflinks_out1.gtf"/>
-            <!-- Can't test these right now because .expr files aren't recognized.
-            <output name="genes_expression" file="cufflinks_out3.expr"/>
-            <output name="transcripts_expression" file="cufflinks_out2.expr"/>
+            <!--
+            Can't test these right now b/c .expr files aren't recognized. Need to add them?
+            <output name="genes_expression" format="tabular" file="cufflinks_out3.expr"/>
+            <output name="transcripts_expression" format="tabular" file="cufflinks_out2.expr"/>
             -->
         </test>
     </tests>

    

[galaxy-commits] [hg] galaxy 3676: Cufflinks tools update; added cuffdiff wrapper.

Nate Coraor