[galaxy-commits] commit/galaxy-central: 2 new changesets

7 May 2011

2 new changesets in galaxy-central:

http://bitbucket.org/galaxy/galaxy-central/changeset/d79a1ad33a03/
changeset:   r5510:d79a1ad33a03
user:        fubar
date:        2011-05-07 17:56:12
summary:     lots of small tweaks and some new picard metrics - Hs and libcomplexity. Includes an option to use the input data genome for a reference where a reference genome is needed - helps prevent new user head scratching - this really should be added (eg) the bwa and other wrappers too?
affected #:  9 files (15.9 KB)

--- a/test-data/picard_output_AsMetrics_indexed_hg18_sorted_pair.html	Fri May 06 16:02:25 2011 -0400
+++ b/test-data/picard_output_AsMetrics_indexed_hg18_sorted_pair.html	Sat May 07 11:56:12 2011 -0400
@@ -12,10 +12,10 @@
 </head><body><div class="document">
-Galaxy tool wrapper run picard_wrapper at 27/04/2011 11:46:40</b><br/><b>Picard log</b><hr/>
-<pre>## executing java -Xmx2g -jar /udd/rerla/galaxy-central/tool-data/shared/jars/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true  ADAPTER_SEQUENCE= IS_BISULFITE_SEQUENCED=false MAX_INSERT_SIZE=100000 OUTPUT=/udd/rerla/galaxy-central/database/job_working_directory/90/dataset_92_files/CollectAlignmentSummaryMetrics.metrics.txt R=/udd/rerla/galaxy-central/database/job_working_directory/90/dataset_92_files/hg18.fasta_fake.fasta TMP_DIR=/tmp INPUT=/tmp/6728746.1.all.q/tmpkzvo-A/database/files/000/dataset_91.dat returned status 1 and stderr: 
-[Wed Apr 27 11:46:37 EDT 2011] net.sf.picard.analysis.CollectAlignmentSummaryMetrics MAX_INSERT_SIZE=100000 ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, IS_BISULFITE_SEQUENCED=false] INPUT=/tmp/6728746.1.all.q/tmpkzvo-A/database/files/000/dataset_91.dat OUTPUT=/udd/rerla/galaxy-central/database/job_working_directory/90/dataset_92_files/CollectAlignmentSummaryMetrics.metrics.txt REFERENCE_SEQUENCE=/udd/rerla/galaxy-central/database/job_working_directory/90/dataset_92_files/hg18.fasta_fake.fasta ASSUME_SORTED=true TMP_DIR=/tmp VALIDATION_STRINGENCY=LENIENT    IS_BISULFITE_SEQUENCED=false STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
-[Wed Apr 27 11:46:40 EDT 2011] net.sf.picard.analysis.CollectAlignmentSummaryMetrics done.
+Galaxy tool wrapper run picard_wrapper at 06/05/2011 22:54:31</b><br/><b>Picard log</b><hr/>
+<pre>## executing java -Xmx2g -jar /udd/rerla/rgalaxy/tool-data/shared/jars/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true  ADAPTER_SEQUENCE= IS_BISULFITE_SEQUENCED=false MAX_INSERT_SIZE=100000 OUTPUT=/udd/rerla/rgalaxy/database/job_working_directory/90/dataset_92_files/CollectAlignmentSummaryMetrics.metrics.txt R=/udd/rerla/rgalaxy/database/job_working_directory/90/dataset_92_files/hg18.fasta_fake.fasta TMP_DIR=/tmp INPUT=/tmp/6728809.1.all.q/tmps-Xqx9/database/files/000/dataset_91.dat returned status 1 and stderr: 
+[Fri May 06 22:54:22 EDT 2011] net.sf.picard.analysis.CollectAlignmentSummaryMetrics MAX_INSERT_SIZE=100000 ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, IS_BISULFITE_SEQUENCED=false] INPUT=/tmp/6728809.1.all.q/tmps-Xqx9/database/files/000/dataset_91.dat OUTPUT=/udd/rerla/rgalaxy/database/job_working_directory/90/dataset_92_files/CollectAlignmentSummaryMetrics.metrics.txt REFERENCE_SEQUENCE=/udd/rerla/rgalaxy/database/job_working_directory/90/dataset_92_files/hg18.fasta_fake.fasta ASSUME_SORTED=true TMP_DIR=/tmp VALIDATION_STRINGENCY=LENIENT    IS_BISULFITE_SEQUENCED=false STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
+[Fri May 06 22:54:30 EDT 2011] net.sf.picard.analysis.CollectAlignmentSummaryMetrics done.
 Runtime.totalMemory()=910163968
 Exception in thread "main" net.sf.picard.PicardException: Requesting earlier reference sequence: 0 < 1
 	at net.sf.picard.reference.ReferenceSequenceFileWalker.get(ReferenceSequenceFileWalker.java:78)


--- a/tool_conf.xml.sample	Fri May 06 16:02:25 2011 -0400
+++ b/tool_conf.xml.sample	Sat May 07 11:56:12 2011 -0400
@@ -246,7 +246,9 @@
     <tool file="picard/picard_BamIndexStats.xml" /><tool file="picard/rgPicardASMetrics.xml" /><tool file="picard/rgPicardGCBiasMetrics.xml" />     
+    <tool file="picard/rgPicardLibComplexity.xml" /><tool file="picard/rgPicardInsertSize.xml" />
+    <tool file="picard/rgPicardHsMetrics.xml" /><label text="bam/sam Cleaning" id="picard-tools" /><tool file="picard/picard_AddOrReplaceReadGroups.xml" /><tool file="picard/picard_ReorderSam.xml" />


--- a/tools/picard/picard_wrapper.py	Fri May 06 16:02:25 2011 -0400
+++ b/tools/picard/picard_wrapper.py	Sat May 07 11:56:12 2011 -0400
@@ -8,7 +8,7 @@
 see http://www.gnu.org/copyleft/lesser.html
 """
 
-import optparse, os, sys, subprocess, tempfile, shutil, time
+import optparse, os, sys, subprocess, tempfile, shutil, time, pysam
 
 galhtmlprefix = """<?xml version="1.0" encoding="utf-8" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
@@ -52,11 +52,10 @@
         if self.opts.outdir == None:
              self.opts.outdir = self.opts.tmpdir # fixmate has no html file eg
         assert self.opts.outdir <> None,'## PicardBase needs a temp directory if no output directory passed in'
-        self.maxloglines = 100
-        self.picname = self.nameMunge(opts.jar)
+        self.picname = self.baseName(opts.jar)
         if self.picname.startswith('picard'):
-	    self.picname = opts.picard_cmd # special case for some tools like replaceheader?
-        self.progname = self.nameMunge(arg0)
+            self.picname = opts.picard_cmd # special case for some tools like replaceheader?
+        self.progname = self.baseName(arg0)
         self.version = '0.002'
         self.delme = [] # list of files to destroy
         self.title = opts.title
@@ -72,7 +71,7 @@
         self.log_filename = '%s.log' % self.picname
         self.metricsOut =  os.path.join(opts.outdir,'%s.metrics.txt' % self.picname)
  
-    def nameMunge(self,name=None):
+    def baseName(self,name=None):
         return os.path.splitext(os.path.basename(name))[0]
 
     def readLarge(self,fname=None):
@@ -172,7 +171,7 @@
             except:
                 pass
                     
-    def prettyPicout(self,transpose=True,maxrows=100):
+    def prettyPicout(self,transpose,maxrows):
         """organize picard outpouts into a report html page
         """
         res = []
@@ -181,7 +180,7 @@
         except:
             r = []        
         if len(r) > 0:
-            res.append('<b>Picard on line resources</b><ul>\n')
+            res.append('<b>Picard on line resources - maxrows=%d</b><ul>\n' % maxrows)
             res.append('<li><a href="http://picard.sourceforge.net/index.shtml">Click here for Picard Documentation</a></li>\n')
             res.append('<li><a href="http://picard.sourceforge.net/picard-metric-definitions.shtml">Click here for Picard Metrics definitions</a></li></ul><hr/>\n')
             if transpose:
@@ -221,19 +220,18 @@
                             tdat = ['<tr class="d%d"><td>%s</td></tr>\n' % ((i+len(heads)) % 2,x) for i,x in enumerate(dat) if i < maxrows] 
                             missing = len(tdat) - maxrows
                             if missing > 0:      
-                                tdat.append('<tr><td>...WARNING: %d rows deleted..see raw file %s for entire output</td></tr>' % (missing,os.path.basename(picout)))
+                                tdat.append('<tr><td>...WARNING: %d rows deleted..see raw file %s for entire output</td></tr>' % (missing,os.path.basename(self.metricsOut)))
                         res += tdat
                         dat = []
             res.append('</table>\n')   
         return res
 
-    def fixPicardOutputs(self,transpose=True,maxloglines=100):
+    def fixPicardOutputs(self,transpose,maxloglines):
         """
         picard produces long hard to read tab header files
         make them available but present them transposed for readability
         """
         self.cleanup() # remove temp files stored in delme
-        self.maxloglines = maxloglines
         rstyle="""<style type="text/css">
         tr.d0 td {background-color: oldlace; color: black;}
         tr.d1 td {background-color: aliceblue; color: black;}
@@ -256,7 +254,7 @@
                 fn = os.path.split(f)[-1]
                 res.append('<tr><td><a href="%s">%s</a></td></tr>\n' % (fn,fn))
             res.append('</table><p/>\n') 
-        pres = self.prettyPicout(transpose=transpose,maxrows=self.maxloglines)
+        pres = self.prettyPicout(transpose,maxloglines)
         if len(pres) > 0:
             res += pres
         l = open(self.log_filename,'r').readlines()
@@ -264,9 +262,9 @@
         if llen > 0: 
             res.append('<b>Picard log</b><hr/>\n') 
             rlog = ['<pre>',]
-            if llen > self.maxloglines:
-                rlog += l[:self.maxloglines]                
-                rlog.append('\n<b>## WARNING - %d log lines truncated - %s contains entire output' % (llen - self.maxloglines,self.log_filename))
+            if llen > maxloglines:
+                rlog += l[:maxloglines]                
+                rlog.append('\n<b>## WARNING - %d log lines truncated - %s contains entire output' % (llen - maxloglines,self.log_filename))
             else:
                 rlog += l
             rlog.append('</pre>')
@@ -281,12 +279,82 @@
         outf.write('\n')
         outf.close()
 
+    def makePicInterval(self,inbed=None,outf=None):
+        """
+        picard wants bait and target files to have the same header length as the incoming bam/sam 
+        a meaningful (ie accurate) representation will fail because of this - so this hack
+        it would be far better to be able to supply the original bed untouched
+        """
+        assert inbed <> None
+        bed = open(inbed,'r').readlines()
+        thead = os.path.join(self.opts.outdir,'tempSamHead.txt')
+        if self.opts.datatype == 'sam':
+            cl = ['samtools view -H -S',self.opts.input,'>',thead]
+        else:
+            cl = ['samtools view -H',self.opts.input,'>',thead]
+        self.runCL(cl=cl,output_dir=self.opts.outdir)
+        head = open(thead,'r').readlines()
+        s = '## got %d rows of header\n' % (len(head))
+        lf = open(self.log_filename,'a')
+        lf.write(s)
+        lf.close()
+        o = open(outf,'w')
+        o.write(''.join(head))
+        o.write(''.join(bed))
+        o.close()
+        return outf                 
 
 
+    def cleanSam(self, insam=None, newsam=None, picardErrors=[],outformat=None):
+        """
+        interesting problem - if paired, must remove mate pair of errors too or we have a new set of errors after cleaning - missing mate pairs!
+        Do the work of removing all the error sequences
+        pysam is cool
+        infile = pysam.Samfile( "-", "r" )
+        outfile = pysam.Samfile( "-", "w", template = infile )
+        for s in infile: outfile.write(s)
+
+        errors from ValidateSameFile.jar look like
+        WARNING: Record 32, Read name SRR006041.1202260, NM tag (nucleotide differences) is missing
+        ERROR: Record 33, Read name SRR006041.1042721, Empty sequence dictionary.
+        ERROR: Record 33, Read name SRR006041.1042721, RG ID on SAMRecord not found in header: SRR006041
+
+        """
+        assert os.path.isfile(insam), 'rgPicardValidate cleansam needs an input sam file - cannot find %s' % insam
+        assert newsam <> None, 'rgPicardValidate cleansam needs an output new sam file path'
+        removeNames = [x.split(',')[1].replace(' Read name ','') for x in picardErrors if len(x.split(',')) > 2]
+        remDict = dict(zip(removeNames,range(len(removeNames))))
+        infile = pysam.Samfile(insam,'rb')
+        info = 'found %d error sequences in picardErrors, %d unique' % (len(removeNames),len(remDict))
+        if len(removeNames) > 0:
+            outfile = pysam.Samfile(newsam,'wb',template=infile) # template must be an open file
+            i = 0
+            j = 0
+            for row in infile:
+                dropme = remDict.get(row.qname,None) # keep if None
+                if not dropme:
+                    outfile.write(row)
+                    j += 1
+                else: # discard
+                    i += 1
+            info = '%s\n%s' % (info, 'Discarded %d lines writing %d to %s from %s' % (i,j,newsam,insam))
+            outfile.close()
+            infile.close()
+        else: # we really want a nullop or a simple pointer copy
+            infile.close()
+            if newsam:
+                shutil.copy(insam,newsam)
+        lf = open(self.log_filename,'a')
+        lf.write(info)
+        lf.write('\n')
+        lf.close()
+                
+
 
 def __main__():
     doFix = False # tools returning htmlfile don't need this
     doTranspose = True # default
+    maxloglines = 100 # default 
     #Parse Command Line
     op = optparse.OptionParser()
     # All tools
@@ -323,7 +391,7 @@
     # CollectAlignmentSummaryMetrics
     op.add_option('', '--maxinsert', default="20")
     op.add_option('', '--adaptors', action='append', type="string")
-    # FixMateInformation
+    # FixMateInformation and validate
     op.add_option('','--newformat', default='bam')
     # CollectGcBiasMetrics
     op.add_option('', '--windowsize', default='100')
@@ -349,16 +417,25 @@
     op.add_option('','--minid', default="5")
     op.add_option('','--maxdiff', default="0.03")
     op.add_option('','--minmeanq', default="20")
-
+    #hsmetrics
+    op.add_option('','--baitbed', default=None)
+    op.add_option('','--targetbed', default=None)
+    #validate
+    op.add_option('','--ignoreflags', action='append', type="string")
+    op.add_option('','--maxerrors', default=None)
+    op.add_option('','--datatype', default=None)
+    op.add_option('','--bamout', default=None)
+    op.add_option('','--samout', default=None)
 
     opts, args = op.parse_args()
     opts.sortme = opts.assumesorted == 'false'
     assert opts.input <> None
     # need to add
-    # output version # of tool
+    # instance that does all the work
     pic = PicardBase(opts,sys.argv[0])
 
     tmp_dir = opts.outdir
+    haveTempout = False # we use this where sam output is an option
 
     # set ref and dict files to use (create if necessary)
     ref_file_name = opts.ref
@@ -373,7 +450,7 @@
         os.symlink( opts.ref_file, ref_file_name )
         cl = ['REFERENCE=%s' % ref_file_name]
         cl.append('OUTPUT=%s' % dict_file_name)
-        cl.append('URI=%s' % os.path.basename( ref_file ))
+        cl.append('URI=%s' % os.path.basename( opts.ref_file ))
         cl.append('TRUNCATE_NAMES_AT_WHITESPACE=%s' % opts.trunc_names)
         if opts.species_name:
             cl.append('SPECIES=%s' % opts.species_name)
@@ -429,7 +506,7 @@
     elif pic.picname == 'EstimateLibraryComplexity':
         cl.append('I=%s' % opts.input)
         cl.append('O=%s' % pic.metricsOut)
-        if float(sopts.minid) > 0:
+        if float(opts.minid) > 0:
             cl.append('MIN_IDENTICAL_BASES=%s' % opts.minid)
         if float(opts.maxdiff) > 0.0:
             cl.append('MAX_DIFF_RATE=%s' % opts.maxdiff)
@@ -439,13 +516,11 @@
             cl.append('READ_NAME_REGEX="%s"' % opts.readregex)
         if float(opts.optdupdist) > 0:
             cl.append('OPTICAL_DUPLICATE_PIXEL_DISTANCE=%s' % opts.optdupdist)
-        self.runPic(opts.jar,cl)
-
-
+        pic.runPic(opts.jar,cl)
 
     elif pic.picname == 'CollectAlignmentSummaryMetrics':
         # Why do we do this fakefasta thing? Because we need NO fai to be available or picard barfs unless it has the same length as the input data.
-        # why? Dunno 
+        # why? Dunno Seems to work without complaining if the .bai file is AWOL....
         fakefasta = os.path.join(opts.outdir,'%s_fake.fasta' % os.path.basename(ref_file_name))
         try:
             os.symlink(ref_file_name,fakefasta)
@@ -468,8 +543,7 @@
             pic.delme.append(fakeinput)
             cl.append('INPUT=%s' % fakeinput)
         else:
-            cl.append('INPUT=%s' % os.path.abspath(opts.input))          
-
+            cl.append('INPUT=%s' % os.path.abspath(opts.input)) 
         pic.runPic(opts.jar,cl)
        
         
@@ -539,8 +613,7 @@
            lf.write(stdouts)
            lf.write('\n')
         lf.close()
-  
-      
+        
     elif pic.picname == 'MarkDuplicates':
         # assume sorted even if header says otherwise
         cl.append('ASSUME_SORTED=%s' % (opts.assumesorted))
@@ -563,14 +636,7 @@
         cl.append('O=%s' % tempout)
         cl.append('SORT_ORDER=%s' % opts.sortorder)
         pic.runPic(opts.jar,cl)
-        # Picard tool produced intermediate bam file. Depending on the
-        # desired format, we either just move to final location or create
-        # a sam version of it.
-        if opts.newformat == 'sam':
-            tlog, tempsam = pic.bamToSam( tempout, opts.outdir )
-            shutil.move(tempsam,os.path.abspath(opts.output))
-        else:
-            shutil.move(tempout, os.path.abspath(opts.output))       
+        haveTempout = True
         
     elif pic.picname == 'ReorderSam':
         # input
@@ -596,13 +662,84 @@
         if opts.output_format == 'sam':
             tlog,newsam = pic.bamToSam(tempout,opts.tmpdir)
             shutil.move(newsam,opts.output)
-	else:
-	    shutil.move(tempout,opts.output)
+        else:
+            shutil.move(tempout,opts.output)
+            
+    elif pic.picname == "CalculateHsMetrics":
+        maxloglines = 100
+        baitfname = os.path.join(opts.outdir,'rgPicardHsMetrics.bait')
+        targetfname = os.path.join(opts.outdir,'rgPicardHsMetrics.target')
+        baitf = pic.makePicInterval(opts.baitbed,baitfname)
+        if opts.targetbed == opts.baitbed: # same file sometimes
+            targetf = baitf
+        else:
+            targetf = pic.makePicInterval(opts.targetbed,targetfname)   
+        cl.append('BAIT_INTERVALS=%s' % baitf)
+        cl.append('TARGET_INTERVALS=%s' % targetf)
+        cl.append('INPUT=%s' % os.path.abspath(opts.input))
+        cl.append('OUTPUT=%s' % pic.metricsOut)
+        cl.append('TMP_DIR=%s' % opts.tmpdir)
+        pic.runPic(opts.jar,cl)     
+           
+    elif pic.picname == "ValidateSamFile":
+        doTranspose = False
+        sortedfile = os.path.join(opts.outdir,'rgValidate.sorted')
+        stf = open(pic.log_filename,'w')
+        tlog = None
+        if opts.datatype == 'sam': # need to work with a bam 
+            tlog,tempbam = pic.samToBam(opts.input,opts.outdir)
+            try:
+                tlog = pic.sortSam(tempbam,sortedfile,opts.outdir)
+            except:
+                print '## exception on sorting sam file %s' % opts.input
+        else: # is already bam
+            try:
+                tlog = pic.sortSam(opts.input,sortedfile,opts.outdir)
+            except: # bug - [bam_sort_core] not being ignored - TODO fixme
+                print '## exception on sorting bam file %s' % opts.input
+        if tlog:
+            print '##tlog=',tlog
+            stf.write(tlog)
+            stf.write('\n')
+        sortedfile = '%s.bam' % sortedfile # samtools does that      
+        cl.append('O=%s' % pic.metricsOut)
+        cl.append('TMP_DIR=%s' % opts.tmpdir)
+        cl.append('I=%s' % sortedfile)
+        opts.maxerrors = '99999999'
+        cl.append('MAX_OUTPUT=%s' % opts.maxerrors)
+        if opts.ignoreflags[0] <> 'None': # picard error values to ignore
+            igs = ['IGNORE=%s' % x for x in opts.ignoreflags if x <> 'None']
+            cl.append(' '.join(igs))
+        if opts.bisulphite.lower() <> 'false':
+            cl.append('IS_BISULFITE_SEQUENCED=true')
+        if opts.ref <> None or opts.ref_file <> None:
+            cl.append('R=%s' %  ref_file_name)            
+        pic.runPic(opts.jar,cl)         
+        if opts.datatype == 'sam': 
+            pic.delme.append(tempbam)
+        newsam = opts.output
+        outformat = 'bam'            
+        if opts.newformat == 'sam':
+            outformat = 'sam'
+        pe = open(pic.metricsOut,'r').readlines()
+        pic.cleanSam(insam=sortedfile, newsam=newsam, picardErrors=pe,outformat=outformat)
+        pic.delme.append(sortedfile) # not wanted
+        stf.close()
+        pic.cleanup()
     else:
         print >> sys.stderr,'picard.py got an unknown tool name - %s' % pic.picname
         sys.exit(1)
+    if haveTempout:
+        # Some Picard tools produced a potentially intermediate bam file. 
+        # Either just move to final location or create sam
+        if opts.newformat == 'sam':
+            tlog, tempsam = pic.bamToSam( tempout, opts.outdir )
+            shutil.move(tempsam,os.path.abspath(opts.output))
+        else:
+            shutil.move(tempout, os.path.abspath(opts.output))       
 
     if opts.htmlout <> None or doFix: # return a pretty html page
-        pic.fixPicardOutputs(transpose=doTranspose,maxloglines=100)
+        pic.fixPicardOutputs(transpose=doTranspose,maxloglines=maxloglines)
 
 if __name__=="__main__": __main__()
+


--- a/tools/picard/rgPicardASMetrics.xml	Fri May 06 16:02:25 2011 -0400
+++ b/tools/picard/rgPicardASMetrics.xml	Sat May 07 11:56:12 2011 -0400
@@ -17,17 +17,26 @@
       help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/><param name="out_prefix" value="Picard Alignment Summary Metrics" type="text"
       label="Title for the output file - use this remind you what the job was for" size="80" />
-    <conditional name="genomeSource">
-      <param name="refGenomeSource" type="select" help="If in doubt, choose built-in and read Picard/Samtools documentation"
-         label="Align to reference genome - built-in or from current history?">
-        <option value="indexed">Use a built in reference sequence</option>
-        <option value="history">Use a reference genome (fasta format) from my history</option>
+
+      <conditional name="genomeSource">
+    
+      <param name="refGenomeSource" type="select" help="This tool needs a reference genome. Default is usually right"
+         label="Align to which reference genome? - default, built-in or choose from current history">
+        <option value="default" selected="true">Default - use the input data genome/build</option>
+        <option value="indexed">Select a different built-in genome</option>
+        <option value="history">Use a genome (fasta format) from my history</option></param>
+      <when value="default">
+        <param name="index" type="select" label="Select a default reference genome">
+          <options from_data_table="all_fasta">
+          <filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator="," />
+          <validator type="no_options" message="No reference build available for selected input" /> 
+          </options>
+        </param>
+      </when><when value="indexed">
-        <param name="index" type="select" label="Select a reference genome">
-          <options from_data_table="all_fasta"> 
-          <filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator=","/>
-          <validator type="no_options" message="No reference sequences are available for the build associated with the selected input data"/>
+        <param name="index" type="select" label="Select a built-in reference genome" >
+          <options from_data_table="all_fasta"></options></param></when>
@@ -35,7 +44,6 @@
         <param name="ownFile" type="data" format="fasta" metadata_name="dbkey" label="Select a reference genome from history" /></when></conditional>
-    
     <param name="sorted" type="boolean" label="Assume the input file is already sorted" checked="true" truevalue="true" falsevalue="false"/><param name="bisulphite" type="boolean" label="Input file contains Bisulphite sequenced reads" checked="false" falsevalue="false" truevalue="true" /><param name="adaptors" value="" type="text" area="true" label="Adapter sequences - one per line if multiple" size="5x120" />
@@ -191,8 +199,10 @@
 
 Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010
 
-All rgenetics artifacts are available licensed under the LGPL
+All rgenetics artefacts are available licensed under the LGPL_
 Other dependencies are licensed at the author's discretion - please see each individual package for details
+ 
+ .. _LGPL: http://www.gnu.org/copyleft/lesser.html
 
   </help></tool>


--- a/tools/picard/rgPicardFixMate.xml	Fri May 06 16:02:25 2011 -0400
+++ b/tools/picard/rgPicardFixMate.xml	Sat May 07 11:56:12 2011 -0400
@@ -1,5 +1,5 @@
 <tool name="Paired Read Mate Fixer:" id="rgPicFixMate" version="0.01">
-  <description>(Picard)</description>
+  <description>for paired data</description><command interpreter="python">
    picard_wrapper.py -i "$input_file" -o "$out_file" --tmpdir "${__new_file_path__}" -n "$out_prefix" 
    --newformat "$newformat" -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/FixMateInformation.jar" --sortorder "$sortOrder"
@@ -82,7 +82,7 @@
 
 **From the Picard documentation**
 
-.. csv-table:: GC Bias Doc
+.. csv-table:: Fixmate
 
    :header-rows: 1
 
@@ -106,7 +106,11 @@
 All the Picard tools are freely available and are documented
 at http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSu...
 
-Here, you can apply Picard tools through Galaxy which might be easier than through the native Picard command line interface.
+Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010
+Code cleaned up and the ugliest hacks repaired by Raphael Lullis.
+Kelly Vincent wrote the first version of picard_wrapper.py that now incorporates that code.
+
+It takes a village of programmers to wrap a picard tool
 
 -----
 
@@ -116,14 +120,10 @@
 
 This Galaxy tool is a component of the rgenetics toolkit.
 
-Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010
-Code cleaned up and the ugliest hacks repaired by Raphael Lullis.
-Kelly Vincent wrote the first version of picard_wrapper.py that now incorporates that code.
-
-It takes a village of programmers to wrap a picard tool
-
-All rgenetics artifacts are available licensed under the LGPL
+All rgenetics artefacts are available licensed under the LGPL_
 Other dependencies are licensed at the author's discretion - please see each individual package for details
+ 
+ .. _LGPL: http://www.gnu.org/copyleft/lesser.html
 
   </help></tool>


--- a/tools/picard/rgPicardGCBiasMetrics.xml	Fri May 06 16:02:25 2011 -0400
+++ b/tools/picard/rgPicardGCBiasMetrics.xml	Sat May 07 11:56:12 2011 -0400
@@ -17,18 +17,24 @@
       help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/><param name="out_prefix" value="Short Read GC Bias Metrics" type="text"
       label="Title for the output file - use this remind you what the job was for" size="80" />
-    <conditional name="genomeSource">
-      <param name="refGenomeSource" type="select" help="If in doubt, choose built-in and read Picard/Samtools documentation"
-         label="Align to reference genome - built-in or from current history?">
-        <option value="indexed">Align to input data 'dbkey' reference genome</option>
-        <option value="history">Use a genome (fasta format) from my history</option>
+    <conditional name="genomeSource"> 
+      <param name="refGenomeSource" type="select" help="This tool needs a reference genome. Default is usually right"
+         label="Align to which reference genome? - default, built-in or choose from current history">
+        <option value="default" selected="true">Default - use the input data genome build</option>
+        <option value="indexed">Select a different built-in genome </option>
+        <option value="history">Use a genome (fasta format) from my history </option></param>
+      <when value="default">
+        <param name="index" type="select" label="Select a default reference genome">
+          <options from_data_table="all_fasta">
+          <filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator=","/>
+          <validator type="no_options" message="No reference build available for the selected input data" />
+          </options>
+        </param>
+      </when><when value="indexed">
-       <param name="index" type="select" label="Select a reference genome">
-          <options from_data_table="all_fasta"> 
-          <filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator=","/>
-          <validator type="no_options" message="No reference sequences are available for the build associated with the selected input data"/>
-          </options>
+        <param name="index" type="select" label="Select a built-in reference genome">
+          <options from_data_table="all_fasta"/></param></when><when value="history">
@@ -168,9 +174,10 @@
 This Galaxy tool is a component of the rgenetics toolkit.
 
 Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010
-
-All rgenetics artifacts are available licensed under the LGPL
+All rgenetics artefacts are available licensed under the LGPL_
 Other dependencies are licensed at the author's discretion - please see each individual package for details
+ 
+ .. _LGPL: http://www.gnu.org/copyleft/lesser.html
 
   </help></tool>


--- a/tools/picard/rgPicardHsMetrics.xml	Fri May 06 16:02:25 2011 -0400
+++ b/tools/picard/rgPicardHsMetrics.xml	Sat May 07 11:56:12 2011 -0400
@@ -1,10 +1,10 @@
 <tool name="Sam/bam Hybrid Selection Metrics:" id="PicardHsMetrics" version="0.01">
-  <description>(Picard)</description>
+  <description>For (eg exome) targetted data</description><command interpreter="python">
 
-    rgPicardHsMetrics.py -i $input_file -d $html_file.files_path -o $html_file
-    -b $bait_bed -t $target_bed -n "$out_prefix" --tmpdir "${__new_file_path__}"
-    -j ${GALAXY_DATA_INDEX_DIR}/shared/jars/CalculateHsMetrics.jar -y $input_file.ext
+    picard_wrapper.py -i "$input_file" -d "$html_file.files_path" -t "$html_file" --datatype "$input_file.ext"
+    --baitbed "$bait_bed" --targetbed "$target_bed" -n "$out_prefix" --tmpdir "${__new_file_path__}"
+    -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/CalculateHsMetrics.jar" -x "$maxheap"
 
   </command><requirements><requirement type="package">picard</requirement></requirements>
@@ -13,6 +13,13 @@
     <param name="out_prefix" value="Picard HS Metrics" type="text" label="Title for the output file - to remind you what the job was for" size="80" /><param name="bait_bed" type="data" format="interval" label="Bait intervals: Sequences for bait in the design - ucsc BED" size="80" /><param name="target_bed" type="data" format="interval" label="Target intervals: Sequences for targets in the design - ucsc BED" size="80" />
+    <param name="maxheap" type="select" 
+       help="If in doubt, try the default. If it fails with a complaint about java heap size, try increasing it please - larger jobs will require your own hardware."
+     label="Java heap size">
+    <option value="4G" selected = "true">4GB default </option>
+    <option value="8G" >8GB use if 4GB fails</option>
+    <option value="16G">16GB - try this if 8GB fails</option>
+    </param></inputs><outputs><data format="html" name="html_file" label="${out_prefix}.html" />
@@ -23,6 +30,7 @@
       <param name="input_file" value="picard_input_summary_alignment_stats.sam" ftype="sam" /><param name="bait_bed" value="picard_input_bait.bed" /><param name="target_bed" value="picard_input_bait.bed"  />
+      <param name="maxheap" value="8G"  /><output name="html_file" file="picard_output_hs_transposed_summary_alignment_stats.html" ftype="html" lines_diff="212"/></test></tests>
@@ -140,10 +148,10 @@
 This is part of the rgenetics toolkit.
 
 Written by Ross Lazarus 2010
-
-Licensed under the LGPL
-All rgenetics artifacts are available licensed under the LGPL
+All rgenetics artefacts are available licensed under the LGPL_
 Other dependencies are licensed at the author's discretion - please see each individual package for details
+ 
+ .. _LGPL: http://www.gnu.org/copyleft/lesser.html
 
   </help></tool>


--- a/tools/picard/rgPicardInsertSize.xml	Fri May 06 16:02:25 2011 -0400
+++ b/tools/picard/rgPicardInsertSize.xml	Sat May 07 11:56:12 2011 -0400
@@ -1,5 +1,5 @@
 <tool name="Insertion size metrics:" id="rgPicInsertSize" version="0.01">
-  <description>(Picard)</description>
+  <description>for PAIRED data</description><command interpreter="python">
    picard_wrapper.py -i "$input_file" -n "$out_prefix" --tmpdir "${__new_file_path__}" --taillimit "$tailLimit"
    --histwidth "$histWidth" --minpct "$minPct"


--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/picard/rgPicardLibComplexity.xml	Sat May 07 11:56:12 2011 -0400
@@ -0,0 +1,143 @@
+<tool name="Estimate Library Complexity:" id="rgEstLibComp" version="0.01">
+  <description>(Picard)</description>
+  <command interpreter="python">
+   picard_wrapper.py -i "$input_file" -n "$out_prefix" --tmpdir "${__new_file_path__}" --minid "$minIDbases"
+   --maxdiff "$maxDiff" --minmeanq "$minMeanQ" --readregex "$readRegex" --optdupdist "$optDupeDist"
+   -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/EstimateLibraryComplexity.jar" -d "$html_file.files_path" -t "$html_file"
+  </command>
+  <inputs>
+    <param format="bam,sam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history"
+      help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/>
+    <param name="out_prefix" value="Library Complexity" type="text"
+      label="Title for the output file - use this remind you what the job was for" size="80" />
+    <param name="minIDbases" value="5" type="integer"  label="Minimum identical bases at starts of reads for grouping" size="5" 
+      help="Total_reads / 4^max_id_bases reads will be compared at a time. Lower numbers = more accurate results and exponentially more time/ram" />
+     <param name="maxDiff" value="0.03" type="float"
+      label="Maximum difference rate for identical reads" size="5" 
+      help="The maximum rate of differences between two reads to call them identical" />
+     <param name="minMeanQ" value="20" type="integer"
+      label="Minimum percentage" size="5" 
+      help="The minimum mean quality of bases in a read pair. Lower average quality reads filtered out from all calculations" />
+     <param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" type="text" size="120"
+      label="Regular expression that can be used to parse read names in the incoming SAM file" 
+      help="Names are parsed to extract: tile/region, x coordinate and y coordinate, to estimate optical duplication rate" >
+      <sanitizer>
+        <valid initial="string.printable">
+         <remove value="'"/>
+        </valid>
+        <mapping initial="none">
+          <add source="'" target="__sq__"/>
+        </mapping>
+      </sanitizer>
+     </param>
+     <param name="optDupeDist" value="100" type="text"
+      label="The maximum offset between two duplicte clusters in order to consider them optical duplicates." size="5" 
+      help="e.g. 5-10 pixels. Later Illumina software versions multiply pixel values by 10, in which case 50-100" />
+
+  </inputs>
+  <outputs>
+    <data format="html" name="html_file" label="${out_prefix}_lib_complexity.html"/>
+  </outputs>
+  <tests>
+    <test>
+      <param name="input_file" value="picard_input_tiny.sam" />
+      <param name="out_prefix" value="Library Complexity" />
+      <param name="minIDbases" value="5" />
+      <param name="maxDiff" value="0.03" />
+      <param name="minMeanQ" value="20" />
+      <param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" />
+      <param name="optDupeDist" value="100" />      
+      <output name="html_file" file="picard_output_estlibcomplexity_tinysam.html" ftype="html" lines_diff="25" />
+    </test>
+  </tests>
+  <help>
+
+.. class:: infomark
+
+**Purpose**
+
+    EstimateLibraryComplexity
+    Attempts to estimate library complexity from sequence alone. 
+    Does so by sorting all reads by the first N bases (5 by default) of each read and then 
+    comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be 
+    duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default).
+
+    Reads of poor quality are filtered out so as to provide a more accurate estimate. 
+    The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than 
+    MIN_MEAN_QUALITY across either the first or second read.
+
+    The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the 
+    calculation of library size. Also, since there is no alignment to screen out technical reads one 
+    further filter is applied on the data. After examining all reads a histogram is built of 
+    [#reads in duplicate set -> #of duplicate sets]; 
+    all bins that contain exactly one duplicate set are then removed from the histogram as outliers before library size is estimated.
+
+**Why you might want to use this tool**
+
+    This tool provides a Galaxy interface to one of the Picard tools.
+    If you need to estimate library complexity from sequences, the Picard tool may help.
+
+
+**Note on the Regular Expression**
+
+    (from the Picard docs)
+    This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file. 
+    These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. 
+    The regular expression should contain three capture groups for the three variables, in order. 
+    Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*.
+    
+
+-----
+
+.. class:: infomark
+
+**From the Picard documentation**
+
+.. csv-table:: Estimate complexity docs
+   :header-rows: 1
+
+    Option	Description
+    "INPUT=File","One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped. This option may be specified 0 or more times."
+    "OUTPUT=File","Output file to writes per-library metrics to. Required."
+    "MIN_IDENTICAL_BASES=Integer","The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU. Default value: 5."
+    "MAX_DIFF_RATE=Double","The maximum rate of differences between two reads to call them identical. Default value: 0.03. "
+    "MIN_MEAN_QUALITY=Integer","The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations. Default value: 20."
+    "READ_NAME_REGEX=String","Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. The regular expression should contain three capture groups for the three variables, in order. Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to clear the default value."
+    "OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer","The maximum offset between two duplicte clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal. Default value: 100"
+    "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false. This option can be set to 'null' to clear the default value. "
+
+
+-----
+
+.. class:: infomark
+
+**Attributions**
+
+Picard is supported through the SamTools project.
+This tool wraps Picard and is supported through the galaxy-bugs mailing list
+Please help us by completing the report form that appears automatically
+if a tool fails unexpectedly when you run it in Galaxy.
+
+All the Picard tools are freely available and are documented
+at http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeM...
+
+-----
+
+.. class:: infomark
+
+**Copyright**
+
+This Galaxy tool is a component of the rgenetics toolkit.
+
+Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010
+Code cleaned up and the ugliest hacks repaired by Raphael Lullis
+
+All rgenetics artefacts are available licensed under the LGPL_
+Other dependencies are licensed at the author's discretion - please see each individual package for details
+ 
+ .. _LGPL: http://www.gnu.org/copyleft/lesser.html
+
+  </help>
+</tool>
+
+


http://bitbucket.org/galaxy/galaxy-central/changeset/3cf3d54e5b28/
changeset:   r5511:3cf3d54e5b28
user:        fubar
date:        2011-05-07 19:58:35
summary:     add test input for hsmetrics
affected #:  1 file (376 bytes)

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/picard_input_bait.bed	Sat May 07 13:58:35 2011 -0400
@@ -0,0 +1,8 @@
+chr1	1	300	-	CCDS635.1_cds_0_0_chr1_67052401_r
+chr2	1	300	-	CCDS635.1_cds_1_0_chr1_67060632_r
+chr3	1	300	-	CCDS635.1_cds_2_0_chr1_67065091_r
+chr4	1	300	-	CCDS635.1_cds_3_0_chr1_67066083_r
+chr5	1	300	-	CCDS635.1_cds_4_0_chr1_67071856_r
+chr6	1	300	-	CCDS635.1_cds_5_0_chr1_67072262_r
+chr7	1	300	-	CCDS635.1_cds_6_0_chr1_67073897_r
+chr8	1	300	-	CCDS635.1_cds_7_0_chr1_67075981_r

Repository URL: https://bitbucket.org/galaxy/galaxy-central/

--

This is a commit notification from bitbucket.org. You are receiving
this because you have the service enabled, addressing the recipient of
this email.

    

[galaxy-commits] commit/galaxy-central: 2 new changesets

Bitbucket