2 new changesets in galaxy-central: http://bitbucket.org/galaxy/galaxy-central/changeset/d79a1ad33a03/ changeset: r5510:d79a1ad33a03 user: fubar date: 2011-05-07 17:56:12 summary: lots of small tweaks and some new picard metrics - Hs and libcomplexity. Includes an option to use the input data genome for a reference where a reference genome is needed - helps prevent new user head scratching - this really should be added (eg) the bwa and other wrappers too? affected #: 9 files (15.9 KB) --- a/test-data/picard_output_AsMetrics_indexed_hg18_sorted_pair.html Fri May 06 16:02:25 2011 -0400 +++ b/test-data/picard_output_AsMetrics_indexed_hg18_sorted_pair.html Sat May 07 11:56:12 2011 -0400 @@ -12,10 +12,10 @@ </head><body><div class="document"> -Galaxy tool wrapper run picard_wrapper at 27/04/2011 11:46:40</b><br/><b>Picard log</b><hr/> -<pre>## executing java -Xmx2g -jar /udd/rerla/galaxy-central/tool-data/shared/jars/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true ADAPTER_SEQUENCE= IS_BISULFITE_SEQUENCED=false MAX_INSERT_SIZE=100000 OUTPUT=/udd/rerla/galaxy-central/database/job_working_directory/90/dataset_92_files/CollectAlignmentSummaryMetrics.metrics.txt R=/udd/rerla/galaxy-central/database/job_working_directory/90/dataset_92_files/hg18.fasta_fake.fasta TMP_DIR=/tmp INPUT=/tmp/6728746.1.all.q/tmpkzvo-A/database/files/000/dataset_91.dat returned status 1 and stderr: -[Wed Apr 27 11:46:37 EDT 2011] net.sf.picard.analysis.CollectAlignmentSummaryMetrics MAX_INSERT_SIZE=100000 ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, IS_BISULFITE_SEQUENCED=false] INPUT=/tmp/6728746.1.all.q/tmpkzvo-A/database/files/000/dataset_91.dat OUTPUT=/udd/rerla/galaxy-central/database/job_working_directory/90/dataset_92_files/CollectAlignmentSummaryMetrics.metrics.txt REFERENCE_SEQUENCE=/udd/rerla/galaxy-central/database/job_working_directory/90/dataset_92_files/hg18.fasta_fake.fasta ASSUME_SORTED=true TMP_DIR=/tmp VALIDATION_STRINGENCY=LENIENT IS_BISULFITE_SEQUENCED=false STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false -[Wed Apr 27 11:46:40 EDT 2011] net.sf.picard.analysis.CollectAlignmentSummaryMetrics done. +Galaxy tool wrapper run picard_wrapper at 06/05/2011 22:54:31</b><br/><b>Picard log</b><hr/> +<pre>## executing java -Xmx2g -jar /udd/rerla/rgalaxy/tool-data/shared/jars/CollectAlignmentSummaryMetrics.jar VALIDATION_STRINGENCY=LENIENT ASSUME_SORTED=true ADAPTER_SEQUENCE= IS_BISULFITE_SEQUENCED=false MAX_INSERT_SIZE=100000 OUTPUT=/udd/rerla/rgalaxy/database/job_working_directory/90/dataset_92_files/CollectAlignmentSummaryMetrics.metrics.txt R=/udd/rerla/rgalaxy/database/job_working_directory/90/dataset_92_files/hg18.fasta_fake.fasta TMP_DIR=/tmp INPUT=/tmp/6728809.1.all.q/tmps-Xqx9/database/files/000/dataset_91.dat returned status 1 and stderr: +[Fri May 06 22:54:22 EDT 2011] net.sf.picard.analysis.CollectAlignmentSummaryMetrics MAX_INSERT_SIZE=100000 ADAPTER_SEQUENCE=[AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAGACCGATCTCGTATGCCGTCTTCTGCTTG, AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT, AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG, IS_BISULFITE_SEQUENCED=false] INPUT=/tmp/6728809.1.all.q/tmps-Xqx9/database/files/000/dataset_91.dat OUTPUT=/udd/rerla/rgalaxy/database/job_working_directory/90/dataset_92_files/CollectAlignmentSummaryMetrics.metrics.txt REFERENCE_SEQUENCE=/udd/rerla/rgalaxy/database/job_working_directory/90/dataset_92_files/hg18.fasta_fake.fasta ASSUME_SORTED=true TMP_DIR=/tmp VALIDATION_STRINGENCY=LENIENT IS_BISULFITE_SEQUENCED=false STOP_AFTER=0 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false +[Fri May 06 22:54:30 EDT 2011] net.sf.picard.analysis.CollectAlignmentSummaryMetrics done. Runtime.totalMemory()=910163968 Exception in thread "main" net.sf.picard.PicardException: Requesting earlier reference sequence: 0 < 1 at net.sf.picard.reference.ReferenceSequenceFileWalker.get(ReferenceSequenceFileWalker.java:78) --- a/tool_conf.xml.sample Fri May 06 16:02:25 2011 -0400 +++ b/tool_conf.xml.sample Sat May 07 11:56:12 2011 -0400 @@ -246,7 +246,9 @@ <tool file="picard/picard_BamIndexStats.xml" /><tool file="picard/rgPicardASMetrics.xml" /><tool file="picard/rgPicardGCBiasMetrics.xml" /> + <tool file="picard/rgPicardLibComplexity.xml" /><tool file="picard/rgPicardInsertSize.xml" /> + <tool file="picard/rgPicardHsMetrics.xml" /><label text="bam/sam Cleaning" id="picard-tools" /><tool file="picard/picard_AddOrReplaceReadGroups.xml" /><tool file="picard/picard_ReorderSam.xml" /> --- a/tools/picard/picard_wrapper.py Fri May 06 16:02:25 2011 -0400 +++ b/tools/picard/picard_wrapper.py Sat May 07 11:56:12 2011 -0400 @@ -8,7 +8,7 @@ see http://www.gnu.org/copyleft/lesser.html """ -import optparse, os, sys, subprocess, tempfile, shutil, time +import optparse, os, sys, subprocess, tempfile, shutil, time, pysam galhtmlprefix = """<?xml version="1.0" encoding="utf-8" ?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> @@ -52,11 +52,10 @@ if self.opts.outdir == None: self.opts.outdir = self.opts.tmpdir # fixmate has no html file eg assert self.opts.outdir <> None,'## PicardBase needs a temp directory if no output directory passed in' - self.maxloglines = 100 - self.picname = self.nameMunge(opts.jar) + self.picname = self.baseName(opts.jar) if self.picname.startswith('picard'): - self.picname = opts.picard_cmd # special case for some tools like replaceheader? - self.progname = self.nameMunge(arg0) + self.picname = opts.picard_cmd # special case for some tools like replaceheader? + self.progname = self.baseName(arg0) self.version = '0.002' self.delme = [] # list of files to destroy self.title = opts.title @@ -72,7 +71,7 @@ self.log_filename = '%s.log' % self.picname self.metricsOut = os.path.join(opts.outdir,'%s.metrics.txt' % self.picname) - def nameMunge(self,name=None): + def baseName(self,name=None): return os.path.splitext(os.path.basename(name))[0] def readLarge(self,fname=None): @@ -172,7 +171,7 @@ except: pass - def prettyPicout(self,transpose=True,maxrows=100): + def prettyPicout(self,transpose,maxrows): """organize picard outpouts into a report html page """ res = [] @@ -181,7 +180,7 @@ except: r = [] if len(r) > 0: - res.append('<b>Picard on line resources</b><ul>\n') + res.append('<b>Picard on line resources - maxrows=%d</b><ul>\n' % maxrows) res.append('<li><a href="http://picard.sourceforge.net/index.shtml">Click here for Picard Documentation</a></li>\n') res.append('<li><a href="http://picard.sourceforge.net/picard-metric-definitions.shtml">Click here for Picard Metrics definitions</a></li></ul><hr/>\n') if transpose: @@ -221,19 +220,18 @@ tdat = ['<tr class="d%d"><td>%s</td></tr>\n' % ((i+len(heads)) % 2,x) for i,x in enumerate(dat) if i < maxrows] missing = len(tdat) - maxrows if missing > 0: - tdat.append('<tr><td>...WARNING: %d rows deleted..see raw file %s for entire output</td></tr>' % (missing,os.path.basename(picout))) + tdat.append('<tr><td>...WARNING: %d rows deleted..see raw file %s for entire output</td></tr>' % (missing,os.path.basename(self.metricsOut))) res += tdat dat = [] res.append('</table>\n') return res - def fixPicardOutputs(self,transpose=True,maxloglines=100): + def fixPicardOutputs(self,transpose,maxloglines): """ picard produces long hard to read tab header files make them available but present them transposed for readability """ self.cleanup() # remove temp files stored in delme - self.maxloglines = maxloglines rstyle="""<style type="text/css"> tr.d0 td {background-color: oldlace; color: black;} tr.d1 td {background-color: aliceblue; color: black;} @@ -256,7 +254,7 @@ fn = os.path.split(f)[-1] res.append('<tr><td><a href="%s">%s</a></td></tr>\n' % (fn,fn)) res.append('</table><p/>\n') - pres = self.prettyPicout(transpose=transpose,maxrows=self.maxloglines) + pres = self.prettyPicout(transpose,maxloglines) if len(pres) > 0: res += pres l = open(self.log_filename,'r').readlines() @@ -264,9 +262,9 @@ if llen > 0: res.append('<b>Picard log</b><hr/>\n') rlog = ['<pre>',] - if llen > self.maxloglines: - rlog += l[:self.maxloglines] - rlog.append('\n<b>## WARNING - %d log lines truncated - %s contains entire output' % (llen - self.maxloglines,self.log_filename)) + if llen > maxloglines: + rlog += l[:maxloglines] + rlog.append('\n<b>## WARNING - %d log lines truncated - %s contains entire output' % (llen - maxloglines,self.log_filename)) else: rlog += l rlog.append('</pre>') @@ -281,12 +279,82 @@ outf.write('\n') outf.close() + def makePicInterval(self,inbed=None,outf=None): + """ + picard wants bait and target files to have the same header length as the incoming bam/sam + a meaningful (ie accurate) representation will fail because of this - so this hack + it would be far better to be able to supply the original bed untouched + """ + assert inbed <> None + bed = open(inbed,'r').readlines() + thead = os.path.join(self.opts.outdir,'tempSamHead.txt') + if self.opts.datatype == 'sam': + cl = ['samtools view -H -S',self.opts.input,'>',thead] + else: + cl = ['samtools view -H',self.opts.input,'>',thead] + self.runCL(cl=cl,output_dir=self.opts.outdir) + head = open(thead,'r').readlines() + s = '## got %d rows of header\n' % (len(head)) + lf = open(self.log_filename,'a') + lf.write(s) + lf.close() + o = open(outf,'w') + o.write(''.join(head)) + o.write(''.join(bed)) + o.close() + return outf + def cleanSam(self, insam=None, newsam=None, picardErrors=[],outformat=None): + """ + interesting problem - if paired, must remove mate pair of errors too or we have a new set of errors after cleaning - missing mate pairs! + Do the work of removing all the error sequences + pysam is cool + infile = pysam.Samfile( "-", "r" ) + outfile = pysam.Samfile( "-", "w", template = infile ) + for s in infile: outfile.write(s) + + errors from ValidateSameFile.jar look like + WARNING: Record 32, Read name SRR006041.1202260, NM tag (nucleotide differences) is missing + ERROR: Record 33, Read name SRR006041.1042721, Empty sequence dictionary. + ERROR: Record 33, Read name SRR006041.1042721, RG ID on SAMRecord not found in header: SRR006041 + + """ + assert os.path.isfile(insam), 'rgPicardValidate cleansam needs an input sam file - cannot find %s' % insam + assert newsam <> None, 'rgPicardValidate cleansam needs an output new sam file path' + removeNames = [x.split(',')[1].replace(' Read name ','') for x in picardErrors if len(x.split(',')) > 2] + remDict = dict(zip(removeNames,range(len(removeNames)))) + infile = pysam.Samfile(insam,'rb') + info = 'found %d error sequences in picardErrors, %d unique' % (len(removeNames),len(remDict)) + if len(removeNames) > 0: + outfile = pysam.Samfile(newsam,'wb',template=infile) # template must be an open file + i = 0 + j = 0 + for row in infile: + dropme = remDict.get(row.qname,None) # keep if None + if not dropme: + outfile.write(row) + j += 1 + else: # discard + i += 1 + info = '%s\n%s' % (info, 'Discarded %d lines writing %d to %s from %s' % (i,j,newsam,insam)) + outfile.close() + infile.close() + else: # we really want a nullop or a simple pointer copy + infile.close() + if newsam: + shutil.copy(insam,newsam) + lf = open(self.log_filename,'a') + lf.write(info) + lf.write('\n') + lf.close() + + def __main__(): doFix = False # tools returning htmlfile don't need this doTranspose = True # default + maxloglines = 100 # default #Parse Command Line op = optparse.OptionParser() # All tools @@ -323,7 +391,7 @@ # CollectAlignmentSummaryMetrics op.add_option('', '--maxinsert', default="20") op.add_option('', '--adaptors', action='append', type="string") - # FixMateInformation + # FixMateInformation and validate op.add_option('','--newformat', default='bam') # CollectGcBiasMetrics op.add_option('', '--windowsize', default='100') @@ -349,16 +417,25 @@ op.add_option('','--minid', default="5") op.add_option('','--maxdiff', default="0.03") op.add_option('','--minmeanq', default="20") - + #hsmetrics + op.add_option('','--baitbed', default=None) + op.add_option('','--targetbed', default=None) + #validate + op.add_option('','--ignoreflags', action='append', type="string") + op.add_option('','--maxerrors', default=None) + op.add_option('','--datatype', default=None) + op.add_option('','--bamout', default=None) + op.add_option('','--samout', default=None) opts, args = op.parse_args() opts.sortme = opts.assumesorted == 'false' assert opts.input <> None # need to add - # output version # of tool + # instance that does all the work pic = PicardBase(opts,sys.argv[0]) tmp_dir = opts.outdir + haveTempout = False # we use this where sam output is an option # set ref and dict files to use (create if necessary) ref_file_name = opts.ref @@ -373,7 +450,7 @@ os.symlink( opts.ref_file, ref_file_name ) cl = ['REFERENCE=%s' % ref_file_name] cl.append('OUTPUT=%s' % dict_file_name) - cl.append('URI=%s' % os.path.basename( ref_file )) + cl.append('URI=%s' % os.path.basename( opts.ref_file )) cl.append('TRUNCATE_NAMES_AT_WHITESPACE=%s' % opts.trunc_names) if opts.species_name: cl.append('SPECIES=%s' % opts.species_name) @@ -429,7 +506,7 @@ elif pic.picname == 'EstimateLibraryComplexity': cl.append('I=%s' % opts.input) cl.append('O=%s' % pic.metricsOut) - if float(sopts.minid) > 0: + if float(opts.minid) > 0: cl.append('MIN_IDENTICAL_BASES=%s' % opts.minid) if float(opts.maxdiff) > 0.0: cl.append('MAX_DIFF_RATE=%s' % opts.maxdiff) @@ -439,13 +516,11 @@ cl.append('READ_NAME_REGEX="%s"' % opts.readregex) if float(opts.optdupdist) > 0: cl.append('OPTICAL_DUPLICATE_PIXEL_DISTANCE=%s' % opts.optdupdist) - self.runPic(opts.jar,cl) - - + pic.runPic(opts.jar,cl) elif pic.picname == 'CollectAlignmentSummaryMetrics': # Why do we do this fakefasta thing? Because we need NO fai to be available or picard barfs unless it has the same length as the input data. - # why? Dunno + # why? Dunno Seems to work without complaining if the .bai file is AWOL.... fakefasta = os.path.join(opts.outdir,'%s_fake.fasta' % os.path.basename(ref_file_name)) try: os.symlink(ref_file_name,fakefasta) @@ -468,8 +543,7 @@ pic.delme.append(fakeinput) cl.append('INPUT=%s' % fakeinput) else: - cl.append('INPUT=%s' % os.path.abspath(opts.input)) - + cl.append('INPUT=%s' % os.path.abspath(opts.input)) pic.runPic(opts.jar,cl) @@ -539,8 +613,7 @@ lf.write(stdouts) lf.write('\n') lf.close() - - + elif pic.picname == 'MarkDuplicates': # assume sorted even if header says otherwise cl.append('ASSUME_SORTED=%s' % (opts.assumesorted)) @@ -563,14 +636,7 @@ cl.append('O=%s' % tempout) cl.append('SORT_ORDER=%s' % opts.sortorder) pic.runPic(opts.jar,cl) - # Picard tool produced intermediate bam file. Depending on the - # desired format, we either just move to final location or create - # a sam version of it. - if opts.newformat == 'sam': - tlog, tempsam = pic.bamToSam( tempout, opts.outdir ) - shutil.move(tempsam,os.path.abspath(opts.output)) - else: - shutil.move(tempout, os.path.abspath(opts.output)) + haveTempout = True elif pic.picname == 'ReorderSam': # input @@ -596,13 +662,84 @@ if opts.output_format == 'sam': tlog,newsam = pic.bamToSam(tempout,opts.tmpdir) shutil.move(newsam,opts.output) - else: - shutil.move(tempout,opts.output) + else: + shutil.move(tempout,opts.output) + + elif pic.picname == "CalculateHsMetrics": + maxloglines = 100 + baitfname = os.path.join(opts.outdir,'rgPicardHsMetrics.bait') + targetfname = os.path.join(opts.outdir,'rgPicardHsMetrics.target') + baitf = pic.makePicInterval(opts.baitbed,baitfname) + if opts.targetbed == opts.baitbed: # same file sometimes + targetf = baitf + else: + targetf = pic.makePicInterval(opts.targetbed,targetfname) + cl.append('BAIT_INTERVALS=%s' % baitf) + cl.append('TARGET_INTERVALS=%s' % targetf) + cl.append('INPUT=%s' % os.path.abspath(opts.input)) + cl.append('OUTPUT=%s' % pic.metricsOut) + cl.append('TMP_DIR=%s' % opts.tmpdir) + pic.runPic(opts.jar,cl) + + elif pic.picname == "ValidateSamFile": + doTranspose = False + sortedfile = os.path.join(opts.outdir,'rgValidate.sorted') + stf = open(pic.log_filename,'w') + tlog = None + if opts.datatype == 'sam': # need to work with a bam + tlog,tempbam = pic.samToBam(opts.input,opts.outdir) + try: + tlog = pic.sortSam(tempbam,sortedfile,opts.outdir) + except: + print '## exception on sorting sam file %s' % opts.input + else: # is already bam + try: + tlog = pic.sortSam(opts.input,sortedfile,opts.outdir) + except: # bug - [bam_sort_core] not being ignored - TODO fixme + print '## exception on sorting bam file %s' % opts.input + if tlog: + print '##tlog=',tlog + stf.write(tlog) + stf.write('\n') + sortedfile = '%s.bam' % sortedfile # samtools does that + cl.append('O=%s' % pic.metricsOut) + cl.append('TMP_DIR=%s' % opts.tmpdir) + cl.append('I=%s' % sortedfile) + opts.maxerrors = '99999999' + cl.append('MAX_OUTPUT=%s' % opts.maxerrors) + if opts.ignoreflags[0] <> 'None': # picard error values to ignore + igs = ['IGNORE=%s' % x for x in opts.ignoreflags if x <> 'None'] + cl.append(' '.join(igs)) + if opts.bisulphite.lower() <> 'false': + cl.append('IS_BISULFITE_SEQUENCED=true') + if opts.ref <> None or opts.ref_file <> None: + cl.append('R=%s' % ref_file_name) + pic.runPic(opts.jar,cl) + if opts.datatype == 'sam': + pic.delme.append(tempbam) + newsam = opts.output + outformat = 'bam' + if opts.newformat == 'sam': + outformat = 'sam' + pe = open(pic.metricsOut,'r').readlines() + pic.cleanSam(insam=sortedfile, newsam=newsam, picardErrors=pe,outformat=outformat) + pic.delme.append(sortedfile) # not wanted + stf.close() + pic.cleanup() else: print >> sys.stderr,'picard.py got an unknown tool name - %s' % pic.picname sys.exit(1) + if haveTempout: + # Some Picard tools produced a potentially intermediate bam file. + # Either just move to final location or create sam + if opts.newformat == 'sam': + tlog, tempsam = pic.bamToSam( tempout, opts.outdir ) + shutil.move(tempsam,os.path.abspath(opts.output)) + else: + shutil.move(tempout, os.path.abspath(opts.output)) if opts.htmlout <> None or doFix: # return a pretty html page - pic.fixPicardOutputs(transpose=doTranspose,maxloglines=100) + pic.fixPicardOutputs(transpose=doTranspose,maxloglines=maxloglines) if __name__=="__main__": __main__() + --- a/tools/picard/rgPicardASMetrics.xml Fri May 06 16:02:25 2011 -0400 +++ b/tools/picard/rgPicardASMetrics.xml Sat May 07 11:56:12 2011 -0400 @@ -17,17 +17,26 @@ help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/><param name="out_prefix" value="Picard Alignment Summary Metrics" type="text" label="Title for the output file - use this remind you what the job was for" size="80" /> - <conditional name="genomeSource"> - <param name="refGenomeSource" type="select" help="If in doubt, choose built-in and read Picard/Samtools documentation" - label="Align to reference genome - built-in or from current history?"> - <option value="indexed">Use a built in reference sequence</option> - <option value="history">Use a reference genome (fasta format) from my history</option> + + <conditional name="genomeSource"> + + <param name="refGenomeSource" type="select" help="This tool needs a reference genome. Default is usually right" + label="Align to which reference genome? - default, built-in or choose from current history"> + <option value="default" selected="true">Default - use the input data genome/build</option> + <option value="indexed">Select a different built-in genome</option> + <option value="history">Use a genome (fasta format) from my history</option></param> + <when value="default"> + <param name="index" type="select" label="Select a default reference genome"> + <options from_data_table="all_fasta"> + <filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator="," /> + <validator type="no_options" message="No reference build available for selected input" /> + </options> + </param> + </when><when value="indexed"> - <param name="index" type="select" label="Select a reference genome"> - <options from_data_table="all_fasta"> - <filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator=","/> - <validator type="no_options" message="No reference sequences are available for the build associated with the selected input data"/> + <param name="index" type="select" label="Select a built-in reference genome" > + <options from_data_table="all_fasta"></options></param></when> @@ -35,7 +44,6 @@ <param name="ownFile" type="data" format="fasta" metadata_name="dbkey" label="Select a reference genome from history" /></when></conditional> - <param name="sorted" type="boolean" label="Assume the input file is already sorted" checked="true" truevalue="true" falsevalue="false"/><param name="bisulphite" type="boolean" label="Input file contains Bisulphite sequenced reads" checked="false" falsevalue="false" truevalue="true" /><param name="adaptors" value="" type="text" area="true" label="Adapter sequences - one per line if multiple" size="5x120" /> @@ -191,8 +199,10 @@ Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010 -All rgenetics artifacts are available licensed under the LGPL +All rgenetics artefacts are available licensed under the LGPL_ Other dependencies are licensed at the author's discretion - please see each individual package for details + + .. _LGPL: http://www.gnu.org/copyleft/lesser.html </help></tool> --- a/tools/picard/rgPicardFixMate.xml Fri May 06 16:02:25 2011 -0400 +++ b/tools/picard/rgPicardFixMate.xml Sat May 07 11:56:12 2011 -0400 @@ -1,5 +1,5 @@ <tool name="Paired Read Mate Fixer:" id="rgPicFixMate" version="0.01"> - <description>(Picard)</description> + <description>for paired data</description><command interpreter="python"> picard_wrapper.py -i "$input_file" -o "$out_file" --tmpdir "${__new_file_path__}" -n "$out_prefix" --newformat "$newformat" -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/FixMateInformation.jar" --sortorder "$sortOrder" @@ -82,7 +82,7 @@ **From the Picard documentation** -.. csv-table:: GC Bias Doc +.. csv-table:: Fixmate :header-rows: 1 @@ -106,7 +106,11 @@ All the Picard tools are freely available and are documented at http://picard.sourceforge.net/command-line-overview.shtml#CollectAlignmentSu... -Here, you can apply Picard tools through Galaxy which might be easier than through the native Picard command line interface. +Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010 +Code cleaned up and the ugliest hacks repaired by Raphael Lullis. +Kelly Vincent wrote the first version of picard_wrapper.py that now incorporates that code. + +It takes a village of programmers to wrap a picard tool ----- @@ -116,14 +120,10 @@ This Galaxy tool is a component of the rgenetics toolkit. -Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010 -Code cleaned up and the ugliest hacks repaired by Raphael Lullis. -Kelly Vincent wrote the first version of picard_wrapper.py that now incorporates that code. - -It takes a village of programmers to wrap a picard tool - -All rgenetics artifacts are available licensed under the LGPL +All rgenetics artefacts are available licensed under the LGPL_ Other dependencies are licensed at the author's discretion - please see each individual package for details + + .. _LGPL: http://www.gnu.org/copyleft/lesser.html </help></tool> --- a/tools/picard/rgPicardGCBiasMetrics.xml Fri May 06 16:02:25 2011 -0400 +++ b/tools/picard/rgPicardGCBiasMetrics.xml Sat May 07 11:56:12 2011 -0400 @@ -17,18 +17,24 @@ help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/><param name="out_prefix" value="Short Read GC Bias Metrics" type="text" label="Title for the output file - use this remind you what the job was for" size="80" /> - <conditional name="genomeSource"> - <param name="refGenomeSource" type="select" help="If in doubt, choose built-in and read Picard/Samtools documentation" - label="Align to reference genome - built-in or from current history?"> - <option value="indexed">Align to input data 'dbkey' reference genome</option> - <option value="history">Use a genome (fasta format) from my history</option> + <conditional name="genomeSource"> + <param name="refGenomeSource" type="select" help="This tool needs a reference genome. Default is usually right" + label="Align to which reference genome? - default, built-in or choose from current history"> + <option value="default" selected="true">Default - use the input data genome build</option> + <option value="indexed">Select a different built-in genome </option> + <option value="history">Use a genome (fasta format) from my history </option></param> + <when value="default"> + <param name="index" type="select" label="Select a default reference genome"> + <options from_data_table="all_fasta"> + <filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator=","/> + <validator type="no_options" message="No reference build available for the selected input data" /> + </options> + </param> + </when><when value="indexed"> - <param name="index" type="select" label="Select a reference genome"> - <options from_data_table="all_fasta"> - <filter type="data_meta" ref="input_file" key="dbkey" column="dbkey" multiple="True" separator=","/> - <validator type="no_options" message="No reference sequences are available for the build associated with the selected input data"/> - </options> + <param name="index" type="select" label="Select a built-in reference genome"> + <options from_data_table="all_fasta"/></param></when><when value="history"> @@ -168,9 +174,10 @@ This Galaxy tool is a component of the rgenetics toolkit. Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010 - -All rgenetics artifacts are available licensed under the LGPL +All rgenetics artefacts are available licensed under the LGPL_ Other dependencies are licensed at the author's discretion - please see each individual package for details + + .. _LGPL: http://www.gnu.org/copyleft/lesser.html </help></tool> --- a/tools/picard/rgPicardHsMetrics.xml Fri May 06 16:02:25 2011 -0400 +++ b/tools/picard/rgPicardHsMetrics.xml Sat May 07 11:56:12 2011 -0400 @@ -1,10 +1,10 @@ <tool name="Sam/bam Hybrid Selection Metrics:" id="PicardHsMetrics" version="0.01"> - <description>(Picard)</description> + <description>For (eg exome) targetted data</description><command interpreter="python"> - rgPicardHsMetrics.py -i $input_file -d $html_file.files_path -o $html_file - -b $bait_bed -t $target_bed -n "$out_prefix" --tmpdir "${__new_file_path__}" - -j ${GALAXY_DATA_INDEX_DIR}/shared/jars/CalculateHsMetrics.jar -y $input_file.ext + picard_wrapper.py -i "$input_file" -d "$html_file.files_path" -t "$html_file" --datatype "$input_file.ext" + --baitbed "$bait_bed" --targetbed "$target_bed" -n "$out_prefix" --tmpdir "${__new_file_path__}" + -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/CalculateHsMetrics.jar" -x "$maxheap" </command><requirements><requirement type="package">picard</requirement></requirements> @@ -13,6 +13,13 @@ <param name="out_prefix" value="Picard HS Metrics" type="text" label="Title for the output file - to remind you what the job was for" size="80" /><param name="bait_bed" type="data" format="interval" label="Bait intervals: Sequences for bait in the design - ucsc BED" size="80" /><param name="target_bed" type="data" format="interval" label="Target intervals: Sequences for targets in the design - ucsc BED" size="80" /> + <param name="maxheap" type="select" + help="If in doubt, try the default. If it fails with a complaint about java heap size, try increasing it please - larger jobs will require your own hardware." + label="Java heap size"> + <option value="4G" selected = "true">4GB default </option> + <option value="8G" >8GB use if 4GB fails</option> + <option value="16G">16GB - try this if 8GB fails</option> + </param></inputs><outputs><data format="html" name="html_file" label="${out_prefix}.html" /> @@ -23,6 +30,7 @@ <param name="input_file" value="picard_input_summary_alignment_stats.sam" ftype="sam" /><param name="bait_bed" value="picard_input_bait.bed" /><param name="target_bed" value="picard_input_bait.bed" /> + <param name="maxheap" value="8G" /><output name="html_file" file="picard_output_hs_transposed_summary_alignment_stats.html" ftype="html" lines_diff="212"/></test></tests> @@ -140,10 +148,10 @@ This is part of the rgenetics toolkit. Written by Ross Lazarus 2010 - -Licensed under the LGPL -All rgenetics artifacts are available licensed under the LGPL +All rgenetics artefacts are available licensed under the LGPL_ Other dependencies are licensed at the author's discretion - please see each individual package for details + + .. _LGPL: http://www.gnu.org/copyleft/lesser.html </help></tool> --- a/tools/picard/rgPicardInsertSize.xml Fri May 06 16:02:25 2011 -0400 +++ b/tools/picard/rgPicardInsertSize.xml Sat May 07 11:56:12 2011 -0400 @@ -1,5 +1,5 @@ <tool name="Insertion size metrics:" id="rgPicInsertSize" version="0.01"> - <description>(Picard)</description> + <description>for PAIRED data</description><command interpreter="python"> picard_wrapper.py -i "$input_file" -n "$out_prefix" --tmpdir "${__new_file_path__}" --taillimit "$tailLimit" --histwidth "$histWidth" --minpct "$minPct" --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/picard/rgPicardLibComplexity.xml Sat May 07 11:56:12 2011 -0400 @@ -0,0 +1,143 @@ +<tool name="Estimate Library Complexity:" id="rgEstLibComp" version="0.01"> + <description>(Picard)</description> + <command interpreter="python"> + picard_wrapper.py -i "$input_file" -n "$out_prefix" --tmpdir "${__new_file_path__}" --minid "$minIDbases" + --maxdiff "$maxDiff" --minmeanq "$minMeanQ" --readregex "$readRegex" --optdupdist "$optDupeDist" + -j "${GALAXY_DATA_INDEX_DIR}/shared/jars/EstimateLibraryComplexity.jar" -d "$html_file.files_path" -t "$html_file" + </command> + <inputs> + <param format="bam,sam" name="input_file" type="data" label="Input: sam or bam format short read data in your current history" + help="If the select list is empty, you need to upload or import some aligned short read data from a shared library"/> + <param name="out_prefix" value="Library Complexity" type="text" + label="Title for the output file - use this remind you what the job was for" size="80" /> + <param name="minIDbases" value="5" type="integer" label="Minimum identical bases at starts of reads for grouping" size="5" + help="Total_reads / 4^max_id_bases reads will be compared at a time. Lower numbers = more accurate results and exponentially more time/ram" /> + <param name="maxDiff" value="0.03" type="float" + label="Maximum difference rate for identical reads" size="5" + help="The maximum rate of differences between two reads to call them identical" /> + <param name="minMeanQ" value="20" type="integer" + label="Minimum percentage" size="5" + help="The minimum mean quality of bases in a read pair. Lower average quality reads filtered out from all calculations" /> + <param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" type="text" size="120" + label="Regular expression that can be used to parse read names in the incoming SAM file" + help="Names are parsed to extract: tile/region, x coordinate and y coordinate, to estimate optical duplication rate" > + <sanitizer> + <valid initial="string.printable"> + <remove value="'"/> + </valid> + <mapping initial="none"> + <add source="'" target="__sq__"/> + </mapping> + </sanitizer> + </param> + <param name="optDupeDist" value="100" type="text" + label="The maximum offset between two duplicte clusters in order to consider them optical duplicates." size="5" + help="e.g. 5-10 pixels. Later Illumina software versions multiply pixel values by 10, in which case 50-100" /> + + </inputs> + <outputs> + <data format="html" name="html_file" label="${out_prefix}_lib_complexity.html"/> + </outputs> + <tests> + <test> + <param name="input_file" value="picard_input_tiny.sam" /> + <param name="out_prefix" value="Library Complexity" /> + <param name="minIDbases" value="5" /> + <param name="maxDiff" value="0.03" /> + <param name="minMeanQ" value="20" /> + <param name="readRegex" value="[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*" /> + <param name="optDupeDist" value="100" /> + <output name="html_file" file="picard_output_estlibcomplexity_tinysam.html" ftype="html" lines_diff="25" /> + </test> + </tests> + <help> + +.. class:: infomark + +**Purpose** + + EstimateLibraryComplexity + Attempts to estimate library complexity from sequence alone. + Does so by sorting all reads by the first N bases (5 by default) of each read and then + comparing reads with the first N bases identical to each other for duplicates. Reads are considered to be + duplicates if they match each other with no gaps and an overall mismatch rate less than or equal to MAX_DIFF_RATE (0.03 by default). + + Reads of poor quality are filtered out so as to provide a more accurate estimate. + The filtering removes reads with any no-calls in the first N bases or with a mean base quality lower than + MIN_MEAN_QUALITY across either the first or second read. + + The algorithm attempts to detect optical duplicates separately from PCR duplicates and excludes these in the + calculation of library size. Also, since there is no alignment to screen out technical reads one + further filter is applied on the data. After examining all reads a histogram is built of + [#reads in duplicate set -> #of duplicate sets]; + all bins that contain exactly one duplicate set are then removed from the histogram as outliers before library size is estimated. + +**Why you might want to use this tool** + + This tool provides a Galaxy interface to one of the Picard tools. + If you need to estimate library complexity from sequences, the Picard tool may help. + + +**Note on the Regular Expression** + + (from the Picard docs) + This tool requires a valid regular expression to parse out the read names in the incoming SAM or BAM file. + These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. + The regular expression should contain three capture groups for the three variables, in order. + Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. + + +----- + +.. class:: infomark + +**From the Picard documentation** + +.. csv-table:: Estimate complexity docs + :header-rows: 1 + + Option Description + "INPUT=File","One or more files to combine and estimate library complexity from. Reads can be mapped or unmapped. This option may be specified 0 or more times." + "OUTPUT=File","Output file to writes per-library metrics to. Required." + "MIN_IDENTICAL_BASES=Integer","The minimum number of bases at the starts of reads that must be identical for reads to be grouped together for duplicate detection. In effect total_reads / 4^max_id_bases reads will be compared at a time, so lower numbers will produce more accurate results but consume exponentially more memory and CPU. Default value: 5." + "MAX_DIFF_RATE=Double","The maximum rate of differences between two reads to call them identical. Default value: 0.03. " + "MIN_MEAN_QUALITY=Integer","The minimum mean quality of the bases in a read pair for the read to be analyzed. Reads with lower average quality are filtered out and not considered in any calculations. Default value: 20." + "READ_NAME_REGEX=String","Regular expression that can be used to parse read names in the incoming SAM file. Read names are parsed to extract three variables: tile/region, x coordinate and y coordinate. These values are used to estimate the rate of optical duplication in order to give a more accurate estimated library size. The regular expression should contain three capture groups for the three variables, in order. Default value: [a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*. This option can be set to 'null' to clear the default value." + "OPTICAL_DUPLICATE_PIXEL_DISTANCE=Integer","The maximum offset between two duplicte clusters in order to consider them optical duplicates. This should usually be set to some fairly small number (e.g. 5-10 pixels) unless using later versions of the Illumina pipeline that multiply pixel values by 10, in which case 50-100 is more normal. Default value: 100" + "CREATE_MD5_FILE=Boolean","Whether to create an MD5 digest for any BAM files created. Default value: false. This option can be set to 'null' to clear the default value. " + + +----- + +.. class:: infomark + +**Attributions** + +Picard is supported through the SamTools project. +This tool wraps Picard and is supported through the galaxy-bugs mailing list +Please help us by completing the report form that appears automatically +if a tool fails unexpectedly when you run it in Galaxy. + +All the Picard tools are freely available and are documented +at http://picard.sourceforge.net/command-line-overview.shtml#CollectInsertSizeM... + +----- + +.. class:: infomark + +**Copyright** + +This Galaxy tool is a component of the rgenetics toolkit. + +Written by and copyright Ross Lazarus, ross.lazarus at gmail etc, September 2010 +Code cleaned up and the ugliest hacks repaired by Raphael Lullis + +All rgenetics artefacts are available licensed under the LGPL_ +Other dependencies are licensed at the author's discretion - please see each individual package for details + + .. _LGPL: http://www.gnu.org/copyleft/lesser.html + + </help> +</tool> + + http://bitbucket.org/galaxy/galaxy-central/changeset/3cf3d54e5b28/ changeset: r5511:3cf3d54e5b28 user: fubar date: 2011-05-07 19:58:35 summary: add test input for hsmetrics affected #: 1 file (376 bytes) --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/picard_input_bait.bed Sat May 07 13:58:35 2011 -0400 @@ -0,0 +1,8 @@ +chr1 1 300 - CCDS635.1_cds_0_0_chr1_67052401_r +chr2 1 300 - CCDS635.1_cds_1_0_chr1_67060632_r +chr3 1 300 - CCDS635.1_cds_2_0_chr1_67065091_r +chr4 1 300 - CCDS635.1_cds_3_0_chr1_67066083_r +chr5 1 300 - CCDS635.1_cds_4_0_chr1_67071856_r +chr6 1 300 - CCDS635.1_cds_5_0_chr1_67072262_r +chr7 1 300 - CCDS635.1_cds_6_0_chr1_67073897_r +chr8 1 300 - CCDS635.1_cds_7_0_chr1_67075981_r Repository URL: https://bitbucket.org/galaxy/galaxy-central/ -- This is a commit notification from bitbucket.org. You are receiving this because you have the service enabled, addressing the recipient of this email.