1 new commit in galaxy-central: https://bitbucket.org/galaxy/galaxy-central/changeset/d9408bcb3cf4/ changeset: d9408bcb3cf4 user: greg date: 2011-11-10 18:09:45 summary: Eliminate the non-functional PacBio tools from the distribution and the sample tool config. affected #: 14 files diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tool_conf.xml.sample --- a/tool_conf.xml.sample +++ b/tool_conf.xml.sample @@ -477,17 +477,6 @@ <tool file="vcf_tools/filter.xml" /><tool file="vcf_tools/extract.xml" /></section> - <section name="PacBio/Illumina Assembly" id="hybrid"> - <tool file="ilmn_pacbio/quake.xml"/> - <tool file="ilmn_pacbio/quake_pe.xml"/> - <tool file="ilmn_pacbio/soap_denovo.xml"/> - <!-- - Uncomment this tool when we support the HDF5 format - <tool file="ilmn_pacbio/smrtpipe_filter.xml"/> - --> - <tool file="ilmn_pacbio/smrtpipe_hybrid.xml"/> - <tool file="ilmn_pacbio/assembly_stats.xml"/> - </section><!-- TODO: uncomment the following EMBOSS section whenever moving to test, but comment it in .sample to eliminate diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/abyss.xml --- a/tools/ilmn_pacbio/abyss.xml +++ /dev/null @@ -1,30 +0,0 @@ -<tool id="abyss" name="ABySS" version="1.0.0"> - <description>Short-read de Bruijn assembly</description> - <command interpreter="python"> - quake_wrapper.py -k $k -r $input1 -p 8 > $output1 - </command> - <inputs> - <param name="input1" format="fastq" type="data" label="Select FASTQ file to correct" /> - <param name="k" type="integer" value="16" label="Size of k-mers to correct" /> - </inputs> - <outputs> - <data format="fastq" name="output1" label="Error-corrected reads from ${on_string}" /> - </outputs> - <help> - -**What it does** - -TBD. Calls ABySS assembler - -**Parameter list** - -k - -**Output** - -Corrected reads - - </help> -</tool> - - diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/assembly_stats.py --- a/tools/ilmn_pacbio/assembly_stats.py +++ /dev/null @@ -1,83 +0,0 @@ -#!/usr/bin/env python -# -#Copyright (c) 2011, Pacific Biosciences of California, Inc. -# -#All rights reserved. -# -#Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: -# * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. -# * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. -# * Neither the name of Pacific Biosciences nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. -# -#THIS SOFTWARE IS PROVIDED BY PACIFIC BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -#WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS CONTRIBUTORS BE LIABLE FOR ANY -#DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -#LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -#(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -# -import sys, os -from optparse import OptionParser -from galaxy import eggs -import pkg_resources -pkg_resources.require( 'bx-python' ) -from bx.seq.fasta import FastaReader - -def getStats( fastaFile, genomeLength, minContigLength ): - lengths = [] - stats = { "Num" : 0, - "Sum" : 0, - "Max" : 0, - "Avg" : 0, - "N50" : 0, - "99%" : 0 } - fasta_reader = FastaReader( open( fastaFile, 'rb' ) ) - while True: - seq = fasta_reader.next() - if not seq: - break - if seq.length < minContigLength: - continue - lengths.append( seq.length ) - if lengths: - stats[ 'Num' ] = len( lengths ) - stats[ 'Sum' ] = sum( lengths ) - stats[ 'Max' ] = max( lengths ) - stats[ 'Avg' ] = int( sum( lengths ) / float( len( lengths ) ) ) - stats[ 'N50' ] = 0 - stats[ '99%' ] = 0 - if genomeLength == 0: - genomeLength = sum( lengths ) - lengths.sort() - lengths.reverse() - lenSum = 0 - stats[ "99%" ] = len( lengths ) - for idx, length in enumerate( lengths ): - lenSum += length - if ( lenSum > genomeLength / 2 ): - stats[ "N50" ] = length - break - lenSum = 0 - for idx, length in enumerate( lengths ): - lenSum += length - if lenSum > genomeLength * 0.99: - stats[ "99%" ] = idx + 1 - break - return stats - -def __main__(): - #Parse Command Line - usage = 'Usage: %prog input output --minContigLength' - parser = OptionParser( usage=usage ) - parser.add_option( "--minContigLength", dest="minContigLength", help="Minimum length of contigs to analyze" ) - parser.add_option( "--genomeLength", dest="genomeLength", help="Length of genome for which to calculate N50s" ) - parser.set_defaults( minContigLength=0, genomeLength=0 ) - options, args = parser.parse_args() - input_fasta_file = args[ 0 ] - output_tabular_file = args[ 1 ] - statKeys = "Num Sum Max Avg N50 99%".split( " " ) - stats = getStats( input_fasta_file, int( options.genomeLength ), int( options.minContigLength ) ) - fout = open( output_tabular_file, "w" ) - fout.write( "%s\n" % "\t".join( map( lambda key: str( stats[ key ] ), statKeys ) ) ) - fout.close() - -if __name__=="__main__": __main__() diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/assembly_stats.xml --- a/tools/ilmn_pacbio/assembly_stats.xml +++ /dev/null @@ -1,54 +0,0 @@ -<tool id="assembly_stats" name="Assembly Statistics" version="1.0.0"> - <description>Calculate common measures of assembly quality</description> - <command interpreter="python"> - assembly_stats.py $input1 $output1 --minContigLength=${minLength} - </command> - <inputs> - <param name="input1" format="fasta" type="data" label="Select FASTA file containing contigs"/> - <param name="minLength" type="integer" value="0" label="Minimum length of contigs to consider"/> - </inputs> - <outputs> - <data name="output1" format="tabular" label="Assembly statistics for ${on_string}"/> - </outputs> - <tests> - <test> - <param name="input1" value="3.fasta" ftype="fasta"/> - <param name="minLength" value="100"/> - <output name="output1" ftype="tabular" file="assembly_stats.tabular" /> - </test> - </tests> - <help> - -**What it does** - -Reports standard measures of *de novo* assembly quality such as number of contigs, sum of contigs, mean contig length, and N50. - -**Parameter list** - -Minimum length - Only include contigs of this size or greater for calculating statistics. - -**Output** - -Num contigs - Total number of contigs in the assembly - -Sum of contig lengths - Total sum of contig lengths - -Maximum contig length - Maximum of the contig lengths - -Mean contig length - Average contig length - -N50 - Contig length at which 50% of the assembly is contained in contigs of this size or greater. - -99% - Number of contigs accounting for 99% of the observed assembly. - - </help> -</tool> - - diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/cov_model.py --- a/tools/ilmn_pacbio/cov_model.py +++ /dev/null @@ -1,238 +0,0 @@ -#!/usr/bin/env python -from optparse import OptionParser, SUPPRESS_HELP -import os, random, quake - -############################################################ -# cov_model.py -# -# Given a file of kmer counts, reports the cutoff to use -# to separate trusted/untrusted kmers. -############################################################ - -############################################################ -# main -############################################################ -def main(): - usage = 'usage: %prog [options] <counts file>' - parser = OptionParser(usage) - parser.add_option('--int', dest='count_kmers', action='store_true', default=False, help='Kmers were counted as integers w/o the use of quality values [default: %default]') - parser.add_option('--ratio', dest='ratio', type='int', default=200, help='Likelihood ratio to set trusted/untrusted cutoff [default: %default]') - parser.add_option('--no_sample', dest='no_sample', action='store_true', default=False, help='Do not sample kmer coverages into kmers.txt because its already done [default: %default]') - # help='Model kmer coverage as a function of GC content of kmers [default: %default]' - parser.add_option('--gc', dest='model_gc', action='store_true', default=False, help=SUPPRESS_HELP) - (options, args) = parser.parse_args() - - if len(args) != 1: - parser.error('Must provide kmers counts file') - else: - ctsf = args[0] - - if options.count_kmers: - model_cutoff(ctsf, options.ratio) - print 'Cutoff: %s' % open('cutoff.txt').readline().rstrip() - - else: - if options.model_gc: - model_q_gc_cutoffs(ctsf, 25000, options.ratio) - else: - model_q_cutoff(ctsf, 50000, options.ratio, options.no_sample) - print 'Cutoff: %s' % open('cutoff.txt').readline().rstrip() - - -############################################################ -# model_cutoff -# -# Make a histogram of kmers to give to R to learn the cutoff -############################################################ -def model_cutoff(ctsf, ratio): - # make kmer histogram - cov_max = 0 - for line in open(ctsf): - cov = int(line.split()[1]) - if cov > cov_max: - cov_max = cov - - kmer_hist = [0]*cov_max - for line in open(ctsf): - cov = int(line.split()[1]) - kmer_hist[cov-1] += 1 - - cov_out = open('kmers.hist', 'w') - for cov in range(0,cov_max): - if kmer_hist[cov]: - print >> cov_out, '%d\t%d' % (cov+1,kmer_hist[cov]) - cov_out.close() - - os.system('R --slave --args %d < %s/cov_model.r 2> r.log' % (ratio,quake.quake_dir)) - - -############################################################ -# model_q_cutoff -# -# Sample kmers to give to R to learn the cutoff -# 'div100' is necessary when the number of kmers is too -# large for random.sample, so we only consider every 100th -# kmer. -############################################################ -def model_q_cutoff(ctsf, sample, ratio, no_sample=False): - if not no_sample: - # count number of kmer coverages - num_covs = 0 - for line in open(ctsf): - num_covs += 1 - - # choose random kmer coverages - div100 = False - if sample >= num_covs: - rand_covs = range(num_covs) - else: - if num_covs > 1000000000: - div100 = True - rand_covs = random.sample(xrange(num_covs/100), sample) - else: - rand_covs = random.sample(xrange(num_covs), sample) - rand_covs.sort() - - # print to file - out = open('kmers.txt', 'w') - kmer_i = 0 - rand_i = 0 - for line in open(ctsf): - if div100: - if kmer_i % 100 == 0 and kmer_i/100 == rand_covs[rand_i]: - print >> out, line.split()[1] - rand_i += 1 - if rand_i >= sample: - break - else: - if kmer_i == rand_covs[rand_i]: - print >> out, line.split()[1] - rand_i += 1 - if rand_i >= sample: - break - kmer_i += 1 - out.close() - - os.system('R --slave --args %d < %s/cov_model_qmer.r 2> r.log' % (ratio,quake.quake_dir)) - - -############################################################ -# model_q_gc_cutoffs -# -# Sample kmers to give to R to learn the cutoff for each -# GC value -############################################################ -def model_q_gc_cutoffs(ctsf, sample, ratio): - # count number of kmer coverages at each at - k = len(open(ctsf).readline().split()[0]) - num_covs_at = [0]*(k+1) - for line in open(ctsf): - kmer = line.split()[0] - num_covs_at[count_at(kmer)] += 1 - - # for each AT bin - at_cutoffs = [] - for at in range(1,k): - # sample covs - if sample >= num_covs_at[at]: - rand_covs = range(num_covs_at[at]) - else: - rand_covs = random.sample(xrange(num_covs_at[at]), sample) - rand_covs.sort() - - # print to file - out = open('kmers.txt', 'w') - kmer_i = 0 - rand_i = 0 - for line in open(ctsf): - (kmer,cov) = line.split() - if count_at(kmer) == at: - if kmer_i == rand_covs[rand_i]: - print >> out, cov - rand_i += 1 - if rand_i >= sample: - break - kmer_i += 1 - out.close() - - os.system('R --slave --args %d < %s/cov_model_qmer.r 2> r%d.log' % (ratio,quake.quake_dir,at)) - - at_cutoffs.append( open('cutoff.txt').readline().rstrip() ) - if at in [1,k-1]: # setting extremes to next closests - at_cutoffs.append( open('cutoff.txt').readline().rstrip() ) - - os.system('mv kmers.txt kmers.at%d.txt' % at) - os.system('mv cutoff.txt cutoff.at%d.txt' % at) - - out = open('cutoffs.gc.txt','w') - print >> out, '\n'.join(at_cutoffs) - out.close() - - -############################################################ -# model_q_gc_cutoffs_bigmem -# -# Sample kmers to give to R to learn the cutoff for each -# GC value -############################################################ -def model_q_gc_cutoffs_bigmem(ctsf, sample, ratio): - # input coverages - k = 0 - for line in open(ctsf): - (kmer,cov) = line.split() - if k == 0: - k = len(kmer) - at_covs = ['']*(k+1) - else: - at = count_at(kmer) - if at_covs[at]: - at_covs[at].append(cov) - else: - at_covs[at] = [cov] - - for at in range(1,k): - print '%d %d' % (at,len(at_covs[at])) - - # for each AT bin - at_cutoffs = [] - for at in range(1,k): - # sample covs - if sample >= len(at_covs[at]): - rand_covs = at_covs[at] - else: - rand_covs = random.sample(at_covs[at], sample) - - # print to file - out = open('kmers.txt', 'w') - for rc in rand_covs: - print >> out, rc - out.close() - - os.system('R --slave --args %d < %s/cov_model_qmer.r 2> r%d.log' % (ratio,quake.quake_dir,at)) - - at_cutoffs.append( open('cutoff.txt').readline().rstrip() ) - if at in [1,k-1]: # setting extremes to next closests - at_cutoffs.append( open('cutoff.txt').readline().rstrip() ) - - os.system('mv kmers.txt kmers.at%d.txt' % at) - os.system('mv cutoff.txt cutoff.at%d.txt' % at) - - out = open('cutoffs.gc.txt','w') - print >> out, '\n'.join(at_cutoffs) - out.close() - - -############################################################ -# count_at -# -# Count A's and T's in the given sequence -############################################################ -def count_at(seq): - return len([nt for nt in seq if nt in ['A','T']]) - - -############################################################ -# __main__ -############################################################ -if __name__ == '__main__': - main() diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/quake.py --- a/tools/ilmn_pacbio/quake.py +++ /dev/null @@ -1,136 +0,0 @@ -#!/usr/bin/env python -from optparse import OptionParser, SUPPRESS_HELP -import os, random, sys -import cov_model - -############################################################ -# quake.py -# -# Launch pipeline to correct errors in Illumina sequencing -# reads. -############################################################ - -#r_dir = '/nfshomes/dakelley/research/error_correction/bin' -quake_dir = os.path.abspath(os.path.dirname(sys.argv[0])) - -############################################################ -# main -############################################################ -def main(): - usage = 'usage: %prog [options]' - parser = OptionParser(usage) - parser.add_option('-r', dest='readsf', help='Fastq file of reads') - parser.add_option('-f', dest='reads_listf', help='File containing fastq file names, one per line or two per line for paired end reads.') - parser.add_option('-k', dest='k', type='int', help='Size of k-mers to correct') - parser.add_option('-p', dest='proc', type='int', default=4, help='Number of processes [default: %default]') - parser.add_option('-q', dest='quality_scale', type='int', default=-1, help='Quality value ascii scale, generally 64 or 33. If not specified, it will guess.') - parser.add_option('--no_count', dest='no_count', action='store_true', default=False, help='Kmers are already counted and in expected file [reads file].qcts or [reads file].cts [default: %default]') - parser.add_option('--no_cut', dest='no_cut', action='store_true', default=False, help='Coverage model is optimized and cutoff was printed to expected file cutoff.txt [default: %default]') - parser.add_option('--int', dest='counted_kmers', action='store_true', default=False, help='Kmers were counted as integers w/o the use of quality values [default: %default]') - parser.add_option('--ratio', dest='ratio', type='int', default=200, help='Likelihood ratio to set trusted/untrusted cutoff. Generally set between 10-1000 with lower numbers suggesting a lower threshold. [default: %default]') - # help='Model kmer coverage as a function of GC content of kmers [default: %default]' - parser.add_option('--gc', dest='model_gc', action='store_true', default=False, help=SUPPRESS_HELP) - parser.add_option('--headers', action='store_true', default=False, help='Output original read headers (i.e. pass --headers to correct)' ) - (options, args) = parser.parse_args() - - if not options.readsf and not options.reads_listf: - parser.error('Must provide fastq file of reads with -r or file with list of fastq files of reads with -f') - if not options.k: - parser.error('Must provide k-mer size with -k') - if options.quality_scale == -1: - options.quality_scale = guess_quality_scale(options.readsf, options.reads_listf) - - if options.counted_kmers: - cts_suf = 'cts' - else: - cts_suf = 'qcts' - if options.readsf: - ctsf = '%s.%s' % (os.path.splitext( os.path.split(options.readsf)[1] )[0], cts_suf) - reads_str = '-r %s' % options.readsf - else: - ctsf = '%s.%s' % (os.path.split(options.reads_listf)[1], cts_suf) - reads_str = '-f %s' % options.reads_listf - - if not options.no_count and not options.no_cut: - count_kmers(options.readsf, options.reads_listf, options.k, ctsf, options.quality_scale) - - if not options.no_cut: - # model coverage - if options.counted_kmers: - cov_model.model_cutoff(ctsf, options.ratio) - else: - if options.model_gc: - cov_model.model_q_gc_cutoffs(ctsf, 10000, options.ratio) - else: - cov_model.model_q_cutoff(ctsf, 25000, options.ratio) - - - if options.model_gc: - # run correct C++ code - os.system('%s/correct %s -k %d -m %s -a cutoffs.gc.txt -p %d -q %d' % (quake_dir,reads_str, options.k, ctsf, options.proc, options.quality_scale)) - - else: - cutoff = open('cutoff.txt').readline().rstrip() - - # run correct C++ code - headers = '--headers' if options.headers else '' - os.system('%s/correct %s %s -k %d -m %s -c %s -p %d -q %d' % (quake_dir,headers, reads_str, options.k, ctsf, cutoff, options.proc, options.quality_scale)) - - -################################################################################ -# guess_quality_scale -# Guess at ascii scale of quality values by examining -# a bunch of reads and looking for quality values < 64, -# in which case we set it to 33. -################################################################################ -def guess_quality_scale(readsf, reads_listf): - reads_to_check = 1000 - if not readsf: - readsf = open(reads_listf).readline().split()[0] - - fqf = open(readsf) - reads_checked = 0 - header = fqf.readline() - while header and reads_checked < reads_to_check: - seq = fqf.readline() - mid = fqf.readline() - qual = fqf.readline().rstrip() - reads_checked += 1 - for q in qual: - if ord(q) < 64: - print 'Guessing quality values are on ascii 33 scale' - return 33 - header = fqf.readline() - - print 'Guessing quality values are on ascii 64 scale' - return 64 - - - -############################################################ -# count_kmers -# -# Count kmers in the reads file using AMOS count-kmers or -# count-qmers -############################################################ -def count_kmers(readsf, reads_listf, k, ctsf, quality_scale): - # find files - fq_files = [] - if readsf: - fq_files.append(readsf) - else: - for line in open(reads_listf): - for fqf in line.split(): - fq_files.append(fqf) - - if ctsf[-4:] == 'qcts': - os.system('cat %s | %s/count-qmers -k %d -q %d > %s' % (' '.join(fq_files), quake_dir, k, quality_scale, ctsf)) - else: - os.system('cat %s | %s/count-kmers -k %d > %s' % (' '.join(fq_files), quake_dir, k, ctsf)) - - -############################################################ -# __main__ -############################################################ -if __name__ == '__main__': - main() diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/quake.xml --- a/tools/ilmn_pacbio/quake.xml +++ /dev/null @@ -1,44 +0,0 @@ -<tool id="quake" name="Quake" version="1.0.0"> - <description>Quality-aware error correction</description> - <command interpreter="python"> - quake_wrapper.py --default_cutoff=10 --headers -k $k -f $fofnfile -p 12 > $output1 - </command> - <inputs> - <param name="input1" format="fastq" type="data" label="Select FASTQ file to correct" /> - <param name="k" type="integer" value="16" label="Size of k-mers to correct" /> - </inputs> - <configfiles> - <configfile name="fofnfile"> -${input1.file_name} - </configfile> - </configfiles> - <outputs> - <data format="fastq" name="output1" label="Error-corrected reads from ${on_string}" /> - </outputs> - <help> - -**What it does** - -Applies the Quake_ algorithm for quality-aware correction of -substitution error in short reads. - -Kelley DR, Schatz MC, Salzberg SL. -"Quake: quality-aware detection and correction of sequencing errors." -*Genome Biol.* 2010;11(11):R116. - -.. _Quake: http://www.cbcb.umd.edu/software/quake - -**Parameter list** - -k - k-mer size for detecting spurious k-mers versus true k-mers from - the genome. Recommendations for choosing a value of k can be found - here_. - -.. _here: http://www.cbcb.umd.edu/software/quake/faq.html - -**Output** - -A FASTQ file of corrected and trimmed reads. - </help> -</tool> diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/quake_pe.xml --- a/tools/ilmn_pacbio/quake_pe.xml +++ /dev/null @@ -1,53 +0,0 @@ -<tool id="quake_pe" name="Quake PE" version="1.0.0"> - <description>Quality-aware error correction for paired-end reads</description> - <command interpreter="python"> - quake_wrapper.py --default_cutoff=$cutoff --headers -k $k -f $fofnfile -p 12 --output=$output1,$output2 - </command> - <inputs> - <param name="input1" format="fastq" type="data" label="FASTQ file for forward reads" /> - <param name="input2" format="fastq" type="data" label="FASTQ file for reverse reads" /> - <param name="k" type="integer" value="16" label="Size of k-mers to correct" /> - <param name="cutoff" type="integer" value="0" label="Default coverage cutoff if estimation fails"/> - </inputs> - <configfiles> - <configfile name="fofnfile">${input1.file_name} ${input2.file_name} - </configfile> - </configfiles> - <outputs> - <data format="fastq" name="output1" label="Error-corrected forward reads from ${on_string}" /> - <data format="fastq" name="output2" label="Error-corrected reverse reads from ${on_string}" /> - </outputs> - <help> - -**What it does** - -Applies the Quake_ algorithm for quality-aware correction of -substitution error in short reads. This form of the tool is customized -for correcting paired-end reads. - -Kelley DR, Schatz MC, Salzberg SL. -"Quake: quality-aware detection and correction of sequencing errors." -*Genome Biol.* 2010;11(11):R116. - -.. _Quake: http://www.cbcb.umd.edu/software/quake - -**Parameter list** - -K-mer size - k-mer size for detecting spurious k-mers versus true k-mers from - the genome. Recommendations for choosing a value of k can be found - here_. - -Default coverage cutoff - If the appropriate coverage cutoff can not be found then Quake can be - forced to proceed anyways with the supplied cutoff. In this case, - the optimal cutoff can be estimated by examining - the k-mer coverage histogram by eye. - -.. _here: http://www.cbcb.umd.edu/software/quake/faq.html - -**Output** - -A FASTQ file of corrected and trimmed reads. - </help> -</tool> diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/quake_wrapper.py --- a/tools/ilmn_pacbio/quake_wrapper.py +++ /dev/null @@ -1,132 +0,0 @@ -#!/usr/bin/python -# -# Copyright (c) 2011, Pacific Biosciences of California, Inc. -# -# All rights reserved. -# -#Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: -# * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. -# * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. -# * Neither the name of Pacific Biosciences nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. -# -#THIS SOFTWARE IS PROVIDED BY PACIFIC BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED -#WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS CONTRIBUTORS BE LIABLE FOR ANY -#DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; -#LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT -#(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. -# -import sys -import os -import subprocess - -QUAKE_EXE = os.path.join( os.path.dirname(os.path.abspath(sys.argv[0])), 'quake.py' ) -cmdLine = sys.argv -cmdLine.pop(0) - -# -# horribly not robust, but it was a pain to rewrite everything with -# optparse -# -j = -1 -cut = 0 -for i,arg in enumerate(cmdLine): - if '--default_cutoff' in arg: - j = i - cut = int(arg.split('=')[1]) -if j>=0: - cmdLine = cmdLine[:j] + cmdLine[j+1:] - -j = -1 -output='' -for i,arg in enumerate(cmdLine): - if '--output' in arg: - j = i - output = arg.split('=')[1] -if j>=0: - cmdLine = cmdLine[:j] + cmdLine[j+1:] - -def backticks( cmd, merge_stderr=True ): - """ - Simulates the perl backticks (``) command with error-handling support - Returns ( command output as sequence of strings, error code, error message ) - """ - if merge_stderr: - _stderr = subprocess.STDOUT - else: - _stderr = subprocess.PIPE - - p = subprocess.Popen( cmd, shell=True, stdin=subprocess.PIPE, - stdout=subprocess.PIPE, stderr=_stderr, - close_fds=True ) - - out = [ l[:-1] for l in p.stdout.readlines() ] - - p.stdout.close() - if not merge_stderr: - p.stderr.close() - - # need to allow process to terminate - p.wait() - - errCode = p.returncode and p.returncode or 0 - if p.returncode>0: - errorMessage = os.linesep.join(out) - output = [] - else: - errorMessage = '' - output = out - - return output, errCode, errorMessage - -def to_stdout(): - def toCorFastq(f): - stem, ext = os.path.splitext( os.path.basename(f) ) - dir = os.path.dirname(f) - corFastq = os.path.join(dir,'%s.cor%s' % (stem,ext) ) - if not os.path.exists(corFastq): - print >>sys.stderr, "Can't find path %s" % corFastq - sys.exit(1) - return corFastq - if '-r' in cmdLine: - fastqFile = cmdLine[ cmdLine.index('-r')+1 ] - corFastq = toCorFastq(fastqFile) - infile = open( corFastq, 'r' ) - for line in infile: - sys.stdout.write( line ) - infile.close() - else: - fofnFile = cmdLine[ cmdLine.index('-f')+1 ] - infile = open(fofnFile,'r') - for line in infile: - line = line.strip() - if len(line)>0: - fastqFiles = line.split() - break - infile.close() - outs = output.split(',') - for o,f in zip(outs,fastqFiles): - cf = toCorFastq(f) - os.system( 'cp %s %s' % ( cf, o ) ) - -def run(): - cmd = '%s %s' % ( QUAKE_EXE, " ".join(cmdLine) ) - output, errCode, errMsg = backticks( cmd ) - - if errCode==0: - to_stdout() - else: - # if Quake exits with an error in cutoff determination we - # can force correction if requested - if 'cutoff.txt' in errMsg and cut>0: - outfile = open( 'cutoff.txt', 'w' ) - print >>outfile, str(cut) - outfile.close() - cmd = '%s --no_count --no_cut %s' % ( QUAKE_EXE, " ".join(cmdLine) ) - output, errCode, errMsg = backticks( cmd ) - if errCode==0: - to_stdout() - else: - print >>sys.stderr, errMsg - sys.exit(1) - -if __name__=='__main__': run() diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/smrtpipe.py --- a/tools/ilmn_pacbio/smrtpipe.py +++ /dev/null @@ -1,5 +0,0 @@ -#!/usr/bin/env python -# EASY-INSTALL-SCRIPT: 'pbpy==0.1','smrtpipe.py' -__requires__ = 'pbpy==0.1' -import pkg_resources -pkg_resources.run_script('pbpy==0.1', 'smrtpipe.py') diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/smrtpipe_filter.xml --- a/tools/ilmn_pacbio/smrtpipe_filter.xml +++ /dev/null @@ -1,67 +0,0 @@ -<tool id="smrtpipe_filter" name="SMRTpipe Filter" version="1.0.0"> - <description>Produce filtered reads from a set of PacBio primary analysis outputs.</description> - <command interpreter="python"> - smrtpipe_galaxy.py --output=data/filtered_subreads.fasta --galaxy_output=${outfile} ${iniFile} - </command> - <inputs> - <conditional name="source"> - <param name="input_source" type="select" label="Choose the source for the analysis inputs"> - <option value="path">Path to fofn or multiple bas.h5 paths</option> - <option value="history">History</option> - </param> - <when value="path"> - <repeat name="inputFiles" title="Input files"> - <param name="path" type="text" label="File path" size="75"/> - </repeat> - </when> - <when value="history"> - <param name="input1" type="data" format="tabular" label="File containing input paths" /> - </when> - </conditional> - <param name="minimum_readlength" type="integer" value="50" label="Minimum raw readlength" /> - <param name="minimum_readscore" type="float" value="0.75" label="Minimum read quality" /> - </inputs> - <configfiles> - <configfile name="iniFile"> -[input] -#if $source.input_source=="history": -#for $l in open($source.input1.file_name,'r'): -$l -#end for -#else -#for $p in $source.inputFiles -${p.path} -#end for -#end if - -[S_Filter] -filters=MinRL=${minimum_readlength},MinReadScore=${minimum_readscore} - </configfile> - </configfiles> - <outputs> - <data name="outfile" format="fasta" label="Filtered subreads" /> - </outputs> - <help> - -**What it does** - -Filters PacBio bas.h5 files and produces a FASTA file of filtered subreads. - -In PacBio SMRT sequencing, the template format is a SMRTbell: a circular -molecule with adapters at two locations in the circle. The subreads are the -portions of the read between adapters. - -**Parameter list** - -Minimum readlength - Only keep reads from ZMWs that produced this many bases or more. - -Minimum read quality - Only keep reads with overall quality scores of this value or more. The read quality score is a *de novo* prediction of the accuracy of the read. - -**Output** - -FASTA file of filtered reads. - - </help> -</tool> diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/smrtpipe_galaxy.py --- a/tools/ilmn_pacbio/smrtpipe_galaxy.py +++ /dev/null @@ -1,265 +0,0 @@ -#!/usr/bin/python -import sys -import os -import subprocess -import optparse as op -import xml.etree.cElementTree as et - -TRACE=False -# -# Turn on tracing to dump out __input__.xml and __settings__.xml somewhere -# -#TRACE=True -#TRACE_PATH='/home/UNIXHOME/jsorenson' - -class SmrtpipeGalaxy: - """Wrapper for running smrtpipe under galaxy""" - def __init__( self, argv ): - self.__parseOptions( argv ) - - def __parseOptions( self, argv ): - usage = 'Usage: %prog [--help] [options] smrtpipe.ini' - parser = op.OptionParser( usage=usage, description=SmrtpipeGalaxy.__doc__ ) - parser.add_option( "--output", - help="Designate a file generated by smrtpipe as the expected output for galaxy" ) - parser.add_option( "--nproc", type="int", - help="Number of processes to use (-D NPROC)" ) - parser.add_option( "--galaxy_output", - help="File name provided by galaxy where output should be placed" ) - parser.add_option( "--dry_run", action="store_true", - help="Create auxiliary XML files and exit" ) - parser.add_option( "--dat_extension", - help="Soft link .dat files to have this extension (some pipelines require certain extensions)" ) - - parser.set_defaults( output=None, dry_run=False, galaxy_output=None, - dat_extension=None, nproc=0 ) - self.options, self.args = parser.parse_args( argv ) - - if len(self.args)!=2: - parser.error( 'Expected 1 argument' ) - - self.configFile = self.args[1] - - def __parseConfig( self ): - infile = open( self.configFile, 'r' ) - section = None - sections = [] - for line in infile: - l = line.strip() - if len(l)==0 or line.startswith('#'): - continue - if l.startswith('[') and l.endswith(']'): - section = section_factory( l[1:-1] ) - sections.append(section) - continue - if section is None: - continue - if '=' in l: - section.addParameterLine(l) - else: - section.addLine(l) - infile.close() - return sections - - def transferOutput( self ): - if not self.options.output or not self.options.galaxy_output: - return True, '' - if not os.path.exists(self.options.output): - return False, "Can't find file %s (job error?)" % self.options.output - os.system( 'cp %s %s' % (self.options.output, self.options.galaxy_output )) - return True, '' - - def run( self ): - if not os.path.exists( self.configFile ): - print >>sys.stderr, "Can't find config file %s" % self.configFile - return 1 - - sections = self.__parseConfig() - - if len(sections)==0: - print >>sys.stderr, "No sections found in %s" % self.configFile - return 1 - if sections[0].name != 'input': - print >>sys.stderr, "No [input] section found in %s" % self.configFile - return 1 - - INPUT_FILE = '__input__.xml' - SETTINGS_FILE = '__settings__.xml' - - sections[0].softLinkDats( self.options.dat_extension ) - inputXml = sections[0].makeXmlElement() - write_xml_to_file( INPUT_FILE, inputXml ) - if TRACE: - write_xml_to_file( os.path.join(TRACE_PATH,INPUT_FILE), inputXml ) - - settings = et.Element( 'smrtpipeSettings' ) - for s in sections[1:]: - s.makeXmlElement( settings ) - - write_xml_to_file( SETTINGS_FILE, settings ) - if TRACE: - write_xml_to_file( os.path.join(TRACE_PATH,SETTINGS_FILE), settings ) - - nproc = '-D NPROC=%d' % self.options.nproc if self.options.nproc>0 else '' - cmd = 'smrtpipe.py %s --params=%s xml:%s > smrtpipe.err 2>1' % \ - ( nproc, SETTINGS_FILE, INPUT_FILE ) - - if self.options.dry_run: - print 'Command to run:' - print cmd - return 0 - - out, errCode, errMsg = backticks( cmd ) - if errCode!=0: - print >>sys.stderr, "error while running: %s" % cmd - print >>sys.stderr, errMsg - if os.path.exists('log/smrtpipe.log'): - print >>sys.stderr, 'Log:' - infile = open('log/smrtpipe.log','r') - for line in infile: sys.stderr.write(line) - infile.close() - return errCode - - success, errMsg = self.transferOutput() - if not success: - print >>sys.stderr, errMsg - return 1 - - return 0 - -def write_xml_to_file( fileName, root ): - outfile = open( fileName, 'w' ) - outfile.write( '<?xml version="1.0"?>\n' ) - outfile.write( et.tostring(root) + '\n' ) - outfile.close() - -def section_factory( name ): - if name=='input': - return InputSection(name) - else: - return Section(name) - -class Section: - def __init__( self, name ): - self._name = name - self._lines = [] - self._vars = {} - - @property - def name(self): - return self._name - - def addLine( self, line ): - self._lines.append(line) - - def addParameterLine( self, line ): - self.addLine(line) - i = line.find( '=' ) - key = line[:i].strip() - value = line[i+1:].strip() - self._vars[key] = value - - def makeXmlElement( self, settings ): - if self._name=='global': - root = et.SubElement( settings, "protocol", {'name':'generic'} ) - else: - root = et.SubElement( settings, "module", {'name':self._name} ) - for k,v in self._vars.iteritems(): - param = et.SubElement( root, 'param', {'name':k} ) - val = et.SubElement( param, 'value' ) - val.text = v - return None - - def __str__( self ): - "for debugging" - buffer = [ 'S { name=' ] - buffer.append(self._name) - buffer.append('; lines=%s' % ','.join(self._lines) ) - for k,v in self._vars.iteritems(): - buffer.append('; %s=%s' % (k,v) ) - buffer.append(' }') - return ''.join(buffer) - -class InputSection( Section ): - def __init__( self, name ): - Section.__init__(self,name) - - def softLinkDats( self, newExtension ): - if not newExtension: - return - newLines = [] - for l in self._lines: - if ':' in l: - protocol = l[:l.find(':')+1] - file = l[l.find(':')+1:] - else: - protocol = '' - file = l - if os.path.exists(file) and file.endswith('.dat'): - newFile = '%s.%s' % ( file, newExtension ) - if not os.path.exists(newFile): - os.system( 'ln -s %s %s' % ( file, newFile ) ) - newLines.append(protocol+newFile) - else: - newLines.append(l) - self._lines = newLines - - def makeXmlElement( self, parent=None ): - root = et.Element( "pacbioAnalysisInputs" ) - data = et.SubElement( root, 'dataReferences' ) - iRef = 0 - for l in self._lines: - def add(x,iRef): - if len(x)==0: return iRef - node = et.SubElement( data, 'url' ) - if ':' in x: - node.attrib[ 'ref' ] = x - else: - node.attrib[ 'ref' ] = 'run:0000000-%04d' % iRef - node2 = et.SubElement( node, 'location' ) - node2.text = x - return iRef+1 - if l.endswith('fofn') and os.path.exists(l): - infile = open(l,'r') - for j,line in enumerate(infile): iRef=add(line.strip(),iRef) - infile.close() - else: - iRef=add(l,iRef) - return root - -def backticks( cmd, merge_stderr=True ): - """ - Simulates the perl backticks (``) command with error-handling support - Returns ( command output as sequence of strings, error code, error message ) - """ - if merge_stderr: - _stderr = subprocess.STDOUT - else: - _stderr = subprocess.PIPE - - p = subprocess.Popen( cmd, shell=True, stdin=subprocess.PIPE, - stdout=subprocess.PIPE, stderr=_stderr, - close_fds=True ) - - out = [ l[:-1] for l in p.stdout.readlines() ] - - p.stdout.close() - if not merge_stderr: - p.stderr.close() - - # need to allow process to terminate - p.wait() - - errCode = p.returncode and p.returncode or 0 - if p.returncode>0: - errorMessage = os.linesep.join(out) - output = [] - else: - errorMessage = '' - output = out - - return output, errCode, errorMessage - -if __name__=='__main__': - app = SmrtpipeGalaxy( sys.argv ) - sys.exit( app.run() ) diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/smrtpipe_hybrid.xml --- a/tools/ilmn_pacbio/smrtpipe_hybrid.xml +++ /dev/null @@ -1,59 +0,0 @@ -<tool id="smrtpipe_hybrid" name="AHA" version="1.0.0"> - <description>Assemble contigs from a set of contigs and PacBio reads.</description> - <command interpreter="python"> - smrtpipe_galaxy.py --nproc=24 --dat_extension=fasta --output=data/scaffold.fasta --galaxy_output=${outfile} ${iniFile} - </command> - <!-- - <command>cp ${iniFile} ${outfile}</command> - --> - <inputs> - <param name="contigs" format="fasta" type="data" label="Starting Contigs"/> - <param name="reads" format="fasta" type="data" label="PacBio Reads"/> - <param name="schedule" type="text" value="6,3,75;6,3,75;5,3,75;5,3,75;6,2,75;6,2,75;5,2,75;5,2,75" label="Parameter Schedule" size="60"/> - </inputs> - <configfiles> - <configfile name="iniFile"> -[input] -assembled_contigs:${contigs} -file:${reads} - -[HybridAssembly] -instrumentModel=RS -cleanup=False -untangler=pacbio -#set $schedule2 = $schedule.replace('X',';') -paramSchedule=${schedule2} -dontFillin=False -longReadsAsStrobe=True -exactQueryIds=True -rm4Opts=-minMatch 7 -minFrac 0.1 -minPctIdentity 65 -bestn 10 -noSplitSubreads -numberProcesses=16 -cluster=False -minRepeatLength=100000 - </configfile> - </configfiles> - <outputs> - <data name="outfile" format="fasta" label="Hybrid assembly contigs from ${on_string}"/> - </outputs> - <help> - -**What it does** - -The AHA assembly algorithm is an AMOS_-based pipeline -for finishing bacterial-sized -genomes using draft contigs and PacBio reads. - -.. _AMOS: http://sourceforge.net/apps/mediawiki/amos - -**Parameter list** - -Parameter schedule - The parameter schedule is a semi-colon delimited list of triples. Each triple represents an iteration of hybrid assembly (alignment/scaffolding/gap-filling). The three paremeters for each iteration are the Z-score, number of reads required to define a link, and the minimum length of subreads used in links. - -**Output** - -FASTA file containing scaffolded and gap-filled contigs resulting from the -hybrid assembly. - - </help> -</tool> diff -r 3565ae97160f4a70b3910b54bcd98c0f4c7d1796 -r d9408bcb3cf4bcb81e9dac4ddeda74cbc164e123 tools/ilmn_pacbio/soap_denovo.xml --- a/tools/ilmn_pacbio/soap_denovo.xml +++ /dev/null @@ -1,73 +0,0 @@ -<tool id="soap_denovo" name="SOAPdenovo" version="1.0.0"> - <description>Short-read de novo assembly</description> - <!-- - # SOAPdenovo-127mer all -s ${soap_config} -o assembly -K ${k} -p 8 -d -D - # cat ${soap_config} > ${output1} - # cp ${soap_config} ${output1} && - --> - <command> - SOAPdenovo-127mer all -s ${soap_config} -o assembly -K ${k} -p 24 -d -D -R - </command> - <inputs> - <conditional name="inputs"> - <param name="read_type" type="select" label="Illumina read type"> - <option value="single">Single fragment</option> - <option value="paired">Paired-end</option> - </param> - <when value="single"> - <param name="input1" format="fastq" type="data" label="FASTQ file for reads"/> - </when> - <when value="paired"> - <param name="input1" format="fastq" type="data" label="FASTQ file for forward reads"/> - <param name="input2" format="fastq" type="data" label="FASTQ file for reverse reads"/> - <param name="d" type="integer" value="500" label="Estimated insert size for paired-end reads" /> - </when> - </conditional> - <param name="k" type="integer" value="23" label="Size of k for forming the de Bruijn overlap graph" /> - </inputs> - <configfiles> - <configfile name="soap_config">max_rd_len=105 -[LIB] -#if $inputs.read_type == "single" -q=${inputs.input1.file_name} -#else -avg_ins=${inputs.d} -asm_flags=3 -reverse_seq=0 -q1=${inputs.input1.file_name} -q2=${inputs.input2.file_name} -#end if - </configfile> - </configfiles> - <outputs> - <data name="assembled_contigs" format="fasta" from_work_dir="assembly.scafSeq" label="Assembled contigs from ${on_string}" /> - </outputs> - <help> - -**What it does** - -Runs SOAPdenovo_ to generate a genome assembly -using single-fragment or paired-end short reads. - -Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J. -"De novo assembly of human genomes with massively parallel short read sequencing." -*Genome Res.* 2010 Feb;20(2):265-72. - -.. _SOAPdenovo: http://soap.genomics.org.cn/soapdenovo.html - -**Parameter list** - -k - k-mer size for constructing the de Bruijn graph. The appropriate size of k is genome and data set dependent, but a good starting choice might be 75% of the read length. - -Insert size - For paired-end libraries, the expected insert size. - -**Output** - -assembly - - </help> -</tool> - - Repository URL: https://bitbucket.org/galaxy/galaxy-central/ -- This is a commit notification from bitbucket.org. You are receiving this because you have the service enabled, addressing the recipient of this email.