details: http://www.bx.psu.edu/hg/galaxy/rev/cf02fb92ee6a changeset: 2710:cf02fb92ee6a user: Kelly Vincent <kpvincent@bx.psu.edu> date: Fri Sep 18 10:15:09 2009 -0400 description: Added the Pileup-to-Interval tool to condense pileup format 7 file(s) affected in this change: test-data/pileup_interval_in1.tabular test-data/pileup_interval_in2.tabular test-data/pileup_interval_out1.tabular test-data/pileup_interval_out2.tabular tool_conf.xml.sample tools/samtools/pileup_interval.py tools/samtools/pileup_interval.xml diffs (559 lines): diff -r 8fc33cdc1857 -r cf02fb92ee6a test-data/pileup_interval_in1.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pileup_interval_in1.tabular Fri Sep 18 10:15:09 2009 -0400 @@ -0,0 +1,118 @@ +chr1 1 G 3 , 3 +chr1 5 A 5 , I +chr1 10 T 2 , I +chr1 11 C 3 , I +chr1 12 G 4 , I +chr1 13 C 2 , I +chr1 14 A 3 , I +chr1 15 T 3 , 6 +chr1 16 A 2 , 3 +chr1 17 T 4 , I +chr1 2735 C 3 , I +chrM 2736 t 3 , 9 +chrM 2737 t 3 , I +chrM 2738 a 3 , I +chrM 2739 c 3 , I +chrM 2740 a 3 , I +chrM 2741 c 3 , I +chrM 2742 t 4 , 5 +chrM 2743 c 5 , I +chrM 2744 a 2 , I +chrM 2745 g 1 , I +chrM 2746 a 1 , I +chrM 2747 g 1 , I +chrM 2748 g 1 , I +chrM 2749 t 1 , I +chrM 2750 t 1 , I +chrM 2751 c 1 , I +chrM 2752 a 1 , I +chrM 2753 a 1 , I +chrM 2754 c 1 , I +chrM 2755 t 1 , I +chrM 2756 c 1 , I +chrM 2757 c 1 , I +chrM 2758 t 5 , I +chrM 2759 c 3 , I +chrM 2760 t 1 , I +chrM 2761 c 1 , I +chrM 2762 c 1 n " +chrM 2763 c 1 n " +chrM 2764 t 1 , I +chrM 2765 a 1 , I +chrM 2766 a 1 , I +chrM 2767 c 1 , I +chrM 2768 a 1 , I +chrM 2769 a 1 , I +chrM 2770 c 1 ,$ I +chrM 9563 C 1 ^:, I +chrM 9564 T 1 , + +chrM 9565 G 1 , - +chrM 9566 A 1 , I +chrM 9567 C 1 , I +chrM 9568 T 1 , ? +chrM 9569 A 1 , I +chrM 9570 C 1 , D +chrM 9571 C 1 , I +chrM 9572 A 1 , I +chrM 9573 C 1 , I +chrM 9574 A 1 , I +chrM 9575 A 1 , I +chrM 9576 C 1 , I +chrM 9577 T 1 , I +chrM 9578 A 1 , I +chrM 9579 A 1 , I +chrM 9580 A 1 , I +chrM 9581 C 1 , I +chrM 9582 A 1 , I +chrM 9583 T 1 , I +chrM 9584 C 1 , I +chrM 9585 T 1 , I +chrM 9586 A 1 , I +chrM 9587 T 1 , I +chrM 9588 G 1 , I +chrM 9589 C 1 , I +chrM 9590 A 1 n " +chrM 9591 G 1 n " +chrM 9592 A 1 , I +chrM 9593 A 1 , I +chrM 9594 A 1 , I +chrM 9595 A 1 , I +chrM 9596 A 1 , I +chrM 9597 A 1 , I +chrM 9598 C 1 ,$ I +chrM 10864 T 1 ^!, ~ +chrM 10865 G 1 , ~ +chrM 10866 T 1 , ~ +chrM 10867 A 1 , ~ +chrM 10868 G 1 , ~ +chrM 10869 A 1 , ~ +chrM 10870 A 1 , ~ +chrM 10871 G 1 , ~ +chrM 10872 C 1 , ~ +chrM 10873 C 3 , ~ +chrM 10874 C 3 , ~ +chrM 10875 C 3 , ~ +chrM 10876 A 3 , ~ +chrM 10877 A 3 , ~ +chrM 10878 T 3 , ~ +chrM 10879 T 3 , ~ +chrM 10880 G 3 , ~ +chrM 10881 C 3 , ~ +chrM 10882 C 3 , ~ +chrM 10883 G 3 , ~ +chrM 10884 G 3 , ~ +chrM 10885 A 1 , ~ +chrM 10886 T 1 , ~ +chrM 10887 C 1 , ~ +chrM 10888 C 1 , ~ +chrM 10889 A 1 , ~ +chrM 10890 T 1 , ~ +chrM 10891 A 1 n ~ +chrM 10892 G 1 n ~ +chrM 10893 T 1 , ~ +chrM 10894 G 1 , ~ +chrM 10895 C 1 , ~ +chrM 10896 T 3 , ~ +chrM 10897 A 3 , ~ +chrM 10898 G 3 , ~ +chrM 10899 C 3 ,$ ~ diff -r 8fc33cdc1857 -r cf02fb92ee6a test-data/pileup_interval_in2.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pileup_interval_in2.tabular Fri Sep 18 10:15:09 2009 -0400 @@ -0,0 +1,99 @@ +chr1 5016020 t T 33 0 25 2 .. II +chr1 5016021 g G 32 0 25 2 .. )I +chr1 5016022 t T 28 0 25 2 .. I$ +chr1 5016023 t T 33 0 25 2 .. II +chr1 5016024 t T 33 0 25 2 .. II +chr1 5016025 c C 39 0 25 4 ..^:,^:, II:/ +chr1 5016026 t T 28 0 25 4 ..,c III$ +chr1 5016027 c C 39 0 25 4 ..,, H0+7 +chr1 5016028 t T 28 0 25 4 ..,g III$ +chr1 5016029 g G 10 0 25 4 T.,t BII# +chr1 5016030 c C 39 0 25 4 .$.$,, @)6I +chr1 5016031 t T 33 0 25 2 ,, IF +chr1 5016032 t G 0 0 25 2 ,g IC +chr1 5016033 c C 33 0 25 2 ,, 1I +chr1 5016034 t T 33 0 25 2 ,, II +chr1 12459316 G G 7 0 0 2 .. II +chr1 12459317 G G 7 0 0 2 .. II +chr1 12459318 A A 10 0 0 3 ..^!. III +chr1 12459319 T T 10 0 0 3 ... III +chr1 12459320 C C 10 0 0 3 ... III +chr1 12459321 T T 10 0 0 3 ... III +chr1 12459322 A A 10 0 0 3 ... .II +chr1 12459323 C C 10 0 0 3 ... ?II +chr1 12459324 A A 10 0 0 3 ... G?I +chr1 12459325 C C 10 0 0 3 ... II; +chr1 12459326 A A 10 0 0 3 ... I@B +chr1 12459327 C C 10 0 0 3 ... 8II +chr1 12459328 A A 10 0 0 3 ... IH5 +chr1 12459329 T T 10 0 0 3 ... I;I +chr1 12459330 C C 10 0 0 3 ... IAI +chr1 12459331 T T 10 0 0 3 ... 3HI +chr1 49116109 C C 28 0 18 2 .. G? +chr1 49116110 A A 28 0 18 2 .. '@ +chr1 49116111 G G 26 0 18 2 .. 68 +chr1 49116112 A A 9 0 18 2 .. 1' +chr1 49116113 G G 20 0 18 2 .. I2 +chr1 49116114 G G 2 0 20 3 A.^:, &&$ +chr1 49116115 G G 21 0 20 3 .A, 8$I +chr1 49116116 T T 31 0 20 3 .., .9% +chr1 49116117 T T 36 0 20 3 .., I55 +chr1 49116118 T T 36 0 20 3 .., II+ +chr1 49116119 T T 36 0 20 3 .., II8 +chr1 49116120 G G 32 0 20 3 .., &%B +chr1 49116121 T T 36 0 20 3 .$., <63 +chr1 49116122 C C 33 0 25 2 ., +I +chr1 49116123 T T 33 0 25 2 ., -7 +chr1 49116124 G G 29 0 25 2 ., %I +chr1 49116125 C C 24 0 25 2 .$, +/ +chr1 126866554 G G 7 0 0 2 .. (I +chr1 126866555 C C 10 0 0 3 .$.^!. III +chr1 126866556 C C 7 0 0 2 .. II +chr11 1021425 C C 4 0 0 1 . I +chr11 1021426 A A 4 0 0 1 . I +chr11 1021427 G G 4 0 0 1 . I +chr11 1021428 G G 28 0 18 2 .^:. 0I +chr11 1021429 G G 19 0 18 2 C. $I +chr11 1021430 G G 36 0 20 3 ..^:. III +chr11 1021431 T T 36 0 20 3 ... III +chr11 1021432 G G 36 0 20 3 ... III +chr11 1021433 A A 36 0 20 3 ... @II +chr11 1021434 C C 36 0 20 3 ... %II +chr11 1021435 G G 36 0 20 3 ... #II +chr11 1021436 T T 36 0 20 3 ... 8II +chr11 1021437 G G 36 0 20 3 ... /II +chr11 1021438 G G 36 0 20 3 ... III +chr11 1021439 G G 36 0 20 3 ... ;II +chr11 1021440 C C 28 0 20 3 N.. "II +chr11 1021441 T T 36 0 20 3 ... IFI +chr11 1021442 G G 36 0 20 3 ... III +chr11 1021443 T T 36 0 20 3 ... III +chr11 1021444 G G 28 0 20 3 T.. #II +chr11 1021445 T T 28 0 20 3 C.. #II +chr11 1021446 C C 36 0 20 3 .$.. :II +chr11 1021447 T T 33 0 25 2 .. II +chr11 1021448 G G 33 0 25 2 .. II +chr11 1021449 T T 33 0 25 2 .. 7I +chr11 1021450 G G 33 0 25 2 .. II +chr14 1021451 T A 33 0 25 3 .. 4I +chr14 80839355 A A 33 0 25 2 .. I* +chr14 80839356 G G 28 0 25 2 .. I# +chr14 80839357 A A 31 0 25 2 .. I( +chr14 80839358 A A 32 0 25 2 .. I) +chr14 80839359 T T 39 0 25 4 ..^:,^:, I+I( +chr14 80839360 T T 39 0 25 4 ..,, I+I+ +chr14 80839361 C C 39 0 25 4 ..,, I&5( +chr14 80839362 T T 39 0 25 4 ..,, I3II +chr14 80839363 G G 39 0 25 4 ..,, G#I4 +chr14 80839364 G G 39 0 25 4 ..,, I'II +chr14 80839365 A A 39 0 25 4 ..,, @)IH +chr14 80839366 T T 39 0 25 4 ..,, I/I2 +chr14 80839367 A A 39 0 25 4 ..,, I,I= +chr14 80839368 T T 39 0 25 4 ..,, I.I7 +chr14 80839369 T T 39 0 25 4 ..,, I4II +chr14 80839370 T T 39 0 25 4 ..,, I2I0 +chr14 80839371 A A 39 0 25 4 .$.$,, ;+I? +chr14 80839372 C C 14 0 25 2 ,a 5$ +chr14 80839373 A A 33 0 25 2 ,, II +chr14 80839374 T T 33 0 25 2 ,, II +chr14 80839375 T T 33 0 25 2 ,, I? diff -r 8fc33cdc1857 -r cf02fb92ee6a test-data/pileup_interval_out1.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pileup_interval_out1.tabular Fri Sep 18 10:15:09 2009 -0400 @@ -0,0 +1,10 @@ +chr1 0 1 G +chr1 4 5 A +chr1 10 12 CG +chr1 13 15 AT +chr1 16 17 T +chr1 2734 2735 C +chrM 2735 2743 ttacactc +chrM 2757 2759 tc +chrM 10872 10884 CCCAATTGCCGG +chrM 10895 10899 TAGC diff -r 8fc33cdc1857 -r cf02fb92ee6a test-data/pileup_interval_out2.tabular --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/test-data/pileup_interval_out2.tabular Fri Sep 18 10:15:09 2009 -0400 @@ -0,0 +1,7 @@ +chr1 5016024 5016030 ctctgc +chr1 12459317 12459331 ATCTACACACATCT +chr1 49116113 49116121 GGTTTTGT +chr1 126866554 126866555 C +chr11 1021429 1021446 GTGACGTGGGCTGTGTC +chr14 1021450 1021451 T +chr14 80839358 80839371 TTCTGGATATTTA diff -r 8fc33cdc1857 -r cf02fb92ee6a tool_conf.xml.sample --- a/tool_conf.xml.sample Thu Sep 17 12:45:36 2009 -0400 +++ b/tool_conf.xml.sample Fri Sep 18 10:15:09 2009 -0400 @@ -349,5 +349,6 @@ <tool file="samtools/sam_merge.xml" /> <tool file="samtools/sam_pileup.xml" /> <tool file="samtools/pileup_parser.xml" /> + <tool file="samtools/pileup_interval.xml" /> </section> </toolbox> diff -r 8fc33cdc1857 -r cf02fb92ee6a tools/samtools/pileup_interval.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/samtools/pileup_interval.py Fri Sep 18 10:15:09 2009 -0400 @@ -0,0 +1,105 @@ +#! /usr/bin/python + +""" +Creates a pileup file from a bam file and a reference. + +usage: %prog [options] + -i, --input=i: Input pileup file + -o, --output=o: Output pileup + -c, --coverage=c: Coverage + -f, --format=f: Pileup format + -b, --base=b: Base to select + -s, --seq_column=s: Sequence column + -l, --loc_column=l: Base location column + -r, --base_column=r: Reference base column + -C, --cvrg_column=C: Coverage column +""" + +from galaxy import eggs +import pkg_resources; pkg_resources.require( "bx-python" ) +from bx.cookbook import doc_optparse + +def stop_err( msg ): + sys.stderr.write( msg ) + sys.exit() + +def __main__(): + strout = '' + #Parse Command Line + options, args = doc_optparse.parse( __doc__ ) + coverage = int(options.coverage) + fin = file(options.input, 'r') + fout = file(options.output, 'w') + inLine = fin.readline() + if options.format == 'six': + seqIndex = 0 + locIndex = 1 + baseIndex = 2 + covIndex = 3 + elif options.format == 'ten': + seqIndex = 0 + locIndex = 1 + if options.base == 'first': + baseIndex = 2 + else: + baseIndex = 3 + covIndex = 7 + else: + seqIndex = int(options.seq_column) - 1 + locIndex = int(options.loc_column) - 1 + baseIndex = int(options.base_column) - 1 + covIndex = int(options.cvrg_column) - 1 + lastSeq = '' + lastLoc = -1 + locs = [] + startLoc = -1 + bases = [] + while inLine.strip() != '': + lineParts = inLine.split('\t') + seq, loc, base, cov = lineParts[seqIndex], int(lineParts[locIndex]), lineParts[baseIndex], int(lineParts[covIndex]) +# strout += str(startLoc) + '\n' +# strout += str(bases) + '\n' +# strout += '%s\t%s\t%s\t%s\n' % (seq, loc, base, cov) + if loc == lastLoc+1 or lastLoc == -1: + if cov >= coverage: + if seq == lastSeq or lastSeq == '': + if startLoc == -1: + startLoc = loc + locs.append(loc) + bases.append(base) + else: + if len(bases) > 0: + fout.write('%s\t%s\t%s\t%s\n' % (lastSeq, startLoc-1, lastLoc, ''.join(bases))) + startLoc = loc + locs = [loc] + bases = [base] + else: + if len(bases) > 0: + fout.write('%s\t%s\t%s\t%s\n' % (lastSeq, startLoc-1, lastLoc, ''.join(bases))) + startLoc = -1 + locs = [] + bases = [] + else: + if len(bases) > 0: + fout.write('%s\t%s\t%s\t%s\n' % (lastSeq, startLoc-1, lastLoc, ''.join(bases))) + if cov >= coverage: + startLoc = loc + locs = [loc] + bases = [base] + else: + startLoc = -1 + locs = [] + bases = [] + lastSeq = seq + lastLoc = loc + inLine = fin.readline() + if len(bases) > 0: + fout.write('%s\t%s\t%s\t%s\n' % (lastSeq, startLoc-1, lastLoc, ''.join(bases))) + fout.close() + fin.close() + +# import sys +# strout += file(fout.name,'r').read() +# sys.stderr.write(strout) + +if __name__ == "__main__" : __main__() diff -r 8fc33cdc1857 -r cf02fb92ee6a tools/samtools/pileup_interval.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tools/samtools/pileup_interval.xml Fri Sep 18 10:15:09 2009 -0400 @@ -0,0 +1,186 @@ +<tool id="pileup_interval" name="Pileup-to-Interval" version="1.0.0"> + <description>condenses pileup format into ranges of bases</description> + <command interpreter="python"> + pileup_interval.py + --input=$input + --output=$output + --coverage=$coverage + --format=$format_type.format + #if $format_type.format == "ten": + --base=$format_type.which_base + --seq_column="None" + --loc_column="None" + --base_column="None" + --cvrg_column="None" + #elif $format_type.format == "manual": + --base="None" + --seq_column=$format_type.seq_column + --loc_column=$format_type.loc_column + --base_column=$format_type.base_column + --cvrg_column=$format_type.cvrg_column + #else: + --base="None" + --seq_column="None" + --loc_column="None" + --base_column="None" + --cvrg_column="None" + #end if + </command> + <inputs> + <param name="input" type="data" format="tabular" label="Choose a pileup file to condense:" /> + <conditional name="format_type"> + <param name="format" type="select" label="which contains:" help="See "Types of pileup datasets" below for examples"> + <option value="six" selected="true">Pileup with six columns (simple)</option> + <option value="ten">Pileup with ten columns (with consensus)</option> + <option value="manual">Set columns manually</option> + </param> + <when value="six" /> + <when value="ten"> + <param name="which_base" type="select" label="Which base do you want to concatenate"> + <option value="first" selected="true">Reference base (first)</option> + <option value="second">Consensus base (second)</option> + </param> + </when> + <when value="manual"> + <param name="seq_column" label="Select column with sequence name" type="data_column" numerical="false" data_ref="input" /> + <param name="loc_column" label="Select column with base location" type="data_column" numerical="false" data_ref="input" /> + <param name="base_column" label="Select column with base to concatenate" type="data_column" numerical="false" data_ref="input" /> + <param name="cvrg_column" label="Select column with coverage" type="data_column" numerical="true" data_ref="input" /> + </when> + </conditional> + <param name="coverage" type="integer" value="3" label="Do not report bases with coverage less than:" /> + </inputs> + <outputs> + <data format="tabular" name="output" /> + </outputs> + <tests> + <test> + <param name="input" value="pileup_interval_in1.tabular" /> + <param name="format" value="six" /> + <param name="coverage" value="3" /> + <output name="output" file="pileup_interval_out1.tabular" /> + </test> + <test> + <param name="input" value="pileup_interval_in2.tabular" /> + <param name="format" value="ten" /> + <param name="which_base" value="first" /> + <param name="coverage" value="3" /> + <output name="output" file="pileup_interval_out2.tabular" /> + </test> + <test> + <param name="input" value="pileup_interval_in2.tabular" /> + <param name="format" value="manual" /> + <param name="seq_column" value="1" /> + <param name="loc_column" value="2" /> + <param name="base_column" value="3" /> + <param name="cvrg_column" value="8" /> + <param name="coverage" value="3" /> + <output name="output" file="pileup_interval_out2.tabular" /> + </test> + </tests> + <help> + +**What is does** + +Reduces the size of a results set by taking a pileup file and producing a condensed version showing consecutive sequences of bases meeting coverage criteria. The tool works on six and ten column pileup formats produced with *samtools pileup* command. You also can specify columns for the input file manually. The tool assumes that the pileup dataset was produced by *samtools pileup* command (although you can override this by setting column assignments manually). + +-------- + +**Types of pileup datasets** + +The description of pileup format below is largely based on information that can be found on SAMTools_ documentation page. The 6- and 10-column variants are described below. + +.. _SAMTools: http://samtools.sourceforge.net/pileup.shtml + +**Six column pileup**:: + + 1 2 3 4 5 6 + --------------------------------- + chrM 412 A 2 ., II + chrM 413 G 4 ..t, IIIH + chrM 414 C 4 ...a III2 + chrM 415 C 4 TTTt III7 + +where:: + + Column Definition + ------ ---------------------------- + 1 Chromosome + 2 Position (1-based) + 3 Reference base at that position + 4 Coverage (# reads aligning over that position) + 5 Bases within reads where (see Galaxy wiki for more info) + 6 Quality values (phred33 scale, see Galaxy wiki for more) + +**Ten column pileup** + +The `ten-column`__ pileup incorporates additional consensus information generated with *-c* option of *samtools pileup* command:: + + + 1 2 3 4 5 6 7 8 9 10 + ------------------------------------------------ + chrM 412 A A 75 0 25 2 ., II + chrM 413 G G 72 0 25 4 ..t, IIIH + chrM 414 C C 75 0 25 4 ...a III2 + chrM 415 C T 75 75 25 4 TTTt III7 + +where:: + + Column Definition + ------- ---------------------------- + 1 Chromosome + 2 Position (1-based) + 3 Reference base at that position + 4 Consensus bases + 5 Consensus quality + 6 SNP quality + 7 Maximum mapping quality + 8 Coverage (# reads aligning over that position) + 9 Bases within reads where (see Galaxy wiki for more info) + 10 Quality values (phred33 scale, see Galaxy wiki for more) + + +.. __: http://samtools.sourceforge.net/cns0.shtml + +------ + +**The output format** + +The output file condenses the information in the pileup file so that consecutive bases are listed together as sequences. The starting and ending points of the sequence range are listed, with the starting value converted to a 0-based value. + +Given the following input with minimum coverage set to 3:: + + 1 2 3 4 5 6 + --------------------------------- + chr1 112 G 3 ..Ta III6 + chr1 113 T 2 aT.. III5 + chr1 114 A 5 ,,.. IIH2 + chr1 115 C 4 ,., III + chrM 412 A 2 ., II + chrM 413 G 4 ..t, IIIH + chrM 414 C 4 ...a III2 + chrM 415 C 4 TTTt III7 + chrM 490 T 3 a I + +the following would be the output:: + + 1 2 3 4 + ------------------- + chr1 111 112 G + chr1 113 115 AC + chrM 412 415 GCC + chrM 489 490 T + +where:: + + Column Definition + ------- ---------------------------- + 1 Chromosome + 2 Starting position (0-based) + 3 Ending position (1-based) + 4 Sequence of bases + + </help> +</tool> + +