details: http://www.bx.psu.edu/hg/galaxy/rev/2afb7110c649
changeset: 2708:2afb7110c649
user: Dan Blankenberg <dan(a)bx.psu.edu>
date: Thu Sep 17 11:47:37 2009 -0400
description:
Add a MAF to Interval converter that produces a set of intervals with sequence data.
7 file(s) affected in this change:
test-data/maf_to_interval_out_hg17.interval
test-data/maf_to_interval_out_panTro1.interval
tool_conf.xml.main
tool_conf.xml.sample
tools/maf/maf_reverse_complement.xml
tools/maf/maf_to_interval.py
tools/maf/maf_to_interval.xml
diffs (246 lines):
diff -r bac909f808c2 -r 2afb7110c649 test-data/maf_to_interval_out_hg17.interval
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/maf_to_interval_out_hg17.interval Thu Sep 17 11:47:37 2009 -0400
@@ -0,0 +1,2 @@
+#chrom start end strand score name bosTau2 canFam2 dasNov1 hg17 mm7 panTro1 rheMac2 rn3
+chr7 127471910 127472074 + 94204.0 hg17_5_0 atgtgaacaa---------------------------------------------------------------------------------------------aacggacccgtgtgggactcggcggagcacacagattttgcgggagCACGTTCCCGTTAGGAAGTCTCTGATGCAATACGACCGGTGCCTTCAGGACCTG-TG--AGGCTGACTTTCCTTA-CCCCTCCACACCATCATCAAGGCAGGTGTGATTTTCCAGG cagtgaacaa---------------------------------------------------------------------------------------------aacagagccctgcagt-cttgatggagcacacaacctttg-gggaaCATGTTTCCATAAGAAAGTCTCCAATGTGATCTGA-TGGTGCCGCCAGGACCTA-TGTCAGCCTACCGTTCCATGTCCCCTCCACACCATCATCACTGCAGGTGTGTTTTCCCACA CAGTGAGCAA-----------------------------------------------------------------------------------------------CAGCCTGGCTCCGT-CC--GGGGGCCGCTCAGCAGCTC-GGGAGCGTGGAGACG---GGAAGTCTGTCACGCGATGCG-----------CTGGGCCCG------------CTGTTCCCGCCCCCCTCC---CCCC----------------TTTCCCAAG caatgaccaa----------------------------------------------------------------------------------------------atagactcctaccaa-ctc-aaagaatgcacattctCTG-GGA
AACATGTTTCCATTAGGAAGCCTCGAATGCAATGTGACTGTGGTCTCCAGGACCTG-TGTGATCCTGGCTTTTCCTGTTCCCTCCG---CATCATCACTGCAGGTGTGTTTTCCCAAG caaaaaccaa------------------------------------------------------------------------------------------------aaaaACCTATAGC-CTC-ACAGGGTGGGTTGTCTTTG-AGGAACATGCATCCGCTAGAAAGTCCCAAGTACACTATGACAGTTG--CCCAGGCCCCGCCTTAAACCTGGTTTTCCTGGTTTCTTTCA---CATCATTACCACGAATATATTTCCTCAAG caatgaccaa----------------------------------------------------------------------------------------------atagactcctaccaa-ctc-aaagaatgcacattctCTG-GGAAACATGTTTCCATTAGGAAGCCTCGAATGCAATGTGACTGTGGTCTCCAGGACATG-TGTGATCCTGGCTTTTCCTGTTCCCTCTG---CATCATCACTGCAGGTGTATTTTCCCAAG caatgaccaa----------------------------------------------------------------------------------------------atagacccctaccga-ctc-aaagaatgtacattctTTG-GGAAACATGTTTCCATCAGAAAATCTCAAATGCAATGTGACTGGGGTCTCCAGGACCTG-TGTGAGCCTGGCTTTTCCTGTTCCCTCCA---CATCATCACTGCAGGTGTATTTTCCC--G --ATGACCAATATACACTGTTTACATGTATAGCATTGTGAATGGAGACATAAAAAGATAATCTAGCTTTGTG
CTAGGTAGGTGCTGAGCTCTTAACAGTGCTGGGCAGAAACCTATAAC-CTC-ACAGGGTGGGTTGTCTTTG-AGGAGCGTGCTAACCCTAGGAAGTCTCAAATACAATGTGATGGTTGCCCCCAGGCACCACCTTGAACCTGGTCTTCCTGGTTTCTTTCA---CACCATTACCACAAATACATTTTCTCAGG
diff -r bac909f808c2 -r 2afb7110c649 test-data/maf_to_interval_out_panTro1.interval
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/maf_to_interval_out_panTro1.interval Thu Sep 17 11:47:37 2009 -0400
@@ -0,0 +1,2 @@
+#chrom start end strand score name bosTau2 canFam2 dasNov1 hg17 mm7 panTro1 rheMac2 rn3
+chr6 129885791 129885955 + 94204.0 panTro1_5_0 atgtgaacaa---------------------------------------------------------------------------------------------aacggacccgtgtgggactcggcggagcacacagattttgcgggagCACGTTCCCGTTAGGAAGTCTCTGATGCAATACGACCGGTGCCTTCAGGACCTG-TG--AGGCTGACTTTCCTTA-CCCCTCCACACCATCATCAAGGCAGGTGTGATTTTCCAGG cagtgaacaa---------------------------------------------------------------------------------------------aacagagccctgcagt-cttgatggagcacacaacctttg-gggaaCATGTTTCCATAAGAAAGTCTCCAATGTGATCTGA-TGGTGCCGCCAGGACCTA-TGTCAGCCTACCGTTCCATGTCCCCTCCACACCATCATCACTGCAGGTGTGTTTTCCCACA CAGTGAGCAA-----------------------------------------------------------------------------------------------CAGCCTGGCTCCGT-CC--GGGGGCCGCTCAGCAGCTC-GGGAGCGTGGAGACG---GGAAGTCTGTCACGCGATGCG-----------CTGGGCCCG------------CTGTTCCCGCCCCCCTCC---CCCC----------------TTTCCCAAG caatgaccaa----------------------------------------------------------------------------------------------atagactcctaccaa-ctc-aaagaatgcacattctCTG-
GGAAACATGTTTCCATTAGGAAGCCTCGAATGCAATGTGACTGTGGTCTCCAGGACCTG-TGTGATCCTGGCTTTTCCTGTTCCCTCCG---CATCATCACTGCAGGTGTGTTTTCCCAAG caaaaaccaa------------------------------------------------------------------------------------------------aaaaACCTATAGC-CTC-ACAGGGTGGGTTGTCTTTG-AGGAACATGCATCCGCTAGAAAGTCCCAAGTACACTATGACAGTTG--CCCAGGCCCCGCCTTAAACCTGGTTTTCCTGGTTTCTTTCA---CATCATTACCACGAATATATTTCCTCAAG caatgaccaa----------------------------------------------------------------------------------------------atagactcctaccaa-ctc-aaagaatgcacattctCTG-GGAAACATGTTTCCATTAGGAAGCCTCGAATGCAATGTGACTGTGGTCTCCAGGACATG-TGTGATCCTGGCTTTTCCTGTTCCCTCTG---CATCATCACTGCAGGTGTATTTTCCCAAG caatgaccaa----------------------------------------------------------------------------------------------atagacccctaccga-ctc-aaagaatgtacattctTTG-GGAAACATGTTTCCATCAGAAAATCTCAAATGCAATGTGACTGGGGTCTCCAGGACCTG-TGTGAGCCTGGCTTTTCCTGTTCCCTCCA---CATCATCACTGCAGGTGTATTTTCCC--G --ATGACCAATATACACTGTTTACATGTATAGCATTGTGAATGGAGACATAAAAAGATAATCTAGCTTT
GTGCTAGGTAGGTGCTGAGCTCTTAACAGTGCTGGGCAGAAACCTATAAC-CTC-ACAGGGTGGGTTGTCTTTG-AGGAGCGTGCTAACCCTAGGAAGTCTCAAATACAATGTGATGGTTGCCCCCAGGCACCACCTTGAACCTGGTCTTCCTGGTTTCTTTCA---CACCATTACCACAAATACATTTTCTCAGG
diff -r bac909f808c2 -r 2afb7110c649 tool_conf.xml.main
--- a/tool_conf.xml.main Thu Sep 17 09:08:37 2009 -0400
+++ b/tool_conf.xml.main Thu Sep 17 11:47:37 2009 -0400
@@ -40,6 +40,7 @@
<tool file="fasta_tools/fasta_to_tabular.xml" />
<tool file="filters/gff2bed.xml" />
<tool file="maf/maf_to_bed.xml" />
+ <tool file="maf/maf_to_interval.xml" />
<tool file="maf/maf_to_fasta.xml" />
<tool file="fasta_tools/tabular_to_fasta.xml" />
</section>
diff -r bac909f808c2 -r 2afb7110c649 tool_conf.xml.sample
--- a/tool_conf.xml.sample Thu Sep 17 09:08:37 2009 -0400
+++ b/tool_conf.xml.sample Thu Sep 17 11:47:37 2009 -0400
@@ -75,6 +75,7 @@
<tool file="filters/gff2bed.xml" />
<tool file="filters/lav_to_bed.xml" />
<tool file="maf/maf_to_bed.xml" />
+ <tool file="maf/maf_to_interval.xml" />
<tool file="maf/maf_to_fasta.xml" />
<tool file="fasta_tools/tabular_to_fasta.xml" />
<tool file="next_gen_conversion/solid_to_fastq.xml" />
diff -r bac909f808c2 -r 2afb7110c649 tools/maf/maf_reverse_complement.xml
--- a/tools/maf/maf_reverse_complement.xml Thu Sep 17 09:08:37 2009 -0400
+++ b/tools/maf/maf_reverse_complement.xml Thu Sep 17 11:47:37 2009 -0400
@@ -1,4 +1,4 @@
-<tool id="MAF_Reverse_Complement_1" name="Reverse Compliment" version="1.0.1">
+<tool id="MAF_Reverse_Complement_1" name="Reverse Complement" version="1.0.1">
<description>a MAF file</description>
<command interpreter="python">maf_reverse_complement.py $input1 $out_file1 $species</command>
<inputs>
diff -r bac909f808c2 -r 2afb7110c649 tools/maf/maf_to_interval.py
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/maf/maf_to_interval.py Thu Sep 17 11:47:37 2009 -0400
@@ -0,0 +1,68 @@
+#!/usr/bin/env python
+
+"""
+Read a maf and output intervals for specified list of species.
+"""
+import sys, os
+from galaxy import eggs
+import pkg_resources; pkg_resources.require( "bx-python" )
+from bx.align import maf
+from galaxy.tools.util import maf_utilities
+
+assert sys.version_info[:2] >= ( 2, 4 )
+
+def __main__():
+ input_filename = sys.argv[1]
+ output_filename = sys.argv[2]
+ output_id = sys.argv[3]
+ #where to store files that become additional output
+ database_tmp_dir = sys.argv[4]
+ primary_spec = sys.argv[5]
+ species = sys.argv[6].split( ',' )
+ all_species = sys.argv[7].split( ',' )
+ partial = sys.argv[8]
+ keep_gaps = sys.argv[9]
+ out_files = {}
+
+ if "None" in species:
+ species = []
+
+ if primary_spec not in species:
+ species.append( primary_spec )
+ if primary_spec not in all_species:
+ all_species.append( primary_spec )
+
+ all_species.sort()
+ for spec in species:
+ if spec == primary_spec:
+ out_files[ spec ] = open( output_filename, 'wb+' )
+ else:
+ out_files[ spec ] = open( os.path.join( database_tmp_dir, 'primary_%s_%s_visible_interval_%s' % ( output_id, spec, spec ) ), 'wb+' )
+ out_files[ spec ].write( '#chrom\tstart\tend\tstrand\tscore\tname\t%s\n' % ( '\t'.join( all_species ) ) )
+ num_species = len( all_species )
+
+ file_in = open( input_filename, 'r' )
+ maf_reader = maf.Reader( file_in )
+
+ for i, m in enumerate( maf_reader ):
+ for j, block in enumerate( maf_utilities.iter_blocks_split_by_species( m ) ):
+ if len( block.components ) < num_species and partial == "partial_disallowed": continue
+ sequences = {}
+ for c in block.components:
+ spec, chrom = maf_utilities.src_split( c.src )
+ if keep_gaps == 'remove_gaps':
+ sequences[ spec ] = c.text.replace( '-', '' )
+ else:
+ sequences[ spec ] = c.text
+ sequences = '\t'.join( [ sequences.get( spec, '' ) for spec in all_species ] )
+ for spec in species:
+ c = block.get_component_by_src_start( spec )
+ if c is not None:
+ spec2, chrom = maf_utilities.src_split( c.src )
+ assert spec2 == spec, Exception( 'Species name inconsistancy found in component: %s != %s' % ( spec, spec2 ) )
+ out_files[ spec ].write( "%s\t%s\t%s\t%s\t%s\t%s\t%s\n" % ( chrom, c.forward_strand_start, c.forward_strand_end, c.strand, m.score, "%s_%s_%s" % (spec, i, j), sequences ) )
+ file_in.close()
+ for file_out in out_files.values():
+ file_out.close()
+
+if __name__ == "__main__": __main__()
diff -r bac909f808c2 -r 2afb7110c649 tools/maf/maf_to_interval.xml
--- /dev/null Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/maf/maf_to_interval.xml Thu Sep 17 11:47:37 2009 -0400
@@ -0,0 +1,127 @@
+<tool id="MAF_To_Interval1" name="MAF to Interval" force_history_refresh="True">
+ <description>Converts a MAF formated file to the Interval format</description>
+ <command interpreter="python">maf_to_interval.py $input1 $out_file1 $out_file1.id $__new_file_path__ $input1.dbkey $species $input1.metadata.species $complete_blocks $remove_gaps</command>
+ <inputs>
+ <param format="maf" name="input1" type="data" label="MAF file to convert"/>
+ <param name="species" type="select" label="Select additional species" display="checkboxes" multiple="true" help="The species matching the dbkey of the alignment is always included. A separate history item will be created for each species.">
+ <options>
+ <filter type="data_meta" ref="input1" key="species" />
+ <filter type="remove_value" meta_ref="input1" key="dbkey" />
+ </options>
+ </param>
+ <param name="complete_blocks" type="select" label="Exclude blocks which have a species missing">
+ <option value="partial_allowed">include blocks with missing species</option>
+ <option value="partial_disallowed">exclude blocks with missing species</option>
+ </param>
+ <param name="remove_gaps" type="select" label="Remove Gap characters from sequences">
+ <option value="keep_gaps">keep gaps</option>
+ <option value="remove_gaps">remove gaps</option>
+ </param>
+ </inputs>
+ <outputs>
+ <data format="interval" name="out_file1" />
+ </outputs>
+ <tests>
+ <test>
+ <param name="input1" value="4.maf" dbkey="hg17"/>
+ <param name="complete_blocks" value="partial_disallowed"/>
+ <param name="remove_gaps" value="keep_gaps"/>
+ <param name="species" value="panTro1" />
+ <!-- <output name="out_file1" file="maf_to_interval_out_hg17.interval"/> cannot test primary species, because we cannot leave species blank and we can only test the last item added to a history-->
+ <output name="out_file1" file="maf_to_interval_out_panTro1.interval"/>
+ </test>
+ </tests>
+ <help>
+
+**What it does**
+
+This tool converts every MAF block to a set of genomic intervals describing the position of that alignment block within a corresponding genome. Sequences from aligning species are also included in the output.
+
+The interface for this tool contains several options:
+
+ * **MAF file to convert**. Choose multiple alignments from history to be converted to BED format.
+ * **Choose species**. Choose additional species from the alignment to be included in the output
+ * **Exclude blocks which have a species missing**. if an alignment block does not contain any one of the species found in the alignment set and this option is set to **exclude blocks with missing species**, then coordinates of such a block **will not** be included in the output (see **Example 2** below).
+ * **Remove Gap characters from sequences**. Gaps can be removed from sequences before they are output.
+
+
+-----
+
+**Example 1**: **Include only reference genome** (hg18 in this case) and **include blocks with missing species**:
+
+For the following alignment::
+
+ ##maf version=1
+ a score=68686.000000
+ s hg18.chr20 56827368 75 + 62435964 GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
+ s panTro2.chr20 56528685 75 + 62293572 GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
+ s rheMac2.chr10 89144112 69 - 94855758 GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
+ s mm8.chr2 173910832 61 + 181976762 AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC-------
+ s canFam2.chr24 46551822 67 + 50763139 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C
+
+ a score=10289.000000
+ s hg18.chr20 56827443 37 + 62435964 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
+ s panTro2.chr20 56528760 37 + 62293572 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
+ s rheMac2.chr10 89144181 37 - 94855758 ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG
+
+the tool will create **a single** history item containing the following (**note** the name field is numbered iteratively: hg18_0_0, hg18_1_0 etc. where the first number is the block number and the second number is the iteration through the block (if a species appears twice in a block, that interval will be repeated) and sequences for each species are included in the order specified in the header: the field is left empty when no sequence is available for that species)::
+
+ #chrom start end strand score name canFam2 hg18 mm8 panTro2 rheMac2
+ chr20 56827368 56827443 + 68686.0 hg18_0_0 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------- GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
+ chr20 56827443 56827480 + 10289.0 hg18_1_0 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG
+
+
+-----
+
+**Example 2**: **Include hg18 and mm8** and **exclude blocks with missing species**:
+
+For the following alignment::
+
+ ##maf version=1
+ a score=68686.000000
+ s hg18.chr20 56827368 75 + 62435964 GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
+ s panTro2.chr20 56528685 75 + 62293572 GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC-
+ s rheMac2.chr10 89144112 69 - 94855758 GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
+ s mm8.chr2 173910832 61 + 181976762 AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC-------
+ s canFam2.chr24 46551822 67 + 50763139 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C
+
+ a score=10289.000000
+ s hg18.chr20 56827443 37 + 62435964 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
+ s panTro2.chr20 56528760 37 + 62293572 ATGTGCAGAAAATGTGATACAGAAACCTGCAGAGCAG
+ s rheMac2.chr10 89144181 37 - 94855758 ATGTGCGGAAAATGTGATACAGAAACCTGCAGAGCAG
+
+the tool will create **two** history items (one for hg18 and one for mm8) containing the following (**note** that both history items contain only one line describing the first alignment block. The second MAF block is not included in the output because it does not contain mm8):
+
+History item **1** (for hg18)::
+
+ #chrom start end strand score name canFam2 hg18 mm8 panTro2 rheMac2
+ chr20 56827368 56827443 + 68686.0 hg18_0_0 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------- GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
+
+
+History item **2** (for mm8)::
+
+ #chrom start end strand score name canFam2 hg18 mm8 panTro2 rheMac2
+ chr2 173910832 173910893 + 68686.0 mm8_0_0 CG------GCGTCTGTAAGGGGCCACCGCCCGGCCTGTG-CTCAAAGCTACAAATGACTCAACTCCCAACCGA------C GACAGGGTGCATCTGGGAGGG---CCTGCCGGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- AGAAGGATCCACCT------------TGCTGGGCCTCTGCTCCAGCAAGACCCACCTCCCAACTCAAATGCCC------- GACAGGGTGCATCTGAGAGGG---CCTGCCAGGCCTTTA-TTCAACACTAGATACGCCCCATCTCCAATTCTAATGGAC- GACAGGGTGCATCTGAGAGGG---CCTGCTGGGCCTTTG-TTCAAAACTAGATATGCCCCAACTCCAATTCTA-------
+
+
+-------
+
+.. class:: infomark
+
+**About formats**
+
+**MAF format** multiple alignment format file. This format stores multiple alignments at the DNA level between entire genomes.
+
+ - The .maf format is line-oriented. Each multiple alignment ends with a blank line.
+ - Each sequence in an alignment is on a single line.
+ - Lines starting with # are considered to be comments.
+ - Each multiple alignment is in a separate paragraph that begins with an "a" line and contains an "s" line for each sequence in the multiple alignment.
+ - Some MAF files may contain two optional line types:
+
+ - An "i" line containing information about what is in the aligned species DNA before and after the immediately preceding "s" line;
+ - An "e" line containing information about the size of the gap between the alignments that span the current block.
+
+
+ </help>
+</tool>
+