Small improvement for Extract-Genomic-Sequences tool
Hello, We've encountered a small issue when using the extract-genomic-sequences tool with a TwoBit file and lot's of sequences: memory and CPU usage increases unnecessarily (and processing time, too). Background: a TwoBit (2bit) file is a binary representation of a FASTA file. Unlike nbit files, TwoBit files contain multiple sequences per file. For example, The Drosophila Virilis 'genome' contains 13530 scaffolds. Converting the droVir3 FASTA file into a 2bit file creates a 52MB file. Details: When trying to extract sequences (using extract_genomic_dna.py), Each time a new chromosome/scaffold is encountered, the *entire* 2bit file is loaded and stored. This is unnecessary - the same 2bit file is used for all the scaffolds. (Imagine a BED file with 1000 different scaffolds - memory usage goes to 1000 * 52MB... ). The offending code is (extract_genomic_dna.py) : ------------------- elif twobit_path and os.path.exists( twobit_path ): if chrom in twobits: t = twobits[chrom] else: twobits[chrom] = t = bx.seq.twobit.TwoBitFile( file( twobit_path ) ) try: sequence = t[chrom][start:end] ------------------- the hash 'twobits' contains multiple copies of the same 2bit file, over and over again (one per chromosome/scaffold), but only one chromosome/scaffold is read from it. An improvement would be to load the file only once, and each time take the chromosome/scaffold from the already-loaded hash. Here's a possible patch: ======================== --- extract_genomic_dna.py 2009-07-17 12:55:53.000000000 -0400 +++ extract_genomic_dna2.py 2009-07-17 12:55:18.000000000 -0400 @@ -84,6 +84,7 @@ def __main__(): fout = open( output_filename, "w" ) warnings = [] warning = '' + twobitfile = None for i, line in enumerate( open( input_filename ) ): line = line.rstrip( '\r\n' ) @@ -132,12 +133,10 @@ def __main__(): invalid_line = line continue elif twobit_path and os.path.exists( twobit_path ): - if chrom in twobits: - t = twobits[chrom] - else: - twobits[chrom] = t = bx.seq.twobit.TwoBitFile( file( twobit_path ) ) + if not twobitfile: + twobitfile = bx.seq.twobit.TwoBitFile( file( twobit_path ) ) try: - sequence = t[chrom][start:end] + sequence = twobitfile[chrom][start:end] except: warning = "Unable to fetch the sequence from '%d' to '%d' for build '%s'. " %( start, end-start, dbkey ) warnings.append( warning ) ======================== Technical note: This code and patch refers to an older revision (before changeset 2457:8dfe971fcc27), but I think the new code still behaves this way. Kudos to Ted Karginov who had to learn python to find and fix this bug. -Gordon.
Dear Assaf and Fed, Thanks for your suggestion. It has been incorporated in the latest changeset (#2488). Guru Galaxy team. On Jul 17, 2009, at 1:08 PM, Assaf Gordon wrote:
Hello,
We've encountered a small issue when using the extract-genomic- sequences tool with a TwoBit file and lot's of sequences: memory and CPU usage increases unnecessarily (and processing time, too).
Background: a TwoBit (2bit) file is a binary representation of a FASTA file. Unlike nbit files, TwoBit files contain multiple sequences per file.
For example, The Drosophila Virilis 'genome' contains 13530 scaffolds. Converting the droVir3 FASTA file into a 2bit file creates a 52MB file.
Details: When trying to extract sequences (using extract_genomic_dna.py), Each time a new chromosome/scaffold is encountered, the *entire* 2bit file is loaded and stored. This is unnecessary - the same 2bit file is used for all the scaffolds. (Imagine a BED file with 1000 different scaffolds - memory usage goes to 1000 * 52MB... ).
The offending code is (extract_genomic_dna.py) : ------------------- elif twobit_path and os.path.exists( twobit_path ): if chrom in twobits: t = twobits[chrom] else: twobits[chrom] = t = bx.seq.twobit.TwoBitFile( file( twobit_path ) ) try: sequence = t[chrom][start:end] ------------------- the hash 'twobits' contains multiple copies of the same 2bit file, over and over again (one per chromosome/scaffold), but only one chromosome/scaffold is read from it.
An improvement would be to load the file only once, and each time take the chromosome/scaffold from the already-loaded hash.
Here's a possible patch: ======================== --- extract_genomic_dna.py 2009-07-17 12:55:53.000000000 -0400 +++ extract_genomic_dna2.py 2009-07-17 12:55:18.000000000 -0400 @@ -84,6 +84,7 @@ def __main__(): fout = open( output_filename, "w" ) warnings = [] warning = '' + twobitfile = None
for i, line in enumerate( open( input_filename ) ): line = line.rstrip( '\r\n' ) @@ -132,12 +133,10 @@ def __main__(): invalid_line = line continue elif twobit_path and os.path.exists( twobit_path ): - if chrom in twobits: - t = twobits[chrom] - else: - twobits[chrom] = t = bx.seq.twobit.TwoBitFile( file( twobit_path ) ) + if not twobitfile: + twobitfile = bx.seq.twobit.TwoBitFile( file( twobit_path ) ) try: - sequence = t[chrom][start:end] + sequence = twobitfile[chrom][start:end] except: warning = "Unable to fetch the sequence from '%d' to '%d' for build '%s'. " %( start, end-start, dbkey ) warnings.append( warning ) ========================
Technical note: This code and patch refers to an older revision (before changeset 2457:8dfe971fcc27), but I think the new code still behaves this way.
Kudos to Ted Karginov who had to learn python to find and fix this bug.
-Gordon. _______________________________________________ galaxy-dev mailing list galaxy-dev@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-dev
Guruprasad Ananda Graduate Student Bioinformatics and Genomics The Pennsylvania State University
participants (2)
-
Assaf Gordon
-
Guruprasad Ananda