On Jul 17, 2009, at 1:08 PM, Assaf Gordon wrote:

Hello,

We've encountered a small issue when using the extract-genomic-sequences
tool with a TwoBit file and lot's of sequences: memory and CPU usage
increases unnecessarily (and processing time, too).

Background:
a TwoBit (2bit) file is a binary representation of a FASTA file.
Unlike nbit files, TwoBit files contain multiple sequences per file.

For example,
The Drosophila Virilis 'genome' contains 13530 scaffolds.
Converting the droVir3 FASTA file into a 2bit file creates a 52MB file.

Details:
When trying to extract sequences (using extract_genomic_dna.py),
Each time a new chromosome/scaffold is encountered, the *entire* 2bit
file is loaded and stored.
This is unnecessary - the same 2bit file is used for all the scaffolds.
(Imagine a BED file with 1000 different scaffolds - memory usage goes to
1000 * 52MB... ).

The offending code is (extract_genomic_dna.py) :
-------------------
elif twobit_path and os.path.exists( twobit_path ):
if chrom in twobits:
   t = twobits[chrom]
   else:
   twobits[chrom] = t = bx.seq.twobit.TwoBitFile( file(
twobit_path ) )
   try:
   sequence = t[chrom][start:end]
-------------------
the hash 'twobits' contains multiple copies of the same 2bit file, over
and over again (one per chromosome/scaffold), but only one
chromosome/scaffold is read from it.

An improvement would be to load the file only once, and each time take
the chromosome/scaffold from the already-loaded hash.

Here's a possible patch:
========================
--- extract_genomic_dna.py 2009-07-17 12:55:53.000000000 -0400
+++ extract_genomic_dna2.py 2009-07-17 12:55:18.000000000 -0400
@@ -84,6 +84,7 @@ def __main__():
fout = open( output_filename, "w" )
warnings = []
warning = ''
+ twobitfile = None

for i, line in enumerate( open( input_filename ) ):
line = line.rstrip( '\r\n' )
@@ -132,12 +133,10 @@ def __main__():
invalid_line = line
continue
elif twobit_path and os.path.exists( twobit_path ):
- if chrom in twobits:
- t = twobits[chrom]
- else:
- twobits[chrom] = t = bx.seq.twobit.TwoBitFile(
file( twobit_path ) )
+ if not twobitfile:
+ twobitfile = bx.seq.twobit.TwoBitFile( file(
twobit_path ) )
try:
- sequence = t[chrom][start:end]
+ sequence = twobitfile[chrom][start:end]
except:
warning = "Unable to fetch the sequence from '%d'
to '%d' for build '%s'. " %( start, end-start, dbkey )
warnings.append( warning )
========================

Technical note:
This code and patch refers to an older revision (before changeset
2457:8dfe971fcc27), but I think the new code still behaves this way.

Kudos to Ted Karginov who had to learn python to find and fix this bug.

-Gordon.
_______________________________________________
galaxy-dev mailing list
galaxy-dev@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-dev

Guruprasad Ananda

Graduate Student

Bioinformatics and Genomics

The Pennsylvania State University