Hello,
We've encountered a small issue when using the extract-genomic-sequences
tool with a TwoBit file and lot's of sequences: memory and CPU usage
increases unnecessarily (and processing time, too).
Background:
a TwoBit (2bit) file is a binary representation of a FASTA file.
Unlike nbit files, TwoBit files contain multiple sequences per file.
For example,
The Drosophila Virilis 'genome' contains 13530 scaffolds.
Converting the droVir3 FASTA file into a 2bit file creates a 52MB file.
Details:
When trying to extract sequences (using extract_genomic_dna.py),
Each time a new chromosome/scaffold is encountered, the *entire* 2bit
file is loaded and stored.
This is unnecessary - the same 2bit file is used for all the scaffolds.
(Imagine a BED file with 1000 different scaffolds - memory usage goes to
1000 * 52MB... ).
The offending code is (extract_genomic_dna.py) :
-------------------
elif twobit_path and os.path.exists( twobit_path ):
if chrom in twobits:
t = twobits[chrom]
else:
twobits[chrom] = t = bx.seq.twobit.TwoBitFile( file(
twobit_path ) )
try:
sequence = t[chrom][start:end]
-------------------
the hash 'twobits' contains multiple copies of the same 2bit file, over
and over again (one per chromosome/scaffold), but only one
chromosome/scaffold is read from it.
An improvement would be to load the file only once, and each time take
the chromosome/scaffold from the already-loaded hash.
Here's a possible patch:
========================
--- extract_genomic_dna.py 2009-07-17 12:55:53.000000000 -0400
+++ extract_genomic_dna2.py 2009-07-17 12:55:18.000000000 -0400
@@ -84,6 +84,7 @@ def __main__():
fout = open( output_filename, "w" )
warnings = []
warning = ''
+ twobitfile = None
for i, line in enumerate( open( input_filename ) ):
line = line.rstrip( '\r\n' )
@@ -132,12 +133,10 @@ def __main__():
invalid_line = line
continue
elif twobit_path and os.path.exists( twobit_path ):
- if chrom in twobits:
- t = twobits[chrom]
- else:
- twobits[chrom] = t = bx.seq.twobit.TwoBitFile(
file( twobit_path ) )
+ if not twobitfile:
+ twobitfile = bx.seq.twobit.TwoBitFile( file(
twobit_path ) )
try:
- sequence = t[chrom][start:end]
+ sequence = twobitfile[chrom][start:end]
except:
warning = "Unable to fetch the sequence from '%d'
to '%d' for build '%s'. " %( start, end-start, dbkey )
warnings.append( warning )
========================
Technical note:
This code and patch refers to an older revision (before changeset
2457:8dfe971fcc27), but I think the new code still behaves this way.
Kudos to Ted Karginov who had to learn python to find and fix this bug.
-Gordon.
_______________________________________________
galaxy-dev mailing list