New subject: Small improvement for Extract-Genomic-Sequences tool

17 Jul 2009

Hello,

We've encountered a small issue when using the extract-genomic-sequences 
tool with a TwoBit file and lot's of sequences: memory and CPU usage 
increases unnecessarily (and processing time, too).

Background:
a TwoBit (2bit) file is a binary representation of a FASTA file.
Unlike nbit files, TwoBit files contain multiple sequences per file.

For example,
The Drosophila Virilis 'genome' contains 13530 scaffolds.
Converting the droVir3 FASTA file into a 2bit file creates a 52MB file.

Details:
When trying to extract sequences (using extract_genomic_dna.py),
Each time a new chromosome/scaffold is encountered, the *entire* 2bit 
file is loaded and stored.
This is unnecessary - the same 2bit file is used for all the scaffolds.
(Imagine a BED file with 1000 different scaffolds - memory usage goes to 
1000 * 52MB... ).

The offending code is (extract_genomic_dna.py) :
-------------------
elif twobit_path and os.path.exists( twobit_path ):
    if chrom in twobits:
         t = twobits[chrom]
     else:
         twobits[chrom] = t = bx.seq.twobit.TwoBitFile( file( 
twobit_path ) )
     try:
         sequence = t[chrom][start:end]
-------------------
the hash 'twobits' contains multiple copies of the same 2bit file, over 
and over again (one per chromosome/scaffold), but only one 
chromosome/scaffold is read from it.


An improvement would be to load the file only once, and each time take 
the chromosome/scaffold from the already-loaded hash.

Here's a possible patch:
========================

--- extract_genomic_dna.py          2009-07-17 12:55:53.000000000 -0400
+++ extract_genomic_dna2.py         2009-07-17 12:55:18.000000000 -0400
@@ -84,6 +84,7 @@ def __main__():
      fout = open( output_filename, "w" )
      warnings = []
      warning = ''
+    twobitfile = None

      for i, line in enumerate( open( input_filename ) ):
          line = line.rstrip( '\r\n' )
@@ -132,12 +133,10 @@ def __main__():
                          invalid_line = line
                      continue
              elif twobit_path and os.path.exists( twobit_path ):
-                if chrom in twobits:
-                    t = twobits[chrom]
-                else:
-                    twobits[chrom] = t = bx.seq.twobit.TwoBitFile( 
file( twobit_path ) )
+                if not twobitfile:
+                    twobitfile = bx.seq.twobit.TwoBitFile( file( 
twobit_path ) )
                  try:
-                    sequence = t[chrom][start:end]
+                    sequence = twobitfile[chrom][start:end]
                  except:
                      warning = "Unable to fetch the sequence from '%d' 
to '%d' for build '%s'. " %( start, end-start, dbkey )
                      warnings.append( warning )
========================


Technical note:
This code and patch refers to an older revision (before changeset 
2457:8dfe971fcc27), but I think the new code still behaves this way.



Kudos to Ted Karginov who had to learn python to find and fix this bug.



-Gordon.

    

Small improvement for Extract-Genomic-Sequences tool

Assaf Gordon

Guruprasad Ananda

tags

participants (2)