Re: [galaxy-user] fasta_to_tabular.py slowness

20 Jul 2009

      Hi all;

Rasmus:
...
...
...
I've found that fasta_to_tabular.py is very slow with big sequences,  
e.g. ~4 minutes for a single 5MB sequence.
[...]
-                fasta_seq = "%s%s" % ( fasta_seq, line )
+                fasta_seq += line
Bob:
...
...
I suspect an additional improvement would be seen by keeping fasta_seq  
as a list of strings, using fasta_seq.append(line), and the catenating  
them together with "".join when it's time to output.
Greg:
...
Think about memory when you have large files...
The memory usage shouldn't be any different than the current
implementation since an entire sequence is read into memory, and
then written to the output file. Bob's list/join approach is the
standard way to quickly do this, although in Python 2.5 and above
the concatenation approach is almost as good. The Python wiki has a
good summary of this common speed-up improvement:

http://wiki.python.org/moin/PythonSpeed/PerformanceTips#StringConcatenation

Definitely worth adding. If memory is a problem the code could
be improved to read in a specified number of lines and write them
incrementally to the output file instead of breaking at sequence
records.

Brad

Re: [galaxy-user] fasta_to_tabular.py slowness

Brad Chapman