Hi all; Rasmus:
I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence. [...] - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line
Bob:
I suspect an additional improvement would be seen by keeping fasta_seq as a list of strings, using fasta_seq.append(line), and the catenating them together with "".join when it's time to output.
Greg:
Think about memory when you have large files...
The memory usage shouldn't be any different than the current implementation since an entire sequence is read into memory, and then written to the output file. Bob's list/join approach is the standard way to quickly do this, although in Python 2.5 and above the concatenation approach is almost as good. The Python wiki has a good summary of this common speed-up improvement: http://wiki.python.org/moin/PythonSpeed/PerformanceTips#StringConcatenation Definitely worth adding. If memory is a problem the code could be improved to read in a specified number of lines and write them incrementally to the output file instead of breaking at sequence records. Brad