Re: [galaxy-user] fasta_to_tabular.py slowness

22 Jul 2009

Hello Rasmus,

The fix for this should be pushed out to our public repo shortly, and 
available on our main site as well.  I've opened the following ticket in 
bitbucket so you can "follow" it if you want.

http://bitbucket.org/galaxy/galaxy-central/issue/112/fix-fasta_to_tabularpy-...

Greg Von Kuster
Galaxy Development Team



Rasmus Ory Nielsen wrote:
...
Hi Greg,
I was in the middle of writing a mail with a message very similar to what Brad Chapman just sent.
Therefore I will just send my time comparisons to back up my initial mail.
At the moment it is not impossible, but at least you got to have lots of time, if you want to convert a few large sequences.
Below is two tests I just ran. Both tests convert a single sequence comparing the original and the patched version (+= approach) of fasta_to_tabular.py.
Thanks.
Best regards,
Rasmus Ory Nielsen
------------------------------------------------------------
[roni@galaxy]$ ls -lh test.fa
-rw-rw-r-- 1 roni roni 5.9M 2009-07-20 15:24 test.fa
[roni@galaxy]$ time ./fasta_to_tabular.py test.fa test.tab 0
real    0m0.214s
user    0m0.139s
sys     0m0.024s
[roni@galaxy]$ time ./fasta_to_tabular.py.orig test.fa test.tab.orig 0
real    2m37.114s
user    1m53.467s
sys     0m43.531s
And with a bigger file:
[roni@galaxy]$ ls -lh test2.fa
-rw-rw-r-- 1 roni roni 12M 2009-07-20 15:33 test2.fa
[roni@galaxy]$ time ./fasta_to_tabular.py test2.fa test2.tab 0
real    0m0.413s
user    0m0.264s
sys     0m0.050s
[roni@galaxy]$ time ./fasta_to_tabular.py.orig test2.fa test2.tab.orig 0
real    13m30.621s
user    9m18.316s
sys     4m12.081s
________________________________________
Fra: Greg Von Kuster [ghv2@psu.edu]
Sendt: 20. juli 2009 14:44
Til: Bob Harris
Cc: galaxy-user@bx.psu.edu; Rasmus Ory Nielsen
Emne: Re: [galaxy-user] fasta_to_tabular.py slowness
Think about memory when you have large files...
Bob Harris wrote:
...
I suspect an additional improvement would be seen by keeping fasta_seq
as a list of strings, using fasta_seq.append(line), and the catenating
them together with "".join when it's time to output.
Mind you, I haven't tested that though.
Bob H
On Jul 18, 2009, at 3:10 PM, Rasmus Ory Nielsen wrote:
...
Hi Galaxy Team,
I've found that fasta_to_tabular.py is very slow with big sequences,
e.g. ~4 minutes for a single 5MB sequence.
The patch below makes the running time go from minutes to seconds
for such a sequence. Mind you, this is my first line of python, so
there may be a smarter way.
Best regards,
Rasmus Ory Nielsen

--- fasta_to_tabular.py.orig 2009-07-18 16:25:50.896487000 +0200
+++ fasta_to_tabular.py      2009-07-18 17:22:49.544611000 +0200
@@ -34,7 +34,7 @@
            fasta_seq = ''
        else:
            if line:
-                fasta_seq = "%s%s" % ( fasta_seq, line )
+                fasta_seq += line
if fasta_seq:
        out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ],
fasta_seq ) )
_______________________________________________
galaxy-user mailing list
galaxy-user@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________
galaxy-user mailing list
galaxy-user@bx.psu.edu
http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user