Hi Galaxy Team, I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence. The patch below makes the running time go from minutes to seconds for such a sequence. Mind you, this is my first line of python, so there may be a smarter way. Best regards, Rasmus Ory Nielsen --- fasta_to_tabular.py.orig 2009-07-18 16:25:50.896487000 +0200 +++ fasta_to_tabular.py 2009-07-18 17:22:49.544611000 +0200 @@ -34,7 +34,7 @@ fasta_seq = '' else: if line: - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line if fasta_seq: out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], fasta_seq ) )
I suspect an additional improvement would be seen by keeping fasta_seq as a list of strings, using fasta_seq.append(line), and the catenating them together with "".join when it's time to output. Mind you, I haven't tested that though. Bob H On Jul 18, 2009, at 3:10 PM, Rasmus Ory Nielsen wrote:
Hi Galaxy Team,
I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence.
The patch below makes the running time go from minutes to seconds for such a sequence. Mind you, this is my first line of python, so there may be a smarter way.
Best regards, Rasmus Ory Nielsen
--- fasta_to_tabular.py.orig 2009-07-18 16:25:50.896487000 +0200 +++ fasta_to_tabular.py 2009-07-18 17:22:49.544611000 +0200 @@ -34,7 +34,7 @@ fasta_seq = '' else: if line: - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line
if fasta_seq: out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], fasta_seq ) ) _______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
Think about memory when you have large files... Bob Harris wrote:
I suspect an additional improvement would be seen by keeping fasta_seq as a list of strings, using fasta_seq.append(line), and the catenating them together with "".join when it's time to output.
Mind you, I haven't tested that though.
Bob H
On Jul 18, 2009, at 3:10 PM, Rasmus Ory Nielsen wrote:
Hi Galaxy Team,
I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence.
The patch below makes the running time go from minutes to seconds for such a sequence. Mind you, this is my first line of python, so there may be a smarter way.
Best regards, Rasmus Ory Nielsen
--- fasta_to_tabular.py.orig 2009-07-18 16:25:50.896487000 +0200 +++ fasta_to_tabular.py 2009-07-18 17:22:49.544611000 +0200 @@ -34,7 +34,7 @@ fasta_seq = '' else: if line: - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line
if fasta_seq: out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], fasta_seq ) ) _______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
Hi all; Rasmus:
I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence. [...] - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line
Bob:
I suspect an additional improvement would be seen by keeping fasta_seq as a list of strings, using fasta_seq.append(line), and the catenating them together with "".join when it's time to output.
Greg:
Think about memory when you have large files...
The memory usage shouldn't be any different than the current implementation since an entire sequence is read into memory, and then written to the output file. Bob's list/join approach is the standard way to quickly do this, although in Python 2.5 and above the concatenation approach is almost as good. The Python wiki has a good summary of this common speed-up improvement: http://wiki.python.org/moin/PythonSpeed/PerformanceTips#StringConcatenation Definitely worth adding. If memory is a problem the code could be improved to read in a specified number of lines and write them incrementally to the output file instead of breaking at sequence records. Brad
Hi Greg, I was in the middle of writing a mail with a message very similar to what Brad Chapman just sent. Therefore I will just send my time comparisons to back up my initial mail. At the moment it is not impossible, but at least you got to have lots of time, if you want to convert a few large sequences. Below is two tests I just ran. Both tests convert a single sequence comparing the original and the patched version (+= approach) of fasta_to_tabular.py. Thanks. Best regards, Rasmus Ory Nielsen ------------------------------------------------------------ [roni@galaxy]$ ls -lh test.fa -rw-rw-r-- 1 roni roni 5.9M 2009-07-20 15:24 test.fa [roni@galaxy]$ time ./fasta_to_tabular.py test.fa test.tab 0 real 0m0.214s user 0m0.139s sys 0m0.024s [roni@galaxy]$ time ./fasta_to_tabular.py.orig test.fa test.tab.orig 0 real 2m37.114s user 1m53.467s sys 0m43.531s And with a bigger file: [roni@galaxy]$ ls -lh test2.fa -rw-rw-r-- 1 roni roni 12M 2009-07-20 15:33 test2.fa [roni@galaxy]$ time ./fasta_to_tabular.py test2.fa test2.tab 0 real 0m0.413s user 0m0.264s sys 0m0.050s [roni@galaxy]$ time ./fasta_to_tabular.py.orig test2.fa test2.tab.orig 0 real 13m30.621s user 9m18.316s sys 4m12.081s ________________________________________ Fra: Greg Von Kuster [ghv2@psu.edu] Sendt: 20. juli 2009 14:44 Til: Bob Harris Cc: galaxy-user@bx.psu.edu; Rasmus Ory Nielsen Emne: Re: [galaxy-user] fasta_to_tabular.py slowness Think about memory when you have large files... Bob Harris wrote:
I suspect an additional improvement would be seen by keeping fasta_seq as a list of strings, using fasta_seq.append(line), and the catenating them together with "".join when it's time to output.
Mind you, I haven't tested that though.
Bob H
On Jul 18, 2009, at 3:10 PM, Rasmus Ory Nielsen wrote:
Hi Galaxy Team,
I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence.
The patch below makes the running time go from minutes to seconds for such a sequence. Mind you, this is my first line of python, so there may be a smarter way.
Best regards, Rasmus Ory Nielsen
--- fasta_to_tabular.py.orig 2009-07-18 16:25:50.896487000 +0200 +++ fasta_to_tabular.py 2009-07-18 17:22:49.544611000 +0200 @@ -34,7 +34,7 @@ fasta_seq = '' else: if line: - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line
if fasta_seq: out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], fasta_seq ) ) _______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
Here is the original thread: http://mail.bx.psu.edu/pipermail/galaxy-user/2009-January/000439.html
James, Thanks very much for originally reporting this issue, and we really apologize for the lack of response until now. Messages like yours are extremely important to us, and we make our best attempt at responding to them, and incorporating fixes on a timely basis. This is one of those times where we wish we had done thigns differently. I've opened the following issue in bitbucket and this fix is currently under way and will soon be available in the distribution and on our main server. Thanks James, Greg Von Kuster galaxy Development Team James Casbon wrote:
Here is the original thread:
http://mail.bx.psu.edu/pipermail/galaxy-user/2009-January/000439.html
2009/7/22 Greg Von Kuster <ghv2@psu.edu>:
I've opened the following issue in bitbucket and this fix is currently under way and will soon be available in the distribution and on our main server.
Great! Thanks Greg, I didn't want the overhead of maintaining my own fork! keep up the good work, James
James, Your contributed code has been used to replace the original versions of the fasta_filter_by_length.py and fasta_to_tabular.py files. These fixes have been pushed to the distribution as well. Thanks again for your contributions here, and please overlook our initial lack of response. I promise it won't happen again. Greg James Casbon wrote:
2009/7/22 Greg Von Kuster <ghv2@psu.edu>:
I've opened the following issue in bitbucket and this fix is currently under way and will soon be available in the distribution and on our main server.
Great!
Thanks Greg, I didn't want the overhead of maintaining my own fork!
keep up the good work,
James
Hello Rasmus, The fix for this should be pushed out to our public repo shortly, and available on our main site as well. I've opened the following ticket in bitbucket so you can "follow" it if you want. http://bitbucket.org/galaxy/galaxy-central/issue/112/fix-fasta_to_tabularpy-... Greg Von Kuster Galaxy Development Team Rasmus Ory Nielsen wrote:
Hi Greg,
I was in the middle of writing a mail with a message very similar to what Brad Chapman just sent. Therefore I will just send my time comparisons to back up my initial mail.
At the moment it is not impossible, but at least you got to have lots of time, if you want to convert a few large sequences.
Below is two tests I just ran. Both tests convert a single sequence comparing the original and the patched version (+= approach) of fasta_to_tabular.py.
Thanks.
Best regards, Rasmus Ory Nielsen
------------------------------------------------------------
[roni@galaxy]$ ls -lh test.fa -rw-rw-r-- 1 roni roni 5.9M 2009-07-20 15:24 test.fa [roni@galaxy]$ time ./fasta_to_tabular.py test.fa test.tab 0
real 0m0.214s user 0m0.139s sys 0m0.024s [roni@galaxy]$ time ./fasta_to_tabular.py.orig test.fa test.tab.orig 0
real 2m37.114s user 1m53.467s sys 0m43.531s
And with a bigger file:
[roni@galaxy]$ ls -lh test2.fa -rw-rw-r-- 1 roni roni 12M 2009-07-20 15:33 test2.fa [roni@galaxy]$ time ./fasta_to_tabular.py test2.fa test2.tab 0
real 0m0.413s user 0m0.264s sys 0m0.050s [roni@galaxy]$ time ./fasta_to_tabular.py.orig test2.fa test2.tab.orig 0
real 13m30.621s user 9m18.316s sys 4m12.081s
________________________________________ Fra: Greg Von Kuster [ghv2@psu.edu] Sendt: 20. juli 2009 14:44 Til: Bob Harris Cc: galaxy-user@bx.psu.edu; Rasmus Ory Nielsen Emne: Re: [galaxy-user] fasta_to_tabular.py slowness
Think about memory when you have large files...
Bob Harris wrote:
I suspect an additional improvement would be seen by keeping fasta_seq as a list of strings, using fasta_seq.append(line), and the catenating them together with "".join when it's time to output.
Mind you, I haven't tested that though.
Bob H
On Jul 18, 2009, at 3:10 PM, Rasmus Ory Nielsen wrote:
Hi Galaxy Team,
I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence.
The patch below makes the running time go from minutes to seconds for such a sequence. Mind you, this is my first line of python, so there may be a smarter way.
Best regards, Rasmus Ory Nielsen
--- fasta_to_tabular.py.orig 2009-07-18 16:25:50.896487000 +0200 +++ fasta_to_tabular.py 2009-07-18 17:22:49.544611000 +0200 @@ -34,7 +34,7 @@ fasta_seq = '' else: if line: - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line
if fasta_seq: out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], fasta_seq ) ) _______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
Hi Greg, This is great. Thanks. Best regards, Rasmus Ory Nielsen ________________________________________ Fra: Greg Von Kuster [ghv2@psu.edu] Sendt: 22. juli 2009 20:28 Til: Rasmus Ory Nielsen Cc: galaxy-user@bx.psu.edu Emne: Re: SV: [galaxy-user] fasta_to_tabular.py slowness Hello Rasmus, The fix for this should be pushed out to our public repo shortly, and available on our main site as well. I've opened the following ticket in bitbucket so you can "follow" it if you want. http://bitbucket.org/galaxy/galaxy-central/issue/112/fix-fasta_to_tabularpy-... Greg Von Kuster Galaxy Development Team Rasmus Ory Nielsen wrote:
Hi Greg,
I was in the middle of writing a mail with a message very similar to what Brad Chapman just sent. Therefore I will just send my time comparisons to back up my initial mail.
At the moment it is not impossible, but at least you got to have lots of time, if you want to convert a few large sequences.
Below is two tests I just ran. Both tests convert a single sequence comparing the original and the patched version (+= approach) of fasta_to_tabular.py.
Thanks.
Best regards, Rasmus Ory Nielsen
------------------------------------------------------------
[roni@galaxy]$ ls -lh test.fa -rw-rw-r-- 1 roni roni 5.9M 2009-07-20 15:24 test.fa [roni@galaxy]$ time ./fasta_to_tabular.py test.fa test.tab 0
real 0m0.214s user 0m0.139s sys 0m0.024s [roni@galaxy]$ time ./fasta_to_tabular.py.orig test.fa test.tab.orig 0
real 2m37.114s user 1m53.467s sys 0m43.531s
And with a bigger file:
[roni@galaxy]$ ls -lh test2.fa -rw-rw-r-- 1 roni roni 12M 2009-07-20 15:33 test2.fa [roni@galaxy]$ time ./fasta_to_tabular.py test2.fa test2.tab 0
real 0m0.413s user 0m0.264s sys 0m0.050s [roni@galaxy]$ time ./fasta_to_tabular.py.orig test2.fa test2.tab.orig 0
real 13m30.621s user 9m18.316s sys 4m12.081s
________________________________________ Fra: Greg Von Kuster [ghv2@psu.edu] Sendt: 20. juli 2009 14:44 Til: Bob Harris Cc: galaxy-user@bx.psu.edu; Rasmus Ory Nielsen Emne: Re: [galaxy-user] fasta_to_tabular.py slowness
Think about memory when you have large files...
Bob Harris wrote:
I suspect an additional improvement would be seen by keeping fasta_seq as a list of strings, using fasta_seq.append(line), and the catenating them together with "".join when it's time to output.
Mind you, I haven't tested that though.
Bob H
On Jul 18, 2009, at 3:10 PM, Rasmus Ory Nielsen wrote:
Hi Galaxy Team,
I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence.
The patch below makes the running time go from minutes to seconds for such a sequence. Mind you, this is my first line of python, so there may be a smarter way.
Best regards, Rasmus Ory Nielsen
--- fasta_to_tabular.py.orig 2009-07-18 16:25:50.896487000 +0200 +++ fasta_to_tabular.py 2009-07-18 17:22:49.544611000 +0200 @@ -34,7 +34,7 @@ fasta_seq = '' else: if line: - fasta_seq = "%s%s" % ( fasta_seq, line ) + fasta_seq += line
if fasta_seq: out.write( "%s\t%s\n" %( fasta_title[ 1:keep_first ], fasta_seq ) ) _______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
_______________________________________________ galaxy-user mailing list galaxy-user@bx.psu.edu http://mail.bx.psu.edu/cgi-bin/mailman/listinfo/galaxy-user
2009/7/18 Rasmus Ory Nielsen <Rasmus.Nielsen@agrsci.dk>:
Hi Galaxy Team,
I've found that fasta_to_tabular.py is very slow with big sequences, e.g. ~4 minutes for a single 5MB sequence.
The patch below makes the running time go from minutes to seconds for such a sequence. Mind you, this is my first line of python, so there may be a smarter way.
I sent similar patches through months ago - they got ignored by the core team, unfortunately. cheers, James
participants (5)
-
Bob Harris
-
Brad Chapman
-
Greg Von Kuster
-
James Casbon
-
Rasmus Ory Nielsen