heads-up: more galaxy slowness in sequence.py
Hello all, Continuing the search for slowness in my local Galaxy server (see http://lists.bx.psu.edu/pipermail/galaxy-dev/2009-December/001549.html ), The datatypes/sequence.py file is also scanning and parsing entire files when creating a new FASTA/FASTQ file. It's nice and fun and informative for small files, but with a 2.7GB FASTA file - the python process stays at 100% CPU for a long long time, causing everything else to be very slow. The offending code is at sequence.py, method "set_meta", lines 30-39. I think Illumina expects 25x coverage of the human genome in a single run by the end of the year - this will roughly translates to 8 FASTQ files of more than 8GB each => FASTA files of 4GB each... Galaxy will not be able to just casually scan these files. -gordon
Hello Assaf, Is your instance configured to set metadata externally ( on your cluster nodes )? If not, in your universe_wsgi.ini file, add the following to the [app:main] section: set_metadata_externally = True On Jan 6, 2010, at 5:13 PM, Assaf Gordon wrote:
Hello all,
Continuing the search for slowness in my local Galaxy server (see http://lists.bx.psu.edu/pipermail/galaxy-dev/2009-December/ 001549.html ),
The datatypes/sequence.py file is also scanning and parsing entire files when creating a new FASTA/FASTQ file. It's nice and fun and informative for small files, but with a 2.7GB FASTA file - the python process stays at 100% CPU for a long long time, causing everything else to be very slow.
The offending code is at sequence.py, method "set_meta", lines 30-39.
I think Illumina expects 25x coverage of the human genome in a single run by the end of the year - this will roughly translates to 8 FASTQ files of more than 8GB each => FASTA files of 4GB each... Galaxy will not be able to just casually scan these files.
-gordon
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
It is set to "False", but my galaxy runs jobs locally, not on a cluster... (at least, not directly through the SGE Runner). Does this work with local-runner too (i.e. starting a new process to set the metadata) ? Also, does the "external" method works when the use changes the type in the "Edit Attributes" page ? Greg Von Kuster wrote, On 01/08/2010 10:54 AM:
Hello Assaf,
Is your instance configured to set metadata externally ( on your cluster nodes )? If not, in your universe_wsgi.ini file, add the following to the [app:main] section:
set_metadata_externally = True
On Jan 6, 2010, at 5:13 PM, Assaf Gordon wrote:
Hello all,
Continuing the search for slowness in my local Galaxy server (see http://lists.bx.psu.edu/pipermail/galaxy-dev/2009-December/001549.html ),
The datatypes/sequence.py file is also scanning and parsing entire files when creating a new FASTA/FASTQ file. It's nice and fun and informative for small files, but with a 2.7GB FASTA file - the python process stays at 100% CPU for a long long time, causing everything else to be very slow.
The offending code is at sequence.py, method "set_meta", lines 30-39.
I think Illumina expects 25x coverage of the human genome in a single run by the end of the year - this will roughly translates to 8 FASTQ files of more than 8GB each => FASTA files of 4GB each... Galaxy will not be able to just casually scan these files.
-gordon
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
A dataset's set_meta() is done as part of the job, so if you are not running jobs on a cluster, set_meta() will be run locally as well, which is certainly chewing up cpu on your server. If running externally, set_meta() will run on the cluster when the user does anything in the "Edit Attributes" page that call set_meta(), including "Auto-detect". As soon as I get a chance, I'll look at enhancing set_meta() to check if "set_metadata_externally" is True for those data types that take significant processing, and if jobs are running locally, metadata will be set differently. On Jan 8, 2010, at 12:25 PM, Assaf Gordon wrote:
It is set to "False", but my galaxy runs jobs locally, not on a cluster... (at least, not directly through the SGE Runner).
Does this work with local-runner too (i.e. starting a new process to set the metadata) ? Also, does the "external" method works when the use changes the type in the "Edit Attributes" page ?
Greg Von Kuster wrote, On 01/08/2010 10:54 AM:
Hello Assaf,
Is your instance configured to set metadata externally ( on your cluster nodes )? If not, in your universe_wsgi.ini file, add the following to the [app:main] section:
set_metadata_externally = True
On Jan 6, 2010, at 5:13 PM, Assaf Gordon wrote:
Hello all,
Continuing the search for slowness in my local Galaxy server (see http://lists.bx.psu.edu/pipermail/galaxy-dev/2009-December/ 001549.html ),
The datatypes/sequence.py file is also scanning and parsing entire files when creating a new FASTA/FASTQ file. It's nice and fun and informative for small files, but with a 2.7GB FASTA file - the python process stays at 100% CPU for a long long time, causing everything else to be very slow.
The offending code is at sequence.py, method "set_meta", lines 30-39.
I think Illumina expects 25x coverage of the human genome in a single run by the end of the year - this will roughly translates to 8 FASTQ files of more than 8GB each => FASTA files of 4GB each... Galaxy will not be able to just casually scan these files.
-gordon
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu
Greg Von Kuster wrote, On 01/08/2010 01:02 PM:
A dataset's set_meta() is done as part of the job, so if you are not running jobs on a cluster, set_meta() will be run locally as well, which is certainly chewing up cpu on your server.
I don't mind it running locally, I have several CPUs to spare - the problem is that it seems to be running in a thread inside the main galaxy process - which slows all of galaxy. If there's a way to have set_meta be called externally with local runner (as another local job - with a different process) - this would also solve the issue (I think).
If running externally, set_meta() will run on the cluster when the user does anything in the "Edit Attributes" page that call set_meta(), including "Auto-detect".
This is interesting, but how does galaxy know to submit an "Edit Attributes" job to the cluster? does it do "qsub" with the default runner? I'm asking because even when/if I switch to use the cluster, the default runner will still be local, and only some specific jobs will have an "sge://" runner. How would then galaxy know to submit a job to the SGE cluster?
As soon as I get a chance, I'll look at enhancing set_meta() to check if "set_metadata_externally" is True for those data types that take significant processing, and if jobs are running locally, metadata will be set differently.
I'll be more than happy to beta-test this feature. let me know if I can assist. Thanks for all your help! -gordon
On Jan 8, 2010, at 12:25 PM, Assaf Gordon wrote:
It is set to "False", but my galaxy runs jobs locally, not on a cluster... (at least, not directly through the SGE Runner).
Does this work with local-runner too (i.e. starting a new process to set the metadata) ? Also, does the "external" method works when the use changes the type in the "Edit Attributes" page ?
Greg Von Kuster wrote, On 01/08/2010 10:54 AM:
Hello Assaf,
Is your instance configured to set metadata externally ( on your cluster nodes )? If not, in your universe_wsgi.ini file, add the following to the [app:main] section:
set_metadata_externally = True
On Jan 6, 2010, at 5:13 PM, Assaf Gordon wrote:
Hello all,
Continuing the search for slowness in my local Galaxy server (see http://lists.bx.psu.edu/pipermail/galaxy-dev/2009-December/001549.html ),
The datatypes/sequence.py file is also scanning and parsing entire files when creating a new FASTA/FASTQ file. It's nice and fun and informative for small files, but with a 2.7GB FASTA file - the python process stays at 100% CPU for a long long time, causing everything else to be very slow.
The offending code is at sequence.py, method "set_meta", lines 30-39.
I think Illumina expects 25x coverage of the human genome in a single run by the end of the year - this will roughly translates to 8 FASTQ files of more than 8GB each => FASTA files of 4GB each... Galaxy will not be able to just casually scan these files.
-gordon
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu <mailto:galaxy-dev@lists.bx.psu.edu> http://lists.bx.psu.edu/listinfo/galaxy-dev
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu <mailto:greg@bx.psu.edu>
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu <mailto:greg@bx.psu.edu>
Assaf Gordon wrote:
Greg Von Kuster wrote, On 01/08/2010 01:02 PM:
A dataset's set_meta() is done as part of the job, so if you are not running jobs on a cluster, set_meta() will be run locally as well, which is certainly chewing up cpu on your server.
I don't mind it running locally, I have several CPUs to spare - the problem is that it seems to be running in a thread inside the main galaxy process - which slows all of galaxy. If there's a way to have set_meta be called externally with local runner (as another local job - with a different process) - this would also solve the issue (I think).
Even if using the local runner, set_metadata_externally will cause the metadata code to run in a separate process, which (python-wise) would be a huge help for performance.
If running externally, set_meta() will run on the cluster when the user does anything in the "Edit Attributes" page that call set_meta(), including "Auto-detect".
This is interesting, but how does galaxy know to submit an "Edit Attributes" job to the cluster? does it do "qsub" with the default runner? I'm asking because even when/if I switch to use the cluster, the default runner will still be local, and only some specific jobs will have an "sge://" runner. How would then galaxy know to submit a job to the SGE cluster?
It gets a tool id, '__SET_METADATA__', and is submitted through the regular job runner. I just tested and you can set it in universe_wsgi.ini as you would any other job runner override.
As soon as I get a chance, I'll look at enhancing set_meta() to check if "set_metadata_externally" is True for those data types that take significant processing, and if jobs are running locally, metadata will be set differently.
I'll be more than happy to beta-test this feature. let me know if I can assist.
This is already implemented since auto-detect is run as a job. --nate
Thanks for all your help! -gordon
On Jan 8, 2010, at 12:25 PM, Assaf Gordon wrote:
It is set to "False", but my galaxy runs jobs locally, not on a cluster... (at least, not directly through the SGE Runner).
Does this work with local-runner too (i.e. starting a new process to set the metadata) ? Also, does the "external" method works when the use changes the type in the "Edit Attributes" page ?
Greg Von Kuster wrote, On 01/08/2010 10:54 AM:
Hello Assaf,
Is your instance configured to set metadata externally ( on your cluster nodes )? If not, in your universe_wsgi.ini file, add the following to the [app:main] section:
set_metadata_externally = True
On Jan 6, 2010, at 5:13 PM, Assaf Gordon wrote:
Hello all,
Continuing the search for slowness in my local Galaxy server (see http://lists.bx.psu.edu/pipermail/galaxy-dev/2009-December/001549.html ),
The datatypes/sequence.py file is also scanning and parsing entire files when creating a new FASTA/FASTQ file. It's nice and fun and informative for small files, but with a 2.7GB FASTA file - the python process stays at 100% CPU for a long long time, causing everything else to be very slow.
The offending code is at sequence.py, method "set_meta", lines 30-39.
I think Illumina expects 25x coverage of the human genome in a single run by the end of the year - this will roughly translates to 8 FASTQ files of more than 8GB each => FASTA files of 4GB each... Galaxy will not be able to just casually scan these files.
-gordon
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu <mailto:galaxy-dev@lists.bx.psu.edu> http://lists.bx.psu.edu/listinfo/galaxy-dev Greg Von Kuster Galaxy Development Team greg@bx.psu.edu <mailto:greg@bx.psu.edu>
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu <mailto:greg@bx.psu.edu>
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
Hello Nate, Greg, Thanks for your help, with external set_meta everything works much better. The galaxy process is down to 23% CPU because of some SQLAlchemy thing, but that's for another time. One tiny issue: If a user goes to "Edit Attributes" and changes the file type directly (not with Auto-Detect) - the set_meta is still executed inside the galaxy process as a thread. While this shouldn't happen so often, it does happen sometimes in two cases: 1. When users click it by mistakes (it does happen) 2. When there's a need to switch between txt/tabular/interval files. Thanks again, -gordon Nate Coraor wrote, On 01/08/2010 01:39 PM:
Assaf Gordon wrote:
Greg Von Kuster wrote, On 01/08/2010 01:02 PM:
A dataset's set_meta() is done as part of the job, so if you are not running jobs on a cluster, set_meta() will be run locally as well, which is certainly chewing up cpu on your server.
I don't mind it running locally, I have several CPUs to spare - the problem is that it seems to be running in a thread inside the main galaxy process - which slows all of galaxy. If there's a way to have set_meta be called externally with local runner (as another local job - with a different process) - this would also solve the issue (I think).
Even if using the local runner, set_metadata_externally will cause the metadata code to run in a separate process, which (python-wise) would be a huge help for performance.
If running externally, set_meta() will run on the cluster when the user does anything in the "Edit Attributes" page that call set_meta(), including "Auto-detect".
This is interesting, but how does galaxy know to submit an "Edit Attributes" job to the cluster? does it do "qsub" with the default runner? I'm asking because even when/if I switch to use the cluster, the default runner will still be local, and only some specific jobs will have an "sge://" runner. How would then galaxy know to submit a job to the SGE cluster?
It gets a tool id, '__SET_METADATA__', and is submitted through the regular job runner. I just tested and you can set it in universe_wsgi.ini as you would any other job runner override.
As soon as I get a chance, I'll look at enhancing set_meta() to check if "set_metadata_externally" is True for those data types that take significant processing, and if jobs are running locally, metadata will be set differently.
I'll be more than happy to beta-test this feature. let me know if I can assist.
This is already implemented since auto-detect is run as a job.
--nate
Thanks for all your help! -gordon
On Jan 8, 2010, at 12:25 PM, Assaf Gordon wrote:
It is set to "False", but my galaxy runs jobs locally, not on a cluster... (at least, not directly through the SGE Runner).
Does this work with local-runner too (i.e. starting a new process to set the metadata) ? Also, does the "external" method works when the use changes the type in the "Edit Attributes" page ?
Greg Von Kuster wrote, On 01/08/2010 10:54 AM:
Hello Assaf,
Is your instance configured to set metadata externally ( on your cluster nodes )? If not, in your universe_wsgi.ini file, add the following to the [app:main] section:
set_metadata_externally = True
On Jan 6, 2010, at 5:13 PM, Assaf Gordon wrote:
Hello all,
Continuing the search for slowness in my local Galaxy server (see http://lists.bx.psu.edu/pipermail/galaxy-dev/2009-December/001549.html
),
The datatypes/sequence.py file is also scanning and parsing entire files when creating a new FASTA/FASTQ file. It's nice and fun and informative for small files, but with a 2.7GB FASTA file - the python process stays at 100% CPU for a long long time, causing everything else to be very slow.
The offending code is at sequence.py, method "set_meta", lines 30-39.
I think Illumina expects 25x coverage of the human genome in a single run by the end of the year - this will roughly translates to 8 FASTQ files of more than 8GB each => FASTA files of 4GB each... Galaxy will not be able to just casually scan these files.
-gordon
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu <mailto:galaxy-dev@lists.bx.psu.edu> http://lists.bx.psu.edu/listinfo/galaxy-dev Greg Von Kuster Galaxy Development Team greg@bx.psu.edu <mailto:greg@bx.psu.edu>
Greg Von Kuster Galaxy Development Team greg@bx.psu.edu <mailto:greg@bx.psu.edu>
_______________________________________________ galaxy-dev mailing list galaxy-dev@lists.bx.psu.edu http://lists.bx.psu.edu/listinfo/galaxy-dev
participants (3)
-
Assaf Gordon
-
Greg Von Kuster
-
Nate Coraor