Loading a library of bam files
I'm currently in the process of loading (path paste) a large library of BAM files (>10000) into the shared Data Libraries of our local galaxy installation, but I'm finding this process to be very slow. I'm doing a path paste, and not actually copying the files. I have disabled local running of 'upload1', so that it will run on the cluster, and set 'set_metadata_externally' to true. It looks like the job handlers are calling 'samtools index' directly. Looking through the code, that seems to happen in galaxy/datatypes/binary in Bam.dataset_content_needs_grooming, where it calls 'samtools index' and then waits. What would be the most efficient way to start changing the code so that this process can be done by an external script, at a deferred time out on the cluster? Kyle
I also second this request to get it addressed (Where can we vote on bug fixes ?! :) ...It is very weird that samtools is run on the local machine and it even does the indexing sequentially... Thon On Jan 23, 2013, at 03:28 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote: I'm currently in the process of loading (path paste) a large library of BAM files (>10000) into the shared Data Libraries of our local galaxy installation, but I'm finding this process to be very slow. I'm doing a path paste, and not actually copying the files. I have disabled local running of 'upload1', so that it will run on the cluster, and set 'set_metadata_externally' to true. It looks like the job handlers are calling 'samtools index' directly. Looking through the code, that seems to happen in galaxy/datatypes/binary in Bam.dataset_content_needs_grooming, where it calls 'samtools index' and then waits. What would be the most efficient way to start changing the code so that this process can be done by an external script, at a deferred time out on the cluster? Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
I'm willing to put in the coding time, but I'd need some pointers on the best way to go about making the changes. Kyle On Wed, Jan 23, 2013 at 6:35 PM, Anthonius deBoer <thondeboer@me.com> wrote:
I also second this request to get it addressed (Where can we vote on bug fixes ?! :) ...It is very weird that samtools is run on the local machine and it even does the indexing sequentially... Thon
On Jan 23, 2013, at 03:28 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I'm currently in the process of loading (path paste) a large library of BAM files (>10000) into the shared Data Libraries of our local galaxy installation, but I'm finding this process to be very slow. I'm doing a path paste, and not actually copying the files. I have disabled local running of 'upload1', so that it will run on the cluster, and set 'set_metadata_externally' to true. It looks like the job handlers are calling 'samtools index' directly. Looking through the code, that seems to happen in galaxy/datatypes/binary in Bam.dataset_content_needs_grooming, where it calls 'samtools index' and then waits. What would be the most efficient way to start changing the code so that this process can be done by an external script, at a deferred time out on the cluster?
Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Hi Kyle, I'm hoping I can help you a bit on this, although i am not very familiar with the code that is producing this behavior. Your previous reply mentions the following: During job cleanup, galaxy.jobs.__init__.py:412, because external_metadata_set_successfully returns false. An external set_metadata.sh job was run, but it doesn't seem to call samtools. Maybe if I figure out why set_metadata.sh isn't working, this problem will go away. Based on your comments, there are a few things you can do: 1. If setting external metadata results in an error, the error should be printed out in your paster log. Do you see anything relevant there? 2. You also may be able to discover the error if you perform the following sql manually - make sure your have the correct job_id: select filename_results_code from job_external_output_metadata where job_id = <job_id>; 3. Make sure you have the following config setting uncommented and set to False in your universe_wsgi.ini (the default is set to True): # Although it is fairly reliable, setting metadata can occasionally fail. In # these instances, you can choose to retry setting it internally or leave it in # a failed state (since retrying internally may cause the Galaxy process to be # unresponsive). If this option is set to False, the user will be given the # option to retry externally, or set metadata manually (when possible). retry_metadata_internally = False Let me know if any of this helps you resolve the problem, and if not, we'll figure out next steps if possible. Thanks, Greg Von Kuster On Jan 24, 2013, at 4:36 PM, Kyle Ellrott wrote:
I'm willing to put in the coding time, but I'd need some pointers on the best way to go about making the changes.
Kyle
On Wed, Jan 23, 2013 at 6:35 PM, Anthonius deBoer <thondeboer@me.com> wrote: I also second this request to get it addressed (Where can we vote on bug fixes ?! :) ...It is very weird that samtools is run on the local machine and it even does the indexing sequentially... Thon
On Jan 23, 2013, at 03:28 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I'm currently in the process of loading (path paste) a large library of BAM files (>10000) into the shared Data Libraries of our local galaxy installation, but I'm finding this process to be very slow. I'm doing a path paste, and not actually copying the files. I have disabled local running of 'upload1', so that it will run on the cluster, and set 'set_metadata_externally' to true. It looks like the job handlers are calling 'samtools index' directly. Looking through the code, that seems to happen in galaxy/datatypes/binary in Bam.dataset_content_needs_grooming, where it calls 'samtools index' and then waits. What would be the most efficient way to start changing the code so that this process can be done by an external script, at a deferred time out on the cluster?
Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
It looks like if I set 'retry_metadata_internally = False' it stops trying to index them on the queue node. The datasets get added into the library, without a BAM index file, but without error. I guess the index files can be generated on demand later on. Kyle On Mon, Jan 28, 2013 at 12:42 PM, Greg Von Kuster <greg@bx.psu.edu> wrote:
Hi Kyle,
I'm hoping I can help you a bit on this, although i am not very familiar with the code that is producing this behavior. Your previous reply mentions the following:
During job cleanup, galaxy.jobs.__init__.py:412, because external_metadata_set_successfully returns false. An external set_metadata.sh job was run, but it doesn't seem to call samtools. Maybe if I figure out why set_metadata.sh isn't working, this problem will go away.
Based on your comments, there are a few things you can do:
1. If setting external metadata results in an error, the error should be printed out in your paster log. Do you see anything relevant there?
2. You also may be able to discover the error if you perform the following sql manually - make sure your have the correct job_id:
select filename_results_code from job_external_output_metadata where job_id = <job_id>;
3. Make sure you have the following config setting uncommented and set to False in your universe_wsgi.ini (the default is set to True):
# Although it is fairly reliable, setting metadata can occasionally fail. In # these instances, you can choose to retry setting it internally or leave it in # a failed state (since retrying internally may cause the Galaxy process to be # unresponsive). If this option is set to False, the user will be given the # option to retry externally, or set metadata manually (when possible). retry_metadata_internally = False
Let me know if any of this helps you resolve the problem, and if not, we'll figure out next steps if possible.
Thanks,
Greg Von Kuster
On Jan 24, 2013, at 4:36 PM, Kyle Ellrott wrote:
I'm willing to put in the coding time, but I'd need some pointers on the best way to go about making the changes.
Kyle
On Wed, Jan 23, 2013 at 6:35 PM, Anthonius deBoer <thondeboer@me.com>wrote:
I also second this request to get it addressed (Where can we vote on bug fixes ?! :) ...It is very weird that samtools is run on the local machine and it even does the indexing sequentially... Thon
On Jan 23, 2013, at 03:28 PM, Kyle Ellrott <kellrott@soe.ucsc.edu> wrote:
I'm currently in the process of loading (path paste) a large library of BAM files (>10000) into the shared Data Libraries of our local galaxy installation, but I'm finding this process to be very slow. I'm doing a path paste, and not actually copying the files. I have disabled local running of 'upload1', so that it will run on the cluster, and set 'set_metadata_externally' to true. It looks like the job handlers are calling 'samtools index' directly. Looking through the code, that seems to happen in galaxy/datatypes/binary in Bam.dataset_content_needs_grooming, where it calls 'samtools index' and then waits. What would be the most efficient way to start changing the code so that this process can be done by an external script, at a deferred time out on the cluster?
Kyle ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
participants (3)
-
Anthonius deBoer
-
Greg Von Kuster
-
Kyle Ellrott