Corner case in task splitter - merging zero files
Hi Scott, Following some failing hard drives, I'm rebuilding our Galaxy server. Something isn't quite right with our cluster integration yet, but it has exposed a problem in Galaxy's handling of task splitting - it can sometimes attempt to merge zero files. Here is my fix for the BLAST XML format (now in the ToolShed), https://bitbucket.org/peterjc/galaxy-central/changeset/5cb6411bad19802ba4001... Here's an example using the text format: galaxy.jobs.splitters.multi ERROR 2012-10-18 16:26:21,330 Error merging files Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/splitters/multi.py", line 133, in do_merge output_type.merge(output_files, output_file_name) File "/mnt/galaxy/galaxy-central/lib/galaxy/datatypes/data.py", line 545, in merge raise Exception('Result %s from %s' % (result, cmd)) Exception: Result 2 from cat > /mnt/galaxy/galaxy-central/database/files/000/dataset_304.dat The problem obviously is that while "cat file1 ... fileN > merged" will work fine for one or more files, with no files it sits waiting for stdin (and from a user perspective stalls). This logic error is in lib/galaxy/datatypes/data.py method merge, which could either treat zero files as an error, or a no-op: if len(split_files) == 1: cmd = 'mv -f %s %s' % ( split_files[0], output_file ) else: cmd = 'cat %s > %s' % ( ' '.join(split_files), output_file ) result = os.system(cmd) I think this should be something like this: if not split_files: raise Exception('Asked to merge zero files') elif len(split_files) == 1: cmd = 'mv -f %s %s' % ( split_files[0], output_file ) else: cmd = 'cat %s > %s' % ( ' '.join(split_files), output_file ) result = os.system(cmd) It might also make sense to check for zero files in the code which calls the merge, i.e. lib/galaxy/jobs/splitters/multi.py function do_merge I'm still investigating upstream how this comes about, one clue: galaxy.jobs.runners.drmaa DEBUG 2012-10-18 16:25:01,930 (273/510) state change: job is running galaxy.jobs.runners.drmaa DEBUG 2012-10-18 16:25:03,040 (273/510) state change: job finished, but failed galaxy.jobs.runners.drmaa DEBUG 2012-10-18 16:25:03,074 Job output not returned from cluster galaxy.jobs DEBUG 2012-10-18 16:25:03,074 task 641 for job 273 ended; exit code: 0 galaxy.jobs DEBUG 2012-10-18 16:25:03,148 task 641 ended galaxy.jobs.runners.tasks DEBUG 2012-10-18 16:25:05,169 execution finished - beginning merge: tblastx -query "/mnt/galaxy/galaxy-central/database/files/000/dataset_127.dat" -db "/var/local/blast/ncbi/nt" -query_gencode 2 -evalue 0.001 -out /mnt/galaxy/galaxy-central/database/files/000/dataset_304.dat -outfmt 0 -num_threads 8 galaxy.jobs.splitters.multi DEBUG 2012-10-18 16:25:05,181 files [] If you would prefer that small suggestion as a pull request, let me know. Regards, Peter
Hey Peter- Thanks - I'll look into it. If you're able to reproduce the problem easily and wouldn't mind crafting a pull request, then it would be much appreciated. Otherwise I'll put this on my to-do list to be done soon. I or someone else may want to revisit the exception handling to prevent that from happening. Thanks! -Scott ----- Original Message -----
Hi Scott,
Following some failing hard drives, I'm rebuilding our Galaxy server. Something isn't quite right with our cluster integration yet, but it has exposed a problem in Galaxy's handling of task splitting - it can sometimes attempt to merge zero files.
Here is my fix for the BLAST XML format (now in the ToolShed), https://bitbucket.org/peterjc/galaxy-central/changeset/5cb6411bad19802ba4001...
Here's an example using the text format:
galaxy.jobs.splitters.multi ERROR 2012-10-18 16:26:21,330 Error merging files Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/lib/galaxy/jobs/splitters/multi.py", line 133, in do_merge output_type.merge(output_files, output_file_name) File "/mnt/galaxy/galaxy-central/lib/galaxy/datatypes/data.py", line 545, in merge raise Exception('Result %s from %s' % (result, cmd)) Exception: Result 2 from cat > /mnt/galaxy/galaxy-central/database/files/000/dataset_304.dat
The problem obviously is that while "cat file1 ... fileN > merged" will work fine for one or more files, with no files it sits waiting for stdin (and from a user perspective stalls).
This logic error is in lib/galaxy/datatypes/data.py method merge, which could either treat zero files as an error, or a no-op:
if len(split_files) == 1: cmd = 'mv -f %s %s' % ( split_files[0], output_file ) else: cmd = 'cat %s > %s' % ( ' '.join(split_files), output_file ) result = os.system(cmd)
I think this should be something like this:
if not split_files: raise Exception('Asked to merge zero files') elif len(split_files) == 1: cmd = 'mv -f %s %s' % ( split_files[0], output_file ) else: cmd = 'cat %s > %s' % ( ' '.join(split_files), output_file ) result = os.system(cmd)
It might also make sense to check for zero files in the code which calls the merge, i.e. lib/galaxy/jobs/splitters/multi.py function do_merge I'm still investigating upstream how this comes about, one clue:
galaxy.jobs.runners.drmaa DEBUG 2012-10-18 16:25:01,930 (273/510) state change: job is running galaxy.jobs.runners.drmaa DEBUG 2012-10-18 16:25:03,040 (273/510) state change: job finished, but failed galaxy.jobs.runners.drmaa DEBUG 2012-10-18 16:25:03,074 Job output not returned from cluster galaxy.jobs DEBUG 2012-10-18 16:25:03,074 task 641 for job 273 ended; exit code: 0 galaxy.jobs DEBUG 2012-10-18 16:25:03,148 task 641 ended galaxy.jobs.runners.tasks DEBUG 2012-10-18 16:25:05,169 execution finished - beginning merge: tblastx -query "/mnt/galaxy/galaxy-central/database/files/000/dataset_127.dat" -db "/var/local/blast/ncbi/nt" -query_gencode 2 -evalue 0.001 -out /mnt/galaxy/galaxy-central/database/files/000/dataset_304.dat -outfmt 0 -num_threads 8 galaxy.jobs.splitters.multi DEBUG 2012-10-18 16:25:05,181 files []
If you would prefer that small suggestion as a pull request, let me know.
Regards,
Peter
On Thu, Oct 18, 2012 at 5:19 PM, Scott McManus <scottmcmanus@gatech.edu> wrote:
Hey Peter-
Thanks - I'll look into it. If you're able to reproduce the problem easily and wouldn't mind crafting a pull request, then it would be much appreciated. Otherwise I'll put this on my to-do list to be done soon. I or someone else may want to revisit the exception handling to prevent that from happening.
Thanks!
-Scott
OK then: https://bitbucket.org/galaxy/galaxy-central/pull-request/78/avoid-stall-when... I can explain what was happening: We had a mount problem. The Galaxy server could talk to SGE and submit jobs, but when the jobs came to run the mount providing their home directory and the Galaxy file system was down, so they failed. Naturally this meant Galaxy got no output files back. Reading the code, you deliberately attempt to merge any files present (e.g. if 9 out of 10 come back). That does make sense as it could be instructive (as long as it is flagged as an error, which doesn't seem to be happening). I think getting zero files back from the split-jobs ought to be an error condition. In fact, failing to get all the expected sub-files back should also be an error condition (although it is still nice to do the merge so the user can see the partial output). I think a little re-factoring might be needed to treat these explicitly as errors. Regards, Peter
Thanks, Peter! I'll get to it this afternoon EDT. -Scott ----- Original Message -----
On Thu, Oct 18, 2012 at 5:19 PM, Scott McManus <scottmcmanus@gatech.edu> wrote:
Hey Peter-
Thanks - I'll look into it. If you're able to reproduce the problem easily and wouldn't mind crafting a pull request, then it would be much appreciated. Otherwise I'll put this on my to-do list to be done soon. I or someone else may want to revisit the exception handling to prevent that from happening.
Thanks!
-Scott
OK then: https://bitbucket.org/galaxy/galaxy-central/pull-request/78/avoid-stall-when...
I can explain what was happening: We had a mount problem. The Galaxy server could talk to SGE and submit jobs, but when the jobs came to run the mount providing their home directory and the Galaxy file system was down, so they failed. Naturally this meant Galaxy got no output files back.
Reading the code, you deliberately attempt to merge any files present (e.g. if 9 out of 10 come back). That does make sense as it could be instructive (as long as it is flagged as an error, which doesn't seem to be happening).
I think getting zero files back from the split-jobs ought to be an error condition. In fact, failing to get all the expected sub-files back should also be an error condition (although it is still nice to do the merge so the user can see the partial output).
I think a little re-factoring might be needed to treat these explicitly as errors.
Regards,
Peter
Ok -it's in. Thanks again! I will add a to-do item to put output-merge messages into stdout so that they're more visible. -Scott ----- Original Message -----
Thanks, Peter! I'll get to it this afternoon EDT.
-Scott
----- Original Message -----
On Thu, Oct 18, 2012 at 5:19 PM, Scott McManus <scottmcmanus@gatech.edu> wrote:
Hey Peter-
Thanks - I'll look into it. If you're able to reproduce the problem easily and wouldn't mind crafting a pull request, then it would be much appreciated. Otherwise I'll put this on my to-do list to be done soon. I or someone else may want to revisit the exception handling to prevent that from happening.
Thanks!
-Scott
OK then: https://bitbucket.org/galaxy/galaxy-central/pull-request/78/avoid-stall-when...
I can explain what was happening: We had a mount problem. The Galaxy server could talk to SGE and submit jobs, but when the jobs came to run the mount providing their home directory and the Galaxy file system was down, so they failed. Naturally this meant Galaxy got no output files back.
Reading the code, you deliberately attempt to merge any files present (e.g. if 9 out of 10 come back). That does make sense as it could be instructive (as long as it is flagged as an error, which doesn't seem to be happening).
I think getting zero files back from the split-jobs ought to be an error condition. In fact, failing to get all the expected sub-files back should also be an error condition (although it is still nice to do the merge so the user can see the partial output).
I think a little re-factoring might be needed to treat these explicitly as errors.
Regards,
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
On Fri, Oct 19, 2012 at 8:57 PM, Scott McManus <scottmcmanus@gatech.edu> wrote:
Ok -it's in. Thanks again! I will add a to-do item to put output-merge messages into stdout so that they're more visible.
-Scott
Great, thanks. I see Edward Kirton had already reported the underlying problem that was triggering this on our system - "Job output not returned from cluster" is not being treated as an error condition: https://trello.com/card/813-drmaa-py-job-output-not-returned-from-cluster-sh... (The markup imported from bitbucket seems to have messed up but the gist of the report is understandable) Peter
participants (2)
-
Peter Cock
-
Scott McManus