Sorry this is old but I tried recompiling torque after setting the NCONNECTS to 20 and the issue's still there.
But there's more: It doesn't affect only flagstat but other non-linear workflows. One of the two jobs that are submitted when their "father" stopped running triggers the same error: PBS error 15033: No free connections
And the best part is that it works fine sometimes. But when it crashes, it's always the same job that crashes.
Does anyone have a clue?
Cheers, L-A
I was getting the same behavior as you on asynchronous workflows on a multicore computer that is acting as both head and compute node for the torque system. Even after recompiling with a higher NCONNECTS I was getting the same error. I suspect that this is due to galaxy opening up multiple connections to check the status of currently running jobs. Because there can be many status checks in an asynchronous workflow the pbs system is randomly busy depending on when the job submission comes in. To deal with this I modified the lib/galaxy/jobs/runners/pbs.py script to make multiple attempts at submitting in the following way:
@@ -286,6 +286,12 @@ class PBSJobRunner( BaseJobRunner ): log.debug("(%s) submitting file %s" % ( galaxy_job_id, job_file ) ) log.debug("(%s) command is: %s" % ( galaxy_job_id, command_line ) ) job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None) + ##Modified to give ten tries for qsubbing a job + num_try=0 + while(not job_id and num_try<10): + job_id = pbs.pbs_submit(c, job_attrs, job_file, pbs_queue_name, None) + num_try+=1 + pbs.pbs_disconnect(c)
# check to see if it submitted
I haven't had any problems since.
Cheers, Andrew
- Louise-Amélie Schmitt wrote:*>>* Hello everyone*>>**>>* I observed an issue when flagstat is incorporated in a workflow in which*>>* the BAM file it works on is also used by another program (generate*>>* pileup for instance) and is NOT the input dataset (generated by sam to*>>* bam within the workflow).*>>**>>* I tested it with the local job runner and with TORQUE (with the pbs*>>* scheduler and Maui).*>>**>>* - With the local job runner, it works just fine*>>**>>* - With TORQUE I get the following error message:*>>* pbs_submit failed, PBS error 15033: No free connections*>* Hi,*>**>* This can most likely be fixed by increasing the value of NCONNECTS in*>* the TORQUE source, in src/include/libpbs.h, and recompiling on your*>* TORQUE server. I haven't seen a problem after increasing the value to*>* 20.*>**>* --nate*>**
- Surprisingly, other non-linear workflows work fine. I only observed this*>>* error with flagstat. Moreover, when flagstat is in a linear workflow, it*>>* works fine too. Ad if it is non-linear but the input dataset is the bam*>>* file flagstat works on, it works fine too.*>>**>>* Please find attached one of the test workflow where I found the error.*>>* The input dataset is a sam file.*>>**>>* Any clue?*>>**>>* Cheers,*>>* LA*>>* {*>>* "a_galaxy_workflow": "true",*>>* "annotation": "to see if it fails if not forked",*>>* "format-version": "0.1",*>>* "name": "test flagstat",*>>* "steps": {*>>* "0": {*>>* "annotation": "",*>>* "id": 0,*>>* "input_connections": {},*>>* "inputs": [*>>* {*>>* "description": "",*>>* "name": "Input Dataset"*>>* }*>>* ],*>>* "name": "Input dataset",*>>* "outputs": [],*>>* "position": {*>>* "left": 200,*>>* "top": 200*>>* },*>>* "tool_errors": null,*>>* "tool_id": null,*>>* "tool_state": "{"name": "Input Dataset"}",*>>* "tool_version": null,*>>* "type": "data_input",*>>* "user_outputs": []*>>* },*>>* "1": {*>>* "annotation": "",*>>* "id": 1,*>>* "input_connections": {*>>* "source|input1": {*>>* "id": 0,*>>* "output_name": "output"*>>* }*>>* },*>>* "inputs": [],*>>* "name": "SAM-to-BAM",*>>* "outputs": [*>>* {*>>* "name": "output1",*>>* "type": "bam"*>>* }*>>* ],*>>* "position": {*>>* "left": 274.5,*>>* "top": 307*>>* },*>>* "tool_errors": null,*>>* "tool_id": "sam_to_bam",*>>* "tool_state": "{"source": "{\"index_source\": \"cached\", \"input1\": null, \"__current_case__\": 0}", "__page__": 0}",*>>* "tool_version": "1.1.1",*>>* "type": "tool",*>>* "user_outputs": []*>>* },*>>* "2": {*>>* "annotation": "",*>>* "id": 2,*>>* "input_connections": {*>>* "input1": {*>>* "id": 1,*>>* "output_name": "output1"*>>* }*>>* },*>>* "inputs": [],*>>* "name": "flagstat",*>>* "outputs": [*>>* {*>>* "name": "output1",*>>* "type": "txt"*>>* }*>>* ],*>>* "position": {*>>* "left": 396.5,*>>* "top": 445*>>* },*>>* "tool_errors": null,*>>* "tool_id": "samtools_flagstat",*>>* "tool_state": "{"__page__": 0, "input1": "null"}",*>>* "tool_version": "1.0.0",*>>* "type": "tool",*>>* "user_outputs": []*>>* },*>>* "3": {*>>* "annotation": "",*>>* "id": 3,*>>* "input_connections": {*>>* "refOrHistory|input1": {*>>* "id": 1,*>>* "output_name": "output1"*>>* }*>>* },*>>* "inputs": [],*>>* "name": "Generate pileup",*>>* "outputs": [*>>* {*>>* "name": "output1",*>>* "type": "tabular"*>>* }*>>* ],*>>* "position": {*>>* "left": 519,*>>* "top": 340*>>* },*>>* "tool_errors": null,*>>* "tool_id": "sam_pileup",*>>* "tool_state": "{"__page__": 0, "c": "{\"consensus\": \"no\", \"__current_case__\": 0}", "indels": "\"no\"", "refOrHistory": "{\"input1\": null, \"reference\": \"indexed\", \"__current_case__\": 0}", "lastCol": "\"no\"", "mapCap": "\"60\""}",*>>* "tool_version": "1.1.1",*>>* "type": "tool",*>>* "user_outputs": []*>>* }*>>* }*>>* }*
galaxy-user@lists.galaxyproject.org