On Fri, Jun 27, 2014 at 5:16 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Wed, Jun 18, 2014 at 12:04 PM, Jan Kanis <jan.code@jankanis.nl> wrote:
I am not using job splitting, because I am implementing this for a client with a small (one machine) galaxy setup.
Ah - this also explains why a job size limit is important for you.
Implementing a query limit feature in galaxy core would probably be the best idea, but that would also probably require an admin screen to edit those limits, and I don't think I can sell the required time to my boss under the contract we have with the client.
The wrapper script idea I outlined to you earlier would be the least invasive (although might cause trouble if BLAST is run at the command line outside Galaxy), while your idea of inserting the check script into the Galaxy Tool XML just before running BLAST itself should also work well.
While looking an Jan's pull request to insert a query size limit before running BLAST https://github.com/peterjc/galaxy_blast/pull/43 I realised that this will not work so well if job-splitting is enabled.
If using the job-splitting parallelism setting in Galaxy, then the BLAST query FASTA file is broken up into chunks of 1000 sequences. This means the new check would be make at the chunk level - so it could in effect catch extremely long query sequences (e.g. chromosomes), but could not block anyone submitting one query FASTA file containing many thousands of moderate length query sequences (e.g. genes).
John - that Trello issue you logged, https://trello.com/c/0XQXVhRz Generic infrastructure to let deployers specify limits for tools based on input metadata (number of sequences, file size, etc...)
Would it be fair to say this is not likely to be implemented in the near future? i.e. Should we consider implementing the BLAST query limit approach as a short term hack?
It would be good functionality - but I don't foresee myself or anyone on the core team getting to it in the next six months say. ... I am now angry with myself though because I realized that dynamic job destinations are a better way to implement this in the meantime (that environment stuff was very fresh when I responded so I think I just jumped there). You can build a flexible infrastructure locally that is largely decoupled from the tools and that may (?) work around the task splitting problem Peter brought up. Outline of the idea: Create a Python script - say lib/galaxy/jobs/mapper_limits.py and add some functions to it like: ------------------ # Helper utilities for limiting tool inputs. from galaxy.jobs.mapper import JobMappingException DEFAULT_QUERY_LIMIT_MESSAGE = "Size of input exceeds query limit of this Galaxy instance." def assert_fewer_than_ n_sequences(input_path, n, msg=DEFAULT_QUERY_LIMIT_MESSAGE): ... # compute num_sequences if num_sequences > n: raise JobMappingException(msg) # Do same for other checks... ------------------ This is an abstract file that has nothing to do with the institution or toolbox really. Once you get it working - open a pull request and we can probably get this integrated into Galaxy (as long as it is abstract enough). Then deployers can create specific rules for that particular cluster and toolbox: Create lib/galaxy/jobs/runners/rules/instance_dests.py ------------------ from galaxy.jobs import mapper_limits def limited_blast(job, app): inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] ) query_file = inp_data[ "query" ].file_name mapper_limits.assert_fewer_than_ n_sequences( query_file, 300 ) return app.job_config.get_destination( "blast_base" ) ------------------ Then open job_conf.xml and add the correct destinations... <job_conf> ... <destinations> ... <destination id="limited_blast" runner="dynamic"> <param id="function">limited_blast</param> </destination> <destination id="blast_base" runner="torque> <!-- or whatever --> .... </destination> </destinations> <tools> <tool id="ncbi_blastn_wrapper" destination="limited_blast" /> <tool id="ncbi_blastp_wrapper" destination="limited_blast" /> ... </tools> </job_conf> Jan I am really sorry I didn't come up with this before you did all that work. Hopefully what you did for "limit_query_size.py" can be reused in this context. -John
Thanks,
Peter