Per-tool configuration

newer
How to access session cookies from...

Jan Kanis

13 Jun 2014 13 Jun '14

3:59 p.m.

I am writing a tool that should be configurable by the server admin. I am considering adding a configuration file, but where should such a file be placed? Is the tool-data directory the right place? Is there another standard way for per-tool configuration? Jan

Attachments:

attachment.htm (text/html — 297 bytes)

Show replies by date

John Chilton

13 Jun 13 Jun

4:02 p.m.

I would have different answers for your depending on what options are available to the server admin. What exactly about the tool is configurable - can you be more specific? -John On Fri, Jun 13, 2014 at 10:59 AM, Jan Kanis <jan.code@jankanis.nl> wrote:

...

I am writing a tool that should be configurable by the server admin. I am considering adding a configuration file, but where should such a file be placed? Is the tool-data directory the right place? Is there another standard way for per-tool configuration?

Jan

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Jan Kanis

14 Jun 14 Jun

10:12 a.m.

I have two use cases: the first is for a modification of the ncbi blast wrapper to limit the query input size (for a publically accessible galaxy instance), so this needs a configuration option for the query size limit. I was thinking about a separate config file in tool-data for this. The second is is for a tool I have written to convert a blast xml output into a html report. The report contains links for each match to a gene bank (e.g. the ncbi database). These links should be configurable per database that was searched, and preferrably have an option of linking to the location of the match within the gene if the gene bank supports such links. One option is to add an extra column to the blast .loc files (if that doesn't break blast), where the databases are already configured. Jan Op 13 jun. 2014 18:02 schreef "John Chilton" <jmchilton@gmail.com> het volgende:

...

I would have different answers for your depending on what options are available to the server admin. What exactly about the tool is configurable - can you be more specific?

-John

On Fri, Jun 13, 2014 at 10:59 AM, Jan Kanis <jan.code@jankanis.nl> wrote:

...
I am writing a tool that should be configurable by the server admin. I am considering adding a configuration file, but where should such a file be placed? Is the tool-data directory the right place? Is there another standard way for per-tool configuration?

Jan

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

John Chilton

16 Jun 16 Jun

3:18 a.m.

Hello Jan, Thanks for the clarification. Not quite what I was expecting so I am glad I asked - I don't have great answers for either case so hopefully other people will have some ideas. For the first use case - I would just specify some default input to supply to the input wrapper - lets call this N - add a parameter to the tool wrapper "--limit-size=N" - test that and then allow it to be overridden via an environment variable - so in your command block use "--limit-size=\${BLAST_QUERY_LIMIT:N}". This will use N is not limit is set, but deployers can set limits. There are a number of ways to set such variables - DRM specific environment files, login rc files, etc.... Just this last release I added the ability to define environment variables right in job_conf.xml (https://bitbucket.org/galaxy/galaxy-central/pull-request/378/allow-specifica...). I thought the tool shed might have a way to collect such definitions as well and insert them into package files - but Google failed to find this for me. Not sure about how to proceed with the second use case - extending the .loc file should work locally - I am not sure it is feasible within the context of the existing tool shed tools, data manager, etc.... You could certainly duplicate this stuff with your modifications - this how down sides in terms of interoperability though. Sorry I don't have great answers for either question, -John On Sat, Jun 14, 2014 at 5:12 AM, Jan Kanis <jan.code@jankanis.nl> wrote:

...

I have two use cases: the first is for a modification of the ncbi blast wrapper to limit the query input size (for a publically accessible galaxy instance), so this needs a configuration option for the query size limit. I was thinking about a separate config file in tool-data for this.

The second is is for a tool I have written to convert a blast xml output into a html report. The report contains links for each match to a gene bank (e.g. the ncbi database). These links should be configurable per database that was searched, and preferrably have an option of linking to the location of the match within the gene if the gene bank supports such links. One option is to add an extra column to the blast .loc files (if that doesn't break blast), where the databases are already configured.

Jan

Op 13 jun. 2014 18:02 schreef "John Chilton" <jmchilton@gmail.com> het volgende:

...
I would have different answers for your depending on what options are available to the server admin. What exactly about the tool is configurable - can you be more specific?

-John

On Fri, Jun 13, 2014 at 10:59 AM, Jan Kanis <jan.code@jankanis.nl> wrote:

...
I am writing a tool that should be configurable by the server admin. I am considering adding a configuration file, but where should such a file be placed? Is the tool-data directory the right place? Is there another standard way for per-tool configuration?

Jan

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Peter Cock

7:08 a.m.

On Mon, Jun 16, 2014 at 4:18 AM, John Chilton <jmchilton@gmail.com> wrote:

...

Hello Jan,

Thanks for the clarification. Not quite what I was expecting so I am glad I asked - I don't have great answers for either case so hopefully other people will have some ideas.

For the first use case - I would just specify some default input to supply to the input wrapper - lets call this N - add a parameter to the tool wrapper "--limit-size=N" - test that and then allow it to be overridden via an environment variable - so in your command block use "--limit-size=\${BLAST_QUERY_LIMIT:N}". This will use N is not limit is set, but deployers can set limits. There are a number of ways to set such variables - DRM specific environment files, login rc files, etc.... Just this last release I added the ability to define environment variables right in job_conf.xml (https://bitbucket.org/galaxy/galaxy-central/pull-request/378/allow-specifica...). I thought the tool shed might have a way to collect such definitions as well and insert them into package files - but Google failed to find this for me.

Hmm. Jan emailed me off list earlier about this. We could insert a pre-BLAST script to check the size of the query FASTA file, and abort if it is too large (e.g. number of queries, total sequence length, perhaps scaled according to the database size if we want to get clever?). I was hoping there was a more general mechanism in Galaxy - after all, BLAST is by no means the only computationally expensive tool ;) We have had query files of 20,000 and more genes against NR (both BLASTP and BLASTX), but our Galaxy has task-splitting enabled so this becomes 20 (or more) individual cluster jobs of 1000 queries each. This works fine apart from the occasional glitch with the network drive when the data is merged afterwards. (We know this failed once shortly after the underlying storage had been expanded, and would have been under heavy load rebalancing the data across the new disks.)

...

Not sure about how to proceed with the second use case - extending the .loc file should work locally - I am not sure it is feasible within the context of the existing tool shed tools, data manager, etc.... You could certainly duplicate this stuff with your modifications - this how down sides in terms of interoperability though.

Currently the BLAST wrappers use the *.loc files directly, but this is likely to switch to the newer "Data Manager" approach. That may or may not complicate local modifications like adding extra columns...

...

Sorry I don't have great answers for either question, -John

Thanks John, Peter

Jan Kanis

17 Jun 17 Jun

3:57 p.m.

Too bad there aren't any really good options. I will use the environment variable approach for the query size limit. For the gene bank links I guess modifying the .loc file is the least bad way. Maybe it can be merged into galaxy_blast, that would at least solve the interoperability problems. @Peter: One potential problem in merging my blast2html tool could be that I have written it in python3, and the current tool wrapper therefore installs python3 and a host of its dependencies, making for a quite large download. Jan On 16 June 2014 09:08, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Mon, Jun 16, 2014 at 4:18 AM, John Chilton <jmchilton@gmail.com> wrote:

...
Hello Jan,

Thanks for the clarification. Not quite what I was expecting so I am glad I asked - I don't have great answers for either case so hopefully other people will have some ideas.

For the first use case - I would just specify some default input to supply to the input wrapper - lets call this N - add a parameter to the tool wrapper "--limit-size=N" - test that and then allow it to be overridden via an environment variable - so in your command block use "--limit-size=\${BLAST_QUERY_LIMIT:N}". This will use N is not limit is set, but deployers can set limits. There are a number of ways to set such variables - DRM specific environment files, login rc files, etc.... Just this last release I added the ability to define environment variables right in job_conf.xml ( https://bitbucket.org/galaxy/galaxy-central/pull-request/378/allow-specifica... ). I thought the tool shed might have a way to collect such definitions as well and insert them into package files - but Google failed to find this for me.

Hmm. Jan emailed me off list earlier about this. We could insert a pre-BLAST script to check the size of the query FASTA file, and abort if it is too large (e.g. number of queries, total sequence length, perhaps scaled according to the database size if we want to get clever?).

I was hoping there was a more general mechanism in Galaxy - after all, BLAST is by no means the only computationally expensive tool ;)

We have had query files of 20,000 and more genes against NR (both BLASTP and BLASTX), but our Galaxy has task-splitting enabled so this becomes 20 (or more) individual cluster jobs of 1000 queries each. This works fine apart from the occasional glitch with the network drive when the data is merged afterwards. (We know this failed once shortly after the underlying storage had been expanded, and would have been under heavy load rebalancing the data across the new disks.)

...
Not sure about how to proceed with the second use case - extending the .loc file should work locally - I am not sure it is feasible within the context of the existing tool shed tools, data manager, etc.... You could certainly duplicate this stuff with your modifications - this how down sides in terms of interoperability though.

Currently the BLAST wrappers use the *.loc files directly, but this is likely to switch to the newer "Data Manager" approach. That may or may not complicate local modifications like adding extra columns...

...
Sorry I don't have great answers for either question, -John

Thanks John,

Peter

Peter Cock

7:55 p.m.

On Tue, Jun 17, 2014 at 4:57 PM, Jan Kanis <jan.code@jankanis.nl> wrote:

...

Too bad there aren't any really good options. I will use the environment variable approach for the query size limit.

Are you using the optional job splitting (parallelism) feature in Galaxy? That seems to be me to be a good place to insert a Galaxy level job size limit. e.g. BLAST+ jobs are split into 1000 query chunks, so you might wish to impose a 25 chunk limit? Long term being able to set limits on the input file parameters of each tool would be nicer - e.g. Limit BLASTN to at most 20,000 queries, limit MIRA to at most 50GB FASTQ files, etc.

...

For the gene bank links I guess modifying the .loc file is the least bad way. Maybe it can be merged into galaxy_blast, that would at least solve the interoperability problems.

It would have to be sufficiently general, and backward compatible. FYI other people have also looked at extending the blast *.loc files (e.g. adding a category column for helping filter down a very large BLAST database list).

...

@Peter: One potential problem in merging my blast2html tool could be that I have written it in python3, and the current tool wrapper therefore installs python3 and a host of its dependencies, making for a quite large download.

Without seeing your code, it is hard to say, but actually writing Python code which works unmodified under Python 2.7 and Python 3 is quite doable (and under Python 2.6 with a few more provisos). Both NumPy and Biopython do this if you wanted some reassurance. On the other hand, Galaxy itself will need to more to Python 3 at some point, and certainly individual tools will too. This will probably mean (as with Linux Python packages) having double entries on the ToolSehd (one for Python 2, one for Python 3), e.g ToolShed package for NumPy under Python 2 (done) and under Python 3 (needed). Peter

John Chilton

8:04 p.m.

On Tue, Jun 17, 2014 at 2:55 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Tue, Jun 17, 2014 at 4:57 PM, Jan Kanis <jan.code@jankanis.nl> wrote:

...
Too bad there aren't any really good options. I will use the environment variable approach for the query size limit.

Are you using the optional job splitting (parallelism) feature in Galaxy? That seems to be me to be a good place to insert a Galaxy level job size limit. e.g. BLAST+ jobs are split into 1000 query chunks, so you might wish to impose a 25 chunk limit?

Long term being able to set limits on the input file parameters of each tool would be nicer - e.g. Limit BLASTN to at most 20,000 queries, limit MIRA to at most 50GB FASTQ files, etc.

Trello card created, please vote! https://trello.com/c/0XQXVhRz

...

...
For the gene bank links I guess modifying the .loc file is the least bad way. Maybe it can be merged into galaxy_blast, that would at least solve the interoperability problems.

It would have to be sufficiently general, and backward compatible.

FYI other people have also looked at extending the blast *.loc files (e.g. adding a category column for helping filter down a very large BLAST database list).

...
@Peter: One potential problem in merging my blast2html tool could be that I have written it in python3, and the current tool wrapper therefore installs python3 and a host of its dependencies, making for a quite large download.

Without seeing your code, it is hard to say, but actually writing Python code which works unmodified under Python 2.7 and Python 3 is quite doable (and under Python 2.6 with a few more provisos). Both NumPy and Biopython do this if you wanted some reassurance.

On the other hand, Galaxy itself will need to more to Python 3 at some point, and certainly individual tools will too. This will probably mean (as with Linux Python packages) having double entries on the ToolSehd (one for Python 2, one for Python 3),

I certainly hope Galaxy can move to Python 3 at some point... being a pessimist though I would place bets against it :).

...

e.g ToolShed package for NumPy under Python 2 (done) and under Python 3 (needed).

Peter

Jan Kanis

18 Jun 18 Jun

11:04 a.m.

I am not using job splitting, because I am implementing this for a client with a small (one machine) galaxy setup. Implementing a query limit feature in galaxy core would probably be the best idea, but that would also probably require an admin screen to edit those limits, and I don't think I can sell the required time to my boss under the contract we have with the client. I gave a quick try before on making the blast2html tool run in both python 2.6 and 3, but I gave up due to too many encoding issues. The client's machine has python 2.6. Maybe I should have another look. Jan On 17 June 2014 21:55, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Tue, Jun 17, 2014 at 4:57 PM, Jan Kanis <jan.code@jankanis.nl> wrote:

...
Too bad there aren't any really good options. I will use the environment variable approach for the query size limit.

Are you using the optional job splitting (parallelism) feature in Galaxy? That seems to be me to be a good place to insert a Galaxy level job size limit. e.g. BLAST+ jobs are split into 1000 query chunks, so you might wish to impose a 25 chunk limit?

Long term being able to set limits on the input file parameters of each tool would be nicer - e.g. Limit BLASTN to at most 20,000 queries, limit MIRA to at most 50GB FASTQ files, etc.

...
For the gene bank links I guess modifying the .loc file is the least bad way. Maybe it can be merged into galaxy_blast, that would at least solve the interoperability problems.

It would have to be sufficiently general, and backward compatible.

FYI other people have also looked at extending the blast *.loc files (e.g. adding a category column for helping filter down a very large BLAST database list).

...
@Peter: One potential problem in merging my blast2html tool could be that I have written it in python3, and the current tool wrapper therefore installs python3 and a host of its dependencies, making for a quite large download.

Without seeing your code, it is hard to say, but actually writing Python code which works unmodified under Python 2.7 and Python 3 is quite doable (and under Python 2.6 with a few more provisos). Both NumPy and Biopython do this if you wanted some reassurance.

On the other hand, Galaxy itself will need to more to Python 3 at some point, and certainly individual tools will too. This will probably mean (as with Linux Python packages) having double entries on the ToolSehd (one for Python 2, one for Python 3),

e.g ToolShed package for NumPy under Python 2 (done) and under Python 3 (needed).

Peter

Peter Cock

11:14 a.m.

On Wed, Jun 18, 2014 at 12:04 PM, Jan Kanis <jan.code@jankanis.nl> wrote:

...

I am not using job splitting, because I am implementing this for a client with a small (one machine) galaxy setup.

Ah - this also explains why a job size limit is important for you.

...

Implementing a query limit feature in galaxy core would probably be the best idea, but that would also probably require an admin screen to edit those limits, and I don't think I can sell the required time to my boss under the contract we have with the client.

The wrapper script idea I outlined to you earlier would be the least invasive (although might cause trouble if BLAST is run at the command line outside Galaxy), while your idea of inserting the check script into the Galaxy Tool XML just before running BLAST itself should also work well.

...

I gave a quick try before on making the blast2html tool run in both python 2.6 and 3, but I gave up due to too many encoding issues. The client's machine has python 2.6. Maybe I should have another look.

Jan

It gets easier with practice - a mixture of little syntax things, and the big pain about bytes versus unicode (and thus encodings, and raw versus text mode for file handles). Peter

Peter Cock

27 Jun 27 Jun

10:16 a.m.

On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Wed, Jun 18, 2014 at 12:04 PM, Jan Kanis <jan.code@jankanis.nl> wrote:

...
I am not using job splitting, because I am implementing this for a client with a small (one machine) galaxy setup.

Ah - this also explains why a job size limit is important for you.

...
Implementing a query limit feature in galaxy core would probably be the best idea, but that would also probably require an admin screen to edit those limits, and I don't think I can sell the required time to my boss under the contract we have with the client.

The wrapper script idea I outlined to you earlier would be the least invasive (although might cause trouble if BLAST is run at the command line outside Galaxy), while your idea of inserting the check script into the Galaxy Tool XML just before running BLAST itself should also work well.

While looking an Jan's pull request to insert a query size limit before running BLAST https://github.com/peterjc/galaxy_blast/pull/43 I realised that this will not work so well if job-splitting is enabled. If using the job-splitting parallelism setting in Galaxy, then the BLAST query FASTA file is broken up into chunks of 1000 sequences. This means the new check would be make at the chunk level - so it could in effect catch extremely long query sequences (e.g. chromosomes), but could not block anyone submitting one query FASTA file containing many thousands of moderate length query sequences (e.g. genes). John - that Trello issue you logged, https://trello.com/c/0XQXVhRz Generic infrastructure to let deployers specify limits for tools based on input metadata (number of sequences, file size, etc...) Would it be fair to say this is not likely to be implemented in the near future? i.e. Should we consider implementing the BLAST query limit approach as a short term hack? Thanks, Peter

John Chilton

2:13 p.m.

On Fri, Jun 27, 2014 at 5:16 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Wed, Jun 18, 2014 at 12:04 PM, Jan Kanis <jan.code@jankanis.nl> wrote:

...
I am not using job splitting, because I am implementing this for a client with a small (one machine) galaxy setup.

Ah - this also explains why a job size limit is important for you.

...
Implementing a query limit feature in galaxy core would probably be the best idea, but that would also probably require an admin screen to edit those limits, and I don't think I can sell the required time to my boss under the contract we have with the client.

The wrapper script idea I outlined to you earlier would be the least invasive (although might cause trouble if BLAST is run at the command line outside Galaxy), while your idea of inserting the check script into the Galaxy Tool XML just before running BLAST itself should also work well.

While looking an Jan's pull request to insert a query size limit before running BLAST https://github.com/peterjc/galaxy_blast/pull/43 I realised that this will not work so well if job-splitting is enabled.

If using the job-splitting parallelism setting in Galaxy, then the BLAST query FASTA file is broken up into chunks of 1000 sequences. This means the new check would be make at the chunk level - so it could in effect catch extremely long query sequences (e.g. chromosomes), but could not block anyone submitting one query FASTA file containing many thousands of moderate length query sequences (e.g. genes).

John - that Trello issue you logged, https://trello.com/c/0XQXVhRz Generic infrastructure to let deployers specify limits for tools based on input metadata (number of sequences, file size, etc...)

Would it be fair to say this is not likely to be implemented in the near future? i.e. Should we consider implementing the BLAST query limit approach as a short term hack?

It would be good functionality - but I don't foresee myself or anyone on the core team getting to it in the next six months say. ... I am now angry with myself though because I realized that dynamic job destinations are a better way to implement this in the meantime (that environment stuff was very fresh when I responded so I think I just jumped there). You can build a flexible infrastructure locally that is largely decoupled from the tools and that may (?) work around the task splitting problem Peter brought up. Outline of the idea: Create a Python script - say lib/galaxy/jobs/mapper_limits.py and add some functions to it like: ------------------ # Helper utilities for limiting tool inputs. from galaxy.jobs.mapper import JobMappingException DEFAULT_QUERY_LIMIT_MESSAGE = "Size of input exceeds query limit of this Galaxy instance." def assert_fewer_than_ n_sequences(input_path, n, msg=DEFAULT_QUERY_LIMIT_MESSAGE): ... # compute num_sequences if num_sequences > n: raise JobMappingException(msg) # Do same for other checks... ------------------ This is an abstract file that has nothing to do with the institution or toolbox really. Once you get it working - open a pull request and we can probably get this integrated into Galaxy (as long as it is abstract enough). Then deployers can create specific rules for that particular cluster and toolbox: Create lib/galaxy/jobs/runners/rules/instance_dests.py ------------------ from galaxy.jobs import mapper_limits def limited_blast(job, app): inp_data = dict( [ ( da.name, da.dataset ) for da in job.input_datasets ] ) query_file = inp_data[ "query" ].file_name mapper_limits.assert_fewer_than_ n_sequences( query_file, 300 ) return app.job_config.get_destination( "blast_base" ) ------------------ Then open job_conf.xml and add the correct destinations... <job_conf> ... <destinations> ... <destination id="limited_blast" runner="dynamic"> <param id="function">limited_blast</param> </destination> <destination id="blast_base" runner="torque>  .... </destination> </destinations> <tools> <tool id="ncbi_blastn_wrapper" destination="limited_blast" /> <tool id="ncbi_blastp_wrapper" destination="limited_blast" /> ... </tools> </job_conf> Jan I am really sorry I didn't come up with this before you did all that work. Hopefully what you did for "limit_query_size.py" can be reused in this context. -John

...

Thanks,

Peter

Peter Cock

2:30 p.m.

On Fri, Jun 27, 2014 at 3:13 PM, John Chilton <jmchilton@gmail.com> wrote:

...

On Fri, Jun 27, 2014 at 5:16 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

John - that Trello issue you logged, https://trello.com/c/0XQXVhRz Generic infrastructure to let deployers specify limits for tools based on input metadata (number of sequences, file size, etc...)

Would it be fair to say this is not likely to be implemented in the near future? i.e. Should we consider implementing the BLAST query limit approach as a short term hack?

It would be good functionality - but I don't foresee myself or anyone on the core team getting to it in the next six months say.

...

I am now angry with myself though because I realized that dynamic job destinations are a better way to implement this in the meantime (that environment stuff was very fresh when I responded so I think I just jumped there). You can build a flexible infrastructure locally that is largely decoupled from the tools and that may (?) work around the task splitting problem Peter brought up.

Outline of the idea: <snip>

Hi John, So the idea is to define a dynamic job mapper which checks the query input size, and if too big raises an error, and otherwise passes the job to the configured job handler (e.g. SGE cluster). See https://wiki.galaxyproject.org/Admin/Config/Jobs It sounds like this ought to be possible right now, but you are suggesting since this seems quite a general use case, the code to help build a dynamic mapper using things like file size (in bytes or number of sequences) could be added to Galaxy? This approach would need the Galaxy Admin to setup a custom job mapper for BLAST (which knows to look at the query file), but it taps into an existing Galaxy framework. By providing a reference implementation this ought to be fairly easy to setup, and can be extended to be more clever about the limits. e.g. For BLAST, we should consider both the number (and length) of the queries, plus the size of the database. Regards, Peter

John Chilton

2:45 p.m.

On Fri, Jun 27, 2014 at 9:30 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Fri, Jun 27, 2014 at 3:13 PM, John Chilton <jmchilton@gmail.com> wrote:

...
On Fri, Jun 27, 2014 at 5:16 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Wed, Jun 18, 2014 at 12:14 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

John - that Trello issue you logged, https://trello.com/c/0XQXVhRz Generic infrastructure to let deployers specify limits for tools based on input metadata (number of sequences, file size, etc...)

Would it be fair to say this is not likely to be implemented in the near future? i.e. Should we consider implementing the BLAST query limit approach as a short term hack?

It would be good functionality - but I don't foresee myself or anyone on the core team getting to it in the next six months say.

...

I am now angry with myself though because I realized that dynamic job destinations are a better way to implement this in the meantime (that environment stuff was very fresh when I responded so I think I just jumped there). You can build a flexible infrastructure locally that is largely decoupled from the tools and that may (?) work around the task splitting problem Peter brought up.

Outline of the idea: <snip>

Hi John,

So the idea is to define a dynamic job mapper which checks the query input size, and if too big raises an error, and otherwise passes the job to the configured job handler (e.g. SGE cluster).

See https://wiki.galaxyproject.org/Admin/Config/Jobs

It sounds like this ought to be possible right now, but you are suggesting since this seems quite a general use case, the code to help build a dynamic mapper using things like file size (in bytes or number of sequences) could be added to Galaxy?

Yes it is possible right now and everything could just be stuck right the rule file itself. I was just suggesting that sharing some of the helpers with the community might ease the process for future deployers.

...

This approach would need the Galaxy Admin to setup a custom job mapper for BLAST (which knows to look at the query file), but it taps into an existing Galaxy framework. By providing a reference implementation this ought to be fairly easy to setup, and can be extended to be more clever about the limits.

Yes. As you mention this can be much more expressive than an XML-based fixed set of limit types. In addition to static sorts of limits - you could combine inputs like you mentioned, one could allow local users of the public resource to run as much as they want, allow larger jobs on the weekend when things are slow, etc.... I recently added a high-level utility for looking at job metrics in these rules - so you can say restrict and or expand the limit based on how many jobs the user has run in the last month or how many core hours they have consumed, etc.... https://bitbucket.org/galaxy/galaxy-central/commits/9a905e98e1550314cf821a99...

...

e.g. For BLAST, we should consider both the number (and length) of the queries, plus the size of the database.

Thanks for clarifying and providing some context to my (in retrospect) seemingly random Python scripts :).

...

Regards,

Peter

4214

Age (days ago)

4228

Last active (days ago)

List overview

Download

13 comments

3 participants

participants (3)

Jan Kanis
John Chilton
Peter Cock

Per-tool configuration

tags

participants (3)