Too bad there aren't any really good options. I will use the environment variable approach for the query size limit. For the gene bank links I guess modifying the .loc file is the least bad way. Maybe it can be merged into galaxy_blast, that would at least solve the interoperability problems.

@Peter: One potential problem in merging my blast2html tool could be that I have written it in python3, and the current tool wrapper therefore installs python3 and a host of its dependencies, making for a quite large download.

Jan


On 16 June 2014 09:08, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Mon, Jun 16, 2014 at 4:18 AM, John Chilton <jmchilton@gmail.com> wrote:
> Hello Jan,
>
> Thanks for the clarification. Not quite what I was expecting so I am
> glad I asked - I don't have great answers for either case so hopefully
> other people will have some ideas.
>
> For the first use case - I would just specify some default input to
> supply to the input wrapper - lets call this N - add a parameter to
> the tool wrapper "--limit-size=N" - test that and then allow it to be
> overridden via an environment variable - so in your command block use
> "--limit-size=\${BLAST_QUERY_LIMIT:N}". This will use N is not limit
> is set, but deployers can set limits. There are a number of ways to
> set such variables - DRM specific environment files, login rc files,
> etc.... Just this last release I added the ability to define
> environment variables right in job_conf.xml
> (https://bitbucket.org/galaxy/galaxy-central/pull-request/378/allow-specification-of-environment/diff).
> I thought the tool shed might have a way to collect such definitions
> as well and insert them into package files - but Google failed to find
> this for me.

Hmm. Jan emailed me off list earlier about this. We could insert
a pre-BLAST script to check the size of the query FASTA file,
and abort if it is too large (e.g. number of queries, total sequence
length, perhaps scaled according to the database size if we want
to get clever?).

I was hoping there was a more general mechanism in Galaxy -
after all, BLAST is by no means the only computationally
expensive tool ;)

We have had query files of 20,000 and more genes against NR
(both BLASTP and BLASTX), but our Galaxy has task-splitting
enabled so this becomes 20 (or more) individual cluster jobs
of 1000 queries each. This works fine apart from the occasional
glitch with the network drive when the data is merged afterwards.
(We know this failed once shortly after the underlying storage
had been expanded, and would have been under heavy load
rebalancing the data across the new disks.)

> Not sure about how to proceed with the second use case - extending the
> .loc file should work locally - I am not sure it is feasible within
> the context of the existing tool shed tools, data manager, etc.... You
> could certainly duplicate this stuff with your modifications - this
> how down sides in terms of interoperability though.

Currently the BLAST wrappers use the *.loc files directly, but
this is likely to switch to the newer "Data Manager" approach.
That may or may not complicate local modifications like adding
extra columns...

> Sorry I don't have great answers for either question,
> -John

Thanks John,

Peter