There is a sweet spot for splitting your BLAST query fasta file by sequence - one big file with 25000 sequences is not great, but one sequence per file is the worst possible option. This is due to all the extra overheads, you would have 25000 jobs submitted to the cluster, each of which would load the BLAST binary and database off disk etc. And there are also going to be Galaxy overheads with a large collection as well. I would suggest somewhere around 500 to 1000 gene sequences per FASTQ query file is likely a safe choice. If you have very long sequences (e.g. chromosomes or contigs), then use less. As to the number of threads for each BLAST job, more is better, but what to pick will depend on your cluster and how often there are threads free on nodes. I would suggest trying 4, 8 or 16 threads. I hope that helps. Peter On Thu, Aug 30, 2018 at 3:50 PM Jochen Bick <jochen.bick@usys.ethz.ch> wrote:
Thanks Peter,
so my idea was to split my problem into single blast jobs and run them only on one core... So my file has 25000 sequences and I'm blasting them against all NCBI proteins (nr). This just take to long time. I guess because the database is also very big? I tested this on the first 10 sequences and it took about 10mins. But maybe this is still not faster than running all at once? How many cores would you give such a job?
Cheers Jochen
On 30.08.2018 16:44, Peter Cock wrote:
If there are any limits, it would be down to the Galaxy Admin's job settings - something generic with collections.
Personally I've not done this - I tend to concatenate FASTA files to make large files with multiple sequences instead.
(And then we have the optional task splitting enabled so that Galaxy breaks up the multiple-sequence FASTA file into chunks which get shared out on our cluster for better throughput before concatenating the output back into a single file.)
Peter On Thu, Aug 30, 2018 at 3:37 PM Jochen Bick <jochen.bick@usys.ethz.ch> wrote:
Hi,
is there any limit to run BLAST jobs from a collection of single FASTA files? I started a job but is does not get executed... its just sending for about an hour.
Cheers Jochen
-- ETH Zurich *Jochen Bick* Animal Physiology Institute of Agricultural Sciences Postal address: Universitätstrasse 2 / LFW B 58.1 Office: Tannenstrasse 1 / TAN D 6.2 8092 Zurich, Switzerland
Phone +41 44 632 28 25 jochen.bick@usys.ethz.ch <mailto:jochen.bick@usys.ethz.ch> www.ap.ethz.ch ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/
-- ETH Zurich *Jochen Bick* Animal Physiology Institute of Agricultural Sciences Postal address: Universitätstrasse 2 / LFW B 58.1 Office: Tannenstrasse 1 / TAN D 6.2 8092 Zurich, Switzerland
Phone +41 44 632 28 25 jochen.bick@usys.ethz.ch <mailto:jochen.bick@usys.ethz.ch> www.ap.ethz.ch