Thanks Peter. My answers are below:
What query sequences are you using? I have just been using one fasta protein sequence.
Meanwhile monitor the system with top Top says only a max of 27% cpu usage, but the linux screen eventually freezes, and I have to restart. I am not sure how to read out RAM and disk IO from top.
grep blastp paster.log Tried that, and it says there is no such file or directory
Could you try running BLAST from the host Mac OX Yes. And it works fine! I get a good match in a relatively short time.
I then made a very small protein database, checked it by commandline blastp in both host OS X and in guest linux, and it worked fine. Added it to the blastdb_p.loc file, restarted and saw it listed in galaxy. Tried to use it for a blastp, and got the same error as before with the huge NCBI nr database. So, it is not a matter of size.... Thanks for your comments about RAM and blast searches. It gives me hope that I can get galaxy running usefully. I only chose biolinux because of the suite of programs and the apparent ease of use. The other reason was that I could not install galaxy on OS X (10.6). I get errors that others have noted on the discussion lists but no-one seems to have a solution for. regards, Mike DS -------------------------------------------------------------- On Tue, Apr 30, 2013 at 10:49 PM Peter Cock wrote:
Hi Mike,
On Tue, Apr 30, 2013 at 12:54 PM, Mike Dyall-Smith <mike.dyallsmith@gmail.com> wrote:
Dear Peter, thanks for the advice. I think I can now run blastp from the commandline, both in the host and in the linux virtual machine. I say 'think' because it runs in both cases, but then either never completes (I cancelled after 30 min on OS X) or freezes (VBox/ubuntu). However, the blastp search in galaxy gives the same error as before. This, I don't understand.
What query sequences are you using? I'd try just one or two protein sequences in a small FASTA file, and rerun blastp against nr. Meanwhile monitor the system with top or similar to see what the CPU usage, RAM usage, and disk IO is like.
That should help determine if this is a simple as not enough RAM leading to lots of paging to disk, and therefore a very slow search.
I simplified the directory structure (so the path) to the database, and altered the appropriate configuration files (blastdb_p.loc, blast_environment.sh). With 'env', I see: BLASTDB=/media/sf_mikeds_bioinf/db I also checked that "ls /media/sf_mikeds_bioinf/db/nr*" listed all the database files.
The relevant lines are now: ----------------from blastdb_p.loc file------------------------- #Your blastdb_p.loc file should include an entry per line for each "base name" #you have stored. For example: # #nr_05Jun2010 NCBI NR (non redundant) 05 Jun 2010 /data/blastdb/05Jun2010/nr #nr_15Aug2010 NCBI NR (non redundant) 15 Aug 2010 /data/blastdb/15Aug2010/nr nr_08_Apr2013 NCBI_nrprot_08Apr2013 /media/sf_mikeds_bioinf/db/nr ... ----------------------------------------------------------------------
That looks OK.
The blast+ blastp error from galaxy is: ----------------------------------------------------------------------- An error occurred running this job: blastp: 2.2.26+ Package: blast 2.2.26, build Aug 15 2012 17:48:54 BLAST Database error: No alias or index file found for protein database [/media/sf_mikeds_bioinf/db/nr] in search path [/var/lib/galaxy-server/database/job_working_directory/000/18::] -----------------------------------------------------------------------
Very strange, clearly something is not right with the Galaxy config. on the bright side, Galaxy is finding the binaries . Could you try this on the Galaxy log file,
$ grep blastp paster.log
However, while this might point to a real issue for the use of galaxy within VirtualBox/ubuntu (and with the database on an external USB3 drive), I suspect my idea of running a local instance of galaxy on my macbook is not possible. I have 8 Gb of RAM, and set 4Gb for use in the linux guest system. The Genbank nr protein db directory has files totaling 24 Gb.
The oldest cluster nodes we're still using have 8GB of RAM, and are fine to run BLASTP against NR with. The previous nodes only have 2GB and could not cope - so the threshold is somewhere in between. I suspect your guest VM with only 4GB of RAM is struggling.
Could you try running BLAST from the host Mac OX X instead, with access to the full 8GB of RAM?
Running blastp from the commandline does not give a result in any reasonable length of time, even when a single query sequence is used (both on the host and guest systems). Even if I were able to handle raw sequence datasets of moderate size (less than 1 Gb? as yet untested) in Galaxy, it would be of little use to me if I can't blastp or blastx search the resulting contigs. Do I set up galaxy to send blast search queries to NCBI (i.e. many thousands of queries??) or is there some more elegant solution?
... Regards,
Peter