Thanks Peter. My answers are below:
>What query sequences are you using?
I have just been using one fasta protein sequence.
>Meanwhile monitor the system with top
Top says only a max of 27% cpu usage, but the linux screen eventually
freezes, and I have to restart. I am not sure how to read out RAM and disk
IO from top.
>grep blastp paster.log
Tried that, and it says there is no such file or directory
>Could you try running BLAST from the host Mac OX
Yes. And it works fine! I get a good match in a relatively short time.
I then made a very small protein database, checked it by commandline blastp
in both host OS X and in guest linux, and it worked fine. Added it to the
blastdb_p.loc file, restarted and saw it listed in galaxy. Tried to use it
for a blastp, and got the same error as before with the huge NCBI nr
database. So, it is not a matter of size....
Thanks for your comments about RAM and blast searches. It gives me hope
that I can get galaxy running usefully. I only chose biolinux because of
the suite of programs and the apparent ease of use. The other reason was
that I could not install galaxy on OS X (10.6). I get errors that others
have noted on the discussion lists but no-one seems to have a solution for.
regards, Mike DS
--------------------------------------------------------------
On Tue, Apr 30, 2013 at 10:49 PM Peter Cock wrote:
> Hi Mike,
>
> On Tue, Apr 30, 2013 at 12:54 PM, Mike Dyall-Smith
> <mike.dyallsmith(a)gmail.com> wrote:
> > Dear Peter, thanks for the advice. I think I can now run blastp from the
> > commandline, both in the host and in the linux virtual machine. I say
> > 'think' because it runs in both cases, but then either never completes (I
> > cancelled after 30 min on OS X) or freezes (VBox/ubuntu). However, the
> > blastp search in galaxy gives the same error as before. This, I don't
> > understand.
>
> What query sequences are you using? I'd try just one or two protein
> sequences in a small FASTA file, and rerun blastp against nr.
> Meanwhile monitor the system with top or similar to see what
> the CPU usage, RAM usage, and disk IO is like.
>
> That should help determine if this is a simple as not enough RAM
> leading to lots of paging to disk, and therefore a very slow search.
>
> > I simplified the directory structure (so the path) to the database, and
> > altered the appropriate configuration files (blastdb_p.loc,
> > blast_environment.sh). With 'env', I see:
> BLASTDB=/media/sf_mikeds_bioinf/db
> > I also checked that "ls /media/sf_mikeds_bioinf/db/nr*" listed all the
> > database files.
> >
> > The relevant lines are now:
> > ----------------from blastdb_p.loc file-------------------------
> > #Your blastdb_p.loc file should include an entry per line for each "base
> > name"
> > #you have stored. For example:
> > #
> > #nr_05Jun2010 NCBI NR (non redundant) 05 Jun 2010
> > /data/blastdb/05Jun2010/nr
> > #nr_15Aug2010 NCBI NR (non redundant) 15 Aug 2010
> > /data/blastdb/15Aug2010/nr
> > nr_08_Apr2013 NCBI_nrprot_08Apr2013 /media/sf_mikeds_bioinf/db/nr
> > ...
> > ----------------------------------------------------------------------
>
> That looks OK.
>
> > The blast+ blastp error from galaxy is:
> > -----------------------------------------------------------------------
> > An error occurred running this job: blastp: 2.2.26+
> > Package: blast 2.2.26, build Aug 15 2012 17:48:54
> > BLAST Database error: No alias or index file found for protein database
> > [/media/sf_mikeds_bioinf/db/nr] in search path
> > [/var/lib/galaxy-server/database/job_working_directory/000/18::]
> > -----------------------------------------------------------------------
> >
>
> Very strange, clearly something is not right with the Galaxy config.
> on the bright side, Galaxy is finding the binaries . Could you try this
> on the Galaxy log file,
>
> $ grep blastp paster.log
>
> > However, while this might point to a real issue for the use of galaxy
> within
> > VirtualBox/ubuntu (and with the database on an external USB3 drive), I
> > suspect my idea of running a local instance of galaxy on my macbook is
> not
> > possible. I have 8 Gb of RAM, and set 4Gb for use in the linux guest
> system.
> > The Genbank nr protein db directory has files totaling 24 Gb.
>
> The oldest cluster nodes we're still using have 8GB of RAM, and are
> fine to run BLASTP against NR with. The previous nodes only have
> 2GB and could not cope - so the threshold is somewhere in between.
> I suspect your guest VM with only 4GB of RAM is struggling.
>
> Could you try running BLAST from the host Mac OX X instead, with
> access to the full 8GB of RAM?
>
> > Running blastp
> > from the commandline does not give a result in any reasonable length of
> > time, even when a single query sequence is used (both on the host and
> guest
> > systems). Even if I were able to handle raw sequence datasets of moderate
> > size (less than 1 Gb? as yet untested) in Galaxy, it would be of little
> use
> > to me if I can't blastp or blastx search the resulting contigs. Do I set
> up
> > galaxy to send blast search queries to NCBI (i.e. many thousands of
> > queries??) or is there some more elegant solution?
>
> ...
> Regards,
>
> Peter
>