After installing the BWA from the Tools Shed, I find there is a
new tool_data_table_conf.xml got created under tool-data directory. The
file contains the entries of BWA's loc files.
I believe the installed BWA still relies on the tool_data_table_conf.xml
under galaxy root dir. Anyone can clarify what the one under tool-data dir
While attempting a fresh Galaxy install from both galaxy-central and
galaxy-dist, I ran into a problem initialising the default SQLite database
$ hg clone https://bitbucket.org/galaxy/galaxy-dist
$ cd galaxy-dist
galaxy.model.migrate.check DEBUG 2013-10-31 10:28:26,143 pysqlite>=2
egg successfully loaded for sqlite dialect
Traceback (most recent call last):
OperationalError: (OperationalError) database is locked u'PRAGMA
After some puzzlement, I realised this was down to the file system -
I was trying this under my home directory mounted via a distributed
file system (gluster I think).
Repeating the experiment under /tmp on a local hard disk worked :)
(I'm posting this message for future reference; hopefully Google and/or
mailing list searches will help anyone else facing this error)
I have managed to get proftpd to work, it can connect to the galaxy sql database, and users can log to upload files in their directory. But there is a problem, when a galaxy user logs in the galaxy web platform, the user can't see his upload files since Galaxy doesn't have the rights to open the directory. How can i change the permissions in proftpd so that galaxy can open the user directory?
Thanks in advance to all.
What is the current status in Galaxy for supporting compressed files?
We've talked about this before, for example in addition to FASTQ,
many of us have expressed a wish to work with gzipped FASTQ.
I understand that some have customized their local Galaxy
installations to use gzipped FASTQ as a specific data type - I'm
more interested in a general file format neutral solution.
Also, I'd like to be able to used BGZF (not just GZIP) because it
is better for random access - see for example
- and makes it much easier to break up large datafiles for sharing
over a cluster (i.e. it could be exploited in the current Galaxy code
for splitting large sequence files).
The 11 May 2012 Galaxy Development News Brief
mentions tabix indexing - that uses bgzip, so is there something
general in place yet to allow tool wrappers to say they accept not just
given file formats, but different compressed versions of file formats?
Ideally I'd like to be able to write an XML tool description saying
a tool produced BGZF compressed tabular data, or GZIP
compressed Sanger FASTQ etc. Similarly, I'd like to specify my
tool accepts FASTA or gzipped FASTA (including BGZF FASTA).
While for older tools if they say they accept only uncompressed
FASTA, Galaxy could automatically decompress any compressed
FASTA entries in my history on demand.
I would like to inquire whether anyone has attempted to implement the
idxstats tool from samtools into Galaxy?
The xml-file for idxstats is not present in the Galaxy source code,
which led me to try and implement it myself.
However, the main problem I face is that the idxstats tool silently
relies on having an index file available (within the same directory)
for the bam file you which to print the stats for.
samtools idxstats PATH/test.bam
searches for PATH/test.bam.bai and gives an error when this file is not
present. And somehow I cannot model this behavior in Galaxy.
A different solution would of course be to ask the author(s) of samtools
to have an option available where the user can directly indicate the
path to the index file.
PS: I've searched the mailing list archives for this problem but did not
find any matches. Apologies if I somehow missed the answer.
Michiel Van Bel, PhD
Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809
VIB Department of Plant Systems Biology, Ghent University
Technologiepark 927, 9052 Gent, BELGIUM
I'd like to be able to write some simple <test> entries for
some of the BLAST+ tools using composite datatypes
as input or output (i.e. small BLAST databases). This
doesn't seem to be mentioned or hinted at on the wiki:
Is it possible to use a composite datatype as a test input?
If so how? Normal datatypes are loaded into the test history
using using the upload tool - does that mean I first need to
extend the relevant datatypes to allow them to be uploaded?
Example: Run blastp using a small query FASTA file and
a small database, check the output (eg tabular).
Is it possible to use a composite datatype as a test output?
If so how?
Example: Run makeblastdb using a small FASTA file, and
check the output (a small BLAST database).
There have been a few posts lately about doing distributed computing via Galaxy - i.e.
job splitters etc - below a contribution of some ideas we have developed
and applied in our work, where we have arranged for some Galaxy tools to execute in parallel
on our cluster.
We have developed a job-splitter script "tardis.py" (available from
https://bitbucket.org/agr-bifo/tardis), which takes marked-up
standard unix commands that run an application or tool. The mark-up is
prefixed to the input and output command-line options. Tardis strips off the
mark-up, and re-writes the commands to refer to split inputs and outputs, which are then
executed in parallel e.g. on a distributed compute resource. Tardis knows
the output files to expect and how to join them back together.
(This was referred to in our GCC2013 talk
Any reasonable unix based data processing or analysis command may be marked up and run
using tardis, though of course tardis needs to know how to split and join the data. Our approach
also assumes a "symmetrical" HPC cluster configuration, in the sense that each node sees the same
view of the file system (and has the required underlying application installed). We use tardis
to support both Galaxy and command-line based compute.
Background / design pattern / motivating analogy: Galaxy provides a high level
"end to end" view of a workflow; the HPC cluster resource that one uses then involves
spraying chunks of data out into parallel processes, usually in the form of some kind of
distributed compute cluster - but an end-user looking at a Galaxy history, should ideally not be able
to tell whether the workflow was run as a single process on the server, or
via many parallel processes on the cluster (apart from the fact that when run
in parallel on the cluster, its alot faster!). We noticed that the TCP / IP layered networking
protocol stack provides a useful metaphor and design pattern - with the "end to end" topology
of a Galaxy workflow corresponding to the transport layer of TCP/ IP; and the distribution
of computation across a cluster corresponding to the next TCP/IP layer down - the packet-routing
This picture suggested a strongly layered approach to provisioning
Galaxy with parallelised compute on split data, and hence to an approach in which the
footprint in the Galaxy code-base, of parallel / distributed compute support, should ideally
(from the layered-design point of view) be minimal and superficial. Thus in our approach so far,
the only footprint is in the tool config files, where we arrange the templating to
(optionally) prefix the required tardis mark-up to the input and output command options, and
the tardis script name to the command as a whole. tardis then takes care of rewriting and
launching all of the jobs, and finally joining the results back together and putting them where
galaxy expects them to be (and also housekeeping such as collating and passing up stderr and stdout , and
appropriate process exit codes). (For each galaxy job, tardis creates a working folder in a designated
scratch area, where input files are uncompressed and split; job files and their output
are stored; logging is done etc. Split data is cleaned up at the end unless there
was an error in some part of the job, in which case everything is retained
for debugging and in some cases restart)
(We modify Galaxy tool-configs so that the user can optionally choose to run
the tool on our HPC cluster - there are three HPC related input fields, appended
to the input section of a tool. Here the user selects whether they want to use
our cluster and if so, they specify the chunk size, and can also at that point
specify a sampling rate, since we often find it useful to be able to run preliminary
analyses on a random sample of (for example) single or paired-end NGS sequence
data, to obtain a fairly quick snapshot of the data, before the expense of a
complete run. We found it convenient to include support for input sampling
The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of
examples of marking up a command, and also a simple example of a galaxy tool-config that
has been modified to include support for optionally running the job on our HPC cluster
via the tardis pre-processor.
* we have not yet attempted to integrate our approach with the existing Galaxy job-splitting
distributed compute support, partly because of our "layered" design goal (admittedly also partly
because of ignorance about its details ! )
* our current implementation is quite naive in the distributed compute API
it uses - it supports launching condor job files (and also native sub-processes) - our plan
is to replace that with using the drmaa API
* we would like to integrate it better with the galaxy type system, probably via
a galaxy-tardis wrapper
We would be keen to contribute our approach to Galaxy if people are
Bioinformatics Software Engineer
I've sprouted some grays in the last week after my Galaxy instances all simultaneously ceased to submit jobs to our main cluster.
Some Galaxy instances are running the PBS job runner, and others use DRMAA. For the DRMAA runner I was getting:
galaxy.jobs.runners ERROR 2013-10-15 08:40:14,942 (1024) Unhandled exception calling queue_job
Traceback (most recent call last):
File "galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next
File "galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job
external_job_id = self.ds.runJob(jt)
File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 331, in runJob
_h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate)
File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 213, in c
return f(*(args + (error_buffer, sizeof(error_buffer))))
File "build/bdist.linux-x86_64/egg/drmaa/errors.py", line 90, in error_check
raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value))
InternalException: code 1: (qsub) cannot access script file: Unauthorized Request MSG=can not authorize request (0-Success)
And in my PBS runner:
galaxy.jobs.runners.pbs WARNING 2013-10-14 17:13:07,319 (550) pbs_submit failed (try 1/5), PBS error 15044: Resources temporarily unavailable
To give some background, I had recently requested a new virtual machine to put my test/dev Galaxy on. I copied our production Galaxy to this new VM. I secured a new domain name for it and set it running. Everything was going well until I tried to hook it up to the cluster; at first I got an error saying that I didn't have permission to submit jobs. Makes sense, the new VM was not a qualified submit host for the cluster. I asked the sysadmins to add the VM as a submit host to the cluster using qmgr. As soon as this was done, not only could I still not submit jobs from the test Galaxy, but no Galaxy was able to submit jobs to the cluster.
The issue isn't with Galaxy here but the underlying calls that it makes - for drmaa, I tracked it back to pbs-drmaa/bin/drmaa-run. For PBS, I'm sure it's somewhere in with libtorque. In every case, I could call qsub from the command line and it would correctly submit jobs, which was more perplexing.
I re-installed python, drmaa.egg, pbs-drmaa, and rebooted the VM. I of course restarted Galaxy with each step, to no avail. I worked with the admins to see what was happening in the server logs, but the same cryptic error showed up - cannot authorize request. I've had this issue before in the past, more or less, but usually just gave up on it. It seemed to come and go sporadically, but rebooting the clusters seemed to help.
This time, with our production server no longer functioning, I begged for help and the admins looked through the pbs_server config but couldn't find any mistypes or problems. Reloading the config by sending hangup signals to pbs_server didn't help. Then we tried pausing the scheduler and restarting pbs_server completely - and eureka, all problems went away. PBS and DRMAA runners are back up and working fine. This really seems to be a bug in Torque 22.214.171.124.
I hope this saves someone a lot of headache! Newer versions of Torque may be the answer. I would also advise against making changes to the pbs_server configuration while in production - we have monthly maintenance, and I don't think I'll ever request changes when there won't be an immediate reboot to flush the server!