October 2013 - galaxy-dev - lists.galaxyproject.org

A new tool_data_table_conf.xml created under tool-data directory
by Derrick Lin 17 Jan '14

17 Jan '14

Hi guys, After installing the BWA from the Tools Shed, I find there is a new tool_data_table_conf.xml got created under tool-data directory. The file contains the entries of BWA's loc files. I believe the installed BWA still relies on the tool_data_table_conf.xml under galaxy root dir. Anyone can clarify what the one under tool-data dir is for? Regards, Derrick

3 7

New Galaxy won't start, SQLite (OperationalError) database is locked
by Peter Cock 20 Dec '13

20 Dec '13

Hello all, While attempting a fresh Galaxy install from both galaxy-central and galaxy-dist, I ran into a problem initialising the default SQLite database (database/universe.sqlite), $ hg clone https://bitbucket.org/galaxy/galaxy-dist ... $ cd galaxy-dist $ ./run.sh ... galaxy.model.migrate.check DEBUG 2013-10-31 10:28:26,143 pysqlite>=2 egg successfully loaded for sqlite dialect Traceback (most recent call last): ... OperationalError: (OperationalError) database is locked u'PRAGMA table_info("dataset")' () After some puzzlement, I realised this was down to the file system - I was trying this under my home directory mounted via a distributed file system (gluster I think). Repeating the experiment under /tmp on a local hard disk worked :) (I'm posting this message for future reference; hopefully Google and/or mailing list searches will help anyone else facing this error) Regards, Peter

2 5

FTP Problem: Users cannot see their uploads files.
by Misharl mon 07 Dec '13

07 Dec '13

Hi everybody, I have managed to get proftpd to work, it can connect to the galaxy sql database, and users can log to upload files in their directory. But there is a problem, when a galaxy user logs in the galaxy web platform, the user can't see his upload files since Galaxy doesn't have the rights to open the directory. How can i change the permissions in proftpd so that galaxy can open the user directory? Thanks in advance to all. Mish

4 17

Missing test results on (Test) Tool Shed
by Peter Cock 29 Nov '13

29 Nov '13

Hi Greg & Dave, I really like the new features on the (Test) Tool Shed for searching my repositories: * Latest revision missing tool tests * Latest revision failing tool tests * Latest revision all tool tests pass However there are some teething problems. Some of my tools are listed under "Latest revision failing tool tests", but when I go to look at them, no test results are shown (passing or failing): http://testtoolshed.g2.bx.psu.edu/view/peterjc/blastxml_to_top_descr http://testtoolshed.g2.bx.psu.edu/view/peterjc/effectivet3 http://testtoolshed.g2.bx.psu.edu/view/peterjc/clinod http://testtoolshed.g2.bx.psu.edu/view/peterjc/get_orfs_or_cdss http://testtoolshed.g2.bx.psu.edu/view/peterjc/mira_assembler http://testtoolshed.g2.bx.psu.edu/view/peterjc/seq_primer_clip In some cases there is indeed a failing test, for instance this is due to a bug in the test framework: http://testtoolshed.g2.bx.psu.edu/view/peterjc/blast2go https://trello.com/c/KdGX3hkh And here there are missing dependencies (due to restrictive licensing problems): http://testtoolshed.g2.bx.psu.edu/view/peterjc/tmhmm_and_signalp https://trello.com/card/-/506338ce32ae458f6d15e4b3/770 Peter

4 77

Compressed data files in Galaxy? (e.g. GZIP or BGZF)
by Peter Cock 26 Nov '13

26 Nov '13

Hello all, What is the current status in Galaxy for supporting compressed files? We've talked about this before, for example in addition to FASTQ, many of us have expressed a wish to work with gzipped FASTQ. I understand that some have customized their local Galaxy installations to use gzipped FASTQ as a specific data type - I'm more interested in a general file format neutral solution. Also, I'd like to be able to used BGZF (not just GZIP) because it is better for random access - see for example http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html - and makes it much easier to break up large datafiles for sharing over a cluster (i.e. it could be exploited in the current Galaxy code for splitting large sequence files). The 11 May 2012 Galaxy Development News Brief http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-May/009757.html mentions tabix indexing - that uses bgzip, so is there something general in place yet to allow tool wrappers to say they accept not just given file formats, but different compressed versions of file formats? Ideally I'd like to be able to write an XML tool description saying a tool produced BGZF compressed tabular data, or GZIP compressed Sanger FASTQ etc. Similarly, I'd like to specify my tool accepts FASTA or gzipped FASTA (including BGZF FASTA). While for older tools if they say they accept only uncompressed FASTA, Galaxy could automatically decompress any compressed FASTA entries in my history on demand. Peter

3 4

Samtools and idxstats
by Michiel Van Bel 20 Nov '13

20 Nov '13

Hi, I would like to inquire whether anyone has attempted to implement the idxstats tool from samtools into Galaxy? The xml-file for idxstats is not present in the Galaxy source code, which led me to try and implement it myself. However, the main problem I face is that the idxstats tool silently relies on having an index file available (within the same directory) for the bam file you which to print the stats for. E.g. samtools idxstats PATH/test.bam searches for PATH/test.bam.bai and gives an error when this file is not present. And somehow I cannot model this behavior in Galaxy. A different solution would of course be to ask the author(s) of samtools to have an option available where the user can directly indicate the path to the index file. regards, Michiel PS: I've searched the mailing list archives for this problem but did not find any matches. Apologies if I somehow missed the answer. -- ================================================================== Michiel Van Bel, PhD Expert Bioinformatician Tel:+32 (0)9 331 36 95 fax:+32 (0)9 3313809 VIB Department of Plant Systems Biology, Ghent University Technologiepark 927, 9052 Gent, BELGIUM mibel(a)psb.vib-ugent.be http://www.psb.vib-ugent.be http://bioinformatics.psb.ugent.be ==================================================================

3 4

Tool unit tests using composite datatypes
by Peter Cock 18 Nov '13

18 Nov '13

Hello all, I'd like to be able to write some simple <test> entries for some of the BLAST+ tools using composite datatypes as input or output (i.e. small BLAST databases). This doesn't seem to be mentioned or hinted at on the wiki: http://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax?action=show&redi… Is it possible to use a composite datatype as a test input? If so how? Normal datatypes are loaded into the test history using using the upload tool - does that mean I first need to extend the relevant datatypes to allow them to be uploaded? Example: Run blastp using a small query FASTA file and a small database, check the output (eg tabular). Is it possible to use a composite datatype as a test output? If so how? Example: Run makeblastdb using a small FASTA file, and check the output (a small BLAST database). Thanks, Peter

7 26

tardis job splitter
by McCulloch, Alan 18 Nov '13

18 Nov '13

dear all, There have been a few posts lately about doing distributed computing via Galaxy - i.e. job splitters etc - below a contribution of some ideas we have developed and applied in our work, where we have arranged for some Galaxy tools to execute in parallel on our cluster. We have developed a job-splitter script "tardis.py" (available from https://bitbucket.org/agr-bifo/tardis) which takes marked-up standard unix commands that run an application or tool. The mark-up is prefixed to the input and output command-line options. Tardis strips off the mark-up, and re-writes the commands to refer to split inputs and outputs, which are then executed in parallel e.g. on a distributed compute resource. Tardis knows the output files to expect and how to join them back together. (This was referred to in our GCC2013 talk http://wiki.galaxyproject.org/Events/GCC2013/Abstracts#Events.2FGCC2013.2FA… ) Any reasonable unix based data processing or analysis command may be marked up and run using tardis, though of course tardis needs to know how to split and join the data. Our approach also assumes a "symmetrical" HPC cluster configuration, in the sense that each node sees the same view of the file system (and has the required underlying application installed). We use tardis to support both Galaxy and command-line based compute. Background / design pattern / motivating analogy: Galaxy provides a high level "end to end" view of a workflow; the HPC cluster resource that one uses then involves spraying chunks of data out into parallel processes, usually in the form of some kind of distributed compute cluster - but an end-user looking at a Galaxy history, should ideally not be able to tell whether the workflow was run as a single process on the server, or via many parallel processes on the cluster (apart from the fact that when run in parallel on the cluster, its alot faster!). We noticed that the TCP / IP layered networking protocol stack provides a useful metaphor and design pattern - with the "end to end" topology of a Galaxy workflow corresponding to the transport layer of TCP/ IP; and the distribution of computation across a cluster corresponding to the next TCP/IP layer down - the packet-routing layer. This picture suggested a strongly layered approach to provisioning Galaxy with parallelised compute on split data, and hence to an approach in which the footprint in the Galaxy code-base, of parallel / distributed compute support, should ideally (from the layered-design point of view) be minimal and superficial. Thus in our approach so far, the only footprint is in the tool config files, where we arrange the templating to (optionally) prefix the required tardis mark-up to the input and output command options, and the tardis script name to the command as a whole. tardis then takes care of rewriting and launching all of the jobs, and finally joining the results back together and putting them where galaxy expects them to be (and also housekeeping such as collating and passing up stderr and stdout , and appropriate process exit codes). (For each galaxy job, tardis creates a working folder in a designated scratch area, where input files are uncompressed and split; job files and their output are stored; logging is done etc. Split data is cleaned up at the end unless there was an error in some part of the job, in which case everything is retained for debugging and in some cases restart) (We modify Galaxy tool-configs so that the user can optionally choose to run the tool on our HPC cluster - there are three HPC related input fields, appended to the input section of a tool. Here the user selects whether they want to use our cluster and if so, they specify the chunk size, and can also at that point specify a sampling rate, since we often find it useful to be able to run preliminary analyses on a random sample of (for example) single or paired-end NGS sequence data, to obtain a fairly quick snapshot of the data, before the expense of a complete run. We found it convenient to include support for input sampling in tardis). The pdf document at https://bitbucket.org/agr-bifo/tardis includes a number of examples of marking up a command, and also a simple example of a galaxy tool-config that has been modified to include support for optionally running the job on our HPC cluster via the tardis pre-processor. Known limitations: * we have not yet attempted to integrate our approach with the existing Galaxy job-splitting distributed compute support, partly because of our "layered" design goal (admittedly also partly because of ignorance about its details ! ) * our current implementation is quite naive in the distributed compute API it uses - it supports launching condor job files (and also native sub-processes) - our plan is to replace that with using the drmaa API * we would like to integrate it better with the galaxy type system, probably via a galaxy-tardis wrapper We would be keen to contribute our approach to Galaxy if people are interested. Cheers Alan McCulloch Bioinformatics Software Engineer AgResearch NZ

3 5

Galaxy wrappers for CLC Assembly Cell (CLCbio)
by Peter Cock 15 Nov '13

15 Nov '13

Hello all, This is just to announce I am working on a wrapper for "CLC Assembly Cell" which is the CLCbio commercial command line assembly tool suite. http://www.clcbio.com/products/clc-assembly-cell/ Our institute bought a licence primarily for use on plant genomes where other assemblers at the time required too much RAM to complete. This assembler is both fast and low memory, which can be very useful. Wrapper development here: https://github.com/peterjc/pico_galaxy/tree/master/tools/clc_assembly_cell Prototype releases will be on the Test Tool Shed (soon): http://testtoolshed.g2.bx.psu.edu/view/peterjc/clc_assembly_cell Stable Tool Shed releases will be here (later): http://toolshed.g2.bx.psu.edu/view/peterjc/clc_assembly_cell I would be interested to hear from anyone else with access to a licensed copy of the tool interested in using it from Galaxy. e.g. Is it reasonable to assume the tools are on the $PATH, or is using a specific environment variable more helpful? Regards, Peter

1 1

Errors running DRMAA and PBS on remote server running Torque 4
by Ganote, Carrie L 14 Nov '13

14 Nov '13

Hi List, I've sprouted some grays in the last week after my Galaxy instances all simultaneously ceased to submit jobs to our main cluster. Some Galaxy instances are running the PBS job runner, and others use DRMAA. For the DRMAA runner I was getting: galaxy.jobs.runners ERROR 2013-10-15 08:40:14,942 (1024) Unhandled exception calling queue_job Traceback (most recent call last): File "galaxy-dist/lib/galaxy/jobs/runners/__init__.py", line 60, in run_next method(arg) File "galaxy-dist/lib/galaxy/jobs/runners/drmaa.py", line 188, in queue_job external_job_id = self.ds.runJob(jt) File "build/bdist.linux-x86_64/egg/drmaa/__init__.py", line 331, in runJob _h.c(_w.drmaa_run_job, jid, _ct.sizeof(jid), jobTemplate) File "build/bdist.linux-x86_64/egg/drmaa/helpers.py", line 213, in c return f(*(args + (error_buffer, sizeof(error_buffer)))) File "build/bdist.linux-x86_64/egg/drmaa/errors.py", line 90, in error_check raise _ERRORS[code-1]("code %s: %s" % (code, error_buffer.value)) InternalException: code 1: (qsub) cannot access script file: Unauthorized Request MSG=can not authorize request (0-Success) And in my PBS runner: galaxy.jobs.runners.pbs WARNING 2013-10-14 17:13:07,319 (550) pbs_submit failed (try 1/5), PBS error 15044: Resources temporarily unavailable To give some background, I had recently requested a new virtual machine to put my test/dev Galaxy on. I copied our production Galaxy to this new VM. I secured a new domain name for it and set it running. Everything was going well until I tried to hook it up to the cluster; at first I got an error saying that I didn't have permission to submit jobs. Makes sense, the new VM was not a qualified submit host for the cluster. I asked the sysadmins to add the VM as a submit host to the cluster using qmgr. As soon as this was done, not only could I still not submit jobs from the test Galaxy, but no Galaxy was able to submit jobs to the cluster. The issue isn't with Galaxy here but the underlying calls that it makes - for drmaa, I tracked it back to pbs-drmaa/bin/drmaa-run. For PBS, I'm sure it's somewhere in with libtorque. In every case, I could call qsub from the command line and it would correctly submit jobs, which was more perplexing. I re-installed python, drmaa.egg, pbs-drmaa, and rebooted the VM. I of course restarted Galaxy with each step, to no avail. I worked with the admins to see what was happening in the server logs, but the same cryptic error showed up - cannot authorize request. I've had this issue before in the past, more or less, but usually just gave up on it. It seemed to come and go sporadically, but rebooting the clusters seemed to help. This time, with our production server no longer functioning, I begged for help and the admins looked through the pbs_server config but couldn't find any mistypes or problems. Reloading the config by sending hangup signals to pbs_server didn't help. Then we tried pausing the scheduler and restarting pbs_server completely - and eureka, all problems went away. PBS and DRMAA runners are back up and working fine. This really seems to be a bug in Torque 4.1.5.1. I hope this saves someone a lot of headache! Newer versions of Torque may be the answer. I would also advise against making changes to the pbs_server configuration while in production - we have monthly maintenance, and I don't think I'll ever request changes when there won't be an immediate reboot to flush the server! Cheers, Carrie

2 3