March 2013 - galaxy-dev - lists.galaxyproject.org

Compressed data files in Galaxy? (e.g. GZIP or BGZF)
by Peter Cock 26 Nov '13

26 Nov '13

Hello all, What is the current status in Galaxy for supporting compressed files? We've talked about this before, for example in addition to FASTQ, many of us have expressed a wish to work with gzipped FASTQ. I understand that some have customized their local Galaxy installations to use gzipped FASTQ as a specific data type - I'm more interested in a general file format neutral solution. Also, I'd like to be able to used BGZF (not just GZIP) because it is better for random access - see for example http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html - and makes it much easier to break up large datafiles for sharing over a cluster (i.e. it could be exploited in the current Galaxy code for splitting large sequence files). The 11 May 2012 Galaxy Development News Brief http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-May/009757.html mentions tabix indexing - that uses bgzip, so is there something general in place yet to allow tool wrappers to say they accept not just given file formats, but different compressed versions of file formats? Ideally I'd like to be able to write an XML tool description saying a tool produced BGZF compressed tabular data, or GZIP compressed Sanger FASTQ etc. Similarly, I'd like to specify my tool accepts FASTA or gzipped FASTA (including BGZF FASTA). While for older tools if they say they accept only uncompressed FASTA, Galaxy could automatically decompress any compressed FASTA entries in my history on demand. Peter

3 4

Contributing to genome indexes on rsync server
by Brad Chapman 08 Nov '13

08 Nov '13

Hi all; Is there a way for community members to contribute indexes to the rsync server? This resource is awesome and I'm working on migrating the CloudBioLinux retrieval scripts to use this instead of the custom S3 buckets we'd set up previously: https://github.com/chapmanb/cloudbiolinux/blob/master/cloudbio/biodata/gala… It's great to have this as a public shared resource and I'd like to be able to contribute back. From an initial pass, here are the things I'd like to do: - Include bowtie2 indexes for more genomes. - Include novoalign indexes for a number of commonly used genomes. - Clean up hg19 to include a full canonically sorted hg19, with indexes. Broad has a nice version prepped so GATK will be happy with it, and you need to stick with this ordering if you're ever going to use a GATK tool on it. Right now there is a partial hg19canon (without the random/haplotype chromosomes) and the structure is a bit complex. What's the best way to contribute these? Right now I have a lot of the indexes on S3. For instance, the hg19 indexes are here: https://s3.amazonaws.com/biodata/genomes/hg19-bowtie.tar.xz https://s3.amazonaws.com/biodata/genomes/hg19-bowtie2.tar.xz https://s3.amazonaws.com/biodata/genomes/hg19-bwa.tar.xz https://s3.amazonaws.com/biodata/genomes/hg19-novoalign.tar.xz https://s3.amazonaws.com/biodata/genomes/hg19-seq.tar.xz https://s3.amazonaws.com/biodata/genomes/hg19-ucsc.tar.xz I'm happy to format these differently or upload somewhere that would make it easy to include. Thanks again for setting this up, I'm looking forward to working off a shared repository of data, Brad

6 8

Re: [galaxy-dev] [galaxy-user] Inquiring
by Nate Coraor 04 Nov '13

04 Nov '13

Hi Yan, I've moved this discussion to the galaxy-dev list since it pertains to a local installation of Galaxy. Responses to your questions follow, in-line. Yan Luo wrote: > Dear Sir, > > (1)We installed Galaxy, but recently the user can't registered and got the > following error, how can we fix it? > > Sever error > An error occurred. See the error logs for more information.(To turn debug on > to display ...). Since debug = False in universe_wsgi.ini, you should be able to find a more detailed error message in the log file. If starting Galaxy with: % sh run.sh --daemon The default log file is 'paster.log' in Galaxy's root directory. > (2) Could you please let me know if there is any command to stop galaxy? If starting with the --daemon flag (as above), you can use: % sh run.sh --stop-daemon If running in the foreground, you can use Ctrl-C to terminate the process. There is a recent bug whereby Ctrl-C is ineffective on some platforms under Python 2.6 - in this case you will have to kill/pkill the process manually. We are working on a fix for the latter. > (3) If I reset universe_wsgi.ini file and want to set an administrator > user(I can add a line in the above file), how can I get the password? Should > I stop galaxy(See question 2) first? then run "./setup.sh" and "./run.sh". setup.sh would have only been necessary prior to running Galaxy the first time, however, this step has recently been removed. If you are referencing documentation that still refers to setup.sh, please let us know so we can update it - I did notice this was still on the "Production Server" page, so I removed it from there. You no longer need to run setup.sh at all. > (4) If I run "setup.sh", will a new file "universe_wsgi.ini" be generated? > if I want to change this file,should I edit it before "run.sh" and after > "setup.sh". Is it right? setup.sh and its replacements in run.sh and the Galaxy application itself never overwrite files, they only create files from sample files if they do not exist. > (5) I read some of your docs, command "sh setup.sh"(sh run.sh) and > "./setup.sh"(./run.sh), which one is correct under Linux? Both syntaxes are effectively the same in most cases. --nate > > Looking forward to hearing from you. > > Best Wises, > > Yan Luo, Ph.D. > NIH > <http://int.ask.com/web?siteid=10000861&webqsrc=999&l=dis&q=By%20the%20way,> > _______________________________________________ > galaxy-user mailing list > galaxy-user(a)lists.bx.psu.edu > http://lists.bx.psu.edu/listinfo/galaxy-user

4 7

Deploying LOC files for tool built-in data during a tool installation
by Jean-Frédéric Berthelot 17 Oct '13

17 Oct '13

Hi list, The tool I am currently wrapping has built-in data, which may be used by the tool users (through a relevant < from_data_table> + .LOC file configuration). They are .fasta databases which are rather small and are thus bundled in the tool distribution package. Thanks to the tool_dependencies.xml file, said distribution package is downloaded at install time, code is compiled, and since they are here, the data files are copied to $INSTAL L_DIR too , ready to be used. After that, the user still has to edit tool-data/my_fancy_data_files.loc ; but the thing is, during the install I know where these data files are (since I copied those there), so I would like to save the user the trouble and set up this file automagically. I would have two questions: 1/ Is it okay to have tool built-in data files in $INSTAL L_DIR, or would it be considered bad practice? 2/ Is there a way to set up the tool-data/my_fancy_data_files.loc during the install? Here are the options I though of: *shipping a “real” my_fancy_data_files.loc.sample with the good paths already set-up, which is going to be copied as the .loc file (a rather ugly hack) *using more <action type="shell_command"> during install to create my_fancy_data_files.loc (but deploying this file it is not part of the tool dependency install per se) *variant of the previous : shipping my_fancy_data_files.loc as part of the tool distribution package, and copy it through shell_command (same concern than above). Any thoughts? Cheers, -- Jean-Frédéric Bonsai Bioinformatics group

5 10

Reserved variables in param tags
by Kohler Manuel 12 Sep '13

12 Sep '13

Hi, I have a question regarding the param tag. I would like to pass on the user email to a external python script. I tried to use it like this: <param name="email" type="hidden" value=$__user_email__ /> <param name="experiment" type="select" label="Experiment" help="select Experiment" refresh_on_change="true" dynamic_options="getExperiments(email)"/> This does not work. Ideally I would like to have something like this: dynamic_options="getExperiments($__user_email__)" Has someone done this before? Cheers Manuel -- Manuel Kohler Center for Information Sciences and Databases (C-ISD) Department of Biosystems Science & Engineering (D-BSSE) ETH Zurich, Maulbeerstrasse (1078, 1.02), CH-4058 Basel, +41 61 387 3132

2 1

samtools BAM to SAM tmp directory
by Matt Shirley 03 Jul '13

03 Jul '13

Is there a reason that the samtools BAM to SAM does not respect the new_file_path set in the config file? The tmp directory handling by different tool wrappers seems to be an issue right now on systems with small system tmp directories. -- Matt Shirley Ph.D Candidate - BCMB Pevsner Lab <http://pevsnerlab.kennedykrieger.org/> Johns Hopkins Medicine

3 4

Error running tophat2 in Galaxy
by Sachit Adhikari 02 Jul '13

02 Jul '13

I am getting this error: Error in tophat: [2013-02-13 20:46:41] Beginning TopHat run (v2.0.7) ----------------------------------------------- [2013-02-13 20:46:41] Checking for Bowtie Bowtie version: 2.0.6.0 [2013-02-13 20:46:41] Checking for Samtools Samtools version: 0.1.18.0 [2013-02-13 20:46:41] Checking for Bowtie index files [2013-02-13 20:46:41] Checking for reference FASTA file Warning: Could not find FASTA file /data/rathi/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa [2013-02-13 20:46:41] Reconstituting reference FASTA file from Bowtie index Executing: /usr/bin/bowtie2-inspect /data/rathi/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome > ./tophat_out/tmp/genome.fa [2013-02-13 20:48:51] Generating SAM header for /data/rathi/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome format: fastq quality scale: phred33 (default) [2013-02-13 20:49:23] Preparing reads left reads: min. length=34, max. length=34, 2 kept reads (0 discarded) Warning: you have only one segment per read. If the read length is greater than or equal to 45bp, we strongly recommend that you decrease --segment-length to about half the read length because TopHat will work better with multiple segments [2013-02-13 20:49:23] Mapping left_kept_reads to genome genome with Bowtie2 [2013-02-13 20:49:56] Searching for junctions via segment mapping Coverage-search algorithm is turned on, making this step very slow Please try running TopHat again with the option (--no-coverage-search) if this step takes too much time or memory. Warning: junction database is empty! [2013-02-13 20:51:18] Reporting output tracks [FAILED] Error running /usr/local/bin/tophat_reports --min-anchor 8 --splice-mismatches 0 --min-report-intron 50 --max-report-intron 500000 --min-isoform-fraction 0.15 --output-dir ./tophat_out/ --max-multihits 20 --max-seg-multihits 40 --segment-length 25 --segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50 --max-closure-intron 5000 --min-coverage-intron 50 --max-coverage-intron 20000 --min-segment-intron 50 --max-segment-intron 500000 --read-mismatches 2 --read-gap-length 2 --read-edit-dist 2 --read-realign-edit-dist 3 --max-insertion-length 3 --max-deletion-length 3 -z gzip -p4 --no-closure-search --no-microexon-search --sam-header ./tophat_out/tmp/genome_genome.bwt.samheader.sam --report-discordant-pair-alignments --report-mixed-alignments --samtools=/bin/samtools --bowtie2-max-penalty 6 --bowtie2-min-penalty 2 --bowtie2-penalty-for-N 1 --bowtie2-read-gap-open 5 --bowtie2-read-gap-cont 3 --bowtie2-ref-gap-open 5 --bowtie2-ref-gap-cont 3 ./tophat_out/tmp/genome.fa ./tophat_out/junctions.bed ./tophat_out/insertions.bed ./tophat_out/deletions.bed ./tophat_out/fusions.out ./tophat_out/tmp/accepted_hits ./tophat_out/tmp/left_kept_reads.bam Loading ...done What's wrong?

2 1

Galaxy on Cluster - how to set -a flag with username
by greg 12 Jun '13

12 Jun '13

In our local galaxy install we want the cluster jobs to be run from the galaxy user but we want to include a -a [account name] to our grid software bills properly. Here's what I currently have in universe.wsgi: default_cluster_job_runner = drmaa://-V -pe batch 8/ What I want is something like this: default_cluster_job_runner = drmaa://-V -pe batch 8 -a [logged in user name]/ Is this possible? Thanks, Greg

3 5

Issue with set_user_disk_usage.py and Postgres 8.x
by Lance Parsons 06 Jun '13

06 Jun '13

The recent updates to set_user_disk_usage.py for Postgres users have an issue with Postgres 8.x. The SQL in the pgcalc method (line 51) leads to the following error: sqlalchemy.exc.ProgrammingError: (ProgrammingError) column "d.total_size" must appear in the GROUP BY clause or be used in an aggregate function LINE 4: FROM ( SELECT d.total_siz... ^ The problem is that version of Postgres before 9.x were a bit more restrictive in the use of GROUP BY. This can be fixed using DISTINCT ON instead. See this StackOverflow post for more info: http://stackoverflow.com/questions/1769361/postgresql-group-by-different-fr… I've included a patch below. Let me know if a pull request would be preferred. --- a/scripts/set_user_disk_usage.py +++ b/scripts/set_user_disk_usage.py @@ -52,7 +52,7 @@ sql = """ UPDATE galaxy_user SET disk_usage = (SELECT COALESCE(SUM(total_size), 0) - FROM ( SELECT d.total_size + FROM ( SELECT DISTINCT ON (d.id) d.total_size, d.id FROM history_dataset_association hda JOIN history h ON h.id = hda.history_id JOIN dataset d ON hda.dataset_id = d.id @@ -62,7 +62,7 @@ AND d.purged = false AND d.id NOT IN (SELECT dataset_id FROM library_dataset_dataset_association) - GROUP BY d.id) sizes) + ) sizes) WHERE id = :id RETURNING disk_usage; """ -- Lance Parsons - Scientific Programmer 134 Carl C. Icahn Laboratory Lewis-Sigler Institute for Integrative Genomics Princeton University

3 4

Displaying genomic sequences in Trackster
by Naharajan Lakshmanaperumal 24 May '13

24 May '13

Dear all, We have our own galaxy instance and the idea is to have trackster enabled for users to be able to visualize NGS mapping. We were able to configure trackster in our instance and the visualization works fine. We have two questions regarding trackster: 1) We can't display genomic sequences in trackster. As per the tutorial, we set the location of the .2bit file in the twobit.loc file for the trackster to be able to display the genomic sequence but for some reason it doesn't display it. The name of the builds is the same in all places i.e) in ucsc/chrom/builds.txt and also in the .loc files. Any ideas on what else should be done? 2) While saving the visualization, there is always an error message saying "could not save visualization" and it doesn't seem to be a web browser issue. How do we then save the visualization? Thanks in advance, Naharajan

3 4