Compressed data files in Galaxy? (e.g. GZIP or BGZF)
by Peter Cock
Hello all,
What is the current status in Galaxy for supporting compressed files?
We've talked about this before, for example in addition to FASTQ,
many of us have expressed a wish to work with gzipped FASTQ.
I understand that some have customized their local Galaxy
installations to use gzipped FASTQ as a specific data type - I'm
more interested in a general file format neutral solution.
Also, I'd like to be able to used BGZF (not just GZIP) because it
is better for random access - see for example
http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html
- and makes it much easier to break up large datafiles for sharing
over a cluster (i.e. it could be exploited in the current Galaxy code
for splitting large sequence files).
The 11 May 2012 Galaxy Development News Brief
http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-May/009757.html
mentions tabix indexing - that uses bgzip, so is there something
general in place yet to allow tool wrappers to say they accept not just
given file formats, but different compressed versions of file formats?
Ideally I'd like to be able to write an XML tool description saying
a tool produced BGZF compressed tabular data, or GZIP
compressed Sanger FASTQ etc. Similarly, I'd like to specify my
tool accepts FASTA or gzipped FASTA (including BGZF FASTA).
While for older tools if they say they accept only uncompressed
FASTA, Galaxy could automatically decompress any compressed
FASTA entries in my history on demand.
Peter
8 years, 7 months
Re: [galaxy-dev] [galaxy-user] Inquiring
by Nate Coraor
Hi Yan,
I've moved this discussion to the galaxy-dev list since it pertains to a
local installation of Galaxy.
Responses to your questions follow, in-line.
Yan Luo wrote:
> Dear Sir,
>
> (1)We installed Galaxy, but recently the user can't registered and got the
> following error, how can we fix it?
>
> Sever error
> An error occurred. See the error logs for more information.(To turn debug on
> to display ...).
Since debug = False in universe_wsgi.ini, you should be able to find a
more detailed error message in the log file. If starting Galaxy with:
% sh run.sh --daemon
The default log file is 'paster.log' in Galaxy's root directory.
> (2) Could you please let me know if there is any command to stop galaxy?
If starting with the --daemon flag (as above), you can use:
% sh run.sh --stop-daemon
If running in the foreground, you can use Ctrl-C to terminate the
process. There is a recent bug whereby Ctrl-C is ineffective on some
platforms under Python 2.6 - in this case you will have to kill/pkill
the process manually. We are working on a fix for the latter.
> (3) If I reset universe_wsgi.ini file and want to set an administrator
> user(I can add a line in the above file), how can I get the password? Should
> I stop galaxy(See question 2) first? then run "./setup.sh" and "./run.sh".
setup.sh would have only been necessary prior to running Galaxy the
first time, however, this step has recently been removed. If you are
referencing documentation that still refers to setup.sh, please let us
know so we can update it - I did notice this was still on the
"Production Server" page, so I removed it from there.
You no longer need to run setup.sh at all.
> (4) If I run "setup.sh", will a new file "universe_wsgi.ini" be generated?
> if I want to change this file,should I edit it before "run.sh" and after
> "setup.sh". Is it right?
setup.sh and its replacements in run.sh and the Galaxy application
itself never overwrite files, they only create files from sample files
if they do not exist.
> (5) I read some of your docs, command "sh setup.sh"(sh run.sh) and
> "./setup.sh"(./run.sh), which one is correct under Linux?
Both syntaxes are effectively the same in most cases.
--nate
>
> Looking forward to hearing from you.
>
> Best Wises,
>
> Yan Luo, Ph.D.
> NIH
> <http://int.ask.com/web?siteid=10000861&webqsrc=999&l=dis&q=By%20the%20way,>
> _______________________________________________
> galaxy-user mailing list
> galaxy-user(a)lists.bx.psu.edu
> http://lists.bx.psu.edu/listinfo/galaxy-user
8 years, 8 months
Deploying LOC files for tool built-in data during a tool installation
by Jean-Frédéric Berthelot
Hi list,
The tool I am currently wrapping has built-in data, which may be used by the tool users (through a relevant < from_data_table> + .LOC file configuration).
They are .fasta databases which are rather small and are thus bundled in the tool distribution package.
Thanks to the tool_dependencies.xml file, said distribution package is downloaded at install time, code is compiled, and since they are here, the data files are copied to $INSTAL L_DIR too , ready to be used.
After that, the user still has to edit tool-data/my_fancy_data_files.loc ; but the thing is, during the install I know where these data files are (since I copied those there), so I would like to save the user the trouble and set up this file automagically.
I would have two questions:
1/ Is it okay to have tool built-in data files in $INSTAL L_DIR, or would it be considered bad practice?
2/ Is there a way to set up the tool-data/my_fancy_data_files.loc during the install? Here are the options I though of:
*shipping a “real” my_fancy_data_files.loc.sample with the good paths already set-up, which is going to be copied as the .loc file (a rather ugly hack)
*using more <action type="shell_command"> during install to create my_fancy_data_files.loc (but deploying this file it is not part of the tool dependency install per se)
*variant of the previous : shipping my_fancy_data_files.loc as part of the tool distribution package, and copy it through shell_command (same concern than above).
Any thoughts?
Cheers,
--
Jean-Frédéric
Bonsai Bioinformatics group
8 years, 8 months
Reserved variables in param tags
by Kohler Manuel
Hi,
I have a question regarding the param tag. I would like to pass on the
user email to a external python script. I tried to use it like this:
<param name="email" type="hidden" value=$__user_email__ />
<param name="experiment" type="select" label="Experiment" help="select
Experiment" refresh_on_change="true"
dynamic_options="getExperiments(email)"/>
This does not work. Ideally I would like to have something like this:
dynamic_options="getExperiments($__user_email__)"
Has someone done this before?
Cheers
Manuel
--
Manuel Kohler
Center for Information Sciences and Databases (C-ISD)
Department of Biosystems Science & Engineering (D-BSSE)
ETH Zurich, Maulbeerstrasse (1078, 1.02), CH-4058 Basel, +41 61 387 3132
8 years, 9 months
samtools BAM to SAM tmp directory
by Matt Shirley
Is there a reason that the samtools BAM to SAM does not respect the
new_file_path set in the config file? The tmp directory handling by
different tool wrappers seems to be an issue right now on systems with
small system tmp directories.
--
Matt Shirley
Ph.D Candidate - BCMB
Pevsner Lab <http://pevsnerlab.kennedykrieger.org/>
Johns Hopkins Medicine
9 years
Error running tophat2 in Galaxy
by Sachit Adhikari
I am getting this error:
Error in tophat:
[2013-02-13 20:46:41] Beginning TopHat run (v2.0.7)
-----------------------------------------------
[2013-02-13 20:46:41] Checking for Bowtie
Bowtie version: 2.0.6.0
[2013-02-13 20:46:41] Checking for Samtools
Samtools version: 0.1.18.0
[2013-02-13 20:46:41] Checking for Bowtie index files
[2013-02-13 20:46:41] Checking for reference FASTA file
Warning: Could not find FASTA file
/data/rathi/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome.fa
[2013-02-13 20:46:41] Reconstituting reference FASTA file from Bowtie index
Executing: /usr/bin/bowtie2-inspect
/data/rathi/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome >
./tophat_out/tmp/genome.fa
[2013-02-13 20:48:51] Generating SAM header for
/data/rathi/Homo_sapiens/UCSC/hg19/Sequence/Bowtie2Index/genome
format: fastq
quality scale: phred33 (default)
[2013-02-13 20:49:23] Preparing reads
left reads: min. length=34, max. length=34, 2 kept reads (0 discarded)
Warning: you have only one segment per read.
If the read length is greater than or equal to 45bp,
we strongly recommend that you decrease --segment-length to about
half the read length because TopHat will work better with multiple
segments
[2013-02-13 20:49:23] Mapping left_kept_reads to genome genome with Bowtie2
[2013-02-13 20:49:56] Searching for junctions via segment mapping
Coverage-search algorithm is turned on, making this step very slow
Please try running TopHat again with the option
(--no-coverage-search) if this step takes too much time or memory.
Warning: junction database is empty!
[2013-02-13 20:51:18] Reporting output tracks
[FAILED]
Error running /usr/local/bin/tophat_reports --min-anchor 8
--splice-mismatches 0 --min-report-intron 50 --max-report-intron
500000 --min-isoform-fraction 0.15 --output-dir ./tophat_out/
--max-multihits 20 --max-seg-multihits 40 --segment-length 25
--segment-mismatches 2 --min-closure-exon 100 --min-closure-intron 50
--max-closure-intron 5000 --min-coverage-intron 50
--max-coverage-intron 20000 --min-segment-intron 50
--max-segment-intron 500000 --read-mismatches 2 --read-gap-length 2
--read-edit-dist 2 --read-realign-edit-dist 3 --max-insertion-length 3
--max-deletion-length 3 -z gzip -p4 --no-closure-search
--no-microexon-search --sam-header
./tophat_out/tmp/genome_genome.bwt.samheader.sam
--report-discordant-pair-alignments --report-mixed-alignments
--samtools=/bin/samtools --bowtie2-max-penalty 6 --bowtie2-min-penalty
2 --bowtie2-penalty-for-N 1 --bowtie2-read-gap-open 5
--bowtie2-read-gap-cont 3 --bowtie2-ref-gap-open 5
--bowtie2-ref-gap-cont 3 ./tophat_out/tmp/genome.fa
./tophat_out/junctions.bed ./tophat_out/insertions.bed
./tophat_out/deletions.bed ./tophat_out/fusions.out
./tophat_out/tmp/accepted_hits ./tophat_out/tmp/left_kept_reads.bam
Loading ...done
What's wrong?
9 years
Galaxy on Cluster - how to set -a flag with username
by greg
In our local galaxy install we want the cluster jobs to be run from
the galaxy user but we want to include a -a [account name] to our grid
software bills properly.
Here's what I currently have in universe.wsgi:
default_cluster_job_runner = drmaa://-V -pe batch 8/
What I want is something like this:
default_cluster_job_runner = drmaa://-V -pe batch 8 -a [logged in user name]/
Is this possible?
Thanks,
Greg
9 years
Issue with set_user_disk_usage.py and Postgres 8.x
by Lance Parsons
The recent updates to set_user_disk_usage.py for Postgres users have an
issue with Postgres 8.x. The SQL in the pgcalc method (line 51) leads
to the following error:
sqlalchemy.exc.ProgrammingError: (ProgrammingError) column "d.total_size" must appear in the GROUP BY clause or be used in an aggregate function
LINE 4: FROM ( SELECT d.total_siz...
^
The problem is that version of Postgres before 9.x were a bit more
restrictive in the use of GROUP BY. This can be fixed using DISTINCT ON
instead. See this StackOverflow post for more info:
http://stackoverflow.com/questions/1769361/postgresql-group-by-different-...
I've included a patch below. Let me know if a pull request would be
preferred.
--- a/scripts/set_user_disk_usage.py
+++ b/scripts/set_user_disk_usage.py
@@ -52,7 +52,7 @@
sql = """
UPDATE galaxy_user
SET disk_usage = (SELECT COALESCE(SUM(total_size), 0)
- FROM ( SELECT d.total_size
+ FROM ( SELECT DISTINCT ON (d.id)
d.total_size, d.id
FROM
history_dataset_association hda
JOIN history h ON
h.id = hda.history_id
JOIN dataset d ON
hda.dataset_id = d.id
@@ -62,7 +62,7 @@
AND d.purged = false
AND d.id NOT IN
(SELECT dataset_id
FROM library_dataset_dataset_association)
- GROUP BY d.id) sizes)
+ ) sizes)
WHERE id = :id
RETURNING disk_usage;
"""
--
Lance Parsons - Scientific Programmer
134 Carl C. Icahn Laboratory
Lewis-Sigler Institute for Integrative Genomics
Princeton University
9 years, 1 month
Displaying genomic sequences in Trackster
by Naharajan Lakshmanaperumal
Dear all,
We have our own galaxy instance and the idea is to have trackster enabled for users to be able to visualize NGS mapping. We were able to configure trackster in our instance and the visualization works fine.
We have two questions regarding trackster:
1) We can't display genomic sequences in trackster. As per the tutorial, we set the location of the .2bit file in the twobit.loc file for the trackster to be able to display the genomic sequence but for some reason it doesn't display it. The name of the builds is the same in all places i.e) in ucsc/chrom/builds.txt and also in the .loc files. Any ideas on what else should be done?
2) While saving the visualization, there is always an error message saying "could not save visualization" and it doesn't seem to be a web browser issue. How do we then save the visualization?
Thanks in advance,
Naharajan
9 years, 1 month