I analyzed data and for several days could see data loaded as custom track on
USCSC browser, but no wall of the sudden track is there but EMPTY
Can someone please explain what is going on?
Cosmas Giallourakis, MD
Assistant Professor of Medicine
Harvard Medical School
Massachussetts General Hospital
55 Fruit Street
Boston, MA 02114
clinical office (617)-726-2026
The information in this e-mail is intended only for the person to whom it is
addressed. If you believe this e-mail was sent to you in error and the e-mail
contains patient information, please contact the Partners Compliance HelpLine at
http://www.partners.org/complianceline . If the e-mail was sent to you in error
but does not contain patient information, please contact the sender and properly
dispose of the e-mail.
November 24, 2010 Galaxy Development News Brief
Here are the highlights of the following upgrade:
hg pull -u -r 8729d2e29b02
---- What's New ----
Galaxy's FTP Server New Data Upload Option
* User how-to:
* Configuration instructions for local installs:
* User how-to and config instructions:
NGS Simulation Tool
* Allows user to simulate multiple Illumina runs with several parameters
that can be set.
o On each run, one position is randomly chosen to be polymorphic and
sequencing errors are also simulated.
o The primary output is a png with two different plots.
o The other output shows summary statistics about the simulation.
* NGS simulation tool location: tools/ngs_simulation/ngs_simulation.xml
Tophat and Cufflinks RNA-seq Tools
* Addition of RNA-seq analysis tools Tophat and Cufflinks.
o Together, these tools can be used to analyze RNA-seq data to
understand alternative splicing and isoforms, gene and isoform
expression, and perform statistical tests for differential expression.
o Galaxy supports Tophat version 1.1.1 and later and Cufflinks
version 0.9.1 and later. (These are the versions included this
Import or Export Workflows & Histories
* Workflows can now be downloaded/exported to a file and
uploaded/imported into Galaxy, making it easy to move workflows between
* Beta feature: Histories can also be downloaded or moved from one
Galaxy instance to another, subject to these limitations:
o history archives can be uploaded/imported only via URL, not file
o histories must be shared in order for them to be importable via archive
o tags are not currently imported
o reproducibility is limited as parameters for imported jobs are not
always recovered and set
Even Better Data Visualization with Trackster
* Trackster now supports interactive filtering for VCF quality values
and BED score values.
* For example, a user can drag a slider to filter a file of splice
junctions to view junctions supported by different numbers of reads.
trackster splice example
* Improved CIGAR support to BAM display. Properly displays matches,
deletions, skipped bases, and clipping. Padding for insertions are
currently not represented in the display.
* GFF feature blocks are now displayed correctly, along with name,
strand, and score information.
* General enhancements
o Removed right-hand pane, allow inline re-ordering and configuration
o Moved navigational controls to the top
o Histogram display for LineTracks and overview
o New navigational slider and new overview settings under the
dropdown corresponding to the track name
o Summary view now shows maximum y-axis value
o Can change draw color of LineTrack
o When editing track config, "Enter" and "Esc" keys submit and cancel
the changes, respectively
o Don't index bottom level for summary_tree, greatly reducing
computation time (>5x speedup) while not sacrificing usability
Refactored to pass JSLint
o Fix ReferenceTrack issue.
o Don't re-add new datasets when refreshing after using "Add into
current viz" link.
o To prevent browser lockup, only display up to 50 lines of features
by default (user-editable in future). Coming soon: add warning message
when this occurs.
o Fix LineTrack rendering bug when more than one tile on screen.
Native Data set Re-organization
* Galaxy now uses a set of data tables instead of simple loc files to
organize, document, and store native genome data sets.
* Why Data tables? Better data management for long term stability!
o Allows the information in the loc file, including the path, to be
o By using a unique ID as the parameter value, data links in existing
workflows are preserved.
* Most tools (PerM, Bowtie, BWA, Lastz, Megablast, SRMA, Tophat) that
previously used loc files now have the new data tables organization
* Better data tracking has allowed for more informative genome name
display in tool dropdown boxes.
* For local installations:
o See the new wiki describing how to use data tables:
o More help for NGS tool setup (update pending):
---- Updated & Improved ----
* Complete re-write of the Framework and User interface (database schema
* New interactive interface to select files to transfer from the
sequencer to Galaxy data libraries.
* The data transfer feature now uses Galaxy RESTful API.
* Full documentation detailing the new functionality and how to use it
will be available within a few weeks through the home Galaxy Wiki.
* New checkouts will now perform all necessary setup directly in run.sh,
there is no longer a need to run setup.sh prior to run.sh. (setup.sh
will be removed in a future distribution).
* Enable 'FASTX-Toolkit for FASTQ data' as a subsection under 'NGS: QC
and manipulation' in tool_conf.xml.sample/main. Includes special
handling for when the shell only allows for strict Bourne syntax.
* Add descriptive labels to output dataset names for MACS peakcalling tool.
* Taxonomy tools updated for better error reporting. Includes special
handling for when the shell only allows for strict Bourne syntax.
* Refactor sam_bitwise_flag_filter tool, simplifying it and making it
fastet when there are multiple flag criteria
Tool Dependency Enhancements
* Addition of the 'package' type to <requirement> tags in the tool config.
1 Syntax for tool configs is:
<requirement type='package' version='X.Y.Z'>NAME</requirement>
2 Next, a directory should be created, and the path to that directory
should be set in universe_wsgi.ini as 'tool_dependency_dir'.
3 Galaxy will then source the following file prior to executing the
4 The 'version' attribute of the 'requirement' tag is optional and if
left off, Galaxy will look for the following instead:
* UI: new style for dropdown menus.
* Now uses jStore to save folder expansion state.
* Pre-generate and cache variables so that expensive functions like
jQuery.siblings, jQuery.filter and jQuery.find only have to be called a
minimum amount of times. Provides significant speedup to loading of
large data libraries.
* Add basic support for Bowtie indexes as a datatype (bowtie_base_index,
bowtie_color_index), available via datatype conversion. Currently, the
indexes need to be converted manually from the FASTA file before use in
Bowtie, but they can be reused.
* A new sample loc file (tool-data/all_fasta.loc.sample) was added which
lists fasta files. A script (scripts/loc_files/create_all_fasta_loc.py)
was created that can be used to generate this loc file for local
* New gff2bed tool to convert GFF3 files to BED.
* Modified Filter and Sort -> Filter tool to operate correctly on files
with a variable number of columns, such as in SAM files.
* New datatype added: VCF (variant call format).
* Add descriptive labels to output dataset names for MACS peakcalling tool.
* Add name/designation to HDA name for new datasets created in
* Shift management of the interaction between workflow outputs and
HideDatasetActions to the front end editor.
* No usability changes, but this resolves the issue with multiple
HideDatasetActions being created.
* Existing workflows displaying multiple HideDatasetActions per step on
the Run Workflow screen will persist. These extra HideDatasetActions are
harmless, but a simple edit workflow -> save will remove them.
* Workflow Inputs change:
o Workflow inputs that aren't a subtype of text, were previously not
o Added 'data' datatype to registry, which will allow both text and
binary inputs (and their subtypes) to workflow input steps.
o Note that this will allow a user to change the datatype of
something to 'data'.
User Interface (UI)
* New function for downloading metadata files associated with datasets
(such as bai indices for bam files). See the Save icon drop-down menu.
* Enable display of unicode characters in history and workflow
annotations and when listing and running workflows.
* Dynamicically generated popup-style menus. Greatly improves load time,
especially for data libraries having potentially large menu.
* Labels next to checkboxes can now be clicked to check the
* Radio boxes in tool forms now also have clickable labels as well.
* New style for search boxes in grids. Grid items will no longer show
outline when hovered upon if there are no actions to be performed.
when the page is loaded.
* Remove the creation of a background element that closes the active
menu clicked. Instead, bind an event to close active menus to the
document object of current and all other framesets. Tested in IE.
* Make links in split menu buttons "go through" instead of popping up
the menu options.
* Functional Test Framework: new nose plugin that shows a diff between
tests failed this time and last time.
* Documentation update to add more options added to the sample config file.
* Fix for TextToolParameter.get_html_field when provided value is an
empty string but default value specified in tool is non-empty string.
Fixes issue with rerun button where if a user had input an empty string,
the form displayed when rerun would have the default value from the tool
and not the actual previously specified value.
* Fix for Integer/FloatToolParameter.get_html_field() when 'value' is
provided as an integer/float. Fixes an issue seen when saving workflows:
If an integer or float tool parameter is changed to a value of 0 or 0.0
and saved, the form field would be redisplayed using the default tool
value; and not the value that is now saved in the database.
* Fix for setting columns in workflow builder for ColumnListParameter.
e.g. allows splitting lists of columns by newlines and commas and strips
* Fixes for rerun action to recurse grouping options when checking
unvalidated values and cloned HDAs. Better selection of corresponding
HDAs from cloned histories, when multiple copies exist.
* Have rerun action make use of tool.check_and_update_param_values().
Fixes Server Error issue when trying to rerun updated tools.
* Fix for display framework to work with workflows that contain tools
that have been updated. Previously, this would cause a server error when
trying to view a workflow or a page with an embedded workflow that
contained an updated tool.
* Fix bug that was causing Page item selection grids to be initialized
twice and hence causing grid paging to fail.
* Add some space between adjacent embedded items on Pages.
* Fix path to closebox.png image so screencast close button is shown
* Fix the Admin -> Manage Jobs interface when using multiple Galaxy
* When possible (e.g. Python >= 2.6), don't use tons of memory to handle
* Fix cluster stdout/stderr handling that could cause excessive memory
usage if stdout/stderr were very large.
* Make the PBS runner actually stop jobs when a user deletes output.
This would only work before if the Galaxy user was a PBS "operator" and
only using a single process setup.
* Cause waiting jobs to fail if any of their inputs fail to set metadata
* Fix 'import from current history' for Data Libraries that was showing
metadata files that are not visible. Fix this same issue for 'Copy
history items' feature.
* DRMAA runner now uses get_id_tag() in Wrapper instead of job_id
directly for creation of .sh .o and .e files, as well as some debugging.
* Prevent Rename Dataset Action from allowing a blank input.
hg clone http://www.bx.psu.edu/hg/galaxy galaxy-dist
Galaxy is supported in part by NSF, NHGRI, the Huck Institutes of the
Life Sciences, and The Institute for CyberScience at Penn State.
-- Galaxy Team
On Tue, Nov 23, 2010 at 4:52 PM, Martin, David A. <dmarti(a)lsuhsc.edu> wrote:
> Thank you for the help. So, assume I start up a new instance and specify
> 1 master with 100 GB and 5 slave nodes, several EBS volumes will be
> created. Now when I finish working in galaxy. I should first terminate the
> instance from the cloud console, and then terminate the master and slaves
> from the AWS console, right? Now, several EBS volumes are still up and can
> be deleted, except for one. To identify which volume has the data, I should
> look in my S3 bitbucket at the file persistent.txt, no? Is this file in a
> snapshot that is automatically created when I terminate an instance through
> cloud console?
Once you've finished your work with Galaxy for the time being, yes, click
Terminate cluster on the Galaxy Cloud console. That will stop all of the
services running on the cluster and also terminated all of the worker nodes.
Once you see 'Cluster shut down...' at the bottom of the cluster status log
on the Galaxy Cloud console, from the AWS console, terminate the master
instance. That is the only instance that should still be running at that
Then, you can delete all of the EBS volumes that that were created from a
snapshot. All these EBS volumes should be 15GB in size and created from
snapshot 'snap-f3a64f99' (there should be 6 of them based on your example: 1
from the master and 5 from workers). That should be it. You don't really
need to go digging through the persistent_data.txt file in the S3 bucket
because your data volume should be the only one that's still available at
that point, plus you can always pick it out from the rest by looking at its
size (100GB in your example).
> Regarding the 1 TB limit, I am thinking that intermediate files can be
> moved out of the persistent files as they are no longer needed and saved
> somewhere else, so that galaxy is never working with more that 1 TB... I am
> unsure how tricky this will be but I suppose it is possible in principle?
> My conception of the workflow is limited, but I think we need to
> convert(groom) to sanger, map with tophat/bowtie, and then use cufflinks for
> comparing expression. I am trying to practice ahead of time to see what
> kind of files/sizes these steps generate and figure out how this can work on
> the cloud. Thanks again.
I guess that could work but realize that you'll have to ssh to the instance
and clean up the datasets by hand.
> -----Original Message-----
> From: Enis Afgan [mailto:firstname.lastname@example.org <eafgan(a)emory.edu>]
> Sent: Tue 11/23/2010 2:55 PM
> To: Martin, David A.
> Cc: galaxy-user(a)bx.psu.edu
> Subject: Re: [galaxy-user] Galaxy on the Cloud/RNA-Seq
> Your approach for terminating a cluster and starting it back up when it's
> needed should continue to be fine for your purposes. That's the best and
> pretty much the only way to minimize the cost.
> The reason there are 45 EBS volumes created is because each time you start
> an instance, a root EBS volume from snapshot 'snap-f3a64f99' is created to
> serve as the root file system. When you terminate that particular instance,
> that EBS volume is no longer needed and can be deleted (in the next AMI we
> build, we will enable deletion of that volume automatically upon instance
> termination). In other words, feel free to delete all EBS volumes that were
> created from a snapshot; they can be and are recreated when needed. The
> volume that should not be deleted is your data volume. The ID of this
> can be found in your cluster's bucket (cm-<HASH>) in your S3 account in
> named persistent_data.txt
> As a reference, don't attach/detach EBS volumes manually to running Galaxy
> Cloud instances because the application will lose track of them and not be
> able to recover. In addition, always click 'Terminate cluster' on the
> Cloud main UI and wait for it to shutdown all of he services; then
> *terminate* the master instance from AWS console (don't *stop* the
> As far as uploading 200GB of data to a cloud instance and processing it
> there. In principle, it should work. However, there is a 1TB limit on EBS
> volumes imposed by Amazon. As a result, and considering the multiple
> transformation steps your data will have to go through within Galaxy, I am
> concerned that you will reach that 1TB limit. We will be working on
> expanding beyond that limit by composing a filesystem from multiple EBS
> volumes but that's not available yet.
> Hope this helps; let us know if you have any more questions,
> On Tue, Nov 23, 2010 at 3:17 PM, David Martin <dmarti(a)lsuhsc.edu> wrote:
> > Hello,
> > We are about to get about 200 GB of illumina reads(43 bp) from 20
> > two groups of 10 animals. We are hoping to use Galaxy on the Cloud to
> > compare gene expression between the two groups. First of all, do you
> > this is possible with the current state of Galaxy Cloud development?
> > Secondly, we are currently practicing with small drosophila datasets (4
> > sets of 2 GB each), and over the course of a few days of doing relatively
> > little besides grooming and filtering the data, we had already been
> > $60 by Amazon, which we thought was a bit inefficient. What is the best
> > to proceed working from one day to the next? Should one terminate the
> > cluster at Cloud Console and then stop(pause) the cluster at the AWS
> > console, and then restart the instance the next day? Does one have to
> > reattach all of the EBS volumes before restarting the cluster? We were
> > terminating the instance and then bringing it back up and all the data
> > still there, ie it worked fine, but when we looked after a couple days
> > were 45 EBS volumes there - much of it was surely redundant as our data
> > wasn't very large. Perhaps we need to take a snapshot and reboot the
> > instance from this? Thank you for any hints regarding this matter, this
> > all very new to me. Let me know if you need clarification or more
> > information.
> > David Martin
> > dmarti(a)lsuhsc.edu
> > _______________________________________________
> > galaxy-user mailing list
> > galaxy-user(a)lists.bx.psu.edu
> > http://lists.bx.psu.edu/listinfo/galaxy-user
We are about to get about 200 GB of illumina reads(43 bp) from 20 samples,
two groups of 10 animals. We are hoping to use Galaxy on the Cloud to
compare gene expression between the two groups. First of all, do you think
this is possible with the current state of Galaxy Cloud development?
Secondly, we are currently practicing with small drosophila datasets (4 sets
of 2 GB each), and over the course of a few days of doing relatively little
besides grooming and filtering the data, we had already been charged $60 by
Amazon, which we thought was a bit inefficient. What is the best way to
proceed working from one day to the next? Should one terminate the cluster
at Cloud Console and then stop(pause) the cluster at the AWS console, and
then restart the instance the next day? Does one have to reattach all of
the EBS volumes before restarting the cluster? We were just terminating the
instance and then bringing it back up and all the data was still there, ie
it worked fine, but when we looked after a couple days there were 45 EBS
volumes there - much of it was surely redundant as our data wasn¹t very
large. Perhaps we need to take a snapshot and reboot the instance from
this? Thank you for any hints regarding this matter, this is all very new to
me. Let me know if you need clarification or more information.
I am trying to map my Illumina reads to the mouse genome using specific Bowtie criteria. I have transformed the fastq tags into fastq Sanger as described in the tutorials. Now i am attempting to run these tags against the genome. However, it's been more than one hour now that my job has been queued but hasn't started yet. I am wondering if there is something wrong with it, or if it is usual to wait before being able to start the mapping.
Thanks for your help!
I'm currently trying out Galaxy and I like a lot thus far. We have our own
instance on a test server.
I have one issue with the time it takes to import large fastq files
(s_x_sequence.txt files from the GAIIx actually). I think it may be our
set up or something. How long is it suppose to take to upload a 5
Many thanks in advance
With the continued advancement of sequencing technology, we've seen the
size of files uploaded to Galaxy grow quite large. Although our current
file upload methods have worked fine for years, they are not well suited
for the extremely large files in common use today. Uploading directly
from the browser can be unreliable and browsers don't provide feedback
on upload progress and state like they do for downloads.
Because of this, we have implemented file uploads to Galaxy via FTP.
FTP will allow you to monitor as mentioned above, as well as resume
interrupted transfers. To get started using FTP, you'll need to have
registered a regular Galaxy account on our public server at:
Once registered, you can initiate an FTP connection in your preferred
FTP client to the same host (main.g2.bx.psu.edu) using your registered
email address and password for the login details.
Files uploaded to this server won't automatically be imported to Galaxy
- rather, you will be presented with a list of the contents of your FTP
directory on the standard "Upload File" tool interface. Files not
imported within 3 days will be cleaned up from the FTP site.
We hope that this service solves a crucial problem in data transfer
speed and reliability. Please report any problems or suggestions with
the service to us at:
Please note that it may not always be practical to use the public Galaxy
servers. If you're routinely working with very large data and having to
wait for it to upload, a local Galaxy server could be a more practical
solution. Instructions on installing your own server can be found at:
In addition, with the recent announcement of Amazon's parallel upload
capability for the Simple Storage Service (S3), we are investigating
incorporating easy access to S3 buckets for Galaxy instances on the
Amazon Elastic Compute Cloud (EC2). But you don't need to wait for the
pretty interface, you can already access contents of S3 buckets by
pasting links to their contents in the "URL/Text:" field of the "Upload
File" tool. For an example of how to do this, see the "Watch how the
complete analysis can be performed on the Amazon Cloud" screencast at:
Feature Update: Expanded Reference Genomes for NGS Tools
Expanded genome data is now available on Galaxy's main server
The expanded genome data are organized into the following categories:
Full = All chromosomes (or scaffolds/contigs) released as part of the
full build from the data source. Many will also include the latest
Mitochondrial chromosome, if available. Full is the default genome build
type for all reference genomes and what was previously available.
Canonical = All chromosomes released that represent the core chromosomes
(including mitochondrial) for that species and build. Canonical will not
include chromosomes with names that include: chrUn, chrN_random,
chrN_hap_XX (haplotype), and similar.
Coming soon will also be the reference genome variants Male and Female.
Many genomes will default to Male (identical to Canonical). Female will
be the Canonical minus a chrY chromosome.
The NGS Tools with expanded genome data available: Bowtie and BWA.
Please note that previously used/saved settings will default to "Full".
Other NGS Tools are planned to have expanded reference genome data
available in the future as appropriate.
Reference genomes part of the expanded dataset include:
Thanks for using Galaxy!
November 16, 2010
[ copying in the galaxy user mailing list as this is also a problem with
the main site ]
OK. I think I've pinned this down to a problem between Firefox 4 beta
(both on Linux and OSX) and the latest version of Galaxy-dist.
Prior to updating my local install of Galaxy-dist, Firefox 4 beta worked
fine with job tracking, but with the latest update it fails to track
properly. This also occurs with a 'clean' install of Galaxy-dist, so
it's not some trailing config problem with my set-up.
Firefox 3.6.12 works fine.
Does anyone else see similar behaviour?
On 15/11/10 16:26, Chris Cole wrote:
> Just realised I hadn't copied the list into the reply.
> No there are no errors in the browser (Firefox 4 beta7) and this was
> working normally in the same browser prior to the update to galaxy-dist.
> On 12/11/10 22:33, Kanwei Li wrote:
>> Sounds like a browser issue, what browser are you using and are there
>> any errors in the browser?
>> On Fri, Nov 12, 2010 at 11:11 AM, Chris
>> Cole<chris(a)compbio.dundee.ac.uk> wrote:
>>> Another problem in the update I've applied. New jobs don't change
>>> state in
>>> the Galaxy interface, but they are being correctly farmed onto our
>>> via DRMAA. THis is the output I get:
>>> galaxy.jobs INFO 2010-11-12 16:02:43,029 job 2187 dispatched
>>> galaxy.jobs.runners.drmaa DEBUG 2010-11-12 16:02:43,836 (2187)
>>> file /homes/www-galaxy/galaxy_devel/database/pbs/galaxy_2187.sh
>>> galaxy.jobs.runners.drmaa DEBUG 2010-11-12 16:02:43,836 (2187)
>>> command is:
>>> perl /homes/www-galaxy/local_tools_devel/bin/get_RNA_align.pl --rna
>>> hsa-mir-30a --num-reads 1 --expt smRNAHelaC --format text --order pos
>>> galaxy.jobs.runners.drmaa INFO 2010-11-12 16:02:43,857 (2187) queued as
>>> galaxy.jobs.runners.drmaa DEBUG 2010-11-12 16:02:44,820 (2187/989766)
>>> change: job is queued and active
>>> galaxy.jobs.runners.drmaa DEBUG 2010-11-12 16:02:46,824 (2187/989766)
>>> change: job is running
>>> galaxy.jobs.runners.drmaa DEBUG 2010-11-12 16:02:48,121 (2187/989766)
>>> change: process status cannot be determined
>>> galaxy.jobs.runners.drmaa DEBUG 2010-11-12 16:02:50,128 (2187/989766)
>>> change: job finished normally
>>> galaxy.jobs DEBUG 2010-11-12 16:02:50,530 job 2187 ended
>>> If I manually refresh the history panel then the grey history item goes
>>> green. Anyone else seen this behaviour?
>>> galaxy-dev mailing list
(and thank you, Jennifer to have point out the problem).
I guess you're talking about our CARPET tool (not CAPRET) at our Galaxy site
I think the problem was due to a hardware failure that we experienced
and I assume that now the issue is resolved, but if you have still
please, feel free to contact me as soon as you need.
>On 15/9/10 08:36:23 AM, Jennifer, Jackson wrote:
>Is does not seems that you are using galaxy main at http://usegalaxy.org.
>If you are still having problems, it would be best to contact the owner
>of the instance that you are using and/or the author of the tool
>wrapper. (Even if the tool/functionality came from the Tool Shed, the
>tool author would primary contact as these are community driven).
>Hopefully this helps,
>On 11/9/10 8:34 AM, Lei, Haiyan (NIH/NIDDK) [F] wrote:
> > Hi,
> > Does anyone have problem with CAPRET Com&Uni function recently? I submit
> > the jobs, but the status is always "running"? The website problem? Thanks.
> > ------
> > Haiyan Lei, Ph.D.
> > LMB,NIH/NIDDK
> > Bld 5, Rm b1-04
> > BethesdaMD 20892
> > 301-594-9864