My PhD work may be of interest for this subject, although the primary focus has been on generating databases comprising the changes from a specific timeframe, and was not designed specifically for Galaxy. The similarities between my system and the system you are proposing are that it can generate a BLAST database from any date (that has been added to the system), as well as "diffs" between two dates, and supports FASTA, the Uniprot EMBL variant, full files (which does not give compression benefits) and several others. The system uses delta compression to make sure that non-updated fields do not take up extra space. It uses the Hadoop stack (HBase, HDFS and MapReduce) for parallelism in generating the databases (the blast database generation from FASTA files is not parallel).
You can find one of the publications here: <http://bdps.cs.uit.no/papers/hibb13.pdf> http://bdps.cs.uit.no/papers/hibb13.pdf
I hope this can be of some use to you.
Earlier on in the project analysis I was pursuing a Git solution because it seemed all its features would work with documents/code/files of any kind and so would be perfect for scientific reproducibility. But its ability to efficiently archive non-documents is quite hit and miss, and the file size limitation becomes a major problem on top of that when it doesn't.
I will try to design the system so that handlers for different types of databases/files can be called into play to retrieve versioned content.
Its just that this fall I'll only have time to provide the handlers for fasta file archiving (the key-value database update approach enables fasta versioning and all the spinoff data from that.).
The next priority would be a handler for any type of file that needs to be replaced as a whole from version to version (one just needs hard drive space to accommodate this, since caching is pointless).
A git handler for well-behaved document content would also be a possibility.
Typo: I said yesterday "I wasn't going to leave that as just "fasta" datatype since it seems tools like makeblastdb don't allow anything else ..." - but I meant 'I WAS going to leave that as just "fasta"...'
I'm attempting something that should be straightforward, but it's not. I
have a tool that runs a JAR file, which I have bundled with the tool. I
simply want to run the JAR file. And to paraphrase Thomas Edison, I've
tried several thousand things that do not work (at least for me), from
setting the JAVA_JAR_PATH environment variable in the tool_dependencies.xml
file to trying to copy the JAR file into the tool-data/shared/jars
subdirectories (which is the closest thing I've got to working). So, at
long last I'm doing the sensible thing and looking for one simple working
example that I can use as a template. Who can suggest a good toolshed tool
(either main or test) that involves running its own JAR file, and that
I have tried to restart Galaxy after restoring my database from a backup.
Here is the error message I get in the log file. Any idea what is wrong
and how to fix this problem?
galaxy.jobs DEBUG 2014-09-03 08:33:46,367 Loading job configuration from
galaxy.jobs DEBUG 2014-09-03 08:33:46,367 Done loading job configuration
Traceback (most recent call last):
35, in app_factory
app = UniverseApplication( global_conf = global_conf, **kwargs )
File "/export/users/galaxy/galaxy-test/lib/galaxy/app.py", line 102,
self.toolbox = tools.ToolBox( tool_configs, self.config.tool_path,
line 118, in __init__
line 283, in load_integrated_tool_panel_keys
tree = parse_xml( self.integrated_tool_panel_config )
line 132, in parse_xml
tree = ElementTree.parse(fname)
line 859, in parse
line 583, in parse
line 1242, in feed
ExpatError: not well-formed (invalid token): line 117, column 1
Removing PID file /var/run/paster.pid
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
We are about to implement a fasta database (file) versioning system as a Galaxy tool. I wanted to get interested people's feedback first before we roll ahead with the prototype implementation. The versioning system aims to:
* Enable reproducible research: To recreate a search result at a certain point in time we need versioning so that search and mapping tools can look at sequence reference databases corresponding to a particular past date. This recall can also explain the difference between what was known in the past vs. currently.
* Reduce hard drive space. Some databases are too big to keep N copies around, e.g. 5 years of 16S, updated monthly, is say, 670Mb + 668Mb + 665Mb + .... But occasionally we want to access past archives fairly quickly.
* Integrate database versioning into Galaxy without adding a lot of complexity.
A bonus would be to enable the efficient sharing of version databases between computers/servers.
The solution we think would work centres around a "Versioned Data Retrieval" tool (draft image attached) that would work as follows:
1) User selects from a list of databases provided by "Shared Data > Data Libraries > Versioned Data".
- Each database has a master file that keeps its various versions as a list of time-stamped insert/delete transactions of key (fasta id) value (description & sequence) pairs.
- Each master file is managed outside of galaxy via a triggered process on regular fasta file imports from data sources like NCBI or other niche sources.
- We're expecting, due to the nature of fasta archived sequence updates, that our master file would only be about 1.1x the latest version in size (uncompressed).
2) User enters date / version id to retrieve (validated)
3) If a cached version of that database exists, it is linked into user's history.
4) Otherwise a new version of it is created, placed in cache, and linked into history.
- The cached version itself then shows up as linked data under a Data Library > Versioned Data subfolder.
5) User can select preconfigured workflow(s) to execute on the selected retreived fasta file to regenerate any database products they need.
- Workflow output data would also be cached in the same way the fasta data is - by linking the Galaxy Data Library to it.
- Workflow execution will be skipped if end data already exists in cache.
- Simple makeblastdb or bowtie-build commands, or more specific workflows that include dustmasker etc can be implemented.
Does this sound attractive?
We're hoping such a vision could handle Fasta databases from 12mb to e.g. 200Gb (probably requires makeblastdb in parallel at that scale).
Preliminary work suggests this project is doable via the Galaxy API without galaxy customization - does that sound right?!
Feedback really appreciated!
Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control
655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada
About the datatype. So you are thinking of a new datatype that applies to files that hold the versioned database contents (in this case a structured key-value fasta identifier/sequence pairs, right?) Then the fasta archive versioning tool would take only files of that datatype for input. That sounds good.
I was just going to have one folder (and its subfolders) in the data library that hold all the versioned databases to choose from. So the versioned database tool would just populate its input list based on that subfolder tree. But ensuring that it lists only files of a certain datatype sounds beneficial.
Output in any case would be a fasta file that other tools are already expecting; I wasn't going to leave that as just "fasta" datatype since it seems tools like makeblastdb don't allow anything else as input from user history.
I'm hoping that a global (admin) API key can be used by the tool so that all users can get versioned data, but maybe that is a pipe dream.
Sure I'd like to see old patches!
From: Bj?rn Gr?ning [bjoern.gruening(a)gmail.com]
Sent: Saturday, August 23, 2014 12:17 AM
To: Dooley, Damion; galaxy-dev(a)lists.bx.psu.edu
Cc: Hsiao, William
Subject: Re: [galaxy-dev] Concept for a Galaxy Versioned Fasta Data Retrieval Tool
the idea sounds fantastic!
Can we go a step further and use a specific datatype that keeps entire
fasta files versioned and the user can choose which version he wants to
use, in any tool? Please have a look at my talk at GCC2012. Maybe you
are interested in the (old) patches. I would be very interested to
restart this old project.
So I need to refresh on change....I see that if I have a conditional item in my form, this causes a refresh of the page and a (re)evaluation of my dynamic_options methods....so I could misuse this "feature". However, it seems that when I have a <conditional> I must have a <when> entry for every item in my select box. There is no "when else" option?
From: galaxy-dev-bounces(a)lists.bx.psu.edu [mailto:email@example.com] On Behalf Of Lukasse, Pieter
Sent: woensdag 27 augustus 2014 22:37
Subject: [galaxy-dev] refresh_on_change : is this a valid attribute? Any other ideas/options??
I'm trying to get a wrapper from someone else working and I found this "refresh_on_change" attribute in his select boxes which are filled using the dynamic_options feature:
<param name="col_type" type="select" label="Select column type" refresh_on_change="true"
<param name="polarity" type="select" label="Select polarity" refresh_on_change="true"
When searching the documentation/wiki I do not find a reference to this, but it would be a nice option to have ;)
Question: is there any way I can force a refresh when the user selects another option from such a select box. As you can see in the example above, this is needed because the next select box has its dynamic options built up by a function that takes the value from the previous select (col_type - highlighted above) as an input parameter. Currently this tool only works by showing each select in its own <page> , which is a deprecated option and prevents the tool from being used in a workflow... :(
Thanks for your help!
Wageningen UR, Plant Research International
Department of Bioinformatics (Bioscience)
Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB,
Wageningen, the Netherlands
I am trying to write a wrapper for a tool that take a directory containing
SAM/BAM files as an input. I am not sure how to do that, is there another
tool that implements this and that I can have a look at ? Any suggestions
would be greatly appreciated.
yes, I was the only user but I guess it is too fragile for this kind of thing. postgres wasn’t queuing age jobs correctly so I guess I should track down why.
> From: Hans-Rudolf Hotz <hrh(a)fmi.ch>
> Subject: Re: [galaxy-dev] error with multi dataset tool run
> Date: September 1, 2014 at 5:54:10 AM PDT
> To: Robert Baertsch <baertsch(a)soe.ucsc.edu>, <galaxy-dev(a)lists.bx.psu.edu>
> Hi Robert
> Are you using the built in SQLite database ?
> On 08/31/2014 01:27 AM, Robert Baertsch wrote:
>> I submitted 13 fastq files to tophat2 using DRMAA and got this error.
>> Is it fatal? BTW: This is a super cool feature.
>> I’m running the following version of galaxy-dist.
>> changeset: 14212:91547729ffde
>> branch: stable
>> tag: tip
>> user: Nate Coraor <nate(a)bx.psu.edu <mailto:firstname.lastname@example.org>>
>> date: Fri Aug 29 14:00:23 2014 -0400
>> summary: Update tag latest_2014.08.11 for changeset ea12550fbc34
>> There were errors setting up 2 submitted job(s):
>> * *Error executing tool: (OperationalError) database is locked
>> u'UPDATE history_dataset_association SET update_time=?, name=?,
>> blurb=? WHERE history_dataset_association.id = ?' ('2014-08-30
>> 23:14:44.683957', 'Tophat2 on data 7: insertions', 'queued', 137)*
>> * *Error executing tool: (OperationalError) database is locked
>> u'UPDATE dataset SET update_time=?, state=? WHERE dataset.id = ?'
>> ('2014-08-30 23:15:57.204718', 'queued', 446)*
>> Please keep all replies on the list by using "reply all"
>> in your mail client. To manage your subscriptions to this
>> and other Galaxy lists, please use the interface at:
>> To search Galaxy mailing lists use the unified search at:
> From: Sandra Derozier <sandra.derozier(a)jouy.inra.fr>
> Subject: [galaxy-dev] Error with functional tests on cluster
> Date: September 1, 2014 at 6:54:41 AM PDT
> To: galaxy-dev(a)bx.psu.edu
> Hi all,
> I try to set up functional tests on different tools on my Galaxy portal.
> When I run functional tests locally everything works fine. But when I run them on the cluster it failed with this message :
> /bin/sh: module: line 1: syntax error: unexpected end of file
> /bin/sh: error importing function definition for `module'
> The execution on the cluster is ok but the dataset state is set to ERROR.
> The DIFF between the expected result and the obtained result is null. Indeed, this two files are the same.
> As I do not know what the problem is: do you have an idea?
> Sandra DEROZIER
> Sandra DEROZIER
> Unité Mathèmatique, Informatique et Génome (MIG)
> Plateforme MIGALE
> Bâtiment 233
> Domaine de Vilvert
> 78352 Jouy-en-Josas Cedex
> From: Hans-Rudolf Hotz <hrh(a)fmi.ch>
> Subject: [galaxy-dev] Solved - Re: testing the visualization plugins
> Date: September 1, 2014 at 8:13:30 AM PDT
> To: "<galaxy-dev(a)bx.psu.edu>" <galaxy-dev(a)bx.psu.edu>
> Hi all
> First of all, a big Thanks to Carl who helped me fixing this problem.
> So as a summary for all, the problem was caused by a datatype (an extension to tabular), I manually added to "datatypes_conf.xml"
> Removing the datatype fixed the problem. I couldn't identify a syntax problem, neither in "datatypes_conf.xml" nor in "~/lib/galaxy/datatypes/registry.py" and "~/lib/galaxy/datatypes/tabular.py". However, renaming it (in all three files) fixed it as well.
> Regards, Hans-Rudolf
> galaxy-dev mailing list
> To search Galaxy mailing lists use the unified search at: