Re: [galaxy-dev] pass more information on a dataset merge

4 Dec 2012

      John 

Yeah!

I'm glad you took the initiative to do this. It's one of the most requested features from local users here.
I'll happily test this in our environment.

Also - thanks for the github link - it's vastly superior to hg i think for the situation of merging pull requests. 
I can't figure a nice way to do a simple pull request in hg without a full scale repo duplication.

Best wishes!

Brad

On Dec 3, 2012, at 1:26 PM, John Chilton <chil0060@umn.edu>
 wrote:
...
Hey Alex,
Until I have bullied this stuff into galaxy-central, you should
probably e-mail me directly and not the dev list. That said thanks for
the heads up, that there was a definitely a bug. I pushed out this
changeset to the bitbucket repository:
https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes...
I should mention that I have sort of abandoned the bitbucket
repository for this work in lieu of github, so that I can rebase as
Galaxy changes and keep clean changesets.
https://github.com/jmchilton/galaxy-central/tree/multifiles
Since I am posting this on the mailing list I might as well post a
little summary of what has been done:
- For each datatype, an implicit multiple file version of that
datatype is created. A new multiple upload tool/ftp directory tool has
been implemented to create these.
- For any simple tool input you can chose a multiple file version of
that input instead and then all outputs will become multiple file
versions of the outputs. Uses task splitting stuff to distribute jobs
across files.
- For multiple input tools, you can choose either multiple inputs
individuals (no change there) or a single composite version.
Consistent interface for file path, display name, extension, etc... in
tool wrapper.
- It should work with most existing tools and datatypes without change.
- Everything enabled with a single option in universe.ini
Upshots:
 - Makes workflows with arbitrary merging (and to a lesser extent
branching) and arbitrary number of input files possible.
 - Original base name is saved throughout analysis (when possible),
so sample/replicate/fraction/lane/etc tracking is easier.
I started working on the metadata piece last night, once that is done
I was planning on making a little demo video to post to this list to
try to sell the 3 outstanding small pull requests related to this work
and the massive one that would follow those up :).
-John
On Sun, Dec 2, 2012 at 8:52 PM,  <Alex.Khassapov@csiro.au> wrote:
...
Hi John,
My colleague (Neil) has a bit of a problem with the multi file support:
When I try and use the option "Upload Directory of files" I get the error below
Error Traceback:
View as:   Interactive  |  Text  |  XML (full)
⇝ AttributeError: 'Bunch' object has no attribute 'multifiles'
URL: http://140.253.78.218/library_common/upload_library_dataset
Module weberror.evalexception.middleware:364 in respond         view
...
...
app_iter = self.application(environ, detect_start_response)
Module paste.debug.prints:98 in __call__         view
environ, self.app)
Module paste.wsgilib:539 in intercept_output         view
app_iter = application(environ, replacement_start_response)
Module paste.recursive:80 in __call__         view
return self.application(environ, start_response)
Module paste.httpexceptions:632 in __call__         view
return self.application(environ, start_response)
Module galaxy.web.framework.base:160 in __call__         view
body = method( trans, **kwargs )
Module galaxy.web.controllers.library_common:855 in upload_library_dataset         view
**kwd )
Module galaxy.web.controllers.library_common:1055 in upload_dataset         view
json_file_path = upload_common.create_paramfile( trans, uploaded_datasets )
Module galaxy.tools.actions.upload_common:342 in create_paramfile         view
multifiles = uploaded_dataset.multifiles,
AttributeError: 'Bunch' object has no attribute 'multifiles'
Any ideas? Should we check if 'multifiles' attribute is set? Or some other call is missing which should set it to NULL if it's missing?
-Alex
-----Original Message-----
From: jmchilton@gmail.com [mailto:jmchilton@gmail.com] On Behalf Of John Chilton
Sent: Wednesday, 17 October 2012 3:21 AM
To: Khassapov, Alex (CSIRO IM&T, Clayton)
Subject: Re: [galaxy-dev] pass more information on a dataset merge
Wow, thanks for the rapid feedback! I have made the changes you have suggested. It seems you must be interested in this idea/implementation. Let me know if you have specific use cases/requirements in mind and/or if you would be interested in write access to the repository.
-John
On Mon, Oct 15, 2012 at 11:51 PM,  <Alex.Khassapov@csiro.au> wrote:
...
Hi John,
I tried your galaxy-central-homogeneous-composite-datatypes implementation, works great thank you (and Jorrit).
A couple of fixes:
1. Add multi_upload.xml to too_conf.xml 2.
lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context )) -
       "if ftp_files is not None:"
  Remove "is not None" as ftp_files is empty [], but not None, then line 331 "user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, trans.user.email )" throws an exeption if ftp_upload_dir isn't set.
Alex
-----Original Message-----
From: galaxy-dev-bounces@lists.bx.psu.edu
[mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of John Chilton
Sent: Tuesday, 16 October 2012 1:07 AM
To: Jorrit Boekel
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] pass more information on a dataset merge
Here is an implementation of the implicit multi-file composite datatypes piece of that idea. I think the implicit parallelism may be harder.
https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-dat
atypes/compare
Jorrit do you have any objection to me trying to get this included in galaxy-central (this is 95% code I stole from you)? I made the changes against a clean galaxy-central fork and included nothing proteomics specific in anticipation of trying to do that. I have talked with Jim Johnson about the idea and he believes it would be useful his mothur metagenomics tools, so the idea is valuable outside of proteomics.
Galaxy team, would you be okay with including this and if so is there anything you would like to see either at a high level or at the level of the actual implementation.
-John
------------------------------------------------
John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223
Bitbucket: https://bitbucket.org/jmchilton
Github: https://github.com/jmchilton
Web: http://jmchilton.net
On Mon, Oct 8, 2012 at 9:24 AM, John Chilton <chilton@msi.umn.edu> wrote:
...
Jim Johnson and I have been discussing that approach to handling
fractionated proteomics samples as well (composite datatypes, not the
specifics of the interface for parallelizing).
My perspective has been that Galaxy should be augmented with better
native mechanisms for grouping objects in histories, operating over
those groups, building workflows that involve arbitrary numbers of
inputs, etc... Composite data types are kindof a kludge, I think they
are more useful for grouping HTML files together when you don't care
about operating on the constituent parts you just want to view pages
a as a report or something. With this proteomic data we are working
with, the individual pieces are really interesting right? You want to
operate on the individual pieces with the full array of tools (not
just these special tools that have the logic for dealing with the
composite datatypes), you want to visualize the files, etc... Putting
these component pieces in the composite data type extra_files path
really limits what you can do with the pieces in Galaxy.
I have a vague idea of something that I think could bridge some of
the gaps between the approaches (though I have no clue on the
feasibility). Looking through your implementation on bitbucket it
looks like you are defining your core datatypes (MS2, CruxSequest) as
subclasses of this composite data type (CompositeMultifile). My
recommendation would be to try to define plain datatypes for these
core datatype (MS2, CruxSequest) and then have the separate composite
datatype sort of delegate to the plain datatypes.
You could then continue to explicitly declare subclasses of the
composite datatype (maybe MS2Set, CruxSequestSet), but also maybe
augement the tool xml so you can do implicit data type instances the
way you can with tabular data for instance (instead of defining
columns you would define the datatype to delegate to).
The next step would be to make the parallelism implicit (i.e pull it
out of the tool wrapper). Your tool wrappers wouldn't reference the
composite datatypes, they would reference the simple datatypes, but
you could add a little icon next to any input that let you replace a
single input with a composite input for that type. It would be kind
of like the run workflow page where you can replace an input with a
multiple inputs. If a composite input (or inputs) are selected the
tool would then produce composite outputs.
For the steps that actually combine multiple inputs, I think in your
case this is perculator maybe (a tool like interprophet or Scaffold
that merges peptide probabilities across runs and groups proteins),
then you could have the same sort of implicit replacement but instead
of for single inputs it could do that for multi-inputs (assuming the
Galaxy powers that be accept my fixes for multi-input tool parameters:
https://bitbucket.org/galaxy/galaxy-central/pull-request/76/multi-input-data...).
The upshot of all of that would be that then even if these composites
datatypes aren't used widely, other people could still use your
proteomics tools (my users are definitely interested in Crux for
instance) and you could then use other developers' proteomic tools
with your composite datatypes even though they weren't designed with
that use case in mind (I have msconvert, myrimatch, idpicker,
proteinpilot, Ira Cooke has X! Tandem, OMSSA, TPP, and NBIC has an
entire suite of label free quant tools). A third benefit would be
that people working in other -omicses could make use of the
homogenous composite datatype implementation without needing to
rewrite their wrappers and datatypes.
There is probably something that I am missing that makes this very
difficult, let me know if you think this is a good idea and what its
feasibility might be. I forked your repo and set off to try to
implement some of this stuff last week and I ended up with my galaxy
pull requests to improve batching workflows and multi-input tool
parameters instead, but I hope to eventually get around to it.
-John
------------------------------------------------
John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223
Bitbucket: https://bitbucket.org/jmchilton
Github: https://github.com/jmchilton
Web: http://jmchilton.net
On Mon, Oct 1, 2012 at 8:24 AM, Jorrit Boekel
<jorrit.boekel@scilifelab.se> wrote:
...
Dear list,
I thought I was working with fairly large datasets, but they have
recently started to include ~2Gb files in sets of >50. I have ran
these sort of things before as merged data by using tar to roll them
up in one set, but when dealing with >100Gb tarfiles, Galaxy on EC2
seems to get very slow, although that's probably because of my
implementation of dataset type detection (untar and read through files).
Since tarring/untarring isn't very clean, I want to switch from
tarring to creating composite files on merge by putting a tool's
results into the dataset.extra_files_path. This doesn't seem to be
supported yet, because we currently pass in do_merge the output
dataset.filename to the respective datatype's merge method.
I would like to pass more data to the merge method (let's say the
whole dataset object) to be able to get the composite files directory and 'merge'
the files in there. Good idea, bad idea? If anyone has views on
this, I'd love to hear them.
cheers,
jorrit
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this and other
Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
--
Brad Langhorst
langhorst@neb.com

Re: [galaxy-dev] pass more information on a dataset merge

Langhorst, Brad