Re: [galaxy-dev] Fwd: pass more information on a dataset merge

1 Nov 2012

      Hi John,

Do you think it's possible to create a <test> for your 'm:" format? I couldn't find how to specify multi input files for the test.

-Alex

-----Original Message-----
From: jmchilton@gmail.com [mailto:jmchilton@gmail.com] On Behalf Of John Chilton
Sent: Tuesday, 23 October 2012 7:59 AM
To: Jorrit Boekel
Cc: Khassapov, Alex (CSIRO IM&T, Clayton)
Subject: Re: Fwd: [galaxy-dev] pass more information on a dataset merge

Hello again Jorrit,

Great, I am glad we are largely on the same page here. I don't know when I will get a chance to look at this particular aspect, if you get there first that will be great, if not I will get there eventually.

-John

On Mon, Oct 22, 2012 at 2:51 AM, Jorrit Boekel <jorrit.boekel@scilifelab.se> wrote:
...
IIRC, I implemented the task_X suffix (galaxy does so as well but to
the split subdirectories) to ensure jobs that contained multiple split
datasets would be run in sync. Files from two datasets that belong
together then get analysed together in subsequent steps.
It would however be much nicer to retain original file names through a
pipeline or at least the possibility to retrieve them. Since the
split/merge now run actively look and match for files with identical
'task_x', it may be an option to do:
fraction1.raw -> fraction1.raw_dataset_43.dat_task_0 ->
fraction1.raw_dataset_44.dat_task_0
fraction2.raw -> fraction2.raw_dataset_43.dat_task_1 ->
fraction2.raw_dataset_44.dat_task_1
(Note that python starts counting at 0, while most researchers number
their first fraction 1.)
I wouldn't mind looking more into that as well, since it would be a
big improvement UI-wise.
cheers,
jorrit
On 10/19/2012 04:40 PM, John Chilton wrote:
...
Jorrit I meant to cc you on this response to Alex.
---------- Forwarded message ----------
From: John Chilton <chil0060@umn.edu>
Date: Fri, Oct 19, 2012 at 9:40 AM
Subject: Re: [galaxy-dev] pass more information on a dataset merge
To: Alex.Khassapov@csiro.au
Hey Alex,
I think the idea here is that your initially uploaded files would
have different names, but after Jorrit's tool split/merge step they
will all just be named after the dataset id (see screenshot) so you
need the task_X at the end so they don't all just have the same name.
I have not thought a whole lot about the naming thing, in general it
seems like a tough problem and one that Galaxy itself doesn't do a
particularly good job at.
Jorrit have you given any thought to this?
I wonder if it would be feasible to use the initial uploaded name as
a sort of prefix going forward. So if I upload say
fraction1.RAW
fraction2.RAW
fraction3.RAW
and run a conversion step, maybe I could get:
fraction1_dataset567.ms2
fraction2_dataset567.ms2
fraction3_dataset567.ms2
instead of
dataset567.dat_task_0
dataset567.dat_task_1
dataset567.dat_task_2
Jorrit do you mind if I give implementing that a shot? It seems like
it would be a win to me. Am I am going to hit some problem I don't
see now (presumable we have to send some data from the split to the
merge and that might be tricky)?
-John
On Thu, Oct 18, 2012 at 7:00 PM,  <Alex.Khassapov@csiro.au> wrote:
...
Thanks John,
I wonder what's the reason for appending _task_XX to the file names,
why can't we just keep original file names?
Alex
-----Original Message-----
From: jmchilton@gmail.com [mailto:jmchilton@gmail.com] On Behalf Of
John Chilton
Sent: Friday, 19 October 2012 6:16 AM
To: Khassapov, Alex (CSIRO IM&T, Clayton)
Subject: Re: [galaxy-dev] pass more information on a dataset merge
On Tue, Oct 16, 2012 at 11:11 PM,  <Alex.Khassapov@csiro.au> wrote:
...
Hi John,
I am definitely interested in this idea, not only me - we are
currently working on moving a few scientific tools (not related to
genome) into cloud using Galaxy.
Great. My interests in Galaxy are mostly outside of genomics as
well, it is good to have more people utilizing Galaxy in this way
because it will force the platform to become more generic and
address more broader use cases.
...
We will try it further and see if we need any changes. For now one
improvement would be nice, make dataset_id.dat contain list of
paths to the location of the uploaded files, so by displaying html
page the user could just click on the link and download the file.
Code that attempted to do this was in there, but didn't work
obviously. I have now fixed it up.
Thanks for beta testing.
-John
...
We are pretty new to Galaxy, so our understanding of Galaxy is
pretty limited.
Thanks again,
Alex
-----Original Message-----
From: jmchilton@gmail.com [mailto:jmchilton@gmail.com] On Behalf Of
John Chilton
Sent: Wednesday, 17 October 2012 3:21 AM
To: Khassapov, Alex (CSIRO IM&T, Clayton)
Subject: Re: [galaxy-dev] pass more information on a dataset merge
Wow, thanks for the rapid feedback! I have made the changes you
have suggested. It seems you must be interested in this
idea/implementation. Let me know if you have specific use
cases/requirements in mind and/or if you would be interested in write access to the repository.
-John
On Mon, Oct 15, 2012 at 11:51 PM,  <Alex.Khassapov@csiro.au> wrote:
...
Hi John,
I tried your galaxy-central-homogeneous-composite-datatypes
implementation, works great thank you (and Jorrit).
A couple of fixes:
1. Add multi_upload.xml to too_conf.xml 2.
lib/galaxy/tools/parameters/grouping.py line 322 (in
get_filenames( context )) -
         "if ftp_files is not None:"
    Remove "is not None" as ftp_files is empty [], but not None,
then line 331 "user_ftp_dir = os.path.join(
trans.app.config.ftp_upload_dir, trans.user.email )" throws an exeption if ftp_upload_dir isn't set.
Alex
-----Original Message-----
From: galaxy-dev-bounces@lists.bx.psu.edu
[mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of John
Chilton
Sent: Tuesday, 16 October 2012 1:07 AM
To: Jorrit Boekel
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] pass more information on a dataset merge
Here is an implementation of the implicit multi-file composite
datatypes piece of that idea. I think the implicit parallelism may
be harder.
https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite
-da
t
atypes/compare
Jorrit do you have any objection to me trying to get this included
in galaxy-central (this is 95% code I stole from you)? I made the
changes against a clean galaxy-central fork and included nothing
proteomics specific in anticipation of trying to do that. I have
talked with Jim Johnson about the idea and he believes it would be
useful his mothur metagenomics tools, so the idea is valuable outside of proteomics.
Galaxy team, would you be okay with including this and if so is
there anything you would like to see either at a high level or at
the level of the actual implementation.
-John
------------------------------------------------
John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223
Bitbucket: https://bitbucket.org/jmchilton
Github: https://github.com/jmchilton
Web: http://jmchilton.net
On Mon, Oct 8, 2012 at 9:24 AM, John Chilton <chilton@msi.umn.edu>
wrote:
...
Jim Johnson and I have been discussing that approach to handling
fractionated proteomics samples as well (composite datatypes, not
the specifics of the interface for parallelizing).
My perspective has been that Galaxy should be augmented with
better native mechanisms for grouping objects in histories,
operating over those groups, building workflows that involve
arbitrary numbers of inputs, etc... Composite data types are
kindof a kludge, I think they are more useful for grouping HTML
files together when you don't care about operating on the
constituent parts you just want to view pages a as a report or
something. With this proteomic data we are working with, the individual pieces are really interesting right?
You want to operate on the individual pieces with the full array
of tools (not just these special tools that have the logic for
dealing with the composite datatypes), you want to visualize the
files, etc... Putting these component pieces in the composite
data type extra_files path really limits what you can do with the
pieces in Galaxy.
I have a vague idea of something that I think could bridge some
of the gaps between the approaches (though I have no clue on the
feasibility). Looking through your implementation on bitbucket it
looks like you are defining your core datatypes (MS2,
CruxSequest) as subclasses of this composite data type
(CompositeMultifile). My recommendation would be to try to define
plain datatypes for these core datatype (MS2, CruxSequest) and
then have the separate composite datatype sort of delegate to the plain datatypes.
You could then continue to explicitly declare subclasses of the
composite datatype (maybe MS2Set, CruxSequestSet), but also maybe
augement the tool xml so you can do implicit data type instances
the way you can with tabular data for instance (instead of
defining columns you would define the datatype to delegate to).
The next step would be to make the parallelism implicit (i.e pull
it out of the tool wrapper). Your tool wrappers wouldn't
reference the composite datatypes, they would reference the
simple datatypes, but you could add a little icon next to any
input that let you replace a single input with a composite input
for that type. It would be kind of like the run workflow page
where you can replace an input with a multiple inputs. If a
composite input (or inputs) are selected the tool would then produce composite outputs.
For the steps that actually combine multiple inputs, I think in
your case this is perculator maybe (a tool like interprophet or
Scaffold that merges peptide probabilities across runs and groups
proteins), then you could have the same sort of implicit
replacement but instead of for single inputs it could do that for
multi-inputs (assuming the Galaxy powers that be accept my fixes
for multi-input tool parameters:
https://bitbucket.org/galaxy/galaxy-central/pull-request/76/multi-input-data...).
The upshot of all of that would be that then even if these
composites datatypes aren't used widely, other people could still
use your proteomics tools (my users are definitely interested in
Crux for
instance) and you could then use other developers' proteomic
tools with your composite datatypes even though they weren't
designed with that use case in mind (I have msconvert, myrimatch,
idpicker, proteinpilot, Ira Cooke has X! Tandem, OMSSA, TPP, and
NBIC has an entire suite of label free quant tools). A third
benefit would be that people working in other -omicses could make
use of the homogenous composite datatype implementation without
needing to rewrite their wrappers and datatypes.
There is probably something that I am missing that makes this
very difficult, let me know if you think this is a good idea and
what its feasibility might be. I forked your repo and set off to
try to implement some of this stuff last week and I ended up with
my galaxy pull requests to improve batching workflows and
multi-input tool parameters instead, but I hope to eventually get around to it.
-John
------------------------------------------------
John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223
Bitbucket: https://bitbucket.org/jmchilton
Github: https://github.com/jmchilton
Web: http://jmchilton.net
On Mon, Oct 1, 2012 at 8:24 AM, Jorrit Boekel
<jorrit.boekel@scilifelab.se> wrote:
>
> Dear list,
>
> I thought I was working with fairly large datasets, but they
> have recently started to include ~2Gb files in sets of >50. I
> have ran these sort of things before as merged data by using tar
> to roll them up in one set, but when dealing with >100Gb
> tarfiles, Galaxy on EC2 seems to get very slow, although that's
> probably because of my implementation of dataset type detection
> (untar and read through files).
>
> Since tarring/untarring isn't very clean, I want to switch from
> tarring to creating composite files on merge by putting a tool's
> results into the dataset.extra_files_path. This doesn't seem to
> be supported yet, because we currently pass in do_merge the
> output dataset.filename to the respective datatype's merge method.
>
> I would like to pass more data to the merge method (let's say
> the whole dataset object) to be able to get the composite files
> directory and 'merge'
> the files in there. Good idea, bad idea? If anyone has views on
> this, I'd love to hear them.
>
> cheers,
> jorrit
>
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this and
> other Galaxy lists, please use the interface at:
>
>   http://lists.bx.psu.edu/
___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this and
other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
--
Scientific programmer
Mass spec analysis support @ BILS
Janne Lehtiö / Lukas Käll labs
SciLifeLab Stockholm

Re: [galaxy-dev] Fwd: pass more information on a dataset merge

Alex.Khassapov＠csiro.au