Composite datatype output for Cuffdiff

Jim Johnson

11 Oct 2012 11 Oct '12

10:14 p.m.

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database. Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain. I've put demo code in the testtoolshed under the name repository name cummerbund http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/> The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA Thanks, JJ

Attachments:

attachment.htm (text/html — 7.2 KB)

Show replies by date

Carlos Borroto

12 Oct 12 Oct

2:09 p.m.

On Thu, Oct 11, 2012 at 6:14 PM, Jim Johnson <johns198@umn.edu> wrote:

...

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Hi JJ, I'm highly interested in something like this, I even tried in the past to come up with a solution[1], but I gave up when I was hit hard by the complexity of the current cuffdiff output you talk about it. I think your solution sounds great and I will make sure to test your demo code as soon as possible. [1]http://testtoolshed.g2.bx.psu.edu/repos/cjav/cummerbund Thanks, Carlos

Jeremy Goecks

7:44 p.m.

Hi Jim, This is nice and is a path forward for the immediate future. That said, a couple extensions to Galaxy to better support composite datatypes would enable cummerbund without the additional tools: (i) extending the composite datatype to include definition of individual outputs in the collection; (ii) extend the history panel to allow usage/selection of (1) the complete composite set of files or (2) individual items in a composite datatype Once (i) is done, (ii) should be straightforward using the new history panel code. Of course, the advantage of these extensions is that they'd address both cummerbund issues as well as other challenges, such as using output from the barcode splitter. J. On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:

...

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.

I've put demo code in the testtoolshed under the name repository name cummerbund http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund

This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/>

The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database

I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata

I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA

Thanks, JJ

John Chilton

27 Feb 27 Feb

3:05 p.m.

Hey Jeremy, I am trying to think about a path forward with this composite multiple file dataset implementation. It seems there is consensus among the galaxy team that it shouldn't be included because grouping actual datasets would be superior. In that light, I am revisiting this e-mail, because depending on the implementation of what you described multiple file datasets are a specific case of this concept with some likely uncontroversial enhancements for the specific case of composite datatypes that are a homogeneous list of files. Does that make any sense? If I implemented (i) and (ii) in such a way that the multiple file dataset stuff flowed out more organically is there any chance than it could be included in galaxy-central. If no and the implicit datatypes and parallelism stuff would remain no-gos implementing what you described would still benefit the multiple file datasets implementation, so I still might do this, would a clean implementation of just what you described be accepted? Any thoughts you or anyone has on the future direction of composite datatypes in general would be appreciated? Thanks for your time, -John On Fri, Oct 12, 2012 at 2:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...

Hi Jim,

This is nice and is a path forward for the immediate future.

That said, a couple extensions to Galaxy to better support composite datatypes would enable cummerbund without the additional tools:

(i) extending the composite datatype to include definition of individual outputs in the collection; (ii) extend the history panel to allow usage/selection of (1) the complete composite set of files or (2) individual items in a composite datatype

Once (i) is done, (ii) should be straightforward using the new history panel code.

Of course, the advantage of these extensions is that they'd address both cummerbund issues as well as other challenges, such as using output from the barcode splitter.

J.

On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.

I've put demo code in the testtoolshed under the name repository name cummerbund http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund

This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/>

The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database

I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata

I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA

Thanks, JJ

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Jeremy Goecks

7:39 p.m.

New subject: Implementing dataset collections

Hi John, Thanks for your interest and willingness to take a look at this. I've changed the subject of this thread to what I see as the core issue: implementing dataset collections. The Galaxy team would prefer to see an implementation of dataset collections that can be used going forward for all sorts of things. This would prevent time and energy being devoted to creating unneeded flexibility to accommodate an unknown implementation of dataset collections. With that in mind, I've spec'ed out an implementation of dataset collections that uses only 3 additional database tables + model objects: https://trello.com/c/325AXIEr (See the first list, where implementation is discussed.) Please take a look (as well as anyone else who's interested) and, either on the card or in this thread, comment on this approach. This implementation would not replace composite datatypes, but we expect that would work for JJ's Cummerbund. The key difference b/t collections and composite datatypes is that collections include Galaxy datasets that can be used individually, while composite datatypes can only be operated on together. Once a agreement is reached on an implementation, we would welcome a pull request for this functionality. Alternatively, I expect that the Galaxy team would implement it in the next couple months. Best, J. On Feb 27, 2013, at 10:05 AM, John Chilton wrote:

...

Hey Jeremy,

I am trying to think about a path forward with this composite multiple file dataset implementation. It seems there is consensus among the galaxy team that it shouldn't be included because grouping actual datasets would be superior. In that light, I am revisiting this e-mail, because depending on the implementation of what you described multiple file datasets are a specific case of this concept with some likely uncontroversial enhancements for the specific case of composite datatypes that are a homogeneous list of files. Does that make any sense?

If I implemented (i) and (ii) in such a way that the multiple file dataset stuff flowed out more organically is there any chance than it could be included in galaxy-central. If no and the implicit datatypes and parallelism stuff would remain no-gos implementing what you described would still benefit the multiple file datasets implementation, so I still might do this, would a clean implementation of just what you described be accepted?

Any thoughts you or anyone has on the future direction of composite datatypes in general would be appreciated?

Thanks for your time, -John

On Fri, Oct 12, 2012 at 2:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...
Hi Jim,

This is nice and is a path forward for the immediate future.

That said, a couple extensions to Galaxy to better support composite datatypes would enable cummerbund without the additional tools:

(i) extending the composite datatype to include definition of individual outputs in the collection; (ii) extend the history panel to allow usage/selection of (1) the complete composite set of files or (2) individual items in a composite datatype

Once (i) is done, (ii) should be straightforward using the new history panel code.

Of course, the advantage of these extensions is that they'd address both cummerbund issues as well as other challenges, such as using output from the barcode splitter.

J.

On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.

I've put demo code in the testtoolshed under the name repository name cummerbund http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund

This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/>

The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database

I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata

I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA

Thanks, JJ

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Alex.Khassapov＠csiro.au

4 Mar 4 Mar

4:08 a.m.

Hi John, Are you saying that "composite multiple file dataset" isn't required and won't be implemented? We are using your implementation of multifiles dataset ("m:xxx" type) and hope that eventually it will be pushed into main Galaxy implementation. As we are using Galaxy for CT reconstruction tools, where input and output can consist of a couple thousand files, other options are not feasible, i.e. grouping datasets. -Alex -----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Thursday, 28 February 2013 2:06 AM To: Jeremy Goecks Cc: Jim Johnson; <galaxy-dev@bx.psu.edu> Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff Hey Jeremy, I am trying to think about a path forward with this composite multiple file dataset implementation. It seems there is consensus among the galaxy team that it shouldn't be included because grouping actual datasets would be superior. In that light, I am revisiting this e-mail, because depending on the implementation of what you described multiple file datasets are a specific case of this concept with some likely uncontroversial enhancements for the specific case of composite datatypes that are a homogeneous list of files. Does that make any sense? If I implemented (i) and (ii) in such a way that the multiple file dataset stuff flowed out more organically is there any chance than it could be included in galaxy-central. If no and the implicit datatypes and parallelism stuff would remain no-gos implementing what you described would still benefit the multiple file datasets implementation, so I still might do this, would a clean implementation of just what you described be accepted? Any thoughts you or anyone has on the future direction of composite datatypes in general would be appreciated? Thanks for your time, -John On Fri, Oct 12, 2012 at 2:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...

Hi Jim,

This is nice and is a path forward for the immediate future.

That said, a couple extensions to Galaxy to better support composite datatypes would enable cummerbund without the additional tools:

(i) extending the composite datatype to include definition of individual outputs in the collection; (ii) extend the history panel to allow usage/selection of (1) the complete composite set of files or (2) individual items in a composite datatype

Once (i) is done, (ii) should be straightforward using the new history panel code.

Of course, the advantage of these extensions is that they'd address both cummerbund issues as well as other challenges, such as using output from the barcode splitter.

J.

On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.

I've put demo code in the testtoolshed under the name repository name cummerbund

http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund

This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/>

The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database

I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata

I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA

Thanks, JJ

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

John Chilton

5:42 a.m.

Hi Alex, Thanks for the comments. The galaxy team has made it clear here and to me privately that this will NOT be included in the Galaxy main code base. I hope and am I confident that they will make grouping datasets work, hopefully even to thousands of files. I do not believe the two ideas are mutually exclusive and I will be maintaining a fork of galaxy-central with these additions, I will set this up this week hopefully. I will do my best to respond to support requests and make multiple file datasets and composite types in general as robust as possible, keep up with Galaxy updates, etc.... Obviously, it is risky to let a code base drift so far from galaxy main's however and you, me, and others who might want to use them will have to carefully weigh the risks when determining if multiple file datasets are worth the headache. Thanks for all your help and inputs. I am sorry this did not turn out differently, I feel I have really failed here. -John On Sun, Mar 3, 2013 at 10:08 PM, <Alex.Khassapov@csiro.au> wrote:

...

Hi John,

Are you saying that "composite multiple file dataset" isn't required and won't be implemented?

We are using your implementation of multifiles dataset ("m:xxx" type) and hope that eventually it will be pushed into main Galaxy implementation.

As we are using Galaxy for CT reconstruction tools, where input and output can consist of a couple thousand files, other options are not feasible, i.e. grouping datasets.

-Alex

-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Thursday, 28 February 2013 2:06 AM To: Jeremy Goecks Cc: Jim Johnson; <galaxy-dev@bx.psu.edu> Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff

Hey Jeremy,

I am trying to think about a path forward with this composite multiple file dataset implementation. It seems there is consensus among the galaxy team that it shouldn't be included because grouping actual datasets would be superior. In that light, I am revisiting this e-mail, because depending on the implementation of what you described multiple file datasets are a specific case of this concept with some likely uncontroversial enhancements for the specific case of composite datatypes that are a homogeneous list of files. Does that make any sense?

If I implemented (i) and (ii) in such a way that the multiple file dataset stuff flowed out more organically is there any chance than it could be included in galaxy-central. If no and the implicit datatypes and parallelism stuff would remain no-gos implementing what you described would still benefit the multiple file datasets implementation, so I still might do this, would a clean implementation of just what you described be accepted?

Any thoughts you or anyone has on the future direction of composite datatypes in general would be appreciated?

Thanks for your time, -John

On Fri, Oct 12, 2012 at 2:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...
Hi Jim,

This is nice and is a path forward for the immediate future.

That said, a couple extensions to Galaxy to better support composite datatypes would enable cummerbund without the additional tools:

(i) extending the composite datatype to include definition of individual outputs in the collection; (ii) extend the history panel to allow usage/selection of (1) the complete composite set of files or (2) individual items in a composite datatype

Once (i) is done, (ii) should be straightforward using the new history panel code.

Of course, the advantage of these extensions is that they'd address both cummerbund issues as well as other challenges, such as using output from the barcode splitter.

J.

On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.

I've put demo code in the testtoolshed under the name repository name cummerbund

http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund

This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/>

The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database

I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata

I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA

Thanks, JJ

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Alex.Khassapov＠csiro.au

6:32 a.m.

Yeah John, This is sad, I don't understand why it is such a problem? If it's already implemented and used in real projects like ours - then it is needed for the community. I don't think we have other options for our requirements, your multiple file datasets implementation was a real saviour for us. -Alex -----Original Message----- From: jmchilton@gmail.com [mailto:jmchilton@gmail.com] On Behalf Of John Chilton Sent: Monday, 4 March 2013 4:42 PM To: Khassapov, Alex (CSIRO IM&T, Clayton) Cc: <galaxy-dev@bx.psu.edu> Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff Hi Alex, Thanks for the comments. The galaxy team has made it clear here and to me privately that this will NOT be included in the Galaxy main code base. I hope and am I confident that they will make grouping datasets work, hopefully even to thousands of files. I do not believe the two ideas are mutually exclusive and I will be maintaining a fork of galaxy-central with these additions, I will set this up this week hopefully. I will do my best to respond to support requests and make multiple file datasets and composite types in general as robust as possible, keep up with Galaxy updates, etc.... Obviously, it is risky to let a code base drift so far from galaxy main's however and you, me, and others who might want to use them will have to carefully weigh the risks when determining if multiple file datasets are worth the headache. Thanks for all your help and inputs. I am sorry this did not turn out differently, I feel I have really failed here. -John On Sun, Mar 3, 2013 at 10:08 PM, <Alex.Khassapov@csiro.au> wrote:

...

Hi John,

Are you saying that "composite multiple file dataset" isn't required and won't be implemented?

We are using your implementation of multifiles dataset ("m:xxx" type) and hope that eventually it will be pushed into main Galaxy implementation.

As we are using Galaxy for CT reconstruction tools, where input and output can consist of a couple thousand files, other options are not feasible, i.e. grouping datasets.

-Alex

-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Thursday, 28 February 2013 2:06 AM To: Jeremy Goecks Cc: Jim Johnson; <galaxy-dev@bx.psu.edu> Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff

Hey Jeremy,

I am trying to think about a path forward with this composite multiple file dataset implementation. It seems there is consensus among the galaxy team that it shouldn't be included because grouping actual datasets would be superior. In that light, I am revisiting this e-mail, because depending on the implementation of what you described multiple file datasets are a specific case of this concept with some likely uncontroversial enhancements for the specific case of composite datatypes that are a homogeneous list of files. Does that make any sense?

If I implemented (i) and (ii) in such a way that the multiple file dataset stuff flowed out more organically is there any chance than it could be included in galaxy-central. If no and the implicit datatypes and parallelism stuff would remain no-gos implementing what you described would still benefit the multiple file datasets implementation, so I still might do this, would a clean implementation of just what you described be accepted?

Any thoughts you or anyone has on the future direction of composite datatypes in general would be appreciated?

Thanks for your time, -John

On Fri, Oct 12, 2012 at 2:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...
Hi Jim,

This is nice and is a path forward for the immediate future.

That said, a couple extensions to Galaxy to better support composite datatypes would enable cummerbund without the additional tools:

(i) extending the composite datatype to include definition of individual outputs in the collection; (ii) extend the history panel to allow usage/selection of (1) the complete composite set of files or (2) individual items in a composite datatype

Once (i) is done, (ii) should be straightforward using the new history panel code.

Of course, the advantage of these extensions is that they'd address both cummerbund issues as well as other challenges, such as using output from the barcode splitter.

J.

On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.

I've put demo code in the testtoolshed under the name repository name cummerbund

http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund

This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/>

The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database

I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata

I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA

Thanks, JJ

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Dannon Baker

2:09 p.m.

Alex, To reiterate what Jeremy has already said on the mailing list, this is definitely something we want, and need, for Galaxy. While this particular implementation has a lot of good parts, creating these collections as first-class composite datasets isn't ideal and we'd be stuck supporting them going forward, forever. There's a clear plan for implementing this in Trello ( https://trello.com/c/325AXIEr), most of which is straightforward to implement. The 'hard' part is really going to be implementing an ideal UI for dealing with these collections, something which we could do in phases. What exactly are your concerns with the implementation as set out in the Trello card? -Dannon On Mon, Mar 4, 2013 at 1:32 AM, <Alex.Khassapov@csiro.au> wrote:

...

Yeah John,

This is sad, I don't understand why it is such a problem? If it's already implemented and used in real projects like ours - then it is needed for the community. I don't think we have other options for our requirements, your multiple file datasets implementation was a real saviour for us.

-Alex

-----Original Message----- From: jmchilton@gmail.com [mailto:jmchilton@gmail.com] On Behalf Of John Chilton Sent: Monday, 4 March 2013 4:42 PM To: Khassapov, Alex (CSIRO IM&T, Clayton) Cc: <galaxy-dev@bx.psu.edu> Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff

Hi Alex,

Thanks for the comments. The galaxy team has made it clear here and to me privately that this will NOT be included in the Galaxy main code base. I hope and am I confident that they will make grouping datasets work, hopefully even to thousands of files.

I do not believe the two ideas are mutually exclusive and I will be maintaining a fork of galaxy-central with these additions, I will set this up this week hopefully. I will do my best to respond to support requests and make multiple file datasets and composite types in general as robust as possible, keep up with Galaxy updates, etc.... Obviously, it is risky to let a code base drift so far from galaxy main's however and you, me, and others who might want to use them will have to carefully weigh the risks when determining if multiple file datasets are worth the headache.

Thanks for all your help and inputs. I am sorry this did not turn out differently, I feel I have really failed here.

-John

On Sun, Mar 3, 2013 at 10:08 PM, <Alex.Khassapov@csiro.au> wrote:

...
Hi John,

Are you saying that "composite multiple file dataset" isn't required and won't be implemented?

We are using your implementation of multifiles dataset ("m:xxx" type) and hope that eventually it will be pushed into main Galaxy implementation.

As we are using Galaxy for CT reconstruction tools, where input and output can consist of a couple thousand files, other options are not feasible, i.e. grouping datasets.

-Alex

-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Thursday, 28 February 2013 2:06 AM To: Jeremy Goecks Cc: Jim Johnson; <galaxy-dev@bx.psu.edu> Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff

Hey Jeremy,

I am trying to think about a path forward with this composite multiple file dataset implementation. It seems there is consensus among the galaxy team that it shouldn't be included because grouping actual datasets would be superior. In that light, I am revisiting this e-mail, because depending on the implementation of what you described multiple file datasets are a specific case of this concept with some likely uncontroversial enhancements for the specific case of composite datatypes that are a homogeneous list of files. Does that make any sense?

If I implemented (i) and (ii) in such a way that the multiple file dataset stuff flowed out more organically is there any chance than it could be included in galaxy-central. If no and the implicit datatypes and parallelism stuff would remain no-gos implementing what you described would still benefit the multiple file datasets implementation, so I still might do this, would a clean implementation of just what you described be accepted?

Any thoughts you or anyone has on the future direction of composite datatypes in general would be appreciated?

Thanks for your time, -John

On Fri, Oct 12, 2012 at 2:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu> wrote:

...
Hi Jim,

This is nice and is a path forward for the immediate future.

That said, a couple extensions to Galaxy to better support composite datatypes would enable cummerbund without the additional tools:

(i) extending the composite datatype to include definition of individual outputs in the collection; (ii) extend the history panel to allow usage/selection of (1) the complete composite set of files or (2) individual items in a composite datatype

Once (i) is done, (ii) should be straightforward using the new history panel code.

Of course, the advantage of these extensions is that they'd address both cummerbund issues as well as other challenges, such as using output from the barcode splitter.

J.

On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.

I've put demo code in the testtoolshed under the name repository name cummerbund

http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund

This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/>

The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database

I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata

I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA

Thanks, JJ

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Alex.Khassapov＠csiro.au

5 Mar 5 Mar

12:22 a.m.

Hi Dannon, I understand that instead of having one dataset with multiple files you are planning to use existing datasets and combine them in a 'collection'. My concerns are: 1. Our data consists of 200-8000 files, can you imagine how many datasets we'll end up with? It will be a mess. 2. All these files in a dataset belong to each other and it doesn't make much sense to keep them separately. 3. For performance reasons, all these files are located in a single directory which makes it easier to iterate over. 4. From my point of view, it makes perfect sense to have a concept of a "dataset" with multiple files, you have already a dataset_xxx_files folder anyway, and it's not a big change comparing to the new concept of "collection" 5. We are already using the "m:xxx" type datasets (thanks John) in our project, I guess you don't even have a timeframe for implementing the "collection" concept? I'm sure that for many projects using multi file datasets is a requirement now, not in 'years' time. 6. "Collection" is also a good idea and I guess they both can exist together, but only in the future, given current users an opportunity to use Galaxy for their needs. Otherwise we simply have to look at other frameworks which already support multi file datasets. -Alex From: Dannon Baker [mailto:dannon.baker@gmail.com] Sent: Tuesday, 5 March 2013 1:09 AM To: Khassapov, Alex (CSIRO IM&T, Clayton) Cc: chilton@msi.umn.edu; galaxy-dev@bx.psu.edu; NeCTAR Cloud Imaging Project Team Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff Alex, To reiterate what Jeremy has already said on the mailing list, this is definitely something we want, and need, for Galaxy. While this particular implementation has a lot of good parts, creating these collections as first-class composite datasets isn't ideal and we'd be stuck supporting them going forward, forever. There's a clear plan for implementing this in Trello (https://trello.com/c/325AXIEr), most of which is straightforward to implement. The 'hard' part is really going to be implementing an ideal UI for dealing with these collections, something which we could do in phases. What exactly are your concerns with the implementation as set out in the Trello card? -Dannon On Mon, Mar 4, 2013 at 1:32 AM, <Alex.Khassapov@csiro.au<mailto:Alex.Khassapov@csiro.au>> wrote: Yeah John, This is sad, I don't understand why it is such a problem? If it's already implemented and used in real projects like ours - then it is needed for the community. I don't think we have other options for our requirements, your multiple file datasets implementation was a real saviour for us. -Alex -----Original Message----- From: jmchilton@gmail.com<mailto:jmchilton@gmail.com> [mailto:jmchilton@gmail.com<mailto:jmchilton@gmail.com>] On Behalf Of John Chilton Sent: Monday, 4 March 2013 4:42 PM To: Khassapov, Alex (CSIRO IM&T, Clayton) Cc: <galaxy-dev@bx.psu.edu<mailto:galaxy-dev@bx.psu.edu>> Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff Hi Alex, Thanks for the comments. The galaxy team has made it clear here and to me privately that this will NOT be included in the Galaxy main code base. I hope and am I confident that they will make grouping datasets work, hopefully even to thousands of files. I do not believe the two ideas are mutually exclusive and I will be maintaining a fork of galaxy-central with these additions, I will set this up this week hopefully. I will do my best to respond to support requests and make multiple file datasets and composite types in general as robust as possible, keep up with Galaxy updates, etc.... Obviously, it is risky to let a code base drift so far from galaxy main's however and you, me, and others who might want to use them will have to carefully weigh the risks when determining if multiple file datasets are worth the headache. Thanks for all your help and inputs. I am sorry this did not turn out differently, I feel I have really failed here. -John On Sun, Mar 3, 2013 at 10:08 PM, <Alex.Khassapov@csiro.au<mailto:Alex.Khassapov@csiro.au>> wrote:

...

Hi John,

Are you saying that "composite multiple file dataset" isn't required and won't be implemented?

We are using your implementation of multifiles dataset ("m:xxx" type) and hope that eventually it will be pushed into main Galaxy implementation.

As we are using Galaxy for CT reconstruction tools, where input and output can consist of a couple thousand files, other options are not feasible, i.e. grouping datasets.

-Alex

-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu<mailto:galaxy-dev-bounces@lists.bx.psu.edu> [mailto:galaxy-dev-bounces@lists.bx.psu.edu<mailto:galaxy-dev-bounces@lists.bx.psu.edu>] On Behalf Of John Chilton Sent: Thursday, 28 February 2013 2:06 AM To: Jeremy Goecks Cc: Jim Johnson; <galaxy-dev@bx.psu.edu<mailto:galaxy-dev@bx.psu.edu>> Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff

Hey Jeremy,

I am trying to think about a path forward with this composite multiple file dataset implementation. It seems there is consensus among the galaxy team that it shouldn't be included because grouping actual datasets would be superior. In that light, I am revisiting this e-mail, because depending on the implementation of what you described multiple file datasets are a specific case of this concept with some likely uncontroversial enhancements for the specific case of composite datatypes that are a homogeneous list of files. Does that make any sense?

If I implemented (i) and (ii) in such a way that the multiple file dataset stuff flowed out more organically is there any chance than it could be included in galaxy-central. If no and the implicit datatypes and parallelism stuff would remain no-gos implementing what you described would still benefit the multiple file datasets implementation, so I still might do this, would a clean implementation of just what you described be accepted?

Any thoughts you or anyone has on the future direction of composite datatypes in general would be appreciated?

Thanks for your time, -John

On Fri, Oct 12, 2012 at 2:44 PM, Jeremy Goecks <jeremy.goecks@emory.edu<mailto:jeremy.goecks@emory.edu>> wrote:

...
Hi Jim,

This is nice and is a path forward for the immediate future.

That said, a couple extensions to Galaxy to better support composite datatypes would enable cummerbund without the additional tools:

(i) extending the composite datatype to include definition of individual outputs in the collection; (ii) extend the history panel to allow usage/selection of (1) the complete composite set of files or (2) individual items in a composite datatype

Once (i) is done, (ii) should be straightforward using the new history panel code.

Of course, the advantage of these extensions is that they'd address both cummerbund issues as well as other challenges, such as using output from the barcode splitter.

J.

On Oct 11, 2012, at 6:14 PM, Jim Johnson wrote:

Checking to see if there is any interest in including a parameter option to select outputs for cuffdiff, potentially including a composite output and a cummeRbund sqlite database.

Issues: cuffdiff produces 21 output files, which is a little unwieldy in a galaxy history. cummeRbund generates its database when given a cuffdiff output directory, but manually hooking up 21 outputs to the cummerbund_wrapper is a pain.

I've put demo code in the testtoolshed under the name repository name cummerbund

http://jjohnson@testtoolshed.g2.bx.psu.edu/repos/jjohnson/cummerbund

This includes new datatypes defined in datatypes_conf.xml and implemented in cuffdata.py:  <datatype extension="cuffdata" type="galaxy.datatypes.cuffdata:CuffDiffData"/>  <datatype extension="cuffdatadb" type="galaxy.datatypes.cuffdata:CuffDataDB"/>

The cuffdiff wrapper has a multiple select parameter to choose which output files to put in the history. In addition to the 21 cuffdiff outputs, the wrapper can also generate: cuffdata - which is a composite HTML output with links to the 21 cuffdiff outputs cuffdatadb - which is the cummeRbund SQLite database

I also added utility tools: cuffdata_datasets - which will take files from the composite cuffdata and copy them as datasets into the history cuffdata_cummerbund - which generates the cummeRbund cuffdatadb from the composite cuffdata

I updated the cummerbund_wrapper: with tryCatch so that a R error on a plot won't exit the Rscript to include a small image of each plot on the html page added plots for : dispersion, scatter matrix, MDS, and PCA

Thanks, JJ

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

James Taylor

6 Mar 6 Mar

8:10 p.m.

...

I understand that instead of having one dataset with multiple files you are planning to use existing datasets and combine them in a ‘collection’. My concerns are:

This needs to be fleshed out much more, but this is not exactly what we are thinking. The main change is to make it possible for a history to contain items other than datasets. Groups of datasets would be one such thing. Multifile datasets would be another. Workflow invocations a third (needed to support extensions to the workflow system we are proposing).

...

1. Our data consists of 200-8000 files, can you imagine how many datasets we’ll end up with? It will be a mess.

Yes, it would, which is why there does need to be the concept of a homogenous dataset collection to support this.

...

5. We are already using the “m:xxx” type datasets (thanks John) in our project, I guess you don’t even have a timeframe for implementing the “collection” concept? I’m sure that for many projects using multi file datasets is a requirement now, not in ‘years’ time.

We recognize the need, but implementing these using the existing datasets with a prefix on the extension, and then special casing all over the place, is not a maintainable solution going forward. They should be implemented as their own entity.

4880

Age (days ago)

5026

Last active (days ago)

List overview

Download

10 comments

7 participants

participants (7)

Alex.Khassapov＠csiro.au
Carlos Borroto
Dannon Baker
James Taylor
Jeremy Goecks
Jim Johnson
John Chilton

Composite datatype output for Cuffdiff

tags

participants (7)