Appending _task_%d suffix to multi files
Hi guys, We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need "multiple file dataset" - we were using Johns' fork for that initially. Now we are trying to use "The most updated version of the multiple file dataset stuff" https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version. One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names). On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable? Is it really necessary to change the file names? -Alex -----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] the multi job splitter I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number. cheers, jorrit
Hi Alex, In our lab, files are often fractions of an experiments, but they are named by their creators in whatever way they like. I put that code in to standardize fraction naming, in case a tool needs input from two files that originate from the same fraction (but have been treated in different ways). In those cases, in my fork, Galaxy always picks the files with the same task_%d numbers. I can't help you very much right now, as I'm currently away from work until October, but I hope this explains why its in there. cheers, jorrit On 07/31/2013 04:15 AM, Alex.Khassapov@csiro.au wrote:
Hi guys,
We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need "multiple file dataset" - we were using Johns' fork for that initially.
Now we are trying to use "The most updated version of the multiple file dataset stuff" https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version.
One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names).
On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable?
Is it really necessary to change the file names?
-Alex
-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] the multi job splitter
I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number.
cheers,
jorrit
Hi Jorrit, Thank you for your explanation. Would you be able to give us an example of what do you mean by fractions and when the task_%d are being used to pick files. Just want to make sure we have good understanding of the problem that you solved. Also, I vaguely remember seeing 'data parallelism" mentioned somewhere with relation to the m: data sets. Do you currently support in any way automatic distribution of processing of such datasets to parallel environments (e.g. array jobs in sge or such?) Cheers, - Piotr From: Jorrit Boekel [mailto:jorrit.boekel@scilifelab.se] Sent: Wednesday, July 31, 2013 8:18 PM To: Khassapov, Alex (CSIRO IM&T, Clayton) Cc: p.j.a.cock@googlemail.com; jmchilton@gmail.com; galaxy-dev@lists.bx.psu.edu; Szul, Piotr (ICT Centre, Marsfield); Burdett, Neil (ICT Centre, Herston - RBWH) Subject: Re: Appending _task_%d suffix to multi files Hi Alex, In our lab, files are often fractions of an experiments, but they are named by their creators in whatever way they like. I put that code in to standardize fraction naming, in case a tool needs input from two files that originate from the same fraction (but have been treated in different ways). In those cases, in my fork, Galaxy always picks the files with the same task_%d numbers. I can't help you very much right now, as I'm currently away from work until October, but I hope this explains why its in there. cheers, jorrit On 07/31/2013 04:15 AM, Alex.Khassapov@csiro.au<mailto:Alex.Khassapov@csiro.au> wrote: Hi guys, We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need "multiple file dataset" - we were using Johns' fork for that initially. Now we are trying to use "The most updated version of the multiple file dataset stuff" https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version. One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names). On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable? Is it really necessary to change the file names? -Alex -----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu<mailto:galaxy-dev-bounces@lists.bx.psu.edu> [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] the multi job splitter I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number. cheers, jorrit
Hi Piotr, In our proteomics lab, a protein sample is fractionated (by e.g. pH) before analysis in a nr of sample fractions. The fractions are then run through the mass spectrometer one at a time. Each fraction yields a data file. The mass spec data is then matched to peptides by searching a FASTA file, termed target, with protein sequences. Afterwards the matches are statistically scored by machine learning. To do this, the data is also matched with a scrambled FASTA file, termed decoy. Each fraction is matched to a target and decoy file, which yields two match-files per fraction. The machine learning tool thus picks a target and a decoy matchfile and puts statistical significances on the matches. In order for this to be correct, it needs to pick matchfiles that correspond, ie that are derived from the same fraction. In our lab, we have not yet looked at John Chilton's (I think) work with the m: data sets, and our parallel processing is done inside galaxy, using its split and merge functions to divide a job into tasks. Each task is sent as a separate job to sge, I think, but others may know more about this than I. I really have to get back to my holiday now, cheers, jorrit On 08/01/2013 04:17 AM, Piotr.Szul@csiro.au wrote:
Hi Jorrit,
Thank you for your explanation. Would you be able to give us an example of what do you mean by fractions and when the task_%d are being used to pick files. Just want to make sure we have good understanding of the problem that you solved.
Also, I vaguely remember seeing 'data parallelism" mentioned somewhere with relation to the m: data sets. Do you currently support in any way automatic distribution of processing of such datasets to parallel environments (e.g. array jobs in sge or such?)
Cheers,
-Piotr
*From:*Jorrit Boekel [mailto:jorrit.boekel@scilifelab.se] *Sent:* Wednesday, July 31, 2013 8:18 PM *To:* Khassapov, Alex (CSIRO IM&T, Clayton) *Cc:* p.j.a.cock@googlemail.com; jmchilton@gmail.com; galaxy-dev@lists.bx.psu.edu; Szul, Piotr (ICT Centre, Marsfield); Burdett, Neil (ICT Centre, Herston - RBWH) *Subject:* Re: Appending _task_%d suffix to multi files
Hi Alex,
In our lab, files are often fractions of an experiments, but they are named by their creators in whatever way they like. I put that code in to standardize fraction naming, in case a tool needs input from two files that originate from the same fraction (but have been treated in different ways). In those cases, in my fork, Galaxy always picks the files with the same task_%d numbers.
I can't help you very much right now, as I'm currently away from work until October, but I hope this explains why its in there.
cheers, jorrit
On 07/31/2013 04:15 AM, Alex.Khassapov@csiro.au <mailto:Alex.Khassapov@csiro.au> wrote:
Hi guys,
We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need "multiple file dataset" - we were using Johns' fork for that initially.
Now we are trying to use "The most updated version of the multiple file dataset stuff" https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version.
One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names).
On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable?
Is it really necessary to change the file names?
-Alex
-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu <mailto:galaxy-dev-bounces@lists.bx.psu.edu> [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edu <mailto:galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] the multi job splitter
I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number.
cheers,
jorrit
Hi Jorrit, Thanks a lot for the explanation. Sorry to trouble on you holidays. Enjoy your time off. Cheers, - Piotr From: Jorrit Boekel [mailto:jorrit.boekel@scilifelab.se] Sent: Thursday, August 01, 2013 7:45 PM To: Szul, Piotr (ICT Centre, Marsfield) Cc: Khassapov, Alex (CSIRO IM&T, Clayton); p.j.a.cock@googlemail.com; jmchilton@gmail.com; galaxy-dev@lists.bx.psu.edu; Burdett, Neil (ICT Centre, Herston - RBWH) Subject: Re: Appending _task_%d suffix to multi files Hi Piotr, In our proteomics lab, a protein sample is fractionated (by e.g. pH) before analysis in a nr of sample fractions. The fractions are then run through the mass spectrometer one at a time. Each fraction yields a data file. The mass spec data is then matched to peptides by searching a FASTA file, termed target, with protein sequences. Afterwards the matches are statistically scored by machine learning. To do this, the data is also matched with a scrambled FASTA file, termed decoy. Each fraction is matched to a target and decoy file, which yields two match-files per fraction. The machine learning tool thus picks a target and a decoy matchfile and puts statistical significances on the matches. In order for this to be correct, it needs to pick matchfiles that correspond, ie that are derived from the same fraction. In our lab, we have not yet looked at John Chilton's (I think) work with the m: data sets, and our parallel processing is done inside galaxy, using its split and merge functions to divide a job into tasks. Each task is sent as a separate job to sge, I think, but others may know more about this than I. I really have to get back to my holiday now, cheers, jorrit On 08/01/2013 04:17 AM, Piotr.Szul@csiro.au<mailto:Piotr.Szul@csiro.au> wrote: Hi Jorrit, Thank you for your explanation. Would you be able to give us an example of what do you mean by fractions and when the task_%d are being used to pick files. Just want to make sure we have good understanding of the problem that you solved. Also, I vaguely remember seeing 'data parallelism" mentioned somewhere with relation to the m: data sets. Do you currently support in any way automatic distribution of processing of such datasets to parallel environments (e.g. array jobs in sge or such?) Cheers, - Piotr From: Jorrit Boekel [mailto:jorrit.boekel@scilifelab.se] Sent: Wednesday, July 31, 2013 8:18 PM To: Khassapov, Alex (CSIRO IM&T, Clayton) Cc: p.j.a.cock@googlemail.com<mailto:p.j.a.cock@googlemail.com>; jmchilton@gmail.com<mailto:jmchilton@gmail.com>; galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu>; Szul, Piotr (ICT Centre, Marsfield); Burdett, Neil (ICT Centre, Herston - RBWH) Subject: Re: Appending _task_%d suffix to multi files Hi Alex, In our lab, files are often fractions of an experiments, but they are named by their creators in whatever way they like. I put that code in to standardize fraction naming, in case a tool needs input from two files that originate from the same fraction (but have been treated in different ways). In those cases, in my fork, Galaxy always picks the files with the same task_%d numbers. I can't help you very much right now, as I'm currently away from work until October, but I hope this explains why its in there. cheers, jorrit On 07/31/2013 04:15 AM, Alex.Khassapov@csiro.au<mailto:Alex.Khassapov@csiro.au> wrote: Hi guys, We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need "multiple file dataset" - we were using Johns' fork for that initially. Now we are trying to use "The most updated version of the multiple file dataset stuff" https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version. One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names). On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable? Is it really necessary to change the file names? -Alex -----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu<mailto:galaxy-dev-bounces@lists.bx.psu.edu> [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] the multi job splitter I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number. cheers, jorrit
Hi Piotr, Regarding "data parallelism" - Galaxy can split a single large file into small parts and process them in parallel, then merge outputs into single file. That's not what we need, as we already have multiple input files. But as I understand, there's a possibility to write our own splitters/mergers to fit our requirements. And yeah, Jorrit - enjoy your holidays! -Alex From: Jorrit Boekel [mailto:jorrit.boekel@scilifelab.se] Sent: Thursday, 1 August 2013 7:45 PM To: Szul, Piotr (ICT Centre, Marsfield) Cc: Khassapov, Alex (CSIRO IM&T, Clayton); p.j.a.cock@googlemail.com; jmchilton@gmail.com; galaxy-dev@lists.bx.psu.edu; Burdett, Neil (ICT Centre, Herston - RBWH) Subject: Re: Appending _task_%d suffix to multi files Hi Piotr, In our proteomics lab, a protein sample is fractionated (by e.g. pH) before analysis in a nr of sample fractions. The fractions are then run through the mass spectrometer one at a time. Each fraction yields a data file. The mass spec data is then matched to peptides by searching a FASTA file, termed target, with protein sequences. Afterwards the matches are statistically scored by machine learning. To do this, the data is also matched with a scrambled FASTA file, termed decoy. Each fraction is matched to a target and decoy file, which yields two match-files per fraction. The machine learning tool thus picks a target and a decoy matchfile and puts statistical significances on the matches. In order for this to be correct, it needs to pick matchfiles that correspond, ie that are derived from the same fraction. In our lab, we have not yet looked at John Chilton's (I think) work with the m: data sets, and our parallel processing is done inside galaxy, using its split and merge functions to divide a job into tasks. Each task is sent as a separate job to sge, I think, but others may know more about this than I. I really have to get back to my holiday now, cheers, jorrit On 08/01/2013 04:17 AM, Piotr.Szul@csiro.au<mailto:Piotr.Szul@csiro.au> wrote: Hi Jorrit, Thank you for your explanation. Would you be able to give us an example of what do you mean by fractions and when the task_%d are being used to pick files. Just want to make sure we have good understanding of the problem that you solved. Also, I vaguely remember seeing 'data parallelism" mentioned somewhere with relation to the m: data sets. Do you currently support in any way automatic distribution of processing of such datasets to parallel environments (e.g. array jobs in sge or such?) Cheers, - Piotr From: Jorrit Boekel [mailto:jorrit.boekel@scilifelab.se] Sent: Wednesday, July 31, 2013 8:18 PM To: Khassapov, Alex (CSIRO IM&T, Clayton) Cc: p.j.a.cock@googlemail.com<mailto:p.j.a.cock@googlemail.com>; jmchilton@gmail.com<mailto:jmchilton@gmail.com>; galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu>; Szul, Piotr (ICT Centre, Marsfield); Burdett, Neil (ICT Centre, Herston - RBWH) Subject: Re: Appending _task_%d suffix to multi files Hi Alex, In our lab, files are often fractions of an experiments, but they are named by their creators in whatever way they like. I put that code in to standardize fraction naming, in case a tool needs input from two files that originate from the same fraction (but have been treated in different ways). In those cases, in my fork, Galaxy always picks the files with the same task_%d numbers. I can't help you very much right now, as I'm currently away from work until October, but I hope this explains why its in there. cheers, jorrit On 07/31/2013 04:15 AM, Alex.Khassapov@csiro.au<mailto:Alex.Khassapov@csiro.au> wrote: Hi guys, We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need "multiple file dataset" - we were using Johns' fork for that initially. Now we are trying to use "The most updated version of the multiple file dataset stuff" https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version. One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names). On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable? Is it really necessary to change the file names? -Alex -----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu<mailto:galaxy-dev-bounces@lists.bx.psu.edu> [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edu<mailto:galaxy-dev@lists.bx.psu.edu> Subject: Re: [galaxy-dev] the multi job splitter I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number. cheers, jorrit
I have added an feature request issue on galaxy-extras to support this. Earlier this year, Hagai Cohen was exploring the existing Galaxy tools and it seems like some of them expect dataset files to end with .dat. So I reworked the multiple file dataset stuff so each of the individual parts ended that way, it is probably the best changeset to look at if you are considering changing the naming behavior associated with multiple file datasets. https://bitbucket.org/msiappdev/galaxy-extras/commits/1a93fba38e74f9a307b62c... -John On Tue, Jul 30, 2013 at 9:15 PM, <Alex.Khassapov@csiro.au> wrote:
Hi guys,
We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need "multiple file dataset" - we were using Johns' fork for that initially.
Now we are trying to use "The most updated version of the multiple file dataset stuff" https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version.
One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names).
On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable?
Is it really necessary to change the file names?
-Alex
-----Original Message----- From: galaxy-dev-bounces@lists.bx.psu.edu [mailto:galaxy-dev-bounces@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] the multi job splitter
I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number.
cheers,
jorrit
participants (4)
-
Alex.Khassapov@csiro.au
-
John Chilton
-
Jorrit Boekel
-
Piotr.Szul@csiro.au