Thank you again, Greg.

Thank you for the link too (however the image in the link is not accessible anymore). In your example, 4366992 is an auto generated number appended when a user uploads a data file into the Galaxy instance, right? say if the user upload dataset.dat at time t1 and another dataset.dat at time t2, the new file does not over write the old one as the newly generated auto number is different from the previous one. is that correct?

Wanmei



From: Greg Von Kuster <greg@bx.psu.edu>
To: Wanmei <wanmei_06@yahoo.com>
Cc: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu>
Sent: Monday, May 28, 2012 9:26 AM
Subject: Re: [galaxy-dev] Data file and Analysis Program versioning


On May 28, 2012, at 8:34 AM, Wanmei wrote:

Thank you Greg.

> This job keeps information about the tool that was used, including the version.  The results of the job running is the analysis consisting of one or more additional datasets.
[Wanmei] does the job also keep the information about which input dataset is used besides the tool&version?

Yes, the Galaxy reports component ( discussed in the Galaxy news brief at http://wiki.g2.bx.psu.edu/DevNewsBriefs/2010%2006_08 ) is a good place to look for details about Galaxy jobs.  The reports have not yet been enhanced to display tool version information or tool version relationships, but they will soon include this information.  Here is some of the job information shown in the current reports.  You'll see information about input datasets and resulting datasets in the command line.

Job Information

StateJob IdCreate TimeTime To FinishSession Id
ok
38651892012-05-28 00:00:56.4197460:00:345531371
ToolUserRunnerRunner Id
Filter1xxxxxxpbs://torque.g2.bx.psu.edu/2305392.thumper.g2.bx.psu.edu
Remote Host
xxx.xxx.xxx.xxx
Command Line
python /galaxy/home/g2main/galaxy_main/tools/stats/filtering.py /galaxy/main_pool/pool1/files/004/366/dataset_4366992.dat /galaxy/main_pool/pool5/tmp/job_working_directory/003/865/3865189/galaxy_dataset_4366996.dat "c3!=__sq__No results__sq__" 30 "str,str,str,str,int,float,str,float,str,str,int,float,str,str,int,str,str,str,str,str,str,int,str,str,str,list,str,list,str,str"
Stdout
Filtering with c3!='No results', 
kept 46.58% of 1241 valid lines (1241 total lines).
Stderr
Stack Trace
None
Info
None


> With each new analysis, new datasets are produced.  In no case are previous datasets overwritten.  With the new analysis in your example, the job again has information about the tool / version combination that produced the dataset.  So, like I described above, the job can be rerun at some later point.  The resulting dataset is not versioned in the way you describe, but information is kept about the analysis process that produced the resulting dataset.
[Wanmei] I think you mean this for the example we discussed: Galaxy will keep two separate jobs: Job#1 is the previous analysis with the corresponding tool/version/output dataset; Job#2 is the new analysis with the corresponding tool/version/output dataset. Is my understanding correct?

Yes!




Thanks,
Wanmei


From: Greg Von Kuster <greg@bx.psu.edu>
To: Wanmei <wanmei_06@yahoo.com>
Cc: "galaxy-dev@lists.bx.psu.edu" <galaxy-dev@lists.bx.psu.edu>
Sent: Monday, May 28, 2012 7:13 AM
Subject: Re: [galaxy-dev] Data file and Analysis Program versioning

Hello Wanmei,

On May 27, 2012, at 9:43 PM, Wanmei wrote:

Hi All,

I am pretty new to Galaxy. I would like to understand Galaxy's versioning capability from an end-user perspective (i do not mean the versioning capability that Mercurial offers in Galaxy repo).

I did some research and found the following link mentioned a use case: if an end-user would like to rerun an analysis which was previously run using a different version of the analysis program, then Galaxy will prompt the end-user whether he/she would like to proceed with the new analysis. From this screenshot (in the link), it looks like Galaxy keeps track of the metadata of a output data file such as which analysis program and which version of the code produce it. is my understanding correct?
http://wiki.g2.bx.psu.edu/Tool%20Shed#Galaxy_Tool_Versions

You are correct.  In Galaxy, the process of providing an input dataset to an analysis tool creates a Galaxy job.  This job keeps information about the tool that was used, including the version.  The results of the job running is the analysis consisting of one or more additional datasets.  At some later point when a Galaxy user attempts to rerun the job, the original job information is inspected to determine the tool / version combination that was used in the job.  Then the current Galaxy tool box is inspected to see if that tool / version combination is available in the tool box or if a derivative tool / version combination is available, allowing the user to rerun the tool with either the original or the derivative.


If the answer to my above question is yes, then i have one more question. does Galaxy version the output data as well? What i means is, for example, if the end-user agrees to use a newer version of the code to rerun (answer Yes to Galaxy's prompt), will the newly generated output be marked as version #2 as oppose to the original output (version #1)? Or it will just simply overwrites the previous analysis output file?

With each new analysis, new datasets are produced.  In no case are previous datasets overwritten.  With the new analysis in your example, the job again has information about the tool / version combination that produced the dataset.  So, like I described above, the job can be rerun at some later point.  The resulting dataset is not versioned in the way you describe, but information is kept about the analysis process that produced the resulting dataset.

Greg Von Kuster



Thanks,
Wanmei


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/