Tool unit tests using composite datatypes

newer
BAM/BAI index file test problem on...

Peter Cock

3 Apr 2013 3 Apr '13

1:24 p.m.

Hello all, I'd like to be able to write some simple <test> entries for some of the BLAST+ tools using composite datatypes as input or output (i.e. small BLAST databases). This doesn't seem to be mentioned or hinted at on the wiki: http://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax?action=show&redirect=Admin%2FTools%2FTool+Config+Syntax#A.3Ctest.3E_tag_set Is it possible to use a composite datatype as a test input? If so how? Normal datatypes are loaded into the test history using using the upload tool - does that mean I first need to extend the relevant datatypes to allow them to be uploaded? Example: Run blastp using a small query FASTA file and a small database, check the output (eg tabular). Is it possible to use a composite datatype as a test output? If so how? Example: Run makeblastdb using a small FASTA file, and check the output (a small BLAST database). Thanks, Peter

Show replies by date

Dave Bouvier

3 Apr 3 Apr

1:37 p.m.

Peter, Yes, it is definitely possible to use a composite datatype in functional tests, and a number of tools in the Galaxy distribution do so. For examples on how to define composite inputs, you can look at the tools in tools/rgenetics/, such as rgGLM.xml or rgHaploView.xml. For the outputs, rgClean.xml provides an example of comparing a composite output dataset with the expected test data. --Dave B. On 4/3/13 09:24:34.000, Peter Cock wrote:

...

Hello all,

I'd like to be able to write some simple <test> entries for some of the BLAST+ tools using composite datatypes as input or output (i.e. small BLAST databases). This doesn't seem to be mentioned or hinted at on the wiki:

http://wiki.galaxyproject.org/Admin/Tools/ToolConfigSyntax?action=show&redirect=Admin%2FTools%2FTool+Config+Syntax#A.3Ctest.3E_tag_set

Is it possible to use a composite datatype as a test input? If so how? Normal datatypes are loaded into the test history using using the upload tool - does that mean I first need to extend the relevant datatypes to allow them to be uploaded?

Example: Run blastp using a small query FASTA file and a small database, check the output (eg tabular).

Is it possible to use a composite datatype as a test output? If so how?

Example: Run makeblastdb using a small FASTA file, and check the output (a small BLAST database).

Thanks,

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Peter Cock

1:48 p.m.

On Wed, Apr 3, 2013 at 2:37 PM, Dave Bouvier <dave@bx.psu.edu> wrote:

...

Peter,

Yes, it is definitely possible to use a composite datatype in functional tests, and a number of tools in the Galaxy distribution do so.

Hooray - I was hoping this functionality already existed but was just undocumented.

...

For examples on how to define composite inputs, you can look at the tools in tools/rgenetics/, such as rgGLM.xml or rgHaploView.xml.

Got it - I see that uses the tag <composite_data> within the <param> tag.

...

For the outputs, rgClean.xml provides an example of comparing a composite output dataset with the expected test data.

Here there is a tag <extra_files> used within the <output> tag. Many thanks Dave - those should be enough to get me started, and maybe I can update the wiki once I'm happy with how this works... Cheers, Peter

Peter Cock

4 Apr 4 Apr

5:34 p.m.

On Wed, Apr 3, 2013 at 2:48 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Wed, Apr 3, 2013 at 2:37 PM, Dave Bouvier <dave@bx.psu.edu> wrote:

...
For the outputs, rgClean.xml provides an example of comparing a composite output dataset with the expected test data.

Here there is a tag <extra_files> used within the <output> tag.

Many thanks Dave - those should be enough to get me started, and maybe I can update the wiki once I'm happy with how this works...

I'm trying this for makeblastdb which produces a single output, a composite datatype which is a BLAST database. The examples I found all seemed to be composite datatypes with a central file (e.g. HTML plus images), i.e. composite_type='auto_primary_file' In the case of the BLAST databases, this is a composite datatype without a primary file, aka composite_type='basic'. Leaving the <param> without a file gives: Exception: Test output does not have a 'file' Using a empty value doesn't work, Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/test/functional/test_toolbox.py", line 171, in test_tool self.do_it( td, shed_tool_id=shed_tool_id ) File "/mnt/galaxy/galaxy-central/test/functional/test_toolbox.py", line 102, in do_it self.verify_dataset_correctness( outfile, hid=elem_hid, maxseconds=testdef.maxseconds, attributes=attributes, shed_tool_id=shed_tool_id ) File "/mnt/galaxy/galaxy-central/test/base/twilltestcase.py", line 828, in verify_dataset_correctness self.files_diff( local_name, temp_name, attributes=attributes ) File "/mnt/galaxy/galaxy-central/test/base/twilltestcase.py", line 66, in files_diff local_file = open( file1, 'U' ).readlines() IOError: [Errno 21] Is a directory: '/mnt/galaxy/galaxy-central/test-data' Using /dev/null gets a bit further but something is confusing the comparisons. My current experimental test looks like this: <tests> <test> <param name="dbtype" value="prot"/> <param name="file" value="four_human_proteins.fasta"/> <param name="title" value="Just 4 human proteins"/> <param name="parse_seqids" value=""/> <param name="hash_index" value="-hash_index"/> <output name="out_file" file="/dev/null" ftype="blastdbp"> <extra_files type="file" value="four_human_proteins.fasta.phd" name="blastdb.pdb"/> <extra_files type="file" value="four_human_proteins.fasta.phi" name="blastdb.phi"/> <extra_files type="file" value="four_human_proteins.fasta.phr" name="blastdb.phr"/> <extra_files type="file" value="four_human_proteins.fasta.pin" name="blastdb.pin"/> <extra_files type="file" value="four_human_proteins.fasta.pog" name="blastdb.pog"/> <extra_files type="file" value="four_human_proteins.fasta.psd" name="blastdb.psd"/> <extra_files type="file" value="four_human_proteins.fasta.psi" name="blastdb.psi"/> <extra_files type="file" value="four_human_proteins.fasta.psq" name="blastdb.psq"/> </output> </test> </tests> Any advice? Thanks, Peter

Daniel Blankenberg

6:19 p.m.

Hi Peter, What is the test error given when you do have a value defined for name in output? Can you try using 'empty_file.dat'? e.g. <output name="out_file" file="empty_file.dat" > or <output name="out_file" file="empty_file.dat" compare="contains"> etc Thanks for using Galaxy, Dan On Apr 4, 2013, at 1:34 PM, Peter Cock wrote:

...

On Wed, Apr 3, 2013 at 2:48 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Wed, Apr 3, 2013 at 2:37 PM, Dave Bouvier <dave@bx.psu.edu> wrote:

...
For the outputs, rgClean.xml provides an example of comparing a composite output dataset with the expected test data.

Here there is a tag <extra_files> used within the <output> tag.

Many thanks Dave - those should be enough to get me started, and maybe I can update the wiki once I'm happy with how this works...

I'm trying this for makeblastdb which produces a single output, a composite datatype which is a BLAST database.

The examples I found all seemed to be composite datatypes with a central file (e.g. HTML plus images), i.e. composite_type='auto_primary_file'

In the case of the BLAST databases, this is a composite datatype without a primary file, aka composite_type='basic'. Leaving the <param> without a file gives:

Exception: Test output does not have a 'file'

Using a empty value doesn't work,

Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/test/functional/test_toolbox.py", line 171, in test_tool self.do_it( td, shed_tool_id=shed_tool_id ) File "/mnt/galaxy/galaxy-central/test/functional/test_toolbox.py", line 102, in do_it self.verify_dataset_correctness( outfile, hid=elem_hid, maxseconds=testdef.maxseconds, attributes=attributes, shed_tool_id=shed_tool_id ) File "/mnt/galaxy/galaxy-central/test/base/twilltestcase.py", line 828, in verify_dataset_correctness self.files_diff( local_name, temp_name, attributes=attributes ) File "/mnt/galaxy/galaxy-central/test/base/twilltestcase.py", line 66, in files_diff local_file = open( file1, 'U' ).readlines() IOError: [Errno 21] Is a directory: '/mnt/galaxy/galaxy-central/test-data'

Using /dev/null gets a bit further but something is confusing the comparisons.

My current experimental test looks like this:

<tests> <test> <param name="dbtype" value="prot"/> <param name="file" value="four_human_proteins.fasta"/> <param name="title" value="Just 4 human proteins"/> <param name="parse_seqids" value=""/> <param name="hash_index" value="-hash_index"/> <output name="out_file" file="/dev/null" ftype="blastdbp"> <extra_files type="file" value="four_human_proteins.fasta.phd" name="blastdb.pdb"/> <extra_files type="file" value="four_human_proteins.fasta.phi" name="blastdb.phi"/> <extra_files type="file" value="four_human_proteins.fasta.phr" name="blastdb.phr"/> <extra_files type="file" value="four_human_proteins.fasta.pin" name="blastdb.pin"/> <extra_files type="file" value="four_human_proteins.fasta.pog" name="blastdb.pog"/> <extra_files type="file" value="four_human_proteins.fasta.psd" name="blastdb.psd"/> <extra_files type="file" value="four_human_proteins.fasta.psi" name="blastdb.psi"/> <extra_files type="file" value="four_human_proteins.fasta.psq" name="blastdb.psq"/> </output> </test> </tests>

Any advice?

Thanks,

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Peter Cock

5 Apr 5 Apr

2:08 p.m.

On Thu, Apr 4, 2013 at 7:19 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:

...

Hi Peter,

What is the test error given when you do have a value defined for name in output?

Can you try using 'empty_file.dat'?

e.g.

<output name="out_file" file="empty_file.dat" >

or

<output name="out_file" file="empty_file.dat" compare="contains">

etc

Hi Daniel, That seems to help (plus fixing a typo in one of my child file extensions). However there is something else amiss, but my Galaxy is a little out of date: -------------------- >> begin captured logging << -------------------- galaxy.web.framework: DEBUG: Error: this request returned None from get_history(): http://localhost:9486/ galaxy.web.framework: DEBUG: Error: this request returned None from get_history(): http://localhost:9486/ galaxy.web.framework: DEBUG: Error: this request returned None from get_history(): http://localhost:9486/user/logout galaxy.web.framework: DEBUG: Error: this request returned None from get_history(): http://localhost:9486/ galaxy.tools.actions.upload_common: INFO: tool upload1 created job id 1 galaxy.jobs.manager: DEBUG: (1) Job assigned to handler 'main' galaxy.jobs: DEBUG: (1) Working directory for job is: /mnt/galaxy/galaxy-central/database/job_working_directory/000/1 galaxy.jobs.handler: DEBUG: dispatching job 1 to local runner galaxy.jobs.handler: INFO: (1) Job dispatched galaxy.jobs.runners.local: DEBUG: Local runner: starting job 1 galaxy.jobs.runners.local: DEBUG: executing: python /mnt/galaxy/galaxy-central/tools/data_source/upload.py /mnt/galaxy/galaxy-central /tmp/tmpOBsw3s/database/tmp/tmpshAqc4 /tmp/tmpOBsw3s/database/tmp/tmpjPyydZ 1:/mnt/galaxy/galaxy-central/database/job_working_directory/000/1/dataset_1_files:/tmp/tmpOBsw3s/database/files/000/dataset_1.dat galaxy.jobs.runners.local: DEBUG: execution finished: python /mnt/galaxy/galaxy-central/tools/data_source/upload.py /mnt/galaxy/galaxy-central /tmp/tmpOBsw3s/database/tmp/tmpshAqc4 /tmp/tmpOBsw3s/database/tmp/tmpjPyydZ 1:/mnt/galaxy/galaxy-central/database/job_working_directory/000/1/dataset_1_files:/tmp/tmpOBsw3s/database/files/000/dataset_1.dat galaxy.jobs: DEBUG: Tool did not define exit code or stdio handling; checking stderr for success galaxy.jobs: DEBUG: job 1 ended galaxy.jobs.manager: DEBUG: (2) Job assigned to handler 'main' galaxy.jobs: DEBUG: (2) Working directory for job is: /mnt/galaxy/galaxy-central/database/job_working_directory/000/2 galaxy.jobs.handler: DEBUG: dispatching job 2 to local runner galaxy.jobs.handler: INFO: (2) Job dispatched galaxy.jobs.runners.local: DEBUG: Local runner: starting job 2 galaxy.jobs.runners.local: DEBUG: executing: makeblastdb -version &> /tmp/tmpOBsw3s/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpOBsw3s/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpOBsw3s/database/files/000/dataset_1.dat /tmp/tmpOBsw3s/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs.runners.local: DEBUG: execution finished: makeblastdb -version &> /tmp/tmpOBsw3s/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpOBsw3s/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpOBsw3s/database/files/000/dataset_1.dat /tmp/tmpOBsw3s/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.tools: DEBUG: Error opening galaxy.json file: [Errno 2] No such file or directory: '/mnt/galaxy/galaxy-central/database/job_working_directory/000/2/galaxy.json' galaxy.jobs: DEBUG: job 2 ended base.twilltestcase: INFO: ## files diff /mnt/galaxy/galaxy-central/test-data/four_human_proteins.fasta.phd (51 bytes, 4 lines) and /tmp/tmpOBsw3s/database/tmp/tmpPaxnAGblastdb.phd (33 bytes, 1 lines) base.twilltestcase: INFO: ## file /mnt/galaxy/galaxy-central/test-data/four_human_proteins.fasta.phd line 1 is '1111718449\x022\n' base.twilltestcase: INFO: ## file /mnt/galaxy/galaxy-central/test-data/four_human_proteins.fasta.phd line 2 is '2924903341\x020\n' base.twilltestcase: INFO: ## file /mnt/galaxy/galaxy-central/test-data/four_human_proteins.fasta.phd line 3 is '3666588750\x021\n' base.twilltestcase: INFO: ## file /mnt/galaxy/galaxy-central/test-data/four_human_proteins.fasta.phd line 4 is '539247318\x023\n' base.twilltestcase: INFO: ## file /tmp/tmpOBsw3s/database/tmp/tmpPaxnAGblastdb.phd line 1 is 'This is a BLAST protein database.' base.twilltestcase: INFO: ## sibling files to /tmp/tmpOBsw3s/database/tmp/tmpPaxnAGblastdb.phd in same directory: base.twilltestcase: INFO: ## sibling file: tmpUMWHqD base.twilltestcase: INFO: ## sibling file: twilltestcase-MDnhZ1.html base.twilltestcase: INFO: ## sibling file: twilltestcase-_vAVc9.html base.twilltestcase: INFO: ## sibling file: tmpyiy0Kvempty_file.dat base.twilltestcase: INFO: ## sibling file: twilltestcase-MHWm9w.html base.twilltestcase: INFO: ## sibling file: tmpPaxnAGblastdb.phd base.twilltestcase: INFO: ## sibling file: twilltestcase-dVKyqw.html base.twilltestcase: INFO: ## sibling file: tmpshAqc4 base.twilltestcase: INFO: ## files diff on /mnt/galaxy/galaxy-central/test-data/four_human_proteins.fasta.phd and /tmp/tmpOBsw3s/database/tmp/tmpPaxnAGblastdb.phd lines_diff=0, found diff = 5 --------------------- >> end captured logging << --------------------- As you might guess I've added some additional logging, and it seems that for the blastdb.phd file expected to be produced, the comparison is being made using the default blastdbp datatype's peep text, one line: "This is a BLAST protein database." According to the command being run, the output files should all be named: /tmp/tmpHxY3vf/database/files/000/dataset_2_files/blastdb.p* e.g. /tmp/tmpHxY3vf/database/files/000/dataset_2_files/blastdb.phd Somehow instead the test code is using an altogether different path, and this folder only contains one file with blastdb in its name, and not as the filename alone but munged onto a temp prefix. This seems to be a problem in collecting the output for a composite data type. Another oddity is my input FASTA file is being used twice on the command line (a possible problem testing with a <repeat> <param> perhaps?), note: makeblastdb -out "/tmp/tmpOBsw3s/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpOBsw3s/database/files/000/dataset_1.dat /tmp/tmpOBsw3s/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot I will have to checkout the latest Galaxy code and retest in case this is something already fixed... Thanks, Peter

Peter Cock

25 Apr 25 Apr

2:36 p.m.

On Fri, Apr 5, 2013 at 3:08 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Thu, Apr 4, 2013 at 7:19 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:

...
Hi Peter,

What is the test error given when you do have a value defined for name in output?

Can you try using 'empty_file.dat'?

e.g.

<output name="out_file" file="empty_file.dat" >

or

<output name="out_file" file="empty_file.dat" compare="contains">

etc

Hi Daniel,

That seems to help (plus fixing a typo in one of my child file extensions). However there is something else amiss, but my Galaxy is a little out of date:

...

I will have to checkout the latest Galaxy code and retest in case this is something already fixed...

OK, I've updated to the latest galaxy-central default branch. Here's the slightly revised test for ncbi_makeblastdb.xml, <tests> <test>  <param name="dbtype" value="prot"/> <param name="file" value="four_human_proteins.fasta"/> <param name="title" value="Just 4 human proteins"/> <param name="parse_seqids" value=""/> <param name="hash_index" value="-hash_index"/> <output name="out_file" file="empty_file.dat" ftype="blastdbp"> <extra_files type="file" value="four_human_proteins.fasta.phd" name="blastdb.pdb"/> <extra_files type="file" value="four_human_proteins.fasta.phi" name="blastdb.phi"/> <extra_files type="file" value="four_human_proteins.fasta.phr" name="blastdb.phr"/> <extra_files type="file" value="four_human_proteins.fasta.pin" name="blastdb.pin"/> <extra_files type="file" value="four_human_proteins.fasta.pog" name="blastdb.pog"/> <extra_files type="file" value="four_human_proteins.fasta.psd" name="blastdb.psd"/> <extra_files type="file" value="four_human_proteins.fasta.psi" name="blastdb.psi"/> <extra_files type="file" value="four_human_proteins.fasta.psq" name="blastdb.psq"/> </output> </test> </tests> Here's some of the Galaxy log when I run this example manually through the web interface: galaxy.jobs.handler INFO 2013-04-25 15:17:43,820 (43) Job dispatched galaxy.jobs.runners.local DEBUG 2013-04-25 15:17:44,377 (43) executing: makeblastdb -version &> /mnt/galaxy/galaxy-central/database/tmp/GALAXY_VERSION_STRING_43; makeblastdb -out "/mnt/galaxy/galaxy-central/database/files/000/dataset_46_files/blastdb" -hash_index -in " /mnt/galaxy/galaxy-central/database/files/000/dataset_45.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:17:44,436 (43) Persisting job destination (destination id: local:///) galaxy.jobs.runners.local DEBUG 2013-04-25 15:17:44,751 execution finished: makeblastdb -version &> /mnt/galaxy/galaxy-central/database/tmp/GALAXY_VERSION_STRING_43; makeblastdb -out "/mnt/galaxy/galaxy-central/database/files/000/dataset_46_files/blastdb" -hash_index -in " /mnt/galaxy/galaxy-central/database/files/000/dataset_45.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:17:45,456 job 43 ended Now a snippets from the test run (unedited version at end of email). $ ./run_functional_tests.sh -id ncbi_makeblastdb ... galaxy.jobs.handler DEBUG 2013-04-25 15:12:44,214 (2) Dispatching to local runner galaxy.jobs DEBUG 2013-04-25 15:12:45,599 (2) Persisting job destination (destination id: local:///) galaxy.jobs.handler INFO 2013-04-25 15:12:45,711 (2) Job dispatched galaxy.jobs.runners.local DEBUG 2013-04-25 15:12:46,345 (2) executing: makeblastdb -version &> /tmp/tmpovUM3w/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpovUM3w/database/files/000/dataset_1.dat /tmp/tmpovUM3w/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:12:46,409 (2) Persisting job destination (destination id: local:///) galaxy.jobs.runners.local DEBUG 2013-04-25 15:12:46,897 execution finished: makeblastdb -version &> /tmp/tmpovUM3w/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpovUM3w/database/files/000/dataset_1.dat /tmp/tmpovUM3w/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:12:47,502 job 2 ended ... As noted in my last email, for some reason when running the test case, the input FASTA file is being included on the command line TWICE. Curiously the -hash_index argument has been omitted. Linked maybe? And then once this has run, as before, the file comparison is hard to fathom (it is not comparing the correct files to each other). The example rgClean.xml which Dave Bouvier pointed me at uses a composite datatypes with a central file ('pbed' which is a subclass of 'html') while the other examples I've found are 'html'. i.e. composite_type='auto_primary_file' It does seem likely at this point that I could be the first person attempting to write a unit test for a composite datatype without a primary file (i.e. composite_type='basic'). I'd appreciate being shown an existing working unit test using a basic composite datatype as an output file - perhaps there is something on the Tool Shed (which is harder to search than the main repository where I can use grep)? Thanks, Peter -- $ ./run_functional_tests.sh -id ncbi_makeblastdb ... galaxy.jobs.handler DEBUG 2013-04-25 15:12:44,214 (2) Dispatching to local runner galaxy.jobs DEBUG 2013-04-25 15:12:45,599 (2) Persisting job destination (destination id: local:///) galaxy.jobs.handler INFO 2013-04-25 15:12:45,711 (2) Job dispatched galaxy.jobs.runners.local DEBUG 2013-04-25 15:12:46,345 (2) executing: makeblastdb -version &> /tmp/tmpovUM3w/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpovUM3w/database/files/000/dataset_1.dat /tmp/tmpovUM3w/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:12:46,409 (2) Persisting job destination (destination id: local:///) galaxy.jobs.runners.local DEBUG 2013-04-25 15:12:46,897 execution finished: makeblastdb -version &> /tmp/tmpovUM3w/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpovUM3w/database/files/000/dataset_1.dat /tmp/tmpovUM3w/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:12:47,502 job 2 ended galaxy.web.framework DEBUG 2013-04-25 15:12:47,995 This request returned None from get_history(): http://localhost:8898/history galaxy.web.framework DEBUG 2013-04-25 15:12:48,097 This request returned None from get_history(): http://localhost:8898/display base.twilltestcase INFO 2013-04-25 15:12:48,141 ## files diff on /mnt/galaxy/galaxy-central/test-data/four_human_proteins.fasta.phd and /tmp/tmpovUM3w/database/tmp/tmpXjFIurblastdb.pdb lines_diff=0, found diff = 5 ---------------------- >> begin tool stdout << ----------------------- Building a new DB, current time: 04/25/2013 15:12:46 New DB name: /tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb New DB title: Just 4 human proteins Sequence type: Protein Keep Linkouts: T Keep MBits: T Maximum file size: 1073741824B Adding sequences from FASTA; added 4 sequences in 0.000900984 seconds. Adding sequences from FASTA; added 4 sequences in 0.000420094 seconds. ----------------------- >> end tool stdout << ------------------------ ---------------------- >> begin tool stderr << ----------------------- ----------------------- >> end tool stderr << ------------------------ FAIL ====================================================================== FAIL: NCBI BLAST+ makeblastdb ( ncbi_makeblastdb ) > Test-1 ---------------------------------------------------------------------- Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/test/functional/test_toolbox.py", line 171, in test_tool self.do_it( td, shed_tool_id=shed_tool_id ) File "/mnt/galaxy/galaxy-central/test/functional/test_toolbox.py", line 102, in do_it self.verify_dataset_correctness( outfile, hid=elem_hid, maxseconds=testdef.maxseconds, attributes=attributes, shed_tool_id=shed_tool_id ) File "/mnt/galaxy/galaxy-central/test/base/twilltestcase.py", line 849, in verify_dataset_correctness raise AssertionError( errmsg ) AssertionError: History item 2 different than expected, difference (using diff): ( /mnt/galaxy/galaxy-central/test-data/empty_file.dat v. /tmp/tmpovUM3w/database/tmp/tmpZFzfEJempty_file.dat ) Composite file (blastdb.pdb) of History item 2 different than expected, difference (using diff): --- local_file +++ history_data @@ -1,4 +1,1 @@ -11117184492 -29249033410 -36665887501 -5392473183 +This is a BLAST protein database. -------------------- >> begin captured stdout << --------------------- Uploaded file: four_human_proteins.fasta , ftype: auto , extra: {'value': 'four_human_proteins.fasta', 'children': []} button 'in_add' clicked form 'tool_form' contains the following controls ( note the values ) control 0: <HiddenControl(refresh=refresh) (readonly)> control 1: <HiddenControl(tool_id=ncbi_makeblastdb) (readonly)> control 2: <HiddenControl(tool_state=8002549b010000613665366164613561313035643161303739393466363162623336343338386261633066643736303a3762323235663566373036313637363535663566323233613230333032633230323237343639373436633635323233613230323235633232356332323232326332303232363436323734373937303635323233613230323235633232373037323666373435633232323232633230323236383631373336383566363936653634363537383232336132303232356332323534373237353635356332323232326332303232363936653232336132303232356237623563323235663566363936653634363537383566356635633232336132303330326332303563323236363639366336353563323233613230333137643263323037623563323235663566363936653634363537383566356635633232336132303331326332303563323236363639366336353563323233613230333137643564323232633230323237303631373237333635356637333635373136393634373332323361323032323563323234363631366337333635356332323232376471002e) (readonly)> control 3: <RadioControl(dbtype=[*prot, nucl])> control 4: <SelectControl(in_0|file=[*1])> control 5: <SubmitControl(in_0_remove=Remove FASTA file 1) (readonly)> control 6: <SelectControl(in_1|file=[*1])> control 7: <SubmitControl(in_1_remove=Remove FASTA file 2) (readonly)> control 8: <SubmitControl(in_add=Add new FASTA file) (readonly)> control 9: <TextControl(title=)> control 10: <CheckboxControl(parse_seqids=[true])> control 11: <HiddenControl(parse_seqids=true) (readonly)> control 12: <CheckboxControl(hash_index=[*true])> control 13: <HiddenControl(hash_index=true) (readonly)> control 14: <SubmitControl(runtool_btn=Execute) (readonly)> page_inputs (0) {'dbtype': ['prot'], 'hash_index': ['-hash_index'], 'title': ['Just 4 human proteins'], 'parse_seqids': [''], 'in_0|file': ['four_human_proteins.fasta']} --------------------- >> end captured stdout << ----------------------

Peter Cock

30 Apr 30 Apr

3:30 p.m.

On Thu, Apr 25, 2013 at 3:36 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

OK, I've updated to the latest galaxy-central default branch. Here's the slightly revised test for ncbi_makeblastdb.xml,

...

In the absence of any fresh feedback, I've filed an issue on Trello for this to make sure it gets tracked: https://trello.com/card/basic-composite-datatypes-not-working-as-test-output... Composite datatypes like HTML seem to work (with a primary file). Basic composite datatypes seem not to work (with no primary file). Regards, Peter

Peter Cock

10 Jul 10 Jul

4:22 p.m.

On Tue, Apr 30, 2013 at 4:30 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Thu, Apr 25, 2013 at 3:36 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
OK, I've updated to the latest galaxy-central default branch. Here's the slightly revised test for ncbi_makeblastdb.xml,

...

In the absence of any fresh feedback, I've filed an issue on Trello for this to make sure it gets tracked:

https://trello.com/card/basic-composite-datatypes-not-working-as-test-output... Composite datatypes like HTML seem to work (with a primary file). Basic composite datatypes seem not to work (with no primary file).

Regards,

Peter

Hi guys, Has anything changed in this area of the test framework since April? Thanks, Peter

John Chilton

17 Nov 17 Nov

8:30 a.m.

On Thu, Apr 25, 2013 at 9:36 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Fri, Apr 5, 2013 at 3:08 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Thu, Apr 4, 2013 at 7:19 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:

...
Hi Peter,

What is the test error given when you do have a value defined for name in output?

Can you try using 'empty_file.dat'?

e.g.

<output name="out_file" file="empty_file.dat" >

or

<output name="out_file" file="empty_file.dat" compare="contains">

etc

Hi Daniel,

That seems to help (plus fixing a typo in one of my child file extensions). However there is something else amiss, but my Galaxy is a little out of date:

...

I will have to checkout the latest Galaxy code and retest in case this is something already fixed...

OK, I've updated to the latest galaxy-central default branch. Here's the slightly revised test for ncbi_makeblastdb.xml,

<tests> <test>  <param name="dbtype" value="prot"/> <param name="file" value="four_human_proteins.fasta"/> <param name="title" value="Just 4 human proteins"/> <param name="parse_seqids" value=""/> <param name="hash_index" value="-hash_index"/> <output name="out_file" file="empty_file.dat" ftype="blastdbp"> <extra_files type="file" value="four_human_proteins.fasta.phd" name="blastdb.pdb"/> <extra_files type="file" value="four_human_proteins.fasta.phi" name="blastdb.phi"/> <extra_files type="file" value="four_human_proteins.fasta.phr" name="blastdb.phr"/> <extra_files type="file" value="four_human_proteins.fasta.pin" name="blastdb.pin"/> <extra_files type="file" value="four_human_proteins.fasta.pog" name="blastdb.pog"/> <extra_files type="file" value="four_human_proteins.fasta.psd" name="blastdb.psd"/> <extra_files type="file" value="four_human_proteins.fasta.psi" name="blastdb.psi"/> <extra_files type="file" value="four_human_proteins.fasta.psq" name="blastdb.psq"/> </output> </test> </tests>

Here's some of the Galaxy log when I run this example manually through the web interface:

galaxy.jobs.handler INFO 2013-04-25 15:17:43,820 (43) Job dispatched galaxy.jobs.runners.local DEBUG 2013-04-25 15:17:44,377 (43) executing: makeblastdb -version &> /mnt/galaxy/galaxy-central/database/tmp/GALAXY_VERSION_STRING_43; makeblastdb -out "/mnt/galaxy/galaxy-central/database/files/000/dataset_46_files/blastdb" -hash_index -in " /mnt/galaxy/galaxy-central/database/files/000/dataset_45.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:17:44,436 (43) Persisting job destination (destination id: local:///) galaxy.jobs.runners.local DEBUG 2013-04-25 15:17:44,751 execution finished: makeblastdb -version &> /mnt/galaxy/galaxy-central/database/tmp/GALAXY_VERSION_STRING_43; makeblastdb -out "/mnt/galaxy/galaxy-central/database/files/000/dataset_46_files/blastdb" -hash_index -in " /mnt/galaxy/galaxy-central/database/files/000/dataset_45.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:17:45,456 job 43 ended

Now a snippets from the test run (unedited version at end of email).

$ ./run_functional_tests.sh -id ncbi_makeblastdb ... galaxy.jobs.handler DEBUG 2013-04-25 15:12:44,214 (2) Dispatching to local runner galaxy.jobs DEBUG 2013-04-25 15:12:45,599 (2) Persisting job destination (destination id: local:///) galaxy.jobs.handler INFO 2013-04-25 15:12:45,711 (2) Job dispatched galaxy.jobs.runners.local DEBUG 2013-04-25 15:12:46,345 (2) executing: makeblastdb -version &> /tmp/tmpovUM3w/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpovUM3w/database/files/000/dataset_1.dat /tmp/tmpovUM3w/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:12:46,409 (2) Persisting job destination (destination id: local:///) galaxy.jobs.runners.local DEBUG 2013-04-25 15:12:46,897 execution finished: makeblastdb -version &> /tmp/tmpovUM3w/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpovUM3w/database/files/000/dataset_1.dat /tmp/tmpovUM3w/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:12:47,502 job 2 ended ...

As noted in my last email, for some reason when running the test case, the input FASTA file is being included on the command line TWICE. Curiously the -hash_index argument has been omitted. Linked maybe?

Peter, I have fixed the double listing of the FASTA file. Putting min=1 on a repeat statement would result in two repeat instances when using functional tests without this bug fix. https://bitbucket.org/galaxy/galaxy-central/commits/5e534cc8da856ad598d63b8b... It is likely also the problem with your mira tests? The hash_index missing was caused because to the param value you put in the test tag should be true or false, not the truevalue/falsevalue attributes as far as I can tell - those are used only by cheetah I guess. Adding the hash_index parameter creates and additional 5 files - including ones you listed in your test case. With these change, I was able to write working functional tests for your tool using the template you outlined in the Trello card. The .pin file doesn't match, I think there is something time-based in there so I had to set two lines of diff. Also, since this e-mail, you now have two parameters named file, that doesn't go over well yet - so I renamed mask|file to mask|mask_file. <test> <param name="dbtype" value="prot"/> <param name="file" value="four_human_proteins.fasta"/> <param name="title" value="Just 4 human proteins"/> <param name="parse_seqids" value=""/> <param name="hash_index" value="true"/> <output name="out_file" file="empty_file.dat" ftype="blastdbp"> <extra_files type="file" value="four_human_proteins.fasta.phr" name="blastdb.phr"/> <extra_files type="file" value="four_human_proteins.fasta.pin" name="blastdb.pin" lines_diff="2" />  <extra_files type="file" value="four_human_proteins.fasta.psq" name="blastdb.psq"/> <extra_files type="file" value="four_human_proteins.fasta.pog" name="blastdb.pog"/> <extra_files type="file" value="four_human_proteins.fasta.phd" name="blastdb.phd"/> <extra_files type="file" value="four_human_proteins.fasta.phi" name="blastdb.phi"/> <extra_files type="file" value="four_human_proteins.fasta.psd" name="blastdb.psd"/> <extra_files type="file" value="four_human_proteins.fasta.psi" name="blastdb.psi"/> </output> </test> These changes should work right out of central, does not utilize my API driven variant on github. I discovered no problems with auto_primary versus basic composite types here, just the things listed above. -John

...

And then once this has run, as before, the file comparison is hard to fathom (it is not comparing the correct files to each other).

The example rgClean.xml which Dave Bouvier pointed me at uses a composite datatypes with a central file ('pbed' which is a subclass of 'html') while the other examples I've found are 'html'. i.e. composite_type='auto_primary_file'

It does seem likely at this point that I could be the first person attempting to write a unit test for a composite datatype without a primary file (i.e. composite_type='basic').

I'd appreciate being shown an existing working unit test using a basic composite datatype as an output file - perhaps there is something on the Tool Shed (which is harder to search than the main repository where I can use grep)?

Thanks,

Peter

--

$ ./run_functional_tests.sh -id ncbi_makeblastdb ... galaxy.jobs.handler DEBUG 2013-04-25 15:12:44,214 (2) Dispatching to local runner galaxy.jobs DEBUG 2013-04-25 15:12:45,599 (2) Persisting job destination (destination id: local:///) galaxy.jobs.handler INFO 2013-04-25 15:12:45,711 (2) Job dispatched galaxy.jobs.runners.local DEBUG 2013-04-25 15:12:46,345 (2) executing: makeblastdb -version &> /tmp/tmpovUM3w/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpovUM3w/database/files/000/dataset_1.dat /tmp/tmpovUM3w/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:12:46,409 (2) Persisting job destination (destination id: local:///) galaxy.jobs.runners.local DEBUG 2013-04-25 15:12:46,897 execution finished: makeblastdb -version &> /tmp/tmpovUM3w/database/tmp/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb" -in " /tmp/tmpovUM3w/database/files/000/dataset_1.dat /tmp/tmpovUM3w/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot galaxy.jobs DEBUG 2013-04-25 15:12:47,502 job 2 ended galaxy.web.framework DEBUG 2013-04-25 15:12:47,995 This request returned None from get_history(): http://localhost:8898/history galaxy.web.framework DEBUG 2013-04-25 15:12:48,097 This request returned None from get_history(): http://localhost:8898/display base.twilltestcase INFO 2013-04-25 15:12:48,141 ## files diff on /mnt/galaxy/galaxy-central/test-data/four_human_proteins.fasta.phd and /tmp/tmpovUM3w/database/tmp/tmpXjFIurblastdb.pdb lines_diff=0, found diff = 5 ---------------------- >> begin tool stdout << -----------------------

Building a new DB, current time: 04/25/2013 15:12:46 New DB name: /tmp/tmpovUM3w/database/files/000/dataset_2_files/blastdb New DB title: Just 4 human proteins Sequence type: Protein Keep Linkouts: T Keep MBits: T Maximum file size: 1073741824B Adding sequences from FASTA; added 4 sequences in 0.000900984 seconds. Adding sequences from FASTA; added 4 sequences in 0.000420094 seconds.

----------------------- >> end tool stdout << ------------------------

---------------------- >> begin tool stderr << -----------------------

----------------------- >> end tool stderr << ------------------------

FAIL

====================================================================== FAIL: NCBI BLAST+ makeblastdb ( ncbi_makeblastdb ) > Test-1 ---------------------------------------------------------------------- Traceback (most recent call last): File "/mnt/galaxy/galaxy-central/test/functional/test_toolbox.py", line 171, in test_tool self.do_it( td, shed_tool_id=shed_tool_id ) File "/mnt/galaxy/galaxy-central/test/functional/test_toolbox.py", line 102, in do_it self.verify_dataset_correctness( outfile, hid=elem_hid, maxseconds=testdef.maxseconds, attributes=attributes, shed_tool_id=shed_tool_id ) File "/mnt/galaxy/galaxy-central/test/base/twilltestcase.py", line 849, in verify_dataset_correctness raise AssertionError( errmsg ) AssertionError: History item 2 different than expected, difference (using diff): ( /mnt/galaxy/galaxy-central/test-data/empty_file.dat v. /tmp/tmpovUM3w/database/tmp/tmpZFzfEJempty_file.dat ) Composite file (blastdb.pdb) of History item 2 different than expected, difference (using diff): --- local_file +++ history_data @@ -1,4 +1,1 @@ -11117184492 -29249033410 -36665887501 -5392473183 +This is a BLAST protein database. -------------------- >> begin captured stdout << ---------------------

Uploaded file: four_human_proteins.fasta , ftype: auto , extra: {'value': 'four_human_proteins.fasta', 'children': []} button 'in_add' clicked

form 'tool_form' contains the following controls ( note the values ) control 0: <HiddenControl(refresh=refresh) (readonly)> control 1: <HiddenControl(tool_id=ncbi_makeblastdb) (readonly)> control 2: <HiddenControl(tool_state=8002549b010000613665366164613561313035643161303739393466363162623336343338386261633066643736303a3762323235663566373036313637363535663566323233613230333032633230323237343639373436633635323233613230323235633232356332323232326332303232363436323734373937303635323233613230323235633232373037323666373435633232323232633230323236383631373336383566363936653634363537383232336132303232356332323534373237353635356332323232326332303232363936653232336132303232356237623563323235663566363936653634363537383566356635633232336132303330326332303563323236363639366336353563323233613230333137643263323037623563323235663566363936653634363537383566356635633232336132303331326332303563323236363639366336353563323233613230333137643564323232633230323237303631373237333635356637333635373136393634373332323361323032323563323234363631366337333635356332323232376471002e) (readonly)> control 3: <RadioControl(dbtype=[*prot, nucl])> control 4: <SelectControl(in_0|file=[*1])> control 5: <SubmitControl(in_0_remove=Remove FASTA file 1) (readonly)> control 6: <SelectControl(in_1|file=[*1])> control 7: <SubmitControl(in_1_remove=Remove FASTA file 2) (readonly)> control 8: <SubmitControl(in_add=Add new FASTA file) (readonly)> control 9: <TextControl(title=)> control 10: <CheckboxControl(parse_seqids=[true])> control 11: <HiddenControl(parse_seqids=true) (readonly)> control 12: <CheckboxControl(hash_index=[*true])> control 13: <HiddenControl(hash_index=true) (readonly)> control 14: <SubmitControl(runtool_btn=Execute) (readonly)> page_inputs (0) {'dbtype': ['prot'], 'hash_index': ['-hash_index'], 'title': ['Just 4 human proteins'], 'parse_seqids': [''], 'in_0|file': ['four_human_proteins.fasta']}

--------------------- >> end captured stdout << ----------------------

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Peter Cock

18 Nov 18 Nov

12:35 p.m.

On Sun, Nov 17, 2013 at 8:30 AM, John Chilton wrote:

...

On Thu, Apr 25, 2013 at 9:36 AM, Peter Cock wrote:

...
...

As noted in my last email, for some reason when running the test case, the input FASTA file is being included on the command line TWICE. Curiously the -hash_index argument has been omitted. Linked maybe?

Peter,

I have fixed the double listing of the FASTA file. Putting min=1 on a repeat statement would result in two repeat instances when using functional tests without this bug fix.

https://bitbucket.org/galaxy/galaxy-central/commits/5e534cc8da856ad598d63b8b...

Thank you - such a little thing once you'd traced its cause.

...

It is likely also the problem with your mira tests?

This should help for the MIRA4 tests too :)

...

The hash_index missing was caused because to the param value you put in the test tag should be true or false, not the truevalue/falsevalue attributes as far as I can tell - those are used only by cheetah I guess. Adding the hash_index parameter creates and additional 5 files - including ones you listed in your test case.

I think I tried that too (true/false), but it was a while ago now. Hitting multiple test framework issues at the same time made debugging this hard.

...

With these change, I was able to write working functional tests for your tool using the template you outlined in the Trello card. The .pin file doesn't match, I think there is something time-based in there so I had to set two lines of diff.

Yes, I agree the PIN file varies run to run, so the diff trick looks good.

...

Also, since this e-mail, you now have two parameters named file, that doesn't go over well yet - so I renamed mask|file to mask|mask_file.

The makeblastdb wrapper on the main Tool Shed don't yet have the masking file parameter, but does already on the Test Tool Shed - so I'd prefer not to change this: http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus http://testtoolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus In principle the pipe-based fully specified parameter name would work here too, to resolve the potential ambiguity? (That is a separate Trello Card for handling potentially ambiguous parameters in the test framework): https://trello.com/c/zSTrfDOB/820-disambiguated-conditional-parameters-not-s...

...

...

These changes should work right out of central, does not utilize my API driven variant on github.

I discovered no problems with auto_primary versus basic composite types here, just the things listed above.

Not for me though, even if I rename the masking "file" param to avoid the ambiguous "file" parameters, commits here: https://github.com/peterjc/galaxy_blast/commit/2043cc813c2e138d93f8a940ea711... https://github.com/peterjc/galaxy_blast/commit/f4f74cd065921b069499a5fcc4209... and continuing on this branch: https://github.com/peterjc/galaxy_blast/tree/test_makeblastdb https://github.com/peterjc/galaxy_blast/commit/e828543539ab0a0ca8f8b16dbdce1... That gave: $ ./run_functional_tests.sh -id ncbi_makeblastdb ... galaxy.jobs.runners.local DEBUG 2013-11-18 11:25:38,449 execution finished: export GALAXY_SLOTS="1"; makeblastdb -version &> /tmp/tmpF81TF5/tmpghF4ik/new_files_path_lVhJ4E/GALAXY_VERSION_STRING_2; makeblastdb -out "/tmp/tmpF81TF5/tmpghF4ik/database/files/000/dataset_2_files/blastdb" -hash_index -in " /tmp/tmpF81TF5/tmpghF4ik/database/files/000/dataset_1.dat /tmp/tmpF81TF5/tmpghF4ik/database/files/000/dataset_1.dat " -title "Just 4 human proteins" -dbtype prot ... Composite file (blastdb.phr) of History item 2 different than expected, difference (using diff): Binary data detected, not displaying diff ... FAILED (failures=1) ... I find that changing the order of the <extra_files> tags in the test seems to alter the failure - which supports my hunch that something is scrambling the order of the extra files, so that it fails to compare the generated blastdb.phr with the provided four_human_proteins.fasta.phd e.g. Here it seems to compare to the (place holder) text I generate when viewing a database in the Galaxy interface: https://github.com/peterjc/galaxy_blast/commit/e828543539ab0a0ca8f8b16dbdce1... $ ./run_functional_tests.sh -id ncbi_makeblastdb ... galaxy.jobs.runners.local DEBUG 2013-11-18 12:23:24,812 execution finished: export GALAXY_SLOTS="1"; python /mnt/galaxy/galaxy-central/tools/data_source/upload.py /mnt/galaxy/galaxy-central /tmp/tmph0YlBO/tmpRK7bhO/new_files_path_0SO6J0/tmpkPWlQd /tmp/tmph0YlBO/tmpRK7bhO/new_files_path_0SO6J0/tmpr_xBCK 1:/tmp/tmph0YlBO/tmpRK7bhO/job_working_directory_Qh_YcY/000/1/dataset_1_files:/tmp/tmph0YlBO/tmpRK7bhO/database/files/000/dataset_1.dat ... Composite file (blastdb.phd) of History item 2 different than expected, difference (using diff): --- local_file +++ history_data @@ -1,4 +1,1 @@ -11117184492 -29249033410 -36665887501 -5392473183 +This is a BLAST protein database. ... FAILED (failures=1) ... If it works for you then perhaps the filesystem is a factor, e.g. os.listdir(...) order? I have had an initial look at the code in test/base/twilltestcase.py but haven't spotted a problem yet. Thank you John, Peter

Peter Cock

3:55 p.m.

On Mon, Nov 18, 2013 at 12:35 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Sun, Nov 17, 2013 at 8:30 AM, John Chilton wrote:

...
With these change, I was able to write working functional tests for your tool using the template you outlined in the Trello card. ... I discovered no problems with auto_primary versus basic composite types here, just the things listed above.

Not for me though, even if I rename the masking "file" param to avoid the ambiguous "file" parameters ...

I find that changing the order of the <extra_files> tags in the test seems to alter the failure - which supports my hunch that something is scrambling the order of the extra files, so that it fails to compare the generated blastdb.phr with the provided four_human_proteins.fasta.phd

John's replies via Twitter: https://twitter.com/jmchilton/status/402436500131807232

...

@pjacock Hard to argue with @travisci but your makeblastdb test works unmodified on my box and I can reorder the extras. I'm still looking..

https://twitter.com/jmchilton/status/402446372231581696

...

@pjacock Got it, you are overriding display_data in blast DB datatypes. This breaks much! Is how the test framework/API/etc access datasets.

There may be trouble ahead... the current display_data override is to give some meaningful output to the user when they click the "eye ball" icon for a BLAST database - rather than something unhelpful and scary like a blank page. A more slick option would be to override display_data to run blastdbcmd live, but that means a run time dependency of the datatype definition on (a specific version of) the BLAST+ binaries - which could be problematic. I had considered capturing the makeblastdb stdout to a file as blastdb.log and having that as the (human viewable) primary file of a composite datatype - but that would cause trouble if and when I manage to support user uploaded BLAST databases. Thanks, Peter

John Chilton

4:02 p.m.

On Mon, Nov 18, 2013 at 9:55 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Mon, Nov 18, 2013 at 12:35 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Sun, Nov 17, 2013 at 8:30 AM, John Chilton wrote:

...
With these change, I was able to write working functional tests for your tool using the template you outlined in the Trello card. ... I discovered no problems with auto_primary versus basic composite types here, just the things listed above.

Not for me though, even if I rename the masking "file" param to avoid the ambiguous "file" parameters ...

I find that changing the order of the <extra_files> tags in the test seems to alter the failure - which supports my hunch that something is scrambling the order of the extra files, so that it fails to compare the generated blastdb.phr with the provided four_human_proteins.fasta.phd

John's replies via Twitter:

https://twitter.com/jmchilton/status/402436500131807232

...
@pjacock Hard to argue with @travisci but your makeblastdb test works unmodified on my box and I can reorder the extras. I'm still looking..

https://twitter.com/jmchilton/status/402446372231581696

...
@pjacock Got it, you are overriding display_data in blast DB datatypes. This breaks much! Is how the test framework/API/etc access datasets.

There may be trouble ahead... the current display_data override is to give some meaningful output to the user when they click the "eye ball" icon for a BLAST database - rather than something unhelpful and scary like a blank page.

I didn't implement any of this composite dataset stuff so I am just guessing on best practices here, but is the right thing to do here switch from 'basic' to 'auto_primary_file' ? Just generate a file that just includes the content you want to display and set that as the primary file - I think this might be better practice than overriding display_data. If you want that file to a log if it is available that is fine, if files are uploaded an optional log could be available and you can provide a default fallback if unavailable in generate_primary_file/regenerate_primary_file. If you do insist on overriding display data, than you should at least allow fallback to what the super class would do if (filename and filename != 'index'). -John

...

A more slick option would be to override display_data to run blastdbcmd live, but that means a run time dependency of the datatype definition on (a specific version of) the BLAST+ binaries - which could be problematic.

I had considered capturing the makeblastdb stdout to a file as blastdb.log and having that as the (human viewable) primary file of a composite datatype - but that would cause trouble if and when I manage to support user uploaded BLAST databases.

Thanks,

Peter

Peter Cock

4:19 p.m.

On Mon, Nov 18, 2013 at 4:02 PM, John Chilton <chilton@msi.umn.edu> wrote:

...

On Mon, Nov 18, 2013 at 9:55 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Mon, Nov 18, 2013 at 12:35 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Sun, Nov 17, 2013 at 8:30 AM, John Chilton wrote:

...
With these change, I was able to write working functional tests for your tool using the template you outlined in the Trello card. ... I discovered no problems with auto_primary versus basic composite types here, just the things listed above.

Not for me though, even if I rename the masking "file" param to avoid the ambiguous "file" parameters ...

I find that changing the order of the <extra_files> tags in the test seems to alter the failure - which supports my hunch that something is scrambling the order of the extra files, so that it fails to compare the generated blastdb.phr with the provided four_human_proteins.fasta.phd

John's replies via Twitter:

https://twitter.com/jmchilton/status/402436500131807232

...
@pjacock Hard to argue with @travisci but your makeblastdb test works unmodified on my box and I can reorder the extras. I'm still looking..

https://twitter.com/jmchilton/status/402446372231581696

...
@pjacock Got it, you are overriding display_data in blast DB datatypes. This breaks much! Is how the test framework/API/etc access datasets.

There may be trouble ahead... the current display_data override is to give some meaningful output to the user when they click the "eye ball" icon for a BLAST database - rather than something unhelpful and scary like a blank page.

I didn't implement any of this composite dataset stuff so I am just guessing on best practices here, but is the right thing to do here switch from 'basic' to 'auto_primary_file' ? Just generate a file that just includes the content you want to display and set that as the primary file - I think this might be better practice than overriding display_data. If you want that file to a log if it is available that is fine, if files are uploaded an optional log could be available and you can provide a default fallback if unavailable in generate_primary_file/regenerate_primary_file. If you do insist on overriding display data, than you should at least allow fallback to what the super class would do if (filename and filename != 'index').

I see, so my (dummy) primary file could just a small text file saying "This is a BLAST protein database." or similar - and that would let me reproduce the current behaviour? The only possible downside I can see right now is wondering what happens to existing histories containing old BLAST databases created with the current code... but that will sort itself out eventually with some user inconvenience. Who is the composite file architect (for their thoughts)? Regards, Peter

Robert Baertsch

4 Jul 4 Jul

8:49 p.m.

New subject: gzipped fastq reader

Dan, Do these readers support gzip files? reader = fastqVerboseErrorReader reader = fastqReader Do I have to define a special type in galaxy for gzipped files or will the fastq type be ok? Ideally, I would like to keep my files zipped and not have galaxy unzip them, since they triple in size when unzipped. I'm happy to do a push request if you don't support this but I want to make sure I'm in line with your roadmap. I have written a simple tool to convert Illumina fastq to mapsplice fastq. Does that already exist already somewhere? -Robert

Peter Cock

8 Jul 8 Jul

11:05 a.m.

New subject: gzipped fastq reader

On Thu, Jul 4, 2013 at 9:49 PM, Robert Baertsch <robert.baertsch@gmail.com> wrote:

...

Dan, Do these readers support gzip files?

reader = fastqVerboseErrorReader reader = fastqReader

Presumably you are writing a Python script using this library? The answer is a qualified yes. Instead of passing them a normal file handle using open("example.fastq") you instead use gzip.open("example.fastq") via import gzip.

...

Do I have to define a special type in galaxy for gzipped files or will the fastq type be ok?

This needs a special file format - but you are not the first person to look at this, some groups have defined custom gzipped variants of the FASTQ formats within their own Galaxy instances. I've not done this but there should be some useful emails in the archive. Note you'd also need to modify any tool definitions to that they can accept a gzipped FASTQ file.

...

Ideally, I would like to keep my files zipped and not have galaxy unzip them, since they triple in size when unzipped.

I'm happy to do a push request if you don't support this but I want to make sure I'm in line with your roadmap.

Personally I would like a more general system in Galaxy for potentially any file type to be held compressed in a range of formats (e.g. using gzip, bgzf, xy, bz2, etc), with exclusions for things like BAM which are already compressed. This way naive tools would get the gzipped file file uncompressed to a temporary folder before use (i.e. no change for the tool wrapper), but if a tool accepts a gzipped file it will get that (less disk IO and CPU usage, but requires updating tool wrappers). That idea is quite ambitious through ;)

...

I have written a simple tool to convert Illumina fastq to mapsplice fastq. Does that already exist already somewhere?

I don't know. Peter

Robert Baertsch

9:24 p.m.

New subject: gzipped fastq reader

Peter and Dan, I like the idea of replacing all open() with galaxy_open() in all tools. You can tell the format by looking at the first 4 byes (see C code below from the UCSC browser team). Is there some pythonic way of overriding open? You need to read the first four bytes of the file to see if it is compressed and call gzip.open inside of the function and pass pack the handle. For now, it would require a global sweep through the tools to change open() to galaxy_open(), but it is probably a good idea to have tool developers avoid calling open directly. You would have to have special handling if there are multiple files in the compressed archive but that support could be added later. -Robert def galaxy_open(filename, mode="r"): compressor = getCompressor(filename, mode) if compessor != NULL: return openCompressed(filename, mode, compressor) else: return open(filename, mode) def openCompressed(filename, mode): 4bytes = read4bytes(filename) ext = getExtensionFromHdrSig(4bytes) if ext == "gz" : return gzip.open(filename, mode) else if ext == "bz2": return bz2.BZ2File(filename, mode) else if ext == "zip": return zipfile.ZipFile(filename, mode) char *getExtensionFromHdrSig(char *first4bytes) /* Check if header has signature of supported compression stream, and return a phoney filename with extension for it, or NULL if no sig found. */ { char buf[20]; char *ext=NULL; if (startsWith("\x1f\x8b",first4bytes)) ext = "gz"; else if (startsWith("\x1f\x9d\x90",first4bytes)) ext = "Z"; else if (startsWith("BZ",first4bytes)) ext = "bz2"; else if (startsWith("PK\x03\x04",first4bytes)) ext = "zip"; if (ext==NULL) return NULL; } On Jul 8, 2013, at 4:05 AM, Peter Cock wrote:

...

On Thu, Jul 4, 2013 at 9:49 PM, Robert Baertsch <robert.baertsch@gmail.com> wrote:

...
Dan, Do these readers support gzip files?

reader = fastqVerboseErrorReader reader = fastqReader

Presumably you are writing a Python script using this library? The answer is a qualified yes. Instead of passing them a normal file handle using open("example.fastq") you instead use gzip.open("example.fastq") via import gzip.

...
Do I have to define a special type in galaxy for gzipped files or will the fastq type be ok?

This needs a special file format - but you are not the first person to look at this, some groups have defined custom gzipped variants of the FASTQ formats within their own Galaxy instances. I've not done this but there should be some useful emails in the archive.

Note you'd also need to modify any tool definitions to that they can accept a gzipped FASTQ file.

...
Ideally, I would like to keep my files zipped and not have galaxy unzip them, since they triple in size when unzipped.

I'm happy to do a push request if you don't support this but I want to make sure I'm in line with your roadmap.

Personally I would like a more general system in Galaxy for potentially any file type to be held compressed in a range of formats (e.g. using gzip, bgzf, xy, bz2, etc), with exclusions for things like BAM which are already compressed. This way naive tools would get the gzipped file file uncompressed to a temporary folder before use (i.e. no change for the tool wrapper), but if a tool accepts a gzipped file it will get that (less disk IO and CPU usage, but requires updating tool wrappers).

That idea is quite ambitious through ;)

...
I have written a simple tool to convert Illumina fastq to mapsplice fastq. Does that already exist already somewhere?

I don't know.

Peter

Peter Cock

9:58 p.m.

New subject: gzipped fastq reader

On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch <robert.baertsch@gmail.com> wrote:

...

Peter and Dan, I like the idea of replacing all open() with galaxy_open() in all tools. You can tell the format by looking at the first 4 byes (see C code below from the UCSC browser team). Is there some pythonic way of overriding open?

There is monkey patching (replace the current 'open' function with your modified version), but that is not a good idea in general. In any case, this would only affect the small number of Python tools which happen to use the Galaxy parsing libraries - which is a very small fraction of the tools in Galaxy. Most of the tools in Galaxy are compiled programs and are entirely separate. Peter

James Taylor

10:20 p.m.

New subject: gzipped fastq reader

open_compressed in bx-python does this already (for bz2 as well). On Jul 8, 2013, at 5:58 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...

On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch <robert.baertsch@gmail.com> wrote:

...
Peter and Dan, I like the idea of replacing all open() with galaxy_open() in all tools. You can tell the format by looking at the first 4 byes (see C code below from the UCSC browser team). Is there some pythonic way of overriding open?

There is monkey patching (replace the current 'open' function with your modified version), but that is not a good idea in general.

In any case, this would only affect the small number of Python tools which happen to use the Galaxy parsing libraries - which is a very small fraction of the tools in Galaxy. Most of the tools in Galaxy are compiled programs and are entirely separate.

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Robert Baertsch

9 Jul 9 Jul

4:58 p.m.

New subject: gzipped fastq reader

great. Let's put the bx-python calls in a galaxy_open helper function. On Jul 8, 2013, at 3:20 PM, James Taylor wrote:

...

open_compressed in bx-python does this already (for bz2 as well).

On Jul 8, 2013, at 5:58 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch <robert.baertsch@gmail.com> wrote:

...
Peter and Dan, I like the idea of replacing all open() with galaxy_open() in all tools. You can tell the format by looking at the first 4 byes (see C code below from the UCSC browser team). Is there some pythonic way of overriding open?

There is monkey patching (replace the current 'open' function with your modified version), but that is not a good idea in general.

In any case, this would only affect the small number of Python tools which happen to use the Galaxy parsing libraries - which is a very small fraction of the tools in Galaxy. Most of the tools in Galaxy are compiled programs and are entirely separate.

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Robert Baertsch

4:58 p.m.

New subject: gzipped fastq reader

great. Let's put the bx-python calls in a galaxy_open helper function. On Jul 8, 2013, at 3:20 PM, James Taylor wrote:

...

open_compressed in bx-python does this already (for bz2 as well).

On Jul 8, 2013, at 5:58 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:

...
On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch <robert.baertsch@gmail.com> wrote:

...
Peter and Dan, I like the idea of replacing all open() with galaxy_open() in all tools. You can tell the format by looking at the first 4 byes (see C code below from the UCSC browser team). Is there some pythonic way of overriding open?

There is monkey patching (replace the current 'open' function with your modified version), but that is not a good idea in general.

In any case, this would only affect the small number of Python tools which happen to use the Galaxy parsing libraries - which is a very small fraction of the tools in Galaxy. Most of the tools in Galaxy are compiled programs and are entirely separate.

Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/

Robert Baertsch

8 Jul 8 Jul

10:21 p.m.

New subject: gzipped fastq reader

I respectfully disagree, If you want an extensible system, you should always wrap primitive system level calls. Any tools that opens a file that could be compressed would be affected. That is a huge number of tools. Do you really want a cottage industry of tools that have different methods of dealing with compression? Encoding the gzip status in the datatype will create an explosion of datatypes. Compression is not actually a datatype, it tells you nothing about the content data that is stored in the file. It is up to the galaxy team to provide a standard way to interact with compressed files. My proposed solution, is a very small change that could be phased in over time. Any tools that uses open would not support compressed files, but they would not break on uncompressed files. Do others have an opinion? On Jul 8, 2013, at 2:58 PM, Peter Cock wrote:

...

On Mon, Jul 8, 2013 at 10:24 PM, Robert Baertsch <robert.baertsch@gmail.com> wrote:

...
Peter and Dan, I like the idea of replacing all open() with galaxy_open() in all tools. You can tell the format by looking at the first 4 byes (see C code below from the UCSC browser team). Is there some pythonic way of overriding open?

There is monkey patching (replace the current 'open' function with your modified version), but that is not a good idea in general.

In any case, this would only affect the small number of Python tools which happen to use the Galaxy parsing libraries - which is a very small fraction of the tools in Galaxy. Most of the tools in Galaxy are compiled programs and are entirely separate.

Peter

Peter Cock

10:33 p.m.

New subject: gzipped fastq reader

On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:

...

I respectfully disagree, If you want an extensible system, you should always wrap primitive system level calls.

Any tools that opens a file that could be compressed would be affected. That is a huge number of tools. Do you really want a cottage industry of tools that have different methods of dealing with compression?

But defining a Python helper function within the Galaxy Python libraries doesn't achieve that. Are you talking about patching the OS level POSIX open functions or something? The tools available in Galaxy are written in a range of languages including C, Perl, R, etc. Yes, some are in Python, but of those most are independent of Galaxy and can be used separately from Galaxy.

...

Encoding the gzip status in the datatype will create an explosion of datatypes. Compression is not actually a datatype, it tells you nothing about the content data that is stored in the file.

What we'd previously discussed was a dual system, holding the file type as now (e.g. FASTA, SAM, GFF3, etc) and any compression (e.g., None, normal GZIP, BGZF which is a GZIP variant, BZIP2, etc). Galaxy tool wrappers currently define input files with a list of file types - they'd also have to give a list of supported compression types (defaulting to none). Likewise for any output files - if they are already compressed the XML for the tool wrapper would have to tell Galaxy this.

...

It is up to the galaxy team to provide a standard way to interact with compressed files.

That is my preference too - although this could be driven by the Galaxy community rather than the core team? I see defining new datatypes like 'gzippedfastq' as a stop gap special case (but a very practical route for now).

...

My proposed solution, is a very small change that could be phased in over time. Any tools that uses open would not support compressed files, but they would not break on uncompressed files.

Do others have an opinion?

Either I don't understand your plan, or it would only help in a tiny minority of cases. Regards, Peter

Robert Baertsch

9 Jul 9 Jul

4:53 p.m.

New subject: gzipped fastq reader

On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:

...

On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:

...
I respectfully disagree, If you want an extensible system, you should always wrap primitive system level calls.

Any tools that opens a file that could be compressed would be affected. That is a huge number of tools. Do you really want a cottage industry of tools that have different methods of dealing with compression?

But defining a Python helper function within the Galaxy Python libraries doesn't achieve that.

Are you talking about patching the OS level POSIX open functions or something?

...

The tools available in Galaxy are written in a range of languages including C, Perl, R, etc. Yes, some are in Python, but of those most are independent of Galaxy and can be used separately from Galaxy.

no. the helper function would have to ported to R. We are talking about how galaxy compressed data. Once we decide that, we can determine how to best implement it. Proposal: Do not treat compressed data as a separate data type. Treat it as an independent attribute that can be applied to any data. Otherwise you will have to create a gzipped , zip and bz2 type for every type that you want to compress. people can use the python helpers or write their own in other languages, We need a galaxy_open function to hide details of compression from tool developers. We could also open http files or pipes without any changes to tools. (other than changing open() to galaxy_open()

...

...
Encoding the gzip status in the datatype will create an explosion of datatypes. Compression is not actually a datatype, it tells you nothing about the content data that is stored in the file.

What we'd previously discussed was a dual system, holding the file type as now (e.g. FASTA, SAM, GFF3, etc) and any compression (e.g., None, normal GZIP, BGZF which is a GZIP variant, BZIP2, etc).

What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also? This will quickly get out of hand and create a mess for tool developers that need to support all thees types. The tool code and tool xml should be written to handle uncompressed data and galaxy should handle the details of decompression. This is not hard to do.

...

Galaxy tool wrappers currently define input files with a list of file types - they'd also have to give a list of supported compression types (defaulting to none). Likewise for any output files - if they are already compressed the XML for the tool wrapper would have to tell Galaxy this.

...
It is up to the galaxy team to provide a standard way to interact with compressed files.

That is my preference too - although this could be driven by the Galaxy community rather than the core team? I see defining new datatypes like 'gzippedfastq' as a stop gap special case (but a very practical route for now).

...
My proposed solution, is a very small change that could be phased in over time. Any tools that uses open would not support compressed files, but they would not break on uncompressed files.

Do others have an opinion?

Either I don't understand your plan, or it would only help in a tiny minority of cases.

Regards,

Peter

Peter Cock

5:38 p.m.

New subject: gzipped fastq reader

On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:

...

On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:

...
The tools available in Galaxy are written in a range of languages including C, Perl, R, etc. Yes, some are in Python, but of those most are independent of Galaxy and can be used separately from Galaxy.

the helper function would have to ported to R. We are talking about how galaxy compressed data. Once we decide that, we can determine how to best implement it.

Individual tools called from Galaxy read and create the files - and we can't usually control them at this level (modifying them all to call a Galaxy managed file open mechanism is not an option).

...

Proposal: Do not treat compressed data as a separate data type. Treat it as an independent attribute that can be applied to any data. Otherwise you will have to create a gzipped , zip and bz2 type for every type that you want to compress.

That's what I've been saying - the fact that some people are already using a new gzipped FASTQ format within their Galaxy instances is practical, but I view it as a short term solution only.

...

...
...
Encoding the gzip status in the datatype will create an explosion of datatypes. Compression is not actually a datatype, it tells you nothing about the content data that is stored in the file.

What we'd previously discussed was a dual system, holding the file type as now (e.g. FASTA, SAM, GFF3, etc) and any compression (e.g., None, normal GZIP, BGZF which is a GZIP variant, BZIP2, etc).

What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?

Note ZIP is a bit different, as it is often a multiple file bundle - it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that regard. But otherwise, yes. As a specific example, the tabix tool used BGZF compressed tabular data to combine compression and efficient random access. This would be useful for many annotation files (e.g. GTF, GFF3).

...

This will quickly get out of hand and create a mess for tool developers that need to support all thees types.

Why? Individual tool developers don't need to know if Galaxy is keeping the original data file on disk compressed - unless the tool XML says otherwise, Galaxy would hide this detail and call the tool with an uncompressed input file. (Unix named pipe which decompresses the file on the file would be a potential alternative - but only if the tool XML was marked up to say that an input could be streamed. The default must be to assume potential random access to the input files)

...

The tool code and tool xml should be written to handle uncompressed data and galaxy should handle the details of decompression. This is not hard to do.

It isn't trivial either ;) Peter

Robert Baertsch

5:47 p.m.

New subject: gzipped fastq reader

I will implement this if the galaxy team likes the approach. We did this in ucsc genome browser code years ago: a single open_helper call handles, gzip, http, ftp and pipes. No need to care about how the data is compressed or where it data resides. wouldn't it be great to be able to pipe data between workflow steps rather than writing to disk? I admit that this will require some work but the first step is to abstract the open. On Jul 9, 2013, at 10:38 AM, Peter Cock wrote:

...

On Tue, Jul 9, 2013 at 5:53 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:

...
On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:

...
The tools available in Galaxy are written in a range of languages including C, Perl, R, etc. Yes, some are in Python, but of those most are independent of Galaxy and can be used separately from Galaxy.

the helper function would have to ported to R. We are talking about how galaxy compressed data. Once we decide that, we can determine how to best implement it.

Individual tools called from Galaxy read and create the files - and we can't usually control them at this level (modifying them all to call a Galaxy managed file open mechanism is not an option).

...
Proposal: Do not treat compressed data as a separate data type. Treat it as an independent attribute that can be applied to any data. Otherwise you will have to create a gzipped , zip and bz2 type for every type that you want to compress.

That's what I've been saying - the fact that some people are already using a new gzipped FASTQ format within their Galaxy instances is practical, but I view it as a short term solution only.

...
...
...
Encoding the gzip status in the datatype will create an explosion of datatypes. Compression is not actually a datatype, it tells you nothing about the content data that is stored in the file.

What we'd previously discussed was a dual system, holding the file type as now (e.g. FASTA, SAM, GFF3, etc) and any compression (e.g., None, normal GZIP, BGZF which is a GZIP variant, BZIP2, etc).

What about tabular. Should we create tab.gz, tab.bz2 and tab.zip also?

Note ZIP is a bit different, as it is often a multiple file bundle - it behaves differently from GZIP, BGZF, XY, BZIP2 etc in that regard.

But otherwise, yes. As a specific example, the tabix tool used BGZF compressed tabular data to combine compression and efficient random access. This would be useful for many annotation files (e.g. GTF, GFF3).

...
This will quickly get out of hand and create a mess for tool developers that need to support all thees types.

Why? Individual tool developers don't need to know if Galaxy is keeping the original data file on disk compressed - unless the tool XML says otherwise, Galaxy would hide this detail and call the tool with an uncompressed input file.

(Unix named pipe which decompresses the file on the file would be a potential alternative - but only if the tool XML was marked up to say that an input could be streamed. The default must be to assume potential random access to the input files)

...
The tool code and tool xml should be written to handle uncompressed data and galaxy should handle the details of decompression. This is not hard to do.

It isn't trivial either ;)

Peter

Robert Baertsch

4:58 p.m.

New subject: gzipped fastq reader

On Jul 8, 2013, at 3:33 PM, Peter Cock wrote:

...

On Mon, Jul 8, 2013 at 11:21 PM, Robert Baertsch <rbaertsc@ucsc.edu> wrote:

...
I respectfully disagree, If you want an extensible system, you should always wrap primitive system level calls.

Any tools that opens a file that could be compressed would be affected. That is a huge number of tools. Do you really want a cottage industry of tools that have different methods of dealing with compression?

But defining a Python helper function within the Galaxy Python libraries doesn't achieve that.

Are you talking about patching the OS level POSIX open functions or something?

...

The tools available in Galaxy are written in a range of languages including C, Perl, R, etc. Yes, some are in Python, but of those most are independent of Galaxy and can be used separately from Galaxy.

...

...
Encoding the gzip status in the datatype will create an explosion of datatypes. Compression is not actually a datatype, it tells you nothing about the content data that is stored in the file.

What we'd previously discussed was a dual system, holding the file type as now (e.g. FASTA, SAM, GFF3, etc) and any compression (e.g., None, normal GZIP, BGZF which is a GZIP variant, BZIP2, etc).

...

Galaxy tool wrappers currently define input files with a list of file types - they'd also have to give a list of supported compression types (defaulting to none). Likewise for any output files - if they are already compressed the XML for the tool wrapper would have to tell Galaxy this.

...
It is up to the galaxy team to provide a standard way to interact with compressed files.

That is my preference too - although this could be driven by the Galaxy community rather than the core team? I see defining new datatypes like 'gzippedfastq' as a stop gap special case (but a very practical route for now).

...
My proposed solution, is a very small change that could be phased in over time. Any tools that uses open would not support compressed files, but they would not break on uncompressed files.

Do others have an opinion?

Either I don't understand your plan, or it would only help in a tiny minority of cases.

Regards,

Peter

4261

Age (days ago)

4490

Last active (days ago)

List overview

Download

26 comments

7 participants

participants (7)

Daniel Blankenberg
Dave Bouvier
James Taylor
John Chilton
Peter Cock
Robert Baertsch
Robert Baertsch