Twill compares wrong file; was: Twill comparisons ignore GALAXY_TEST_NO_CLEANUP
On Tue, Feb 11, 2014 at 3:41 PM, Dave Bouvier <dave@bx.psu.edu> wrote:
Peter,
Thanks for pointing this out, I've applied your changes in 12478:15fc8675064e on the default branch.
--Dave B.
Thanks Dave, RE: http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-February/018154.html https://bitbucket.org/galaxy/galaxy-central/commits/15fc8675064ea46b7e081d96... I'm still digging into the underlying problem, by focussing on just one unit test from https://github.com/peterjc/pico_galaxy/blob/master/tools/mira4/mira4_de_novo... Namely: <test> <param name="job_type" value="genome" /> <param name="job_quality" value="accurate" /> <param name="type" value="none" /> <param name="filenames" value="ecoli.fastq" ftype="fastqsanger" /> <output name="out_fasta" file="ecoli.mira4_de_novo.fasta" ftype="fasta" /> </test> After the test runs, we have five files, which is expected as I have one input and four outputs: <outputs> <data name="out_fasta" format="fasta" label="MIRA de novo contigs (FASTA)" /> <data name="out_bam" format="bam" label="MIRA de novo assembly (BAM)" /> <data name="out_maf" format="mira" label="MIRA de novo assembly" /> <data name="out_log" format="txt" label="MIRA de novo log" /> </outputs> $ ls /tmp/tmp09TawY/tmpFOX9hS/database/files/000/ dataset_1.dat dataset_2.dat dataset_3.dat dataset_4.dat dataset_5.dat dataset_1.dat --> test-data/ecoli.fastq dataset_2.dat --> out_fasta dataset_3.dat --> out_bam dataset_4.dat --> out_maf dataset_5.dat --> out_log Of these, the test should only attempt to compare dataset_2.dat (out_fasta) to my sample output file test-data/ecoli.mira4_de_novo.fasta which it matches: $ diff test-data/ecoli.mira4_de_novo.fasta /tmp/tmp09TawY/tmpFOX9hS/database/files/000/dataset_2.dat However, for reasons unknown, it is comparing dataset_5.dat (out_log) to the sample output file test-data/ecoli.mira4_de_novo.fasta instead. A little more logging (e.g. [1]) shows the problem is when calling into the verify_dataset_correctness method (should use hid=2): base.interactor: DEBUG: About to call verify_dataset_correctness('ecoli.mira4_de_novo.fasta', '5', ...) base.interactor: DEBUG: About to call verify_dataset_correctness(...) for output_data 'state': 'queued' base.interactor: DEBUG: About to call verify_dataset_correctness(...) for output_data 'hid': '5' base.interactor: DEBUG: About to call verify_dataset_correctness(...) for output_data 'id': '5' base.interactor: DEBUG: About to call verify_dataset_correctness(...) for output_data 'dbkey': 'hg17' base.interactor: DEBUG: About to call verify_dataset_correctness(...) for output_data 'name': 'MIRA de novo log' base.twilltestcase: DEBUG: Verifying dataset correctness for 'ecoli.mira4_de_novo.fasta' (hid='5') Now what is strange is how can the call to get the hid for the current outfile can know if it should be 2, 3, 4, or 5? The bug appears to be in method _verify_outputs in test/functional/test_toolbox.py - which was last touched by John... output_data = data_list[ len(data_list) - len(testdef.outputs) + output_index ] In this example, we have data_list[5 - 1 + 0] = 4, meaning since data_list is essentially [dataset_1, ..., dataset_5] we get dataset_5 I believe the error is that rather than the number of defined test outputs (here 1), you should be subtracting the number of tool outputs (here 4). i.e. data_list[5 - 4 + 0] = data_list[1] = dataset_2 As a workaround, I could probably provide expected data for all four output files... Regards, Peter [1] Patch which adds more logging to test/base/interactor.py (similar debugging added to other files to trace this.) $ hg diff test/base/interactor.py diff -r f3dc213a5773 test/base/interactor.py --- a/test/base/interactor.py Mon Feb 10 22:13:35 2014 -0600 +++ b/test/base/interactor.py Tue Feb 11 16:19:51 2014 +0000 @@ -309,6 +309,9 @@ def verify_output( self, history, output_data, outfile, attributes, shed_tool_id, maxseconds ): hid = output_data.get( 'hid' ) + log.debug("About to call verify_dataset_correctness(%r, %r, ...)" % (outfile, hid)) + for key in output_data.keys(): + log.debug("About to call verify_dataset_correctness(...) for output_data %r: %r" % (key, output_data.get(key))) self.twill_test_case.verify_dataset_correctness( outfile, hid=hid, attributes=attributes, shed_tool_id=shed_tool_id, maxseconds=maxseconds ) def get_job_stream( self, history_id, output_data, stream ):
On Tue, Feb 11, 2014 at 4:42 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 11, 2014 at 3:41 PM, Dave Bouvier <dave@bx.psu.edu> wrote:
Peter,
Thanks for pointing this out, I've applied your changes in 12478:15fc8675064e on the default branch.
--Dave B.
Thanks Dave,
RE: http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-February/018154.html https://bitbucket.org/galaxy/galaxy-central/commits/15fc8675064ea46b7e081d96...
I'm still digging into the underlying problem, by focussing on just one unit test ...
As a workaround, I could probably provide expected data for all four output files...
I've been talking to John (off list), he will reply here later, but in the meantime pointed me at a relevant commit, https://bitbucket.org/galaxy/galaxy-central/commits/d05be33ad6b772c2a688d875... Basically the current Twill tests are extremely fragile and expect the test output files in the same order used in the <outputs> definitions and also there must be sample data for ALL the output files. (I may have hit this bug before, I forget now). Regards, Peter
On Tue, Feb 11, 2014 at 5:09 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 11, 2014 at 4:42 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 11, 2014 at 3:41 PM, Dave Bouvier <dave@bx.psu.edu> wrote:
Peter,
Thanks for pointing this out, I've applied your changes in 12478:15fc8675064e on the default branch.
--Dave B.
Thanks Dave,
RE: http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-February/018154.html https://bitbucket.org/galaxy/galaxy-central/commits/15fc8675064ea46b7e081d96...
I'm still digging into the underlying problem, by focussing on just one unit test ...
As a workaround, I could probably provide expected data for all four output files...
Confirmed, providing a reference file for each output works: <test> <param name="job_type" value="genome" /> <param name="job_quality" value="accurate" /> <param name="type" value="none" /> <param name="filenames" value="ecoli.fastq" ftype="fastqsanger" /> <output name="out_fasta" file="ecoli.mira4_de_novo.fasta" ftype="fasta" /> <output name="out_bam" file="ecoli.mira4_de_novo.bam" ftype="bam" /> <output name="out_maf" file="ecoli.mira4_de_novo.mira" ftype="mira" /> <output name="out_log" file="ecoli.mira4_de_novo.log" ftype="txt" lines_diff="2000" /> </test> However, at only 5kb the FASTA file is small and fine to bundle - but the BAM (49kb), log (123kb) and MAF (764kb) quickly add up so I would prefer not to have to include these (under source code control, and for the Tool Shed upload). Also, as you might guess from the large number of allowed differences, the log file is quite variable (lots of time stamps plus Galaxy's temp filenames appear).
I've been talking to John (off list), he will reply here later, but in the meantime pointed me at a relevant commit, https://bitbucket.org/galaxy/galaxy-central/commits/d05be33ad6b772c2a688d875...
Basically the current Twill tests are extremely fragile and expect the test output files in the same order used in the <outputs> definitions and also there must be sample data for ALL the output files.
(I may have hit this bug before, I forget now).
I think it ought be possible to refactor the Twill test code to use the same dictionary approach used in the API based testing - and so support listing the test output in any order and only testing some of the files. However, I'll wait and see what you guys have to say. Peter.
Am Mittwoch, den 12.02.2014, 10:46 +0000 schrieb Peter Cock:
On Tue, Feb 11, 2014 at 5:09 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 11, 2014 at 4:42 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 11, 2014 at 3:41 PM, Dave Bouvier <dave@bx.psu.edu> wrote:
Peter,
Thanks for pointing this out, I've applied your changes in 12478:15fc8675064e on the default branch.
--Dave B.
Thanks Dave,
RE: http://lists.bx.psu.edu/pipermail/galaxy-dev/2014-February/018154.html https://bitbucket.org/galaxy/galaxy-central/commits/15fc8675064ea46b7e081d96...
I'm still digging into the underlying problem, by focussing on just one unit test ...
As a workaround, I could probably provide expected data for all four output files...
Confirmed, providing a reference file for each output works:
<test> <param name="job_type" value="genome" /> <param name="job_quality" value="accurate" /> <param name="type" value="none" /> <param name="filenames" value="ecoli.fastq" ftype="fastqsanger" /> <output name="out_fasta" file="ecoli.mira4_de_novo.fasta" ftype="fasta" /> <output name="out_bam" file="ecoli.mira4_de_novo.bam" ftype="bam" /> <output name="out_maf" file="ecoli.mira4_de_novo.mira" ftype="mira" /> <output name="out_log" file="ecoli.mira4_de_novo.log" ftype="txt" lines_diff="2000" /> </test>
However, at only 5kb the FASTA file is small and fine to bundle - but the BAM (49kb), log (123kb) and MAF (764kb) quickly add up so I would prefer not to have to include these (under source code control, and for the Tool Shed upload). Also, as you might guess from the large number of allowed differences, the log file is quite variable (lots of time stamps plus Galaxy's temp filenames appear).
Do you have any advise which filesize is feasible and at which point we should skip the tests? I'm working currently on MACS2 wrappers and to have reasonable tests I need up to 50mb ob test files. Providing gz input files would be one option to get it down to 10mb, but that does not work and it's still 10mb ...
I've been talking to John (off list), he will reply here later, but in the meantime pointed me at a relevant commit, https://bitbucket.org/galaxy/galaxy-central/commits/d05be33ad6b772c2a688d875...
Basically the current Twill tests are extremely fragile and expect the test output files in the same order used in the <outputs> definitions and also there must be sample data for ALL the output files.
(I may have hit this bug before, I forget now).
I think it ought be possible to refactor the Twill test code to use the same dictionary approach used in the API based testing - and so support listing the test output in any order and only testing some of the files.
However, I'll wait and see what you guys have to say.
Peter. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
On Thu, Feb 13, 2014 at 7:51 AM, Björn Grüning <bjoern.gruening@gmail.com> wrote:
Am Mittwoch, den 12.02.2014, 10:46 +0000 schrieb Peter Cock:
...
However, at only 5kb the FASTA file is small and fine to bundle - but the BAM (49kb), log (123kb) and MAF (764kb) quickly add up so I would prefer not to have to include these (under source code control, and for the Tool Shed upload). Also, as you might guess from the large number of allowed differences, the log file is quite variable (lots of time stamps plus Galaxy's temp filenames appear).
Do you have any advise which filesize is feasible and at which point we should skip the tests? I'm working currently on MACS2 wrappers and to have reasonable tests I need up to 50mb ob test files.
I try to use small files under say 10kb if possible - partly as smaller unit tests are often easier to diagnose when they break, but also as noted above to reduce overhead for version control, and the ToolShed upload/download time. 50mb does sound excessive, but if that's the minimum maybe it is still better than no test at all?
Providing gz input files would be one option to get it down to 10mb, but that does not work and it's still 10mb ...
I thought the test framework would allow you to provide gzipped input since it uses the upload tool internally which would handle this. If that breaks I'd consider it a bug. Gzipped output files might be more difficult, since the validation is based on direct file comparison. Peter
Hi all, I just wanted to point out that there are several different ways to compare the history output item to a test file beyond the default “diff”, including contains, re_match, sim_size, etc, this is set by the “compare” attribute in the xml tag for the test output. If you really don’t care about the contents of an output, you could, for example, use contains against an empty file. Sorry that I didn’t provide more specific examples, as I am at a conference this week. Thanks for using Galaxy, Dan On Feb 13, 2014, at 5:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Feb 13, 2014 at 7:51 AM, Björn Grüning <bjoern.gruening@gmail.com> wrote:
Am Mittwoch, den 12.02.2014, 10:46 +0000 schrieb Peter Cock:
...
However, at only 5kb the FASTA file is small and fine to bundle - but the BAM (49kb), log (123kb) and MAF (764kb) quickly add up so I would prefer not to have to include these (under source code control, and for the Tool Shed upload). Also, as you might guess from the large number of allowed differences, the log file is quite variable (lots of time stamps plus Galaxy's temp filenames appear).
Do you have any advise which filesize is feasible and at which point we should skip the tests? I'm working currently on MACS2 wrappers and to have reasonable tests I need up to 50mb ob test files.
I try to use small files under say 10kb if possible - partly as smaller unit tests are often easier to diagnose when they break, but also as noted above to reduce overhead for version control, and the ToolShed upload/download time.
50mb does sound excessive, but if that's the minimum maybe it is still better than no test at all?
Providing gz input files would be one option to get it down to 10mb, but that does not work and it's still 10mb ...
I thought the test framework would allow you to provide gzipped input since it uses the upload tool internally which would handle this. If that breaks I'd consider it a bug.
Gzipped output files might be more difficult, since the validation is based on direct file comparison.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
On Thu, Feb 13, 2014 at 1:58 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi all,
I just wanted to point out that there are several different ways to compare the history output item to a test file beyond the default "diff", including contains, re_match, sim_size, etc, this is set by the "compare" attribute in the xml tag for the test output. If you really don't care about the contents of an output, you could, for example, use contains against an empty file.
Sorry that I didn't provide more specific examples, as I am at a conference this week.
Thanks Dan - I don't think I'd seen the "contains" tag before, and it isn't on the wiki yet: https://wiki.galaxyproject.org/Admin/Tools/Writing%20Tests Contains an empty file is a rather sneaky hack - I like it: <test> <param name="job_type" value="genome" /> <param name="job_quality" value="accurate" /> <param name="type" value="none" /> <param name="filenames" value="ecoli.fastq" ftype="fastqsanger" /> <output name="out_fasta" file="ecoli.mira4_de_novo.fasta" ftype="fasta" /> <output name="out_bam" file="empty_file.dat" compare="contains" /> <output name="out_maf" file="empty_file.dat" compare="contains" /> <output name="out_log" file="empty_file.dat" compare="contains" /> </test> It will do in the short term until the test framework supports omitting some of the output test files... or I could switch to providing a snippet of data from each file. Thanks, Peter
Am Donnerstag, den 13.02.2014, 14:23 +0000 schrieb Peter Cock:
On Thu, Feb 13, 2014 at 1:58 PM, Daniel Blankenberg <dan@bx.psu.edu> wrote:
Hi all,
I just wanted to point out that there are several different ways to compare the history output item to a test file beyond the default "diff", including contains, re_match, sim_size, etc, this is set by the "compare" attribute in the xml tag for the test output. If you really don't care about the contents of an output, you could, for example, use contains against an empty file.
Sorry that I didn't provide more specific examples, as I am at a conference this week.
Thanks Dan - I don't think I'd seen the "contains" tag before, and it isn't on the wiki yet:
It's now added.
Contains an empty file is a rather sneaky hack - I like it:
<test> <param name="job_type" value="genome" /> <param name="job_quality" value="accurate" /> <param name="type" value="none" /> <param name="filenames" value="ecoli.fastq" ftype="fastqsanger" /> <output name="out_fasta" file="ecoli.mira4_de_novo.fasta" ftype="fasta" /> <output name="out_bam" file="empty_file.dat" compare="contains" /> <output name="out_maf" file="empty_file.dat" compare="contains" /> <output name="out_log" file="empty_file.dat" compare="contains" /> </test>
It will do in the short term until the test framework supports omitting some of the output test files... or I could switch to providing a snippet of data from each file.
Thanks,
Peter
Hi Dan,
Hi all,
I just wanted to point out that there are several different ways to compare the history output item to a test file beyond the default “diff”, including contains, re_match, sim_size, etc, this is set by the “compare” attribute in the xml tag for the test output. If you really don’t care about the contents of an output, you could, for example, use contains against an empty file.
That is indeed a nice trick! It is potentially a way to only compare the first 1000 lines, assuming that the rest is identical. Unfortunately, I'm not able to complete the MACS2 tests, because I was not able to get conditionals of conditionals working. Is that a known bug? As I understood Greg correctly, we should not invest much more time in the twill testing and are waiting for the new API based tests? Is that correct? Thanks, Bjoern
Sorry that I didn’t provide more specific examples, as I am at a conference this week.
Thanks for using Galaxy,
Dan
On Feb 13, 2014, at 5:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Feb 13, 2014 at 7:51 AM, Björn Grüning <bjoern.gruening@gmail.com> wrote:
Am Mittwoch, den 12.02.2014, 10:46 +0000 schrieb Peter Cock:
...
However, at only 5kb the FASTA file is small and fine to bundle - but the BAM (49kb), log (123kb) and MAF (764kb) quickly add up so I would prefer not to have to include these (under source code control, and for the Tool Shed upload). Also, as you might guess from the large number of allowed differences, the log file is quite variable (lots of time stamps plus Galaxy's temp filenames appear).
Do you have any advise which filesize is feasible and at which point we should skip the tests? I'm working currently on MACS2 wrappers and to have reasonable tests I need up to 50mb ob test files.
I try to use small files under say 10kb if possible - partly as smaller unit tests are often easier to diagnose when they break, but also as noted above to reduce overhead for version control, and the ToolShed upload/download time.
50mb does sound excessive, but if that's the minimum maybe it is still better than no test at all?
Providing gz input files would be one option to get it down to 10mb, but that does not work and it's still 10mb ...
I thought the test framework would allow you to provide gzipped input since it uses the upload tool internally which would handle this. If that breaks I'd consider it a bug.
Gzipped output files might be more difficult, since the validation is based on direct file comparison.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hi Björn, I think twill is going to be around for a while yet, but it is quickly becoming more critical to eliminate it. I haven't yet looked at Johns tests using the API, but my understanding is that they are focused on tools. We're using twill a lot in other areas besides tools, so it will require some work to eliminate it completely. Some enhancements to our twill framework will undoubtedly be necessaruy form some time. However, non-trivial enhancements should be thought about carefully. Thanks! Greg Von Kuster On Feb 13, 2014, at 5:46 PM, Björn Grüning <bjoern.gruening@gmail.com> wrote:
Hi Dan,
Hi all,
I just wanted to point out that there are several different ways to compare the history output item to a test file beyond the default “diff”, including contains, re_match, sim_size, etc, this is set by the “compare” attribute in the xml tag for the test output. If you really don’t care about the contents of an output, you could, for example, use contains against an empty file.
That is indeed a nice trick! It is potentially a way to only compare the first 1000 lines, assuming that the rest is identical. Unfortunately, I'm not able to complete the MACS2 tests, because I was not able to get conditionals of conditionals working. Is that a known bug? As I understood Greg correctly, we should not invest much more time in the twill testing and are waiting for the new API based tests? Is that correct?
Thanks, Bjoern
Sorry that I didn’t provide more specific examples, as I am at a conference this week.
Thanks for using Galaxy,
Dan
On Feb 13, 2014, at 5:06 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Thu, Feb 13, 2014 at 7:51 AM, Björn Grüning <bjoern.gruening@gmail.com> wrote:
Am Mittwoch, den 12.02.2014, 10:46 +0000 schrieb Peter Cock:
...
However, at only 5kb the FASTA file is small and fine to bundle - but the BAM (49kb), log (123kb) and MAF (764kb) quickly add up so I would prefer not to have to include these (under source code control, and for the Tool Shed upload). Also, as you might guess from the large number of allowed differences, the log file is quite variable (lots of time stamps plus Galaxy's temp filenames appear).
Do you have any advise which filesize is feasible and at which point we should skip the tests? I'm working currently on MACS2 wrappers and to have reasonable tests I need up to 50mb ob test files.
I try to use small files under say 10kb if possible - partly as smaller unit tests are often easier to diagnose when they break, but also as noted above to reduce overhead for version control, and the ToolShed upload/download time.
50mb does sound excessive, but if that's the minimum maybe it is still better than no test at all?
Providing gz input files would be one option to get it down to 10mb, but that does not work and it's still 10mb ...
I thought the test framework would allow you to provide gzipped input since it uses the upload tool internally which would handle this. If that breaks I'd consider it a bug.
Gzipped output files might be more difficult, since the validation is based on direct file comparison.
Peter
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Hello Peter, I'm also sure we could enhance hte twill code to use the dictionary approach used in the API based testing. However, pervasive use of Javascript is being introduced into the Galaxy code base at a fast pace and twill does not work with Javascript. So clearly the use of twill for functional testing in Galaxy will have to be eliminated although I'm not aware of any formal plans to do so. All of the tool shed functional tests and the tool shed's install and test framework use twill as well, so I'll be working to eliminate its use in those frameworks as soon as possible. John has been working on some pieces of the puzzle for eliminating twill (e.g., the API based testing), but I assume there is significant work left to completely eliminate the use of twill. So perhaps we should plan to continue to add enhancements like this to the twill framework even though we'll have to soon eliminate it due to Javascript. I'm not sure if others agree with this though. Greg Von Kuster On Feb 12, 2014, at 5:46 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
I think it ought be possible to refactor the Twill test code to use the same dictionary approach used in the API based testing - and so support listing the test output in any order and only testing some of the files.
However, I'll wait and see what you guys have to say.
Peter. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (4)
-
Björn Grüning
-
Daniel Blankenberg
-
Greg Von Kuster
-
Peter Cock