Hi all,
I have just found a problem using the "Filter data on any column using simple expressions" tool, i.e. files tools/stats/filters.xml and tools/stats/filters.py
I have some six column tabular like this, where I have used \t for a tab, and \n for the new lines:
#ID\tHMM_Sprob_score\tSP_len\tRXLR_start\tEER_start\tRXLR?\n gi|301087619|ref|XP_002894699.1|\t0.990\t21\t54\t64\tY\n gi|301087623|ref|XP_002894700.1|\t0.997\t23\t\t\tN\n gi|301087628|ref|XP_002894701.1|\t0.000\t24\t\t\tN\n
Breakdown of my data:
Column 1 - ID, mandatory string Column 2 - HMM_Sprob_score, mandatory float Column 3 - SP_len, mandatory integer Column 4 - RXLR_start, optional integer Column 5 - EER_start, optional integer Column 6 - RXLR?, mandatory string (Y or N)
Notice that in my output columns 4 and 5 can be empty or an integer.
I'm trying to filter this file using c6=='Y', i.e. column six is a yes. This works (one row output) but Galaxy tells me:
Info: Filtering with c6=='Y', kept 100.00% of 4 lines. Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score SP_len RXLR_start EER_start RXLR?"
Then if I try to filter using c6=='N', i.e. column six is a no, it fails to work (zero rows of output instead of three) and tells me:
kept 0.00% of 4 lines. Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score SP_len RXLR_start EER_start RXLR?"
Digging into the code, tools/stats/filters.py gets given the list of column types from Galaxy and (regardless of which columns are to be used) attempts to cast them to integers, floats, etc.
It looks like Galaxy has decided that my columns 4 and 5 are integers (based on the first row), and therefore filters.py blindly tries to using int(...) on all these entries and that fails on the empty cells.
I see several issues,
(a) The filters.py tool only really needs to cast those columns being used for the filter (fairly easy to fix) (b) The galaxy column type detection seems a bit fragile (hard to really fix without looking at all the data). (c) Are there other tools that would break in a similar way to filter.py?
Peter
On Thu, Apr 7, 2011 at 4:28 PM, Peter Cock p.j.a.cock@googlemail.com wrote:
Hi all,
I have just found a problem using the "Filter data on any column using simple expressions" tool, i.e. files tools/stats/filters.xml and tools/stats/filters.py
I have some six column tabular like this, where I have used \t for a tab, and \n for the new lines:
#ID\tHMM_Sprob_score\tSP_len\tRXLR_start\tEER_start\tRXLR?\n gi|301087619|ref|XP_002894699.1|\t0.990\t21\t54\t64\tY\n gi|301087623|ref|XP_002894700.1|\t0.997\t23\t\t\tN\n gi|301087628|ref|XP_002894701.1|\t0.000\t24\t\t\tN\n
Breakdown of my data:
Column 1 - ID, mandatory string Column 2 - HMM_Sprob_score, mandatory float Column 3 - SP_len, mandatory integer Column 4 - RXLR_start, optional integer Column 5 - EER_start, optional integer Column 6 - RXLR?, mandatory string (Y or N)
Notice that in my output columns 4 and 5 can be empty or an integer.
I'm trying to filter this file using c6=='Y', i.e. column six is a yes. This works (one row output) but Galaxy tells me:
Info: Filtering with c6=='Y', kept 100.00% of 4 lines. Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score SP_len RXLR_start EER_start RXLR?"
Then if I try to filter using c6=='N', i.e. column six is a no, it fails to work (zero rows of output instead of three) and tells me:
kept 0.00% of 4 lines. Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score SP_len RXLR_start EER_start RXLR?"
Digging into the code, tools/stats/filters.py gets given the list of column types from Galaxy and (regardless of which columns are to be used) attempts to cast them to integers, floats, etc.
It looks like Galaxy has decided that my columns 4 and 5 are integers (based on the first row), and therefore filters.py blindly tries to using int(...) on all these entries and that fails on the empty cells.
I see several issues,
(a) The filters.py tool only really needs to cast those columns being used for the filter (fairly easy to fix) (b) The galaxy column type detection seems a bit fragile (hard to really fix without looking at all the data). (c) Are there other tools that would break in a similar way to filter.py?
Also: (d) This probably also explains why the filter tool doesn't like my header row (which starts with a #) since the captions are not numeric. Skipping these is probably a different bug fix though.
Peter
On Thu, Apr 7, 2011 at 7:00 PM, Peter Cock p.j.a.cock@googlemail.com wrote:
On Thu, Apr 7, 2011 at 4:28 PM, Peter Cock p.j.a.cock@googlemail.com wrote:
Hi all,
I have just found a problem using the "Filter data on any column using simple expressions" tool, i.e. files tools/stats/filters.xml and tools/stats/filters.py
I have some six column tabular like this, where I have used \t for a tab, and \n for the new lines:
#ID\tHMM_Sprob_score\tSP_len\tRXLR_start\tEER_start\tRXLR?\n gi|301087619|ref|XP_002894699.1|\t0.990\t21\t54\t64\tY\n gi|301087623|ref|XP_002894700.1|\t0.997\t23\t\t\tN\n gi|301087628|ref|XP_002894701.1|\t0.000\t24\t\t\tN\n
Breakdown of my data:
Column 1 - ID, mandatory string Column 2 - HMM_Sprob_score, mandatory float Column 3 - SP_len, mandatory integer Column 4 - RXLR_start, optional integer Column 5 - EER_start, optional integer Column 6 - RXLR?, mandatory string (Y or N)
Notice that in my output columns 4 and 5 can be empty or an integer.
I'm trying to filter this file using c6=='Y', i.e. column six is a yes. This works (one row output) but Galaxy tells me:
Info: Filtering with c6=='Y', kept 100.00% of 4 lines. Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score SP_len RXLR_start EER_start RXLR?"
Then if I try to filter using c6=='N', i.e. column six is a no, it fails to work (zero rows of output instead of three) and tells me:
kept 0.00% of 4 lines. Skipped 3 invalid lines starting at line #1: "#ID HMM_Sprob_score SP_len RXLR_start EER_start RXLR?"
Digging into the code, tools/stats/filters.py gets given the list of column types from Galaxy and (regardless of which columns are to be used) attempts to cast them to integers, floats, etc.
It looks like Galaxy has decided that my columns 4 and 5 are integers (based on the first row), and therefore filters.py blindly tries to using int(...) on all these entries and that fails on the empty cells.
I see several issues,
(a) The filters.py tool only really needs to cast those columns being used for the filter (fairly easy to fix) (b) The galaxy column type detection seems a bit fragile (hard to really fix without looking at all the data). (c) Are there other tools that would break in a similar way to filter.py?
Also: (d) This probably also explains why the filter tool doesn't like my header row (which starts with a #) since the captions are not numeric. Skipping these is probably a different bug fix though.
Peter
To address these issues with the filters.py tool I've filed the following bugs with fixes:
https://bitbucket.org/galaxy/galaxy-central/issue/535/ https://bitbucket.org/galaxy/galaxy-central/issue/536/ https://bitbucket.org/galaxy/galaxy-central/issue/537/
Peter
galaxy-dev@lists.galaxyproject.org