[galaxy-dev] [hg] galaxy 3358: Initial version of DNA code filter tool

9 Feb 2010

details:   http://www.bx.psu.edu/hg/galaxy/rev/b36c13131ac7
changeset: 3358:b36c13131ac7
user:      Kelly Vincent <kpvincent@bx.psu.edu>
date:      Mon Feb 08 23:52:56 2010 -0500
description:
Initial version of DNA code filter tool

diffstat:

 test-data/dna_filter_in1.bed  |   49 ++++++++++
 test-data/dna_filter_out1.bed |    4 +
 test-data/dna_filter_out2.bed |   39 ++++++++
 test-data/dna_filter_out3.bed |   41 ++++++++
 test-data/dna_filter_out4.bed |   24 +++++
 tool_conf.xml.sample          |    1 +
 tools/stats/dna_filtering.py  |  195 ++++++++++++++++++++++++++++++++++++++++++
 tools/stats/dna_filtering.xml |  114 ++++++++++++++++++++++++
 8 files changed, 467 insertions(+), 0 deletions(-)

diffs (506 lines):

diff -r dedb7be9aa44 -r b36c13131ac7 test-data/dna_filter_in1.bed

--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/dna_filter_in1.bed	Mon Feb 08 23:52:56 2010 -0500
@@ -0,0 +1,49 @@
+chr1	256	257	A	N	M	N	-	M	N	U	N	N	A	N	D	N	G	N	N	K	N	N	N
+chr1	468	469	C	C	C	N	M	N	N	K	.	N	C	U	N	H	N	G	N	N	M	N	S
+chr1	582	583	G	G	G	N	G	R	N	R	N	-	N	M	N	V	K	N	N	N	G	C	R
+chr1	602	603	G	G	G	N	G	N	Y	N	R	G	G	N	N	U	N	T	N	A	K	N	R
+chr1	4792	4793	A	A	M	K	N	W	S	S	N	N	Y	N	N	N	N	N	M	R	N	R	N
+chr1	6119	6120	G	G	M	N	S	N	N	W	B	N	S	D	N	N	H	V	N	B	W	N	N
+chr1	6357	6358	G	G	N	M	K	N	G	-	N	N	G	U	N	N	N	B	N	N	K	N	S
+chr1	6433	6434	G	G	N	R	N	N	C	N	N	N	.	N	N	.	N	N	N	N	N	R	N
+chr1	39160	39161	T	T	T	N	N	Y	N	-	N	N	N	N	N	N	N	V	N	N	N	N	Y
+chr1	41920	41921	G	C	G	N	M	C	G	N	A	N	G	N	K	N	W	S	N	N	N	V	N
+chr1	42100	42101	T	T	T	Y	R	W	N	N	N	V	N	M	R	N	N	G	N	M	Y	N	K
+chr1	45026	45027	C	A	C	N	N	Y	N	S	Y	N	N	X	N	A	D	N	N	K	N	N	A
+chr1	45161	45162	C	T	C	.	N	X	H	V	N	N	C	R	N	Y	N	N	N	N	R	N	Y
+chr2	45407	45408	C	N	C	S	B	N	N	N	N	N	C	N	Y	N	N	T	K	G	N	C	N
+chr2	45788	45789	T	T	T	N	W	S	N	Y	N	R	Y	N	S	N	W	M	N	C	T	N	C
+chr2	46243	46244	T	T	T	N	W	N	N	B	V	N	U	N	T	N	N	Y	C	N	U	N	N
+chr2	47814	47815	A	C	A	S	N	X	D	N	N	H	W	N	G	N	Y	C	N	N	M	R	N
+chr2	48073	48074	A	G	A	Y	W	.	N	K	N	N	N	G	N	N	N	G	N	N	N	Y	N
+chr2	48633	48634	T	T	T	N	G	N	N	N	.	N	N	N	N	S	N	Y	N	.	N	N	N
+chr2	51304	51305	A	G	N	N	C	N	W	-	N	S	Y	N	.	N	N	G	N	N	N	W	R
+chr2	51324	51325	T	T	N	R	N	N	N	N	N	-	N	U	N	W	A	N	N	N	N	N	N
+chr2	52065	52066	T	C	T	N	N	N	S	N	.	N	T	N	M	N	S	W	N	T	Y	C	N
+chr2	53130	53131	T	C	T	K	R	.	N	B	N	N	T	N	N	M	N	Y	N	N	Y	N	N
+chr2	53505	53506	A	A	A	M	N	N	Y	N	N	N	N	-	K	N	W	N	N	N	S	N	R
+chr2	53559	53560	T	T	T	N	N	V	R	V	N	N	T	N	U	N	N	B	N	M	N	V	Y
+chr2	55607	55608	A	N	A	U	S	N	N	H	R	K	N	N	N	Y	N	N	G	N	N	N	N
+chr10	55659	55660	T	N	T	C	N	K	N	N	N	U	N	S	N	N	N	V	C	R	S	N	N
+chr10	55734	55735	T	N	T	G	N	C	N	M	M	G	C	N	B	N	.	N	G	N	N	N	N
+chr10	55870	55871	C	G	C	N	H	G	-	N	N	N	C	N	H	K	N	M	G	N	N	N	N
+chr10	56024	56025	A	T	A	N	D	U	N	Y	B	N	N	X	N	N	Y	N	T	N	-	N	N
+chr10	56100	56101	T	T	A	W	N	N	W	N	S	N	K	M	N	R	N	R	N	R	N	G	N
+chr10	56120	56121	A	-	A	N	A	N	N	Y	N	N	N	W	V	N	N	Y	G	N	N	W	N
+chr10	56137	56138	A	A	A	N	A	Y	H	.	Y	N	G	N	.	D	N	N	T	N	N	N	N
+chr10	56174	56175	A	T	A	Y	A	N	N	N	N	N	N	N	N	N	.	S	T	Y	N	B	N
+chr10	59373	59374	A	G	A	N	N	N	N	N	N	T	N	S	N	N	N	G	N	N	N	V	N
+chr10	68912	68913	G	T	G	R	N	B	R	N	H	N	U	W	Y	N	N	N	N	N	N	N	T
+chr10	72946	72947	T	A	N	N	N	N	N	N	B	N	N	.	B	D	W	U	N	U	N	D	A
+chr10	77052	77053	G	A	R	N	G	N	N	Y	N	N	N	N	N	N	B	R	N	W	N	N	R
+chr18	78200	78201	G	G	G	N	N	H	N	N	V	N	G	N	N	N	N	A	A	N	K	X	N
+chr18	81076	81077	T	A	T	B	N	N	G	N	N	X	W	N	X	N	V	N	N	D	N	N	N
+chr18	81198	81199	A	T	A	N	N	N	N	-	N	N	X	N	K	T	N	M	N	K	X	N	W
+chr18	81216	81217	G	A	G	Y	N	N	D	N	X	N	N	N	N	A	N	S	N	N	N	D	N
+chr18	81398	81399	G	T	G	N	-	W	N	N	M	N	G	C	N	K	N	S	N	N	N	N	K
+chr18	91548	91549	A	A	A	S	N	X	H	S	R	N	A	K	N	N	N	N	U	A	R	N	N
+chr18	93895	93896	T	T	T	H	N	N	V	W	Y	N	N	N	-	N	N	N	N	N	N	Y	N
+chr18	98172	98173	T	T	T	N	.	N	N	N	S	N	T	N	Y	N	N	Y	X	D	V	N	Y
+chr18	110904	110905	T	-	A	A	N	A	N	A	W	A	N	N	A	X	N	W	N	N	N	N	N
+chr18	140324	140325	A	A	A	N	M	N	N	Y	N	S	N	V	N	N	X	N	C	N	N	.	M
+chr18	160592	160593	C	G	G	G	N	G	N	G	N	G	N	N	G	N	N	M	T	N	Y	N	N
\ No newline at end of file
diff -r dedb7be9aa44 -r b36c13131ac7 test-data/dna_filter_out1.bed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/dna_filter_out1.bed	Mon Feb 08 23:52:56 2010 -0500
@@ -0,0 +1,4 @@
+chr1	582	583	G	G	G	N	G	R	N	R	N	-	N	M	N	V	K	N	N	N	G	C	R
+chr1	602	603	G	G	G	N	G	N	Y	N	R	G	G	N	N	U	N	T	N	A	K	N	R
+chr2	48633	48634	T	T	T	N	G	N	N	N	.	N	N	N	N	S	N	Y	N	.	N	N	N
+chr10	77052	77053	G	A	R	N	G	N	N	Y	N	N	N	N	N	N	B	R	N	W	N	N	R
diff -r dedb7be9aa44 -r b36c13131ac7 test-data/dna_filter_out2.bed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/dna_filter_out2.bed	Mon Feb 08 23:52:56 2010 -0500
@@ -0,0 +1,39 @@
+chr1	256	257	A	N	M	N	-	M	N	U	N	N	A	N	D	N	G	N	N	K	N	N	N
+chr1	602	603	G	G	G	N	G	N	Y	N	R	G	G	N	N	U	N	T	N	A	K	N	R
+chr1	4792	4793	A	A	M	K	N	W	S	S	N	N	Y	N	N	N	N	N	M	R	N	R	N
+chr1	6119	6120	G	G	M	N	S	N	N	W	B	N	S	D	N	N	H	V	N	B	W	N	N
+chr1	6357	6358	G	G	N	M	K	N	G	-	N	N	G	U	N	N	N	B	N	N	K	N	S
+chr1	6433	6434	G	G	N	R	N	N	C	N	N	N	.	N	N	.	N	N	N	N	N	R	N
+chr1	39160	39161	T	T	T	N	N	Y	N	-	N	N	N	N	N	N	N	V	N	N	N	N	Y
+chr1	41920	41921	G	C	G	N	M	C	G	N	A	N	G	N	K	N	W	S	N	N	N	V	N
+chr1	42100	42101	T	T	T	Y	R	W	N	N	N	V	N	M	R	N	N	G	N	M	Y	N	K
+chr2	45788	45789	T	T	T	N	W	S	N	Y	N	R	Y	N	S	N	W	M	N	C	T	N	C
+chr2	46243	46244	T	T	T	N	W	N	N	B	V	N	U	N	T	N	N	Y	C	N	U	N	N
+chr2	47814	47815	A	C	A	S	N	X	D	N	N	H	W	N	G	N	Y	C	N	N	M	R	N
+chr2	48633	48634	T	T	T	N	G	N	N	N	.	N	N	N	N	S	N	Y	N	.	N	N	N
+chr2	51304	51305	A	G	N	N	C	N	W	-	N	S	Y	N	.	N	N	G	N	N	N	W	R
+chr2	51324	51325	T	T	N	R	N	N	N	N	N	-	N	U	N	W	A	N	N	N	N	N	N
+chr2	53130	53131	T	C	T	K	R	.	N	B	N	N	T	N	N	M	N	Y	N	N	Y	N	N
+chr2	53505	53506	A	A	A	M	N	N	Y	N	N	N	N	-	K	N	W	N	N	N	S	N	R
+chr2	53559	53560	T	T	T	N	N	V	R	V	N	N	T	N	U	N	N	B	N	M	N	V	Y
+chr2	55607	55608	A	N	A	U	S	N	N	H	R	K	N	N	N	Y	N	N	G	N	N	N	N
+chr10	55659	55660	T	N	T	C	N	K	N	N	N	U	N	S	N	N	N	V	C	R	S	N	N
+chr10	55734	55735	T	N	T	G	N	C	N	M	M	G	C	N	B	N	.	N	G	N	N	N	N
+chr10	56024	56025	A	T	A	N	D	U	N	Y	B	N	N	X	N	N	Y	N	T	N	-	N	N
+chr10	56100	56101	T	T	A	W	N	N	W	N	S	N	K	M	N	R	N	R	N	R	N	G	N
+chr10	56120	56121	A	-	A	N	A	N	N	Y	N	N	N	W	V	N	N	Y	G	N	N	W	N
+chr10	56137	56138	A	A	A	N	A	Y	H	.	Y	N	G	N	.	D	N	N	T	N	N	N	N
+chr10	56174	56175	A	T	A	Y	A	N	N	N	N	N	N	N	N	N	.	S	T	Y	N	B	N
+chr10	59373	59374	A	G	A	N	N	N	N	N	N	T	N	S	N	N	N	G	N	N	N	V	N
+chr10	68912	68913	G	T	G	R	N	B	R	N	H	N	U	W	Y	N	N	N	N	N	N	N	T
+chr10	72946	72947	T	A	N	N	N	N	N	N	B	N	N	.	B	D	W	U	N	U	N	D	A
+chr10	77052	77053	G	A	R	N	G	N	N	Y	N	N	N	N	N	N	B	R	N	W	N	N	R
+chr18	78200	78201	G	G	G	N	N	H	N	N	V	N	G	N	N	N	N	A	A	N	K	X	N
+chr18	81076	81077	T	A	T	B	N	N	G	N	N	X	W	N	X	N	V	N	N	D	N	N	N
+chr18	81198	81199	A	T	A	N	N	N	N	-	N	N	X	N	K	T	N	M	N	K	X	N	W
+chr18	81216	81217	G	A	G	Y	N	N	D	N	X	N	N	N	N	A	N	S	N	N	N	D	N
+chr18	81398	81399	G	T	G	N	-	W	N	N	M	N	G	C	N	K	N	S	N	N	N	N	K
+chr18	91548	91549	A	A	A	S	N	X	H	S	R	N	A	K	N	N	N	N	U	A	R	N	N
+chr18	98172	98173	T	T	T	N	.	N	N	N	S	N	T	N	Y	N	N	Y	X	D	V	N	Y
+chr18	110904	110905	T	-	A	A	N	A	N	A	W	A	N	N	A	X	N	W	N	N	N	N	N
+chr18	160592	160593	C	G	G	G	N	G	N	G	N	G	N	N	G	N	N	M	T	N	Y	N	N
diff -r dedb7be9aa44 -r b36c13131ac7 test-data/dna_filter_out3.bed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/dna_filter_out3.bed	Mon Feb 08 23:52:56 2010 -0500
@@ -0,0 +1,41 @@
+chr1	468	469	C	C	C	N	M	N	N	K	.	N	C	U	N	H	N	G	N	N	M	N	S
+chr1	582	583	G	G	G	N	G	R	N	R	N	-	N	M	N	V	K	N	N	N	G	C	R
+chr1	602	603	G	G	G	N	G	N	Y	N	R	G	G	N	N	U	N	T	N	A	K	N	R
+chr1	6119	6120	G	G	M	N	S	N	N	W	B	N	S	D	N	N	H	V	N	B	W	N	N
+chr1	6357	6358	G	G	N	M	K	N	G	-	N	N	G	U	N	N	N	B	N	N	K	N	S
+chr1	6433	6434	G	G	N	R	N	N	C	N	N	N	.	N	N	.	N	N	N	N	N	R	N
+chr1	39160	39161	T	T	T	N	N	Y	N	-	N	N	N	N	N	N	N	V	N	N	N	N	Y
+chr1	41920	41921	G	C	G	N	M	C	G	N	A	N	G	N	K	N	W	S	N	N	N	V	N
+chr1	42100	42101	T	T	T	Y	R	W	N	N	N	V	N	M	R	N	N	G	N	M	Y	N	K
+chr1	45026	45027	C	A	C	N	N	Y	N	S	Y	N	N	X	N	A	D	N	N	K	N	N	A
+chr1	45161	45162	C	T	C	.	N	X	H	V	N	N	C	R	N	Y	N	N	N	N	R	N	Y
+chr2	45407	45408	C	N	C	S	B	N	N	N	N	N	C	N	Y	N	N	T	K	G	N	C	N
+chr2	45788	45789	T	T	T	N	W	S	N	Y	N	R	Y	N	S	N	W	M	N	C	T	N	C
+chr2	46243	46244	T	T	T	N	W	N	N	B	V	N	U	N	T	N	N	Y	C	N	U	N	N
+chr2	48073	48074	A	G	A	Y	W	.	N	K	N	N	N	G	N	N	N	G	N	N	N	Y	N
+chr2	48633	48634	T	T	T	N	G	N	N	N	.	N	N	N	N	S	N	Y	N	.	N	N	N
+chr2	51324	51325	T	T	N	R	N	N	N	N	N	-	N	U	N	W	A	N	N	N	N	N	N
+chr2	52065	52066	T	C	T	N	N	N	S	N	.	N	T	N	M	N	S	W	N	T	Y	C	N
+chr2	53130	53131	T	C	T	K	R	.	N	B	N	N	T	N	N	M	N	Y	N	N	Y	N	N
+chr2	53559	53560	T	T	T	N	N	V	R	V	N	N	T	N	U	N	N	B	N	M	N	V	Y
+chr2	55607	55608	A	N	A	U	S	N	N	H	R	K	N	N	N	Y	N	N	G	N	N	N	N
+chr10	55659	55660	T	N	T	C	N	K	N	N	N	U	N	S	N	N	N	V	C	R	S	N	N
+chr10	55734	55735	T	N	T	G	N	C	N	M	M	G	C	N	B	N	.	N	G	N	N	N	N
+chr10	55870	55871	C	G	C	N	H	G	-	N	N	N	C	N	H	K	N	M	G	N	N	N	N
+chr10	56100	56101	T	T	A	W	N	N	W	N	S	N	K	M	N	R	N	R	N	R	N	G	N
+chr10	56120	56121	A	-	A	N	A	N	N	Y	N	N	N	W	V	N	N	Y	G	N	N	W	N
+chr10	56174	56175	A	T	A	Y	A	N	N	N	N	N	N	N	N	N	.	S	T	Y	N	B	N
+chr10	59373	59374	A	G	A	N	N	N	N	N	N	T	N	S	N	N	N	G	N	N	N	V	N
+chr10	68912	68913	G	T	G	R	N	B	R	N	H	N	U	W	Y	N	N	N	N	N	N	N	T
+chr10	72946	72947	T	A	N	N	N	N	N	N	B	N	N	.	B	D	W	U	N	U	N	D	A
+chr10	77052	77053	G	A	R	N	G	N	N	Y	N	N	N	N	N	N	B	R	N	W	N	N	R
+chr18	78200	78201	G	G	G	N	N	H	N	N	V	N	G	N	N	N	N	A	A	N	K	X	N
+chr18	81076	81077	T	A	T	B	N	N	G	N	N	X	W	N	X	N	V	N	N	D	N	N	N
+chr18	81198	81199	A	T	A	N	N	N	N	-	N	N	X	N	K	T	N	M	N	K	X	N	W
+chr18	81216	81217	G	A	G	Y	N	N	D	N	X	N	N	N	N	A	N	S	N	N	N	D	N
+chr18	81398	81399	G	T	G	N	-	W	N	N	M	N	G	C	N	K	N	S	N	N	N	N	K
+chr18	93895	93896	T	T	T	H	N	N	V	W	Y	N	N	N	-	N	N	N	N	N	N	Y	N
+chr18	98172	98173	T	T	T	N	.	N	N	N	S	N	T	N	Y	N	N	Y	X	D	V	N	Y
+chr18	110904	110905	T	-	A	A	N	A	N	A	W	A	N	N	A	X	N	W	N	N	N	N	N
+chr18	140324	140325	A	A	A	N	M	N	N	Y	N	S	N	V	N	N	X	N	C	N	N	.	M
+chr18	160592	160593	C	G	G	G	N	G	N	G	N	G	N	N	G	N	N	M	T	N	Y	N	N
diff -r dedb7be9aa44 -r b36c13131ac7 test-data/dna_filter_out4.bed
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/test-data/dna_filter_out4.bed	Mon Feb 08 23:52:56 2010 -0500
@@ -0,0 +1,24 @@
+chr1	582	583	G	G	G	N	G	R	N	R	N	-	N	M	N	V	K	N	N	N	G	C	R
+chr1	602	603	G	G	G	N	G	N	Y	N	R	G	G	N	N	U	N	T	N	A	K	N	R
+chr1	6119	6120	G	G	M	N	S	N	N	W	B	N	S	D	N	N	H	V	N	B	W	N	N
+chr1	6433	6434	G	G	N	R	N	N	C	N	N	N	.	N	N	.	N	N	N	N	N	R	N
+chr1	41920	41921	G	C	G	N	M	C	G	N	A	N	G	N	K	N	W	S	N	N	N	V	N
+chr1	45161	45162	C	T	C	.	N	X	H	V	N	N	C	R	N	Y	N	N	N	N	R	N	Y
+chr2	45788	45789	T	T	T	N	W	S	N	Y	N	R	Y	N	S	N	W	M	N	C	T	N	C
+chr2	46243	46244	T	T	T	N	W	N	N	B	V	N	U	N	T	N	N	Y	C	N	U	N	N
+chr2	48633	48634	T	T	T	N	G	N	N	N	.	N	N	N	N	S	N	Y	N	.	N	N	N
+chr2	51304	51305	A	G	N	N	C	N	W	-	N	S	Y	N	.	N	N	G	N	N	N	W	R
+chr2	51324	51325	T	T	N	R	N	N	N	N	N	-	N	U	N	W	A	N	N	N	N	N	N
+chr2	52065	52066	T	C	T	N	N	N	S	N	.	N	T	N	M	N	S	W	N	T	Y	C	N
+chr2	53559	53560	T	T	T	N	N	V	R	V	N	N	T	N	U	N	N	B	N	M	N	V	Y
+chr10	55734	55735	T	N	T	G	N	C	N	M	M	G	C	N	B	N	.	N	G	N	N	N	N
+chr10	55870	55871	C	G	C	N	H	G	-	N	N	N	C	N	H	K	N	M	G	N	N	N	N
+chr10	56120	56121	A	-	A	N	A	N	N	Y	N	N	N	W	V	N	N	Y	G	N	N	W	N
+chr10	59373	59374	A	G	A	N	N	N	N	N	N	T	N	S	N	N	N	G	N	N	N	V	N
+chr10	72946	72947	T	A	N	N	N	N	N	N	B	N	N	.	B	D	W	U	N	U	N	D	A
+chr10	77052	77053	G	A	R	N	G	N	N	Y	N	N	N	N	N	N	B	R	N	W	N	N	R
+chr18	81198	81199	A	T	A	N	N	N	N	-	N	N	X	N	K	T	N	M	N	K	X	N	W
+chr18	98172	98173	T	T	T	N	.	N	N	N	S	N	T	N	Y	N	N	Y	X	D	V	N	Y
+chr18	110904	110905	T	-	A	A	N	A	N	A	W	A	N	N	A	X	N	W	N	N	N	N	N
+chr18	140324	140325	A	A	A	N	M	N	N	Y	N	S	N	V	N	N	X	N	C	N	N	.	M
+chr18	160592	160593	C	G	G	G	N	G	N	G	N	G	N	N	G	N	N	M	T	N	Y	N	N
diff -r dedb7be9aa44 -r b36c13131ac7 tool_conf.xml.sample
--- a/tool_conf.xml.sample	Mon Feb 08 21:33:12 2010 -0500
+++ b/tool_conf.xml.sample	Mon Feb 08 23:52:56 2010 -0500
@@ -49,6 +49,7 @@
     <tool file="filters/headWrapper.xml" />
     <tool file="filters/tailWrapper.xml" />
     <tool file="filters/trimmer.xml" />
+    <tool file="stats/dna_filtering.xml" />
   </section>
   <section name="Filter and Sort" id="filter">
     <tool file="stats/filtering.xml" />
diff -r dedb7be9aa44 -r b36c13131ac7 tools/stats/dna_filtering.py
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/stats/dna_filtering.py	Mon Feb 08 23:52:56 2010 -0500
@@ -0,0 +1,195 @@
+#!/usr/bin/env python
+
+"""
+This tool takes a tab-delimited text file as input and creates filters on columns based on certain properties. The tool will skip over invalid lines within the file, informing the user about the number of lines skipped.
+
+usage: %prog [options]
+    -i, --input=i: tabular input file
+    -o, --output=o: filtered output file
+    -c, --cond=c: conditions to filter on
+    -n, --n_handling=n: how to handle N and X
+    -l, --columns=l: columns 
+    -t, --col_types=t: column types    
+
+"""
+
+#from __future__ import division
+import os.path, re, string, sys
+from galaxy import eggs
+import pkg_resources; pkg_resources.require( "bx-python" )
+from bx.cookbook import doc_optparse
+
+# Older py compatibility
+try:
+    set()
+except:
+    from sets import Set as set
+
+#assert sys.version_info[:2] >= ( 2, 4 )
+
+def get_operands( filter_condition ):
+    # Note that the order of all_operators is important
+    items_to_strip = [ '==', '!=', ' and ', ' or ' ]
+    for item in items_to_strip:
+        if filter_condition.find( item ) >= 0:
+            filter_condition = filter_condition.replace( item, ' ' )
+    operands = set( filter_condition.split( ' ' ) )
+    return operands
+
+def stop_err( msg ):
+    sys.stderr.write( msg )
+    sys.exit()
+
+def __main__():
+    #Parse Command Line
+    options, args = doc_optparse.parse( __doc__ )
+    input = options.input
+    output = options.output
+    cond = options.cond
+    n_handling = options.n_handling
+    columns = options.columns
+    col_types = options.col_types
+
+    try:
+        in_columns = int( columns )
+        assert col_types  #check to see that the column types variable isn't null
+        in_column_types = col_types.split( ',' )
+    except:
+        stop_err( "Data does not appear to be tabular.  This tool can only be used with tab-delimited data." )
+
+    # Unescape if input has been escaped
+    cond_text = cond.replace( '__eq__', '==' ).replace( '__ne__', '!=' ).replace( '__sq__', "'" )
+    orig_cond_text = cond_text
+    # Expand to allow for DNA codes
+    dot_letters = [ letter for letter in string.uppercase if letter not in \
+                   [ 'A', 'T', 'U', 'G', 'C', 'K', 'M', 'R', 'Y', 'S', 'W', 'B', 'V', 'H', 'D', 'N', 'X' ] ]
+    codes = {'A': [ 'A', 'M', 'R', 'W', 'V', 'H', 'D' ],
+             'T': [ 'T', 'U', 'K', 'Y', 'W', 'B', 'H', 'D' ],
+             'G': [ 'G', 'K', 'R', 'S', 'B', 'V', 'D' ],
+             'C': [ 'C', 'M', 'Y', 'S', 'B', 'V', 'H' ],
+             'U': [ 'T', 'U', 'K', 'Y', 'W', 'B', 'H', 'D' ],
+             'K': [ 'K', 'G', 'T' ],
+             'M': [ 'M', 'A', 'C' ],
+             'R': [ 'R', 'A', 'G' ],
+             'Y': [ 'Y', 'C', 'T' ],
+             'S': [ 'S', 'C', 'G' ],
+             'W': [ 'W', 'A', 'T' ],
+             'B': [ 'B', 'C', 'G', 'T' ],
+             'V': [ 'V', 'A', 'C', 'G' ],
+             'H': [ 'H', 'A', 'C', 'T' ],
+             'D': [ 'D', 'A', 'G', 'T' ],
+             '.': dot_letters,
+             '-': [ '-' ]}
+    # Add handling for N and X
+    if n_handling == "all":
+        codes[ 'N' ] = [ 'G', 'A', 'T', 'C', 'U', 'K', 'M', 'R', 'Y', 'S', 'W', 'B', 'V', 'H', 'D', 'N', 'X' ]
+        codes[ 'X' ] = [ 'G', 'A', 'T', 'C', 'U', 'K', 'M', 'R', 'Y', 'S', 'W', 'B', 'V', 'H', 'D', 'N', 'X' ]
+        for code in codes.keys():
+            if code != '.' and code != '-':
+                codes[code].append( 'N' )
+                codes[code].append( 'X' )
+    else:
+        codes[ 'N' ] = dot_letters
+        codes[ 'X' ] = dot_letters
+    # Expand conditions to allow for DNA codes
+    try:
+        match_replace = {}
+        pat = re.compile( "c\d+\s*[!=]=\s*[\w']+" )
+        matches = pat.findall( cond_text )
+        for match in matches:
+            if match.find( '==' ) > 0:
+                match_parts = match.split( '==' )
+                new_match = '(%s in codes[%s] and %s in codes[%s])' % ( match_parts[0], match_parts[1], match_parts[1], match_parts[0] ) 
+            elif match.find( '!=' ) > 0 :
+                match_parts = match.split( '!=' )
+                new_match = '(%s not in codes[%s] or %s not in codes[%s])' % ( match_parts[0], match_parts[1], match_parts[1], match_parts[0] )
+            else:
+                raise Exception
+            if match_parts[1].find( "'" ) >= 0:
+                assert match_parts[1].replace( "'", '' ) in [ 'G', 'A', 'T', 'C', 'U', 'K', 'M', 'R', 'Y', 'S', 'W', 'B', 'V', 'H', 'D', 'N', 'X', '-', '.' ]
+            else:
+                assert match_parts[1].startswith( 'c' )
+            match_replace[match] = new_match
+        for match in match_replace.keys():
+            cond_text = cond_text.replace(match, match_replace[match])
+        if len( match_replace ) == 0:
+            raise Exception
+    except:
+        stop_err( "One of your conditions is invalid. Make sure to use only '!=' or '==', valid column numbers, and valid base values." )
+
+    # Attempt to determine if the condition includes executable stuff and, if so, exit
+    secured = dir()
+    operands = get_operands( cond_text )
+    for operand in operands:
+        try:
+            check = int( operand )
+        except:
+            if operand in secured:
+                stop_err( "Illegal value '%s' in condition '%s'" % ( operand, cond_text ) )
+
+    # Prepare the column variable names and wrappers for column data types
+    cols, type_casts = [], []
+    for col in range( 1, in_columns + 1 ):
+        col_name = "c%d" % col
+        cols.append( col_name )
+        col_type = in_column_types[ col - 1 ]
+        type_cast = "%s(%s)" % ( col_type, col_name )
+        type_casts.append( type_cast )
+
+    col_str = ', '.join( cols )    # 'c1, c2, c3, c4'
+    type_cast_str = ', '.join( type_casts )  # 'str(c1), int(c2), int(c3), str(c4)'
+    assign = "%s = line.split( '\\t' )" % col_str
+    wrap = "%s = %s" % ( col_str, type_cast_str )
+    skipped_lines = 0
+    first_invalid_line = 0
+    invalid_line = None
+    lines_kept = 0
+    total_lines = 0
+    out = open( output, 'wt' )
+    # Read and filter input file, skipping invalid lines
+    code = '''
+for i, line in enumerate( file( input ) ):
+    total_lines += 1
+    line = line.rstrip( '\\r\\n' )
+    if not line or line.startswith( '#' ):
+        skipped_lines += 1
+        if not invalid_line:
+            first_invalid_line = i + 1
+            invalid_line = line
+        continue
+    try:
+        %s = line.split( '\\t' )
+        %s = %s
+        if %s:
+            lines_kept += 1
+            print >> out, line
+    except Exception, e:
+        skipped_lines += 1
+        if not invalid_line:
+            first_invalid_line = i + 1
+            invalid_line = line
+''' % ( col_str, col_str, type_cast_str, cond_text )
+
+    valid_filter = True
+    try:
+        exec code
+    except Exception, e:
+        out.close()
+        if str( e ).startswith( 'invalid syntax' ):
+            valid_filter = False
+            stop_err( 'Filter condition "%s" likely invalid. See tool tips, syntax and examples.' % orig_cond_text )
+        else:
+            stop_err( str( e ) )
+
+    if valid_filter:
+        out.close()
+        valid_lines = total_lines - skipped_lines
+        print 'Filtering with %s, ' % orig_cond_text
+        if valid_lines > 0:
+            print 'kept %4.2f%% of %d lines.' % ( 100.0*lines_kept/valid_lines, total_lines )
+        else:
+            print 'Possible invalid filter condition "%s" or non-existent column referenced. See tool tips, syntax and examples.' % orig_cond_text
+        if skipped_lines > 0:
+            print 'Skipped %d invalid lines starting at line #%d: "%s"' % ( skipped_lines, first_invalid_line, invalid_line )
+    
+if __name__ == "__main__" : __main__()
diff -r dedb7be9aa44 -r b36c13131ac7 tools/stats/dna_filtering.xml
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/tools/stats/dna_filtering.xml	Mon Feb 08 23:52:56 2010 -0500
@@ -0,0 +1,114 @@
+<tool id="dna_filter" name="DNA Filter" version="1.0.0">
+  <description>filter column data on DNA ambiguity codes using simple expressions</description>
+  <command interpreter="python">
+    dna_filtering.py
+      --input=$input 
+      --output=$out_file1 
+      --cond="$cond" 
+      --n_handling=$n_handling
+      --columns=${input.metadata.columns} 
+      --col_types="${input.metadata.column_types}"
+  </command>
+  <inputs>
+    <param format="tabular" name="input" type="data" label="Filter" help="Query missing? See TIP below."/>
+    <param name="cond" size="40" type="text" value="c8=='G'" label="With following condition" help="Double equal signs, ==, must be used as shown above. To filter for an arbitrary string, use the Select tool.">
+      <validator type="empty_field" message="Enter a valid filtering condition, see syntax and examples below."/>
+    </param>
+    <param name="n_handling" type="select" label="Do you want N (and X) to match A or C or G or T OR nothing?">
+      <option value="all">N = A or C or G or T</option>
+      <option value="none">N = nothing</option>
+    </param>
+  </inputs>
+  <outputs>
+    <data format="input" name="out_file1" metadata_source="input"/>
+  </outputs>
+  <tests>
+    <test>
+      <param name="input" value="dna_filter_in1.bed" />
+      <param name="cond" value="c8=='G'" />
+      <param name="n_handling" value="all" />
+      <output name="out_file1" file="dna_filter_out1.bed" />
+    </test>
+    <test>
+      <param name="input" value="dna_filter_in1.bed" />
+      <param name="cond" value="(c10==c11 or c17==c18) and c6!='C' and c23=='R'" />
+      <param name="n_handling" value="all" />
+      <output name="out_file1" file="dna_filter_out2.bed" />
+    </test>
+    <test>
+      <param name="input" value="dna_filter_in1.bed" />
+      <param name="cond" value="c4=='B' or c9==c10" />
+      <param name="n_handling" value="none" />
+      <output name="out_file1" file="dna_filter_out3.bed" />
+    </test>
+    <test>
+      <param name="input" value="dna_filter_in1.bed" />
+      <param name="cond" value="c7!='Y' and c9!='U'" />
+      <param name="n_handling" value="none" />
+      <output name="out_file1" file="dna_filter_out4.bed" />
+    </test>
+  </tests>
+  <help>
+
+.. class:: warningmark
+
+Double equal signs, ==, must be used as *"equal to"* (e.g., **c1 == 'G'**)
+
+.. class:: infomark
+
+**TIP:** If your data is not TAB delimited, use *Text Manipulation->Convert*
+
+.. class:: infomark
+
+**TIP:** This tool is intended primarily for comparing column values (such as "c5==c12"), although it is also possible to filter on specific values (like "c6!='G'"). Be aware that when searching for specific values, any possible match is considered. So if you search on "c6!='G'", rows will be excluded when c6 is G, K, R, S, B, V, or D (plus N or X if you set that to equal "all"), because it is possible those values could be G. 
+
+-----
+
+**Syntax**
+
+The filter tool allows you to restrict the dataset using simple conditional statements.
+
+- Columns are referenced with **c** and a **number**. For example, **c1** refers to the first column of a tab-delimited file
+- Make sure that multi-character operators contain no white space ( e.g., **!=** is valid while **! =** is not valid )
+- When using 'equal-to' operator **double equal sign '==' must be used** ( e.g., **c1=='chr1'** )
+- Non-numerical values must be included in single or double quotes ( e.g., **c6=='C'** )
+- Filtering condition can include logical operators, but **make sure operators are all lower case** ( e.g., **(c1!='chrX' and c1!='chrY') or c6=='+'** )
+
+-----
+
+**DNA Codes**
+
+The following are the DNA codes used for filtering::
+
+  Code        Meaning
+  ----   ---------------------------
+   A            A
+   T            T
+   U            T
+   G            G
+   C            C
+   K          G or T
+   M          A or C
+   R          A or G
+   Y          C or T
+   S          C or G
+   W          A or T
+   B         C, G or T
+   V         A, C or G
+   H         A, C or T
+   D         A, G or T
+   X        A, C, G or T
+   N        A, C, G or T
+   .     not (A, C, G or T)
+   -     gap of indeterminate length
+
+-----
+
+**Example**
+
+- **c8=='A'** selects lines in which the eighth column is A, M, R, W, V, H, D and N or X if appropriate
+- **c12==c15** selects lines where the value in the twelfth column could be the same as the fifteenth and the fifteenth column could be the same as the twelfth column (based on appropriate codes)
+- **c9!=c19** selects lines where column nine could not be the same as column nineteen and column nineteen could not be the same as column nine (using appropriate codes)
+
+</help>
+</tool>

    

[galaxy-dev] [hg] galaxy 3358: Initial version of DNA code filter tool

Greg Von Kuster