Hello Anto, There is no specific tool that I know of to do this based off read content, but you could use the very low quality score (2) assigned to ambiguous bases and the tool 'Filter by quality' to do a filter by percentage. Be aware that other bases may have scores assigned to this lower value, but these would very likely not be of practical usage anyway. You could clip these end first, then do the filter, discarding any that have very short usable sequence left. If the data is Illumina, is likely a sign of a sequence that failed vendor quality checks, and these are no longer removed by default as of Casava 1.8+. Creating regular expression with the Select tool is another option, but this probably more effort than it is worth to construct. But, your choice. A google will bring up syntax advice. Ideally the first will do the job, Jen Galaxy team On 7/29/13 3:17 AM, Anto Praveen Rajkumar Rajamani wrote:
Hello,
I like to filter my fastq files (50 bp single end Illumina RNA seq reads) by a maximum threshold (10%) of ambiguous (N) bases. I can see that the "CLIP" tool removes all reads with one or more N bases. Is there a way to remove only the reads with five or more N bases using Galaxy? Thank you.
Best wishes, Anto
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
To search Galaxy mailing lists use the unified search at:
-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org