On 02/11/09 18:41, Brown, Stuart wrote:
I am trying to come up with a nice workflow/tutorial for the use of Galaxy to search for Transcription Factor binding sites on a genome wide scale using pattern search tools. I want to train my students to think genomically and to use clever tools to leverage their abilities.
Galaxy is absolutely awesome for grabbing the upstream promoter regions for all genes from any organism with a whole genome in UCSC. It is also possible to use the integrated EMBOSS tools such as fuzznuc and dreg to search for a known TFBS (or any other simple nucleotide pattern). However, I can't get past the simple search into a more clever infomation-based search. In particular I have the following workflow in mind:
1. Collect upstream regions for all mouse (or human) genes 2. Search for a published TF binding site with a single base mismatch using FUZZNUC 3. Make a multiple alignment of the sequences returned by FUZZNUC (not possible in any way that I have been able to find) 4. Make a logo from the alignment to identify informative positions and conserved substitutions (not in Galaxy) 5. Make a PSSM profile, HMM profile, or other smart searching tool from the aligned sequences (not in Galaxy) 6. Search the upstream regions again with this more sensitive pattern search method. (not in Galaxy). 7. Make a list of genes targeted with this TFBS, 8. Compare list of genes to microarray data showing co-regulation of this gene set, or to pathways
I am frustrated at step 3. Even if I bring the FUZZNUC results to my desktop, there is no easy way to extract just sequences and make a multiple alignment. Many of the 'allowed' Fuzznuc optional output formats produce an error, or no useable output.
Do you want just the regions that matched the pattern? EMBOSS has an option -rformat listfile that will make a list file of the subsequences. You can use this as input to any othr EMBOSS progam using the syntax @filename (though I'm not sure how easy that is within Galaxy). If you want the whole sequences we can add a new report format to EMBOSS that simply reports sequences with a feature. The easiest is to add the sequence to the EMBL, SwissProt and other feature outputs. Fasta is tricky as we have to write features to a separate GFF file (Galaxy is clever but writing 1 or 2 output files is not the kindest way to deliver results) If you need new programs, we are happy to add them to the next EMBOSS release and put them into Galaxy. Hope this helps, Peter Rice