Re: [galaxy-user] How to combine two Reference Genome (Files) In Galaxy?

19 Oct 2011

      Hello,

There is a tool from the FASTX-Toolkit to remove duplicated sequences, 
"Collapse sequences", but it is designed to work on short reads.

If the common IDs/sequences are the same between the two files, you 
could compare them to identify the common and unique entries. The 
general path would be to first convert the fasta format to tabular using 
"Convert Formats -> FASTA-to-Tabular" then compare the IDs using "Join, 
Subtract and Group -> Compare two Datasets".

Three comparisons will be needed:
1 - rows unique to file1
2 - rows unique to file2
3 - rows in common

Then merge the results using "Text Manipulation -> Concatenate datasets" 
and convert back to fasta using "Convert Formats -> Tabular-to-FASTA".

If the IDs are not the same and the sequences are slightly different, 
then you will probably need to consider a tool designed to do genome 
sequence assembly.

Hopefully this helps,

Jen
Galaxy team

On 10/14/11 8:49 AM, Binbin You wrote:
...
Hi all,
I have two reference (genome) files. Let's say EAB_FB_MG.fa(total37972
sequences/contigs) and EAB_FB.fa(21272 sequences/contigs). I know there
are some common contigs between them. How could I combine/merge them to
get a new reference file with all unique contigs (without duplicates)?
Many thanks for any idea!!
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
-- 
Jennifer Jackson
http://usegalaxy.org
http://galaxyproject.org/wiki/Support