Combining tables with different numbers of columns
Hi all, I'm having a little trouble understanding the best way to perform some tabular file manipulations in Galaxy. I have several tabular files, which contain different numbers of columns, which I want to combine using a single column containing an identifier (which must match for the rows to be combined). e.g. File 1 contains, c1 = ID c2 = Score1a File 2 contains, c1 = ID c2 = Score2a c3 = Score2b c4 = Score2c File 3 contains, c1 = ID c2 = Score3a c3 = Score3b Desired combined file containing: c1 = ID c2 = Score1a c3 = Score2a c4 = Score2b c4 = Score2c c6 = Score3a c7 = Score3b I have worked out how to do this with two calls to the "Join two Datasets" tool, but this results in the repetition of the join column (ID in this example), so a final clean-up is required using the "Cut" tool (which breaks the column assignments). The more flexible "Column Join" tool would let me combine an arbitrary number of files, but is designed for input files containing the same column structure. Is there a better way to do this with Galaxy as it stands? Alternatively, would adding an option to the "Join two Datasets" tool not to bother with the redundant column be widely useful? Peter
Hi Peter, Another option would be to pad out the files so that all desired columns in the final result are present before you do the join with "Column Join". For now, this and the method you describe are the available choices. If you wanted to open an issue in bitbucket, the team would have a log of the enhancement request. Or, I can open one, just let me know, Useful idea! Thanks, Jen Galaxy team On 4/13/11 6:29 AM, Peter Cock wrote:
Hi all,
I'm having a little trouble understanding the best way to perform some tabular file manipulations in Galaxy. I have several tabular files, which contain different numbers of columns, which I want to combine using a single column containing an identifier (which must match for the rows to be combined).
e.g.
File 1 contains, c1 = ID c2 = Score1a
File 2 contains, c1 = ID c2 = Score2a c3 = Score2b c4 = Score2c
File 3 contains, c1 = ID c2 = Score3a c3 = Score3b
Desired combined file containing:
c1 = ID c2 = Score1a c3 = Score2a c4 = Score2b c4 = Score2c c6 = Score3a c7 = Score3b
I have worked out how to do this with two calls to the "Join two Datasets" tool, but this results in the repetition of the join column (ID in this example), so a final clean-up is required using the "Cut" tool (which breaks the column assignments).
The more flexible "Column Join" tool would let me combine an arbitrary number of files, but is designed for input files containing the same column structure.
Is there a better way to do this with Galaxy as it stands?
Alternatively, would adding an option to the "Join two Datasets" tool not to bother with the redundant column be widely useful?
Peter ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Jackson http://usegalaxy.org http://galaxyproject.org
participants (2)
-
Jennifer Jackson
-
Peter Cock