On Feb 16, 2012, at 4:47 AM, Peter Cock wrote:
On Wed, Feb 15, 2012 at 6:07 PM, Dannon Baker <dannonbaker@me.com> wrote:
Good luck, let me know how it goes, and again - contributions are certainly welcome :)
I think I found the first bug, method split in lib/galaxy/datatypes/sequence.py for class Sequence assumes four lines per sequence. This would make sense as the split method of the Fastq class (after grooming to remove any line wrapping) but is a very bad idea on most sequence file formats (e.g. FASTA).
It looks like a little refactoring is needed, defining a Sequence split method which raises not implemented, and moving the current code to the Fastq class, then writing something similar but allowing multiple lines per record for the Fasta class.
Does that sound reasonable? I'll do this on a new branch for review...
Peter
Makes sense from my perspective; splits have to be defined based on data type. It could be as low-level as defining a simple iterator per record, then a wrapper that allows a specific chunk-size. The split file creation could almost be abstracted completely away into a common method. As Peter implies, maybe a simple API for defining a split method would be all that is needed. Might also be useful on any merge step, 'cat'-like merges won't work for every format but would be a suitable default. chris