On Feb 16, 2012, at 12:24 PM, Peter Cock wrote:
On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hi Dan,
I think I need a little more advice - what is the role of the script scripts/extract_dataset_part.py and the JSON files created when splitting FASTQ files in lib/galaxy/datatypes/sequence.py, and then used by the class' process_split_file method?
Why is there no JSON file created by the base data class in lib/galaxy/datatypes/data.py and no method process_split_file?
Is the JSON thing part of a partial and unfinished rewrite of the splitter code?
On the assumption that not all splitters bother with the JSON, I am trying a little hack to scripts/extract_dataset_part.py to abort silently if there is no JSON file: https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3
This seems to be working with my current attempt at a FASTA splitter (not checked in yes, only partly implemented and tested).
I've checked in my FASTA splitting, which now seems to be working OK with my BLAST tests. So far this only does splitting into chunks of the requested number of sequences, rather than the option to split the whole file into a given number of pieces. https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9
Cool! Seems like a perfectly fine start. I guess you could grab the # of sequences from the dataset somehow (I'm guessing that is set somehow upon import into Galaxy).
I also need to look at merging multiple BLAST XML outputs, but this is looking promising.
Peter
Yep, that's definitely one where a simple concatenation wouldn't work (though NCBI used to think so, years ago…) chris