Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

17 Feb 2012


      On Feb 16, 2012, at 12:24 PM, Peter Cock wrote:
...
On Thu, Feb 16, 2012 at 4:28 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
...
Hi Dan,
I think I need a little more advice - what is the role of the script
scripts/extract_dataset_part.py and the JSON files created
when splitting FASTQ files in lib/galaxy/datatypes/sequence.py,
and then used by the class' process_split_file method?
Why is there no JSON file created by the base data class in
lib/galaxy/datatypes/data.py and no method process_split_file?
Is the JSON thing part of a partial and unfinished rewrite of the
splitter code?
On the assumption that not all splitters bother with the JSON,
I am trying a little hack to scripts/extract_dataset_part.py to
abort silently if there is no JSON file:
https://bitbucket.org/peterjc/galaxy-central/changeset/ebe94a2c25c3
This seems to be working with my current attempt at a FASTA
splitter (not checked in yes, only partly implemented and tested).
I've checked in my FASTA splitting, which now seems to be
working OK with my BLAST tests. So far this only does splitting
into chunks of the requested number of sequences, rather than
the option to split the whole file into a given number of pieces.
https://bitbucket.org/peterjc/galaxy-central/changeset/416c961c0da9
Cool!  Seems like a perfectly fine start.  I guess you could grab the # of sequences from the dataset somehow (I'm guessing that is set somehow upon import into Galaxy).
...
I also need to look at merging multiple BLAST XML outputs, but
this is looking promising.
Peter
Yep, that's definitely one where a simple concatenation wouldn't work (though NCBI used to think so, years ago…)

chris

Re: [galaxy-dev] Splitting large jobs over multiple nodes/CPUs?

Fields, Christopher J