Are those four tools being used on Galaxy Main already with
this basic parallelism in place?

Main still runs these jobs in the standard non-split fashion, and as a resource that is occasionally saturated (and thus doesn't necessarily have extra resources to parallelize to) will probably continue doing so as long as there's significant overhead involved in splitting the files. Fancy scheduling could minimize the issue, but as it is during heavy load you would actually have lower total throughput due to the splitting overhead.

Looking at the code in lib/galaxy/jobs/splitters/basic.py its
comments suggest it only works on tools with one input and
one output file (although that seems a bit fuzzy as you could
be using BWA with a FASTA history item as the reference -
would that fail?).

I haven't tried it, but probably.

I see also interesting things in lib/galaxy/jobs/splitters/multi.py
Is that even more experimental? It looks like it could be used
to say BWA's read file was to be split, but the reference file
shared.

Yes.

Regarding the merging of the out, I see there is a default merge
method in lib/galaxy/datatypes/data.py which just concatenates
the files. I am surprised at that - it seems like a very bad idea in
general - consider many binary files, or XML. Why not put this
as the default for text and subclasses thereof?

I can't think of a better reasonable default behavior for "Data", though you're obviously right that each datatype subclass will need to define particular behaviors for merging files.

OK then, I hope to have a play with this shortly.

Good luck, let me know how it goes, and again - contributions are certainly welcome :)

-Dannon