Re: [galaxy-dev] disk space and file formats

2 Sep 2011

      ...
...
...
 i actually think illumina's pipeline produces files in this format (unaligned-bam) now.
...
Oh do they? - that's interesting. Do you have a reference/link?
i caught wind of this at the recent illumina user's conference but i
asked someone in our sequencing team to confirm and he hadn't heard of
this.  it must be limited to the forthcoming miseq sequencer for the
timebeing, but may make it's way to the big sequencers later.
apparently illumina is thinking about storage as well.  i seem to
recall the speaker saying they won't produce srf files anymore, but
again, this was a talk about the miseq so may not apply to the other
sequencers.
...
...
...
wrappers which create a temporary fastq file would need to be created
but that's easy enough.
...
...
My argument against that is the cost of going from BAM -> temp
fastq may be prohibitive, e.g. the need to generate very large
temp fastq files on the fly as input for various applications may
lead one back to just keeping a permanent FASTQ around anyway.
...
True - if you can't update the tools you need to take BAM.
In some cases at least you can pipe the gzipped FASTQ
into alignment tools which accepts FASTQ on stdin, so
there is no temp file per se.
the tools really do need to support the format; the tmpfile was simply
a workaround.  some tools already support bam, more currently support
fastq.gz.  (someone here made the wrong bet years ago and had adopted
a site-wide fastq.bz2 standard which only recently changed to
fastq.gz.)  but if illumina does start producing bam files in the
future, then we can expect more tools to support that format.  until
they do, probably fastq.gz is a safe bet.

of course there is a computational cost to compressing/uncompressing
files but that's probably better than storing unnecessarily huge
files.  it's a trade-off.

similarly, there's a trade-off involved in limiting read qc tools to a
single/few big tools which wrap several tools, with many options.
users can't play around with read qc but that may be too expensive
(computationally and storage-wise).  for the most part, a standard qc
will do.  one can spend a lot of time and effort to squeeze a bit more
useful data out of a bad library, for example, when they probably
should have just sequenced another library.  i favor leaving the
playing around to the r&d/development/qc team and just offering a
canned/vetted qc solution to the average user.
...
...
I recall hdf5 was planned as an alternate format (PacBio uses
it, IIRC), and of course there is NCBI's .sra format.  Anyone
using the latter two?
Moving from the custom BGZF modified gzip format used in
BAM to HD5 has been proposed on the samtools mailing list
(as Chris knows), and there is a proof of principle implementation
too in BioHDF, http://www.hdfgroup.org/projects/biohdf/
The SAM/BAM group didn't seem overly enthusiastic though.
For the NCBI's .sra format, there is no open specification, just
their public domain source code:
http://seqanswers.com/forums/showthread.php?t=12054
i believe hdf5 is an indexed data structure which, as you mentioned,
isn't required for unprocessed reads.

since i'm rapidly running out of storage, i think the best immediate
solution for me is to deprecate all the fastq datatypes in favor of a
new fastqsangergz and to bundle the read qc tools to eliminate
intermediate files.  sure, users won't be able to play around with
their data as much, but my disk is 88% full and my cluster has been
100% occupied for 2-months straight, so less choice is probably
better.

Re: [galaxy-dev] disk space and file formats

Edward Kirton