disk space and file formats

older
Bug importing from data library to...

Patrick Page-McCaw

19 Aug 2011 19 Aug '11

4:29 p.m.

I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files. I've been thinking about the structure of the files and my workflow and it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping. If I had a file that had a format something like (here as tab delimited): Header Seq Phred Start Len Barcode etc Each tool could read the Seq and Phred starting at Start and running Len nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format. So it would be a real pain, no doubt, to rewrite all the tools. The little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format. But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed. Then as people are interested, the tools could be converted to take as input the new format. It may well be true in these days of $100 terabyte drives, this is not useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed). I may have a comp sci undergrad working in the lab this fall. With help we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while? thanks [1] It could even do something like: Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database.

Show replies by date

Jelle Scholtalbers

19 Aug 19 Aug

6:43 p.m.

Hi Patrick, the issue you are having is partly related to the idea of Galaxy to ensure reproducible science and saving each intermediate step and output files. For example in your current workflow in Galaxy you can easily do something else with each intermediate file - feed it to a different tool just to check what the average read length is after filtering - you can do that even 2 months after your run. If you how ever insist on keeping disk usage low and don't want to start programming - as your provided solutions will require - and aren't too afraid of the commandline you might want to start there. The thing is, a lot of tools accept either an input file or an input stream. These same tools also have the ability to either write to an output file or to an output stream. This way you can "pipe" these tools together. e.g. "trimMyFq -i rawinput.fq | removebarcode -i - -n optionN | filterJunk -i - -o finalOutput.fq" I don't know which programs you actually use, but the principle is probably the same ( as long as the tools actually accept streams ). This example saved you diskspace because from the 3 tools run, only one actually writes to the disk. On the downside, this also means you don't have an output file from removeBarcode which you can look at to see if everything went ok. If you do want to program or someone else wants to do it, I could think of a tool that combines your iterative steps and can be run as one tool - you could even wrap up your 'pipeline' in a script and put that as a tool in your Galaxy instance and/or in the toolshed. Cheers, Jelle On Fri, Aug 19, 2011 at 6:29 PM, Patrick Page-McCaw <ppagemccaw@gmail.com> wrote:

...

I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files.

I've been thinking about the structure of the files and my workflow and it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping.

If I had a file that had a format something like (here as tab delimited): Header Seq Phred Start Len Barcode etc Each tool could read the Seq and Phred starting at Start and running Len nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format.

So it would be a real pain, no doubt, to rewrite all the tools. The little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format. But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed.

Then as people are interested, the tools could be converted to take as input the new format.

It may well be true in these days of $100 terabyte drives, this is not useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed).

I may have a comp sci undergrad working in the lab this fall. With help we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while?

thanks

[1] It could even do something like: Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start Len etc Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Edward Kirton

1 Sep 1 Sep

10:02 p.m.

Read QC intermediate files account for most of the storage used on our galaxy site. And it's a real problem that I must solve soon. My first attempt at taming the beast was to try to create a single read QC tool that did such things as convert qual encoding, qual-end trimming, etc. (very basic functions). Such a tool could simply be a wrapper around your favorite existing tools, but doesn't keep the intermediate files. The added benefit is that it runs faster because it only has to queue onto the cluster once. Sure, one might argue that it's nice to have all the intermediate files just in case you wish to review them, but in practice, I have found this happens relatively infrequently and is too expensive. If you're a small lab maybe that's fine, but if you generate a lot of sequence, a more production-line approach is reasonable. I've been toying with the idea of replacing all the fastq datatypes with a single fastq datatype that is sanger-encoded and gzipped. I think gzipped reads files are about 1/4 of the unpacked version. Of course, many tools will require a wrapper if they don't accept gzipped input, but that's trivial (and many already support compressed reads). However the import tool automatically uncompressed uploaded files so I'd need to do some hacking there to prevent this. Heck, what we really need is a nice compact binary format for reads, perhaps which doesn't even store ids (although pairing would need to be recorded). Thoughts? On Fri, Aug 19, 2011 at 11:43 AM, Jelle Scholtalbers < j.scholtalbers@gmail.com> wrote:

...

Hi Patrick,

the issue you are having is partly related to the idea of Galaxy to ensure reproducible science and saving each intermediate step and output files. For example in your current workflow in Galaxy you can easily do something else with each intermediate file - feed it to a different tool just to check what the average read length is after filtering - you can do that even 2 months after your run. If you how ever insist on keeping disk usage low and don't want to start programming - as your provided solutions will require - and aren't too afraid of the commandline you might want to start there.

The thing is, a lot of tools accept either an input file or an input stream. These same tools also have the ability to either write to an output file or to an output stream. This way you can "pipe" these tools together. e.g. "trimMyFq -i rawinput.fq | removebarcode -i - -n optionN | filterJunk -i - -o finalOutput.fq"

I don't know which programs you actually use, but the principle is probably the same ( as long as the tools actually accept streams ). This example saved you diskspace because from the 3 tools run, only one actually writes to the disk. On the downside, this also means you don't have an output file from removeBarcode which you can look at to see if everything went ok.

If you do want to program or someone else wants to do it, I could think of a tool that combines your iterative steps and can be run as one tool - you could even wrap up your 'pipeline' in a script and put that as a tool in your Galaxy instance and/or in the toolshed.

Cheers, Jelle

...
I'm not a bioinformaticist or programmer so apologies if this is a silly question. I've been occasionally running galaxy on my laptop and on the

On Fri, Aug 19, 2011 at 6:29 PM, Patrick Page-McCaw <ppagemccaw@gmail.com> wrote: public server and I love it. The issue that I have is that my workflow requires many steps (what I do is probably very unusual). Each step creates a new large fastq file as the sequences are iteratively trimmed of junk. This fills my laptop and fills the public server with lots of unnecessary very large files.

...
I've been thinking about the structure of the files and my workflow and

it seems to me that a more space efficient system would be to have a single file (or a sql database) on which each tool can work. Most of what I do is remove adapter sequences, extract barcodes, trim by quality, map to the genome and then process my hits by type (exon, intron etc). Since the clean up tools in FASTX aren't written with my problem in mind, it takes several passes to get the sequences trimmed up before mapping.

...
If I had a file that had a format something like (here as tab delimited): Header Seq Phred Start Len Barcode etc Each tool could read the Seq and Phred starting at Start and running Len

nucleotides and work on that. The tool could then write a new Start and Len to reflect the trimming it has done[1]. For convenience let me call this an HSPh format.

...
So it would be a real pain, no doubt, to rewrite all the tools. The

little that I can read the tools it seems that the way the input is handled internally varies quite a bit. But it seems to me (naively?) that it would be relatively easy to write a conversion tool that would take the HSPh format and turn it into fastq or fast on the fly for the tools. Since most tools take fastq or fasta, it should be a write once, use many times, plugin. The harder (and slower) part would be mapping the fastq output back onto HSPh format. But again, this should be a write once, use for many tools plugin. Both of the intermediating files would be deleted when done. Just as a real quick test I thought I would see how long it takes to run sed on a fastq 1.35GB file and it was so fast on my laptop, < 2 minutes, that it was done before I noticed.

...
Then as people are interested, the tools could be converted to take as

input the new format.

...
It may well be true in these days of $100 terabyte drives, this is not

useful, that cycles are limiting, not drive space. But I think if the tools were rewritten to take and write to a HSPh format, processing would be faster too. It seems like some effort has been made to create the tab delimited format and maybe someone is already working on something like this (no doubt better designed).

...
I may have a comp sci undergrad working in the lab this fall. With help

we (well, he) might manage some parts of this. He is apparently quite a talented and hard working C++ programmer. Is it worth while?

...
thanks

[1] It could even do something like: Header Seq Phred Start Len Tool Parameter Start Len Tool Parameter Start

...
Tool is the tool name, Parameter a list of parameters used, Start and Len would be the latest trim positions. And the last Start Len pair would be the one to use by default for the next tool, but this would keep an edit history without doubling the space needs with each processing cycle. I wouldn't need

Len etc this but it might be more friendly for users, an "undo" means removing 4 columns. A format like this would probably be better as a sql database.

...
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Peter Cock

2 Sep 2 Sep

12:54 a.m.

On Thu, Sep 1, 2011 at 11:02 PM, Edward Kirton <eskirton@lbl.gov> wrote:

...

Read QC intermediate files account for most of the storage used on our galaxy site. And it's a real problem that I must solve soon. My first attempt at taming the beast was to try to create a single read QC tool that did such things as convert qual encoding, qual-end trimming, etc. (very basic functions). Such a tool could simply be a wrapper around your favorite existing tools, but doesn't keep the intermediate files. The added benefit is that it runs faster because it only has to queue onto the cluster once. Sure, one might argue that it's nice to have all the intermediate files just in case you wish to review them, but in practice, I have found this happens relatively infrequently and is too expensive. If you're a small lab maybe that's fine, but if you generate a lot of sequence, a more production-line approach is reasonable.

Sounds very sensible if you have some frequently repeated multistep analyses.

...

I've been toying with the idea of replacing all the fastq datatypes with a single fastq datatype that is sanger-encoded and gzipped. I think gzipped reads files are about 1/4 of the unpacked version. Of course, many tools will require a wrapper if they don't accept gzipped input, but that's trivial (and many already support compressed reads). However the import tool automatically uncompressed uploaded files so I'd need to do some hacking there to prevent this.

Hmm. Probably there are some tasks where a gzip'd FASTQ isn't ideal, but for the fairly typical case of intreating over the records it should be fine.

...

Heck, what we really need is a nice compact binary format for reads, perhaps which doesn't even store ids (although pairing would need to be recorded). Thoughts?

What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all. Peter

Edward Kirton

8:02 p.m.

...

What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all.

ah, yes, precisely. i actually think illumina's pipeline produces files in this format now. wrappers which create a temporary fastq file would need to be created but that's easy enough.

Fields, Christopher J

8:27 p.m.

On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:

...

...
What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all.

ah, yes, precisely. i actually think illumina's pipeline produces files in this format now. wrappers which create a temporary fastq file would need to be created but that's easy enough.

My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway. One could probably get better performance out of a simpler format that removes most of the 'AM' parts of BAM. Or is the idea that the file itself is modified, like a database? And how would indexing work (BAM uses binning on the match to the reference seq), or does it matter? I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format. Anyone using the latter two? chris

Peter Cock

3 Sep 3 Sep

1:02 a.m.

On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J <cjfields@illinois.edu> wrote:

...

On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:

...
...
What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all.

ah, yes, precisely. i actually think illumina's pipeline produces files in this format now.

Oh do they? - that's interesting. Do you have a reference/link?

...

...
wrappers which create a temporary fastq file would need to be created but that's easy enough.

My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway.

True - if you can't update the tools you need to take BAM. In some cases at least you can pipe the gzipped FASTQ into alignment tools which accepts FASTQ on stdin, so there is no temp file per se.

...

One could probably get better performance out of a simpler format that removes most of the 'AM' parts of BAM.

Yes, but that meaning inventing yet another file format. At least gzipped FASTQ is quite straightforward.

...

Or is the idea that the file itself is modified, like a database?

That would be quite a dramatic change from the current Galaxy workflow system - I doubt that would be acceptable in general.

...

And how would indexing work (BAM uses binning on the match to the reference seq), or does it matter?

BAM indexing as done in samtools/picard is only for the aligned reads - so no help for a BAM file of unaligned reads. You could use a different indexing system (e.g. by read name) and the same BAM BGZF block offset system (I've tried this as an experiment with Biopython's SQLite indexing of sequence files). However, for tasks taking unaligned reads as input, you generally just iterate over the reads in the order on disk.

...

I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format. Anyone using the latter two?

Moving from the custom BGZF modified gzip format used in BAM to HD5 has been proposed on the samtools mailing list (as Chris knows), and there is a proof of principle implementation too in BioHDF, http://www.hdfgroup.org/projects/biohdf/ The SAM/BAM group didn't seem overly enthusiastic though. For the NCBI's .sra format, there is no open specification, just their public domain source code: http://seqanswers.com/forums/showthread.php?t=12054 Regards, Peter

Edward Kirton

3:17 a.m.

...

...
...
i actually think illumina's pipeline produces files in this format (unaligned-bam) now.

...

Oh do they? - that's interesting. Do you have a reference/link?

i caught wind of this at the recent illumina user's conference but i asked someone in our sequencing team to confirm and he hadn't heard of this. it must be limited to the forthcoming miseq sequencer for the timebeing, but may make it's way to the big sequencers later. apparently illumina is thinking about storage as well. i seem to recall the speaker saying they won't produce srf files anymore, but again, this was a talk about the miseq so may not apply to the other sequencers.

...

...
...
wrappers which create a temporary fastq file would need to be created but that's easy enough.

...

...
My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway.

...

True - if you can't update the tools you need to take BAM. In some cases at least you can pipe the gzipped FASTQ into alignment tools which accepts FASTQ on stdin, so there is no temp file per se.

the tools really do need to support the format; the tmpfile was simply a workaround. some tools already support bam, more currently support fastq.gz. (someone here made the wrong bet years ago and had adopted a site-wide fastq.bz2 standard which only recently changed to fastq.gz.) but if illumina does start producing bam files in the future, then we can expect more tools to support that format. until they do, probably fastq.gz is a safe bet. of course there is a computational cost to compressing/uncompressing files but that's probably better than storing unnecessarily huge files. it's a trade-off. similarly, there's a trade-off involved in limiting read qc tools to a single/few big tools which wrap several tools, with many options. users can't play around with read qc but that may be too expensive (computationally and storage-wise). for the most part, a standard qc will do. one can spend a lot of time and effort to squeeze a bit more useful data out of a bad library, for example, when they probably should have just sequenced another library. i favor leaving the playing around to the r&d/development/qc team and just offering a canned/vetted qc solution to the average user.

...

...
I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format. Anyone using the latter two? Moving from the custom BGZF modified gzip format used in BAM to HD5 has been proposed on the samtools mailing list (as Chris knows), and there is a proof of principle implementation too in BioHDF, http://www.hdfgroup.org/projects/biohdf/ The SAM/BAM group didn't seem overly enthusiastic though. For the NCBI's .sra format, there is no open specification, just their public domain source code: http://seqanswers.com/forums/showthread.php?t=12054

i believe hdf5 is an indexed data structure which, as you mentioned, isn't required for unprocessed reads. since i'm rapidly running out of storage, i think the best immediate solution for me is to deprecate all the fastq datatypes in favor of a new fastqsangergz and to bundle the read qc tools to eliminate intermediate files. sure, users won't be able to play around with their data as much, but my disk is 88% full and my cluster has been 100% occupied for 2-months straight, so less choice is probably better.

Peter Cock

4:04 a.m.

On Saturday, September 3, 2011, Edward Kirton <eskirton@lbl.gov> wrote:

...

of course there is a computational cost to compressing/uncompressing files but that's probably better than storing unnecessarily huge files. it's a trade-off.

It may still be faster due to less IO, probably depends on your hardware.

...

since i'm rapidly running out of storage, i think the best immediate solution for me is to deprecate all the fastq datatypes in favor of a new fastqsangergz and to bundle the read qc tools to eliminate intermediate files. sure, users won't be able to play around with their data as much, but my disk is 88% full and my cluster has been 100% occupied for 2-months straight, so less choice is probably better.

In your position I agree that is a pragmatic choice. You might be able to modify the file upload code to gzip any FASTQ files... that would prevent uncompressed FASTQ getting into new histories. I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype? However this seems generally useful (not just for FASTQ) so perhaps a more general mechanism would be better where tool XML files can say which file types they accept and which of those can/must be compressed (possily not just gzip format?). Peter

Edward Kirton

5:17 a.m.

...

In your position I agree that is a pragmatic choice.

Thanks for helping me muddle through my options.

...

You might be able to modify the file upload code to gzip any FASTQ files... that would prevent uncompressed FASTQ getting into new histories.

Right!

...

I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype? However this seems generally useful (not just for FASTQ) so perhaps a more general mechanism would be better where tool XML files can say which file types they accept and which of those can/must be compressed (possily not just gzip format?).

Perhaps we can flesh-out what more general solutions would look like... Imagine the fastq datatypes were left alone and instead there's a mechanism by which files which haven't been used as input for x days get compressed by a cron job. the file server knows how to uncompress such files on the fly when needed. For the most part, files are uncompressed during analysis and are compressed when the files exist as an archive within galaxy. An even simpler solution would be an archive/compress button which users could use when they're done with a history. Users could still copy (uncompressed) datasets into a new history for further analysis. Of course there's also the solution mentioned in the 2010 galaxy developer's conference about automatic compression at the system level. Not a possibility for me, but is attractive.

Nate Coraor

6 Sep 6 Sep

2:24 p.m.

Edward Kirton wrote:

...

...
In your position I agree that is a pragmatic choice.

Thanks for helping me muddle through my options.

...
You might be able to modify the file upload code to gzip any FASTQ files... that would prevent uncompressed FASTQ getting into new histories.

Right!

...
I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype? However this seems generally useful (not just for FASTQ) so perhaps a more general mechanism would be better where tool XML files can say which file types they accept and which of those can/must be compressed (possily not just gzip format?).

Perhaps we can flesh-out what more general solutions would look like...

Imagine the fastq datatypes were left alone and instead there's a mechanism by which files which haven't been used as input for x days get compressed by a cron job. the file server knows how to uncompress such files on the fly when needed. For the most part, files are uncompressed during analysis and are compressed when the files exist as an archive within galaxy.

Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first. --nate

...

An even simpler solution would be an archive/compress button which users could use when they're done with a history. Users could still copy (uncompressed) datasets into a new history for further analysis.

Of course there's also the solution mentioned in the 2010 galaxy developer's conference about automatic compression at the system level. Not a possibility for me, but is attractive. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Peter Cock

2:32 p.m.

On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...

Edward Kirton wrote:

...
Peter wrote:

...
I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype? However this seems generally useful (not just for FASTQ) so perhaps a more general mechanism would be better where tool XML files can say which file types they accept and which of those can/must be compressed (possily not just gzip format?).

Perhaps we can flesh-out what more general solutions would look like...

Imagine the fastq datatypes were left alone and instead there's a mechanism by which files which haven't been used as input for x days get compressed by a cron job. the file server knows how to uncompress such files on the fly when needed. For the most part, files are uncompressed during analysis and are compressed when the files exist as an archive within galaxy.

Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.

--nate

Yes, that's what I was envisioning Nate. Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip). We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake. Peter

Nate Coraor

4 p.m.

Peter Cock wrote:

...

On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Edward Kirton wrote:

...
Peter wrote:

...
I wonder if Galaxy would benefit from a new fastqsanger-gzip (etc) datatype? However this seems generally useful (not just for FASTQ) so perhaps a more general mechanism would be better where tool XML files can say which file types they accept and which of those can/must be compressed (possily not just gzip format?).

Perhaps we can flesh-out what more general solutions would look like...

Imagine the fastq datatypes were left alone and instead there's a mechanism by which files which haven't been used as input for x days get compressed by a cron job. the file server knows how to uncompress such files on the fly when needed. For the most part, files are uncompressed during analysis and are compressed when the files exist as an archive within galaxy.

Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.

--nate

Yes, that's what I was envisioning Nate.

Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).

Makes sense.

...

We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.

Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times. --nate

...

Peter

Peter Cock

4:05 p.m.

On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...

Peter Cock wrote:

...
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.

--nate

Yes, that's what I was envisioning Nate.

Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).

Makes sense.

...
We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.

Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times.

While decompression wastes CPU time and makes things slower, there is less data IO from disk (which may be network mounted) which makes things faster. So overall, depending on the setup and the task at hand, it could be faster. Is it time to file an issue on bitbucket to track this potential enhancement? Peter

Nate Coraor

4:12 p.m.

Peter Cock wrote:

...

On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Peter Cock wrote:

...
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.

--nate

Yes, that's what I was envisioning Nate.

Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).

Makes sense.

...
We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.

Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times.

While decompression wastes CPU time and makes things slower, there is less data IO from disk (which may be network mounted) which makes things faster. So overall, depending on the setup and the task at hand, it could be faster.

Is it time to file an issue on bitbucket to track this potential enhancement?

Sure.

...

Peter

Peter Cock

6 Oct 6 Oct

8:29 a.m.

On Tue, Sep 6, 2011 at 5:12 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...

Peter Cock wrote:

...
On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Peter Cock wrote:

...
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.

--nate

Yes, that's what I was envisioning Nate.

Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).

Makes sense.

...
We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.

Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times.

While decompression wastes CPU time and makes things slower, there is less data IO from disk (which may be network mounted) which makes things faster. So overall, depending on the setup and the task at hand, it could be faster.

Is it time to file an issue on bitbucket to track this potential enhancement?

Sure.

Issue filed, better late than never: https://bitbucket.org/galaxy/galaxy-central/issue/666/ This was prompted by another thread where gzipped FASTQ was proposed: http://lists.bx.psu.edu/pipermail/galaxy-dev/2011-October/007069.html Peter

Edward Kirton

8 Sep 8 Sep

6:35 p.m.

copied from another thread: On Thu, Sep 8, 2011 at 7:30 AM, Anton Nekrutenko <anton@bx.psu.edu> wrote:

...

What we are thinking of lately is switching to unaligned BAM for everyting. One of the benefits here is the ability to add readgroups from day 1 simplifying multisample analyses down the road.

this seems to be the simplest solution; i like it a lot. really, only the reads need to be compressed, most other outfiles are tiny by comparison, so a more general solution may be overkill. and if compression of everything is desired, zfs works well -- another of our sites (LANL) uses this and recommended it to me too. i just haven't been able to convince my own IT people to go this route for technical reason beyond my attention span. On Tue, Sep 6, 2011 at 9:05 AM, Peter Cock <p.j.a.cock@googlemail.com>wrote:

...

On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Peter Cock wrote:

...
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.

--nate

Yes, that's what I was envisioning Nate.

Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).

Makes sense.

...
We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.

Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times.

While decompression wastes CPU time and makes things slower, there is less data IO from disk (which may be network mounted) which makes things faster. So overall, depending on the setup and the task at hand, it could be faster.

Is it time to file an issue on bitbucket to track this potential enhancement?

Peter

Fields, Christopher J

6:47 p.m.

The use of (unaligned) BAM for readgroups seems like a good idea. At the very least it prevents inconsistently hacking this information into the FASTQ descriptor (a common problem with any simple format). chris On Sep 8, 2011, at 1:35 PM, Edward Kirton wrote:

...

copied from another thread:

On Thu, Sep 8, 2011 at 7:30 AM, Anton Nekrutenko <anton@bx.psu.edu> wrote: What we are thinking of lately is switching to unaligned BAM for everyting. One of the benefits here is the ability to add readgroups from day 1 simplifying multisample analyses down the road.

this seems to be the simplest solution; i like it a lot. really, only the reads need to be compressed, most other outfiles are tiny by comparison, so a more general solution may be overkill. and if compression of everything is desired, zfs works well -- another of our sites (LANL) uses this and recommended it to me too. i just haven't been able to convince my own IT people to go this route for technical reason beyond my attention span.

On Tue, Sep 6, 2011 at 9:05 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote: On Tue, Sep 6, 2011 at 5:00 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Peter Cock wrote:

...
On Tue, Sep 6, 2011 at 3:24 PM, Nate Coraor <nate@bx.psu.edu> wrote:

...
Ideally, there'd just be a column on the dataset table indicating whether the dataset is compressed or not, and then tools get a new way to indicate whether they can directly read compressed inputs, or whether the input needs to be decompressed first.

--nate

Yes, that's what I was envisioning Nate.

Are there any schemes other than gzip which would make sense? Perhaps rather than a boolean column (compressed or not), it should specify the kind of compression if any (e.g. gzip).

Makes sense.

...
We need something which balances compression efficiency (size) with decompression speed, while also being widely supported in libraries for maximum tool uptake.

Yes, and there's a side effect of allowing this: you may decrease efficiency if the tools used downstream all require decompression, and you waste a bunch of time decompressing the dataset multiple times.

While decompression wastes CPU time and makes things slower, there is less data IO from disk (which may be network mounted) which makes things faster. So overall, depending on the setup and the task at hand, it could be faster.

Is it time to file an issue on bitbucket to track this potential enhancement?

Peter

___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Fields, Christopher J

3 Sep 3 Sep

4:14 a.m.

On Sep 2, 2011, at 8:02 PM, Peter Cock wrote:

...

On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J <cjfields@illinois.edu> wrote:

...
On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:

...
...
What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all.

ah, yes, precisely. i actually think illumina's pipeline produces files in this format now.

Oh do they? - that's interesting. Do you have a reference/link?

...
...
wrappers which create a temporary fastq file would need to be created but that's easy enough.

My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway.

True - if you can't update the tools you need to take BAM. In some cases at least you can pipe the gzipped FASTQ into alignment tools which accepts FASTQ on stdin, so there is no temp file per se.

Some applications (Velvet for instance) accept gzipped FASTQ, though they may turn around and dump the data out uncompressed.

...

...
One could probably get better performance out of a simpler format that removes most of the 'AM' parts of BAM.

Yes, but that meaning inventing yet another file format. At least gzipped FASTQ is quite straightforward.

Yes.

...

...
Or is the idea that the file itself is modified, like a database?

That would be quite a dramatic change from the current Galaxy workflow system - I doubt that would be acceptable in general.

My thought as well.

...

...
And how would indexing work (BAM uses binning on the match to the reference seq), or does it matter?

BAM indexing as done in samtools/picard is only for the aligned reads - so no help for a BAM file of unaligned reads. You could use a different indexing system (e.g. by read name) and the same BAM BGZF block offset system (I've tried this as an experiment with Biopython's SQLite indexing of sequence files).

However, for tasks taking unaligned reads as input, you generally just iterate over the reads in the order on disk.

I think, unless there is a demonstrable advantage to using unaligned BAM, fastq.gz is the easiest.

...

...
I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format. Anyone using the latter two?

Moving from the custom BGZF modified gzip format used in BAM to HD5 has been proposed on the samtools mailing list (as Chris knows), and there is a proof of principle implementation too in BioHDF, http://www.hdfgroup.org/projects/biohdf/ The SAM/BAM group didn't seem overly enthusiastic though.

Probably not, as it is somewhat a competitor of SAM/BAM (a bit broader in scope, beyond just alignments). As Peter indicated, I know the BioHDF folks (they are here in town); however, my actual question was whether anyone is actually using HDF5 or SRA in production? I haven't seen adoption beyond PacBio, but I have seen some things popping up in Galaxy.

...

For the NCBI's .sra format, there is no open specification, just their public domain source code: http://seqanswers.com/forums/showthread.php?t=12054

Regards,

Peter

Simply gzipping FASTQ seems to give better compression that an .lite.sra file (and I'm not a happy user of their SRA toolset). And of course there is parallel gzip... chris

Paul Gordon

6:07 p.m.

...

Probably not, as it is somewhat a competitor of SAM/BAM (a bit broader in scope, beyond just alignments). As Peter indicated, I know the BioHDF folks (they are here in town); however, my actual question was whether anyone is actually using HDF5 or SRA in production? I haven't seen adoption beyond PacBio, but I have seen some things popping up in Galaxy.

FWIW, the XSQ format files created by the 5500 Series ABI SOLiD are HDF5, but not BioHDF: http://solidsoftwaretools.com/gf/download/docmanfileversion/309/1079/XSQ_Web... -- ______________ Paul Gordon Bioinformatics Support Specialist Alberta Children's Hospital Research Institute http://www.ucalgary.ca/~gordonp

Scott Smith

5:23 p.m.

On Sep 2, 2011, at 8:02 PM, Peter Cock wrote:

...

On Fri, Sep 2, 2011 at 9:27 PM, Fields, Christopher J <cjfields@illinois.edu> wrote:

...
On Sep 2, 2011, at 3:02 PM, Edward Kirton wrote:

...
...
What, like a BAM file of unaligned reads? Uses gzip compression, and tracks the pairing information explicitly :) Some tools will already take this as an input format, but not all.

ah, yes, precisely. i actually think illumina's pipeline produces files in this format now.

Oh do they? - that's interesting. Do you have a reference/link?

Yeah, here at The Genome Institute at Wash-U, we get Illumina data directly in BAM format, and tries to avoid fastq conversion. The latest BWA supports a BAM of reads as input, as well as making BAM output. Hopefully most tools will go that way. You can always engineer something with named pipes in the mean time to avoid the read/write to real disk, but that requires some care.

...

...
...
wrappers which create a temporary fastq file would need to be created but that's easy enough.

My argument against that is the cost of going from BAM -> temp fastq may be prohibitive, e.g. the need to generate very large temp fastq files on the fly as input for various applications may lead one back to just keeping a permanent FASTQ around anyway.

True - if you can't update the tools you need to take BAM. In some cases at least you can pipe the gzipped FASTQ into alignment tools which accepts FASTQ on stdin, so there is no temp file per se.

...
One could probably get better performance out of a simpler format that removes most of the 'AM' parts of BAM.

Yes, but that meaning inventing yet another file format. At least gzipped FASTQ is quite straightforward.

...
Or is the idea that the file itself is modified, like a database?

That would be quite a dramatic change from the current Galaxy workflow system - I doubt that would be acceptable in general.

And mutable data structure like that are harder to manage in a high-throughput environment.

...

...
And how would indexing work (BAM uses binning on the match to the reference seq), or does it matter?

BAM indexing as done in samtools/picard is only for the aligned reads - so no help for a BAM file of unaligned reads. You could use a different indexing system (e.g. by read name) and the same BAM BGZF block offset system (I've tried this as an experiment with Biopython's SQLite indexing of sequence files).

However, for tasks taking unaligned reads as input, you generally just iterate over the reads in the order on disk.

...
I recall hdf5 was planned as an alternate format (PacBio uses it, IIRC), and of course there is NCBI's .sra format. Anyone using the latter two?

Moving from the custom BGZF modified gzip format used in BAM to HD5 has been proposed on the samtools mailing list (as Chris knows), and there is a proof of principle implementation too in BioHDF, http://www.hdfgroup.org/projects/biohdf/ The SAM/BAM group didn't seem overly enthusiastic though.

HDF5 sounds really great, though I don't think PacBio has the data volume to tax it the way Illumina does. There was some speculation that HDF5 would be underneath a new BAM standard, but I don't know the status of that. We did a few experiments in house with BioHDF in its infancy to see how it compared to BAM and it didn't capture all of the data (it was missing the somewhat critical cigar strings at this time) ...and haven't revisited it since then. I'm sure it would be effective in storing reads, but starting your own standard when Illumina makes BAMs will probably not ultimately be as useful as going with BAM format.

...

For the NCBI's .sra format, there is no open specification, just their public domain source code: http://seqanswers.com/forums/showthread.php?t=12054

This standard is ...complex, with the associated down-sides. We only convert things into that format if explicitly required to do so.

...

Regards,

Peter

Best of luck, Scott -- Scott Smith Manager, Application Programming and Development Analysis Pipeline The Genome Institute Washington University School of Medicine

5035

Age (days ago)

5083

Last active (days ago)

List overview

Download

20 comments

8 participants

participants (8)

Edward Kirton
Fields, Christopher J
Jelle Scholtalbers
Nate Coraor
Patrick Page-McCaw
Paul Gordon
Peter Cock
Scott Smith

disk space and file formats

tags

participants (8)