Feedback appreciated for new unified tabular format: GTrack
Dear Galaxy-dev guys! Our research group have the last years been developing the Genomic HyperBrowser (http://hyperbrowser.uio.no), which is a system for statistical analysis of genomic data, built on top of Galaxy. The system currently includes 17 hypothesis tests and 53 descriptional statistics. In the process of developing the HyperBrowser, we have experienced several shortcomings of the usual tabular formats for genomic datasets: BED, WIG, GFF, BedGraph, etc. This has lead us to define (yet another) format for genomic data: the GTrack format (short for Genomic Track). The format will hopefully be published in an article, currently under review (together with an extension of the XML format BioXSD, which supports many of the same properties as the GTrack format). In the process of article review, we would be very interested in the feedback from you guys, in order for the format to be as good as possible. The basic issues handled are the following: 1. The first issue is the very existence of so many formats. We have, in the article, defined 15 different track formats and believe that these track formats are the main reason for the proliferation of tabular formats. The track formats are the usual Segments (as in BED) or Valued Segments (as in BedGraph), but also include other types as Points or Step Function. In addition, we introduce linked track types usable for analysis of three-dimensional data set. The GTrack format handles all 15 track types. 2. Simple to create. Allthough the GTrack format specification document is quite large, the format is still quite simple to handle. It is based on fixed columns (not attributes like GFF), but allows custom columns (unlike BED). If you allready have scripts creating output in a common tabular format, they should require little change to support GTrack. 3. Customizability. GTrack allows any number of custom columns to be added, in any order. Also, GTrack supports a scheme for creating GTrack subtypes. A GTrack subtype is a particular configuration of GTrack files explicitly created for specific uses/tools. All GTrack subtypes can still be handled by generic GTrack parsers. 4. Simple to parse. We have tried to make GTrack as simple as possible to parse. This includes the use of header lines for defining properties of a file. This eases parsing by telling the parses what is coming, plus it allows quick and dirty parsers to explicitly assert what they are able to handle (so as to abort with a clearly stated reason instead of failing silently). Also, the GTrack subtyping scheme allows parsers to limit their support to a subset of the GTrack specification, e.g. files with a fixed number and order of columns. 5. Advanced functionality. GTrack supports more advanced functionality such as networks of track elements and the option of defining the domain of a track (e.g. the genomic regions for which the track is defined). 6. Syntax, not semantics. As with BED or WIG, the GTrack format focuses on the structural elements of the data, e.g. how to represent data mathematically/informatically. We leave the specifics of interpretation to others (who can, for instance, use their definitons to create GTrack subtypes). We have also been in contact with the Galaxy team and have received positive signals regarding future support for GTrack in Galaxy. Note that tools for converting between GTrack and other formats, in addition to a tool to help create GTrack headers, will be available soon. We hope you find the format interesting and welcome all kinds of feedback/suggestions. As the paper is in a review process, we would appreciatee feedback within the next two weeks. Version 1.0b2 of the GTrack specification and an illustration of the 15 track types are available here: http://hyperbrowser.uio.no/hb/static/hyperbrowser/files/gtrack/GTrack_specif... http://hyperbrowser.uio.no/hb/static/hyperbrowser/files/gtrack/track_types.p... For the HyperBrowser team, Sveinung Gundersen -- Sveinung Gundersen PhD Student, Bioinformatics, Dept. of Tumor Biology, Inst. for Cancer Research, The Norwegian Radium Hospital, Montebello, 0310 Oslo, Norway E-mail: sveinung.gundersen@medisin.uio.no, Phone: +47 93 00 94 54
participants (1)
-
Sveinung Gundersen