I'd like to have a datatype with a dict as metadata. This dict() would store file offsets to enable seeking around to process different sections of the file. How do I add a dictionary data metadata element? John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com<mailto:jduddy@illumina.com>
Hey John, are you sure you don't want to use a "converted dataset" rather than a metadata element for this. This is how we handle most types of secondary indexes for visualization. If you do it this way, the converter that creates the offset index is just another tool (but registered in datatypes_conf.xml) and the index it self is another dataset that can be accessed through the converted datasets relationship. On Aug 25, 2011, at 6:12 PM, Duddy, John wrote:
I’d like to have a datatype with a dict as metadata. This dict() would store file offsets to enable seeking around to process different sections of the file.
How do I add a dictionary data metadata element?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
A converted dataset would be fine too. I'm working on an enhancement that would allow the metadata to be provided when the file is uploaded/registered via the API. So to do what you say, I'd need to have a way of providing that converted dataset. The files I'm talking about are concatenated GZIP files, and the GZIP format specification doesn't contain any information about the size of the compressed data, only the uncompressed size (and then, modulo 2^32). AFAIK, anything in Galaxy that tried to create the auxiliary index would need to read and decompress all the data in the file to do that - easily an hours' worth of work for some of our full genome runs. We have all that information already when we make the file, so I'd prefer to just give it to Galaxy at the start. I could place stuff in a special section in the first GZIP header, but then this capability would not be as general-purpose as it could be. I also want to prevent unnecessary gzip decompression in python, because serious decompression in versions before 2.7 is so slow as to be unusable for large datasets. Is there a way to upload that converted dataset when I upload/register the main file? I'd also need to know how to write such a file. John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com -----Original Message----- From: James Taylor [mailto:james@jamestaylor.org] Sent: Friday, August 26, 2011 5:37 AM To: Duddy, John Cc: galaxy-dev Subject: Re: [galaxy-dev] Storing a dict as metadata Hey John, are you sure you don't want to use a "converted dataset" rather than a metadata element for this. This is how we handle most types of secondary indexes for visualization. If you do it this way, the converter that creates the offset index is just another tool (but registered in datatypes_conf.xml) and the index it self is another dataset that can be accessed through the converted datasets relationship. On Aug 25, 2011, at 6:12 PM, Duddy, John wrote:
I'd like to have a datatype with a dict as metadata. This dict() would store file offsets to enable seeking around to process different sections of the file.
How do I add a dictionary data metadata element?
John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Not currently, but since a converted dataset is just a dataset, you could reuse all of the existing upload mechanism, and just add the converted dataset connection between the two after the fact. On Aug 26, 2011, at 11:54 AM, Duddy, John wrote:
Is there a way to upload that converted dataset when I upload/register the main file? I'd also need to know how to write such a file.
I'm looking into these, and it seems that the spirit is to store a version of the data that is converted, like a FASTQ -> BAM or some such use case, where one file can be extracted from the other. Am I missing a dimension to these files? In any case, I'd have to add the ability to associate the files in the API, probably a new operation in the update method for library contents? John Duddy Sr. Staff Software Engineer Illumina, Inc. 9885 Towne Centre Drive San Diego, CA 92121 Tel: 858-736-3584 E-mail: jduddy@illumina.com -----Original Message----- From: James Taylor [mailto:james@jamestaylor.org] Sent: Friday, August 26, 2011 1:52 PM To: Duddy, John Cc: galaxy-dev Subject: Re: [galaxy-dev] Storing a dict as metadata Not currently, but since a converted dataset is just a dataset, you could reuse all of the existing upload mechanism, and just add the converted dataset connection between the two after the fact. On Aug 26, 2011, at 11:54 AM, Duddy, John wrote:
Is there a way to upload that converted dataset when I upload/register the main file? I'd also need to know how to write such a file.
On Aug 26, 2011, at 7:23 PM, Duddy, John wrote:
I'm looking into these, and it seems that the spirit is to store a version of the data that is converted, like a FASTQ -> BAM or some such use case, where one file can be extracted from the other.
It was originally built for that (hence the name) but is good for storing other kinds of related files, particularly indexes. For visualization, we use it to store interval tree indexes for example.
In any case, I'd have to add the ability to associate the files in the API, probably a new operation in the update method for library contents?
Yes, it would need a new API operation. If the "parent" dataset is in a library, then yes it would make sense to do it as part of the library API.
participants (2)
-
Duddy, John
-
James Taylor