ASN.1 (text and binary) formats in Galaxy & Tool Shed
Hello all, Although they are these days also offering XML for many tools, the NCBI still make heavy use of the older ASN.1 file format (both as plain text and binary). This crops up in BLAST (e.g. as the BLAST archive format, or as dustmasker output), in the Entrez Utilities (e.g. for sequence data as an alternative to GenBank for FASTA format etc, or pubmed, etc) and also for 3D structures. I think it could make sense to define generic 'asn1' and 'asn1-binary' formats in the Galaxy core (name suggestions welcome), and even perhaps 'ncbi-asn1' and 'ncbi-asn1-binary' too. Then ToolShed entries can define domain specific subclasses. For instance, the BLAST+ wrapper could include definitions for the dustmasker output, and perhaps the BLAST archive format too. Separately anyone working with 3D structures as ASN.1 could define another sub-format, etc. I see this as a clear analogy to the assorted XML file formats in existence - defined in Galaxy as subclasses of the core XML format included with the Galaxy core. Would a pull request implementing this be acceptable? Peter P.S. Does anyone know an authoritative source for the MIME types used by the NCBI? Using the BLAST website they offer plain text ASN.1 just as text/plain, likewise efetch also seems to use text/plain for ASN.1 downloads. However I've seen references to chemical/ncbi-asn1-ascii and chemical/ncbi-asn1-binary mime-types mentioned, e.g. http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn i.e. It appears that 3D structure NCBI ASN.1 files use a well defined MIME type, while most NCBI ASN.1 text files default to text/plain - which we can handle nicely in Galaxy as subclasses.
On Tue, Feb 19, 2013 at 6:32 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Would a pull request implementing this be acceptable?
Yes. My understanding is that ASN is a completely flexible metaformat, like XML, and so should be under either Text or Data, with appropriate subtypes defined for blast, et cetera. -- James Taylor, Assistant Professor, Biology/CS, Emory University
On Tue, Feb 19, 2013 at 2:00 PM, James Taylor <james@jamestaylor.org> wrote:
On Tue, Feb 19, 2013 at 6:32 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Would a pull request implementing this be acceptable?
Yes. My understanding is that ASN is a completely flexible metaformat, like XML, and so should be under either Text or Data, with appropriate subtypes defined for blast, et cetera.
Thank James, Yes - very like XML, but with the subtlety that ASN.1 comes in text and binary favours (which I presume applies to all the variants, although the binary versions may not be as commonly used for the smaller files). Nicola - do you want to make a pull request to galaxy-central defining ASN.1 text and binary formats (which we can then subclass for the NCBI BLAST+ wrappers)? Or should I? I think the mime-type for the base ASN.1 text format should probably be text/plain based on the NCBI usage patterns. I'm not sure what the mime-type for the base ASN.1 binary format should be (but I don't think it should be chemical/ncbi-asn1-binary). Regards, Peter
Il giorno mar, 19/02/2013 alle 14.15 +0000, Peter Cock ha scritto:
On Tue, Feb 19, 2013 at 2:00 PM, James Taylor <james@jamestaylor.org> wrote:
On Tue, Feb 19, 2013 at 6:32 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
I think it could make sense to define generic 'asn1' and 'asn1-binary' formats in the Galaxy core (name suggestions welcome)
What about extension="asn" type="galaxy.datatypes.data:GenericAsn1" and extension="asnb" type="galaxy.datatypes.binary:GenericAsn1Binary" like GenericXml class?
Would a pull request implementing this be acceptable?
Yes. My understanding is that ASN is a completely flexible metaformat, like XML, and so should be under either Text or Data, with appropriate subtypes defined for blast, et cetera.
Thank James,
Yes - very like XML, but with the subtlety that ASN.1 comes in text and binary favours (which I presume applies to all the variants, although the binary versions may not be as commonly used for the smaller files).
Nicola - do you want to make a pull request to galaxy-central defining ASN.1 text and binary formats (which we can then subclass for the NCBI BLAST+ wrappers)? Or should I?
If you mean a minimal implementation, I can surely do that. If something more elaborated is needed, then probably you are more qualified than me!
I think the mime-type for the base ASN.1 text format should probably be text/plain based on the NCBI usage patterns.
Ok.
I'm not sure what the mime-type for the base ASN.1 binary format should be (but I don't think it should be chemical/ncbi-asn1-binary).
application/octet-stream ? Best, Nicola
On Tue, Feb 19, 2013 at 6:45 PM, Nicola Soranzo <soranzo@crs4.it> wrote:
Il giorno mar, 19/02/2013 alle 14.15 +0000, Peter Cock ha scritto:
On Tue, Feb 19, 2013 at 2:00 PM, James Taylor <james@jamestaylor.org> wrote:
On Tue, Feb 19, 2013 at 6:32 AM, Peter Cock wrote:
I think it could make sense to define generic 'asn1' and 'asn1-binary' formats in the Galaxy core (name suggestions welcome)
What about
extension="asn" type="galaxy.datatypes.data:GenericAsn1"
and
extension="asnb" type="galaxy.datatypes.binary:GenericAsn1Binary"
like GenericXml class?
Those seem sensible to me as the class names, although I'm not so sure about the format names (aka 'extensions' in Galaxy terms). I'd prefer to see the '1' in the name for clarity. My suggestions of 'asn1' and 'asn1-binary' were based on NCBI usage. Perhaps the Galaxy team could comment on their views here for conciseness versus clarity in file format names for Galaxy?
Would a pull request implementing this be acceptable?
Yes. My understanding is that ASN is a completely flexible metaformat, like XML, and so should be under either Text or Data, with appropriate subtypes defined for blast, et cetera.
Thank James,
Yes - very like XML, but with the subtlety that ASN.1 comes in text and binary favours (which I presume applies to all the variants, although the binary versions may not be as commonly used for the smaller files).
Nicola - do you want to make a pull request to galaxy-central defining ASN.1 text and binary formats (which we can then subclass for the NCBI BLAST+ wrappers)? Or should I?
If you mean a minimal implementation, I can surely do that. If something more elaborated is needed, then probably you are more qualified than me!
The current minimal implementation you sent me for BLAST+ would be an excellent start. Things like data type sniffers etc would be a nice to have feature, but can be added later I think. And the sooner this gets into the Galaxy core, the sooner we can use it in the BLAST+ wrappers :)
I think the mime-type for the base ASN.1 text format should probably be text/plain based on the NCBI usage patterns.
Ok.
I'm not sure what the mime-type for the base ASN.1 binary format should be (but I don't think it should be chemical/ncbi-asn1-binary).
application/octet-stream ?
Probably OK - we can/should test this by uploading a test binary ASN.1 file into Galaxy and downloading it out again. Thanks, Peter
Clarity in this case, definitely needs the 1. On Feb 19, 2013, at 3:38 PM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
On Tue, Feb 19, 2013 at 6:45 PM, Nicola Soranzo <soranzo@crs4.it> wrote:
Il giorno mar, 19/02/2013 alle 14.15 +0000, Peter Cock ha scritto:
On Tue, Feb 19, 2013 at 2:00 PM, James Taylor <james@jamestaylor.org> wrote:
On Tue, Feb 19, 2013 at 6:32 AM, Peter Cock wrote:
I think it could make sense to define generic 'asn1' and 'asn1-binary' formats in the Galaxy core (name suggestions welcome)
What about
extension="asn" type="galaxy.datatypes.data:GenericAsn1"
and
extension="asnb" type="galaxy.datatypes.binary:GenericAsn1Binary"
like GenericXml class?
Those seem sensible to me as the class names, although I'm not so sure about the format names (aka 'extensions' in Galaxy terms). I'd prefer to see the '1' in the name for clarity. My suggestions of 'asn1' and 'asn1-binary' were based on NCBI usage.
Perhaps the Galaxy team could comment on their views here for conciseness versus clarity in file format names for Galaxy?
Would a pull request implementing this be acceptable?
Yes. My understanding is that ASN is a completely flexible metaformat, like XML, and so should be under either Text or Data, with appropriate subtypes defined for blast, et cetera.
Thank James,
Yes - very like XML, but with the subtlety that ASN.1 comes in text and binary favours (which I presume applies to all the variants, although the binary versions may not be as commonly used for the smaller files).
Nicola - do you want to make a pull request to galaxy-central defining ASN.1 text and binary formats (which we can then subclass for the NCBI BLAST+ wrappers)? Or should I?
If you mean a minimal implementation, I can surely do that. If something more elaborated is needed, then probably you are more qualified than me!
The current minimal implementation you sent me for BLAST+ would be an excellent start. Things like data type sniffers etc would be a nice to have feature, but can be added later I think. And the sooner this gets into the Galaxy core, the sooner we can use it in the BLAST+ wrappers :)
I think the mime-type for the base ASN.1 text format should probably be text/plain based on the NCBI usage patterns.
Ok.
I'm not sure what the mime-type for the base ASN.1 binary format should be (but I don't think it should be chemical/ncbi-asn1-binary).
application/octet-stream ?
Probably OK - we can/should test this by uploading a test binary ASN.1 file into Galaxy and downloading it out again.
Thanks,
Peter
Il giorno mar, 19/02/2013 alle 20.38 +0000, Peter Cock ha scritto:
On Tue, Feb 19, 2013 at 6:45 PM, Nicola Soranzo <soranzo@crs4.it> wrote:
Il giorno mar, 19/02/2013 alle 14.15 +0000, Peter Cock ha scritto:
On Tue, Feb 19, 2013 at 2:00 PM, James Taylor <james@jamestaylor.org> wrote:
On Tue, Feb 19, 2013 at 6:32 AM, Peter Cock wrote:
I think it could make sense to define generic 'asn1' and 'asn1-binary' formats in the Galaxy core (name suggestions welcome)
What about
extension="asn" type="galaxy.datatypes.data:GenericAsn1"
and
extension="asnb" type="galaxy.datatypes.binary:GenericAsn1Binary"
like GenericXml class?
Those seem sensible to me as the class names, although I'm not so sure about the format names (aka 'extensions' in Galaxy terms). I'd prefer to see the '1' in the name for clarity. My suggestions of 'asn1' and 'asn1-binary' were based on NCBI usage.
Ok, no problem!
Nicola - do you want to make a pull request to galaxy-central defining ASN.1 text and binary formats (which we can then subclass for the NCBI BLAST+ wrappers)? Or should I?
If you mean a minimal implementation, I can surely do that. If something more elaborated is needed, then probably you are more qualified than me!
The current minimal implementation you sent me for BLAST+ would be an excellent start. Things like data type sniffers etc would be a nice to have feature, but can be added later I think. And the sooner this gets into the Galaxy core, the sooner we can use it in the BLAST+ wrappers :)
Before proceeding on this route, one question: we are going to introduce a dependency for blast_datatypes/ncbi_blast_plus on a future galaxy release, is it ok? I don't think it is possible to express such dependency in a repository_dependencies.xml file yet. Best, Nicola -- Nicola Soranzo, Ph.D. CRS4 Bioinformatics Program Loc. Piscina Manna 09010 Pula (CA), Italy http://www.bioinformatica.crs4.it/
On Fri, Feb 22, 2013 at 2:55 PM, Nicola Soranzo <soranzo@crs4.it> wrote:
Il giorno mar, 19/02/2013 alle 20.38 +0000, Peter Cock ha scritto:
The current minimal implementation you sent me for BLAST+ would be an excellent start. Things like data type sniffers etc would be a nice to have feature, but can be added later I think. And the sooner this gets into the Galaxy core, the sooner we can use it in the BLAST+ wrappers :)
Before proceeding on this route, one question: we are going to introduce a dependency for blast_datatypes/ncbi_blast_plus on a future galaxy release, is it ok?
I would just wait a few weeks after the new datatypes have been released in a stable Galaxy release before making the new usage within the BLAST tools public on the ToolShed. Not perfect, but simple.
I don't think it is possible to express such dependency in a repository_dependencies.xml file yet.
I don't think we can declare a dependency on a version of Galaxy itself - but that is a nice idea. Peter
On Tue, Feb 19, 2013 at 2:00 PM, James Taylor <james@jamestaylor.org> wrote:
On Tue, Feb 19, 2013 at 6:32 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Would a pull request implementing this be acceptable?
Yes. My understanding is that ASN is a completely flexible metaformat, like XML, and so should be under either Text or Data, with appropriate subtypes defined for blast, et cetera.
Nicola made the pull request we discussed a month ago: https://bitbucket.org/galaxy/galaxy-central/pull-request/135/add-generic-asn... Could someone please review and commit this? Thanks, Peter
Il giorno mer, 03/04/2013 alle 14.27 +0100, Peter Cock ha scritto:
On Tue, Feb 19, 2013 at 2:00 PM, James Taylor <james@jamestaylor.org> wrote:
On Tue, Feb 19, 2013 at 6:32 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Would a pull request implementing this be acceptable?
Yes. My understanding is that ASN is a completely flexible metaformat, like XML, and so should be under either Text or Data, with appropriate subtypes defined for blast, et cetera.
Nicola made the pull request we discussed a month ago:
https://bitbucket.org/galaxy/galaxy-central/pull-request/135/add-generic-asn...
Could someone please review and commit this?
Please note that this pull request contains also this commit: https://bitbucket.org/nsoranzo/galaxy-central/commits/62d0924f163b8731adaf58... which fixes a bug where uploading an unsniffable binary file whose filename has more than one dot (e.g. foo.bar.scf ) results in the error 'The uploaded binary file contains inappropriate content'. Nicola
On Tue, Feb 19, 2013 at 11:32 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hello all,
Although they are these days also offering XML for many tools, the NCBI still make heavy use of the older ASN.1 file format (both as plain text and binary). This crops up in BLAST (e.g. as the BLAST archive format, or as dustmasker output), in the Entrez Utilities (e.g. for sequence data as an alternative to GenBank for FASTA format etc, or pubmed, etc) and also for 3D structures.
I think it could make sense to define generic 'asn1' and 'asn1-binary' formats in the Galaxy core (name suggestions welcome), and even perhaps 'ncbi-asn1' and 'ncbi-asn1-binary' too. Then ToolShed entries can define domain specific subclasses. For instance, the BLAST+ wrapper could include definitions for the dustmasker output, and perhaps the BLAST archive format too. Separately anyone working with 3D structures as ASN.1 could define another sub-format, etc.
I see this as a clear analogy to the assorted XML file formats in existence - defined in Galaxy as subclasses of the core XML format included with the Galaxy core.
Would a pull request implementing this be acceptable?
Peter
P.S. Does anyone know an authoritative source for the MIME types used by the NCBI? Using the BLAST website they offer plain text ASN.1 just as text/plain, likewise efetch also seems to use text/plain for ASN.1 downloads. However I've seen references to chemical/ncbi-asn1-ascii and chemical/ncbi-asn1-binary mime-types mentioned, e.g. http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn
i.e. It appears that 3D structure NCBI ASN.1 files use a well defined MIME type, while most NCBI ASN.1 text files default to text/plain - which we can handle nicely in Galaxy as subclasses.
I had an interesting email about the NCBI ASN.1 files from Christopher Hogue, which (with his blessing) I am forwarding to the list in case anyone else is interested - see below. Thanks Christopher, Peter ---------- Forwarded message ---------- From: Christopher Hogue Date: Wed, May 1, 2013 at 4:06 PM Subject: ASN.1 (text and binary) formats - Use. To: p.j.a.cock@googlemail.com I can offer a bit of insight here about NCBI Asn.1 as it is confusing. (Sorry about the time delay here - I'm not a Galaxy developer - I just picked up on this thread via Google today.) I was involved with the origin of the chemical/ncbi-asn1-binary back when I wrote Cn3D 1.0 in 1996 - http://www.ch.ic.ac.uk/chemime/ which was managed by H. Rzepa In case this is tl;dr - you have done a sensible thing - I don't suggest you change anything you have implemented. This is just to let you know what is involved in reading NCBI emitted Asn.1 types - if that is what you want to do eventually. The Asn.1 specification here: http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn Refers to the the current NCBI Mime-types: (chemical/ncbi-asn1-ascii and chemical/ncbi-asn1-binary). Yes, originally this was set up for 3D structures. NCBI used *.val for binary and *.prt for ascii forms, and these were wrapped with the obsolete mime-types chemical/x-ncbi-asn1-binary and chemical/x-ncbi-asn1-ascii respetively. The only exported data at the time was MMDB/Cn3D structure data. Newer versions of Cn3D started taking in sequences and sequence alignments generated by VAST structure superposition. Now these original file extensions could hold any exported Asn.1 symbol in their spec set, so the *.val/*.prt file could hold anything from pubmed types to 3D structure to sequences to a blast output, or various nested fragments. BUT Unlike XML, Asn.1 data does not point back to its own schema, and the binary files are trouble if you lose track of what top-level symbol type they start at. There is no observable metadata. There is no automated way to pull down the specification from a URI. This is difficult to deal with without changing the NCBI Asn.1 1990 implementation itself. So to fix Cn3D so it could import more types, they made a Asn.1 wrapper which is in the NCBI toolkit as "ncbimime.asn". This was intended to be a catch-all wrapper for all the data types they emit for Cn3D. The idea was that all their emitted spec objects would be triaged the same way, by parsing the top-level part of this piece of Asn.1, then you could figure out what the data was inside. In practice this is still used by Cn3D, and as far as I know Cn3D is the only external NCBI application still supporting this. They also introduced *.cn3 (removing *.val and *.prt) to specify a stream (either binary or ascii) that has this top-level Mime wrapper. In one case they emit the wrong extension *.c3d on the VAST structure similarity server. The problem with the mime-wrapper they "spec'ed" in, is that - it only wraps their types, not any arbitrary Asn.1 object that might be made with NCBI tools. If you want to know more about Asn.1 - the Larmouth book is free to download. http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html#lar... If you have any other questions about reading/writing/Asn1 specs - write me - chogue {at} blueprint.org and I can probably answer most of them, as I still use Asn.1 for my 3D structure research. Also - the NCBIC++ toolkit datatool apparently has some support now for converting Asn.1 into JSON. I haven't tested the extent of it, but it looks interesting for simple types. Cheers, Christopher Hogue www.blueprint.org <quote author='Peter Cock'> ... P.S. Does anyone know an authoritative source for the MIME types used by the NCBI? Using the BLAST website they offer plain text ASN.1 just as text/plain, likewise efetch also seems to use text/plain for ASN.1 downloads. However I've seen references to chemical/ncbi-asn1-ascii and chemical/ncbi-asn1-binary mime-types mentioned, e.g. http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn i.e. It appears that 3D structure NCBI ASN.1 files use a well defined MIME type, while most NCBI ASN.1 text files default to text/plain - which we can handle nicely in Galaxy as subclasses. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ </quote> Quoted from: http://dev.list.galaxyproject.org/ASN-1-text-and-binary-formats-in-Galaxy-To... _____________________________________ Sent from http://dev.list.galaxyproject.org
participants (4)
-
James Taylor
-
James Taylor
-
Nicola Soranzo
-
Peter Cock