On Tue, Feb 19, 2013 at 11:32 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Hello all,
Although they are these days also offering XML for many tools, the NCBI still make heavy use of the older ASN.1 file format (both as plain text and binary). This crops up in BLAST (e.g. as the BLAST archive format, or as dustmasker output), in the Entrez Utilities (e.g. for sequence data as an alternative to GenBank for FASTA format etc, or pubmed, etc) and also for 3D structures.
I think it could make sense to define generic 'asn1' and 'asn1-binary' formats in the Galaxy core (name suggestions welcome), and even perhaps 'ncbi-asn1' and 'ncbi-asn1-binary' too. Then ToolShed entries can define domain specific subclasses. For instance, the BLAST+ wrapper could include definitions for the dustmasker output, and perhaps the BLAST archive format too. Separately anyone working with 3D structures as ASN.1 could define another sub-format, etc.
I see this as a clear analogy to the assorted XML file formats in existence - defined in Galaxy as subclasses of the core XML format included with the Galaxy core.
Would a pull request implementing this be acceptable?
Peter
P.S. Does anyone know an authoritative source for the MIME types used by the NCBI? Using the BLAST website they offer plain text ASN.1 just as text/plain, likewise efetch also seems to use text/plain for ASN.1 downloads. However I've seen references to chemical/ncbi-asn1-ascii and chemical/ncbi-asn1-binary mime-types mentioned, e.g. http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn
i.e. It appears that 3D structure NCBI ASN.1 files use a well defined MIME type, while most NCBI ASN.1 text files default to text/plain - which we can handle nicely in Galaxy as subclasses.
I had an interesting email about the NCBI ASN.1 files from Christopher Hogue, which (with his blessing) I am forwarding to the list in case anyone else is interested - see below. Thanks Christopher, Peter ---------- Forwarded message ---------- From: Christopher Hogue Date: Wed, May 1, 2013 at 4:06 PM Subject: ASN.1 (text and binary) formats - Use. To: p.j.a.cock@googlemail.com I can offer a bit of insight here about NCBI Asn.1 as it is confusing. (Sorry about the time delay here - I'm not a Galaxy developer - I just picked up on this thread via Google today.) I was involved with the origin of the chemical/ncbi-asn1-binary back when I wrote Cn3D 1.0 in 1996 - http://www.ch.ic.ac.uk/chemime/ which was managed by H. Rzepa In case this is tl;dr - you have done a sensible thing - I don't suggest you change anything you have implemented. This is just to let you know what is involved in reading NCBI emitted Asn.1 types - if that is what you want to do eventually. The Asn.1 specification here: http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn Refers to the the current NCBI Mime-types: (chemical/ncbi-asn1-ascii and chemical/ncbi-asn1-binary). Yes, originally this was set up for 3D structures. NCBI used *.val for binary and *.prt for ascii forms, and these were wrapped with the obsolete mime-types chemical/x-ncbi-asn1-binary and chemical/x-ncbi-asn1-ascii respetively. The only exported data at the time was MMDB/Cn3D structure data. Newer versions of Cn3D started taking in sequences and sequence alignments generated by VAST structure superposition. Now these original file extensions could hold any exported Asn.1 symbol in their spec set, so the *.val/*.prt file could hold anything from pubmed types to 3D structure to sequences to a blast output, or various nested fragments. BUT Unlike XML, Asn.1 data does not point back to its own schema, and the binary files are trouble if you lose track of what top-level symbol type they start at. There is no observable metadata. There is no automated way to pull down the specification from a URI. This is difficult to deal with without changing the NCBI Asn.1 1990 implementation itself. So to fix Cn3D so it could import more types, they made a Asn.1 wrapper which is in the NCBI toolkit as "ncbimime.asn". This was intended to be a catch-all wrapper for all the data types they emit for Cn3D. The idea was that all their emitted spec objects would be triaged the same way, by parsing the top-level part of this piece of Asn.1, then you could figure out what the data was inside. In practice this is still used by Cn3D, and as far as I know Cn3D is the only external NCBI application still supporting this. They also introduced *.cn3 (removing *.val and *.prt) to specify a stream (either binary or ascii) that has this top-level Mime wrapper. In one case they emit the wrong extension *.c3d on the VAST structure similarity server. The problem with the mime-wrapper they "spec'ed" in, is that - it only wraps their types, not any arbitrary Asn.1 object that might be made with NCBI tools. If you want to know more about Asn.1 - the Larmouth book is free to download. http://www.oss.com/asn1/resources/books-whitepapers-pubs/asn1-books.html#lar... If you have any other questions about reading/writing/Asn1 specs - write me - chogue {at} blueprint.org and I can probably answer most of them, as I still use Asn.1 for my 3D structure research. Also - the NCBIC++ toolkit datatool apparently has some support now for converting Asn.1 into JSON. I haven't tested the extent of it, but it looks interesting for simple types. Cheers, Christopher Hogue www.blueprint.org <quote author='Peter Cock'> ... P.S. Does anyone know an authoritative source for the MIME types used by the NCBI? Using the BLAST website they offer plain text ASN.1 just as text/plain, likewise efetch also seems to use text/plain for ASN.1 downloads. However I've seen references to chemical/ncbi-asn1-ascii and chemical/ncbi-asn1-binary mime-types mentioned, e.g. http://www.ncbi.nlm.nih.gov/data_specs/asn/NCBI_all.asn i.e. It appears that 3D structure NCBI ASN.1 files use a well defined MIME type, while most NCBI ASN.1 text files default to text/plain - which we can handle nicely in Galaxy as subclasses. ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ </quote> Quoted from: http://dev.list.galaxyproject.org/ASN-1-text-and-binary-formats-in-Galaxy-To... _____________________________________ Sent from http://dev.list.galaxyproject.org