-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Let's pretend for a second that I'm rather lazy (oh...wait), and I have ZERO interest in writing datatype parsers to sniff and validate whether or not a specific file is a specific datatype. I'm a sysadmin and bioinformatician, and I've worked with dozens of libraries that exist to parse file formats, and they all die in flames when I feed them bad data.
Would it be possible to somehow define requirements for datatypes?
I don't want to take on the burden of code I write saying "yes, I've sniffed+validated this and it is absolutely a genbank file". That's a lot of responsibility, especially if people have malformed genbank files and their tools fail as a result.
I would like to do this with BioPython and turf the validation to another library that exists to parse genbank files, that will raise and exception if they're invalid.
def sniff(self, filename): from Bio import SeqIO try: self.records = list(SeqIO.parse( filename, "genbank" )) return True except: self.records = None return False
def validate(self, dataset): from Bio import SeqIO errors = list() try: self.records = list(SeqIO.parse( dataset.file_name, "genbank" )) except Exception, e: errors.append(e) return errors
def set_meta(self, dataset, **kwd): if self.records is not None: dataset.metadata.number_of_sequences = len(self.records)
so much easier! And I can shift the burden of validation and sniffing to upstream, rather than any failures being my fault and requiring maintenance of a complex sniffer.
Cheers, Eric
- -- Eric Rasche Programmer II Center for Phage Technology Texas A&M University College Station, TX 77843 404-692-2048 esr@tamu.edu rasche.eric@yandex.ru
You could do something like that, and we already have Biopython packages in the ToolShed which can be listed as dependencies :)
However, some things like GenBank are tricky - in order to tolerate NCBI dumps the Biopython parser will ignore any free text before the first LOCUS line. A confusing side effect is most text files are then treated as a GenBank file with zero records. But if it came back with some records it is probably OK :)
Basically Biopython also does not care to offer file format detection simply because it is a can of worms.
Zen of Python - explicit is better than implicit.
We want you to tell us which format you want to try parsing it as.
Sorry,
Peter (Speaking as the Bio.SeqIO maintainer for Biopython)
On Thu, Jul 17, 2014 at 7:45 PM, Eric Rasche rasche.eric@yandex.ru wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Let's pretend for a second that I'm rather lazy (oh...wait), and I have ZERO interest in writing datatype parsers to sniff and validate whether or not a specific file is a specific datatype. I'm a sysadmin and bioinformatician, and I've worked with dozens of libraries that exist to parse file formats, and they all die in flames when I feed them bad data.
Would it be possible to somehow define requirements for datatypes?
I don't want to take on the burden of code I write saying "yes, I've sniffed+validated this and it is absolutely a genbank file". That's a lot of responsibility, especially if people have malformed genbank files and their tools fail as a result.
I would like to do this with BioPython and turf the validation to another library that exists to parse genbank files, that will raise and exception if they're invalid.
def sniff(self, filename): from Bio import SeqIO try: self.records = list(SeqIO.parse( filename, "genbank" )) return True except: self.records = None return False
def validate(self, dataset): from Bio import SeqIO errors = list() try: self.records = list(SeqIO.parse( dataset.file_name, "genbank" )) except Exception, e: errors.append(e) return errors
def set_meta(self, dataset, **kwd): if self.records is not None: dataset.metadata.number_of_sequences = len(self.records)
so much easier! And I can shift the burden of validation and sniffing to upstream, rather than any failures being my fault and requiring maintenance of a complex sniffer.
Cheers, Eric
Eric Rasche Programmer II Center for Phage Technology Texas A&M University College Station, TX 77843 404-692-2048 esr@tamu.edu rasche.eric@yandex.ru -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.22 (GNU/Linux)
iQIcBAEBAgAGBQJTyBmyAAoJEMqDXdrsMcpVQa0P/jj0edAKM6QsodhRWHglR92W tej1tJjtPgtJ15wsFzq6wVfhbL5J39ytsWjjtk//jhVNXh4FEE/OFZe6Nx9uTFKP ybazyTrLSCrxsST+w+Rx8Q9vfzShr87vjP+fC1k5i2EZOgogPOcQml1ouOHHjC6z pArrwPOvL3ZxWJG7oEcZjUjrPD8+ffhfQ/x096YYIMw7Hg74d50ARwtawJRoslZD JnYWa+aUOcsvC3QMrLKkDm4qBaTHa5x7x7P07Lcx7X65iMPDcuMZNtImiLztNscF QwbbdJdcs8oeSRRnmKgAllRAKf4dMeiyaSI+muVzNlpvLlSMZBNawD0bO1OXmIQH vAaV0eU+rYmDJSGo330o+RydvlDJENTXOkDt0TxmvfYAPtg2TlJCiWUdL7V1LqqF n8J5Z7Cu/sqRGSr5ww6KY27QHq6TU1WZDsVZiyEWJeKg3HGzp0MUmzMdr7iSZawK gnZxv6qg3+FlSqA30niyAuxEq588vS8uEFjjOfhnNLsUM7FAuFANF5z9bPOhG2qM Xjc3/NY7NsERd9nsIwfRuz0DWni8upvZ39vfeRZ3OAW9NwjRzqXrQiQp08XHa934 z4EBnpcWc9rNSV/3APF/imecBTOoiKtZfzIfILLtOPGE407Bmd8cE8hWyW7ipvrT QU6DIimj3eoMn+elXDfX =M+s5 -----END PGP SIGNATURE----- ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/17/2014 02:11 PM, Peter Cock wrote:
You could do something like that, and we already have Biopython packages in the ToolShed which can be listed as dependencies :)
If my module depends on the biopython from the toolshed, will that be accessible within a datatype? Would it be as simple as "from Bio import X"? Most of what I've seen of dependencies (and please forgive my lack of knowledge about them) consists of env.sh being sourced with paths to binaries, prior to tool run.
However, some things like GenBank are tricky - in order to tolerate NCBI dumps the Biopython parser will ignore any free text before the first LOCUS line. A confusing side effect is most text files are then treated as a GenBank file with zero records. But if it came back with some records it is probably OK :)
Interesting, very good to know.
Basically Biopython also does not care to offer file format detection simply because it is a can of worms.
Zen of Python - explicit is better than implicit.
We want you to tell us which format you want to try parsing it as.
Yes! Exactly! Which is why it's perfectly fine here:
SeqIO.parse( dataset.file_name, "genbank" )
All I want to know is whether or not this parses as a genbank file (and has 1 or more records). BioPython may not do automatic format detection (yuck, agreed), but since I already know I'm looking for a genbank file, simply being able to parse it or not is "good enough".
Sorry,
Peter (Speaking as the Bio.SeqIO maintainer for Biopython)
On Thu, Jul 17, 2014 at 7:45 PM, Eric Rasche rasche.eric@yandex.ru wrote: Let's pretend for a second that I'm rather lazy (oh...wait), and I have ZERO interest in writing datatype parsers to sniff and validate whether or not a specific file is a specific datatype. I'm a sysadmin and bioinformatician, and I've worked with dozens of libraries that exist to parse file formats, and they all die in flames when I feed them bad data.
Would it be possible to somehow define requirements for datatypes?
I don't want to take on the burden of code I write saying "yes, I've sniffed+validated this and it is absolutely a genbank file". That's a lot of responsibility, especially if people have malformed genbank files and their tools fail as a result.
I would like to do this with BioPython and turf the validation to another library that exists to parse genbank files, that will raise and exception if they're invalid.
def sniff(self, filename): from Bio import SeqIO try: self.records = list(SeqIO.parse( filename, "genbank" )) return True except: self.records = None return False
def validate(self, dataset): from Bio import SeqIO errors = list() try: self.records = list(SeqIO.parse( dataset.file_name, "genbank" )) except Exception, e: errors.append(e) return errors
def set_meta(self, dataset, **kwd): if self.records is not None: dataset.metadata.number_of_sequences = len(self.records)
so much easier! And I can shift the burden of validation and sniffing to upstream, rather than any failures being my fault and requiring maintenance of a complex sniffer.
Cheers, Eric
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
- -- Eric Rasche Programmer II Center for Phage Technology Texas A&M University College Station, TX 77843 404-692-2048 esr@tamu.edu rasche.eric@yandex.ru
On Thu, Jul 17, 2014 at 8:20 PM, Eric Rasche rasche.eric@yandex.ru wrote:
On 07/17/2014 02:11 PM, Peter Cock wrote:
You could do something like that, and we already have Biopython packages in the ToolShed which can be listed as dependencies :)
If my module depends on the biopython from the toolshed, will that be accessible within a datatype? Would it be as simple as "from Bio import X"? Most of what I've seen of dependencies (and please forgive my lack of knowledge about them) consists of env.sh being sourced with paths to binaries, prior to tool run.
I don't know - this may well be a gap in the ToolShed framework, since thus far most of the datatypes defined have been self contained.
I have asked something similar before (in the context of defining automatic file format conversion like the way Galaxy can turn FASTA into tabular in input parameters expecting tabular), where there could be a binary dependency.
However, some things like GenBank are tricky - in order to tolerate NCBI dumps the Biopython parser will ignore any free text before the first LOCUS line. A confusing side effect is most text files are then treated as a GenBank file with zero records. But if it came back with some records it is probably OK :)
Interesting, very good to know.
Basically Biopython also does not care to offer file format detection simply because it is a can of worms.
Zen of Python - explicit is better than implicit.
We want you to tell us which format you want to try parsing it as.
Yes! Exactly! Which is why it's perfectly fine here:
SeqIO.parse( dataset.file_name, "genbank" )
All I want to know is whether or not this parses as a genbank file (and has 1 or more records). BioPython may not do automatic format detection (yuck, agreed), but since I already know I'm looking for a genbank file, simply being able to parse it or not is "good enough".
With those provisos, you should be OK :)
Peter
My understanding of the code is that tool shed dependencies (or local dependencies) will not be available to tool shed datatypes (for sniffing for instance). Sorry.
If you want to hack up your local instance to resolve dependencies during the sniffing process that may be possible - my guess is you could add requirement tags to tools/data_source/upload.xml and the __SET_METADATA__ tool definition embedded in lib/galaxy/datatypes/registry.py - though I have not tried this.
-John
On Thu, Jul 17, 2014 at 2:24 PM, Peter Cock p.j.a.cock@googlemail.com wrote:
On Thu, Jul 17, 2014 at 8:20 PM, Eric Rasche rasche.eric@yandex.ru wrote:
On 07/17/2014 02:11 PM, Peter Cock wrote:
You could do something like that, and we already have Biopython packages in the ToolShed which can be listed as dependencies :)
If my module depends on the biopython from the toolshed, will that be accessible within a datatype? Would it be as simple as "from Bio import X"? Most of what I've seen of dependencies (and please forgive my lack of knowledge about them) consists of env.sh being sourced with paths to binaries, prior to tool run.
I don't know - this may well be a gap in the ToolShed framework, since thus far most of the datatypes defined have been self contained.
I have asked something similar before (in the context of defining automatic file format conversion like the way Galaxy can turn FASTA into tabular in input parameters expecting tabular), where there could be a binary dependency.
However, some things like GenBank are tricky - in order to tolerate NCBI dumps the Biopython parser will ignore any free text before the first LOCUS line. A confusing side effect is most text files are then treated as a GenBank file with zero records. But if it came back with some records it is probably OK :)
Interesting, very good to know.
Basically Biopython also does not care to offer file format detection simply because it is a can of worms.
Zen of Python - explicit is better than implicit.
We want you to tell us which format you want to try parsing it as.
Yes! Exactly! Which is why it's perfectly fine here:
SeqIO.parse( dataset.file_name, "genbank" )
All I want to know is whether or not this parses as a genbank file (and has 1 or more records). BioPython may not do automatic format detection (yuck, agreed), but since I already know I'm looking for a genbank file, simply being able to parse it or not is "good enough".
With those provisos, you should be OK :)
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 07/18/2014 09:49 AM, John Chilton wrote:
My understanding of the code is that tool shed dependencies (or local dependencies) will not be available to tool shed datatypes (for sniffing for instance). Sorry.
I figured as much, not very surprising at all. Dependencies notwithstanding, the idea has some modicum of merit. There are plenty of people who have already written great parsers that throw up errors, why should datatypes re-write them?
If you want to hack up your local instance to resolve dependencies during the sniffing process that may be possible - my guess is you could add requirement tags to tools/data_source/upload.xml and the __SET_METADATA__ tool definition embedded in lib/galaxy/datatypes/registry.py - though I have not tried this.
Well heck, at that point I'd just use the fact that I know I'm in lib/galaxy/datatypes to locate the BioPython dependency that was installed through greps, globs, and finds. Though I'll hold off on that for a "better" solution.
Cheers, Eric
-John
On Thu, Jul 17, 2014 at 2:24 PM, Peter Cock p.j.a.cock@googlemail.com wrote:
On Thu, Jul 17, 2014 at 8:20 PM, Eric Rasche rasche.eric@yandex.ru wrote:
On 07/17/2014 02:11 PM, Peter Cock wrote:
You could do something like that, and we already have Biopython packages in the ToolShed which can be listed as dependencies :)
If my module depends on the biopython from the toolshed, will that be accessible within a datatype? Would it be as simple as "from Bio import X"? Most of what I've seen of dependencies (and please forgive my lack of knowledge about them) consists of env.sh being sourced with paths to binaries, prior to tool run.
I don't know - this may well be a gap in the ToolShed framework, since thus far most of the datatypes defined have been self contained.
I have asked something similar before (in the context of defining automatic file format conversion like the way Galaxy can turn FASTA into tabular in input parameters expecting tabular), where there could be a binary dependency.
However, some things like GenBank are tricky - in order to tolerate NCBI dumps the Biopython parser will ignore any free text before the first LOCUS line. A confusing side effect is most text files are then treated as a GenBank file with zero records. But if it came back with some records it is probably OK :)
Interesting, very good to know.
Basically Biopython also does not care to offer file format detection simply because it is a can of worms.
Zen of Python - explicit is better than implicit.
We want you to tell us which format you want to try parsing it as.
Yes! Exactly! Which is why it's perfectly fine here:
SeqIO.parse( dataset.file_name, "genbank" )
All I want to know is whether or not this parses as a genbank file (and has 1 or more records). BioPython may not do automatic format detection (yuck, agreed), but since I already know I'm looking for a genbank file, simply being able to parse it or not is "good enough".
With those provisos, you should be OK :)
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
- -- Eric Rasche Programmer II Center for Phage Technology Texas A&M University College Station, TX 77843 404-692-2048 esr@tamu.edu rasche.eric@yandex.ru
On Fri, Jul 18, 2014 at 4:21 PM, Eric Rasche rasche.eric@yandex.ru wrote:
On 07/18/2014 09:49 AM, John Chilton wrote:
My understanding of the code is that tool shed dependencies (or local dependencies) will not be available to tool shed datatypes (for sniffing for instance). Sorry.
I figured as much, not very surprising at all. Dependencies notwithstanding, the idea has some modicum of merit. There are plenty of people who have already written great parsers that throw up errors, why should datatypes re-write them?
Exactly - Trello request for the toolshed to handle both Python and binary dependencies for datatypes?
(e.g. samtools is a binary dependency of the SAM/BAM datatypes, used for conversion and indexing)
If you want to hack up your local instance to resolve dependencies during the sniffing process that may be possible - my guess is you could add requirement tags to tools/data_source/upload.xml and the __SET_METADATA__ tool definition embedded in lib/galaxy/datatypes/registry.py - though I have not tried this.
Well heck, at that point I'd just use the fact that I know I'm in lib/galaxy/datatypes to locate the BioPython dependency that was installed through greps, globs, and finds. Though I'll hold off on that for a "better" solution.
I'd manually install the Python dependencies as part of the Python used to run Galaxy itself?
Peter
Hi guys,
Noticed that this thread had no further replies. I was wondering if this is going to be in the road map for galaxy next releases. Would be very helpful for sniffing and parsing complex datasets such as medical images.
Cheers, Igor On Jul 18, 2014 12:42 PM, "Peter Cock" p.j.a.cock@googlemail.com wrote:
On Fri, Jul 18, 2014 at 4:21 PM, Eric Rasche rasche.eric@yandex.ru wrote:
On 07/18/2014 09:49 AM, John Chilton wrote:
My understanding of the code is that tool shed dependencies (or local dependencies) will not be available to tool shed datatypes (for sniffing for instance). Sorry.
I figured as much, not very surprising at all. Dependencies notwithstanding, the idea has some modicum of merit. There are plenty of people who have already written great parsers that throw up errors, why should datatypes re-write them?
Exactly - Trello request for the toolshed to handle both Python and binary dependencies for datatypes?
(e.g. samtools is a binary dependency of the SAM/BAM datatypes, used for conversion and indexing)
If you want to hack up your local instance to resolve dependencies during the sniffing process that may be possible - my guess is you could add requirement tags to tools/data_source/upload.xml and the __SET_METADATA__ tool definition embedded in lib/galaxy/datatypes/registry.py - though I have not tried this.
Well heck, at that point I'd just use the fact that I know I'm in lib/galaxy/datatypes to locate the BioPython dependency that was installed through greps, globs, and finds. Though I'll hold off on that for a "better" solution.
I'd manually install the Python dependencies as part of the Python used to run Galaxy itself?
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
galaxy-dev@lists.galaxyproject.org