Problem in Json datatype?
Greetings, I was adding datatype classes to galaxy.datatype.text.py for our custom JSON formats and I noticed what I think is a problem with the Json datatype class, in particular the _looks_like_json method. The basic algorithm is (in pseudo code): if the_file_is_small_enough: try to load the the file as json return True on success, False otherwise else while True find first non-blank line if line starts with ‘{‘ or ‘[‘: return True else return False However, JSON is typically compressed by stripping whitespace (including newlines), particularly when it is fetched from a web service. This means that the first call to readLine is going to load the entire file, which defeats the purpose of checking the file size in the first place. I would submit a pull request with our fix, but since I am not a Python programmer I thought I would simply post the code here for review. class Json(Text): # Unchanged bits of the class have been omitted. # Add a function that reads a single character skipping over whitespace. def read(self, fileHandle): ch = fileHandle.read(1) while ch.isspace(): ch = fileHandle.read(1) return ch # Only the “else-part” has changed def _looks_like_json(self, filename): # Pattern used by SequenceSplitLocations if os.path.getsize(filename) < 50000: # If the file is small enough - don't guess just check. try: json.load(open(filename, "r")) return True except Exception: return False else: with open(filename, "r") as fh: ch = self.read(fh) return ch == '{' or ch == '[' We then use the read() method to sniff out our JSON formats. Once we can sniff a header subclasses only need to specify the header. class Lapps( Json ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org''' def sniff(self, filename) with open(filename) as fh: for c in header: if c != self.read(fh) return False return True class Lif ( Lapps ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/jsonld#lif”''' class Gate( Lapps ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/gate”''' … Regardless of how the JSON is rendered (pretty printed or not) we can match the “magic” strings used to identify our formats. Being new to Python I don’t know how expensive it is to read a file one character at a time, but it has to be cheaper than potentially reading the entire file. Cheers, Keith ------------------------------ Research Associate Department of Computer Science Vassar College Poughkeepsie, NY
Good idea - the Python file read method takes an option number of bytes/characters, so it could easily read just one, or say 100 and then strip whitespace. Peter On Thu, Jun 4, 2015 at 3:36 AM, Keith Suderman <suderman@cs.vassar.edu> wrote:
Greetings,
I was adding datatype classes to galaxy.datatype.text.py for our custom JSON formats and I noticed what I think is a problem with the Json datatype class, in particular the _looks_like_json method. The basic algorithm is (in pseudo code):
if the_file_is_small_enough: try to load the the file as json return True on success, False otherwise else while True find first non-blank line if line starts with ‘{‘ or ‘[‘: return True else return False
However, JSON is typically compressed by stripping whitespace (including newlines), particularly when it is fetched from a web service. This means that the first call to readLine is going to load the entire file, which defeats the purpose of checking the file size in the first place.
I would submit a pull request with our fix, but since I am not a Python programmer I thought I would simply post the code here for review.
class Json(Text): # Unchanged bits of the class have been omitted.
# Add a function that reads a single character skipping over whitespace. def read(self, fileHandle): ch = fileHandle.read(1) while ch.isspace(): ch = fileHandle.read(1) return ch
# Only the “else-part” has changed def _looks_like_json(self, filename): # Pattern used by SequenceSplitLocations if os.path.getsize(filename) < 50000: # If the file is small enough - don't guess just check. try: json.load(open(filename, "r")) return True except Exception: return False else: with open(filename, "r") as fh: ch = self.read(fh) return ch == '{' or ch == '['
We then use the read() method to sniff out our JSON formats. Once we can sniff a header subclasses only need to specify the header.
class Lapps( Json ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org''' def sniff(self, filename) with open(filename) as fh: for c in header: if c != self.read(fh) return False return True class Lif ( Lapps ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/jsonld#lif”''' class Gate( Lapps ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/gate”''' …
Regardless of how the JSON is rendered (pretty printed or not) we can match the “magic” strings used to identify our formats.
Being new to Python I don’t know how expensive it is to read a file one character at a time, but it has to be cheaper than potentially reading the entire file.
Cheers, Keith
------------------------------ Research Associate Department of Computer Science Vassar College Poughkeepsie, NY
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Thanks for the detailed bug report Keith! I have implemented Peter suggestion as a Pull Request (https://github.com/galaxyproject/galaxy/pull/309). -John On Thu, Jun 4, 2015 at 3:34 AM, Peter Cock <p.j.a.cock@googlemail.com> wrote:
Good idea - the Python file read method takes an option number of bytes/characters, so it could easily read just one, or say 100 and then strip whitespace.
Peter
On Thu, Jun 4, 2015 at 3:36 AM, Keith Suderman <suderman@cs.vassar.edu> wrote:
Greetings,
I was adding datatype classes to galaxy.datatype.text.py for our custom JSON formats and I noticed what I think is a problem with the Json datatype class, in particular the _looks_like_json method. The basic algorithm is (in pseudo code):
if the_file_is_small_enough: try to load the the file as json return True on success, False otherwise else while True find first non-blank line if line starts with ‘{‘ or ‘[‘: return True else return False
However, JSON is typically compressed by stripping whitespace (including newlines), particularly when it is fetched from a web service. This means that the first call to readLine is going to load the entire file, which defeats the purpose of checking the file size in the first place.
I would submit a pull request with our fix, but since I am not a Python programmer I thought I would simply post the code here for review.
class Json(Text): # Unchanged bits of the class have been omitted.
# Add a function that reads a single character skipping over whitespace. def read(self, fileHandle): ch = fileHandle.read(1) while ch.isspace(): ch = fileHandle.read(1) return ch
# Only the “else-part” has changed def _looks_like_json(self, filename): # Pattern used by SequenceSplitLocations if os.path.getsize(filename) < 50000: # If the file is small enough - don't guess just check. try: json.load(open(filename, "r")) return True except Exception: return False else: with open(filename, "r") as fh: ch = self.read(fh) return ch == '{' or ch == '['
We then use the read() method to sniff out our JSON formats. Once we can sniff a header subclasses only need to specify the header.
class Lapps( Json ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org''' def sniff(self, filename) with open(filename) as fh: for c in header: if c != self.read(fh) return False return True class Lif ( Lapps ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/jsonld#lif”''' class Gate( Lapps ): header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/gate”''' …
Regardless of how the JSON is rendered (pretty printed or not) we can match the “magic” strings used to identify our formats.
Being new to Python I don’t know how expensive it is to read a file one character at a time, but it has to be cheaper than potentially reading the entire file.
Cheers, Keith
------------------------------ Research Associate Department of Computer Science Vassar College Poughkeepsie, NY
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (3)
-
John Chilton
-
Keith Suderman
-
Peter Cock