Greetings,
I was adding datatype classes to galaxy.datatype.text.py for our custom JSON formats and I noticed what I think is a problem with the Json datatype class, in particular the _looks_like_json method. The basic algorithm is (in pseudo code):
if the_file_is_small_enough:
try to load the the file as json
return True on success, False otherwise
else
while True
find first non-blank line
if line starts with ‘{‘ or ‘[‘:
return True
else
return False
However, JSON is typically compressed by stripping whitespace (including newlines), particularly when it is fetched from a web service. This means that the first call to readLine is going to load the entire file, which defeats the purpose of checking the file size in the first place.
I would submit a pull request with our fix, but since I am not a Python programmer I thought I would simply post the code here for review.
class Json(Text):
# Unchanged bits of the class have been omitted.
# Add a function that reads a single character skipping over whitespace.
def read(self, fileHandle):
ch = fileHandle.read(1)
while ch.isspace():
ch = fileHandle.read(1)
return ch
# Only the “else-part” has changed
def _looks_like_json(self, filename):
# Pattern used by SequenceSplitLocations
if os.path.getsize(filename) < 50000:
# If the file is small enough - don't guess just check.
try:
json.load(open(filename, "r"))
return True
except Exception:
return False
else:
with open(filename, "r") as fh:
ch = self.read(fh)
return ch == '{' or ch == '['
We then use the read() method to sniff out our JSON formats. Once we can sniff a header subclasses only need to specify the header.
class Lapps( Json ):
header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org'''
def sniff(self, filename)
with open(filename) as fh:
for c in header:
if c != self.read(fh)
return False
return True
class Lif ( Lapps ):
header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/jsonld#lif”'''
class Gate( Lapps ):
header = ‘’’{“discriminator”:”http://vocab.lappsgrid.org/ns/media/gate”'''
…
Regardless of how the JSON is rendered (pretty printed or not) we can match the “magic” strings used to identify our formats.
Being new to Python I don’t know how expensive it is to read a file one character at a time, but it has to be cheaper than potentially reading the entire file.
Cheers,
Keith
------------------------------
Research Associate
Department of Computer Science
Vassar College
Poughkeepsie, NY