character set encoding metadata
Does galaxy have an official place to store the character set for a text file? I'm thinking of modifying the upload tool to prompt for the character set, since it cannot be sniffed. It should go in the hda/ldda metadata hash or extended_metadata for the dataset. Any opinions? BTW: everyone should watch this talk on unicode. http://bit.ly/unipain
Excellent talk! I watched it earlier in the week - I feel like it has already made me a better programmer. That said, I am not sure I have any real great answers for you. The least disruptive thing you could do is add a new metadata element to the Text datatype (hopefully most text based datatypes are sub-typing that). Intuitively this makes sense to because this attribute is only valid for text-based datatypes not all datasets or hdas. How to set that is not entirely clear to me though - for any given datatype you could override or modify set_meta to do this - but I don't know how I would get that input from the user into the set_meta procedure. Depending on your use cases - this next suggestion might not be viable - but the easiest and most robust thing to do is probably not track what character-set something is but instead assume it is all UTF-8 (ISO-8859-1 and ASCII files are already right). Then modify the upload form(s) to have new option to convert incoming input files from a supplied character set into UTF-8. Then in tools/data-source/upload.py check for this new character set parameter and use a tool such as recode or icov to do this conversion during the upload/dataset creation process. I am making no promises, but if you were hoping to get these changes included in Galaxy this is what I would be most willing to consider. Getting Galaxy to track, process, and serve a bunch of different character sets would be a real challenge - allowing for the assumption that it is all just UTF-8 though however is much easier. It is a variant on the advice given in the video you sent along as well, instead of converting everything to unicode as soon as possible though, you would be converting everything to UTF-8 as soon as possible. Hope this helps. -John On Sun, Nov 24, 2013 at 5:22 PM, Robert Baertsch <baertsch@soe.ucsc.edu> wrote:
Does galaxy have an official place to store the character set for a text file?
I'm thinking of modifying the upload tool to prompt for the character set, since it cannot be sniffed.
It should go in the hda/ldda metadata hash or extended_metadata for the dataset.
Any opinions?
BTW: everyone should watch this talk on unicode.
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
participants (2)
-
John Chilton
-
Robert Baertsch