A Report Calc tool which provides a scripting language and interpreter for text mining and quality control
Folks, we're testing out a basic scripting language and interpreter for report writing and quality control that is meant to provide both command-line programmers and Galaxy platform researchers and admins with ways to tweak workflow quality control behaviour without having to be programmers themselves. I wanted to sound-out the community about whether or not there are any basic objections to our approach, described briefly below? Currently the interpreter provides all the built-in Python operator and math functions (as well as some particular named group regular expression functions for text mining) so that users can examine given input log or data datasets for fields that need to be reported or compared with QC metric rules. The Report Calc tool takes in a file of statements (each being a function(parameter1 parameter2 ...) syntax) that look like this: set(4857000 report/contigs/reference_genome_size) set("serovar Typhimurium LT2" report/contigs/reference_genome) set(0.1 report/contigs/good_genome_size_ratio) set( statisticN(report/contigs/contig_lengths 50) report/contigs/N50) if( lt(/N50 200000) set(report/job/status FAIL)) Math is accomplished by python built-in math functions (I.e. Ignore the "/" - that's a namespace syntax character). set( truediv( abs( sub( /sampleGenomeSize /referenceGenomeSize)) /referenceGenomeSize) report/contigs/sample_genome_size_ratio ) And writes any output as desired to a standard tool output folder: writeFile( pageHtml( getHtml(report) "My Report Widget") report.html ) It allows users to build a ruleset (file containing statements like the above), and process text, json, tabular datasets in their history, and it can manipulate variables, arrays and dictionaries in the tools in-memory temporary data structure namespace. It doesn't touch Galaxy's inner workings or interact with a workflow except by way of app exit codes. One can even write little text-mining programs in it: iterate( readFileByName(contigs-all.fasta) if( eq( getitem(iterator/0/value 0) ">") append( regexp(iterator/0/value "length_(?P<value>\\d+)_") report/contigs/contig_lengths ) ) ) I'll release it shortly for review/play, with lots of documentation that describes in detail what it can/can't do, and an argument for why one might want to bother using/learning it. There are one or two irrelevant python built-in functions we might filter out (e.g. isCallable() ); so far we haven't spotted any security issues, and we've limited the flow control to only accomplish loops by iterables so there's no evident way to create infinite loops. One only iterates through files or in-memory arrays. As well suggested functionality for the wish-list now accepted! Regards, Damion
On Jan 11, 2016, at 2:42 PM, Dooley, Damion wrote:
... we're testing out a basic scripting language ... meant to provide [folks] with ways to [do something] without having to be programmers ... .... if( lt(/N50 200000) set(report/job/status FAIL))
Math is accomplished by python built-in math functions ...
It could well be that's the only way to accomplish what you want in whatever environment you're in. But the use of prefix notation and a funny name, for an operator like "<" that non-programmers use familiarly as infix, would seem contrary to the stated goal that the user needn't be a programmer. If math can be accomplished via python, why not "<"? By "math" do you only mean function calls, and not arithmetic operators? Is it that python eval() can't be used because of security issues? Bob H
Hi, Thanks for quick reply, and good point about keeping math expressions familiar to most users. In this first round I settled for a simple prefix "function(parameter1 parameter2 ...)". All the python infix operators like "a + b" have equivalent prefix functions "add(a b)" so the latter are used. Yes, infix "a + b" parsing would be easier to read, and I was thinking we could try that on our next iteration (and remain backward compatible) but more opinion may shift that up! (Some other issues to tackle too: decisions about whether "/" should map to "div(a b)" or "truediv(a b)", and how to avoid conflict with our namespace method of referring to variables via a/b/c paths). I'm definitely avoiding eval() since we do have to control exactly which functions, conditionals, and loop constructs are executed. Not trying to provide the all-out iPython approach. D. On 2016-01-11, 12:03 PM, "Bob Harris" <rsharris@bx.psu.edu> wrote:
On Jan 11, 2016, at 2:42 PM, Dooley, Damion wrote:
... we're testing out a basic scripting language ... meant to provide [folks] with ways to [do something] without having to be programmers ... .... if( lt(/N50 200000) set(report/job/status FAIL))
Math is accomplished by python built-in math functions ...
It could well be that's the only way to accomplish what you want in whatever environment you're in. But the use of prefix notation and a funny name, for an operator like "<" that non-programmers use familiarly as infix, would seem contrary to the stated goal that the user needn't be a programmer.
If math can be accomplished via python, why not "<"? By "math" do you only mean function calls, and not arithmetic operators? Is it that python eval() can't be used because of security issues?
Bob H
Yes, infix "a + b" parsing would be easier to read
If you are targeting non-programmers, I think the bigger point is that a + b is easier to write. I do understand the motivation for prefix from the implementation standpoint. The issue, I suppose, will be how early you want to include non-programmers as users. Anyway, that's my 15.2 cents (inflation unadjusted). I'm not your target audience in any case. Bob H On Jan 11, 2016, at 3:31 PM, Dooley, Damion wrote:
Hi,
Thanks for quick reply, and good point about keeping math expressions familiar to most users. In this first round I settled for a simple prefix "function(parameter1 parameter2 ...)". All the python infix operators like "a + b" have equivalent prefix functions "add(a b)" so the latter are used.
Yes, infix "a + b" parsing would be easier to read, and I was thinking we could try that on our next iteration (and remain backward compatible) but more opinion may shift that up! (Some other issues to tackle too: decisions about whether "/" should map to "div(a b)" or "truediv(a b)", and how to avoid conflict with our namespace method of referring to variables via a/b/c paths).
I'm definitely avoiding eval() since we do have to control exactly which functions, conditionals, and loop constructs are executed. Not trying to provide the all-out iPython approach.
D.
On 2016-01-11, 12:03 PM, "Bob Harris" <rsharris@bx.psu.edu> wrote:
On Jan 11, 2016, at 2:42 PM, Dooley, Damion wrote:
... we're testing out a basic scripting language ... meant to provide [folks] with ways to [do something] without having to be programmers ... .... if( lt(/N50 200000) set(report/job/status FAIL))
Math is accomplished by python built-in math functions ...
It could well be that's the only way to accomplish what you want in whatever environment you're in. But the use of prefix notation and a funny name, for an operator like "<" that non-programmers use familiarly as infix, would seem contrary to the stated goal that the user needn't be a programmer.
If math can be accomplished via python, why not "<"? By "math" do you only mean function calls, and not arithmetic operators? Is it that python eval() can't be used because of security issues?
Bob H
Hi folks, I'm back with one more feeler to gauge interest in the approach we are trying out for a Galaxy quality control tool to interject into existing bioinformatics pipelines. With some nudging (thanks Bob) I've implemented basic infix math expressions. As well we're trying out the inclusion of ontology metadata within report data to encourage data import/export/comparison. The goal is to make it easy to see and change quality control metrics (without having to recompile code or modify Galaxy workflow mechanics.) The QC scripting language/interpreter as a Galaxy tool lets us read in text file(s) - some assembly contig data say - and then run a program (a set of rules) like: store( 200 report/contigs/contig_count_QC_threshold ) store( 200000 report/contigs/contig_N50_QC_threshold ) store( 2000 report/contigs/contig_N99_QC_threshold ) if( ( genome_size_ratio > genome_size_ratio_QC_threshold ) fail(qc "Failed genome size ratio threshold") ) store( statisticN( contig_lengths 50 ) report/contigs/contig_N50 ) store( statisticN( contig_lengths 99 ) report/contigs/contig_N99 ) if( contig_N50 < contig_N50_QC_threshold fail(qc "Failed minimum N50 contig length threshold") ) if( contig_N99 < contig_N99_QC_threshold fail(qc "Failed minimum N99 contig length threshold") ) if( report/contigs/contigs_count > contig_count_QC_threshold fail(job "Failed minimum contig count threshold" ) ) Which is like a generic, basic function(parameter1 parameter2...) type of language. On a good run this yields a JSON report like: { "title": "RCQC Quality Control Report", "tool_version": "0.0.7", "job": { "status": "ok" }, "quality_control": { "status": "ok" }, "date": "2016-02-09 09:21", "contigs": { "contig_lengths": [ 128, 172, 221, 224, 238, 230, 240, 246, 407, ... , 242, 2284, 1506], "genome_size_ratio_QC_threshold": 0.10000000000000001, "contig_N99_QC_threshold": 2000, "assembly_genome_size": 4615592, "genome_size_ratio": 0.04970310891496809, "contig_N50": 427122, "contig_N99": 8542, "contig_count_QC_threshold": 200, "contig_count": 44, "reference_genome_identifier": "serovar Typhimurium LT2", "reference_genome_size": 4857000, "contig_N50_QC_threshold": 200000 }, "@context": { "contigs": "http://purl.obolibrary.org/obo/SO_0001462", "genome_size_ratio_QC_threshold": "http://purl.obolibrary.org/obo/GenEpiO_0001564", "contig_N99_QC_threshold": "http://purl.obolibrary.org/obo/GenEpiO_0001566", "assembly_genome_size": "http://purl.obolibrary.org/obo/GenEpiO_0001561", "genome_size_ratio": "http://purl.obolibrary.org/obo/GenEpiO_0001563", "contig_N50": "http://purl.obolibrary.org/obo/OBI_0001941", "contig_N99": "http://purl.obolibrary.org/obo/GenEpiO_0001570", "contig_count_QC_threshold": "http://purl.obolibrary.org/obo/GenEpiO_0001571", "contig_count": "http://purl.obolibrary.org/obo/GenEpiO_0000093", "reference_genome_identifier": "http://purl.obolibrary.org/obo/GenEpiO_0001562", "date": "http://purl.obolibrary.org/obo/IAO_0000416", "reference_genome_size": "http://purl.obolibrary.org/obo/GenEpiO_0001560", "contig_N50_QC_threshold": "http://purl.obolibrary.org/obo/GenEpiO_0001565" } } So we'd appreciate any advice on roadblocks or desired features you perceive on this... - Damion
participants (2)
-
Bob Harris
-
Dooley, Damion