New subject: A Report Calc tool which provides a scripting language and interpreter for text mining and quality control

11 Jan 2016

      Folks, we're testing out a basic scripting language and interpreter for
report writing and quality control that is meant to provide both
command-line programmers and Galaxy platform researchers and admins with
ways to tweak workflow quality control behaviour without having to be
programmers themselves.

I wanted to sound-out the community about whether or not there are any
basic objections to our approach, described briefly below?

Currently the interpreter provides all the built-in Python operator and
math functions (as well as some particular named group regular expression
functions for text mining) so that users can examine given input log or
data datasets for fields that need to be reported or compared with QC
metric rules.  

The Report Calc tool takes in a file of statements (each being a
function(parameter1 parameter2 ...) syntax) that look like this:

   set(4857000 report/contigs/reference_genome_size)
   set("serovar Typhimurium LT2" report/contigs/reference_genome)
   set(0.1 report/contigs/good_genome_size_ratio)

   set( statisticN(report/contigs/contig_lengths 50) report/contigs/N50)
   if( lt(/N50 200000) set(report/job/status FAIL))

Math is accomplished by python built-in math functions (I.e. Ignore the
"/" - that's a namespace syntax character).

   set( 
      truediv( abs( sub( /sampleGenomeSize /referenceGenomeSize))
/referenceGenomeSize)

      report/contigs/sample_genome_size_ratio
   )

And writes any output as desired to a standard tool output folder:

   writeFile( 
      pageHtml( getHtml(report) "My Report Widget")
      report.html
   )

It allows users to build a ruleset (file containing statements like the
above), and process text, json, tabular datasets in their history, and it
can manipulate variables, arrays and dictionaries in the tools in-memory
temporary data structure namespace.  It doesn't touch Galaxy's inner
workings or interact with a workflow except by way of app exit codes.

One can even write little text-mining programs in it:

   iterate( readFileByName(contigs-all.fasta)
      if( eq( getitem(iterator/0/value 0) ">")
         append( 
            regexp(iterator/0/value "length_(?P<value>\\d+)_")
            report/contigs/contig_lengths
         )
      )
   )

I'll release it shortly for review/play, with lots of documentation that
describes in detail what it can/can't do, and an argument for why one
might want to bother using/learning it.

There are one or two irrelevant python built-in functions we might filter
out (e.g. isCallable() ); so far we haven't spotted any security issues,
and we've limited the flow control to only accomplish loops by iterables
so there's no evident way to create infinite loops.  One only iterates
through files or in-memory arrays.

As well suggested functionality for the wish-list now accepted!

Regards,

Damion

A Report Calc tool which provides a scripting language and interpreter for text mining and quality control

Dooley, Damion

Bob Harris

Dooley, Damion

Bob Harris

Dooley, Damion

tags

participants (2)