[galaxy-dev] Creating a galaxy tool in R - "You must not use 8-bit bytestrings"

24 Apr 2012

      Apologies for originally posting this to galaxy-user; now I realize it
belongs here.

Hello,

I'm a galaxy newbie and running into several issues trying to adapt an
R script to be a galaxy tool.

I'm looking at the XY plotting tool for guidance
(tools/plot/xy_plot.xml), but I decided not to embed my script in XML,
but instead have it in a separate script file, that way I can still
run it from the command line and make sure it works as I make
incremental changes. (So my script starts with args <-
commandArgs(TRUE)). Also, if it doesn't work, this suggests to me that
there is a problem with my galaxy configuration.

First, I tried using the r_wrapper.sh script that comes with the XY
plotting tool,  but it threw away my arguments:

An error occurred running this job: ARGUMENT
'/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_4.dat'
__ignored__

ARGUMENT '/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_3.dat'
__ignored__

ARGUMENT 'Fly' __ignored__

ARGUMENT 'Tagwise' __ignored__

etc.

So then I tried just switching to Rscript:

  <command interpreter="bash">Rscript RNASeq.R $countsTsv $designTsv
"$organism" $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2</command>

(My script produces as output a csv file and a pdf file. The final two
arguments I'm passing are the names of those files.)

But then I get an error that Rscript can't be found.

So I wrote a little wrapper script, Rscript_wrapper.sh:

#!/bin/sh

Rscript $*

And called that:
  <command interpreter="bash">Rscript_wrapper.sh RNASeq.R $countsTsv
$designTsv "$organism" $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2</command>

Then I got an error that RNASeq.R could not be found.

So then I added the absolute path to my R script to the <command> tag.
This seemed to work (that is, it got me further, to the next error),
but I'm not sure why I had to do this; in all the other tools I'm
looking at, the directory to the script to run does not have to be
specified; I assumed that the command would run in the appropriate
directory.

So now I've specified the full path to my R script:

  <command interpreter="bash">Rscript_wrapper.sh
/Users/dtenenba/dev/galaxy-dist/tools/bioc/RNASeq.R $countsTsv
$designTsv "$organism" $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2</command>

And I get the following long error, which includes all of the output
of my R script:

Traceback (most recent call last):
  File "/Users/dtenenba/dev/galaxy-dist/lib/galaxy/jobs/runners/local.py",
line 133, in run_job
    job_wrapper.finish( stdout, stderr )
  File "/Users/dtenenba/dev/galaxy-dist/lib/galaxy/jobs/__init__.py",
line 725, in finish
    self.sa_session.flush()
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/scoping.py",
line 127, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/session.py",
line 1356, in flush
    self._flush(objects)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/session.py",
line 1434, in _flush
    flush_context.execute()
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py",
line 261, in execute
    UOWExecutor().execute(self, tasks)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py",
line 753, in execute
    self.execute_save_steps(trans, task)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py",
line 768, in execute_save_steps
    self.save_objects(trans, task)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py",
line 759, in save_objects
    task.mapper._save_obj(task.polymorphic_tosave_objects, trans)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/mapper.py",
line 1413, in _save_obj
    c = connection.execute(statement.values(value_params), params)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py",
line 824, in execute
    return Connection.executors[c](self, object, multiparams, params)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py",
line 874, in _execute_clauseelement
    return self.__execute_context(context)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py",
line 896, in __execute_context
    self._cursor_execute(context.cursor, context.statement,
context.parameters[0], context=context)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py",
line 950, in _cursor_execute
    self._handle_dbapi_exception(e, statement, parameters, cursor, context)
  File "/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py",
line 931, in _handle_dbapi_exception
    raise exc.DBAPIError.instance(statement, parameters, e,
connection_invalidated=is_disconnect)
ProgrammingError: (ProgrammingError) You must not use 8-bit
bytestrings unless you use a text_factory that can interpret 8-bit
bytestrings (like text_factory = str). It is highly recommended that
you instead just switch your application to Unicode strings. u'UPDATE
job SET update_time=?, stdout=?, stderr=? WHERE job.id = ?'
['2012-04-24 18:55:45.791417', '', 'BiocInstaller version 1.5.7,
?biocLite for help\nWarning message:\nNAs introduced by coercion
\nLoading required package: methods\nLoading required package:
limma\nLoading required package: BiasedUrn\nLoading required package:
geneLenDataBase\nLoading required package: org.Dm.eg.db\nLoading
required package: AnnotationDbi\nLoading required package:
BiocGenerics\n\nAttaching package:
\xe2\x80\x98BiocGenerics\xe2\x80\x99\n\nThe following object(s) are
masked from \xe2\x80\x98package:stats\xe2\x80\x99:\n\n    xtabs\n\nThe
following object(s) are masked from
\xe2\x80\x98package:base\xe2\x80\x99:\n\n    anyDuplicated, cbind,
colnames, duplicated, eval, Filter, Find,\n    get, intersect, lapply,
Map, mapply, mget, order, paste, pmax,\n    pmax.int, pmin, pmin.int,
Position, rbind, Reduce, rep.int,\n    rownames, sapply, setdiff,
table, tapply, union, unique\n\nLoading required package:
Biobase\nWelcome to Bioconductor\n\n    Vignettes contain introductory
material; view with\n    \'browseVignettes()\'. To cite Bioconductor,
see\n    \'citation("Biobase")\', and for packages
\'citation("pkgname")\'.\n\nLoading required package:
DBI\n\nCalculating library sizes from column totals.\nError in
matrix(u, nrow = nrows, byrow = TRUE) : \n  negative extents to
matrix\nCalls: plotMDS.DGEList ... equalizeLibSizes -> splitIntoGroups
-> lapply -> FUN -> matrix\nExecution halted\n', 15]

Note that if I run my script from the command line:

./Rscript_wrapper.sh RNASeq.R
/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_4.dat
/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_3.dat Fly 1
1 Tagwise MDSPlot.pdf outputs.csv

It works fine and does not produce a warning about "NAs introduced by
coercion", nor does it fail with the "Error in matrix" above.

So, can anyone tell me what is going wrong here? Why does R behave
differently in galaxy than it does on the command line? (I'm using the
same instance of R, same machine, for my galaxy and command-line
efforts). Is this 8-bit bytestring error a red herring? Can I filter
it so that galaxy is happy?

Finally, one other curiosity. Every time I hit "Execute" in galaxy to
run my tool, it is run twice--two jobs are created (which each fail in
the same way). Why is this?

My R script:
https://gist.github.com/2482783

My XML file:
https://gist.github.com/2482792

I can share more data (such as sample input files) if necessary.

Thanks for your help.
Dan

[galaxy-dev] Creating a galaxy tool in R - "You must not use 8-bit bytestrings"

Dan Tenenbaum