Re: [galaxy-dev] slow execution of set_metadata.py

8 Dec 2010

      Hello Matthias,

Apologies for the delayed reply, we made some changes to address this 
issue. Updating to the latest changeset will help. If you have not 
already done so, this is encouraged.

You will find that the new changeset handles sam set_meta dramatically 
faster.  It will also now respect max_optional_metadata_filesize which 
is set in datatypes_conf.xml

In addition, this line:

<datatype extension="sam" type="galaxy.datatypes.tabular:Sam" 
display_in_upload="true"/>

Could be changed to

<datatype extension="sam" type="galaxy.datatypes.tabular:Sam" 
display_in_upload="true"  max_optional_metadata_filesize="1048576"/>

And set_meta will not try to dig all the way through huge files.

Hopefully this helps! Thanks for using Galaxy,

Best,

Jen
Galaxy team

On 11/5/10 8:18 AM, Matthias Gierth wrote:
...
Hello List,
I have a small problem with my own local Galaxy instance.
So I try to set up some workflows for NGS. Everything is working fine
for now, but the process for setting the Metadata on a file takes a lot
of time.
Currently I created a workflow --> fastq file from Library -->
fastx_groomer( convert from Illumina to sanger format)-->mapping with bwa
So the grooming and mapping runs fine, but after mapping the
set_metadata.py takes longer than the mapping of the 2gb fastq-file.
The testing server is a Dell R710 with 2x6-core cpus and 72GB of memory
Below there is my Config for Galaxy.
Anybody an Idea whats going wrong with my setup?
many thanks
Matthias
#
# Galaxy is configured by default to be useable in a single-user
development
# environment. To tune the application for a multi-user production
# environment, see the documentation at:
#
# http://bitbucket.org/galaxy/galaxy-central/wiki/Config/ProductionServer
#
# Throughout this sample configuration file, except where stated otherwise,
# uncommented values override the default if left unset, whereas commented
# values are set to the default value.
# examples of many of these options are explained in more detail in the
wiki:
#
# Config hackers are encouraged to check there before asking for help.
# ---- HTTP Server
----------------------------------------------------------
# Configuration of the internal HTTP server.
[server:main]
# The internal HTTP server to use. Currently only Paste is provided. This
# option is required.
use = egg:Paste#http
# The port on which to listen.
port = 8081
# The address on which to listen. By default, only listen to localhost
(Galaxy
# will not be accessible over the network). Use '0.0.0.0' to listen on all
# available network interfaces.
host = 0.0.0.0
# Use a threadpool for the web server instead of creating a thread for each
# request.
use_threadpool = True
# Number of threads in the web server thread pool.
threadpool_workers = 8
# ---- Filters
--------------------------------------------------------------
# Filters sit between Galaxy and the HTTP server.
# These filters are disabled by default. They can be enabled with
# 'filter-with' in the [app:main] section below.
# Define the gzip filter.
[filter:gzip]
use = egg:Paste#gzip
# Define the proxy-prefix filter.
[filter:proxy-prefix]
use = egg:PasteDeploy#prefix
prefix = /galaxy
# ---- Galaxy
---------------------------------------------------------------
# Configuration of the Galaxy application.
[app:main]
# -- Application and filtering
# The factory for the WSGI application. This should not be changed.
paste.app_factory = galaxy.web.buildapp:app_factory
# If not running behind a proxy server, you may want to enable gzip
compression
# to decrease the size of data transferred over the network. If using a
proxy
# server, please enable gzip compression there instead.
#filter-with = gzip
# If running behind a proxy server and Galaxy is served from a
subdirectory,
# enable the proxy-prefix filter and set the prefix in the
# [filter:proxy-prefix] section above.
#filter-with = proxy-prefix
# If proxy-prefix is enabled and you're running more than one Galaxy
instance
# behind one hostname, you will want to set this to the same path as the
prefix
# in the filter above. This value becomes the "path" attribute set in the
# cookie so the cookies from each instance will not clobber each other.
#cookie_path = None
# -- Database
# By default, Galaxy uses a SQLite database at
'database/universe.sqlite'. You
# may use a SQLAlchemy connection string to specify an external database
# instead. This string takes many options which are explained in detail
in the
# config file documentation.
database_connection =
mysql://xxx:xxx@localhost/galaxy?unix_socket=/data/mysql/mysql.sock
# If the server logs errors about not having enough database pool
connections,
# you will want to increase these values, or consider running more Galaxy
# processes.
#database_engine_option_pool_size = 5
#database_engine_option_max_overflow = 10
# If using MySQL and the server logs the error "MySQL server has gone
away",
# you will want to set this to some positive value (7200 should work).
database_engine_option_pool_recycle = 7200
# If large database query results are causing memory or response time
issues in
# the Galaxy process, leave the result on the server instead. This
option is
# only available for PostgreSQL and is highly recommended.
#database_engine_option_server_side_cursors = False
# Create only one connection to the database per thread, to reduce the
# connection overhead. Recommended when not using SQLite:
#database_engine_option_strategy = threadlocal
# -- Files and directories
# Dataset files are stored in this directory.
file_path = /galaxytemp/data
# Temporary files are stored in this directory.
new_file_path = /galaxytemp/tmp
# Tool config file, defines what tools are available in Galaxy.
#tool_config_file = tool_conf.xml
# Path to the directory containing the tools defined in the config.
#tool_path = tools
# Directory where data used by tools is located, see the samples in that
# directory and the wiki for help:
# http://bitbucket.org/galaxy/galaxy-central/wiki/DataIntegration
#tool_data_path = tool-data
# Datatypes config file, defines what data (file) types are available in
# Galaxy.
#datatypes_config_file = datatypes_conf.xml
# -- Mail and notification
# Galaxy sends mail for various things: Subscribing users to the mailing
list
# if they request it, emailing password resets, notification from the
Galaxy
# Sample Tracking system, and reporting dataset errors. To do this, it
needs
# to send mail through an SMTP server, which you may define here.
#smtp_server = None
# On the user registration form, users may choose to join the mailing list.
# This is the address of the list they'll be subscribed to.
#mailing_join_addr = galaxy-user-join@bx.psu.edu
# Datasets in an error state include a link to report the error. Those
reports
# will be sent to this address. Error reports are disabled if no address
is set.
#error_email_to = None
# -- Display sites
# Galaxy can display data at various external browsers. These options
specify
# which browsers should be available. URLs and builds available at these
# browsers are defined in the specifield files.
# UCSC browsers: tool-data/shared/ucsc/ucsc_build_sites.txt
#ucsc_display_sites = main,test,archaea,ucla
# GBrowse servers: tool-data/shared/gbrowse/gbrowse_build_sites.txt
#gbrowse_display_sites = wormbase,tair,modencode_worm,modencode_fly
# GeneTrack servers: tool-data/shared/genetrack/genetrack_sites.txt
#genetrack_display_sites = main,test
# -- UI Localization
# Append "/{brand}" to the "Galaxy" text in the masthead.
#brand = None
# The URL linked by the "Galaxy/brand" text.
#logo_url = /
# The URL linked by the "Galaxy Wiki" link in the "Help" menu.
#wiki_url = http://bitbucket.org/galaxy/galaxy-central/wiki
# The URL linked by the "Email comments..." link in the "Help" menu.
#bugs_email = None
# The URL linked by the "How to Cite..." link in the "Help" menu.
#citation_url = http://bitbucket.org/galaxy/galaxy-central/wiki/Citations
# Serve static content, which must be enabled if you're not serving it
via a
# proxy server. These options should be self explanatory and so are not
# documented individually. You can use these paths (or ones in the proxy
# server) to point to your own styles.
static_enabled = True
static_cache_time = 360
static_dir = %(here)s/static/
static_images_dir = %(here)s/static/images
static_favicon_dir = %(here)s/static/favicon.ico
static_scripts_dir = %(here)s/static/scripts/
static_style_dir = %(here)s/static/june_2007_style/blue
# -- Logging and Debugging
# Verbosity of console log messages. Acceptable values can be found here:
# http://docs.python.org/library/logging.html#logging-levels
#log_level = DEBUG
# Print database operations to the server log (warning, quite verbose!).
#database_engine_option_echo = False
# Print database pool operations to the server log (warning, quite
verbose!).
#database_engine_option_echo_pool = False
# Turn on logging of application events and some user events to the
database.
#log_events = True
# Turn on logging of user actions to the database. Actions currently
logged are
# grid views, tool searches, and use of "recently" used tools menu. The
# log_events and log_actions functionality will eventually be merged.
#log_actions = True
# Debug enables access to various config options useful for development and
# debugging: use_lint, use_profile, use_printdebug and use_interactive. It
# also causes the files used by PBS/SGE (submission script, output, and
error)
# to remain on disk after the job is complete. Debug mode is disabled if
# commented, but is uncommented by default in the sample config.
debug = True
# Check for WSGI compliance.
#use_lint = False
# Run the Python profiler on each request.
#use_profile = False
# Intercept print statements and show them on the returned page.
#use_printdebug = True
# Enable live debugging in your browser. This should NEVER be enabled on a
# public site. Enabled in the sample config for development.
use_interactive = True
# Write thread status periodically to 'heartbeat.log', (careful, uses disk
# space rapidly!). Useful to determine why your processes may be
consuming a
# lot of CPU.
#use_heartbeat = False
# Enable the memory debugging interface (careful, negatively impacts server
# performance).
#use_memdump = False
# -- Data Libraries
# These library upload options are described in much more detail in the
wiki:
#
http://bitbucket.org/galaxy/galaxy-central/wiki/DataLibraries/UploadingFiles
# Add an option to the library upload form which allows administrators to
# upload a directory of files.
library_import_dir = /data/
# Add an option to the library upload form which allows authorized
# non-administrators to upload a directory of files. The configured
directory
# must contain sub-directories named the same as the non-admin user's
Galaxy
# login ( email ). The non-admin user is restricted to uploading files or
# sub-directories of files contained in their directory.
#user_library_import_dir = None
# Add an option to the admin library upload tool allowing admins to paste
# filesystem paths to files and directories in a box, and these paths
will be
# added to a library. Set to True to enable. Please note the security
# implication that this will give Galaxy Admins access to anything your
Galaxy
# user has access to.
allow_library_path_paste = True
# -- Users and Security
# Galaxy encodes various internal values when these values will be
output in
# some format (for example, in a URL or cookie). You should set a key to be
# used by the algorithm that encodes and decodes these values. It can be
any
# string. If left unchanged, anyone could construct a cookie that would
grant
# them access to others' sessions.
#id_secret = USING THE DEFAULT IS NOT SECURE!
# User authentication can be delegated to an upstream proxy server (usually
# Apache). The upstream proxy should set a REMOTE_USER header in the
request.
# Enabling remote user disables regular logins. For more information, see:
# http://bitbucket.org/galaxy/galaxy-central/wiki/Config/ApacheProxy
#use_remote_user = False
# If use_remote_user is enabled and your external authentication
# method just returns bare usernames, set a default mail domain to be
appended
# to usernames, to become your Galaxy usernames (email addresses).
#remote_user_maildomain = None
# If use_remote_user is enabled, you can set this to a URL that will log
your
# users out.
#remote_user_logout_href = None
# Administrative users - set this to a comma-separated list of valid Galaxy
# users (email addresses). These users will have access to the Admin
section
# of the server, and will have access to create users, groups, roles,
# libraries, and more. For more information, see:
# http://bitbucket.org/galaxy/galaxy-central/wiki/Admin/AdminInterface
admin_users = xxxx@biotec.tu-dresden.de,xxxx@biotec.tu-dresden.de
# Force everyone to log in (disable anonymous access).
#require_login = False
# Allow unregistered users to create new accounts (otherwise, they will
have to
# be created by an admin).
#allow_user_creation = True
# Allow administrators to delete accounts.
#allow_user_deletion = False
# By default, users' data will be public, but setting this to True will
cause
# it to be private. Does not affect existing users and data, only ones
created
# after this option is set. Users may still change their default back to
# public.
new_user_dataset_access_role_default_private = True
# -- Beta features
# Enable Galaxy's built-in visualization module, Trackster.
#enable_tracks = False
# Enable Galaxy Pages. Pages are custom webpages that include embedded
Galaxy items,
# such as datasets, histories, workflows, and visualizations; pages are
useful for
# documenting and sharing multiple analyses or workflows. Pages are
created using a
# WYSIWYG editor that is very similar to a word processor.
#enable_pages = False
# Enable the (experimental! beta!) Web API. Documentation forthcoming.
#enable_api = False
# -- Job Execution
# If running multiple Galaxy processes, one can be designated as the job
# runner. For more information, see:
#
http://bitbucket.org/galaxy/galaxy-central/wiki/Config/WebApplicationScaling
enable_job_running = True
# Should jobs be tracked through the database, rather than in memory.
# Necessary if you're running the load balanced setup.
track_jobs_in_database = True
# Enable job recovery (if Galaxy is restarted while cluster jobs are
running,
# it can "recover" them when it starts). This is not safe to use if you are
# running more than one Galaxy server using the same database.
enable_job_recovery = True
# Setting metadata on job outputs to in a separate process (or if using a
# cluster, on the cluster). Thanks to Python's Global Interpreter Lock
and the
# hefty expense that setting metadata incurs, your Galaxy process may
become
# unresponsive when this operation occurs internally.
set_metadata_externally = True
# Although it is fairly reliable, setting metadata can occasionally
fail. In
# these instances, you can choose to retry setting it internally or
leave it in
# a failed state (since retrying internally may cause the Galaxy process
to be
# unresponsive). If this option is set to False, the user will be given the
# option to retry externally, or set metadata manually (when possible).
#retry_metadata_internally = True
# Number of concurrent jobs to run (local job runner)
local_job_queue_workers = 7
# Jobs can be killed after a certain amount of execution time. Format is in
# hh:mm:ss. Currently only implemented for PBS.
#job_walltime = None
# Clustering Galaxy is not a straightforward process and requires some
# pre-configuration. See the the wiki before attempting to set any of these
# options:
# http://bitbucket.org/galaxy/galaxy-central/wiki/Config/Cluster
# Comma-separated list of job runners to start. local is always started. If
# left commented, no jobs will be run on the cluster, even if a cluster
URL is
# explicitly defined in the [galaxy:tool_runners] section below. The
runners
# currently available are 'pbs' and 'drmaa'.
#start_job_runners = None
# The URL for the default runner to use when a tool doesn't explicity
define a
# runner below.
#default_cluster_job_runner = local:///
# The cluster runners have their own thread pools used to prepare and
finish
# jobs (so that these sometimes lengthy operations do not block normal
queue
# operation). The value here is the number of worker threads available
to each
# started runner.
#cluster_job_queue_workers = 3
# These options are only used when using file staging with PBS.
#pbs_application_server =
#pbs_stage_path =
#pbs_dataset_server =
# ---- Tool Job Runners
-----------------------------------------------------
# Individual per-tool job runner overrides. If not listed here, a tool will
# run with the runner defined with default_cluster_job_runner.
[galaxy:tool_runners]
biomart = local:///
encode_db1 = local:///
hbvar = local:///
microbial_import1 = local:///
ucsc_table_direct1 = local:///
ucsc_table_direct_archaea1 = local:///
ucsc_table_direct_test1 = local:///
upload1 = local:///
# ---- Galaxy Message Queue
-------------------------------------------------
# Galaxy uses AMQ protocol to receive messages from external sources like
# bar code scanners. Galaxy has been tested against RabbitMQ AMQP
implementation.
# For Galaxy to receive messages from a message queue the RabbitMQ
server has
# to be set up with a user account and other parameters listed below.
The 'host'
# and 'port' fields should point to where the RabbitMQ server is running.
[galaxy_amqp]
#host = 127.0.0.1
#port = 5672
#userid = galaxy
#password = galaxy
#virtual_host = galaxy_messaging_engine
#queue = galaxy_queue
#exchange = galaxy_exchange
#routing_key = bar_code_scanner
-- 
Jennifer Jackson
http://usegalaxy.org