Huge Output Files in Galaxy (in compute cluster)
Hi All, I use Galaxy on a compute cluster and each job is launch in a node. When I launch a workflow and when it generates very huge files (some Go), the job finish (I can see that with 'qstat' command) but Galaxy takes several minutes (or some hours !!) to display a green Box in history. It seems running... but it's already finished in the cluster !! I think Galaxy is checking the output file... Is it normal ? Is it possible to skip this check ? Thanks in advance, Marc -- Marc Bras ------------------------------------------ Marc.Bras@versailles.inra.fr INRA-URGI: Unité de Recherche Génomique Info Centre de Recherche de Versailles-Grignon Route de Saint Cyr 78026 Versailles - FRANCE Tel: +33 1 30 83 34 70 ------------------------------------------
Marc, What you're seeing is likely Galaxy setting metadata like line counts, etc. One first step would be to check your datatypes_conf.xml and look for the attribute max_optional_metadata_filesize. You can set this on a per-filetype basis, or, if you'd like to set it for everything, you could set your data entry (everything inherits from data) like so: <datatype extension="data" type="galaxy.datatypes.data:Data" mimetype="application/octet-stream" max_optional_metadata_filesize="1048576" /> This will tell galaxy not to set optional metadata on files larger than 1MB, and it might resolve the issue you're seeing. -Dannon On Apr 22, 2011, at 10:25 AM, Marc Bras wrote:
Hi All,
I use Galaxy on a compute cluster and each job is launch in a node. When I launch a workflow and when it generates very huge files (some Go), the job finish (I can see that with 'qstat' command) but Galaxy takes several minutes (or some hours !!) to display a green Box in history. It seems running... but it's already finished in the cluster !!
I think Galaxy is checking the output file...
Is it normal ?
Is it possible to skip this check ?
Thanks in advance,
Marc
-- Marc Bras
------------------------------------------
Marc.Bras@versailles.inra.fr
INRA-URGI: Unité de Recherche Génomique Info Centre de Recherche de Versailles-Grignon Route de Saint Cyr 78026 Versailles - FRANCE
Tel: +33 1 30 83 34 70
------------------------------------------
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
Dannon, First of all, thank you for your reply ! In the datatypes_conf.xml, I attribute 'max_optional_metadata_filsize' but it's still very long... (althought I feel it getting better). However, when I run 'top' on the cluster computer's terminal, I see a python script running on the master node (launching by Galaxy). This script sets metadata. Is it possible to launch this script on a node (and not on the master node) ? Thanks in advance, Marc. On 22/04/2011 16:39, Dannon Baker wrote:
Marc,
What you're seeing is likely Galaxy setting metadata like line counts, etc. One first step would be to check your datatypes_conf.xml and look for the attribute max_optional_metadata_filesize. You can set this on a per-filetype basis, or, if you'd like to set it for everything, you could set your data entry (everything inherits from data) like so:
<datatype extension="data" type="galaxy.datatypes.data:Data" mimetype="application/octet-stream" max_optional_metadata_filesize="1048576" />
This will tell galaxy not to set optional metadata on files larger than 1MB, and it might resolve the issue you're seeing.
-Dannon
On Apr 22, 2011, at 10:25 AM, Marc Bras wrote:
Hi All,
I use Galaxy on a compute cluster and each job is launch in a node. When I launch a workflow and when it generates very huge files (some Go), the job finish (I can see that with 'qstat' command) but Galaxy takes several minutes (or some hours !!) to display a green Box in history. It seems running... but it's already finished in the cluster !!
I think Galaxy is checking the output file...
Is it normal ?
Is it possible to skip this check ?
Thanks in advance,
Marc
-- Marc Bras
------------------------------------------
Marc.Bras@versailles.inra.fr
INRA-URGI: Unité de Recherche Génomique Info Centre de Recherche de Versailles-Grignon Route de Saint Cyr 78026 Versailles - FRANCE
Tel: +33 1 30 83 34 70
------------------------------------------
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Marc Bras ------------------------------------------ Marc.Bras@versailles.inra.fr INRA-URGI: Unité de Recherche Génomique Info Centre de Recherche de Versailles-Grignon Route de Saint Cyr 78026 Versailles - FRANCE Tel: +33 1 30 83 34 70 ------------------------------------------
Yes, metadata can be set on a separate node. To do this you'll want to set the following value in your universe_wsgi.ini: set_metadata_externally = True -Dannon On Apr 29, 2011, at 9:16 AM, Marc Bras wrote:
Dannon,
First of all, thank you for your reply !
In the datatypes_conf.xml, I attribute 'max_optional_metadata_filsize' but it's still very long... (althought I feel it getting better).
However, when I run 'top' on the cluster computer's terminal, I see a python script running on the master node (launching by Galaxy). This script sets metadata.
Is it possible to launch this script on a node (and not on the master node) ?
Thanks in advance,
Marc.
On 22/04/2011 16:39, Dannon Baker wrote:
Marc,
What you're seeing is likely Galaxy setting metadata like line counts, etc. One first step would be to check your datatypes_conf.xml and look for the attribute max_optional_metadata_filesize. You can set this on a per-filetype basis, or, if you'd like to set it for everything, you could set your data entry (everything inherits from data) like so:
<datatype extension="data" type="galaxy.datatypes.data:Data" mimetype="application/octet-stream" max_optional_metadata_filesize="1048576" />
This will tell galaxy not to set optional metadata on files larger than 1MB, and it might resolve the issue you're seeing.
-Dannon
On Apr 22, 2011, at 10:25 AM, Marc Bras wrote:
Hi All,
I use Galaxy on a compute cluster and each job is launch in a node. When I launch a workflow and when it generates very huge files (some Go), the job finish (I can see that with 'qstat' command) but Galaxy takes several minutes (or some hours !!) to display a green Box in history. It seems running... but it's already finished in the cluster !!
I think Galaxy is checking the output file...
Is it normal ?
Is it possible to skip this check ?
Thanks in advance,
Marc
-- Marc Bras
------------------------------------------
Marc.Bras@versailles.inra.fr
INRA-URGI: Unité de Recherche Génomique Info Centre de Recherche de Versailles-Grignon Route de Saint Cyr 78026 Versailles - FRANCE
Tel: +33 1 30 83 34 70
------------------------------------------
___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Marc Bras
------------------------------------------
Marc.Bras@versailles.inra.fr
INRA-URGI: Unité de Recherche Génomique Info Centre de Recherche de Versailles-Grignon Route de Saint Cyr 78026 Versailles - FRANCE
Tel: +33 1 30 83 34 70
------------------------------------------
participants (2)
-
Dannon Baker
-
Marc Bras