Hi Peter,
After installing the clc testing galaxy wrapper I noticed there are still some static paths in the wrapper. I'm in favor of expecting binaries in the system path, but maybe that's a matter of taste.
Greets,
Eric Kuijt
On 30 October 2013 17:03, Peter Cock p.j.a.cock@googlemail.com wrote:
Hello all,
This is just to announce I am working on a wrapper for "CLC Assembly Cell" which is the CLCbio commercial command line assembly tool suite. http://www.clcbio.com/products/clc-assembly-cell/
Our institute bought a licence primarily for use on plant genomes where other assemblers at the time required too much RAM to complete. This assembler is both fast and low memory, which can be very useful.
Wrapper development here: https://github.com/peterjc/pico_galaxy/tree/master/tools/clc_assembly_cell
Prototype releases will be on the Test Tool Shed (soon): http://testtoolshed.g2.bx.psu.edu/view/peterjc/clc_assembly_cell
Stable Tool Shed releases will be here (later): http://toolshed.g2.bx.psu.edu/view/peterjc/clc_assembly_cell
I would be interested to hear from anyone else with access to a licensed copy of the tool interested in using it from Galaxy. e.g. Is it reasonable to assume the tools are on the $PATH, or is using a specific environment variable more helpful?
Regards,
Peter ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
On Mon, Nov 18, 2013 at 3:31 PM, Eric Kuyt erickuyt@gmail.com wrote:
Hi Peter,
After installing the clc testing galaxy wrapper I noticed there are still some static paths in the wrapper. I'm in favor of expecting binaries in the system path, but maybe that's a matter of taste.
Yes - while waiting for any public opinon on $PATH versus something like $CLCBIO for my local testing I had a hard coded path. If you vote for using $PATH that is fine with me and I can update the wrapper accordingly.
Other than that is it working for you? Could you run the unit tests?
Also which version of CLCbio assembly cell do you have - and is the wrapper capturing this correctly (in case it behaves any differently to the point version we have)?
Thanks Eric,
Peter
(Sorry for sending this twice, I omitted the list the first time)
Just the licence server was installed and not the actual genomics workbench, so I couldn't do real testing yet. I am now downloading 6-5-1 64bit.
I'll keep you noticed.
On 18 November 2013 16:37, Peter Cock p.j.a.cock@googlemail.com wrote:
On Mon, Nov 18, 2013 at 3:31 PM, Eric Kuyt erickuyt@gmail.com wrote:
Hi Peter,
After installing the clc testing galaxy wrapper I noticed there are still some static paths in the wrapper. I'm in favor of expecting binaries in the system path, but maybe that's a matter of taste.
Yes - while waiting for any public opinon on $PATH versus something like $CLCBIO for my local testing I had a hard coded path. If you vote for using $PATH that is fine with me and I can update the wrapper accordingly.
Other than that is it working for you? Could you run the unit tests?
Also which version of CLCbio assembly cell do you have - and is the wrapper capturing this correctly (in case it behaves any differently to the point version we have)?
Thanks Eric,
Peter
(Sorry for sending this twice, I omitted the list the first time)
Hi Peter, it turns out we only have a workbench licence, the clc_assembler packaged with the workbench is called ./clc_assembler_ilo
which has the man page below, do you think this is the same binary as the clc-assembly-cell assembler?
I will just try to link clc_assembler_ilo to my path and see what it does :)
___Help page___ usage: clc_assembler [options]
Assemble some reads and output contig sequences in fasta format.
Options:
-h / --help: Display this message
-q / --reads: The files following this option are read files. Fasta, fastq, and sff formats are allowed. (may be used several times)
-i <file1> <file2> / --interleave <file1> <file2>: Interleave the sequences in two files, alternating between the files when reading the sequences. Only valid for read files. (may be used several times)
-o <file> / --output <file>: Give the output fasta file (required)
-f <file> / --feature_output <file>: Output scaffolding annotation in GFF (default) or AGP format. The file suffix is used to determine the output format. Use '.gff' for GFF format and '.agp' for AGP format.
-m <n> / --min-length <n>: Set the minimum contig/scaffold length to output (default = 200)
-w <n> / --wordsize <n>: Set the word size for the de Bruijn graph (default is automatic based on input data size)
-b <n> / --bubblesize <n>: Set the maximum bubble size for the de Bruijn graph (default is 50)
--cpus <n>: Set the number of cpus to use.
-v / --verbose: Output various information while running.
-p <par> / --paired <par>: Set the paired read mode for the read files following this option. (may be used several times)
par consists of four strings: <mode> [<dist_mode>] [<min_dist> <max_dist>]
mode is ff, fb, bf, bb and sets the relative orientation of read one and two in a pair (f = forward, b = backward)
dist_mode is ss, se, es, ee and sets the place on read one and two to measure the distance (s = start, e = end).
A typical use would be "-p fb ss 180 250" which means that the reads are inverted and pointing towards each other. The distance includes both the reads and the sequence between them. The distance may be between 180 and 250, both included.
It is also allowed to insert a "d" before the mode. This indicates that the reads in the following file(s) should only be used for their paired end information and not to build initial contigs. E.g. "-p d fb ss 180 250".
To explicitly say that the following reads are not paired, use "no" for par, i.e. "-p no".
For paired end reads split in two files, use the -i option.
-e <file> / --estimatedistances <file> Estimate paired distances for all paired reads and save the distance estimates in <file>. If it is not possible to get an accurate distance estimate for a file, the original paired distance is used.
-g <mode> / --fragmentmode <mode>: Set the mode for how reads are used to create fragments. One mode is "ignore", which ignores the nucleotides when building initial fragments. The other mode is "use", which uses the nucleotides when building initial fragments. This is the default mode. The mode applies to all read files following this option. The option may be used repeatedly.
-n / --no-scaffolding: Pair info is used for contig creation, but no scaffolding is performed.
Examples:
Assembly of a single file with reads:
clc_assembler -o contigs.fasta -q reads.fasta
Assembly of two interleaved files with paired end reads:
clc_assembler -o contigs.fasta -p fb ss 180 250 -q -i reads1.fq reads2.fq
Version: 4.20.91522
On 18 November 2013 16:58, Eric Kuyt eric.kuijt@wur.nl wrote:
Just the licence server was installed and not the actual genomics workbench, so I couldn't do real testing yet. I am now downloading 6-5-1 64bit.
I'll keep you noticed.
On 18 November 2013 16:37, Peter Cock p.j.a.cock@googlemail.com wrote:
On Mon, Nov 18, 2013 at 3:31 PM, Eric Kuyt erickuyt@gmail.com wrote:
Hi Peter,
After installing the clc testing galaxy wrapper I noticed there are
still
some static paths in the wrapper. I'm in favor of expecting binaries in the system path, but maybe that's
a
matter of taste.
Yes - while waiting for any public opinon on $PATH versus something like $CLCBIO for my local testing I had a hard coded path. If you vote for using $PATH that is fine with me and I can update the wrapper accordingly.
Other than that is it working for you? Could you run the unit tests?
Also which version of CLCbio assembly cell do you have - and is the wrapper capturing this correctly (in case it behaves any differently to the point version we have)?
Thanks Eric,
Peter
(Sorry for sending this twice, I omitted the list the first time)
-- Central Veterinary Institute of Wageningen UR (CVI) Department of Infection Biology PO box 65, 8200 AB Lelystad, NL Visiting address: ASG, Edelhertweg 15, 8219 PH Lelystad
Tel: +31-(0)320-293391 Fax: +31-(0)320-238153 E-mail: eric.kuijt@wur.nl Web: http://www.cvi.wur.nl
On Mon, Nov 18, 2013 at 4:23 PM, Eric Kuyt eric.kuijt@wur.nl wrote:
Hi Peter, it turns out we only have a workbench licence, the clc_assembler packaged with the workbench is called ./clc_assembler_ilo
which has the man page below, do you think this is the same binary as the clc-assembly-cell assembler?
I will just try to link clc_assembler_ilo to my path and see what it does :)
Try creating a symlink named clc_assembler (or hacking my wrapper to look for clc_assembler_ilo instead of clc_assembler), but yes, it looks like the same tool under a different name (see below).
Interestingly yours is newer then ours, perhaps we need an update...
What about the clc_mapper and clc_cas_to_sam binaries which are used in my clc_mapper.xml wrapper? Are they there under different names too?
[I have no idea if the CLCbio workbench licence is intended to allow you to run the clc_assembler at the command line as well - you may need to double check that to be safe.]
Regards,
Peter
--
$ /mnt/apps/clcBio/clc-assembly-cell-4.1.0-linux_64/clc_assembler No read files
usage: clc_assembler [options]
Assemble some reads and output contig sequences in fasta format.
Options:
-h / --help: Display this message
-q / --reads: The files following this option are read files. Fasta, fastq, and sff formats are allowed. (may be used several times)
-i <file1> <file2> / --interleave <file1> <file2>: Interleave the sequences in two files, alternating between the files when reading the sequences. Only valid for read files. (may be used several times)
-o <file> / --output <file>: Give the output fasta file (required)
-f <file> / --feature_output <file>: Output scaffolding annotation in GFF (default) or AGP format. The file suffix is used to determine the output format. Use '.gff' for GFF format and '.agp' for AGP format.
-m <n> / --min-length <n>: Set the minimum contig length to output (default = 200)
-w <n> / --wordsize <n>: Set the word size for the de Bruijn graph (default is automatic based on input data size)
-b <n> / --bubblesize <n>: Set the maximum bubble size for the de Bruijn graph (default is 50)
--cpus <n>: Set the number of cpus to use.
-v / --verbose: Output various information while running.
-p <par> / --paired <par>: Set the paired read mode for the read files following this option. (may be used several times)
par consists of four strings: <mode> [<dist_mode>] [<min_dist> <max_dist>]
mode is ff, fb, bf, bb and sets the relative orientation of read one and two in a pair (f = forward, b = backward)
dist_mode is ss, se, es, ee and sets the place on read one and two to measure the distance (s = start, e = end).
A typical use would be "-p fb ss 180 250" which means that the reads are inverted and pointing towards each other. The distance includes both the reads and the sequence between them. The distance may be between 180 and 250, both included.
It is also allowed to insert a "d" before the mode. This indicates that the reads in the following file(s) should only be used for their paired end information and not to build initial contigs. E.g. "-p d fb ss 180 250".
To explicitly say that the following reads are not paired, use "no" for par, i.e. "-p no".
For paired end reads split in two files, use the -i option.
-e <file> / --estimatedistances <file> Estimate paired distances for all paired reads and save the distance estimates in <file>. If it is not possible to get an accurate distance estimate for a file, the original paired distance is used.
-g <mode> / --fragmentmode <mode>: Set the mode for how reads are used to create fragments. One mode is "ignore", which ignores the nucleotides when building initial fragments. The other mode is "use", which uses the nucleotides when building initial fragments. This is the default mode. The mode applies to all read files following this option. The option may be used repeatedly.
-n / --no-scaffolding: Pair info is used for contig creation, but no scaffolding is performed.
Examples:
Assembly of a single file with reads:
clc_assembler -o contigs.fasta -q reads.fasta
Assembly of two interleaved files with paired end reads:
clc_assembler -o contigs.fasta -p fb ss 180 250 -q -i reads1.fq reads2.fq
Version: 4.10.86742
On Mon, Nov 18, 2013 at 4:29 PM, Peter Cock p.j.a.cock@googlemail.com wrote:
On Mon, Nov 18, 2013 at 4:23 PM, Eric Kuyt eric.kuijt@wur.nl wrote:
Hi Peter, it turns out we only have a workbench licence, the clc_assembler packaged with the workbench is called ./clc_assembler_ilo
which has the man page below, do you think this is the same binary as the clc-assembly-cell assembler?
I will just try to link clc_assembler_ilo to my path and see what it does :)
Try creating a symlink named clc_assembler (or hacking my wrapper to look for clc_assembler_ilo instead of clc_assembler), but yes, it looks like the same tool under a different name (see below).
Interestingly yours is newer then ours, perhaps we need an update...
What about the clc_mapper and clc_cas_to_sam binaries which are used in my clc_mapper.xml wrapper? Are they there under different names too?
[I have no idea if the CLCbio workbench licence is intended to allow you to run the clc_assembler at the command line as well - you may need to double check that to be safe.]
Regards,
Peter
Hi Eric,
Were you able to run the tools within the CLC workbench at the command line? This would be very encouraging for getting the most value-for-money out of existing CLCbio licences.
Thanks,
Peter
galaxy-dev@lists.galaxyproject.org