I've been looking at this all day today, and here's what I've figured out. The picard_wrapper.py simply puts the SAM header from the input BAM file at the top of the BED file. However, the interval file actually has different columns of the order: Seq Name, Start Pos (1-based), End Pos, Strand, Interval Name. whereas the Bed file use the format of Seq Name, Start Pos (0-based), End Pos, Name, Score, Strand So the bed file actually needs to be converted and not just have the SAM header added. I wonder if the wrapper should NOT be doing this but this should be a whole different file format. I see in datatypes_conf.xml that a picard_interval_list datatype exists, but I'm not sure its entirely correct either. Would it be more appropriate to have the user upload a correctly formatted file or should the wrapper just re-order the BED columns and add 1 to the start pos? On Tue, Jan 10, 2012 at 11:24 AM, Ryan Golhar <ngsbioinformatics@gmail.com>wrote:
In case anyone is interested I posted a message to samtools-dev and got a few responses about it. The thread is called 'Picard bait/target format file for HsMetrics'. Now, for Galaxy, I think the wrapper should not accept the BED file as input as that doesn't work. I like the idea of a new file format (picardBaitTarget or maybe picardIntervalList) as the input type.
If the converter tool adds a header to the BED file, then there is the possibility that a user can associated the BED file with the wrong version of a genome. This is what Picard was trying to avoid. But that doesn't mean a user can't manually add the wrong header anyway. If the BED file is missing strand information, I don't think the tool should add it. I would say just leave the rest of the file alone. If there is no strand information, perhaps the user doesn't care about the strand.
On Mon, Jan 9, 2012 at 6:11 PM, Ross <ross.lazarus@gmail.com> wrote:
Hi Ryan,
Yes, the Picard tool mandates a bizarre bait/target format file for reasons which might best be addressed to the Picard devs - they may have some very good reasons although I can't imagine what they are. :)
Yes, automated conversion of any valid Galaxy bed dataset into the strange format required by the Picard tool is a very good idea. We're already half way there because the tool wrapper adds the (IMHO really silly) required SAM header automagically.
A new datatype (eg "picardBaitTarget") and an automated converter would make the tool much easier to use - it's far from ideal to force Galaxy users to comply with the strange Picard format requirements if we can automate a converter.
I thought about implementing one but stopped when I realized that am not sure what an automated converter should do if the user supplies a valid Galaxy bed lacking strand information - generally, making up strand is not a good idea. I don't have enough insight into the way the stats are calculated to know whether bad things might happen if (eg) we assume all the bait and target regions are on the + strand if they're not - but if someone can describe how to automate the conversion, it would definitely be an improvement to the usability of the Picard tool.
Suggestions welcomed!
Hi all - I think there is a problem with the Picard HSMetrics wrapper in Galaxy. The wrapper accepts a BAM files and a BED file. However the BED file isn't really in a BED format...it requires a SAM header before the BED lines. This really isn't a BED file format. I'm not quite sure how Galaxy should deal with this...maybe a file format specific for Picard
On Tue, Jan 10, 2012 at 8:03 AM, Ryan Golhar <ngsbioinformatics@gmail.com> wrote: formatted
BED file.