Best practices for utilising Unique Molecular Indexing (UMI's)

newer
GCCBOSC 2018 Early Registration...

older
May Galaxy News: GCCBOSC, Training...

Brooks, Tony

26 Mar 2018 26 Mar '18

9:45 a.m.

Hi We are currently seeing a number of methods that are utilising the power of unique molecular indexing. Unfortunately, there is no consensus on how libraries should be configured, and therefore no consensus for how to deal with them within Galaxy. Often libraries that have the UMI placed directly downstream of the first (i7) index, such as ones using the IDT xGen adapter set (https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xg...). Sometimes UMI's exist in place of the second (i5) index (https://www.neb.com/nebnext-direct/nebnext-direct-for-target-enrichment). In both cases, the recommended workflows are convoluted and all the necessary tools do not currently exist in the toolshed (so that the datasets need to be taken out of galaxy, processed and reloaded). It is possible to use bcl2fastq to output the UMI as an additional fastq file, but this would then require me to create a dataset triplicate (not pair) which afaik we can't do (yet). A quick Google/toolshed search had me find UMI-Tools & Je-Suite which both exist in Galaxy. Both these tools assume the UMI is "in-line" (i.e. at the beginning of the read 1 or read2 - not its own read), extract/remove the UMI and place it in the read header, where it is then used further down the line to dedup the bam file. Does anyone know of any tools that would take the UMI from a separate fastq and use it to tag the headers of actual read data. Or alternatively, a tool that will paste the UMI tag onto the 5' end of the read fastq? And whether these steps can be done within Galaxy, or maybe prior to fastq upload? Anyone have a method/workflow for UMI's? Thanks in advance Tony

Attachments:

attachment.htm (text/html — 4.2 KB)

Show replies by date

Charles Girardot

26 Mar 26 Mar

11:05 a.m.

New subject: Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony, sounds like you need is the Je-Suite version 2 which we will release soon-ish. In this version, the new “je debarcode” module allows you to define an unlimited number of FASTQ input together with their layout (describing BARCODE, UMI and SAMPLE positions). One can define an unlimited number of output FASTQ layouts by combining the BARCODE, UMI and SAMPLE slots defined in input layouts (one slot can be written in multiple layouts if need be). For example, you can output the UMI in their own separate file instead of/in addition to keeping them in the read header. This version also let you keep the demultiplexed reads in a single output file (for single end I mean), map this file to the genome then use the new “je retag” to transfer the BARCOD/EUMI info embedded in read header to proper BAM tags. This should make it easier to deal with single-cell datasets i.e. manipulate a unique file instead of hundreds/thousands… The command line version is close to completion (“je retag” needs a bit more testing) and I believe we could push this in conda rather quickly. The bad news is we haven’t started to write/update the Galaxy wrappers yet. Let me know if you are interested to try the command line version. Best Charles

...

On 26. Mar 2018, at 11:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi We are currently seeing a number of methods that are utilising the power of unique molecular indexing. Unfortunately, there is no consensus on how libraries should be configured, and therefore no consensus for how to deal with them within Galaxy.

Often libraries that have the UMI placed directly downstream of the first (i7) index, such as ones using the IDT xGen adapter set (https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xg...). Sometimes UMI’s exist in place of the second (i5) index (https://www.neb.com/nebnext-direct/nebnext-direct-for-target-enrichment).

In both cases, the recommended workflows are convoluted and all the necessary tools do not currently exist in the toolshed (so that the datasets need to be taken out of galaxy, processed and reloaded). It is possible to use bcl2fastq to output the UMI as an additional fastq file, but this would then require me to create a dataset triplicate (not pair) which afaik we can’t do (yet).

A quick Google/toolshed search had me find UMI-Tools & Je-Suite which both exist in Galaxy. Both these tools assume the UMI is “in-line” (i.e. at the beginning of the read 1 or read2 – not its own read), extract/remove the UMI and place it in the read header, where it is then used further down the line to dedup the bam file.

Does anyone know of any tools that would take the UMI from a separate fastq and use it to tag the headers of actual read data. Or alternatively, a tool that will paste the UMI tag onto the 5’ end of the read fastq? And whether these steps can be done within Galaxy, or maybe prior to fastq upload?

Anyone have a method/workflow for UMI’s?

Thanks in advance Tony ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================

Brooks, Tony

27 Mar 27 Mar

2:45 p.m.

New subject: Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Charles Thanks for replying. Yes, I'd definitely be interested in the command line version. I'm assuming I can use je retag to add the UMI from a separate read to my sample fastq (already demultiplexed by bcl2fastq) then use the de-markdupes already in Galaxy to dedupe? Would this work? Btw, I did find a small bug in markdupes. On the NextSeq, the fastq header contains a space, e.g. @NS500195:396:HLMM5BGX5:1:11101:2331:1043 1:N:0:GCCAAT When I run markdupes, I get an error as it thinks the UMI is "1043", not "GCCAAT". Thanks ________________________________ From: Charles Girardot <charles.girardot@embl.de> Sent: 26 March 2018 12:05:22 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's) Hi Tony, sounds like you need is the Je-Suite version 2 which we will release soon-ish. In this version, the new “je debarcode” module allows you to define an unlimited number of FASTQ input together with their layout (describing BARCODE, UMI and SAMPLE positions). One can define an unlimited number of output FASTQ layouts by combining the BARCODE, UMI and SAMPLE slots defined in input layouts (one slot can be written in multiple layouts if need be). For example, you can output the UMI in their own separate file instead of/in addition to keeping them in the read header. This version also let you keep the demultiplexed reads in a single output file (for single end I mean), map this file to the genome then use the new “je retag” to transfer the BARCOD/EUMI info embedded in read header to proper BAM tags. This should make it easier to deal with single-cell datasets i.e. manipulate a unique file instead of hundreds/thousands… The command line version is close to completion (“je retag” needs a bit more testing) and I believe we could push this in conda rather quickly. The bad news is we haven’t started to write/update the Galaxy wrappers yet. Let me know if you are interested to try the command line version. Best Charles

...

On 26. Mar 2018, at 11:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi We are currently seeing a number of methods that are utilising the power of unique molecular indexing. Unfortunately, there is no consensus on how libraries should be configured, and therefore no consensus for how to deal with them within Galaxy.

Often libraries that have the UMI placed directly downstream of the first (i7) index, such as ones using the IDT xGen adapter set (https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xg...). Sometimes UMI’s exist in place of the second (i5) index (https://www.neb.com/nebnext-direct/nebnext-direct-for-target-enrichment).

In both cases, the recommended workflows are convoluted and all the necessary tools do not currently exist in the toolshed (so that the datasets need to be taken out of galaxy, processed and reloaded). It is possible to use bcl2fastq to output the UMI as an additional fastq file, but this would then require me to create a dataset triplicate (not pair) which afaik we can’t do (yet).

A quick Google/toolshed search had me find UMI-Tools & Je-Suite which both exist in Galaxy. Both these tools assume the UMI is “in-line” (i.e. at the beginning of the read 1 or read2 – not its own read), extract/remove the UMI and place it in the read header, where it is then used further down the line to dedup the bam file.

Does anyone know of any tools that would take the UMI from a separate fastq and use it to tag the headers of actual read data. Or alternatively, a tool that will paste the UMI tag onto the 5’ end of the read fastq? And whether these steps can be done within Galaxy, or maybe prior to fastq upload?

Anyone have a method/workflow for UMI’s?

Thanks in advance Tony ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/

Charles Girardot

9:25 p.m.

New subject: Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony, please see inline answers

...

On 27. Mar 2018, at 16:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi Charles Thanks for replying. Yes, I'd definitely be interested in the command line version.

we’ll try to push the version we have in conda tomorrow.

...

I'm assuming I can use je retag to add the UMI from a separate read to my sample fastq (already demultiplexed by bcl2fastq) then use the de-markdupes already in Galaxy to dedupe? Would this work?

I think you got me wrong (or I got you wrong!): je retag lets you transfer UMI/BARCODE information embedded in the read name of a *bam* file to proper bam BC/RX/QX/OX/BZ/MI tags. Indeed many tools (will) expect this info to be available in BAM tags and not in read names. The current version of je markedupes expects the UMI info in the read name. We havent updated je markedupes to get this info from BAM tags yet (even in je-suite2, not fully ready as mentioned before)

...

Btw, I did find a small bug in markdupes. On the NextSeq, the fastq header contains a space, e.g.

@NS500195:396:HLMM5BGX5:1:11101:2331:1043 1:N:0:GCCAAT

When I run markdupes, I get an error as it thinks the UMI is "1043", not "GCCAAT”.

read names in BAM files can’t have spaces so the read name in your BAM is likely to be : @NS500195:396:HLMM5BGX5:1:11101:2331:1043 and by default, the UMI is expected in the last position ; hence '1043' Please double check how the read names look in your bam file. Best Charles

...

Thanks From: Charles Girardot <charles.girardot@embl.de> Sent: 26 March 2018 12:05:22 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony,

sounds like you need is the Je-Suite version 2 which we will release soon-ish.

In this version, the new “je debarcode” module allows you to define an unlimited number of FASTQ input together with their layout (describing BARCODE, UMI and SAMPLE positions). One can define an unlimited number of output FASTQ layouts by combining the BARCODE, UMI and SAMPLE slots defined in input layouts (one slot can be written in multiple layouts if need be). For example, you can output the UMI in their own separate file instead of/in addition to keeping them in the read header.

This version also let you keep the demultiplexed reads in a single output file (for single end I mean), map this file to the genome then use the new “je retag” to transfer the BARCOD/EUMI info embedded in read header to proper BAM tags. This should make it easier to deal with single-cell datasets i.e. manipulate a unique file instead of hundreds/thousands…

The command line version is close to completion (“je retag” needs a bit more testing) and I believe we could push this in conda rather quickly. The bad news is we haven’t started to write/update the Galaxy wrappers yet.

Let me know if you are interested to try the command line version.

Best

Charles

...
On 26. Mar 2018, at 11:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi We are currently seeing a number of methods that are utilising the power of unique molecular indexing. Unfortunately, there is no consensus on how libraries should be configured, and therefore no consensus for how to deal with them within Galaxy.

Often libraries that have the UMI placed directly downstream of the first (i7) index, such as ones using the IDT xGen adapter set (https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xg...). Sometimes UMI’s exist in place of the second (i5) index (https://www.neb.com/nebnext-direct/nebnext-direct-for-target-enrichment).

In both cases, the recommended workflows are convoluted and all the necessary tools do not currently exist in the toolshed (so that the datasets need to be taken out of galaxy, processed and reloaded). It is possible to use bcl2fastq to output the UMI as an additional fastq file, but this would then require me to create a dataset triplicate (not pair) which afaik we can’t do (yet).

A quick Google/toolshed search had me find UMI-Tools & Je-Suite which both exist in Galaxy. Both these tools assume the UMI is “in-line” (i.e. at the beginning of the read 1 or read2 – not its own read), extract/remove the UMI and place it in the read header, where it is then used further down the line to dedup the bam file.

Does anyone know of any tools that would take the UMI from a separate fastq and use it to tag the headers of actual read data. Or alternatively, a tool that will paste the UMI tag onto the 5’ end of the read fastq? And whether these steps can be done within Galaxy, or maybe prior to fastq upload?

Anyone have a method/workflow for UMI’s?

Thanks in advance Tony ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================

Brooks, Tony

28 Mar 28 Mar

10:16 a.m.

New subject: Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Charles Thanks for replying and clarifying a few things. Just so I'm straight, I can use debarcode to add the UMIs from fastq read file #1 into the read header of fastq read file #2. I would then align to a genome to generate a bam, then use retag to switch the UMI from the read header to the RX position in the bam - then I'd be able to use picard to dedup with the RX barcode_tag. Feasibly, once the UMI is added to the header, I could still use the current je-markdupes to dedupe (it being a fork of picard)? Previously, I've been using fgbio, but this involved aligning the data in Galaxy, exporting it and annotating with UMI's and then loading back in. Ideally, we'd like to keep the entire workflow within Galaxy (hence the original question). Tony ________________________________ From: Charles Girardot <charles.girardot@embl.de> Sent: 27 March 2018 22:25:57 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org; Jelle Scholtalbers Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's) Hi Tony, please see inline answers

...

On 27. Mar 2018, at 16:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi Charles Thanks for replying. Yes, I'd definitely be interested in the command line version.

we’ll try to push the version we have in conda tomorrow.

...

I'm assuming I can use je retag to add the UMI from a separate read to my sample fastq (already demultiplexed by bcl2fastq) then use the de-markdupes already in Galaxy to dedupe? Would this work?

...

Btw, I did find a small bug in markdupes. On the NextSeq, the fastq header contains a space, e.g.

@NS500195:396:HLMM5BGX5:1:11101:2331:1043 1:N:0:GCCAAT

When I run markdupes, I get an error as it thinks the UMI is "1043", not "GCCAAT”.

...

Thanks From: Charles Girardot <charles.girardot@embl.de> Sent: 26 March 2018 12:05:22 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony,

sounds like you need is the Je-Suite version 2 which we will release soon-ish.

In this version, the new “je debarcode” module allows you to define an unlimited number of FASTQ input together with their layout (describing BARCODE, UMI and SAMPLE positions). One can define an unlimited number of output FASTQ layouts by combining the BARCODE, UMI and SAMPLE slots defined in input layouts (one slot can be written in multiple layouts if need be). For example, you can output the UMI in their own separate file instead of/in addition to keeping them in the read header.

This version also let you keep the demultiplexed reads in a single output file (for single end I mean), map this file to the genome then use the new “je retag” to transfer the BARCOD/EUMI info embedded in read header to proper BAM tags. This should make it easier to deal with single-cell datasets i.e. manipulate a unique file instead of hundreds/thousands…

The command line version is close to completion (“je retag” needs a bit more testing) and I believe we could push this in conda rather quickly. The bad news is we haven’t started to write/update the Galaxy wrappers yet.

Let me know if you are interested to try the command line version.

Best

Charles

...
On 26. Mar 2018, at 11:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi We are currently seeing a number of methods that are utilising the power of unique molecular indexing. Unfortunately, there is no consensus on how libraries should be configured, and therefore no consensus for how to deal with them within Galaxy.

Often libraries that have the UMI placed directly downstream of the first (i7) index, such as ones using the IDT xGen adapter set (https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xg...). Sometimes UMI’s exist in place of the second (i5) index (https://www.neb.com/nebnext-direct/nebnext-direct-for-target-enrichment).

In both cases, the recommended workflows are convoluted and all the necessary tools do not currently exist in the toolshed (so that the datasets need to be taken out of galaxy, processed and reloaded). It is possible to use bcl2fastq to output the UMI as an additional fastq file, but this would then require me to create a dataset triplicate (not pair) which afaik we can’t do (yet).

A quick Google/toolshed search had me find UMI-Tools & Je-Suite which both exist in Galaxy. Both these tools assume the UMI is “in-line” (i.e. at the beginning of the read 1 or read2 – not its own read), extract/remove the UMI and place it in the read header, where it is then used further down the line to dedup the bam file.

Does anyone know of any tools that would take the UMI from a separate fastq and use it to tag the headers of actual read data. Or alternatively, a tool that will paste the UMI tag onto the 5’ end of the read fastq? And whether these steps can be done within Galaxy, or maybe prior to fastq upload?

Anyone have a method/workflow for UMI’s?

Thanks in advance Tony ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================

Charles Girardot

10:50 a.m.

New subject: Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony, yes you got this right. You might not need to "je retag" after mapping depending on the tool you use for removing duplicates. As I said, having je2 in galaxy is in our plans but we still need some time to write the missing wrappers. You can get Je-suite 2 release candidate from our github already in the “ release_2_beta” branch i.e. https://github.com/gbcs-embl/Je/blob/release_2_beta/dist/je_2.0.RC.tar.gz Ah and you need java 8 I still need to write web doc but the command line should hopefully get you going (ask if anything unclear) Best C

...

On 28. Mar 2018, at 12:16, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi Charles Thanks for replying and clarifying a few things. Just so I'm straight, I can use debarcode to add the UMIs from fastq read file #1 into the read header of fastq read file #2. I would then align to a genome to generate a bam, then use retag to switch the UMI from the read header to the RX position in the bam - then I'd be able to use picard to dedup with the RX barcode_tag. Feasibly, once the UMI is added to the header, I could still use the current je-markdupes to dedupe (it being a fork of picard)?

Previously, I've been using fgbio, but this involved aligning the data in Galaxy, exporting it and annotating with UMI's and then loading back in. Ideally, we'd like to keep the entire workflow within Galaxy (hence the original question).

Tony

From: Charles Girardot <charles.girardot@embl.de> Sent: 27 March 2018 22:25:57 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org; Jelle Scholtalbers Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony,

please see inline answers

...
On 27. Mar 2018, at 16:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi Charles Thanks for replying. Yes, I'd definitely be interested in the command line version.

we’ll try to push the version we have in conda tomorrow.

...
I'm assuming I can use je retag to add the UMI from a separate read to my sample fastq (already demultiplexed by bcl2fastq) then use the de-markdupes already in Galaxy to dedupe? Would this work?

I think you got me wrong (or I got you wrong!): je retag lets you transfer UMI/BARCODE information embedded in the read name of a *bam* file to proper bam BC/RX/QX/OX/BZ/MI tags. Indeed many tools (will) expect this info to be available in BAM tags and not in read names.

The current version of je markedupes expects the UMI info in the read name. We havent updated je markedupes to get this info from BAM tags yet (even in je-suite2, not fully ready as mentioned before)

...
Btw, I did find a small bug in markdupes. On the NextSeq, the fastq header contains a space, e.g.

@NS500195:396:HLMM5BGX5:1:11101:2331:1043 1:N:0:GCCAAT

When I run markdupes, I get an error as it thinks the UMI is "1043", not "GCCAAT”.

read names in BAM files can’t have spaces so the read name in your BAM is likely to be :

@NS500195:396:HLMM5BGX5:1:11101:2331:1043

and by default, the UMI is expected in the last position ; hence '1043'

Please double check how the read names look in your bam file.

Best

Charles

...
...
Thanks From: Charles Girardot <charles.girardot@embl.de> Sent: 26 March 2018 12:05:22 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony,

sounds like you need is the Je-Suite version 2 which we will release soon-ish.

In this version, the new “je debarcode” module allows you to define an unlimited number of FASTQ input together with their layout (describing BARCODE, UMI and SAMPLE positions). One can define an unlimited number of output FASTQ layouts by combining the BARCODE, UMI and SAMPLE slots defined in input layouts (one slot can be written in multiple layouts if need be). For example, you can output the UMI in their own separate file instead of/in addition to keeping them in the read header.

This version also let you keep the demultiplexed reads in a single output file (for single end I mean), map this file to the genome then use the new “je retag” to transfer the BARCOD/EUMI info embedded in read header to proper BAM tags. This should make it easier to deal with single-cell datasets i.e. manipulate a unique file instead of hundreds/thousands…

The command line version is close to completion (“je retag” needs a bit more testing) and I believe we could push this in conda rather quickly. The bad news is we haven’t started to write/update the Galaxy wrappers yet.

Let me know if you are interested to try the command line version.

Best

Charles

...
On 26. Mar 2018, at 11:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi We are currently seeing a number of methods that are utilising the power of unique molecular indexing. Unfortunately, there is no consensus on how libraries should be configured, and therefore no consensus for how to deal with them within Galaxy.

Often libraries that have the UMI placed directly downstream of the first (i7) index, such as ones using the IDT xGen adapter set (https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xg...). Sometimes UMI’s exist in place of the second (i5) index (https://www.neb.com/nebnext-direct/nebnext-direct-for-target-enrichment).

In both cases, the recommended workflows are convoluted and all the necessary tools do not currently exist in the toolshed (so that the datasets need to be taken out of galaxy, processed and reloaded). It is possible to use bcl2fastq to output the UMI as an additional fastq file, but this would then require me to create a dataset triplicate (not pair) which afaik we can’t do (yet).

A quick Google/toolshed search had me find UMI-Tools & Je-Suite which both exist in Galaxy. Both these tools assume the UMI is “in-line” (i.e. at the beginning of the read 1 or read2 – not its own read), extract/remove the UMI and place it in the read header, where it is then used further down the line to dedup the bam file.

Does anyone know of any tools that would take the UMI from a separate fastq and use it to tag the headers of actual read data. Or alternatively, a tool that will paste the UMI tag onto the 5’ end of the read fastq? And whether these steps can be done within Galaxy, or maybe prior to fastq upload?

Anyone have a method/workflow for UMI’s?

Thanks in advance Tony ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================

Brooks, Tony

4 May 4 May

4:29 p.m.

New subject: Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Charles I'm running into a problem with debarcode. It looks like a barcode file is required, but seeing as the data I have will already be demultiplexed with bcl2fastq, how do I run without needing this file? Can I also check that my read layout would be RL='<SAMPLE1:x>, <SAMPLE2:x>, <UMI1:x>' If I input my read1.fastq read2.fast umi.fastq in that order? Thanks in advance Tony ________________________________ From: Charles Girardot <charles.girardot@embl.de> Sent: 28 March 2018 11:50:41 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org; Jelle Scholtalbers Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's) Hi Tony, yes you got this right. You might not need to "je retag" after mapping depending on the tool you use for removing duplicates. As I said, having je2 in galaxy is in our plans but we still need some time to write the missing wrappers. You can get Je-suite 2 release candidate from our github already in the “ release_2_beta” branch i.e. https://github.com/gbcs-embl/Je/blob/release_2_beta/dist/je_2.0.RC.tar.gz Ah and you need java 8 I still need to write web doc but the command line should hopefully get you going (ask if anything unclear) Best C

...

On 28. Mar 2018, at 12:16, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi Charles Thanks for replying and clarifying a few things. Just so I'm straight, I can use debarcode to add the UMIs from fastq read file #1 into the read header of fastq read file #2. I would then align to a genome to generate a bam, then use retag to switch the UMI from the read header to the RX position in the bam - then I'd be able to use picard to dedup with the RX barcode_tag. Feasibly, once the UMI is added to the header, I could still use the current je-markdupes to dedupe (it being a fork of picard)?

Previously, I've been using fgbio, but this involved aligning the data in Galaxy, exporting it and annotating with UMI's and then loading back in. Ideally, we'd like to keep the entire workflow within Galaxy (hence the original question).

Tony

From: Charles Girardot <charles.girardot@embl.de> Sent: 27 March 2018 22:25:57 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org; Jelle Scholtalbers Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony,

please see inline answers

...
On 27. Mar 2018, at 16:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi Charles Thanks for replying. Yes, I'd definitely be interested in the command line version.

we’ll try to push the version we have in conda tomorrow.

...
I'm assuming I can use je retag to add the UMI from a separate read to my sample fastq (already demultiplexed by bcl2fastq) then use the de-markdupes already in Galaxy to dedupe? Would this work?

I think you got me wrong (or I got you wrong!): je retag lets you transfer UMI/BARCODE information embedded in the read name of a *bam* file to proper bam BC/RX/QX/OX/BZ/MI tags. Indeed many tools (will) expect this info to be available in BAM tags and not in read names.

The current version of je markedupes expects the UMI info in the read name. We havent updated je markedupes to get this info from BAM tags yet (even in je-suite2, not fully ready as mentioned before)

...
Btw, I did find a small bug in markdupes. On the NextSeq, the fastq header contains a space, e.g.

@NS500195:396:HLMM5BGX5:1:11101:2331:1043 1:N:0:GCCAAT

When I run markdupes, I get an error as it thinks the UMI is "1043", not "GCCAAT”.

read names in BAM files can’t have spaces so the read name in your BAM is likely to be :

@NS500195:396:HLMM5BGX5:1:11101:2331:1043

and by default, the UMI is expected in the last position ; hence '1043'

Please double check how the read names look in your bam file.

Best

Charles

...
...
Thanks From: Charles Girardot <charles.girardot@embl.de> Sent: 26 March 2018 12:05:22 To: Brooks, Tony Cc: galaxy-dev@lists.galaxyproject.org Subject: Re: [galaxy-dev] Best practices for utilising Unique Molecular Indexing (UMI's)

Hi Tony,

sounds like you need is the Je-Suite version 2 which we will release soon-ish.

In this version, the new “je debarcode” module allows you to define an unlimited number of FASTQ input together with their layout (describing BARCODE, UMI and SAMPLE positions). One can define an unlimited number of output FASTQ layouts by combining the BARCODE, UMI and SAMPLE slots defined in input layouts (one slot can be written in multiple layouts if need be). For example, you can output the UMI in their own separate file instead of/in addition to keeping them in the read header.

This version also let you keep the demultiplexed reads in a single output file (for single end I mean), map this file to the genome then use the new “je retag” to transfer the BARCOD/EUMI info embedded in read header to proper BAM tags. This should make it easier to deal with single-cell datasets i.e. manipulate a unique file instead of hundreds/thousands…

The command line version is close to completion (“je retag” needs a bit more testing) and I believe we could push this in conda rather quickly. The bad news is we haven’t started to write/update the Galaxy wrappers yet.

Let me know if you are interested to try the command line version.

Best

Charles

...
On 26. Mar 2018, at 11:45, Brooks, Tony <a.brooks@ucl.ac.uk> wrote:

Hi We are currently seeing a number of methods that are utilising the power of unique molecular indexing. Unfortunately, there is no consensus on how libraries should be configured, and therefore no consensus for how to deal with them within Galaxy.

Often libraries that have the UMI placed directly downstream of the first (i7) index, such as ones using the IDT xGen adapter set (https://www.idtdna.com/pages/products/next-generation-sequencing/adapters/xg...). Sometimes UMI’s exist in place of the second (i5) index (https://www.neb.com/nebnext-direct/nebnext-direct-for-target-enrichment).

In both cases, the recommended workflows are convoluted and all the necessary tools do not currently exist in the toolshed (so that the datasets need to be taken out of galaxy, processed and reloaded). It is possible to use bcl2fastq to output the UMI as an additional fastq file, but this would then require me to create a dataset triplicate (not pair) which afaik we can’t do (yet).

A quick Google/toolshed search had me find UMI-Tools & Je-Suite which both exist in Galaxy. Both these tools assume the UMI is “in-line” (i.e. at the beginning of the read 1 or read2 – not its own read), extract/remove the UMI and place it in the read header, where it is then used further down the line to dedup the bam file.

Does anyone know of any tools that would take the UMI from a separate fastq and use it to tag the headers of actual read data. Or alternatively, a tool that will paste the UMI tag onto the 5’ end of the read fastq? And whether these steps can be done within Galaxy, or maybe prior to fastq upload?

Anyone have a method/workflow for UMI’s?

Thanks in advance Tony ___________________________________________________________ Please keep all replies on the list by using "reply all" in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================

===================================== Charles Girardot Head of Genome Biology Computational Support (GBCS) and Senior Bioinformatician in the Furlong Lab European Molecular Biology Laboratory Tel: +49 6221 387 -8585 Fax: +49-(0)6221-387-8166 Email: charles.girardot@embl.de Skype: charles_girardot Web : http://gbcs.embl.de Room V205 Meyerhofstraße 1, 69117 Heidelberg, Germany =====================================

3004

Age (days ago)

3043

Last active (days ago)

List overview

Download

6 comments

2 participants

participants (2)

Brooks, Tony
Charles Girardot