primer contamination, miranalyzer
Hi Galaxy, Ive got 2 problems for you; 1) Ive got microRNA Illumina NGS data that I want to analyse, I put it through fastQC on galaxy and it showed that 71% of the reads in one overrepresented sequence; Sequence Count Percentage Possible Source GAATTCCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAG 16896622 71.06413061961005 RNA PCR Primer, Index 1 (100% over 29bp) CCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCTTGTAATCTC 525614 2.2106372475809497 RNA PCR Primer, Index 12 (100% over 44bp) CCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACC 416041 1.7497930632000402 RNA PCR Primer, Index 2 (100% over 34bp) What would be the best way to remove this contamination? Also is is still ok to use that data despite such high contamination? Ive currently been trying to remove the sequence by using the clip adaptor tool, using the following options; library to clip 2: FASTQ Groomer on data H1 Minimum sequence length (after clipping, sequences shorter than this length will be discarded) 15 Enter custom clipping sequence GAATTCCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAG enter non-zero value to keep the adapter sequence and x bases that follow it 0 Discard sequences with unknown (N) bases No Output options Output only non-clipped sequences (i.e. sequences which did not contained the adapter) Clipped reads - discarded. Input: 23776583 reads. Output: 3091831 reads. discarded 1287140 too-short reads. discarded 18984774 adapter-only reads. discarded 412838 clipp but then I'm only left with 13% of the reads. 2) After I've filtered and clipped the adapter I want to analyse the frequency of each miR. I've been using miranalyzer to do this, I use the following workflow data=>groomer=>clip adapter=>filter FastQ (min quality 20)=>fastq to fasta=>collapse the collapse file is like this;
1-17285268 GAATTCCACCACGTTCCCGTGG 2-522760 CCACCACGTTCCCGTGG 3-101198 TATTGCACTTGTCCCGGCCTGT 4-88745
Then upload the collapse file to miranalyzer however the total reads in the miranalyzer output is the same as the total number of sequences in the collapse file, it doesn't seem to recognise the count number. miranalyzer says the following; 2.1 Input formats miRanalyzer requires a single file containing the unique reads and their counts. The application accepts two different input formats: 2.1.1 A tab or space separated file as in the following example (read-count format): GAGGTAGTAGGTTGTA 49862 ACCCGTAGAACCGACC 15490 ... ... GGAGCATCTCTCGGTC 13762 2.1.2 A multifasta file: >ID1 49862 GAGGTAGTAGGTTGTA >ID2 15490 ACCCGTAGAACCGACC .... >ID 13762 GGAGCATCTCTCGGTC The description field must hold the read count. If not set, it is supposed to be 1. The file must have extension ’fa’, ’fasta’ or ’mfa’. Do you know how I could change my format so it can recognise the read count e.g. maybe change the '-' to a space? 3) Ive recently got the local install of galaxy but encounter the following error when I try to add a file to my data libary Error attempting to display contents of library (New data library): (OperationalError) no such column: True u'SELECT dataset_permissions.id AS dataset_permissions_id, dataset_permissions.create_time AS dataset_permissions_create_time, dataset_permissions.update_time AS dataset_permissions_update_time, dataset_permissions.action AS dataset_permissions_action, dataset_permissions.dataset_id AS dataset_permissions_dataset_id, dataset_permissions.role_id AS dataset_permissions_role_id XnFROM dataset_permissions XnWHERE True AND dataset_permissions.action = ?' ['access']. Ive got the latest version of galaxy and am using chrome and mountain lion os x changeset: 7986:12fcd068b12e tag: tip user: Daniel Blankenberg <dan@bx.psu.edu> date: Thu Oct 18 11:22:12 2012 -0400 summary: Do not hide failed datasets with HideDatasetAction post job action. Any help will be greatly appreciated Thank you Rosie Griffiths
Hi Galaxy,
Ive got 2 problems for you;
1) Ive got microRNA Illumina NGS data that I want to analyse, I put it through fastQC on galaxy and it showed that 71% of the reads in one overrepresented sequence;
Sequence Count Percentage Possible Source GAATTCCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAG 16896622 71.06413061961005 RNA PCR Primer, Index 1 (100% over 29bp) CCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACCTTGTAATCTC 525614 2.2106372475809497 RNA PCR Primer, Index 12 (100% over 44bp) CCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAGTCACC 416041 1.7497930632000402 RNA PCR Primer, Index 2 (100% over 34bp)
What would be the best way to remove this contamination? Also is is still ok to use that data despite such high contamination? You can try. Ive currently been trying to remove the sequence by using the clip adaptor tool, using the following options;
library to clip 2: FASTQ Groomer on data H1 Minimum sequence length (after clipping, sequences shorter than this length will be discarded) 15 Enter custom clipping sequence GAATTCCACCACGTTCCCGTGGTGGAATTCTCGGGTGCCAAGGAACTCCAG enter non-zero value to keep the adapter sequence and x bases that follow it 0 Discard sequences with unknown (N) bases No Output options Output only non-clipped sequences (i.e. sequences which did not contained the adapter) Did you really intended to discard the sequences that were clipped? Or
Hello Rosie, Pls see below On 11/12/12 4:00 AM, Rosie Griffiths wrote: perhaps the option "Output both clipped and non-clipped sequences" is what you intended? This would envoke the additional filters set, such as minimum length after clipping (15). Currently, with the option used, any sequence that is clipped - at all- is discarded as a first step. 75% reads will have some clipping Maximum 25% will be in output, not counting other factors (sequences already under 15 bp in length, etc.) This is a very hard hit and explains the current 15% output. See next ->
Clipped reads - discarded.
here ^^ see that any clipped sequences are discarded immediately. A re-run with the other option is recommended. It could be a negligible difference - but seems worth a check if the goal is to recover what is usable.
Input: 23776583 reads. Output: 3091831 reads. discarded 1287140 too-short reads. discarded 18984774 adapter-only reads. discarded 412838 clipp
but then I'm only left with 13% of the reads.
2) After I've filtered and clipped the adapter I want to analyse the frequency of each miR. I've been using miranalyzer to do this, I use the following workflow
data=>groomer=>clip adapter=>filter FastQ (min quality 20)=>fastq to fasta=>collapse
See below
the collapse file is like this;
1-17285268 GAATTCCACCACGTTCCCGTGG 2-522760 CCACCACGTTCCCGTGG 3-101198 TATTGCACTTGTCCCGGCCTGT 4-88745 Then upload the collapse file to miranalyzer however the total reads in the miranalyzer output is the same as the total number of sequences in the collapse file, it doesn't seem to recognise the count number.
miranalyzer says the following;
2.1 Input formats
miRanalyzer requires a single file containing the unique reads and their counts. The application accepts two different input formats:
2.1.1 A tab or space separated file as in the following example (read-count format):
GAGGTAGTAGGTTGTA 49862 ACCCGTAGAACCGACC 15490 ... ... GGAGCATCTCTCGGTC 13762 2.1.2 A multifasta file:
ID1 49862 GAGGTAGTAGGTTGTA ID2 15490 ACCCGTAGAACCGACC .... ID 13762 GGAGCATCTCTCGGTC The description field must hold the read count. If not set, it is supposed to be 1. The file must have extension ’fa’, ’fasta’ or ’mfa’.
Do you know how I could change my format so it can recognise the read count e.g. maybe change the '-' to a space?
You have this correct: Convert the fasta -> tabular, convert the dash to tab, then convert tab -> fasta (setting the new column as the description field).
3) Ive recently got the local install of galaxy but encounter the following error when I try to add a file to my data libary
Are you set up as an admin? This is the default if you are running Galaxy straight as-is without any changes. You may also be running as for a 'production environment'. The setting in the links below have set up info for both. If you are configured and having problems, this would be a good question to sent to the galaxy-dev@bx.psu.edu mailing list as a brand new thread, and as a distinct question, to reach the developers. (No need to continue this thread or cc galaxy-user). Include as much information about your local environment as possible (but nothing personal, like a password). I can't tell from this info what is going on, but it is very likely these gurus can! http://getgalaxy.org http://wiki.galaxyproject.org/Admin/Config/Performance/Production%20Server http://wiki.galaxyproject.org/Admin/Data%20Libraries http://wiki.galaxyproject.org/Admin/Data%20Libraries/Libraries Best wishes for your project! Jen Galaxy team http://wiki.galaxyproject.org/Support
Error attempting to display contents of library (New data library): (OperationalError) no such column: True u'SELECT dataset_permissions.id AS dataset_permissions_id, dataset_permissions.create_time AS dataset_permissions_create_time, dataset_permissions.update_time AS dataset_permissions_update_time, dataset_permissions.action AS dataset_permissions_action, dataset_permissions.dataset_id AS dataset_permissions_dataset_id, dataset_permissions.role_id AS dataset_permissions_role_id XnFROM dataset_permissions XnWHERE True AND dataset_permissions.action = ?' ['access'].
Ive got the latest version of galaxy and am using chrome and mountain lion os x
changeset: 7986:12fcd068b12e tag: tip user: Daniel Blankenberg <dan@bx.psu.edu> date: Thu Oct 18 11:22:12 2012 -0400 summary: Do not hide failed datasets with HideDatasetAction post job action.
Any help will be greatly appreciated
Thank you Rosie Griffiths
___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists, please use the interface at:
-- Jennifer Jackson http://galaxyproject.org
participants (2)
-
Jennifer Jackson
-
Rosie Griffiths