Hi Veranja,
I am going to try to address all questions in one go since they are
all in the same thread. Next time though, it would be best send new
questions as a brand new question, not as a reply with just the
subject line changed. This helps us greatly with tracking and other
users when searching prior posts.
In the first email you seemed to have some trouble with the format
of your custom reference genome, but later in the second email this
seems to be resolved, at least as far as format is concerned
(SAM->BAM conversion is possible using this genome, in Galaxy?).
I am going to point you to our help for custom reference genomes,
and if you click through to the main page there is a table with
detailed format troubleshooting help. But, I will tell you first
that I do not believe that this is going to be helpful for your
overall goals, if I am understanding correctly.
But, here is the link:
http://wiki.galaxyproject.org/Support#Custom_reference_genome
Your reference genome sounds as if it is not really a reference
genome but instead more of a collection of short read sequences? If
this number is very large, and the sequences are very short, you
will likely run into memory or related indexing problems with many
tools. There really isn't an easy way around this. You could try
taking the analysis to a cloud version of Galaxy and scaling up the
memory to see if that helps. You also might try breaking the job up
into smaller jobs - you mentioned that the data is from multiple
genomes - perhaps split by genome. But you will have to test this -
I don't know the actual profile of your data. I can let you know
that using purely a short read dataset, in particular one that has
redundancy, will be problematic, likely no matter what is attempted.
Some assembly or other strategy is likely required to move forward.
Galaxy CloudMan:
http://usegalaxy.org/cloud
For the last question, different tools are probably expected to vary
a bit in the results since they use a different method. If you want
to compare datasets, using identifiers would be a good way. Convert
the files to tabular, cut out the identifiers, compare these to find
differences, then adjust the tabular files as needed, and convert
back to fastq/fasta. Tools to do these sorts of functions are in the
tool groups "Text Manipulation",
"FASTA manipulation", "Filter and Sort, and Join", "Subtract and
Group", "NGS: QC and manipulation". I know that seems like a lot of
places to look - but use the tool search at the top of the tool
panel and search by data type or tool name to make finding these
easier, for example "Cut" or "Join" or "Tabular" - these tools have
the names you would probably expect them to have and tool help is
directly on each form. Our 101 tutorial also would be a good
introduction for an overview:
https://main.g2.bx.psu.edu/u/aun1/p/galaxy101
Hopefully this gives you some helpful information to work with,
Jen
Galaxy team
On 4/8/13 7:21 PM, Veranja
Liyanapathirana wrote:
Dear all,
I was using the barcode splitter on Miseq paired end
reads, however I am not sure if I did it correctly as the
results I get in terms of the number of reads alocated per
each barcode does not tally with the resutls obtained by the
our service provider by one of their in-house script based
methods. I use it for splitting some inhouse barcodes. I
need to make sure that read 1 and read 2 are split in to the
same group, and drop the sequences where this criteria is
not met. Not sure how to get about doing this. Would using
FASTQ joiner on the two reads and subsequent splitting work?
Thank you,
Kind Regards,
Veranja
Dear Galaxy team/ users,
I am sorry to spam the thread again but I
still could not figure out what is worng with my
work flow and need some help.
As mentioned earlier, I use Miseq reads,
demultiplex for an inhouse barcode using barcode
splitter, re-upload and map with a ref sequence
that is consisting of multiple short reference
sequences. The work flow goes well up to this
stage, conversion from SAM to BAM after filtering
the SAM files also fine but I can not use the GATK
depth of coverage tool to get the alignment data
or create pileups. An error comes up in all
instances.
I would really appreciate any inputs in to
this.
Thanks a lot,
Veranja Liyanapathirana
Graduate Student (Microbiology)
Dear all,
My problem seems like something that
should have a very simple solution from my
end and due to my lack of knowledge in
bioinformatics, I am probably messing up
with the workflows. The experiment I run
is one where we used Miseq to sequence
amplicons of a multiplex PCR. We
introduced an inhouse barcodeto our PCR
products via an adaptor.
Miseq data was demultiplexed for the
Illumina barcodes using Miseq reporter on
intrument software by our service provider
and I am trying to run the rest of the
process on Galaxy web port with no command
prompt programming.
The data for R1 and R2 was imported,
and then I used barcode splitter to
de-multiplex the amplicons after quality
triming. (I did not use FASTQ groomer as
Miseq data is supposed to be Sanger FastQ
than Illumina).
Then the sequence trimmer was used to
trim the barcode+adaptor sequences. The
results of this were re-uploaded and
designated as FASTQ for alignment.
Now for the reference genome, as our
aplicons are of from different sequences,
we have segmented FASTA sequences in one
file with different FASTA identifiers.
When this file was input as the reference
genome and mapping was performed using
Bowtie for Illumina, the mapping went on
with no errors.
I could filter the alignment file using
SAM filters too. But I can not do any more
downstream visualozations, not even SAM
to BAM conversion.
I suspect that this may be due to an
error in the way that the reference genome
was formulated but can not get around to
figure it out. I would be extremely
grateful if you could help me with this
issue. I tihnk if I string together the
sequences as one it would work, but
converting this back for interpretation
becomes an issue then.
Thank you,
Kind Regards,
Veranja
Veranja Liyanapathirana
Graduate Student (Microbiology)
___________________________________________________________
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org. Please keep all replies on the list by
using "reply all" in your mail client. For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:
http://lists.bx.psu.edu/listinfo/galaxy-dev
To manage your subscriptions to this and other Galaxy lists,
please use the interface at:
http://lists.bx.psu.edu/
To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/
--
Jennifer Hillman-Jackson
Galaxy Support and Training
http://galaxyproject.org/