extracting a subset of sequences from a very large fasta file(1.5 million) - galaxy-user - lists.galaxyproject.org

newer
replicates in cuffdiff output

extracting a subset of sequences from a very large fasta file(1.5 million)

older
Rscript not found when running...

Perumal Vijayan

30 Nov 2012 30 Nov '12

3:55 p.m.

I have successfully uploaded a large fasta file (2.5 million genomic sequence contigs) onto Galaxy server. I wish to extract a subset of sequences from this file. I have a list of the fasta headers. Is there a way I can accomplish this on Galaxy? -- Perumal Vijayan Saskatoon Canada

Attachments:

attachment.htm (text/html — 304 bytes)

Reply

Sign in to reply online Use email software

Show replies by date

Peter Cock

30 Nov 30 Nov

4:28 p.m.

New subject: extracting a subset of sequences from a very large fasta file(1.5 million)

On Fri, Nov 30, 2012 at 3:55 PM, Perumal Vijayan <peruvijayan@gmail.com> wrote:

I have successfully uploaded a large fasta file (2.5 million genomic sequence contigs) onto Galaxy server. I wish to extract a subset of sequences from this file. I have a list of the fasta headers. Is there a way I can accomplish this on Galaxy?

Yes, if you are running your own Galaxy instance you could use one of these two tools available on the Galaxy tool shed: http://toolshed.g2.bx.psu.edu/ 'seq_filter_by_id' - returns a filtered version of the sequence file with only those entries on your list. This can output two files, those on the list and those not on the list, or just the sequences not on the list. 'seq_select_by_id' - like the above but indexes the sequence file in order to extract the requested entries in the order given (rather than the order in the sequence file). Both of these tools work on FASTA, FASTQ and SFF files. Peter

Reply

Sign in to reply online Use email software

Jennifer Jackson

3 Dec 3 Dec

2:17 p.m.

New subject: extracting a subset of sequences from a very large fasta file(1.5 million)

Hi Perumal, There isn't a simple fasta extraction tool on the public Main Galaxy server, but the extraction is possible and could be grouped into a workflow for re-use once completed. This is simpler that it first looks, really just 4 steps: 1. Convert the fasta file to tabular: 'FASTA manipulation' -> <javascript:void(0)>FASTA-to-Tabular Settings: For the option "How many columns to divide title string into?:" use "2" if there is "identifier" and "description" text. See the next step for more details. 2. Load your list of identifiers as tabular This mean "tabular" text format. Adjust the datatype to be "tabular as needed, and any other formatting so that the "identifiers" are exactly the same in both files. I am not sure if this is what you meant by "fasta headers". To be clear, in the fasta file (#1) any characters after the leading ">" but before the first whitespace (tab, space, etc) are considered the "identifier" and everything else on the line is considered the "description". This file (#2) should only contain the "identifier", not the "description. Here is a link to FASTA format in case you run into problems here (the IDs not being exact will almost certainly be the root cause of any issues): http://wiki.galaxyproject.org/Learn/Datatypes#Fasta 3. Compare the two files together, subsetting out the entries in #1 that are present in #2. ' Join, Subtract and Group' -> Compare two Datasets Settings: Compare file #1, column 1 (c1), against file #1, column 2 (c1), 'To find' = Matching rows of 1st dataset. 4. Transform the results back to tabular format. 'FASTA manipulation' -> Tabular-to-FASTA Settings: Be sure to account for any description fields, if they are included in your data. At this point you can either put them into the final fasta output or omit the row/data altogether and just pull out identifiers/sequence. Hopefully this helps - Jen Galaxy team On 11/30/12 7:55 AM, Perumal Vijayan wrote:

I have successfully uploaded a large fasta file (2.5 million genomic sequence contigs) onto Galaxy server. I wish to extract a subset of sequences from this file. I have a list of the fasta headers. Is there a way I can accomplish this on Galaxy? -- Perumal Vijayan Saskatoon Canada

___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

-- Jennifer Jackson http://galaxyproject.org

Reply

Sign in to reply online Use email software

Jim Robinson

5 Feb 5 Feb

1:48 p.m.

New subject: Problems with large gzipped fasta files

Hi, I am having a lot of difficulty uploading some large gzipped fastqs (~ 10GB) to the public server. I have tried both ftp and "pulling" by http URL. The upload succeeds, however I get an error as it tries to gunzip it. I have tried more than 10 times now and succeeded once. These files are correct and complete, and gunzip properly locally. The error shown is usually this empty format: txt, database: ? Problem decompressing gzipped data However on 2 occasions (both ftp uploads) I got the traceback below. Am I missing some obvious trick? I searched the archives and see references to problems with large gzipped files but no solutions. Thanks Jim Traceback (most recent call last): File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 384, in <module> __main__() File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 373, in __main__ add_file( dataset, registry, json_file, output_path ) File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 270, in add_file line_count, converted_path = sniff.convert_newlines( dataset.path, in_place=in_place ) File "/galaxy/home/g2main/galaxy_main/lib/galaxy/datatypes/sniff.py", line 106, in convert_newlines shutil.move( temp_name, fname ) File "/usr/lib/python2.7/shutil.py", line 299, in move copy2(src, real_dst) File "/usr/lib/python2.7/shutil.py", line 128, in copy2 copyfile(src, dst) File "/usr/lib/python2.7/shutil.py", line 84, in copyfile copyfileobj(fsrc, fdst) File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj buf = fsrc.read(length) IOError: [Errno 5] Input/output error

Reply

Sign in to reply online Use email software

Jennifer Jackson

9:40 p.m.

New subject: Problems with large gzipped fasta files

Hi Jim, You message was misthreaded (perhaps a reply to another thread, with just the subject line changed?), but I was able to dig it out. A this time, there are no known issues with FTP Upload to the public Main server. Any issues you have have found prior were either related to a problem with the original file content (compression problem) or a transitory issue with the FTP server that has since been resolved (there has been a handful in the last few years). The instructions to follow are here: http://wiki.galaxyproject.org/FTPUpload I am not exactly sure what your issue is, but any chance that you have more than one file per archive? That will certainly cause an issue, but usually with just the first file loading the remainder not. Please send more details if this continues. Does the failure occur at the FTP stage or at the point where you move from the FTP holding area into a history? Thanks! Jen Galaxy team On 2/5/13 5:48 AM, Jim Robinson wrote:

Hi,

I am having a lot of difficulty uploading some large gzipped fastqs (~ 10GB) to the public server. I have tried both ftp and "pulling" by http URL. The upload succeeds, however I get an error as it tries to gunzip it. I have tried more than 10 times now and succeeded once. These files are correct and complete, and gunzip properly locally. The error shown is usually this

empty format: txt, database: ? Problem decompressing gzipped data

However on 2 occasions (both ftp uploads) I got the traceback below. Am I missing some obvious trick? I searched the archives and see references to problems with large gzipped files but no solutions.

Thanks

Jim

Traceback (most recent call last): File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 384, in <module> __main__() File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 373, in __main__ add_file( dataset, registry, json_file, output_path ) File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 270, in add_file line_count, converted_path = sniff.convert_newlines( dataset.path, in_place=in_place ) File "/galaxy/home/g2main/galaxy_main/lib/galaxy/datatypes/sniff.py", line 106, in convert_newlines shutil.move( temp_name, fname ) File "/usr/lib/python2.7/shutil.py", line 299, in move copy2(src, real_dst) File "/usr/lib/python2.7/shutil.py", line 128, in copy2 copyfile(src, dst) File "/usr/lib/python2.7/shutil.py", line 84, in copyfile copyfileobj(fsrc, fdst) File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj buf = fsrc.read(length) IOError: [Errno 5] Input/output error

___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

-- Jennifer Hillman-Jackson Galaxy Support and Training http://galaxyproject.org

Reply

Sign in to reply online Use email software

Nate Coraor

13 Feb 13 Feb

3:22 p.m.

New subject: Problems with large gzipped fasta files

Hi Jim, Could you send me a URL to the dataset so I can grab a copy and try to reproduce this problem? Sorry for the trouble you've been having with the upload functionality and the delay in getting back to you. --nate On Feb 5, 2013, at 8:48 AM, Jim Robinson wrote:

Hi,

I am having a lot of difficulty uploading some large gzipped fastqs (~ 10GB) to the public server. I have tried both ftp and "pulling" by http URL. The upload succeeds, however I get an error as it tries to gunzip it. I have tried more than 10 times now and succeeded once. These files are correct and complete, and gunzip properly locally. The error shown is usually this

empty format: txt, database: ? Problem decompressing gzipped data

However on 2 occasions (both ftp uploads) I got the traceback below. Am I missing some obvious trick? I searched the archives and see references to problems with large gzipped files but no solutions.

Thanks

Jim

Traceback (most recent call last): File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 384, in <module> __main__() File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 373, in __main__ add_file( dataset, registry, json_file, output_path ) File "/galaxy/home/g2main/galaxy_main/tools/data_source/upload.py", line 270, in add_file line_count, converted_path = sniff.convert_newlines( dataset.path, in_place=in_place ) File "/galaxy/home/g2main/galaxy_main/lib/galaxy/datatypes/sniff.py", line 106, in convert_newlines shutil.move( temp_name, fname ) File "/usr/lib/python2.7/shutil.py", line 299, in move copy2(src, real_dst) File "/usr/lib/python2.7/shutil.py", line 128, in copy2 copyfile(src, dst) File "/usr/lib/python2.7/shutil.py", line 84, in copyfile copyfileobj(fsrc, fdst) File "/usr/lib/python2.7/shutil.py", line 49, in copyfileobj buf = fsrc.read(length) IOError: [Errno 5] Input/output error ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Reply

Sign in to reply online Use email software

Jim Robinson

5:25 p.m.

New subject: Problems with large gzipped fasta files

Sorry Nate, I misunderstood at first, you want a URL to the dataset here on my server? I can definitely copy one up to an http server, I still have Ricardo's files on a hard disk. I'll start the copy now and let you know when its ready. Jim

Hi Jim,

Could you send me a URL to the dataset so I can grab a copy and try to reproduce this problem? Sorry for the trouble you've been having with the upload functionality and the delay in getting back to you.

--nate

On Feb 5, 2013, at 8:48 AM, Jim Robinson wrote:

Reply

Sign in to reply online Use email software

Nate Coraor

6:10 p.m.

New subject: Problems with large gzipped fasta files

On Feb 13, 2013, at 12:25 PM, Jim Robinson wrote:

Sorry Nate, I misunderstood at first, you want a URL to the dataset here on my server? I can definitely copy one up to an http server, I still have Ricardo's files on a hard disk. I'll start the copy now and let you know when its ready.

Yeah, that's it exactly. Thanks! --nate

Jim

...
Hi Jim,

Could you send me a URL to the dataset so I can grab a copy and try to reproduce this problem? Sorry for the trouble you've been having with the upload functionality and the delay in getting back to you.

--nate

On Feb 5, 2013, at 8:48 AM, Jim Robinson wrote:

Reply

Sign in to reply online Use email software

Mike Dufault

15 Feb 15 Feb

3:35 p.m.

New subject: Problem Loading History Pane

To whom it may concern: The History panel on the right side of the Galaxy page is taking a very very long time to load. Also, when it does load, I have tired to save my .bam files and the transmissions gets truncated to ~7000kb - 8000kb of data. All of my .bam files are several GB. Some times, when I retry tor download the data, it succeeds and other times it is again truncated. The size of the truncation may be different for the same file on the retry attempt. Is there a problem with Galaxy? Thanks, Mike

Reply

Sign in to reply online Use email software

Nate Coraor

3:48 p.m.

New subject: Problem Loading History Pane

On Feb 15, 2013, at 10:35 AM, Mike Dufault wrote:

To whom it may concern:

The History panel on the right side of the Galaxy page is taking a very very long time to load. Also, when it does load, I have tired to save my .bam files and the transmissions gets truncated to ~7000kb - 8000kb of data. All of my .bam files are several GB.

Some times, when I retry tor download the data, it succeeds and other times it is again truncated. The size of the truncation may be different for the same file on the retry attempt.

Is there a problem with Galaxy?

Hi Mike, There are some performance problems with the Main site that we are currently investigating. Thanks for the information and we apologize for the problems. --nate

Thanks, Mike ___________________________________________________________ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using "reply all" in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list:

http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists, please use the interface at:

http://lists.bx.psu.edu/

Reply

Sign in to reply online Use email software

4528

Age (days ago)

4605

Last active (days ago)

Download

9 comments

6 participants

tags

participants (6)

Jennifer Jackson
Jim Robinson
Mike Dufault
Nate Coraor
Perumal Vijayan
Peter Cock