Sequence I/O¶
- Learning Objective
You will learn how to read and write sequence files in FASTA, FASTQ, EMBL or GenBank format.
- Difficulty
Basic
- Duration
20 min
- Prerequisites
This tutorial explains how to read and write sequence files using the SeqFileIn and SeqFileOut classes. These classes provide an API for accessing sequence files in different file formats, either compressed or uncompressed.
FASTA/FASTQ Format¶
FASTA/FASTQ are record-based files. A FASTA record contains the sequence id and the sequence characters. Here is an example of FASTA file:
>seq1
CCCCCCCCCCCCCCC
>seq2
CGATCGATC
>seq3
TTTTTTT
In addition to that, a FASTQ record contains also a quality value for each sequence character. Here is an example of FASTQ file:
@seq1
CCCCCCCCCCCCCCC
+
IIIIIHIIIIIIIII
@seq2
CGATCGATC
+
IIIIIIIII
@seq3
TTTTTTT
+
IIIIHHG
SeqFile Formats¶
We can read sequence files with the SeqFileIn class and write them with the SeqFileOut class. These classes support files in FASTA, FASTQ, EMBL or GenBank format.
Note that SeqFileOut will guess the format from the file name.
File Format |
File Extension |
---|---|
FASTA |
|
FASTQ |
|
EMBL |
|
GenBank |
|
A First Working Example¶
Let us start out with a minimal working example.
The following program reads a FASTA file called example.fa
and prints out the identifier and the sequence of the first record.
#include <seqan/seq_io.h>
using namespace seqan2;
int main()
{
CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fa");
CharString id;
Dna5String seq;
SeqFileIn seqFileIn(toCString(seqFileName));
readRecord(id, seq, seqFileIn);
std::cout << id << '\t' << seq << '\n';
return 0;
}
We call the SeqFileIn constructor with the path to the file to read. Successively, we call the function readRecord to read the first record from the file. Note that, differently from all others FormattedFileIn classes, readRecord accepts separate identifier and sequence Strings rather than one single record object.
Assignment 1¶
- Type
Reproduction
- Objective
Copy the above example of a FASTA file in a new file
example.fa
in a directory of your choice.Copy the program above into a new application
basic_seq_io_example
, adjust the path"example.fa"
to the just created FASTA file, compile the program, and run it.You should see the following output:
seq1 CCCCCCCCCCCCCCC
- Solution
#include <seqan/seq_io.h> using namespace seqan2; int main() { CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fa"); CharString id; Dna5String seq; SeqFileIn seqFileIn(toCString(seqFileName)); readRecord(id, seq, seqFileIn); std::cout << id << '\t' << seq << '\n'; return 0; }
Handling Errors¶
As explained in the File I/O Overview tutorial, SeqFileIn and SeqFileOut throw exceptions to signal eventual errors. Invalid characters inside an input file will be signaled by readRecord via parsing exceptions.
Assignment 2¶
- Type
Application
- Objective
Improve the above program to handle errors.
- Hint
You can use the generic class Exception to catch both low-level and high-level I/O errors.
- Solution
#include <seqan/seq_io.h> using namespace seqan2; int main() { CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fa"); CharString id; Dna5String seq; SeqFileIn seqFileIn; if (!open(seqFileIn, toCString(seqFileName))) { std::cerr << "ERROR: Could not open the file.\n"; return 1; } try { readRecord(id, seq, seqFileIn); } catch (Exception const & e) { std::cout << "ERROR: " << e.what() << std::endl; return 1; } std::cout << id << '\t' << seq << '\n'; return 0; }
Accessing Records in Batches¶
There are three use cases for reading or writing record-based files:
read or write the file record by record;
read or write a batch of records, e.g. 100k records at a time;
read or write all records from or to the file.
The class SeqFileIn provides the functions readRecord and readRecords, while the class SeqFileOut provides the functions writeRecord and writeRecords.
Tip
Reading records in batches is more efficient than reading single records.
Note that the function readRecords uses StringSet instead of String. By default, readRecords reads all remaining records. Optionally, one can specify a batch of records to be read.
CharString seqFileName = getAbsolutePath("/demos/tutorial/sequence_io/example.fa");
StringSet<CharString> ids;
StringSet<Dna5String> seqs;
SeqFileIn seqFileIn(toCString(seqFileName));
// Reads up to 10 records.
readRecords(ids, seqs, seqFileIn, 10);
// Reads all remaining records.
readRecords(ids, seqs, seqFileIn);
Assignment 3¶
- Type
Application
- Objective
Change your program from above to load all sequences and print them in the same fashion.
You should be able to run your program on the example file we created above and see the following output:
seq1 CCCCCCCCCCCCCCC seq2 CGATCGATC seq3 TTTTTTT
- Hint
You can use the function readRecords to load all records at once.
- Solution
#include <seqan/seq_io.h> using namespace seqan2; int main() { CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fa"); SeqFileIn seqFileIn; if (!open(seqFileIn, toCString(seqFileName))) { std::cerr << "ERROR: Could not open the file.\n"; return 1; } StringSet<CharString> ids; StringSet<Dna5String> seqs; try { readRecords(ids, seqs, seqFileIn); } catch (Exception const & e) { std::cout << "ERROR: " << e.what() << std::endl; return 1; } for (unsigned i = 0; i < length(ids); ++i) std::cout << ids[i] << '\t' << seqs[i] << '\n'; return 0; }
Accessing Qualities¶
Functions readRecord, readRecords, writeRecord and writeRecords are available in two variants:
the first variant accepts only the sequence identifier and sequence characters, besides the SeqFileIn object;
the second variant accepts an additional CharString for a PHRED base quality string.
If the first variant is used on an output file containing qualities, e.g. a FASTQ file, then writeRecord writes qualities as 'I'
, i.e. PHRED score 40.
If the second variant is used on an input file containing no qualities, e.g. a FASTA file, then readRecord returns empty quality strings.
Here is an example for the second variant of readRecord:
CharString seqFqFileName = getAbsolutePath("/demos/tutorial/sequence_io/example.fq");
CharString id;
Dna5String seq;
CharString qual;
SeqFileIn seqFqFileIn(toCString(seqFqFileName));
readRecord(id, seq, qual, seqFqFileIn);
Tip
When DnaQ or Dna5Q Strings are used, then you should use the second variant. The qualities are simply stored directly in the sequence characters.
Assignment 4¶
- Type
Application
- Objective
Copy the above example of FASTQ file in a new file
example.fq
in a directory of your choice.Change your result of Assignment 3 to use the variant of readRecord that also reads in the qualities and writes them next to the sequences.
When your program is called on this file, the result should look as follows.
seq1 CCCCCCCCCCCCCCC +qual: IIIIIHIIIIIIIII seq2 CGATCGATC +qual: IIIIIIIII seq3 TTTTTTT +qual: IIIIHHG
- Solution
#include <seqan/seq_io.h> using namespace seqan2; int main() { CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fq"); SeqFileIn seqFileIn; if (!open(seqFileIn, toCString(seqFileName))) { std::cerr << "ERROR: Could not open the file.\n"; return 1; } StringSet<CharString> ids; StringSet<Dna5String> seqs; StringSet<CharString> quals; try { readRecords(ids, seqs, quals, seqFileIn); } catch (Exception const & e) { std::cout << "ERROR: " << e.what() << std::endl; return 1; } for (unsigned i = 0; i < length(ids); ++i) std::cout << ids[i] << '\t' << seqs[i] << "\n+qual:\t" << quals[i] << '\n'; return 0; }
Next Steps¶
Read the Wikipedia articles about the FASTA file format and the FASTQ file format and quality values to refresh your knowledge.
Read the Indexed FASTA I/O tutorial to learn how to read FASTA files efficiently in a random-access fashion.
Continue with the Tutorials.