Sequence I/O

Learning Objective

You will learn how to read and write sequence files in FASTA, FASTQ, EMBL or GenBank format.

Difficulty

Basic

Duration

20 min

Prerequisites

Sequences, File I/O Overview

This tutorial explains how to read and write sequence files using the SeqFileIn and SeqFileOut classes. These classes provide an API for accessing sequence files in different file formats, either compressed or uncompressed.

FASTA/FASTQ Format

FASTA/FASTQ are record-based files. A FASTA record contains the sequence id and the sequence characters. Here is an example of FASTA file:

>seq1
CCCCCCCCCCCCCCC
>seq2
CGATCGATC
>seq3
TTTTTTT

In addition to that, a FASTQ record contains also a quality value for each sequence character. Here is an example of FASTQ file:

@seq1
CCCCCCCCCCCCCCC
+
IIIIIHIIIIIIIII
@seq2
CGATCGATC
+
IIIIIIIII
@seq3
TTTTTTT
+
IIIIHHG

SeqFile Formats

We can read sequence files with the SeqFileIn class and write them with the SeqFileOut class. These classes support files in FASTA, FASTQ, EMBL or GenBank format.

Note that SeqFileOut will guess the format from the file name.

File Format

File Extension

FASTA

.fa, .fasta

FASTQ

.fq, .fastq

EMBL

.embl

GenBank

.gbk

A First Working Example

Let us start out with a minimal working example. The following program reads a FASTA file called example.fa and prints out the identifier and the sequence of the first record.

#include <seqan/seq_io.h>

using namespace seqan2;

int main()
{
    CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fa");
    CharString id;
    Dna5String seq;

    SeqFileIn seqFileIn(toCString(seqFileName));
    readRecord(id, seq, seqFileIn);
    std::cout << id << '\t' << seq << '\n';

    return 0;
}

We call the SeqFileIn constructor with the path to the file to read. Successively, we call the function readRecord to read the first record from the file. Note that, differently from all others FormattedFileIn classes, readRecord accepts separate identifier and sequence Strings rather than one single record object.

Assignment 1

Type

Reproduction

Objective

Copy the above example of a FASTA file in a new file example.fa in a directory of your choice.

Copy the program above into a new application basic_seq_io_example, adjust the path "example.fa" to the just created FASTA file, compile the program, and run it.

You should see the following output:

seq1	CCCCCCCCCCCCCCC
Solution
#include <seqan/seq_io.h>

using namespace seqan2;

int main()
{
    CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fa");
    CharString id;
    Dna5String seq;

    SeqFileIn seqFileIn(toCString(seqFileName));
    readRecord(id, seq, seqFileIn);
    std::cout << id << '\t' << seq << '\n';

    return 0;
}

Handling Errors

As explained in the File I/O Overview tutorial, SeqFileIn and SeqFileOut throw exceptions to signal eventual errors. Invalid characters inside an input file will be signaled by readRecord via parsing exceptions.

Assignment 2

Type

Application

Objective

Improve the above program to handle errors.

Hint

You can use the generic class Exception to catch both low-level and high-level I/O errors.

Solution
#include <seqan/seq_io.h>

using namespace seqan2;

int main()
{
    CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fa");
    CharString id;
    Dna5String seq;

    SeqFileIn seqFileIn;
    if (!open(seqFileIn, toCString(seqFileName)))
    {
        std::cerr << "ERROR: Could not open the file.\n";
        return 1;
    }

    try
    {
        readRecord(id, seq, seqFileIn);
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    std::cout << id << '\t' << seq << '\n';

    return 0;
}

Accessing Records in Batches

There are three use cases for reading or writing record-based files:

  1. read or write the file record by record;

  2. read or write a batch of records, e.g. 100k records at a time;

  3. read or write all records from or to the file.

The class SeqFileIn provides the functions readRecord and readRecords, while the class SeqFileOut provides the functions writeRecord and writeRecords.

Tip

Reading records in batches is more efficient than reading single records.

Note that the function readRecords uses StringSet instead of String. By default, readRecords reads all remaining records. Optionally, one can specify a batch of records to be read.

    CharString seqFileName = getAbsolutePath("/demos/tutorial/sequence_io/example.fa");

    StringSet<CharString> ids;
    StringSet<Dna5String> seqs;

    SeqFileIn seqFileIn(toCString(seqFileName));

    // Reads up to 10 records.
    readRecords(ids, seqs, seqFileIn, 10);

    // Reads all remaining records.
    readRecords(ids, seqs, seqFileIn);

Assignment 3

Type

Application

Objective

Change your program from above to load all sequences and print them in the same fashion.

You should be able to run your program on the example file we created above and see the following output:

seq1	CCCCCCCCCCCCCCC
seq2	CGATCGATC
seq3	TTTTTTT
Hint

You can use the function readRecords to load all records at once.

Solution
#include <seqan/seq_io.h>

using namespace seqan2;

int main()
{
    CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fa");

    SeqFileIn seqFileIn;
    if (!open(seqFileIn, toCString(seqFileName)))
    {
        std::cerr << "ERROR: Could not open the file.\n";
        return 1;
    }

    StringSet<CharString> ids;
    StringSet<Dna5String> seqs;

    try
    {
        readRecords(ids, seqs, seqFileIn);
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    for (unsigned i = 0; i < length(ids); ++i)
        std::cout << ids[i] << '\t' << seqs[i] << '\n';

    return 0;
}

Accessing Qualities

Functions readRecord, readRecords, writeRecord and writeRecords are available in two variants:

  1. the first variant accepts only the sequence identifier and sequence characters, besides the SeqFileIn object;

  2. the second variant accepts an additional CharString for a PHRED base quality string.

If the first variant is used on an output file containing qualities, e.g. a FASTQ file, then writeRecord writes qualities as 'I', i.e. PHRED score 40. If the second variant is used on an input file containing no qualities, e.g. a FASTA file, then readRecord returns empty quality strings.

Here is an example for the second variant of readRecord:

    CharString seqFqFileName = getAbsolutePath("/demos/tutorial/sequence_io/example.fq");
    CharString id;
    Dna5String seq;
    CharString qual;

    SeqFileIn seqFqFileIn(toCString(seqFqFileName));

    readRecord(id, seq, qual, seqFqFileIn);

Tip

When DnaQ or Dna5Q Strings are used, then you should use the second variant. The qualities are simply stored directly in the sequence characters.

Assignment 4

Type

Application

Objective

Copy the above example of FASTQ file in a new file example.fq in a directory of your choice.

Change your result of Assignment 3 to use the variant of readRecord that also reads in the qualities and writes them next to the sequences.

When your program is called on this file, the result should look as follows.

seq1	CCCCCCCCCCCCCCC
+qual:	IIIIIHIIIIIIIII
seq2	CGATCGATC
+qual:	IIIIIIIII
seq3	TTTTTTT
+qual:	IIIIHHG
Solution
#include <seqan/seq_io.h>

using namespace seqan2;

int main()
{
    CharString seqFileName = getAbsolutePath("demos/tutorial/sequence_io/example.fq");

    SeqFileIn seqFileIn;
    if (!open(seqFileIn, toCString(seqFileName)))
    {
        std::cerr << "ERROR: Could not open the file.\n";
        return 1;
    }

    StringSet<CharString> ids;
    StringSet<Dna5String> seqs;
    StringSet<CharString> quals;

    try
    {
        readRecords(ids, seqs, quals, seqFileIn);
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    for (unsigned i = 0; i < length(ids); ++i)
        std::cout << ids[i] << '\t' << seqs[i] << "\n+qual:\t" << quals[i] << '\n';

    return 0;
}

Next Steps