Sequence File I/O

Learning Objective
This tutorial will explain how to read and write sequence files in SeqAn using the RecordReader class and the MultiSeqFile API.
Difficulty
Advanced
Duration
30 min
Prerequisites
I/O Overview

This tutorial explains how to read using the RecordReader class and the MultipleSeqFile API and how to write sequence files using Streams. Currently, SeqAn supports reading sequences (and qualities where it applies) in FASTA, FASTQ, EMBL and GenBank format.

This tutorial does not explain how to use the SequenceStream class. This class is covered in the Basic Sequence I/O tutorial.

Reading Sequence Files

Because sequences play a central role in the SeqAn, the library provides a comprehensive implementation of input routines. The sequence I/O facilities are available through the seq_io module. You get access to this module by #include <seqan/seq_io.h>. We will first describe the API for record-wise reading of sequence files and then go over to reading all records in a file at once.

We start out by creating a RecordReader object. For most cases, the single-pass specialization is the most appropriate one. Then, we can read the file record-by-record using the function readRecord.

To this function, we pass the buffers we want to read the record identifier and the sequence into (id and seq). This is followed by the reader object we create with the std::fstream previously opened. The last argument is the tag seqan::Fasta() which uses the variant of readRecord for reading FASTA files.

#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Read file record-wise.
    seqan::CharString id;
    seqan::Dna5String seq;
    while (!atEnd(reader))
    {
        if (readRecord(id, seq, reader, seqan::Fasta()) != 0)
            return 1;  // Could not record from file.

        std::cout << id << "\t" << seq << "\n";
    }
    
    return 0;
}

When reading FASTQ files, we use the tag seqan::Fastq(). For storing the qualities, we can pass an optional third parameter of type CharString which stores the quality values.

#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Read file record-wise.
    seqan::CharString id;
    seqan::Dna5String seq;
    seqan::CharString qual;
    while (!atEnd(reader))
    {
        if (readRecord(id, seq, qual, reader, seqan::Fastq()) != 0)
            return 1;  // Could not record from file.

        std::cout << id << "\t" << seq << "\n";
    }
    
    return 0;
}

Optionally, we can also read the sequence into a string of [dox:Dna5Q Dna5Q] characters which will store the qualities directly in the string’s characters.

#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Read file record-wise.
    seqan::CharString id;
    seqan::String<seqan::Dna5Q> seq;
    while (!atEnd(reader))
    {
        if (readRecord(id, seq, reader, seqan::Fastq()) != 0)
            return 1;  // Could not record from file.

        std::cout << id << "\t" << seq << "\n";
    }
    
    return 0;
}

Important

Sequence Parsing Behaviour

  • When using Dna5 or Dna5Q as the sequence’s alphabet type, the parsing routine will allow the characters 'C', 'G', 'A', 'T', and 'N' in the sequences of the file. This can make problems if the sequenc contains different characters, for example when it contains IUPAC characters. In this case, you can simply use CharString as the seq parameter and then assign them to a Dna5String.
  • Accordingly, when using Dna or DnaQ, only the characters 'C', 'G', 'A', and 'T' are allowed.
  • When omitting the qual parameter when reading FASTQ, the quality values from the file will be ignored.

Assignment 1

Record-Wise Reading Sequences into CharString

Type
Review
Objective
Modify the example above to read the sequence into a CharString instead of a Dna5String.
Solution
#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Read file record-wise.
    seqan::CharString id;
    seqan::CharString seq;
    while (!atEnd(reader))
    {
        if (readRecord(id, seq, reader, seqan::Fastq()) != 0)
            return 1;  // Could not record from file.

        std::cout << id << "\t" << seq << "\n";
    }
    
    return 0;
}

When we want to read a whole sequence (e.g. FASTA or FASTQ) file into memory then we only have to slightly adjust the example from above. For example, here is how we can read a whole FASTQ file into memory using the function read into StringSets of CharStrings and Dna5Strings.

Warning

For a short time, read() will still be called read2() because of name clashes with the old I/O system.

#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Read file in one pass.
    seqan::StringSet<seqan::CharString> ids;
    seqan::StringSet<seqan::Dna5String> seqs;
    seqan::StringSet<seqan::CharString> quals;
    if (read2(ids, seqs, quals, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.

    for (unsigned i = 0; i < length(ids); ++i)
        std::cout << ids[i] << '\t' << seqs[i] << '\t' << quals[i] << '\n';
    
    return 0;
}

Assignment 2

Document-Wise Reading Sequences into CharString

Type
Review
Objective
Modify the example above to read the sequence into a StringSet of CharStrings instead of a Dna5Strings.
Solution
#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Read file in one pass.
    seqan::StringSet<seqan::CharString> ids;
    seqan::StringSet<seqan::CharString> seqs;
    seqan::StringSet<seqan::CharString> quals;
    if (read2(ids, seqs, quals, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.

    for (unsigned i = 0; i < length(ids); ++i)
        std::cout << ids[i] << '\t' << seqs[i] << '\t' << quals[i] << '\n';
    
    return 0;
}

Choice of Record Reader

In most cases, you will want to use a Single-Pass RecordReader for reading files. Mostly, it is the fastest and best way to read files and also all file formats have a single-pass implementation.

Using a double-pass record reader almost only makes sense if read a whole file into main memory using the document reading API. The file is read twice. In the first pass, the total length of ids and sequence characters is determined. When reading sequences into StringSets, the exact number of elements can be reserved. Even more, when using Concat-Direct StringSet, no superflous memory has to be allocated at all. The string sets are then filled in the second pass.

Using double-pass I/O also only makes sense for document reading when used in conjunction with MMap Strings. When using streams, the RecordReader has to buffer the read data in memory because not all stream implementation allow for jumping. In the case of MMap Strings, no buffer is used because the record reader directly operates on the memory mapped file (and thus directly on the disk buffers of the kernel).

Assignment 3

Using a Double-Pass RecordReader with a MMap String.

Type
Application
Objective
Change solution of Assignment 2 such that a Double-Pass RecordReader is used with a MMap String.
Hint

You can open files into MMap Strings as follows (include the <seqan/file.h> header):

typedef seqan::String<char, seqan::MMap<> > TMMapString;
TMMapString mmapString;
bool success = open(mmapString, "filename.fa", seqan::OPEN_RDONLY);

You can then define a DoublePassRecordReader wrapping the just opened mmapString as follows:

typedef seqan::RecordReader<
        TMMapString,
        seqan::DoublePass<seqan::StringReader> > TReader;
TReader reader(mmapString);
Solution
#include <fstream>
#include <iostream>

#include <seqan/file.h>
#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open memory mapped string.
    seqan::String<char, seqan::MMap<> > mmapString;
    if (!open(mmapString, argv[1], seqan::OPEN_RDONLY))
        return 1;  // Could not open file.

    // Create RecordReader.
    seqan::RecordReader<seqan::String<char, seqan::MMap<> >,
                        seqan::DoublePass<seqan::StringReader> > reader(mmapString);

    // Read file in one pass.
    seqan::StringSet<seqan::CharString> ids;
    seqan::StringSet<seqan::CharString> seqs;
    seqan::StringSet<seqan::CharString> quals;
    if (read2(ids, seqs, quals, reader, seqan::Fastq()) != 0)
        return 1;  // Could not read file.

    for (unsigned i = 0; i < length(ids); ++i)
        std::cout << ids[i] << '\t' << seqs[i] << '\t' << quals[i] << '\n';
    
    return 0;
}

Auto Format Detection

Passing the format as the tag is appropriate when the format is known beforehand. Otherwise, you can use a variable of type AutoSeqStreamFormat instead of the tag.

AutoSeqStreamFormatt objects can be first passed to the function guessStreamFormat. This function tries to parse the file as different formats on the first some thousand bytes. When this succeeds, the successfully recognized file type is stored in the object.

You can then subsequently use the AutoSeqStreamFormat instead of a tag to the functions readRecord or read.

#include <fstream>
#include <iostream>

#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open file and create RecordReader.
    std::fstream in(argv[1], std::ios::binary | std::ios::in);
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(in);

    // Create the AutoSeqStreamFormat object and guess the file format.
    seqan::AutoSeqStreamFormat formatTag;
    if (!guessStreamFormat(reader, formatTag))
        return 1;  // Could not detect file format.

    // Read file record-wise.
    seqan::CharString id;
    seqan::String<seqan::Dna5Q> seq;
    while (!atEnd(reader))
    {
        if (readRecord(id, seq, reader, formatTag) != 0)
            return 1;  // Could not record from file.

        std::cout << id << "\t" << seq << "\n";
    }
    
    return 0;
}

Assignment 4

Using AutoSeqStreamFormat

Type
Application
Objective
Adjust the solution of Assignment 3 to use a AutoSeqStreamFormat for format detection.
Solution
#include <fstream>
#include <iostream>

#include <seqan/file.h>
#include <seqan/seq_io.h>
#include <seqan/sequence.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;  // Invalid number of arguments.

    // Open memory mapped string.
    seqan::String<char, seqan::MMap<> > mmapString;
    if (!open(mmapString, argv[1], seqan::OPEN_RDONLY))
        return 1;  // Could not open file.

    // Create RecordReader.
    seqan::RecordReader<seqan::String<char, seqan::MMap<> >,
                        seqan::DoublePass<seqan::StringReader> > reader(mmapString);

    // Create the AutoSeqStreamFormat object and guess the file format.
    seqan::AutoSeqStreamFormat formatTag;
    if (!guessStreamFormat(reader, formatTag))
        return 1;  // Could not detect file format.

    // Read file in one pass.
    seqan::StringSet<seqan::CharString> ids;
    seqan::StringSet<seqan::CharString> seqs;
    seqan::StringSet<seqan::CharString> quals;
    if (read2(ids, seqs, quals, reader, formatTag) != 0)
        return 1;  // Could not read file.

    for (unsigned i = 0; i < length(ids); ++i)
        std::cout << ids[i] << '\t' << seqs[i] << '\t' << quals[i] << '\n';
    
    return 0;
}

Note

Qualities and FASTA files.

When passing a qual parameter to readRecord or read then this cannot be filled with qualities from the file since FASTA files do not contain any. Instead, the qual string will be empty after the call to readRecord and after the call to read, it will be a string set with empty entries. The string set will have a size that is equal to the number of records in the file.

Writing Sequence Files

Similar to reading, sequence files can be written record-by-record or as a whole.

For record-wise writing, we use the function writeRecord. This function expects as parameters, the StreamConcept to write to, the data to write, followed by the format tag. The following example writes an identifier and a sequence StringSet record-by-record to stdout.

#include <seqan/seq_io.h>
#include <iostream>

int main()
{
    seqan::StringSet<seqan::CharString> ids;
    appendValue(ids, "id1");
    appendValue(ids, "id2");
    seqan::StringSet<seqan::Dna5String> seqs;
    appendValue(seqs, "CGATCGATCGAT");
    appendValue(seqs, "AAAAAAAAAAAA");

    for (unsigned i = 0; i < length(ids); ++i)
        if (seqan::writeRecord(std::cout, ids[i], seqs[i], seqan::Fasta()) != 0)
            return 1;  // Error writing.

    return 0;
}

The result on the console looks like this:

>id1
CGATCGATCGAT
>id2
AAAAAAAAAAAA

Assignment 5

Writing out FASTQ.

Type
Application
Objective
Change the example above such that the two sequences are written as FASTQ with qualities. Use the quality strings "IIIIIIIIIHII" and "IIIIIIIIIIII".
Hint
Simply use a new StringSet quals of CharString, append the quality strings, and modify the line with the writeRecord() call.
Solution
#include <seqan/seq_io.h>
#include <iostream>

int main()
{
    seqan::StringSet<seqan::CharString> ids;
    appendValue(ids, "id1");
    appendValue(ids, "id2");
    seqan::StringSet<seqan::Dna5String> seqs;
    appendValue(seqs, "CGATCGATCGAT");
    appendValue(seqs, "AAAAAAAAAAAA");
    seqan::StringSet<seqan::CharString> quals;
    appendValue(quals, "IIIIIIIIIHII");
    appendValue(quals, "IIIIIIIIIHII");

    for (unsigned i = 0; i < length(ids); ++i)
        if (seqan::writeRecord(std::cout, ids[i], seqs[i], quals[i], seqan::Fastq()) != 0)
            return 1;  // Error writing.

    return 0;
}

The output looks as follows:

@id1
CGATCGATCGAT
+
IIIIIIIIIHII
@id2
AAAAAAAAAAAA
+
IIIIIIIIIIII

For writing out whole string sets at once, we use the function write. The transition from record-wise writing to writing whole string sets is of similar simplicity as for reading:

Warning

For a short time, write() will still be called write2() because of name clashes with the old I/O system.

#include <seqan/seq_io.h>
#include <iostream>

int main()
{
    seqan::StringSet<seqan::CharString> ids;
    appendValue(ids, "id1");
    appendValue(ids, "id2");
    seqan::StringSet<seqan::Dna5String> seqs;
    appendValue(seqs, "CGATCGATCGAT");
    appendValue(seqs, "AAAAAAAAAAAA");
    seqan::StringSet<seqan::CharString> quals;
    appendValue(quals, "IIIIIIIIIHII");
    appendValue(quals, "IIIIIIIIIIII");

    if (seqan::write2(std::cout, ids, seqs, quals, seqan::Fastq()) != 0)
        return 1;  // Error writing.

    return 0;
}

Using MultiSeqFile

Warning

Deprecate MultiSeqFile in favour of FaiIndex?

The class MultiSeqFile (which actually is a shortcut to a memory mapped string set) allows to read sequence files in a two-pass approach. First, the file is read and the start positions of each sequence record in the file is stored in memory. The file is kept open as a memory mapped file.

Then, we can access the identifier, sequence, and quality string of a record using functions such as assignSeqId.

Indexed reading can be done through MultiSeqFile which is a shortcut to a memory mapped string set. We open the file using open on its concat member (which is a MMap String). The function split then parses the file contents and sets the separating indexes of the StringSet. For this, we need the file format. We could give a specify format in the tag (e.g. seqan::Fastq()) or use AutoSeqFormat together with guessFormat.

The following example demonstrates how to use MultiSeqFile to read sequence files. First, we include the necessary headers and start our main() function.

#include <seqan/file.h>
#include <iostream>

int main (int argc, char const ** argv)
{

Then, we declare the MultiSeqFile object and open it with the value of argv[1]. If no parameters are given then we exit the program with status code 1.

    seqan::MultiSeqFile multiSeqFile;
    if (argc < 2 || !open(multiSeqFile.concat, argv[1], seqan::OPEN_RDONLY))
        return 1;

This is followed by using AutoSeqFormat for guessing the sequence file type.

    seqan::AutoSeqFormat format;
    guessFormat(multiSeqFile.concat, format);

After guessing the file type, we can now use this knowledge to compute the start positions of each record using the function split.

    split(multiSeqFile, format);

After the call to split, we can get the number of sequences in the file using the function length. We declare the StringSets for storing the sequences and sequence ids and reserve the exact space for the number of elements we need.

    unsigned seqCount = length(multiSeqFile);
    seqan::StringSet<seqan::String<seqan::Dna5Q> > seqs;
    seqan::StringSet<seqan::CharString> seqIDs;

    reserve(seqs, seqCount, seqan::Exact());
    reserve(seqIDs, seqCount, seqan::Exact());

Then, we declare some buffers for storing the sequence id, characters, and the quality values.

    seqan::String<seqan::Dna5Q> seq;
    seqan::CharString qual;
    seqan::CharString id;

Now, we can access the sequence, qualities and ids using the functions assignSeq, assignQual, and assignSeqId. Note that these functions still have to do some parsing of the input file. The number of sequences is the same as the number of entries in the MultiSeqFile StringSet as returned by length.

In the following loop, we first extract the sequences, qualities, and the sequence id. Then, the qualities are stored in the Dna5Q letters of the string. The sequence with qualities and the sequence ids are then stored in the variables seqs and seqIDs we allocated above.

    for (unsigned i = 0; i < seqCount; ++i)
    {
        assignSeq(seq, multiSeqFile[i], format);    // read sequence
        assignQual(qual, multiSeqFile[i], format);  // read ascii quality values
        assignSeqId(id, multiSeqFile[i], format);   // read sequence id

        // Convert ascii to values from 0..62.
        // We store DNA and quality together in Dna5Q.
        for (unsigned j = 0; j < length(qual) && j < length(seq); ++j)
            assignQualityValue(seq[j], (int)(seqan::ordValue(qual[j]) - 33));

        // We use reserve and append, as assign is not supported by
        // StringSet<..., Owner<ConcatDirect<> > >
        appendValue(seqs, seq);
        appendValue(seqIDs, id);
    }

Finally, we return the status code 0 at the end of our main() function.

    return 0;
}

Indexed reading has multiple advantages.

  • Its performance is only slightly worse than when reading sequentially with a double-pass String RecordReader.
  • The input file is mapped into main memory and otherwise complicated page-wise memory management is done by the operating system and does not have to be implemented by the user. The user can access the file almost at random and only the used parts will be loaded into main memory. This is quite efficient when only few sequences are needed.

If you need to have fast random access to all sequences in a file then loading it into a Concat-Direct StringSet with the batch-reading API is faster than using MultiSeqFile.

MultiSeqFile Review

Type
Review
Objective
Change the example above, so the sequence file that is read is written to the user in a TSV format. For each record in the input file with id ${ID}, sequence ${SEQ}, and quality string ${QUAL}, write out a line ${ID}\t${SEQ}\t${QUAL}.
Solution
#include <seqan/file.h>
#include <iostream>

int main (int argc, char const ** argv)
{
    seqan::MultiSeqFile multiSeqFile;
    if (argc < 2 || !open(multiSeqFile.concat, argv[1], seqan::OPEN_RDONLY))
        return 1;

    seqan::AutoSeqFormat format;
    guessFormat(multiSeqFile.concat, format);
    split(multiSeqFile, format);

    seqan::String<seqan::Dna5> seq;
    seqan::CharString qual;
    seqan::CharString id;

    for (unsigned i = 0; i < seqCount; ++i)
    {
        assignSeq(seq, multiSeqFile[i], format);    // read sequence
        assignQual(qual, multiSeqFile[i], format);  // read ascii quality values
        assignSeqId(id, multiSeqFile[i], format);   // read sequence id

        std::cout << id << '\t' << seq << '\t' << qual << '\n';
    }

    return 0;
}

Next Steps

comments powered by Disqus