Parsing

Learning Objective
This tutorial will explain how to use the class RecordReader for the parsing of text file fo You will see in-detail examples for parsing TSV-based formats such as GFF and BLAST tabular output and also for parsing the recursive Newick tree format.
Difficulty
Advanced
Duration
40 min
Prerequisites
I/O Overview, Lexical Casting

In this tutorial, you will learn how to use the RecordReader functions to easily create parsers for structured text file formats. We will first give a quick example for parsing a simple TSV format. Then, single-pass parsing will be explained (which is the most important variant) and the interface of the RecordReader class and the skip*() and read*() functions will be described. This is followed by extensive examples on parsing the GFF and BLAST tabular format and an example on how to parse the non-linear Newick format for phylogenetic trees. The tutorial closes with an explanation of how to write double-pass I/O code and in which situations it is useful to do so.

A First Example

Let us start off with a quick first example. The following program reads a two-column TSV file from the standard input. The first column contains keys, the second one contains values. The program prints the data as ${key} -> ${value} to stdout.

#include <iostream>

#include <seqan/sequence.h>
#include <seqan/stream.h>

// Read "<key>\t<value>" map from stdin.  Write out as "<key> -> <value>".

int main()
{
    // Define string and record reader types.  We will read from std::cin which
    // is of type std::istream.  We use a single-pass record reader.
    typedef seqan::RecordReader<std::istream, seqan::SinglePass<> > TRecordReader;

    int res = 0;  // Used to store I/O results.

    // Create RecordReader reading from standard input.
    TRecordReader reader(std::cin);

    // Read the file line by line.
    while (!atEnd(reader))
    {
        // Read first column: The key.
        seqan::CharString key;
        res = readUntilChar(key, reader, '\t');
        if (res != 0)
            return 1;

        goNext(reader);  // Skip TAB.

        // Read second column: The value.
        seqan::CharString value;
        res = readLine(value, reader);  // EOL will not be stored in value.
        if (res != 0)
            return 1;

        // Print ${key} -> ${value}.
        std::cout << key << " -> " << value << std::endl;
    }

    return 0;
}

As you can see, using the RecordReader is straightforward. First, we construct the RecordReader to wrap std::cin as also described in the I/O Overview tutorial.

Each iteration of the loop loads one record/line from standard input and writes out the record. We use atEnd() to check whether we are at the end of the file and loop. The function readUntilChar() reads the characters from the underlying file into a buffer key until a given character occurs, here the character is '\t'. The reader will not copy the tabulator into key and stop on the character. The function goNext() can be used to go to the next character in the current file. The call to the function readLine() copies the data into value until the end of line, skipping the end-of-line marker ('\n' or '\r\n') and does not copy the end-of-line marker to the value. Finally, we print the key/value pair to stdout.

Assignment 1

Reading CSV instead of TSV.

Type
Review
Objective
Modify the example above to use a comma (',') instead of a tab character for separating columns.
Hint
Yes, it is very easy.
Solution
#include <iostream>

#include <seqan/sequence.h>
#include <seqan/stream.h>

// Read "<key>\t<value>" map from stdin.  Write out as "<key> -> <value>".

int main()
{
    // Define string and record reader types.  We will read from std::cin which
    // is of type std::istream.  We use a single-pass record reader.
    typedef seqan::RecordReader<std::istream, seqan::SinglePass<> > TRecordReader;

    int res = 0;  // Used to store I/O results.

    // Create RecordReader reading from standard input.
    TRecordReader reader(std::cin);

    // Read the file line by line.
    while (!atEnd(reader))
    {
        // Read first column: The key.
        seqan::CharString key;
        res = readUntilChar(key, reader, ',');
        if (res != 0)
            return 1;

        goNext(reader);  // Skip comma.

        // Read second column: The value.
        seqan::CharString value;
        res = readLine(value, reader);  // EOL will not be stored in value.
        if (res != 0)
            return 1;

        // Print ${key} -> ${value}.
        std::cout << key << " -> " << value << std::endl;
    }

    return 0;
}

The Single-Pass RecordReader Class Interface

Single-pass record readers can simply be seen and used as an abstraction of streams. Read the file character-wise, from beginning to the end.

The low-level API for the single-pass reader is as follows:

Function Description
atEnd Return true if the reader is at the end of the file, false otherwise.
goNext Advance reader in file, return true if at end of file, false otherwise.
value Return the character the reader points to at the moment.
resutlCode Return int with I/O status. 0 for no error, non-0 value for error when reading.

The following program shows another example of single-pass I/O. We read a text file line-by-line and append the results to a String of CharStrings.

#include <iostream>
#include <fstream>

#include <seqan/sequence.h>
#include <seqan/stream.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;
    std::fstream stream(argv[1], std::ios::binary | std::ios::in);
    if (!stream.good())
        return 1;
    
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(stream);
    seqan::StringSet<seqan::CharString> result;
    
    while (!atEnd(reader))
    {
        resize(result, length(result) + 1);
        int res = readLine(back(result), reader);
        if (res != 0)
            return 1;
    }

    return 0;
}

Character Classes and the read* and skip* Functions

Character Classes And is*

In SeqAn, the same character classes are used as in the POSIX standard. See this list of character classes for a comprehensive list and description.

For example:

printf("isdigit('a') == %d\n", isdigit('a'));  // => "isdigit('a') == 0"
printf("isdigit('0') == %d\n", isdigit('0'));  // => "isdigit('0') == 1"
printf("isblank(' ') == %d\n", isdigit(' '));  // => "isdigit(' ') == 0"

The read* And skip* Functions

The parsing functionality in SeqAn built on top of the StreamConcept concept and RecordReader class is optimized for reading bioinformatics text file formats.

These formats mostly consist of fairly flat data files, i.e. a sequence of records, each having very few levels of subrecords. A typical example are FASTQ files where one record consists of adjacent lines, containing the identifier, sequence, and qualities. Another example are TSV (tab-separated-values) files where each record spans a line and there possibly is a header. SAM is an example for a TSV file with a header at the top of the file.

The main challenge in reading bioinformatics files is their size. When parsing a word processor document file, a HTML document, or a computer program, the input file is typically not larger than some MB. In bioinformatics, files having multiple GB are not uncommon, e.g. NGS data or the sequence of the human genome.

Thus, in SeqAn, the files are parsed “on the fly” as they are read. Using compiler nomenclauture, bioinformatics parsers often only have to be tokenizers. Making writing such simple parsers easy is the main aim of the read* and skip* functions in SeqAn. NB: By using Char Array Streams, you can also use the parsing infrastructure on in-memory data.

For each considered class of characters, there often is a read and a skip function. There are two big types of classes: White-listing/inclusion (read*X*) of certain characters and black-listing/exclusion (readUntil*X*) of certain characters. The inclusion functions stop after the last read/included character, the exclusion functions stop on the first excluded/not read character.

Most functions have the following interface. Note that all functions only append to the buffer argument, so you have to call clear yourself. This facilitates optimized reading into Concat Direct StringSets.

int readUntilXXX (TBuffer & buffer, RecordReader<TStream, TPass> & reader);
int readXXX      (TBuffer & buffer, RecordReader<TStream, TPass> & reader);
int skipUntilXXX (RecordReader<TStream, TPass> & reader);
int skipXXX      (RecordReader<TStream, TPass> & reader);

Tip

I/O Return Values and EOF_BEFORE_SUCCESS

The read*() and skip*() functions return an int value. Consistent with C return codes, the return value is == 0 in case that the reading/skipping was successful and != 0 if reading/skipping was not successful.

The cases of unsuccessful reading/skipping include real errors (e.g. hardware problems) but also that the reader is at the end of the file. In this case seqan::EOF_BEFORE_SUCCESS is returned. This behaviour is required for file format guessing where a return value of seqan::EOF_BEFORE_SUCCESS is interpreted as success.

There are three cases in how code can handle the value seqan::EOF_BEFORE_SUCCESS: (1) interpret it as an error, (2) return seqan::EOF_BEFORE_SUCCESS itself, or (3) interpret it as “success”.

Here are some examples:

‘’‘(1) Interpret as Error’‘’

Naively, one would assume that this is the correct treatment. However, (2) is the right choice for most cases.

// TRecordReader reader created above.
seqan::CharString buffer;
while (atEnd(reader))
{
    if (readLine(buffer, read) != 0)
        return 1;  // handle as error
}

(2) Interpret as ``seqan::EOF_BEFORE_SUCCESS``

Returning this code gives the caller the opportunity to handle end-of-file different from any other error. For example, a file format guesser can try to parse the first thousand bytes of a file and see whether they parse as valid. When EOF_BEFORE_SUCCESS is returned, it would count this as an access. Any other non-0 return code would be an error.

// TRecordReader reader created above.
seqan::CharString buffer;
int res = 0;
while (atEnd(reader))
{
    if ((res = readLine(buffer, read)) != 0)
        return res;  // handle as error or EOF_BEFORE_SUCCESS
}

(3) Interpret as Success

In some cases, EOF is a valid event. For example, if you have a line-based file format such as TSV, the last line could end with an EOF instead of a line break.

// TRecordReader reader created above.
seqan::CharString buffer;
int res = 0;
while (atEnd(reader))
{
    if ((res = readLine(buffer, read)) != 0 &&
        res != seqan::EOF_BEFORE_SUCCESS)
        return res;  // line not reached in case of EOF
}

The following functions are available:

Function Description
readDigits Read digit characters.
readDna5IgnoringWhitespaces Read DNA 5 characters, ignore whitespace.
readLetters Read letter characters.
readLine Read whole line, line break is not written into buffer.
readLineStripTrailingBlanks Read whole line, trailing blanks are not written into buffer.
readNChars Read a fixed number of characters.
readNCharsIgnoringWhitespace Read a fixed number of characters, whitespace is not written into the buffer.
readUntilBlank Read until a blank character occurs.
readUntilChar Read until the given character occurs.
readUntilWhitespace Read until a whitespace character occurs.
skipBlanks Skip blank characters.
skipChar Skip one given character.
skipLine Skip from the current position to the end of the line.
skipNChars Skip a fixed number of characters.
skipNCharsIgnoringWhitespace Skip a fixed number of characters, ignore whitespace.
skipUntilBlank Skip until a blank character occurs.
skipUntilChar Skip until a certain character occurs
skipUntilGraph Skip until a graph character occurs.
skipUntilLineBeginsWithChar Skip until a line begins with a certain character.
skipUntilLineBeginsWithOneCharOfStr Skip until a line begins with one character of a given string/list.
skipUntilLineBeginsWithStr Skip until a line begins with a certain string.
skipUntilString Skip until a certain string is found.
skipUntilWhitespace Skip until a whitespace character is found.
skipWhitespaces Skip until a non-whitespace character is found.

In the following example, we read the first two fields of a TSV file from stdin and dump them to stdout.

seqan::RecordReader<std::istream, seqan::SinglePass<> > reader(std::cin);
seqan::CharString buffer;

while (atEnd(reader))
{
    clear(buffer);
    int res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    std::cout << buffer;

    if (goNext(reader))
        return seqan::EOF_BEFORE_SUCCESS;

    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    std::cout << buffer << std::endl;

    res = skipLine(reader);
    if (res != 0 && res != seqan::EOF_BEFORE_SUCCESS)
        return 1;
}

Writing Your Own read* and skip* Functions

Writing your own reading/skipping function is easy. As an example, we write functions for reading and skipping the characters from the set {x, y, z}. The functions follow the same pattern and use the functions _readHelper() and _skipHelper().

These functions read/skip characters as long as a specific overload of the predicate function _charCompare() (in the seqan namespace) returns true. The _charCompare() function gets two parameters: The character to test and a tag for selecting the specific _charCompare() overload. The caracter to test is of type int. The tag is defined by you as a developer and the tag given to _charCompare() is the same as given to _readHelper() and _skipHelper().

For good examples, you can look at the file core/include/seqan/stream/tokenize.h to see how the rest of the read* and skip* functions from above are implemented.

struct Xyz_;
typedef seqan::Tag<Xyz_> Xyz;

inline int
_charCompare(int const c, Xyz const & /* tag*/)
{
    return c == 'x' || c == 'y' || c == 'z';
}

template <typename TStream, typename TPass, typename TBuffer>
inline int
readXyz(TBuffer & buffer, seqan::RecordReader<TStream, TPass> & reader)
{
    return seqan::_readHelper(buffer, reader, Xyz(), false);
}

template <typename TBuffer, typename TStream, typename TPass>
inline int
readUntilXyz(TBuffer & buffer, seqan::RecordReader<TStream, TPass> & reader)
{
    return seqan::_readHelper(buffer, reader, Xyz(), true);
}

template <typename TStream, typename TPass>
inline int
skipXyz(seqan::RecordReader<TStream, TPass> & reader)
{
    return seqan::_skipHelper(reader, Xyz(), false);
}

template <typename TStream, typename TPass>
inline int
skipUntilXyz(seqan::RecordReader<TStream, TPass> & reader)
{
    return seqan::_skipHelper(reader, Xyz(), true);
}

Assignment 2

Writing readHexNumber().

Type
Review
Objective
Write your own read and skip routines for hexadecimal numbers. Such numbers can only contain digits 0-9 and the characters a-f and A-F.
Solution

The following program reads from stdin as long as the input forms a valid hexadecimal number. Note that you can send an end-of-file character to your application by pressing Ctrl + d.

#include <iostream>

#include <seqan/sequence.h>
#include <seqan/stream.h>

// The following few lines are the actual solution to the assignment.

struct HexNumChars_;
typedef seqan::Tag<HexNumChars_> HexNumChars;

inline int
_charCompare(int const c, HexNumChars const & /* tag*/)
{
    return isdigit(c) || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F');
}

template <typename TStream, typename TPass, typename TBuffer>
inline int
readHexNumber(TBuffer & buffer, seqan::RecordReader<TStream, TPass> & reader)
{
    return seqan::_readHelper(buffer, reader, HexNumChars(), false);
}

// This main routine is only some driver code that reads from stdin.

int main()
{
    seqan::RecordReader<std::istream, seqan::SinglePass<> > reader(std::cin);

    while (!atEnd(reader))
    {
        seqan::CharString buffer;
        int res = readHexNumber(buffer, reader);
        if (res != 0 && res != seqan::EOF_BEFORE_SUCCESS)
        {
            std::cerr << "ERROR: Could not read from standard input.\n";
            return 1;
        }

        // Print hexadecimal number back to the user.
        std::cout << "RECOGNIZED " << buffer << '\n';
        
        // Skip all trailing input.
        skipLine(reader);
    }

    return 0;
}

An example session. The Ctrl + d is shown as ^D.

# tutorial_parsing_solution2
foo
10
20
2a^D
RECOGNIZED f
RECOGNIZED 10
RECOGNIZED 20
RECOGNIZED 2a

Assignment 3

Writing readPunctuation().

Type
Review
Objective
Modify the example above to read a sequence of punctuation characters in a function called readPunctuation().
Hint
You can use the function ispunct().
Solution
#include <iostream>

#include <seqan/sequence.h>
#include <seqan/stream.h>

// The following few lines are the actual solution to the assignment.

struct Punct_;
typedef seqan::Tag<Punct_> Punct;

inline int
_charCompare(int const c, Punct const & /* tag*/)
{
    return ispunct(c);
}

template <typename TStream, typename TPass, typename TBuffer>
inline int
readPunctuation(TBuffer & buffer, seqan::RecordReader<TStream, TPass> & reader)
{
    return seqan::_readHelper(buffer, reader, Punct(), false);
}

// This main routine is only some driver code that reads from stdin.

int main()
{
    seqan::RecordReader<std::istream, seqan::SinglePass<> > reader(std::cin);

    while (!atEnd(reader))
    {
        seqan::CharString buffer;
        int res = readPunctuation(buffer, reader);
        if (res != 0 && res != seqan::EOF_BEFORE_SUCCESS)
        {
            std::cerr << "ERROR: Could not read from standard input.\n";
            return 1;
        }

        // Print hexadecimal number back to the user.
        std::cout << "RECOGNIZED " << buffer << '\n';
        
        // Skip all trailing input.
        skipLine(reader);
    }

    return 0;
}

An example session. The Ctrl + d is shown as ^D.

...
asdf
!!@#%%^
RECOGNIZED ...
RECOGNIZED !!
RECOGNIZED !!@#%%^

File Parsing Practice

This section will walk you through a parser for GFF, tabular BLAST output, and the Newick tree format.

Common Patterns

In order to support a new file format, you usually (1) introduce a struct type for storing records, (2) create tags for the file type and the records, and (3) provide overloads of the functions nextIs() and readRecord(). For example, for the GFF format, we

  • create a struct GffRecord (1)
  • create the tag Gff (2)
  • create overloads of nextIs and readRecord for Gff (3).

A Simple GFF2 Parser

We will implement a simple parser for the GFF file format version 2. For the sake of simplicity, will not implement parsing of ## and will read the whole attributes field as one and not subdivide it further. Here, GFF2 files are TSV files with the following fields.

<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

The following example shows a GFF2 parser. First, include the necessary headers.

#include <iostream>
#include <fstream>

#include <seqan/basic.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan;

Then, define Gff2 tag and record struct.

struct Gff2_;
typedef Tag<Gff2_> Gff2;

struct Gff2Record
{
    CharString seqName;
    CharString source;
    CharString feature;
    unsigned start;
    unsigned end;
    bool hasScore;
    double score;
    char strand;
    unsigned frame;
    CharString attributes;
    CharString comments;

    Gff2Record() : start(0), end(0), hasScore(false), score(0), strand('.'), frame(0)
    {}
};

void clear(Gff2Record & record)
{
    clear(record.seqName);
    clear(record.source);
    clear(record.feature);
    clear(record.attributes);
    clear(record.comments);
}

We then implement a parser function for GFF records. Note that most of the code is error handling.

template <typename TStream, typename TPass>
int readRecord(Gff2Record & record, RecordReader<TStream, TPass> & reader, Gff2 const & /*tag*/)
{
    clear(record);
    CharString buffer;
    char c = '\0';
    int res = 0;

    // GFF2 records look like this:
    //
    // <seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

    // <seqname>
    res = readUntilChar(record.seqName, reader, '\t');
    if (res != 0)
        return res;
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;

    // <source>
    res = readUntilChar(record.source, reader, '\t');
    if (res != 0)
        return res;
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;

    // <feature>
    res = readUntilChar(record.feature, reader, '\t');
    if (res != 0)
        return res;
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;

    // <start>
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.end, buffer))
        return 1;  // Could not cast!
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;

    // <end>
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.end, buffer))
        return 1;  // Could not cast!
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;

    // <score>
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    record.hasScore = (buffer != '.');
    if (record.hasScore && !lexicalCast2<double>(record.score, buffer))
        return 1;  // Could not cast!
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;

    // <strand>
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (length(buffer) != 1u)
        return 1;  // More than one char or none.
    c = front(buffer);
    if (c != '.' && c != '+' && c != '-')
        return 1;  // Invalid strand.
    record.strand = c;
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;

    // <frame>
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (length(buffer) != 1u)
        return 1;  // More than one char or none.
    c = front(buffer);
    if (c != '.' && c != '0' && c != '1' && c != '2')
        return 1;  // Invalid frame.
    record.frame = c;
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;

    // <attributes>
    res = readUntilTabOrLineBreak(record.attributes, reader);
    if (res != 0 && res != EOF_BEFORE_SUCCESS)
        return res;
    if (atEnd(reader))
        return 0;
    if (value(reader) == '\t')
    {
        if (goNext(reader))
            return EOF_BEFORE_SUCCESS;
    }

    // <comment>
    res = readLine(record.seqName, reader);
    if (res != 0 && res != EOF_BEFORE_SUCCESS)
        return res;
    return 0;
}

On top of the record-reading routine, we implement reading of whole documents. This is quite simple.

template <typename TGff2Records, typename TStream, typename TPass>
int read2(TGff2Records & records, RecordReader<TStream, TPass> & reader, Gff2 const & /*tag*/)
{
    Gff2Record record;
    while (!atEnd(reader))
    {
        clear(record);
        int res = readRecord(record, reader, Gff2());
        if (res != 0)
            return res;
        appendValue(records, record);
    }
    return 0;
}

Finally, some driver code to open a file and call the parser routine. In the end, we dump some of the information we just read.

int main(int argc, char const ** argv)
{
    // Handle command line arguments, open files.
    if (argc != 2)
        return 1;
    std::fstream stream(argv[1], std::ios::binary | std::ios::in);
    if (!stream.good())
        return 1;

    // Read file.
    RecordReader<std::fstream, SinglePass<> > reader(stream);
    String<Gff2Record> gffRecords;
    int res = read2(gffRecords, reader, Gff2());
    if (res != 0)
        return res;

    // Write out some of the data to stdout.
    for (unsigned i = 0; i < length(gffRecords); ++i)
        std::cout << gffRecords[i].seqName << "\t" << gffRecords[i].strand << "\t" << gffRecords[i].start << "\t"
                  << gffRecords[i].end << std::endl;
    
    return 0;
}

Let’s look at an example run of the program.

# cat extras/demos/tutorial/parsing /gff2_example.txt
IV     curated  mRNA   5506800 5508917 . + .   Transcript B0273.1; Note "Zn-Finger"
IV     curated  5'UTR  5506800 5508999 . + .   Transcript B0273.1
IV     curated  exon   5506900 5506996 . + .   Transcript B0273.1
IV     curated  exon   5506026 5506382 . + .   Transcript B0273.1
IV     curated  exon   5506558 5506660 . + .   Transcript B0273.1
IV     curated  exon   5506738 5506852 . + .   Transcript B0273.1
IV     curated  3'UTR  5506852 5508917 . + .   Transcript B0273.1
# ./extras/demos/tutorial/parsing/tutorial_parse_gff2 extras/demos/tutorial/parsing/gff2_example.txt
IV  +   0   5508917
IV  +   0   5508999
IV  +   0   5506996
IV  +   0   5506382
IV  +   0   5506660
IV  +   0   5506852
IV  +   0   5508917

Newick Tree Parsing (Recursion Example)

The newick tree format is used for encoding phylogenetic trees (see Newick Tree Format Standard for a formal specification). We will write a parser that reads Newick forest files (without allowing for comments).

Here is an example for the Newick format:

(((One:0.2,Two:0.3):0.3,(Three:0.5,Four:0.3):0.2):0.3,Five:0.7):0.0;

A file with this content encodes the following tree:

      +-+ One
   +--+
   |  +--+ Two
+--+
|  | +----+ Three
|  +-+
|    +--+ Four
+
+------+ Five

And here is the grammar of the Newick format in EBNF.

forest        = tree+;
tree          = node, ";";
node          = children, label?, distance?
              | children?, label, distance?;
children      = "(", node, (",",node)*, ")";
label         = quoted-list
              | unquoted-list;
distance      = ":", number;
quoted-list   = "'", (qchar escaped-quote)*, "'";
escaped-quote = "''";
unquoted-list = uqchar;

The following demo shows the parsers, code to dump the tree from the internal data structures and a small driver program for the routines.

First, the necessary includes.

#include <iostream>
#include <fstream>

#include <seqan/basic.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>
#include <seqan/graph_types.h>

using namespace seqan;

Then, we define a Newick tag and a struct for branch labels.

struct Newick_;
typedef Tag<Newick_> Newick;

struct NewickBranchLabel
{
    bool isDistanceSet;
    double distance;
    
    NewickBranchLabel() : isDistanceSet(false), distance(0)
    {}
};

In a next step, we write a readFloatLiteral() helper function that is reusable.

template <typename TCharSeq, typename TStream, typename TPass>
int _readExponentialPart(TCharSeq & buffer,
                         RecordReader<TStream, TPass> & reader)
{
    // Check preconditions.
    SEQAN_ASSERT_NOT(atEnd(reader));
    SEQAN_ASSERT(value(reader) == 'e' || value(reader) == 'E');
    
    // Read 'e' or 'E';
    appendValue(buffer, value(reader));
    if (goNext(reader))
        return EOF_BEFORE_SUCCESS;
    // Possibly read '+' or '-'.
    if (value(reader) == '+' || value(reader) == '-')
    {
        appendValue(buffer, value(reader));
        if (goNext(reader))
            return EOF_BEFORE_SUCCESS;
    }
    // Read digits.
    if (!isdigit(value(reader)))
        return 1;  // Should have been a digit!
    return readDigits(buffer, reader);
}

template <typename TCharSeq, typename TStream, typename TPass>
int readFloadLiteral(TCharSeq & buffer,
                     RecordReader<TStream, TPass> & reader)
{
    if (atEnd(reader))
        return EOF_BEFORE_SUCCESS;  // Empty field.
        
    // The EBNF for floating point integers is as follows:
    //
    // exponent-indicator     = e | E 
    // exponent-part          = exponent-indicator [+|-]digits
    // floating-point-literal = digits exponent-part
    //                        | digits.[digits][exponent-part]
    //                        | .digits[exponent-part]

    // Read one leading sign if it is there.
    if (value(reader) == '-' || value(reader) == '+')
    {
        appendValue(buffer, value(reader));
        if (goNext(reader))
            return EOF_BEFORE_SUCCESS;  // At end after leading sign.
    }
    
    // Digits or dot?
    if (value(reader) == '.')
    {
        // Dot
        appendValue(buffer, '.');
        if (goNext(reader))
            return EOF_BEFORE_SUCCESS;
        if (!isdigit(value(reader)))
            return 1;  // Invalid format, >= 1 digit have to follow.
        int res = readDigits(buffer, reader);
        if (res != 0)
            return res;  // Error reading digits.
        // Optionally read exponential part.
        if (atEnd(reader))
            return 0;
        if (value(reader) == 'e' || value(reader) == 'E')
            return _readExponentialPart(buffer, reader);
    }
    else
    {
        // Digits
        if (!isdigit(value(reader)))
            return 1;  // >= 1 digit required!
        int res = readDigits(buffer, reader);
        if (res != 0)
            return res;  // Error reading digits.
        if (atEnd(reader))  // Stop if no more data.
            return 0;
        if (value(reader) == '.')
        {
            appendValue(buffer, '.');
            if (goNext(reader))
                return 0;  // End of field.
            if (isdigit(value(reader)))
            {
                res = readDigits(buffer, reader);
                if (res != 0)
                    return res;  // Error reading digits.
            }
            // Optionally read exponential part.
            if (atEnd(reader))
                return 0;
            if (value(reader) == 'e' || value(reader) == 'E')
                return _readExponentialPart(buffer, reader);
        }
        else if (value(reader) == 'e' || value(reader) == 'E')
        {
            return _readExponentialPart(buffer, reader);
        }
    }
    
    return 0;
}

The code for reading a Newick forest is recursive and a bit lengthy but not too complex. We load such forests into strings of Tree objects. Additionally, we have a vertex map for the branch distances and the vertex labels for each tree.

template <typename TTree, typename TRecordReader, typename TVertexDescriptor>
int _readNewickTree(TTree & tree,
                    String<CharString> & vertexLabels,
                    String<NewickBranchLabel> & branchLabels,
                    TRecordReader & reader,
                    TVertexDescriptor v)
{
    if (atEnd(reader))
        return EOF_BEFORE_SUCCESS;
    CharString buffer;
    int res = 0;
    
#define SKIP_WHITESPACE                            \
    do                                             \
    {                                              \
        int res = skipWhitespaces(reader);         \
        if (res != 0 && res != EOF_BEFORE_SUCCESS) \
            return res;                            \
    } while(false)
    
    if (value(reader) == '(')  // CHILDREN
    {
        if (goNext(reader))
            return EOF_BEFORE_SUCCESS;
        // children
        bool first = true;
        while (true)
        {
            SKIP_WHITESPACE;

            // Skip leading comma.
            if (!first)
            {
                res = skipChar(reader, ',');
                if (res != 0)
                    return res;
            }
            first = false;
            
            SKIP_WHITESPACE;
            
            // Read child.
            TVertexDescriptor x = addChild(tree, v);
            resizeVertexMap(tree, vertexLabels);
            resizeVertexMap(tree, branchLabels);
            res = _readNewickTree(tree, vertexLabels, branchLabels, reader, x);
            if (res != 0)
                return res;
            
            SKIP_WHITESPACE;
            
            // Exit loop.
            if (value(reader) == ')')
                break;
        }
        res = skipChar(reader, ')');
        if (res != 0)
            return res;  // Could not close child list.
        SKIP_WHITESPACE;
    }
    if (value(reader) != ':')  // LABEL
    {
        SKIP_WHITESPACE;
        clear(buffer);
        if (value(reader) == '\'')
        {
            // Read quoted label.
            if (goNext(reader))
                return EOF_BEFORE_SUCCESS;
            while (!atEnd(reader))
            {
                char c = value(reader);
                if (c == '\'')  // Possibly break, if not "''".
                {
                    if (goNext(reader))
                        break;
                    if (value(reader) != '\'')
                        break;
                }
                appendValue(buffer, value(reader));
                if (goNext(reader))
                    return 1;
            }
        }
        else
        {
            // Read unquoted label.
            while (!atEnd(reader))
            {
                char c = value(reader);
                if (isblank(c) || c == '(' || c == ')' || c == '[' ||
                    c == ']' || c == '\'' || c == '.' || c == ';' ||
                    c == ',' || c == ':')
                    break;
                appendValue(buffer, value(reader));
                if (goNext(reader))
                    return 1;
            }
        }
        assignProperty(vertexLabels, v, buffer);
        SKIP_WHITESPACE;
    }
    if (value(reader) == ':')  // DISTANCE
    {
        skipChar(reader, ':');
        SKIP_WHITESPACE;
        clear(buffer);
        res = readFloadLiteral(buffer, reader);
        if (res != 0)
            return res;  // Invalid distance.
        property(branchLabels, v).isDistanceSet = true;
        property(branchLabels, v).distance = lexicalCast<double>(buffer);
        SKIP_WHITESPACE;
    }
    return 0;
}

template <typename TStream, typename TSpec>
int read2(String<Graph<Tree<> > > & forest,
          String<String<CharString> > & vertexLabels,
          String<String<NewickBranchLabel> > & branchLabels,
          RecordReader<TStream, SinglePass<TSpec> > & reader,
          Newick const & /*tag*/)
{
    typedef Graph<Tree<> > TTree;
    typedef typename VertexDescriptor<TTree>::Type TVertexDescriptor;
    int res = 0;

    SKIP_WHITESPACE;
    
    // Read forest.
    while (!atEnd(reader))
    {
        // Allocate graph and maps.
        resize(forest, length(forest) + 1);
        resize(vertexLabels, length(vertexLabels) + 1);
        resize(branchLabels, length(branchLabels) + 1);
        // Allocate root.
        createRoot(back(forest));
        TVertexDescriptor v = root(back(forest));
        resizeVertexMap(back(forest), back(vertexLabels));
        resizeVertexMap(back(forest), back(branchLabels));
        // Read tree.
        res = _readNewickTree(back(forest), back(vertexLabels), back(branchLabels), reader, v);
        if (res != 0)
            return res;
        // Skip trailing semicolon, must be there.
        res = skipChar(reader, ';');
        if (res != 0)
            return res;
        SKIP_WHITESPACE;
    }

#undef SKIP_WHITESPACE

    return 0;
}

The code for dumping a Newick forest is also quite simple, if lengthy because of error checks.

template <typename TStream, typename TTree, typename TVertexDescriptor, typename TVertexLabels,
          typename TBranchLabels>
int _writeNewickRecurse(TStream & stream, TTree & tree, TVertexDescriptor v,
                        TVertexLabels & vertexLabels, TBranchLabels & branchLabels)
{
    if (numChildren(tree, v) > 0u)
    {
        int res = streamPut(stream, '(');
        if (res != 0)
            return res;
        
        typename Iterator<TTree, OutEdgeIterator>::Type it(tree, v);
        bool first = true;
        for (; !atEnd(it); goNext(it))
        {
            if (!first)
            {
                res = streamPut(stream, ',');
                if (res != 0)
                    return res;
            }
            first = false;
            res = _writeNewickRecurse(stream, tree, targetVertex(it), vertexLabels, branchLabels);
            if (res != 0)
                return res;
        }
        
        res = streamPut(stream, ')');
        if (res != 0)
            return res;
    }
    // Write label if any, quoted if required.
    if (length(property(vertexLabels, v)) > 0u)
    {
        bool needsQuoting = false;
        CharString const & label = property(vertexLabels, v);
        typename Iterator<CharString const, Rooted>::Type it = begin(label, Rooted());
        for (; !atEnd(it); ++it)
        {
            if (isblank(*it) || *it == ',' || *it == ';' || *it == '.' ||
                *it == '\'' || *it == '[' || *it == ']' || *it == '(' ||
                *it == ')')
            {
                needsQuoting = true;
                break;
            }
        }
        if (needsQuoting)
        {
            int res = streamPut(stream, '\'');
            if (res != 0)
                return res;
            it = begin(label, Rooted());
            for (; !atEnd(it); ++it)
            {
                if (*it == '\'')
                {
                    res = streamPut(stream, "''");
                    if (res != 0)
                        return res;
                }
                else
                {
                    res = streamPut(stream, *it);
                    if (res != 0)
                        return res;
                }
            }
            res = streamPut(stream, '\'');
            if (res != 0)
                return res;
        }
        else
        {
            int res = streamPut(stream, label);
            if (res != 0)
                return res;
        }
    }
    // Write branch length if any is given.
    if (property(branchLabels, v).isDistanceSet)
    {
        int res = streamPut(stream, ':');
        if (res != 0)
            return res;
        res = streamPut(stream, property(branchLabels, v).distance);
        if (res != 0)
            return res;
    }
    return 0;
}

template <typename TStream>
inline int write2(TStream & stream,
                  Graph<Tree<> > & tree,
                  String<CharString> & vertexLabels,
                  String<NewickBranchLabel> & branchLabels,
                  Newick const & /*tag*/)
{
    // Write <tree>;.
    int res = _writeNewickRecurse(stream, tree, getRoot(tree), vertexLabels, branchLabels);
    if (res != 0)
        return res;
    return streamPut(stream, ';');
}

Finally, the main() routine.

int main(int argc, char const ** argv)
{
    // Handle arguments, open file.
    if (argc != 2)
    {
        std::cerr << "Incorrect argument count!" << std::endl;
        std::cerr << "USAGE: tutorial_parse_newick INPUT.txt" << std::endl;
        return 1;
    }
    std::fstream stream(argv[1], std::ios::binary | std::ios::in);
    if (!stream.good())
    {
        std::cerr << "Could not open file " << argv[1] << std::endl;
        return 1;
    }
    RecordReader<std::fstream, SinglePass<> > reader(stream);
    
    // Load forest.
    String<Graph<Tree<> > > forest;
    String<String<CharString> > vertexLabels;
    String<String<NewickBranchLabel> > branchLabels;
    int res = read2(forest, vertexLabels, branchLabels, reader, Newick());
    if (res != 0)
    {
        std::cerr << "Could not read Newick file!" << std::endl;
        return res;
    }
    
    // Dump forests.
    for (unsigned i = 0; i < length(forest); ++i)
    {
        res = write2(std::cout, forest[i], vertexLabels[i], branchLabels[i], Newick());
        std::cout << "\n";
        if (res != 0)
        {
            std::cerr << "Error writing to stdout!" << std::endl;
            return 1;
        }
    }
    
    return 0;
}

Let’s look at an example run. Note that the children in SeqAn trees do not have a specific order and the Newick format does not introduce any normalized order. In the written result, the order of the children has changed.

# cat extras/demos/tutorial/parsing/newick_example.txt
(a,('Darwin''s Bulldog (Huxley)',c):-1.92e19)'The ''Root''':5;
((a_node,
  'another node',
  bird:0.3134)higher_node:4.5,
 c):1.03e10;
((<sub>),(,(</sub>,),));
# tutorial_parse_newick extras/demos/tutorial/parsing/newick_example.txt
((c,'Darwin''s Bulldog (Huxley)'):-1.92e+19,a)'The ''Root''':5;
(c,(bird:0.3134,'another node',a_node)higher_node:4.5):1.03e+10;
((,(<sub>,),),(</sub>));

Parsing Tabular BLAST

The program BLASTN can be given an -outfmt parameter that makes it generate tabular output. This output is quite easy to parse (much easier than the human-readable BLAST reports) and looks as follows:

# blastn -subject NC_001405.fasta -query NC_001460.fasta -outfmt 7 > blast_example.txt
# cat blast_example.txt
# BLASTN 2.2.25+
# Query: gi|9626621|ref|NC_001460.1| Human adenovirus A, complete genome
# Subject: gi|9626158|ref|NC_001405.1| Human adenovirus C, complete genome
# Fields: query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score
# 3 hits found
gi|9626621|ref|NC_001460.1| gi|9626158|ref|NC_001405.1| 81.13   408 66  11  17730   18131   18827   19229   5e-87    316
gi|9626621|ref|NC_001460.1| gi|9626158|ref|NC_001405.1| 81.63   98  12  6   383 476 433 528 9e-15   76.8
gi|9626621|ref|NC_001460.1| gi|9626158|ref|NC_001405.1| 76.27   118 22  6   25147   25261   26644   26758   3e-09   58.4
# BLAST processed 1 queries

The following example program takes the name of such a blastn output, reads it into record data structures and then prints it out in a different format again. To do this, we will first implement a record-reading API that allows streaming through the file. Then, we build a batch-reading API that reads such a file into a sequence of records that are all kept in main memory.

The program starts with including the required headers.

#include <iostream>

#include <seqan/sequence.h>
#include <seqan/file.h>
#include <seqan/stream.h>

using namespace seqan;

Then, we define a record for the file format BlastnTab and tabs for the comment and alignment record types.

struct BlastnTab_;
typedef Tag<BlastnTab_> BlastnTab;

struct BlastnTabComment_;
typedef Tag<BlastnTabComment_> BlastnTabComment;

struct BlastnTabAlignment_;
typedef Tag<BlastnTabAlignment_> BlastnTabAlignment;

Next, we define a record type. Note that this record type is very specialized to the blastn -outfmt 7 format. When writing I/O code for multiple format for similar data, you might want to consider writing one record type for all of them. See the (upcoming, TODO) SAM record I/O for the implementation of one record type for the SAM and then BAM format.

We also create a simple function to dump the record to a stream.

struct BlastnTabAlignmentRecord
{
    CharString queryName;
    CharString subjectName;
    double identity;
    unsigned alignmentLength;
    unsigned mismatches;
    unsigned gapOpens;
    unsigned queryBegin;
    unsigned queryEnd;
    unsigned subjectBegin;
    unsigned subjectEnd;
    double eValue;
    double bitScore;
    
    BlastnTabAlignmentRecord() :
            identity(0), alignmentLength(0), mismatches(0), gapOpens(0),
            queryBegin(0), queryEnd(0), subjectBegin(0), subjectEnd(0),
            eValue(0), bitScore(0)
    {}
};

template <typename TStream>
int write(TStream & stream, BlastnTabAlignmentRecord & record)
{
    streamPut(stream, "query name: ");
    streamPut(stream, record.queryName);
    streamPut(stream, "\nsubject name: ");
    streamPut(stream, record.subjectName);
    streamPut(stream, "\nidentity: ");
    streamPut(stream, record.identity);
    streamPut(stream, "\nalignment length: ");
    streamPut(stream, record.alignmentLength);
    streamPut(stream, "\nmismatches: ");
    streamPut(stream, record.mismatches);
    streamPut(stream, "\ngap opens: ");
    streamPut(stream, record.gapOpens);
    streamPut(stream, "\nquery begin: ");
    streamPut(stream, record.queryBegin);
    streamPut(stream, "\nquery end: ");
    streamPut(stream, record.queryEnd);
    streamPut(stream, "\nsubject begin: ");
    streamPut(stream, record.subjectBegin);
    streamPut(stream, "\nsubject end: ");
    streamPut(stream, record.subjectEnd);
    streamPut(stream, "\nevalue: ");
    streamPut(stream, record.eValue);
    streamPut(stream, "\nbit score: ");
    streamPut(stream, record.bitScore);
    int res = streamPut(stream, "\n\n");
    return res;
}

void clear(BlastnTabAlignmentRecord & record)
{
    clear(record.queryName);
    clear(record.subjectName);
}

Then, we define nextIs functions for the BlastnTabComment and BlastnTabAlignment tags, and their represented record types.

template <typename TStream, typename TSpec>
inline bool
nextIs(RecordReader<TStream, SinglePass<TSpec> > & reader, BlastnTabComment const & /*tag*/)
{
    return !atEnd(reader) && value(reader) == '#';
}

template <typename TStream, typename TSpec>
inline bool
nextIs(RecordReader<TStream, SinglePass<TSpec> > & reader, BlastnTabAlignment const & /*tag*/)
{
    return !atEnd(reader) && value(reader) != '#';
}

Then, we implement a record-reading API on top of the skip* and read* functions. Note that the error handling bloats up the number of required lines but is necessary.

template <typename TCharSequence, typename TStream, typename TSpec>
inline int
readRecord(TCharSequence & buffer, RecordReader<TStream, SinglePass<TSpec> > const & reader, BlastnTabComment const & /*tag*/)
{
    SEQAN_ASSERT(nextIs(reader, BlastnTabComment()));
    clear(buffer);
    return readLine(buffer, reader);
}

template <typename TStream, typename TSpec>
inline bool
readRecord(BlastnTabAlignmentRecord & record, RecordReader<TStream, SinglePass<TSpec> > & reader, BlastnTabAlignment const & /*tag*/)
{
    SEQAN_ASSERT(nextIs(reader, BlastnTabAlignment()));
    int res = 0;
    CharString buffer;
    clear(record);

    // Read query name.
    res = readUntilChar(record.queryName, reader, '\t');
    if (res != 0)
        return res;
    goNext(reader);

    // Read subject name.
    res = readUntilChar(record.subjectName, reader, '\t');
    if (res != 0)
        return res;
    goNext(reader);
    
    // Read identity.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<double>(record.identity, buffer))
        return 1;  // Could not cast identity to double.
    goNext(reader);

    // Read alignment length.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.alignmentLength, buffer))
        return 1;  // Could not cast alignment length to unsigned.
    goNext(reader);

    // Read mismatches.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.mismatches, buffer))
        return 1;  // Could not cast mismatches to unsigned.
    goNext(reader);

    // Read gap opens.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.gapOpens, buffer))
        return 1;  // Could not cast gap opens to unsigned.
    goNext(reader);

    // Read query begin.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.queryBegin, buffer))
        return 1;  // Could not cast query begin to unsigned.
    goNext(reader);

    // Read query end.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.queryEnd, buffer))
        return 1;  // Could not cast query end to unsigned.
    goNext(reader);

    // Read subject begin.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.subjectBegin, buffer))
        return 1;  // Could not cast subject begin to unsigned.
    goNext(reader);

    // Read subject end.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<unsigned>(record.subjectEnd, buffer))
        return 1;  // Could not cast subject end to unsigned.
    goNext(reader);

    // Read evalue.
    clear(buffer);
    res = readUntilChar(buffer, reader, '\t');
    if (res != 0)
        return res;
    if (!lexicalCast2<double>(record.eValue, buffer))
        return 1;  // Could not cast evalue to double.
    goNext(reader);

    // Read bit score, up to end of the line.
    clear(buffer);
    res = readLine(buffer, reader);
    if (res != 0)
        return res;
    if (!lexicalCast2<double>(record.bitScore, buffer))
        return 1;  // Could not cast bit score to double.
    
    return 0;
}

On top of the record-reading API, we implement a batch-reading function. This function turns out to be fairly simple.

template <typename TBlastnTabRecords, typename TStream, typename TSpec>
int read(TBlastnTabRecords & records, RecordReader<TStream, SinglePass<TSpec> > & reader, BlastnTab const & /*tag*/)
{
    BlastnTabAlignmentRecord record;
    while (!atEnd(reader))
    {
        if (nextIs(reader, BlastnTabComment()))
        {
            skipRecord(reader, BlastnTabComment());
            continue;
        }
        if (!nextIs(reader, BlastnTabAlignment()))
            return 1;
        if (readRecord(record, reader, BlastnTabAlignment()) != 0)
            return 1;
        appendValue(records, record);
    }
    return 0;
}

In the main() routine, we can then simply open a std::fstream, create a RecordReader. Then, use the batch-reading API to read the whole file into main memory and write it to stdout again.

int main(int argc, char const ** argv)
{
    // Process command line arguments, open file.
    if (argc != 2)
    {
        std::cerr << "Incorrect argument count!" << std::endl;
        std::cerr << "USAGE: tutorial_parse_blastn INPUT.txt" << std::endl;
        return 1;
    }
    std::fstream stream(argv[1], std::ios::binary | std::ios::in);
    if (!stream.good())
    {
        std::cerr << "Could not open file " << argv[1] << std::endl;
        return 1;
    }

    // Read file.
    RecordReader<std::fstream, SinglePass<> > reader(stream);
    String<BlastnTabAlignmentRecord> records;
    int res = read(records, reader, BlastnTab());
    if (res != 0)
    {
        std::cerr << "Could not read BLASTN records." << std::endl;
        return 1;
    }

    // Write read records.
    for (unsigned i = 0; i < length(records); ++i)
        write(std::cout, records[i]);

    return 0;
}

The program’s output looks as follow:

# ./extras/demos/tutorial/parsing/tutorial_parse_blastn ../../extras/demos/tutorial/parsing/blast_example.txt
query name: gi|9626621|ref|NC_001460.1|
subject name: gi|9626158|ref|NC_001405.1|
identity: 81.13
alignment length: 408
mismatches: 66
gap opens: 11
query begin: 17730
query end: 18131
subject begin: 18827
subject end: 19229
evalue: 5e-87
bit score: 316

query name: gi|9626621|ref|NC_001460.1|
subject name: gi|9626158|ref|NC_001405.1|
identity: 81.63
alignment length: 98
mismatches: 12
gap opens: 6
query begin: 383
query end: 476
subject begin: 433
subject end: 528
evalue: 9e-15
bit score: 76.8

query name: gi|9626621|ref|NC_001460.1|
subject name: gi|9626158|ref|NC_001405.1|
identity: 76.27
alignment length: 118
mismatches: 22
gap opens: 6
query begin: 25147
query end: 25261
subject begin: 26644
subject end: 26758
evalue: 3e-09
bit score: 58.4

Double-Pass I/O Using the RecordReader

The Double-Pass RecordReader reader’s API extends the function described above for the Single-Pass RecordReader. It provides the following additional global interface functions.

Function Description
startFirstPass Start first pass of reading.
startSecondPass Second pass of reading.

It is used as follows: For each section of the file that is to be read in the next step (one or multiple records), you first call startFirstPass. This memoizes the current position in the file. Then, you use the same API as for the single-pass reader to read the file. When you are done with this section, you call startSecondPass. This will reset the position of the reader to the one where startFirstPass was called.

Here is an example for using double-pass I/O:

#include <iostream>
#include <fstream>

#include <seqan/basic.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan;

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;
    std::fstream stream(argv[1], std::ios::binary | std::ios::in);
    if (!stream.good())
        return 1;
    
    RecordReader<std::fstream, DoublePass<> > reader(stream);
    String<CharString> result;
    CharString buffer;
    
    while (!atEnd(reader))
    {
        startFirstPass(reader);
        clear(buffer);
        int res = readLine(buffer, reader);
        if (res != 0)
            return 1;

        startSecondPass(reader);
        resize(result, length(result) + 1);
        resize(back(result), length(buffer), Exact());
        res = readLine(back(result), reader);
        if (res != 0)
            return 1;
    }

    return 0;
}

Note that all file contents read in the first pass are buffered when operating on streams. Thus, double-pass I/O can have a high memory usage on streams when having large passes. In this case, using memory mapped strings to read from can be more efficient. However, in order to allow double-pass I/O when reading from compressed streams or stdin, this buffering is designed to lead to better performance or is even required.

Double-pass I/O has the advantage that the exact amount of memory can be allocated for the target data structures. This can lead to reduced memory usage since no memory is pre-allocated and then left unused. Thus, this is useful if the life span of your target data structures is long and a lot of memory is saved.

The disadvantage is the higher memory usage when reading the file itself. All data read in the first pass has to be buffered if using streams.

So, when should you use double-pass I/O? A good rule of thumb is: If you need to read a whole large file into main memory (e.g. NGS read set or a genome) and it is uncompressed then use a double-pass record reader with a memory mapped string. Otherwise, use single-pass I/O.

comments powered by Disqus