File I/O Overview

Learning Objective

This article will give you an overview of the formatted file I/O in SeqAn.

Difficulty

Basic

Duration

30 min

Prerequisites

Sequences

Overview

Most file formats in bioinformatics are structured as lists of records. Often, they start out with a header that itself contains different header records. For example, the Binary Sequence Alignment/Map (SAM/BAM) format starts with an header that lists all contigs of the reference sequence. The BAM header is followed by a list of BAM alignment records that contain query sequences aligned to some reference contig.

Formatted Files

SeqAn allows to read or write record-structured files through two types of classes: FormattedFileIn and FormattedFileOut. Classes of type FormattedFileIn allow to read files, whereas classes of type FormattedFileOut allow to write files. Note how these types of classes do not allow to read and write the same file at the same time.

These types of classes provide the following I/O operations on formatted files:

  1. Open a file given its filename or attach to an existing stream like std::cin or std::cout.

  2. Guess the file format from the file content or filename extension.

  3. Access compressed or uncompressed files transparently.

SeqAn provides the following file formats:

Warning

Access to compressed files relies on external libraries. For instance, you need to have zlib installed for reading .gz files and libbz2 for reading .bz2 files. If you are using Linux or OS X and you followed the Getting Started tutorial closely, then you should have already installed the necessary libraries. On Windows, you will need to follow Installing Dependencies to get the necessary libraries.

You can check whether you have installed these libraries by running CMake again. Simply call cmake . in your build directory. At the end of the output, there will be a section “SeqAn Features”. If you can read ZLIB - FOUND and BZIP2 - FOUND then you can use zlib and libbz2 in your programs.

Basic I/O

This tutorial shows the basic functionalities provided by any class of type FormattedFileIn or FormattedFileOut. In particular, this tutorial adopts the classes BamFileIn and BamFileOut as concrete types. The class BamFileIn allows to read files in SAM or BAM format, whereas the class BamFileOut allows to write them. Nonetheless, these functionalities are independent from the particular file format and thus valid for all record-based file formats supported by SeqAn.

The demo application shown here is a simple BAM to SAM converter.

Includes

Support for a specific format comes by including a specific header file. In this case, we include the BAM header file:

#include <seqan/bam_io.h>

using namespace seqan2;

int main()
{

Opening and Closing Files

Classes of type FormattedFileIn and FormattedFileOut allow to open and close files.

A file can be opened by passing the filename to the constructor:

    CharString bamFileInName = getAbsolutePath("demos/tutorial/file_io_overview/example.bam");
    CharString samFileOutName = getAbsolutePath("demos/tutorial/file_io_overview/example.sam");

    // Open input BAM file, BamFileIn supports both SAM and BAM files.
    BamFileIn bamFileIn(toCString(bamFileInName));

    // Open output SAM file by passing the context of bamFileIn and the filename to open.
    BamFileOut samFileOut(context(bamFileIn), toCString(samFileOutName));

Alternatively, a file can be opened after construction by calling open:

    // Alternative way to open a bam or sam file
    BamFileIn openBamFileIn;
    open(openBamFileIn, toCString(bamFileInName));

Note that any file is closed automatically whenever the FormattedFileIn or FormattedFileOut object goes out of scope. Eventually, a file can be closed manually by calling close.

Accessing the Header

To access the header, we need an object representing the format-specific header. In this case, we use an object of type BamHeader. The content of this object can be ignored for now, it will be covered in the SAM and BAM I/O tutorial.

    // Copy header.
    BamHeader header;
    readHeader(header, bamFileIn);
    writeHeader(samFileOut, header);

The function readHeader reads the header from the input BAM file and writeHeader writes it to the SAM output file.

Accessing the Records

Again, to access records, we need an object representing format-specific information. In this case, we use an object of type BamAlignmentRecord. Each call to readRecord reads one record from the BAM input file and moves the BamFileIn forward. Each call to writeRecord writes the record just read to the SAM output files. We check the end of the input file by calling atEnd.

    // Copy all records.
    BamAlignmentRecord record;
    while (!atEnd(bamFileIn))
    {
        readRecord(record, bamFileIn);
        writeRecord(samFileOut, record);
    }

    return 0;
}

Our small BAM to SAM conversion demo is ready. The tool still lacks error handling, reading from standard input and writing to standard output. You are now going to add these features.

Error Handling

We distinguish between two types of errors: low-level file I/O errors and high-level file format errors. Possible file I/O errors can affect both input and output files. Example of errors are: the file permissions forbid a certain operation, the file does not exist, there is a disk reading error, a file being read gets deleted while we are reading from it, or there is a physical error in the hard disk. Conversely, file format errors can only affect input files: such errors arise whenever the content of the input file is incorrect or damaged. Error handling in SeqAn is implemented by means of exceptions.

I/O Errors

All FormattedFileIn and FormattedFileOut constructors and functions throw exceptions of type IOError to signal low-level file I/O errors. Therefore, it is sufficient to catch these exceptions to handle I/O errors properly.

There is only one exception to this rule. Function open returns a bool to indicate whether the file was opened successfully or not.

Assignment 1

Type

Application

Objective

Improve the program above to detect file I/O errors.

Hint

Use the IOError class.

Solution
#include <seqan/bam_io.h>

using namespace seqan2;

int main(int, char const **)
{
    CharString bamFileInName = getAbsolutePath("demos/tutorial/file_io_overview/example.bam");
    CharString samFileOutName = getAbsolutePath("demos/tutorial/file_io_overview/example.sam");

    // Open input BAM file.
    BamFileIn bamFileIn;
    BamHeader header;
    if (!open(bamFileIn, toCString(bamFileInName)))
    {
        std::cerr << "ERROR: could not open input file " << bamFileInName << ".\n";
        return 1;
    }

    // Open output SAM file.
    BamFileOut samFileOut(context(bamFileIn), toCString(samFileOutName));

    // Copy header.
    try
    {
        readHeader(header, bamFileIn);
        writeHeader(samFileOut, header);
    }
    catch (IOError const & e)
    {
        std::cerr << "ERROR: could not copy header. " << e.what() << "\n";
    }

    // Copy all records.
    BamAlignmentRecord record;
    while (!atEnd(bamFileIn))
    {
        try
        {
            readRecord(record, bamFileIn);
            writeRecord(samFileOut, record);
        }
        catch (IOError const & e)
        {
            std::cerr << "ERROR: could not copy record. " << e.what() << "\n";
        }
    }

    return 0;
}

Format Errors

Classes of types FormattedFileIn throw exceptions of type ParseError to signal high-level input file format errors.

Assignment 2

Type

Application

Objective

Improve the program above to detect file format errors.

Solution
#include <seqan/bam_io.h>

using namespace seqan2;

int main(int, char const **)
{
    CharString bamFileInName = getAbsolutePath("demos/tutorial/file_io_overview/example.bam");
    CharString samFileOutName = getAbsolutePath("demos/tutorial/file_io_overview/example.sam");

    // Open input BAM file.
    BamFileIn bamFileIn;
    if (!open(bamFileIn, toCString(bamFileInName)))
    {
        std::cerr << "ERROR: could not open input file " << bamFileInName << ".\n";
        return 1;
    }

    // Open output SAM file.
    BamFileOut samFileOut(context(bamFileIn), toCString(samFileOutName));
    // Copy header.
    BamHeader header;
    try
    {
        readHeader(header, bamFileIn);
        writeHeader(samFileOut, header);
    }
    catch (ParseError const & e)
    {
        std::cerr << "ERROR: input header is badly formatted. " << e.what() << "\n";
    }
    catch (IOError const & e)
    {
        std::cerr << "ERROR: could not copy header. " << e.what() << "\n";
    }

    // Copy all records.
    BamAlignmentRecord record;
    while (!atEnd(bamFileIn))
    {
        try
        {
            readRecord(record, bamFileIn);
            writeRecord(samFileOut, record);
        }
        catch (ParseError const & e)
        {
            std::cerr << "ERROR: input record is badly formatted. " << e.what() << "\n";
        }
        catch (IOError const & e)
        {
            std::cerr << "ERROR: could not copy record. " << e.what() << "\n";
        }
    }

    return 0;
}

Streams

The FormattedFileIn and FormattedFileOut constructors accept not only filenames, but also standard C++ streams, or any other class implementing the Stream concept. For instance, you can pass std::cin to any FormattedFileIn constructor and std::cout to any FormattedFileOut constructor.

Note

When writing to std::cout, classes of type FormattedFileOut cannot guess the file format from the filename extension. Therefore, the file format has to be specified explicitly by providing a tag, e.g. Sam or Bam.

Assignment 3

Type

Application

Objective

Improve the program above to write to standard output.

Solution
#include <seqan/bam_io.h>

using namespace seqan2;

int main(int, char const **)
{
    CharString bamFileInName = getAbsolutePath("demos/tutorial/file_io_overview/example.bam");

    // Open input BAM file.
    BamFileIn bamFileIn;
    if (!open(bamFileIn, toCString(bamFileInName)))
    {
        std::cerr << "ERROR: could not open input file " << bamFileInName << ".\n";
        return 1;
    }

    // Open output SAM which is the standard output.
    BamFileOut samFileOut(context(bamFileIn), std::cout, Sam());

    // Copy header.
    BamHeader header;
    try
    {
        readHeader(header, bamFileIn);
        writeHeader(samFileOut, header);
    }
    catch (ParseError const & e)
    {
        std::cerr << "ERROR: input header is badly formatted. " << e.what() << "\n";
    }
    catch (IOError const & e)
    {
        std::cerr << "ERROR: could not copy header. " << e.what() << "\n";
    }

    // Copy all records.
    BamAlignmentRecord record;
    while (!atEnd(bamFileIn))
    {
        try
        {
            readRecord(record, bamFileIn);
            writeRecord(samFileOut, record);
        }
        catch (ParseError const & e)
        {
            std::cerr << "ERROR: input record is badly formatted. " << e.what() << "\n";
        }
        catch (IOError const & e)
        {
            std::cerr << "ERROR: could not copy record. " << e.what() << "\n";
        }
    }

    return 0;
}

Running this program results in the following output.

@HD	VN:1.3	SO:coordinate
@SQ	SN:ref	LN:45
@SQ	SN:ref2	LN:40
r001	163	ref	7	30	8M4I4M1D3M	=	37	39	TTAGATAAAGAGGATACTG	*	XX:B:S,12561,2,20,112
r002	0	ref	9	30	1S2I6M1P1I1P1I4M2I	*	0	0	AAAAGATAAGGGATAAA	*
r003	0	ref	9	30	5H6M	*	0	0	AGCTAA	*
r004	0	ref	16	30	6M14N1I5M	*	0	0	ATAGCTCTCAGC	*
r003	16	ref	29	30	6H5M	*	0	0	TAGGC	*
r001	83	ref	37	30	9M	=	7	-39	CAGCGCCAT	*

Next Steps

If you want, you can now have a look at the API documentation of the FormattedFile class.

You can now read the tutorials for already supported file formats: