File I/O

Learning Objective
In this tutorial, you will learn about the new file I/O infrastructure in SeqAn. You will get an overview of the different layers in the library, an introduction on the StreamConcept concept, the Stream class, and MMap-Strings.
Difficulty
Advanced
Duration
60 min
Prerequisites
I/O Overview, Indexed FASTA I/O, Basic SAM and BAM I/O

This tutorial introduces the low-level facilities of file I/O in SeqAn:

  • There is a concept called StreamConcept in the SeqAn library that stream data types have to implement. There also is the class Stream that provides implementations of the concept together with its specializations. (If you want to provide your own Stream implementation, you should specialize the class Stream).
  • Particularly, there are the specializations GzFileStream and BZ2 FileStream that provide access to compressed files.
  • Furthermore, SeqAn allows to access memory mapped files using the MMap String specialization.

The target audience consists of developers (1) who want to learn how to use memory mapped files and compressed streams, or (2) who want to have raw, byte-wise read and write access to files, or (3) who want to get a deeper understanding of the I/O system in the SeqAn library.

Note that this tutorial has more of a presentational character with fewer tasks.

Streams

The I/O Overview tutorial has already given you a good overview of streams in SeqAn and how to open them for reading and writing. As a reminder: Always open your streams in binary mode to circument problems with getting and setting positions within files on Windows. How exactly you can open files in binary mode depends on the library you are using. Consult the documentation of the library you are using for I/O.

The Stream Concept

The stream concept requires the following functions which work on already files (e.g. FILE *, std::fstream, or Stream objects).

Function Summary
streamEof Return whether stream is at end of file.
streamError Return error code of stream.
streamFlush Flush stream buffer.
streamPeek Get next character from stream without changing the position in the file.
streamPut Write a value to the output, converted to string.
streamBlock Read a block of char values from the stream.
streamReadChar Read one character from the stream.
streamSeek Set stream’s location.
streamTell Retrieve stream’s location.
streamWriteBlock Write an array of char to the stream.

Not all functions might be available for all streams. The metafunction HasStreamFeature provides information about the stream types.

Stream Adaptions

The following C/C++ I/O interfaces can be adapted to the StreamConcept concept.

File Type Description
FILE* C standard library files.
std::fstream, std::ifstream, std::ofstream C++ iostream library file streams
std::stringstream, std::istringstream, std::ostringstream C++ iostream library string streams

This way, we can use the common C++ I/O types through a common interface. Also, we could add adaptions of other file and stream data types to the StreamConcept concept.

The following example shows how to use the StreamConcept global interface functions to copy the contents of the file in.txt to the file out.txt.

#include <fstream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

int main()
{
    std::fstream in("in.txt", std::ios::binary | std::ios::in);
    std::fstream out("out.txt", std::ios::binary | std::ios::out);

    seqan::CharString buffer;
    resize(buffer, 1000);

    while (!seqan::streamEof(in) && seqan::streamError(in) == 0)
    {
        int num = seqan::streamReadBlock(&buffer[0], in, length(buffer));
        seqan::streamWriteBlock(out, &buffer[0], num);
    }

    return 0;
}

Assignment 1

Reading / Writing

Type
Review
Objective
Write a program that accepts three parameters from the command line. The first one should identify the stream type to use (e.g. "file" for FILE* and "fstream" for std::fstream). The second should be either 'r' or ‘w' for reading/writing. The third one should be a file name. The program should, depending on the parameters, open the given file name in read/write mode using the given file type. When reading, it should display the file contents on stdout. When writing, it should put the string "Hello world!\n" into the file.
Hint
You can encapsulate the reading and writing in their own function templates. This allows you to remove redundancy from the code.
Solution ::
#include <iostream>
#include <fstream>
#include <cstdio>

#include <seqan/stream.h>

// This template function reads the contents from the given Stream in and
// writes it out to std::cout

template <typename TStream>
int doReading(TStream & in)
{
    seqan::CharString buffer;
    resize(buffer, 1000);

    while (!seqan::streamEof(in) && (seqan::streamError(in) == 0))
    {
        int num = seqan::streamReadBlock(&buffer[0], in, length(buffer));
        seqan::streamWriteBlock(std::cout, &buffer[0], num);
    }
    
    return 0;
}

// This template function writes out "Hello World!\n" to the given Stream.

template <typename TStream>
int doWriting(TStream & out)
{
    seqan::CharString buffer = "Hello World!\n";
    return (seqan::streamWriteBlock(out, &buffer[0], length(buffer)) != length(buffer));
}

// The main function parses the command line, opens the files in the
// appropriate modes with the appropriate stream types and then calls either
// doWriting() or doReading().

int main(int argc, char const ** argv)
{
    if (argc != 4)
    {
        std::cerr << "USAGE: " << argv[0] << " [file|fstream] [r|w] FILENAME\n";
        return 1;
    }

    // Check first argument.
    if (seqan::CharString(argv[1]) != "file" && seqan::CharString(argv[1]) != "fstream")
    {
        std::cerr << "ERROR: " << argv[1] << " is not a valid stream type name.\n";
        return 1;
    }
    bool useFile = (seqan::CharString(argv[1]) == "file");

    // Check second argument.
    if (seqan::CharString(argv[2]) != "r" && seqan::CharString(argv[2]) != "w")
    {
        std::cerr << "ERROR: " << argv[2] << " is not a valid operation name.\n";
        return 1;
    }
    bool doRead = (seqan::CharString(argv[2]) == "r");

    // Branches for stream and operation type.
    int res = 0;
    if (useFile)  // FILE *
    {
        FILE * fp;
        
        if (doRead)  // reading
            fp = fopen(argv[3], "rb");
        else  // writing
            fp = fopen(argv[3], "wb");

        if (fp == 0)
        {
            std::cerr << "ERROR: Could not open " << argv[3] << "\n";
            return 1;
        }

        if (doRead)  // reading
            res = doReading(fp);
        else  // writing
            res = doWriting(fp);

        fclose(fp);
    }
    else  // std::fstream
    {
        std::fstream stream;
        
        if (doRead)  // reading
            stream.open(argv[3], std::ios::binary | std::ios::in);
        else  // writing
            stream.open(argv[3], std::ios::binary | std::ios::out);

        if (!stream.good())
        {
            std::cerr << "ERROR: Could not open " << argv[3] << "\n";
            return 1;
        }

        if (doRead)  // reading
            res = doReading(stream);
        else  // writing
            res = doWriting(stream);
    }

    if (res != 0)
        std::cerr << "ERROR: There was an error accessing the file!\n";
    return res;
}

Char Arrays As Streams

Sometimes it is useful to treat variables of type char * or char[] as streams, e.g., for parsing. You can use the Char-Array Stream specialization for this purpose.

char const * str = "me, myself and my pony";
seqan::Stream<seqan::CharArray<char const *> > wrapper(str, str + strlen(str));
// We can now read from wrapper as if it was a stream.

Compressed Streams

For accessing .gz and .bz2 files, the stream module contains specializations of the class Stream. The main reason for being Stream specializations instead of adaptions is that zlib and bzlib use too generic data types, e.g., void*, where global functions might have unwanted side effects.

Use the following Stream specializations to read and write zlib and bzlib compressed files.

Stream Class Description
GZ File Stream Wraps the zlib functionality for .gz files.
BZ2 File Stream Wraps the bzlib functionality for .bz2 files.

zlib files have a decent compression ratio and support quite fast compression and decompression. bz2 files are fairly slow to read and write, although the compression ratio is better. For most bioinformatics applications, you will prefer zlib over bzlib.

If you are using SeqAn’s build system, zlib and libbz2 will be detected automatically. On Linux and Mac Os X, these libraries are usually already installed. If you are using Windows, then you can follow the instructions in Installing Contribs On Windows for installing the libraries. If you are using your own build system, see BuildManual/IntegrationWithYourOwnBuildSystem for the necessary configuration steps.

Both specializations can be constructed with an already open underlying compressed stream, e.g. you can pass the gzFile/BZFILE*, that you want to work on, to the stream. They are meant as very thin wrappers around the handle for the compressed stream. This has the advantage that you have full access to the compression settings etc. and the wrappers only add error flags and so on when necessary. For more convenience, you can also use the open function to open them.

The following example shows (1) how to conditionally enable zlib and bzlib support, (2) how to open gzFile and BZFILE* handles for reading and their corresponding wrappers and (3) the possibilities for error checking.

In the header of the program, we include the zlib and bzlib headers if the correct preprocessor symbols are set. Also, we’ll include the required SeqAn headers.

#include <cstdio>
#include <fstream>
#if SEQAN_HAS_ZLIB
#include <zlib.h>
#endif  // #if SEQAN_HAS_ZLIB
#if SEQAN_HAS_BZIP2
#include <bzlib.h>
#endif  // #if SEQAN_HAS_BZIP2

#include <seqan/basic.h>
#include <seqan/stream.h>

The first routine demonstrates how to open a .gz file and write its contents to stdout with full error handling. Note that writing char-by-char is probably not the best idea in a real-world program.

int openGz(char const * filename)
{
#if SEQAN_HAS_ZLIB
    seqan::Stream<seqan::GZFile> f;
    if (!open(f, filename, "rb"))
    {
        std::cerr << "ERROR: GZip file has the wrong format!" << std::endl;
        return 1;
    }
    
    // Totally inefficient char-wise writing of characters from .gz file to stderr.
    while (!streamEof(f))
    {
        char c = '\0';
        int res = streamReadChar(c, f);
        if (res != 0)
        {
            std::cerr << "ERROR: Reading byte from GZip file." << std::endl;
            return 1;
        }
        std::cout << c;
    }
#else  // #if SEQAN_HAS_ZLIB
    (void) filename;
    std::cerr << "ZLIB not available!" << std::endl;
#endif  // #if SEQAN_HAS_ZLIB
    return 0;
}

The next routine demonstrates how to open a .bz2 file and write its contents to stdout, again with full error handling.

int openBz2(char const * filename)
{
#if SEQAN_HAS_BZIP2
    seqan::Stream<seqan::BZ2File> f;
    if (!open(f, filename, "rb"))
    {
        std::cerr << "ERROR: BZ2 file has the wrong format!" << std::endl;
        return 1;
    }

    // Totally inefficient char-wise writing of characters from .bz2 file to stderr.
    while (!streamEof(f))
    {
        char c = '\0';
        int res = streamReadChar(c, f);
        if (res != 0)
        {
            std::cout << "ERROR: Reading byte from BZ2 file." << std::endl;
            return 1;
        }
        std::cerr << c;
    }
#else  // #if SEQAN_HAS_BZIP2
    (void) filename;
    std::cerr << "BZLIB not available!" << std::endl;
#endif  // #if SEQAN_HAS_BZIP2
    return 0;
}

And finally, the code that calls the functions from above.

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;
    openGz(argv[1]);
    openBz2(argv[1]);
    return 0;
}

Now, let’s test the program. We’ll first create gzip and bzip2 compressed text files and an uncompressed text file. Then, we’ll run our demo program on these files. Note that the BZ2FileStream fails when reading from the file, not when opening the file.

# echo 'foo' > test.txt
# gzip test.txt
# echo 'bar' > test.txt
# bzip2 test.txt
# echo 'bz' > test.txt
# ./extras/demos/tutorial/stream/tutorial_stream_compression_formats test.txt
ERROR: GZip file has the wrong format!
ERROR: Reading byte from BZ2 file.
# ./extras/demos/tutorial/stream/tutorial_stream_compression_formats test.txt.gz
foo
ERROR: Reading byte from BZ2 file.
# ./extras/demos/tutorial/stream/tutorial_stream_compression_formats test.txt.bz2
ERROR: GZip file has the wrong format!
bar

Assignment 2

Writing a File Compression/Decompression Tool

Type
Application
Objective
Write a file compression/decompression tool. The first argument should be the format to read/write, e.g. "gz" for gzip and "bz2" for bzip2. The second argument should be the direction, i.e. “c” for “compress”, “x” for “extract”. The third and fourth arguments should be the source/target files.
Solution
#include <iostream>
#include <fstream>

#include <seqan/stream.h>

#if SEQAN_HAS_ZLIB && SEQAN_HAS_BZIP2  // Guard against either not being installed.

// Copy from stream in to the stream out.

template <typename TInStream, typename TOutStream>
int copyStream(TInStream & in, TOutStream & out)
{
    seqan::CharString buffer;
    resize(buffer, 1000);

    while (!seqan::streamEof(in) && (seqan::streamError(in) == 0))
    {
        int num = seqan::streamReadBlock(&buffer[0], in, length(buffer));
        seqan::streamWriteBlock(out, &buffer[0], num);
    }
    
    return 0;
}

// The main function parses the command line, opens the files in the
// appropriate modes with the appropriate stream types and then calls either
// copyStream.

int main(int argc, char const ** argv)
{
    if (argc != 5)
    {
        std::cerr << "USAGE: " << argv[0] << " [gz|bz2] [c|x] FILE_IN FILE_OUT\n";
        return 1;
    }

    // Check first argument.
    if (seqan::CharString(argv[1]) != "gz" && seqan::CharString(argv[1]) != "bz2")
    {
        std::cerr << "ERROR: " << argv[1] << " is not a valid compression format.\n";
        return 1;
    }
    bool useGzip = (seqan::CharString(argv[1]) == "gz");

    // Check second argument.
    if (seqan::CharString(argv[2]) != "c" && seqan::CharString(argv[2]) != "x")
    {
        std::cerr << "ERROR: " << argv[2] << " is not a valid operation name.\n";
        return 1;
    }
    bool doCompress = (seqan::CharString(argv[2]) == "c");

    // Branches for stream and operation type.
    int res = 0;
    if (useGzip)
    {
        seqan::Stream<seqan::GZFile> gzFileStream;
        std::fstream fileStream;

        if (doCompress)
        {
            fileStream.open(argv[3], std::ios::binary | std::ios::in);
            if (!fileStream.good())
            {
                std::cerr << "ERROR: Could not open file " << argv[3] << "\n";
                return 1;
            }

            if (!open(gzFileStream, argv[4], "w"))
            {
                std::cerr << "ERROR: Could not open file " << argv[4] << "\n";
                return 1;
            }

            res = copyStream(fileStream, gzFileStream);
        }
        else  // extract
        {
            if (!open(gzFileStream, argv[3], "r"))
            {
                std::cerr << "ERROR: Could not open file " << argv[3] << "\n";
                return 1;
            }

            fileStream.open(argv[4], std::ios::binary | std::ios::out);
            if (!fileStream.good())
            {
                std::cerr << "ERROR: Could not open file " << argv[4] << "\n";
                return 1;
            }

            res = copyStream(gzFileStream, fileStream);
        }
    }
    else  // bz2
    {
        seqan::Stream<seqan::BZ2File> bz2FileStream;
        std::fstream fileStream;

        if (doCompress)
        {
            fileStream.open(argv[3], std::ios::binary | std::ios::in);
            if (!fileStream.good())
            {
                std::cerr << "ERROR: Could not open file " << argv[3] << "\n";
                return 1;
            }

            if (!open(bz2FileStream, argv[4], "w"))
            {
                std::cerr << "ERROR: Could not open file " << argv[4] << "\n";
                return 1;
            }

            res = copyStream(fileStream, bz2FileStream);
        }
        else  // extract
        {
            if (!open(bz2FileStream, argv[3], "r"))
            {
                std::cerr << "ERROR: Could not open file " << argv[3] << "\n";
                return 1;
            }

            fileStream.open(argv[4], std::ios::binary | std::ios::out);
            if (!fileStream.good())
            {
                std::cerr << "ERROR: Could not open file " << argv[4] << "\n";
                return 1;
            }

            res = copyStream(bz2FileStream, fileStream);
        }
    }

    if (res != 0)
        std::cerr << "ERROR: There was an error reading/writing!\n";
    return res;
}

#else  // #if SEQAN_HAS_ZLIB && SEQAN_HAS_BZIP2

int main()
{
    return 0;
}

#endif  // #if SEQAN_HAS_ZLIB && SEQAN_HAS_BZIP2

Memory Mapped Files

Memory mapped files allow very fast access to files since they enable you to read data with few, if any additional buffers. Wikipedia has a nice article on memory mapped files.

In SeqAn, you access memory mapped files using the MMapString specialization. After opening the mapped string using open, you can access its contents as if you were manipulating a normal String. The following shows a simple example:

#include <iostream>

#include <seqan/basic.h>
#include <seqan/file.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
        return 1;
    
    // Memory mapped string, automatically backed by temporary file.
    seqan::String<char, seqan::MMap<> > str1;
    str1 = "This is the first mapped string!";
    std::cout << str1 << std::endl;

    // Open file as memory mapped string.
    seqan::String<char, seqan::MMap<> > str2;
    if (!open(str2, argv[1], seqan::OPEN_RDONLY))
    {
        std::cerr << "Could not open file " << argv[1] << std::endl;
        return 1;
    }
    std::cout << str2 << std::endl;

    return 0;
}

An example execution of the program:

# echo 'foo' > test.txt
# ./extras/demos/tutorial/stream/tutorial_mmap_string_example test.txt
This is the first mapped string!
foo

Next Steps

comments powered by Disqus