SAM and BAM I/O

Learning Objective
This tutorial will explain how to use the lower-level API for reading from and writing to SAM and BAM files.
Difficulty
Advanced
Duration
20 min
Prerequisites
Basic SAM and BAM I/O, I/O Overview

After reading the :ref;`tutorial-basic-sam-bam-io` tutorial, you should have a good understanding of the representation of SAM/BAM records in SeqAn library. This tutorial will explain how the lower-level implementation of SAM and BAM I/O works and which design decisions lead to this. You will also learn about the classes BamIOContext and NameStoreCache. This article also explains how to write template-based code that can read both SAM files from RecordReaders and BAM files from compressed streams. Furthermore, you will learn how to use BAI indices for jumping in BAM files.

This tutorial will mostly work by showing full source code examples and explaining them. Do not feel intimidated by its length, it is inflated by the amount of embedded source code.

The usage of SAM and BAM tags is explained in the Basic SAM and BAM I/O tutorial and will not be discussed here further.

Important

Note that this tutorial is targeted at readers that already know about the SAM format. If you do not know about the SAM format yet then this tutorial will be harder for your to understand. Teaching the ins and outs of SAM is out of the scope of such a tutorial.

BamAlignmentRecord Design Overview

Besides the fact that BAM is a compressed, binary format, the main difference to the plain-text format SAM is that BAM has additional header information. The binary header of BAM files contains a plain-text SAM header with the same information as a SAM file. However, BAM files contain an additional binary header section that stores all reference sequence names and their length.

This is equivalent to the @SQ entries in a SAM header but the @SQ entries are optional in both SAM and the SAM header in a BAM file. Also, the @SQ entries in the SAM header of a BAM file can be out of sync with the binary reference information entries. The authorative source for this information (most prominently the names and the order of the reference sequences). BAM records then only contain an integer rID for each reference that referes to the rID-th reference sequence in the binary header.

Because SAM and BAM files store the same information, there is only one record type to store it in SeqAn. The record type is the class BamAlignmentRecord that was already introduced in the Basic SAM and BAM I/O tutorial tutorial. The structure of this class is designed after the BAM file so conversion from the BAM file on disk to the in-memory representation is fast. For example, the tags are stored as a binary string, the same as in a BAM file. When reading SAM, the tags are converted into the BAM representation whereas BAM tags can be copied verbatim. There is one important deviation, though: The qualities are stored using a phred-style ASCII encoding for consistencies with the rest of the SeqAn library.

As a reminder, here is the synopsis of the BamAlignmentRecord class again.

namespace seqan {

class BamAlignmentRecord
{
public:
    CharString qName;               // QNAME
    __uint16 flag;                  // FLAG
    __int32 rID;                    // REF
    __int32 beginPos;               // POS
    __uint8 mapQ;                   // MAPQ mapping quality, 255 for */invalid
    __uint16 bin;                   // bin for indexing
    String<CigarElement<> > cigar;  // CIGAR string
    __int32 rNextId;                // RNEXT (0-based)
    __int32 pNext;                  // PNEXT (0-based)
    __int32 tLen;                   // TLEN
    CharString seq;                 // SEQ, as in SAM/BAM file.
    CharString qual;                // Quality string as in SAM (Phred).
    CharString tags;                // Tags, raw as in BAM.

    // Constants for marking pos, reference id and length members invalid (== */0).
    static __int32 const INVALID_POS = -1;
    static __int32 const INVALID_REFID = -1;
    static __int32 const INVALID_LEN = 0;
};

}  // namespace seqan

Name Stores and Name Store Caches

In order to translate from numeric reference id (rID) to text reference sequence name, the names have to be stored in a StringSet which we will call a name store. For being able to translate back from a textual name (stored as a CharString, for example), we need a NameStoreCache that allows the fast lookup of numeric ids from textual names. Both the name store and the cache are then wrapped by a BamIOContext. This context object is used to prescient from the differences of SAM and BAM files when reading and writing.

For example, when writing out a BamAlignmentRecord to a SAM file, we need to look up the name of the reference from its numeric id to write it out as a string. When reading a record from a SAM file, we have to translate its name string into a numeric id. Even more, if the sequence is not know yet (remember, the @SQ headers are optional), we have to append it to the name store and register it with the cache.

Here is a minimal example of setting up a name store, name store cache, and a BamIOContext. We will build upon this example below when showing how to read and write SAM and BAM files.

#include <seqan/bam_io.h>
#include <seqan/sequence.h>

int main()
{
    // Create some shortcuts to the types that we will use.
    typedef seqan::StringSet<seqan::CharString> TNameStore;
    typedef seqan::NameStoreCache<TNameStore>   TNameStoreCache;
    typedef seqan::BamIOContext<TNameStore>     TBamIOContext;

    // Setup the variables.
    TNameStore      nameStore;
    TNameStoreCache nameStoreCache(nameStore);
    TBamIOContext   bamIOContext(nameStore, nameStoreCache);

    return 0;
}

BGZF Files / Stream

By default, the BAM format is compressed using the BGZF compression scheme (originating from Tabix, but also described in the SAM standard). You can read BGZF files with tools for processing .gz files, e.g. gzip and zcat.

However, there is a big difference between files written in BGZF and .gz files. BGZF is a sequence of compressed blocks. If the offset of a block is known, it can be decompressed independent of the rest of the file. This information can then be used together with indices.

SeqAn provides the BGZF Stream class in the module <seqan/stream.h> to access such streams. Here is an example for using a Stream for reading:

#include <seqan/stream.h>

int main()
{
    seqan::Stream<seqan::Bgzf> stream;
    if (!open(stream, "filename.bam", "r"))
        return 1;  // Could not open for reading.
    
    return 0;
}

Using a BGZF Stream for writing:

#include <seqan/stream.h>

int main()
{
    seqan::Stream<seqan::Bgzf> stream;
    if (!open(stream, "filename.bam", "w"))
        return 1;  // Could not open for writing.
    
    return 0;
}

Assignment 1

Uncompressing a BGZF file.

Type
Review
Objective
Write a program that reads in a BGZF compressed file using BGZF Stream and writes the uncompressed data out again.
Hint
Use the function streamReadBlock and streamWriteBlock for reading and writing data into and from buffers.
Solution
#include <iostream>
#include <fstream>

#include <seqan/sequence.h>
#include <seqan/stream.h>

int main(int argc, char const ** argv)
{
    if (argc != 3)
    {
        std::cerr << "USAGE: " << argv[0] << " IN.bam OUT.bin\n";
        return 1;
    }

    // Open BGZF file for reading.
    seqan::Stream<seqan::Bgzf> inStream;
    if (!open(inStream, argv[1], "r"))
    {
        std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
        return 1;
    }

    // Open std::fstream for writing.
    std::fstream outStream(argv[2], std::ios::binary | std::ios::out);
    if (!outStream.good())
    {
        std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
        return 1;
    }

    // Copy over data.
    seqan::CharString buffer;
    resize(buffer, 1000);
    while (!seqan::atEnd(inStream) && seqan::streamError(inStream) == 0)
    {
        int num = seqan::streamReadBlock(&buffer[0], inStream, length(buffer));
        seqan::streamWriteBlock(outStream, &buffer[0], num);
    }
    
    return 0;
}

Reading and Writing Headers

The data structure BamHeader has already been described in the Basic SAM and BAM I/O so we will not repeat that here. Instead, we will focus on how to read headers from SAM and BAM files.

Here is a minimal example of reading and writing a header from and to a SAM file. The example contains the creation of a BamIOContext, the necessary RecordReader and full error handling.

#include <iostream>
#include <fstream>

#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

int main(int argc, char const ** argv)
{
    if (argc != 3)
    {
        std::cerr << "USAGE: " << argv[0] << " IN.sam OUT.sam\n";
        return 1;
    }

    // Open std::fstream for reading.
    std::fstream inStream(argv[1], std::ios::binary | std::ios::in);
    if (!inStream.good())
    {
        std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
        return 1;
    }

    // Open std::fstream for writing.
    std::fstream outStream(argv[2], std::ios::binary | std::ios::out);
    if (!outStream.good())
    {
        std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
        return 1;
    }

    // Setup RecordReader.
    seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(inStream);

    // Setup name store, cache, and BAM I/O context.
    typedef seqan::StringSet<seqan::CharString> TNameStore;
    typedef seqan::NameStoreCache<TNameStore>   TNameStoreCache;
    typedef seqan::BamIOContext<TNameStore>     TBamIOContext;
    TNameStore      nameStore;
    TNameStoreCache nameStoreCache(nameStore);
    TBamIOContext   context(nameStore, nameStoreCache);

    // Read header.
    seqan::BamHeader header;
    if (readRecord(header, context, reader, seqan::Sam()) != 0)
    {
        std::cerr << "ERROR: Could not read header from SAM file " << argv[1] << "\n";
        return 1;
    }

    // Write out header again.
    if (write2(outStream, header, context, seqan::Sam()) != 0)
    {
        std::cerr << "ERROR: Could not write header to SAM file " << argv[2] << "\n";
        return 1;
    }

    return 0;
}

Reading and writing headers from and to BAM files is simple. We simply replace seqan::Sam() by seqan::Bam() and use BGZF Stream objects instead of uncompressed streams. Also, we do not need a RecordReader any more.

#include <iostream>
#include <fstream>

#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

int main(int argc, char const ** argv)
{
    if (argc != 3)
    {
        std::cerr << "USAGE: " << argv[0] << " IN.bam OUT.bam\n";
        return 1;
    }

    // Open BGZF Stream for reading.
    seqan::Stream<seqan::Bgzf> inStream;
    if (!open(inStream, argv[1], "r"))
    {
        std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
        return 1;
    }

    // Open BGZF Stream for writing.
    seqan::Stream<seqan::Bgzf> outStream;
    if (!open(outStream, argv[2], "w"))
    {
        std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
        return 1;
    }

    // Setup name store, cache, and BAM I/O context.
    typedef seqan::StringSet<seqan::CharString> TNameStore;
    typedef seqan::NameStoreCache<TNameStore>   TNameStoreCache;
    typedef seqan::BamIOContext<TNameStore>     TBamIOContext;
    TNameStore      nameStore;
    TNameStoreCache nameStoreCache(nameStore);
    TBamIOContext   context(nameStore, nameStoreCache);

    // Read header.
    seqan::BamHeader header;
    if (readRecord(header, context, inStream, seqan::Bam()) != 0)
    {
        std::cerr << "ERROR: Could not read header from BAM file " << argv[1] << "\n";
        return 1;
    }

    // Write out header again.
    if (write2(outStream, header, context, seqan::Bam()) != 0)
    {
        std::cerr << "ERROR: Could not write header to BAM file " << argv[2] << "\n";
        return 1;
    }

    return 0;
}

Note that except for the types, the signatures of the functions readRecord() and write() are the same. Thus, we can make copying of the header a template function copyHeader(). This function can now be used for both BAM and SAM.

#include <iostream>
#include <fstream>

#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

template <typename TOutStream, typename TInStreamOrReader, typename TTag>
int copyHeader(TOutStream & outStream,
               TInStreamOrReader & inStreamOrReader,
               seqan::BamIOContext<seqan::StringSet<seqan::CharString> > & context,
               TTag const & tag)
{

    // Read header.
    seqan::BamHeader header;
    if (readRecord(header, context, inStreamOrReader, tag) != 0)
    {
        std::cerr << "ERROR: Could not read header\n";
        return 1;
    }

    // Write out header again.
    if (write2(outStream, header, context, tag) != 0)
    {
        std::cerr << "ERROR: Could not write header.\n";
        return 1;
    }

    return 0;
}

int main(int argc, char const ** argv)
{
    if (argc != 3)
    {
        std::cerr << "USAGE: " << argv[0] << " IN.sam OUT.sam\n";
        return 1;
    }

    // Streams for SAM.
    std::fstream inStreamSam, outStreamSam;
    // Streams for BAM.
    seqan::Stream<seqan::Bgzf> inStreamBam, outStreamBam;

    if (seqan::endsWith(seqan::CharString(argv[1]), ".sam"))
    {
        inStreamSam.open(argv[1], std::ios::binary | std::ios::in);
        if (!inStreamSam.good())
        {
            std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
            return 1;
        }
        outStreamSam.open(argv[2], std::ios::binary | std::ios::out);
        if (!outStreamSam.good())
        {
            std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
            return 1;
        }
    }
    else
    {
        // Open BGZF Stream for reading.
        if (!open(inStreamBam, argv[1], "r"))
        {
            std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
            return 1;
        }

        // Open BGZF Stream for writing.
        if (!open(outStreamBam, argv[2], "w"))
        {
            std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
            return 1;
        }
    }

    // Setup name store, cache, and BAM I/O context.
    typedef seqan::StringSet<seqan::CharString> TNameStore;
    typedef seqan::NameStoreCache<TNameStore>   TNameStoreCache;
    typedef seqan::BamIOContext<TNameStore>     TBamIOContext;
    TNameStore      nameStore;
    TNameStoreCache nameStoreCache(nameStore);
    TBamIOContext   context(nameStore, nameStoreCache);

    if (endsWith(seqan::CharString(argv[1]), ".sam"))
    {
        // Stream must be open before constructing reader, thus we define it here.
        seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(inStreamSam);
        return copyHeader(outStreamSam, reader, context, seqan::Sam());
    }
    else
    {
        return copyHeader(outStreamBam, inStreamBam, context, seqan::Bam());
    }
}

Assignment 2

Converting BAM header to SAM.

Type
Application
Objective
Write a program that reads the header from a BAM file and writes it out as a SAM header to std::cout.
Solution
#include <iostream>
#include <fstream>

#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
    {
        std::cerr << "USAGE: " << argv[0] << " IN.bam\n";
        return 1;
    }

    // Open BGZF Stream for reading.
    seqan::Stream<seqan::Bgzf> inStream;
    if (!open(inStream, argv[1], "r"))
    {
        std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
        return 1;
    }

    // Setup name store, cache, and BAM I/O context.
    typedef seqan::StringSet<seqan::CharString> TNameStore;
    typedef seqan::NameStoreCache<TNameStore>   TNameStoreCache;
    typedef seqan::BamIOContext<TNameStore>     TBamIOContext;
    TNameStore      nameStore;
    TNameStoreCache nameStoreCache(nameStore);
    TBamIOContext   context(nameStore, nameStoreCache);

    // Read header.
    seqan::BamHeader header;
    if (readRecord(header, context, inStream, seqan::Bam()) != 0)
    {
        std::cerr << "ERROR: Could not read header from BAM file " << argv[1] << "\n";
        return 1;
    }

    // Write out header again.
    if (write2(std::cout, header, context, seqan::Sam()) != 0)
    {
        std::cerr << "ERROR: Could not write header to stdout.\n";
        return 1;
    }

    return 0;
}

Reading and Writing Alignment Records

BamAlignmentRecords can be read and written the same way as BamHeader objects. Here is an example for reading and writing of alignment records from SAM and to files.

// Copy over records.
seqan::BamAlignmentRecord record;
while (atEnd(reader))
{
    if (readRecord(record, context, reader, seqan::Sam()) != 0)
    {
        std::cerr << "ERROR: Could not read record from SAM file " << argv[1] << "\n";
        return 1;
    }

    if (write2(outStream, record, context, seqan::Sam()) != 0)
    {
        std::cerr << "ERROR: Could not write record to SAM file " << argv[2] << "\n";
        return 1;
    }
}

And here is the modified version for the BAM format. The only changes are that

// Copy over records.
seqan::BamAlignmentRecord record;
while (atEnd(reader))
{
    if (readRecord(record, context, inStream, seqan::Bam()) != 0)
    {
        std::cerr << "ERROR: Could not read record from BAM file " << argv[1] << "\n";
        return 1;
    }

    if (write2(outStream, record, context, seqan::Bam()) != 0)
    {
        std::cerr << "ERROR: Could not write record to BAM file " << argv[2] << "\n";
        return 1;
    }
 }

Assignment 3

Converting whole BAM files to SAM.

Type
Application
Objective
Modify the solution of Assignment 2 to not only convert the header to BAM but also the alignment records.
Solution
#include <iostream>
#include <fstream>

#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
    {
        std::cerr << "USAGE: " << argv[0] << " IN.bam\n";
        return 1;
    }

    // Open BGZF Stream for reading.
    seqan::Stream<seqan::Bgzf> inStream;
    if (!open(inStream, argv[1], "r"))
    {
        std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
        return 1;
    }

    // Setup name store, cache, and BAM I/O context.
    typedef seqan::StringSet<seqan::CharString> TNameStore;
    typedef seqan::NameStoreCache<TNameStore>   TNameStoreCache;
    typedef seqan::BamIOContext<TNameStore>     TBamIOContext;
    TNameStore      nameStore;
    TNameStoreCache nameStoreCache(nameStore);
    TBamIOContext   context(nameStore, nameStoreCache);

    // Read header.
    seqan::BamHeader header;
    if (readRecord(header, context, inStream, seqan::Bam()) != 0)
    {
        std::cerr << "ERROR: Could not read header from BAM file " << argv[1] << "\n";
        return 1;
    }

    // Write out header again.
    if (write2(std::cout, header, context, seqan::Sam()) != 0)
    {
        std::cerr << "ERROR: Could not write header to stdout.\n";
        return 1;
    }

    // Copy over the alignment records.
    seqan::BamAlignmentRecord record;
    while (!atEnd(inStream))
    {
        if (readRecord(record, context, inStream, seqan::Bam()) != 0)
        {
            std::cerr << "ERROR: Could not read record from BAM File " << argv[1] << "\n";
            return 1;
        }

        if (write2(std::cout, record, context, seqan::Sam()) != 0)
        {
            std::cerr << "ERROR: Could not write record to stdout.\n";
            return 1;
        }
    }

    return 0;
}

Using Indices

SeqAn also contains support for reading BAM indices with the format .bai. These indices can be built using the samtools index command.

You can read such indices into a BAI BamIndex object with the function read. Then, you can use the function seqan:”Function.BamIndex#jumpToRegion” to jump within BAM files.

After jumping, the next record that is read is before at the given position. This means, you have to manually read as many records up until the one you are looking for is found. The reason for this is that the function jumpToRegion would have to read until it finds the first record that is right from or at the given position. This would lead to this record being lost.

#include <iostream>
#include <fstream>

#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>

int main(int argc, char const ** argv)
{
    if (argc != 7)
    {
        std::cerr << "USAGE: " << argv[0] << " IN.bam IN.bam.bai REF BEGIN END COUNT\n";
        return 1;
    }

    // Open BGZF Stream for reading.
    seqan::Stream<seqan::Bgzf> inStream;
    if (!open(inStream, argv[1], "r"))
    {
        std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
        return 1;
    }

    // Read BAI index.
    seqan::BamIndex<seqan::Bai> baiIndex;
    if (read(baiIndex, argv[2]) != 0)
    {
        std::cerr << "ERROR: Could not read BAI index file " << argv[2] << "\n";
        return 1;
    }

    // Setup name store, cache, and BAM I/O context.
    typedef seqan::StringSet<seqan::CharString> TNameStore;
    typedef seqan::NameStoreCache<TNameStore>   TNameStoreCache;
    typedef seqan::BamIOContext<TNameStore>     TBamIOContext;
    TNameStore      nameStore;
    TNameStoreCache nameStoreCache(nameStore);
    TBamIOContext   context(nameStore, nameStoreCache);

    // Read header.
    seqan::BamHeader header;
    if (readRecord(header, context, inStream, seqan::Bam()) != 0)
    {
        std::cerr << "ERROR: Could not read header from BAM file " << argv[1] << "\n";
        return 1;
    }

    // Translate from reference name to rID.
    int rID = 0;
    if (!getIdByName(nameStore, argv[3], rID, nameStoreCache))
    {
        std::cerr << "ERROR: Reference sequence named " << argv[3] << " not known.\n";
        return 1;
    }

    // Translate BEGIN and END arguments to number, 1-based to 0-based.
    int beginPos = 0, endPos = 0;
    if (!seqan::lexicalCast2(beginPos, argv[4]) || beginPos <= 0)
    {
        std::cerr << "ERROR: Begin position " << argv[4] << " is invalid.\n";
        return 1;
    }
    beginPos -= 1;  // 1-based to 0-based.
    if (!seqan::lexicalCast2(endPos, argv[5]) || endPos <= 0)
    {
        std::cerr << "ERROR: End position " << argv[5] << " is invalid.\n";
        return 1;
    }
    endPos -= 1;  // 1-based to 0-based.

    // Translate number of elements to print to number.
    int num = 0;
    if (!seqan::lexicalCast2(num, argv[6]))
    {
        std::cerr << "ERROR: Count " << argv[6] << " is invalid.\n";
        return 1;
    }

    // Jump the BGZF stream to this position.
    bool hasAlignments = false;
    if (!jumpToRegion(inStream, hasAlignments, context, rID, beginPos, endPos, baiIndex))
    {
        std::cerr << "ERROR: Could not jump to " << argv[3] << ":" << argv[4] << "\n";
        return 1;
    }
    if (!hasAlignments)
        return 0;  // No alignments here.

    // Seek linearly to the selected position.
    seqan::BamAlignmentRecord record;
    int numPrinted = 0;
    while (!atEnd(inStream) && numPrinted < num)
    {
        if (readRecord(record, context, inStream, seqan::Bam()) != 0)
        {
            std::cerr << "ERROR: Could not read record from BAM file.\n";
            return 1;
        }

        // If we are on the next reference or at the end already then we stop.
        if (record.rID == -1 || record.rID > rID || record.beginPos >= endPos)
            break;
        // If we are left of the selected position then we skip this record.
        if (record.beginPos < beginPos)
            continue;

        // Otherwise, we print it to the user.
        numPrinted += 1;
        if (write2(std::cout, record, context, seqan::Sam()) != 0)
        {
            std::cerr << "ERROR: Could not write record to stdout.\n";
            return 1;
        }
    }

    return 0;
}

Next Steps

comments powered by Disqus