SAM and BAM I/O

Learning Objective

In this tutorial, you will learn how to read and write SAM and BAM files.

Difficulty

Average

Duration

1 h (45 min if you know the SAM format)

Prerequisites

Sequences, File I/O Overview, SAM Format Specification

Overview

Warning

Before you can read/write BAM files (bgzf compressed SAM files) you need to make sure that your program is linked against the zlib library. When you build your application within the SeqAn build infrastructure, the zlib library is automatically located by looking at the standard places for the library. Also have a look at Formatted Files to read more about support of compressed file I/O. If the macro SEQAN_HAS_ZLIB is set to 0 then reading/writing BAM file format is disabled. It is set to 1 if the zlib could be found and reading/writing of compressed files is enabled automatically. You can read Using SeqAn in CMake-based projects, Integration with your own Build System and Installing Dependencies for further notes about using the zlib and libbz2 in your build infrastructure.

This tutorial shows how to read and write SAM and BAM files using the BamFileIn and BamFileOut classes. It starts out with a quick reminder on the structure of SAM (and also BAM) files and continues with how to read and write SAM/BAM files and access the tags of a record.

Important

Note that this tutorial is targeted at readers that already know about the SAM format. If you do not know about the SAM format yet, then this tutorial will be harder for your to understand.

Both SAM and BAM files store multi-read alignments. Storing alignments of longer sequences such as contigs from assemblies is also possible, but less common. Here, we will focus on multi-read alignments.

SAM files are text files, having one record per line. BAM files are just binary, compressed versions of SAM files that have a stricter organization and aim to be more efficiently usable by programs and computers. The nuts and bolts of the formats are described in the SAM Format Specification.

The SAM and BAM related I/O functionality in SeqAn focuses on allowing access to these formats in SeqAn with thin abstractions. The Fragment Store Tutorial shows how to get a more high-level abstraction for multi-read alignments.

Important

SAM/BAM I/O vs. Fragment Store

The Fragment Store provides a high-level view of multi-read alignments. This is very useful if you want to do SNP or small indel detection because you need to access the alignment of the reads around your candidate regions. However, storing the whole alignment of a 120GB BAM file obviously is not a good idea.

The SAM/BAM I/O functionality in SeqAn is meant for sequentially reading through SAM and BAM files. Jumping within BAM files using BAI indices is described in the Using BAM Indices section of this tutorial.

SAM / BAM Format

The following shows an example of a SAM file.

@HD	VN:1.3	SO:coordinate
@SQ	SN:ref	LN:45
@SQ	SN:ref2	LN:40
r001	163	ref	7	30	8M4I4M1D3M	=	37	39	TTAGATAAAGAGGATACTG	*	XX:B:S,12561,2,20,112
r002	0	ref	9	30	1S2I6M1P1I1P1I4M2I	*	0	0	AAAAGATAAGGGATAAA	*
r003	0	ref	9	30	5H6M	*	0	0	AGCTAA	*
r004	0	ref	16	30	6M14N1I5M	*	0	0	ATAGCTCTCAGC	*
r003	16	ref	29	30	6H5M	*	0	0	TAGGC	*
r001	83	ref	37	30	9M	=	7	-39	CAGCGCCAT	*

SAM files are TSV (tab-separated-values) files and begin with an optional header. The header consists of multiple lines, starting with an '@' character, each line is a record. Each record starts with its identifier and is followed by tab-separated tags. Each tag in the header consists of a two-character identifier, followed by ':', followed by the value.

If present, the @HD record must be the first record which specifies the SAM version (tag VN) used in this file and the sort order (SO). The optional @SQ header records give the reference sequence names (tag SN) and lengths (tag LN). There also are other header record types.

The optional header section is followed by the alignment records. The alignment records are again tab-separated. There are 11 mandatory columns.

Col

Field

Type

N/A Value

Description

1

QNAME

string

mandatory

The query/read name.

2

FLAG

int

mandatory

The record’s flag.

3

RNAME

string

*

The reference name.

4

POS

32-bit int

0

1-based position on the reference.

5

MAPQ

8-bit int

255

The mapping quality.

6

CIGAR

string

*

The CIGAR string of the alignment.

7

RNEXT

string

*

The reference of the next mate/segment.

8

PNEXT

string

0

The position of the next mate/seqgment.

9

TLEN

string

0

The observed length of the template.

10

SEQ

string

*

The query/read sequence.

11

QUAL

string

*

The ASCII PHRED-encoded base qualities.

Notes:

  • The SAM standard talks about “queries”. In the context of read mapping, where the format originates, queries are reads.

  • The SAM standard talks about “templates” and “segments”. In the case of paired-end and mate-pair mapping the template consists of two segments, each is one read. The template length is the insert size.

  • Paired-end reads are stored as two alignments records with the same QNAME. The first and second mate are discriminated by the FLAG values.

  • When the FLAG indicates that SEQ is reverse-complemented, then QUAL is reversed.

  • Positions in the SAM file are 1-based. When read into a BamAlignmentRecord (see below), the positions become 0-based.

  • The qualities must be stored as ASCII PHRED-encoded qualities.

  • The query and reference names must not contain whitespace. It is common to trim query and reference ids at the first space.

There are many ambiguities, recommendations, and some special cases in the formats that we do not describe here. We recommend that you follow this tutorial, start working with the SAM and BAM formats and later read the SAM specification “on demand” when you need it.

The 11 mandatory columns are followed by an arbitrary number of optional tags. Tags have a two-character identifier followed by ":${TYPE}:", followed by the tag’s value.

BAM files store their header as plain-text SAM headers. However, they additionally store the name and length information about the reference sequences. This information is mandatory since in BAM, the alignment records only contain the numeric ids of the reference sequences. Thus, the name is stored outside the record in the header.

A First Working Example

The following program reads a file named example.sam and prints its contents back to the user on standard output.

#include <seqan/bam_io.h>

using namespace seqan2;

int main()
{
    CharString bamFileName = getAbsolutePath("demos/tutorial/sam_and_bam_io/example.sam");

    // Open input file, BamFileIn can read SAM and BAM files.
    BamFileIn bamFileIn;
    if (!open(bamFileIn, toCString(bamFileName)))
    {
        std::cerr << "ERROR: Could not open " << bamFileName << std::endl;
        return 1;
    }
    // Open output file, BamFileOut accepts also an ostream and a format tag.
    BamFileOut bamFileOut(context(bamFileIn), std::cout, Sam());

    try
    {
        // Copy header.
        BamHeader header;
        readHeader(header, bamFileIn);
        writeHeader(bamFileOut, header);

        // Copy records.
        BamAlignmentRecord record;
        while (!atEnd(bamFileIn))
        {
            readRecord(record, bamFileIn);
            writeRecord(bamFileOut, record);
        }
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    return 0;
}
@HD	VN:1.3	SO:coordinate
@SQ	SN:ref	LN:45
@SQ	SN:ref2	LN:40
r001	163	ref	7	30	8M4I4M1D3M	=	37	39	TTAGATAAAGAGGATACTG	*	XX:B:S,12561,2,20,112
r002	0	ref	9	30	1S2I6M1P1I1P1I4M2I	*	0	0	AAAAGATAAGGGATAAA	*
r003	0	ref	9	30	5H6M	*	0	0	AGCTAA	*
r004	0	ref	16	30	6M14N1I5M	*	0	0	ATAGCTCTCAGC	*
r003	16	ref	29	30	6H5M	*	0	0	TAGGC	*
r001	83	ref	37	30	9M	=	7	-39	CAGCGCCAT	*

We instantiate a BamFileIn object for reading and a BamFileOut object for writing. First, we read the BAM header with readRecord and we write it with writeRecord. Then, we read each record from the input file and print it back on standard output. The alignment records are read into BamAlignmentRecord objects, which we will focus on below.

Assignment 1

Type

Reproduction

Objective

Create a file with the sample SAM content from above and adjust the path "example.sam" to the path to your SAM file (e.g. "/path/to/my_example.sam").

Solution
#include <seqan/bam_io.h>

using namespace seqan2;

int main()
{
    CharString bamFileName = getAbsolutePath("demos/tutorial/sam_and_bam_io/example.sam");

    // Open input file, BamFileIn can read SAM and BAM files.
    BamFileIn bamFileIn;
    if (!open(bamFileIn, toCString(bamFileName)))
    {
        std::cerr << "ERROR: Could not open " << bamFileName << std::endl;
        return 1;
    }
    // Open output file, BamFileOut accepts also an ostream and a format tag.
    BamFileOut bamFileOut(context(bamFileIn), std::cout, Sam());

    try
    {
        // Copy header.
        BamHeader header;
        readHeader(header, bamFileIn);
        writeHeader(bamFileOut, header);

        // Copy records.
        BamAlignmentRecord record;
        while (!atEnd(bamFileIn))
        {
            readRecord(record, bamFileIn);
            writeRecord(bamFileOut, record);
        }
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    return 0;
}

Accessing the Header

Sequence information (i.e. @SQ records) from the BAM header is stored in the BamIOContext. All remaining BAM header information is stored in the class BamHeader.

Important

The header is not mandatory in SAM files and might be missing.

The following program accesses the BamIOContext of its BamFileIn and prints the reference sequence names and lengths present in the BAM header.

#include <seqan/bam_io.h>

using namespace seqan2;

int main()
{
    CharString bamFileName = getAbsolutePath("demos/tutorial/sam_and_bam_io/example.sam");

    BamFileIn bamFileIn(toCString(bamFileName));

    BamHeader header;
    readHeader(header, bamFileIn);

    typedef FormattedFileContext<BamFileIn, void>::Type TBamContext;

    TBamContext const & bamContext = context(bamFileIn);

    for (unsigned i = 0; i < length(contigNames(bamContext)); ++i)
        std::cout << contigNames(bamContext)[i] << '\t'
                  << contigLengths(bamContext)[i] << '\n';

    return 0;
}

The output looks like this:

ref	45
ref2	40

Accessing the Records

The class BamAlignmentRecord stores one alignment record of a SAM or BAM file. The class gives an in-memory representation that

  1. is independent of whether it comes from/goes to a SAM or BAM file,

  2. at the same time follows both formats closely,

  3. allows for efficient storage and usage in C++ and

  4. integrates well with the rest of the SeqAn library.

The following definition gives an overview of the available fields, their types, and how they map to the SAM and BAM fields. Note that we use the CigarElement class to store entries in the CIGAR string.

//![BamTagsDict]
using namespace seqan2;

//![BamTagsDict]
class myBamAlignmentRecord
{
public:
    CharString qName;               // QNAME
    uint16_t flag;                  // FLAG
    int32_t rID;                    // REF
    int32_t beginPos;               // POS
    uint8_t mapQ;                   // MAPQ mapping quality, 255 for */invalid
    uint16_t bin;                   // bin for indexing
    String<CigarElement<> > cigar;  // CIGAR string
    int32_t rNextId;                // RNEXT (0-based)
    int32_t pNext;                  // PNEXT (0-based)
    int32_t tLen;                   // TLEN
    CharString seq;                 // SEQ, as in SAM/BAM file.
    CharString qual;                // Quality string as in SAM (Phred).
    CharString tags;                // Tags, raw as in BAM.

    // Constants for marking pos, reference id and length members invalid (== 0/*).
    static int32_t const INVALID_POS = -1;
    static int32_t const INVALID_REFID = -1;
    static int32_t const INVALID_LEN = 0;
};

The static members INVALID_POS, INVALID_REFID, and INVALID_LEN store sentinel values for marking positions, reference sequence ids, and lengths as invalid or N/A.

Tip

A BamAlignmentRecord is linked to a reference sequence by the field rID. The reference sequence information is stored in the BAM header and kept in the BamIOContext. To easily access reference sequence name and and length relative to a given BamAlignmentRecord within a BamFileIn, use functions getContigName and getContigLength.

An important related type is the enum BamFlags that provides constants for bit operations on the flag field. The functions hasFlagAllProper, hasFlagDuplicate, hasFlagFirst, hasFlagLast, hasFlagMultiple, hasFlagNextRC, hasFlagNextUnmapped, hasFlagQCNoPass, hasFlagRC, hasFlagSecondary, hasFlagUnmapped, and hasFlagSupplementary allow for easy reading of flags.

Assignment 2

Counting Records

Type

Review

Objective

Count the number of unmapped reads.

Hints

Use the function hasFlagUnmapped.

Solution
#include <seqan/bam_io.h>

using namespace seqan2;

int main()
{
    CharString bamFileName = getAbsolutePath("demos/tutorial/sam_and_bam_io/example.sam");

    // Open input file.
    BamFileIn bamFileIn;
    if (!open(bamFileIn, toCString(bamFileName)))
    {
        std::cerr << "ERROR: Could not open " << bamFileName << std::endl;
        return 1;
    }

    unsigned numUnmappedReads = 0;

    try
    {
        // Read header.
        BamHeader header;
        readHeader(header, bamFileIn);

        // Read records.
        BamAlignmentRecord record;
        while (!atEnd(bamFileIn))
        {
            readRecord(record, bamFileIn);
            if (hasFlagUnmapped(record))
                numUnmappedReads += 1;
        }
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    std::cout << "Number of unmapped reads: " << numUnmappedReads << "\n";

    return 0;
}
Number of unmapped reads: 0

Accessing the Records’ Tags

You can use the BamTagsDict class to access the the tag list of a record in a dictionary-like fashion. This class also performs the necessary casting when reading and writing tag list entries.

BamTagsDict acts as a wrapper around the raw tags member of a BamAlignmentRecord, which is of type CharString:

using namespace seqan2;

    BamAlignmentRecord record;
    BamTagsDict tagsDict(record.tags);

We can add a tag using the function setTagValue. When setting an already existing tag’s value, its value will be overwritten. Note that in the following, we give the tags value in SAM format because it is easier to read, although they are stored in BAM format internally.

    setTagValue(tagsDict, "NM", 2);
    // => tags: "NM:i:2"
    setTagValue(tagsDict, "NH", 1);
    // => tags: "NM:i:2 NH:i:1"
    setTagValue(tagsDict, "NM", 3);
    // => tags: "NM:i:3 NH:i:1"

The first parameter to setTagValue is the BamTagsDict, the second one is a two-character string with the key, and the third one is the value. Note that the type of tag entry will be taken automatically from the type of the third parameter.

Reading values is slightly more complex because we have to handle the case that the value is not present. First, we get the index of the tag in the tag list.

    unsigned tagIdx = 0;
    if (!findTagKey(tagIdx, tagsDict, "NH"))
        std::cerr << "ERROR: Unknown key!\n";

Then, we can read the value from the BamTagsDict using the function extractTagValue.

    int tagValInt = 0;
    if (!extractTagValue(tagValInt, tagsDict, tagIdx))
        std::cerr << "ERROR: There was an error extracting NH from tags!\n";

The function returns a bool that is true on success and false otherwise. The extraction can fail if the index is out of bounds or the value in the dictionary cannot be cast to the type of the first parameter.

The value in the tags dictionary will be casted to the type of the first parameter of extractTagValue:

    short tagValShort = 0;
    extractTagValue(tagValShort, tagsDict, tagIdx);

Assignment 3

Reading Tags

Type

Review

Objective

Modify the solution of Assignment 2 to count the number of records having the "XX" tag.

Solution
#include <seqan/bam_io.h>

using namespace seqan2;

int main()
{
    CharString bamFileName = getAbsolutePath("demos/tutorial/sam_and_bam_io/example.sam");

    // Open input file.
    BamFileIn bamFileIn;
    if (!open(bamFileIn, toCString(bamFileName)))
    {
        std::cerr << "ERROR: Could not open " << bamFileName << std::endl;
        return 1;
    }

    unsigned numXXtags = 0;

    try
    {
        // Read header.
        BamHeader header;
        readHeader(header, bamFileIn);

        // Rear records.
        BamAlignmentRecord record;
        while (!atEnd(bamFileIn))
        {
            readRecord(record, bamFileIn);
            BamTagsDict tagsDict(record.tags);

            unsigned tagIdx = 0;
            if (findTagKey(tagIdx, tagsDict, "XX"))
                numXXtags += 1;
        }
    }
    catch (Exception const & e)
    {
        std::cout << "ERROR: " << e.what() << std::endl;
        return 1;
    }

    std::cout << "Number of records with the XX tag: " << numXXtags << "\n";

    return 0;
}
Number of records with the XX tag: 1

Using BAM Indices

SeqAn also contains features for reading BAM indices with the format .bai. These indices can be built using the samtools index command. In the near future we plan to support building the bam index with SeqAn as well.

You can read indices into a BaiBamIndex object with the function open. Then, you can use the function jumpToRegion to jump to a specific position within BAM files. After jumping, the next record to be read is before the given region. Therefore, you have to skip records until you access the one you are looking for.

#include <seqan/sequence.h>
#include <seqan/bam_io.h>

using namespace seqan2;

int main(int, char const **)
{
    CharString bamFileName = getAbsolutePath("demos/tutorial/sam_and_bam_io/example.bam");
    CharString baiFileName = getAbsolutePath("demos/tutorial/sam_and_bam_io/example.bam.bai");
    CharString rName = "ref";

    // Open BamFileIn for reading.
    BamFileIn inFile;
    if (!open(inFile, toCString(bamFileName)))
    {
        std::cerr << "ERROR: Could not open " << bamFileName << " for reading.\n";
        return 1;
    }

    // Read BAI index.
    BamIndex<Bai> baiIndex;
    if (!open(baiIndex, toCString(baiFileName)))
    {
        std::cerr << "ERROR: Could not read BAI index file " << baiFileName << "\n";
        return 1;
    }

    // Read header.
    BamHeader header;
    readHeader(header, inFile);

    // Translate from reference name to rID.
    int rID = 0;
    if (!getIdByName(rID, contigNamesCache(context(inFile)), rName))
    {
        std::cerr << "ERROR: Reference sequence named " << rName << " not known.\n";
        return 1;
    }

    // Translate BEGIN and END arguments to number, 1-based to 0-based.
    int beginPos = 9, endPos = 30;

    // 1-based to 0-based.
    beginPos -= 1;
    endPos -= 1;

    // Translate number of elements to print to number.
    int num = 3;

    // Jump the BGZF stream to this position.
    bool hasAlignments = false;
    if (!jumpToRegion(inFile, hasAlignments, rID, beginPos, endPos, baiIndex))
    {
        std::cerr << "ERROR: Could not jump to " << beginPos << ":" << endPos << "\n";
        return 1;
    }
    if (!hasAlignments)
        return 0;  // No alignments here.

    // Seek linearly to the selected position.
    BamAlignmentRecord record;
    int numPrinted = 0;
    BamFileOut out(inFile, std::cout, Sam());

    while (!atEnd(inFile) && numPrinted < num)
    {
        readRecord(record, inFile);

        // If we are on the next reference or at the end already then we stop.
        if (record.rID == -1 || record.rID > rID || record.beginPos >= endPos)
            break;
        // If we are left of the selected position then we skip this record.
        if (record.beginPos < beginPos)
            continue;

        // Otherwise, we print it to the user.
        numPrinted++;
        writeRecord(out, record);
    }

    return 0;
}
r002	0	ref	9	30	1S2I6M1P1I1P1I4M2I	*	0	0	AAAAGATAAGGGATAAA	*
r003	0	ref	9	30	5H6M	*	0	0	AGCTAA	*
r004	0	ref	16	30	6M14N1I5M	*	0	0	ATAGCTCTCAGC	*

Next Steps