SAM and BAM I/O¶
- Learning Objective
- This tutorial will explain how to use the lower-level API for reading from and writing to SAM and BAM files.
- Difficulty
- Advanced
- Duration
- 20 min
- Prerequisites
- Basic SAM and BAM I/O, I/O Overview
After reading the :ref;`tutorial-basic-sam-bam-io` tutorial, you should have a good understanding of the representation of SAM/BAM records in SeqAn library. This tutorial will explain how the lower-level implementation of SAM and BAM I/O works and which design decisions lead to this. You will also learn about the classes BamIOContext and NameStoreCache. This article also explains how to write template-based code that can read both SAM files from RecordReaders and BAM files from compressed streams. Furthermore, you will learn how to use BAI indices for jumping in BAM files.
This tutorial will mostly work by showing full source code examples and explaining them. Do not feel intimidated by its length, it is inflated by the amount of embedded source code.
The usage of SAM and BAM tags is explained in the Basic SAM and BAM I/O tutorial and will not be discussed here further.
Important
Note that this tutorial is targeted at readers that already know about the SAM format. If you do not know about the SAM format yet then this tutorial will be harder for your to understand. Teaching the ins and outs of SAM is out of the scope of such a tutorial.
BamAlignmentRecord Design Overview¶
Besides the fact that BAM is a compressed, binary format, the main difference to the plain-text format SAM is that BAM has additional header information. The binary header of BAM files contains a plain-text SAM header with the same information as a SAM file. However, BAM files contain an additional binary header section that stores all reference sequence names and their length.
This is equivalent to the @SQ entries in a SAM header but the @SQ entries are optional in both SAM and the SAM header in a BAM file. Also, the @SQ entries in the SAM header of a BAM file can be out of sync with the binary reference information entries. The authorative source for this information (most prominently the names and the order of the reference sequences). BAM records then only contain an integer rID for each reference that referes to the rID-th reference sequence in the binary header.
Because SAM and BAM files store the same information, there is only one record type to store it in SeqAn. The record type is the class BamAlignmentRecord that was already introduced in the Basic SAM and BAM I/O tutorial tutorial. The structure of this class is designed after the BAM file so conversion from the BAM file on disk to the in-memory representation is fast. For example, the tags are stored as a binary string, the same as in a BAM file. When reading SAM, the tags are converted into the BAM representation whereas BAM tags can be copied verbatim. There is one important deviation, though: The qualities are stored using a phred-style ASCII encoding for consistencies with the rest of the SeqAn library.
As a reminder, here is the synopsis of the BamAlignmentRecord class again.
namespace seqan {
class BamAlignmentRecord
{
public:
CharString qName; // QNAME
__uint16 flag; // FLAG
__int32 rID; // REF
__int32 beginPos; // POS
__uint8 mapQ; // MAPQ mapping quality, 255 for */invalid
__uint16 bin; // bin for indexing
String<CigarElement<> > cigar; // CIGAR string
__int32 rNextId; // RNEXT (0-based)
__int32 pNext; // PNEXT (0-based)
__int32 tLen; // TLEN
CharString seq; // SEQ, as in SAM/BAM file.
CharString qual; // Quality string as in SAM (Phred).
CharString tags; // Tags, raw as in BAM.
// Constants for marking pos, reference id and length members invalid (== */0).
static __int32 const INVALID_POS = -1;
static __int32 const INVALID_REFID = -1;
static __int32 const INVALID_LEN = 0;
};
} // namespace seqan
Name Stores and Name Store Caches¶
In order to translate from numeric reference id (rID) to text reference sequence name, the names have to be stored in a StringSet which we will call a name store. For being able to translate back from a textual name (stored as a CharString, for example), we need a NameStoreCache that allows the fast lookup of numeric ids from textual names. Both the name store and the cache are then wrapped by a BamIOContext. This context object is used to prescient from the differences of SAM and BAM files when reading and writing.
For example, when writing out a BamAlignmentRecord to a SAM file, we need to look up the name of the reference from its numeric id to write it out as a string. When reading a record from a SAM file, we have to translate its name string into a numeric id. Even more, if the sequence is not know yet (remember, the @SQ headers are optional), we have to append it to the name store and register it with the cache.
Here is a minimal example of setting up a name store, name store cache, and a BamIOContext. We will build upon this example below when showing how to read and write SAM and BAM files.
#include <seqan/bam_io.h>
#include <seqan/sequence.h>
int main()
{
// Create some shortcuts to the types that we will use.
typedef seqan::StringSet<seqan::CharString> TNameStore;
typedef seqan::NameStoreCache<TNameStore> TNameStoreCache;
typedef seqan::BamIOContext<TNameStore> TBamIOContext;
// Setup the variables.
TNameStore nameStore;
TNameStoreCache nameStoreCache(nameStore);
TBamIOContext bamIOContext(nameStore, nameStoreCache);
return 0;
}
BGZF Files / Stream¶
By default, the BAM format is compressed using the BGZF compression scheme (originating from Tabix, but also described in the SAM standard). You can read BGZF files with tools for processing .gz files, e.g. gzip and zcat.
However, there is a big difference between files written in BGZF and .gz files. BGZF is a sequence of compressed blocks. If the offset of a block is known, it can be decompressed independent of the rest of the file. This information can then be used together with indices.
SeqAn provides the BGZF Stream class in the module <seqan/stream.h> to access such streams. Here is an example for using a Stream for reading:
#include <seqan/stream.h>
int main()
{
seqan::Stream<seqan::Bgzf> stream;
if (!open(stream, "filename.bam", "r"))
return 1; // Could not open for reading.
return 0;
}
Using a BGZF Stream for writing:
#include <seqan/stream.h>
int main()
{
seqan::Stream<seqan::Bgzf> stream;
if (!open(stream, "filename.bam", "w"))
return 1; // Could not open for writing.
return 0;
}
Assignment 1¶
Uncompressing a BGZF file.
- Type
- Review
- Objective
- Write a program that reads in a BGZF compressed file using BGZF Stream and writes the uncompressed data out again.
- Hint
- Use the function streamReadBlock and streamWriteBlock for reading and writing data into and from buffers.
- Solution
#include <iostream> #include <fstream> #include <seqan/sequence.h> #include <seqan/stream.h> int main(int argc, char const ** argv) { if (argc != 3) { std::cerr << "USAGE: " << argv[0] << " IN.bam OUT.bin\n"; return 1; } // Open BGZF file for reading. seqan::Stream<seqan::Bgzf> inStream; if (!open(inStream, argv[1], "r")) { std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n"; return 1; } // Open std::fstream for writing. std::fstream outStream(argv[2], std::ios::binary | std::ios::out); if (!outStream.good()) { std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n"; return 1; } // Copy over data. seqan::CharString buffer; resize(buffer, 1000); while (!seqan::atEnd(inStream) && seqan::streamError(inStream) == 0) { int num = seqan::streamReadBlock(&buffer[0], inStream, length(buffer)); seqan::streamWriteBlock(outStream, &buffer[0], num); } return 0; }
Reading and Writing Headers¶
The data structure BamHeader has already been described in the Basic SAM and BAM I/O so we will not repeat that here. Instead, we will focus on how to read headers from SAM and BAM files.
Here is a minimal example of reading and writing a header from and to a SAM file. The example contains the creation of a BamIOContext, the necessary RecordReader and full error handling.
#include <iostream>
#include <fstream>
#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>
int main(int argc, char const ** argv)
{
if (argc != 3)
{
std::cerr << "USAGE: " << argv[0] << " IN.sam OUT.sam\n";
return 1;
}
// Open std::fstream for reading.
std::fstream inStream(argv[1], std::ios::binary | std::ios::in);
if (!inStream.good())
{
std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
return 1;
}
// Open std::fstream for writing.
std::fstream outStream(argv[2], std::ios::binary | std::ios::out);
if (!outStream.good())
{
std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
return 1;
}
// Setup RecordReader.
seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(inStream);
// Setup name store, cache, and BAM I/O context.
typedef seqan::StringSet<seqan::CharString> TNameStore;
typedef seqan::NameStoreCache<TNameStore> TNameStoreCache;
typedef seqan::BamIOContext<TNameStore> TBamIOContext;
TNameStore nameStore;
TNameStoreCache nameStoreCache(nameStore);
TBamIOContext context(nameStore, nameStoreCache);
// Read header.
seqan::BamHeader header;
if (readRecord(header, context, reader, seqan::Sam()) != 0)
{
std::cerr << "ERROR: Could not read header from SAM file " << argv[1] << "\n";
return 1;
}
// Write out header again.
if (write2(outStream, header, context, seqan::Sam()) != 0)
{
std::cerr << "ERROR: Could not write header to SAM file " << argv[2] << "\n";
return 1;
}
return 0;
}
Reading and writing headers from and to BAM files is simple. We simply replace seqan::Sam() by seqan::Bam() and use BGZF Stream objects instead of uncompressed streams. Also, we do not need a RecordReader any more.
#include <iostream>
#include <fstream>
#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>
int main(int argc, char const ** argv)
{
if (argc != 3)
{
std::cerr << "USAGE: " << argv[0] << " IN.bam OUT.bam\n";
return 1;
}
// Open BGZF Stream for reading.
seqan::Stream<seqan::Bgzf> inStream;
if (!open(inStream, argv[1], "r"))
{
std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
return 1;
}
// Open BGZF Stream for writing.
seqan::Stream<seqan::Bgzf> outStream;
if (!open(outStream, argv[2], "w"))
{
std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
return 1;
}
// Setup name store, cache, and BAM I/O context.
typedef seqan::StringSet<seqan::CharString> TNameStore;
typedef seqan::NameStoreCache<TNameStore> TNameStoreCache;
typedef seqan::BamIOContext<TNameStore> TBamIOContext;
TNameStore nameStore;
TNameStoreCache nameStoreCache(nameStore);
TBamIOContext context(nameStore, nameStoreCache);
// Read header.
seqan::BamHeader header;
if (readRecord(header, context, inStream, seqan::Bam()) != 0)
{
std::cerr << "ERROR: Could not read header from BAM file " << argv[1] << "\n";
return 1;
}
// Write out header again.
if (write2(outStream, header, context, seqan::Bam()) != 0)
{
std::cerr << "ERROR: Could not write header to BAM file " << argv[2] << "\n";
return 1;
}
return 0;
}
Note that except for the types, the signatures of the functions readRecord() and write() are the same. Thus, we can make copying of the header a template function copyHeader(). This function can now be used for both BAM and SAM.
#include <iostream>
#include <fstream>
#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>
template <typename TOutStream, typename TInStreamOrReader, typename TTag>
int copyHeader(TOutStream & outStream,
TInStreamOrReader & inStreamOrReader,
seqan::BamIOContext<seqan::StringSet<seqan::CharString> > & context,
TTag const & tag)
{
// Read header.
seqan::BamHeader header;
if (readRecord(header, context, inStreamOrReader, tag) != 0)
{
std::cerr << "ERROR: Could not read header\n";
return 1;
}
// Write out header again.
if (write2(outStream, header, context, tag) != 0)
{
std::cerr << "ERROR: Could not write header.\n";
return 1;
}
return 0;
}
int main(int argc, char const ** argv)
{
if (argc != 3)
{
std::cerr << "USAGE: " << argv[0] << " IN.sam OUT.sam\n";
return 1;
}
// Streams for SAM.
std::fstream inStreamSam, outStreamSam;
// Streams for BAM.
seqan::Stream<seqan::Bgzf> inStreamBam, outStreamBam;
if (seqan::endsWith(seqan::CharString(argv[1]), ".sam"))
{
inStreamSam.open(argv[1], std::ios::binary | std::ios::in);
if (!inStreamSam.good())
{
std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
return 1;
}
outStreamSam.open(argv[2], std::ios::binary | std::ios::out);
if (!outStreamSam.good())
{
std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
return 1;
}
}
else
{
// Open BGZF Stream for reading.
if (!open(inStreamBam, argv[1], "r"))
{
std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
return 1;
}
// Open BGZF Stream for writing.
if (!open(outStreamBam, argv[2], "w"))
{
std::cerr << "ERROR: Could not open " << argv[2] << " for writing.\n";
return 1;
}
}
// Setup name store, cache, and BAM I/O context.
typedef seqan::StringSet<seqan::CharString> TNameStore;
typedef seqan::NameStoreCache<TNameStore> TNameStoreCache;
typedef seqan::BamIOContext<TNameStore> TBamIOContext;
TNameStore nameStore;
TNameStoreCache nameStoreCache(nameStore);
TBamIOContext context(nameStore, nameStoreCache);
if (endsWith(seqan::CharString(argv[1]), ".sam"))
{
// Stream must be open before constructing reader, thus we define it here.
seqan::RecordReader<std::fstream, seqan::SinglePass<> > reader(inStreamSam);
return copyHeader(outStreamSam, reader, context, seqan::Sam());
}
else
{
return copyHeader(outStreamBam, inStreamBam, context, seqan::Bam());
}
}
Assignment 2¶
Converting BAM header to SAM.
- Type
- Application
- Objective
- Write a program that reads the header from a BAM file and writes it out as a SAM header to std::cout.
- Solution
#include <iostream> #include <fstream> #include <seqan/bam_io.h> #include <seqan/sequence.h> #include <seqan/stream.h> int main(int argc, char const ** argv) { if (argc != 2) { std::cerr << "USAGE: " << argv[0] << " IN.bam\n"; return 1; } // Open BGZF Stream for reading. seqan::Stream<seqan::Bgzf> inStream; if (!open(inStream, argv[1], "r")) { std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n"; return 1; } // Setup name store, cache, and BAM I/O context. typedef seqan::StringSet<seqan::CharString> TNameStore; typedef seqan::NameStoreCache<TNameStore> TNameStoreCache; typedef seqan::BamIOContext<TNameStore> TBamIOContext; TNameStore nameStore; TNameStoreCache nameStoreCache(nameStore); TBamIOContext context(nameStore, nameStoreCache); // Read header. seqan::BamHeader header; if (readRecord(header, context, inStream, seqan::Bam()) != 0) { std::cerr << "ERROR: Could not read header from BAM file " << argv[1] << "\n"; return 1; } // Write out header again. if (write2(std::cout, header, context, seqan::Sam()) != 0) { std::cerr << "ERROR: Could not write header to stdout.\n"; return 1; } return 0; }
Reading and Writing Alignment Records¶
BamAlignmentRecords can be read and written the same way as BamHeader objects. Here is an example for reading and writing of alignment records from SAM and to files.
// Copy over records.
seqan::BamAlignmentRecord record;
while (atEnd(reader))
{
if (readRecord(record, context, reader, seqan::Sam()) != 0)
{
std::cerr << "ERROR: Could not read record from SAM file " << argv[1] << "\n";
return 1;
}
if (write2(outStream, record, context, seqan::Sam()) != 0)
{
std::cerr << "ERROR: Could not write record to SAM file " << argv[2] << "\n";
return 1;
}
}
And here is the modified version for the BAM format. The only changes are that
- we do not read from a RecordReader but a BGZF Stream instead,
- we need to write to a BGZF Stream, and
- we need to use the tag seqan::Bam() instead of seqan::Sam().
// Copy over records.
seqan::BamAlignmentRecord record;
while (atEnd(reader))
{
if (readRecord(record, context, inStream, seqan::Bam()) != 0)
{
std::cerr << "ERROR: Could not read record from BAM file " << argv[1] << "\n";
return 1;
}
if (write2(outStream, record, context, seqan::Bam()) != 0)
{
std::cerr << "ERROR: Could not write record to BAM file " << argv[2] << "\n";
return 1;
}
}
Assignment 3¶
Converting whole BAM files to SAM.
- Type
- Application
- Objective
- Modify the solution of Assignment 2 to not only convert the header to BAM but also the alignment records.
- Solution
#include <iostream> #include <fstream> #include <seqan/bam_io.h> #include <seqan/sequence.h> #include <seqan/stream.h> int main(int argc, char const ** argv) { if (argc != 2) { std::cerr << "USAGE: " << argv[0] << " IN.bam\n"; return 1; } // Open BGZF Stream for reading. seqan::Stream<seqan::Bgzf> inStream; if (!open(inStream, argv[1], "r")) { std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n"; return 1; } // Setup name store, cache, and BAM I/O context. typedef seqan::StringSet<seqan::CharString> TNameStore; typedef seqan::NameStoreCache<TNameStore> TNameStoreCache; typedef seqan::BamIOContext<TNameStore> TBamIOContext; TNameStore nameStore; TNameStoreCache nameStoreCache(nameStore); TBamIOContext context(nameStore, nameStoreCache); // Read header. seqan::BamHeader header; if (readRecord(header, context, inStream, seqan::Bam()) != 0) { std::cerr << "ERROR: Could not read header from BAM file " << argv[1] << "\n"; return 1; } // Write out header again. if (write2(std::cout, header, context, seqan::Sam()) != 0) { std::cerr << "ERROR: Could not write header to stdout.\n"; return 1; } // Copy over the alignment records. seqan::BamAlignmentRecord record; while (!atEnd(inStream)) { if (readRecord(record, context, inStream, seqan::Bam()) != 0) { std::cerr << "ERROR: Could not read record from BAM File " << argv[1] << "\n"; return 1; } if (write2(std::cout, record, context, seqan::Sam()) != 0) { std::cerr << "ERROR: Could not write record to stdout.\n"; return 1; } } return 0; }
Using Indices¶
SeqAn also contains support for reading BAM indices with the format .bai. These indices can be built using the samtools index command.
You can read such indices into a BAI BamIndex object with the function read. Then, you can use the function seqan:”Function.BamIndex#jumpToRegion” to jump within BAM files.
After jumping, the next record that is read is before at the given position. This means, you have to manually read as many records up until the one you are looking for is found. The reason for this is that the function jumpToRegion would have to read until it finds the first record that is right from or at the given position. This would lead to this record being lost.
#include <iostream>
#include <fstream>
#include <seqan/bam_io.h>
#include <seqan/sequence.h>
#include <seqan/stream.h>
int main(int argc, char const ** argv)
{
if (argc != 7)
{
std::cerr << "USAGE: " << argv[0] << " IN.bam IN.bam.bai REF BEGIN END COUNT\n";
return 1;
}
// Open BGZF Stream for reading.
seqan::Stream<seqan::Bgzf> inStream;
if (!open(inStream, argv[1], "r"))
{
std::cerr << "ERROR: Could not open " << argv[1] << " for reading.\n";
return 1;
}
// Read BAI index.
seqan::BamIndex<seqan::Bai> baiIndex;
if (read(baiIndex, argv[2]) != 0)
{
std::cerr << "ERROR: Could not read BAI index file " << argv[2] << "\n";
return 1;
}
// Setup name store, cache, and BAM I/O context.
typedef seqan::StringSet<seqan::CharString> TNameStore;
typedef seqan::NameStoreCache<TNameStore> TNameStoreCache;
typedef seqan::BamIOContext<TNameStore> TBamIOContext;
TNameStore nameStore;
TNameStoreCache nameStoreCache(nameStore);
TBamIOContext context(nameStore, nameStoreCache);
// Read header.
seqan::BamHeader header;
if (readRecord(header, context, inStream, seqan::Bam()) != 0)
{
std::cerr << "ERROR: Could not read header from BAM file " << argv[1] << "\n";
return 1;
}
// Translate from reference name to rID.
int rID = 0;
if (!getIdByName(nameStore, argv[3], rID, nameStoreCache))
{
std::cerr << "ERROR: Reference sequence named " << argv[3] << " not known.\n";
return 1;
}
// Translate BEGIN and END arguments to number, 1-based to 0-based.
int beginPos = 0, endPos = 0;
if (!seqan::lexicalCast2(beginPos, argv[4]) || beginPos <= 0)
{
std::cerr << "ERROR: Begin position " << argv[4] << " is invalid.\n";
return 1;
}
beginPos -= 1; // 1-based to 0-based.
if (!seqan::lexicalCast2(endPos, argv[5]) || endPos <= 0)
{
std::cerr << "ERROR: End position " << argv[5] << " is invalid.\n";
return 1;
}
endPos -= 1; // 1-based to 0-based.
// Translate number of elements to print to number.
int num = 0;
if (!seqan::lexicalCast2(num, argv[6]))
{
std::cerr << "ERROR: Count " << argv[6] << " is invalid.\n";
return 1;
}
// Jump the BGZF stream to this position.
bool hasAlignments = false;
if (!jumpToRegion(inStream, hasAlignments, context, rID, beginPos, endPos, baiIndex))
{
std::cerr << "ERROR: Could not jump to " << argv[3] << ":" << argv[4] << "\n";
return 1;
}
if (!hasAlignments)
return 0; // No alignments here.
// Seek linearly to the selected position.
seqan::BamAlignmentRecord record;
int numPrinted = 0;
while (!atEnd(inStream) && numPrinted < num)
{
if (readRecord(record, context, inStream, seqan::Bam()) != 0)
{
std::cerr << "ERROR: Could not read record from BAM file.\n";
return 1;
}
// If we are on the next reference or at the end already then we stop.
if (record.rID == -1 || record.rID > rID || record.beginPos >= endPos)
break;
// If we are left of the selected position then we skip this record.
if (record.beginPos < beginPos)
continue;
// Otherwise, we print it to the user.
numPrinted += 1;
if (write2(std::cout, record, context, seqan::Sam()) != 0)
{
std::cerr << "ERROR: Could not write record to stdout.\n";
return 1;
}
}
return 0;
}
Next Steps¶
- Read the SAM Specification (pdf).
- Continue with the Tutorial.