Indexed FASTA I/O

Learning Objective ::
In this tutorial, you will learn how to use a FASTA Index file (.fai) for indexed random-access to FASTA files. This is useful for retrieving regions (e.g. chr1:123-10004) or single sequences (e.g. chr1) from FASTA files quickly.
Difficulty
Average
Duration
30 min
Prerequisites
Sequences

The idea of FASTA index files (FAI) comes from the samtools program by Heng Li. The program provides a command samtools faidx for rapidly accessing parts of a large FASTA file (e.g. querying for the first chromosome by the identifier “chr1” or querying for 900 characters starting from character 100 (1-based) by chr1:100-1,000). To do this, the program creates an index file that contains one entry for each sequence. If the FASTA file is named path/sequence.fasta, the index file is usually named path/sequence.fasta.fai.

Using such index files, it is possible to rapidly read parts of the given sequence file. The module <seqan/seq_io.h> allows to create and read such .fai index files and exposes an API to read parts randomly of FASTA file.

Note

FASTA/FASTQ Meta Data and Sequence Ids

FASTA and FASTQ files have one meta data record for each sequence. This usually contains the sequence name but sometimes a lot of additional information is stored. There is no consensus for the meta data.

It is common, however, to store the identifier (id) of the sequence at the beginning of the meta data field before the first space. The id is unique to the whole file and often identifies the associate sequence uniquely in a data base (see section Sequence Identifiers on the Wikipedia FASTA format page).

While not documented anywhere explicitely, only the characters up to the first space are used as the identifiers by widely used tools such as BWA. Only the identifier is carried over into files generated from the input files (BWA uses the sequence id from the genome FASTA to identify the contig/chromosome and the read id as the read name in the SAM output).

How Does It Work?

There are two requirements that a FASTA file has to fulfill to work with the FAI scheme. For each sequence in the FASTA file, the number of characters stored per line has to be the same and the number of bytes per line has to be the same. The first restriction speaks for itself, the second restriction means that the same line ending character has to be used and no line should contain any additional spaces.

The index file then stores records of sequence identifier, length, the offset of the first sequence character in the file, the number of characters per line, and the number of bytes per line. With this information, we can easily compute the byte offset of the i-th character of a sequence in a file by looking at its index record. We skip to this byte offset in the file and from there, we can read the necessary sequence characters.

Building the Index

The class FaiIndex allows for building and loading FAI indices. fo build such an index, we use the function build of the class FaiIndex. The first parameter is the FaiIndex object, the second is the path to the FASTA file. The function returns an integer indicating the result (as usual, 0 for no errors, a value different from 0 indicating an error).

#include <seqan/seq_io.h>

seqan::FaiIndex faiIndex;
int res = build(faiIndex, "path/to/file.fasta");
if (res != 0)
    std::cerr << "ERROR: Could not build the index!\n";

There is an alternative variant of this function where you can pass the path to the FAI file that is to be built as third parameter. The FAI file name will be stored in the FaiIndex.

#include <seqan/seq_io.h>

seqan::FaiIndex faiIndex;
int res = build(faiIndex, "path/to/file.fasta", "another/path/file.fasta.fai");
if (res != 0)
    std::cerr << "ERROR: Could not build the index!\n";

We can write out the index after building it using the function write:

#include <seqan/seq_io.h>

seqan::FaiIndex faiIndex;
// ... index building here ...

int res = write(faiIndex, "path/to/file.fasta.fai");
if (res != 0)
    std::cerr << "ERROR: Could not write the index to file!\n";

Assignment 1

Building a FAI index

Type
Application
Objective
Write a small program build_fai that takes one parameter from the command line, the path to a FASTA file. The program should then build a FAI index and write it out.
Hints
Using the two-parameter variant of build is good enough.
Solution
#include <iostream>
#include <seqan/sequence.h>
#include <seqan/seq_io.h>

int main(int argc, char const ** argv)
{
    if (argc != 2)
    {
        std::cerr << "USAGE: build_fai FILE.fa\n";
        return 1;
    }

    seqan::FaiIndex faiIndex;
    if (build(faiIndex, argv[1]) != 0)
    {
        std::cerr << "ERROR: Could not build FAI index for file " << argv[1] << ".\n";
        return 1;
    }

    seqan::CharString faiFilename = argv[1];
    append(faiFilename, ".fai");

    if (write(faiIndex, toCString(faiFilename)) != 0)
    {
        std::cerr << "ERROR: Could not write the index to file!\n";
        return 1;
    }

    std::cout << "Index file " << faiFilename << " was successfully created.\n";
    return 0;
}

Using the Index

To load a FAI file, we use the function read: We pass the FaiIndex object as the first and the path to the FASTA file as the second parameter. The function returns an int indicating success (value 0) or failure (non-0 value).

#include <seqan/seq_io.h>

seqan::FaiIndex faiIndex;
int res = read(faiIndex, "path/to/file.fasta");
if (res != 0)
    std::cerr << "ERROR: Could not read FAI index path/to/file.fasta.fai\n";

In the example above, the FAI file "path/to/file.fasta.fai" would be loaded. Optionally, we can specify an extra path to the FAI file:

#include <seqan/seq_io.h>

seqan::FaiIndex faiIndex;
int res = read(faiIndex, "path/to/file.fasta", "path/to/index.fai");
if (res != 0)
    std::cerr << "ERROR: Could not load FAI index path/to/index.fai\n";

After loading the index, we can then use the index to map a sequence id to its (zero-based) position (a position i meaning that it is the i-th sequence) in the FASTA file using getIdByName. The function gets the FaiIndex to use, the id of the sequence, and an unsigned position as parameters. It returns a bool indicating whether the mapping was successful (true on success, false on failure).

unsigned idx = 0;
if (getIdByName(faiIndex, "chr1", idx))
    std::cerr << "ERROR: FAI index has no entry for chr1.\n";

Once we have the index for the sequence in the FASTA file, we can then query the FaiIndex for the length of the sequence using sequenceLength, get the whole sequence using readSequence, or get just a part of the sequence using readRegion.

unsigned seqLength = sequenceLength(faiIndex, idx);

// Load first 1000 characters of chr1.
seqan::CharString seqChr1Prefix;
if (readRegion(seqChr1Prefix, faiIndex, idx, 0, 1000) != 0)
    std::cerr << "ERROR: Could not load chr1.\n";

// Load all of chr1.
seqan::CharString seqChr1;
if (readSequence(seqChr1, faiIndex, idx) != 0)
    std::cerr << "ERROR: Could not load chr1.\n";

The sequence length can be determined by only looking at the index. When loading the sequence or a sequence infix, only the relevant part of the file will be touched. Thus, only the minimal amount of memory, time, and disk I/O is used.

Assignment 2

Using the FAI index

Type
Application
Objective
Write a small program query_fai that takes four parameters from the command line: A path to a FASTA file, the id of the sequence, a begin and an end position. The program should then read the given infix of the given sequence from the file and print it to stdout.
Hint
Use the function lexicalCast2 to convert strings of numbers into integers.
Solution

The program appears to be very long, but most is error handling, as usual with robust I/O code.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/seq_io.h>
#include <seqan/stream.h>

int main(int argc, char const ** argv)
{
    if (argc != 5)
    {
        std::cerr << "USAGE: build_fai FILE.fa SEQ BEGIN END\n";
        return 1;
    }

    // Try to load index and create on the fly if necessary.
    seqan::FaiIndex faiIndex;
    if (seqan::read(faiIndex, argv[1]) != 0)
    {
        if (build(faiIndex, argv[1]) != 0)
        {
            std::cerr << "ERROR: Index could not be loaded or built.\n";
            return 1;
        }
        if (write(faiIndex) != 0)  // Name is stored from when reading.
        {
            std::cerr << "ERROR: Index could not be written do disk.\n";
            return 1;
        }
    }

    // Translate sequence name to index.
    unsigned idx = 0;
    if (!getIdByName(faiIndex, argv[2], idx))
    {
        std::cerr << "ERROR: Index does not know about sequence " << argv[2] << "\n";
        return 1;
    }

    // Convert positions into integers.
    unsigned beginPos = 0, endPos = 0;
    if (!seqan::lexicalCast2(beginPos, argv[3]))
    {
        std::cerr << "ERROR: Cannot cast " << argv[3] << " into an unsigned.\n";
        return 1;
    }
    if (!seqan::lexicalCast2(endPos, argv[4]))
    {
        std::cerr << "ERROR: Cannot cast " << argv[4] << " into an unsigned.\n";
        return 1;
    }

    // Make sure begin and end pos are on the sequence and begin <= end.
    if (beginPos > sequenceLength(faiIndex, idx))
        beginPos = sequenceLength(faiIndex, idx);
    if (endPos > sequenceLength(faiIndex, idx))
        endPos = sequenceLength(faiIndex, idx);
    if (beginPos > endPos)
        endPos = beginPos;

    // Finally, get infix of sequence.
    seqan::Dna5String sequenceInfix;
    if (readRegion(sequenceInfix, faiIndex, idx, beginPos, endPos) != 0)
    {
        std::cerr << "ERROR: Could not load infix.\n";
        return 1;
    }

    std::cout << sequenceInfix << "\n";

    return 0;
}

Next Steps

comments powered by Disqus