Simple Genome Alignment

Learning Objective

You will learn how to write a simple genome aligner for large-scale sequences.

Difficulty

Medium

Duration

4h

Prerequisites

Parsing Command Line Arguments, Pairwise Sequence Alignment, Seed Extension, Q-gram Index, Sequence I/O

Introduction

Pairwise sequence alignment is the tool of choice if one wants to compare two biological sequences. However, for genome sized sequences the standard approach is to inefficient as it requires quadratic time and space to compute an alignment. Alternatively, one could compute only a small band but it is unclear whether evolutionary events such as insertions or deletions would shift the band away from the main diagonal. Thus the alignment would not reflect the actual alignment of the sequences. To cope with this issue different methods have been developed that can efficiently deal with large-scale sequences. One of these applications was the LAGAN (Limited Area Global Alignment of Nucleotides) [1.]. It is an iterative algorithm with the following three steps.

  1. Generate local alignments with CHAOS chaining [2.].

  2. Construct a rough global map

  3. Compute final alignment along the global map

The CHAOS chaining finds local alignments depending on three parameters: the seed size (of the q-grams), the distance and the gap parameter. The distance parameter limits the distance between two endpoints of two neighbouring seeds along the diagonal. The band limits the shift of two seeds in vertical and horizontal direction of the matrix. In order to find a good tradeoff between speed and sensitivity the algorithm performs step 1 and 2 recursively until two neighbouring anchors are less than a threshold apart. In the first iteration the parameters are set more restrictive (large seed size, smaller distance and gap size.) For every gap that is bigger than the given threshold, the parameters are set more permissive (smaller seed size and larger distance and gap sizes).

Finally, after the global map has been constructed, the algorithm performs a limited alignment around the anchors and connects the anchors with a standard alignment algorithm.

Main method and ArgumentParser

In this tutorial we want to implement a simplified version of the LAGAN approach using containing only of the first iteration. At first we will look at the basic main method and setup the argument parser.

int main(int const argc, char * argv[])
{
    LaganOption options;
    if (parseCommandLine(options, argc, argv) != ArgumentParser::PARSE_OK)
    {
        std::exit(EXIT_FAILURE);  // terminate the program.
    }
}

We create a LaganOption class where we store the arguments passed to our tool:

struct LaganOption
{
    std::string seq1Filename;
    std::string seq2Filename;
    std::string outFilename;
    unsigned qGramSize;
    unsigned distanceCriteria;
    unsigned gapCriteria;

};

After that, we parse given command line arguments using SeqAn’s ArgumentParser. Firstly, we include the source code for the argument parser by adding the following include directory:

#include <seqan/arg_parse.h>
auto parseCommandLine(LaganOption & options, int const argc, char * argv[])
{
    ArgumentParser parser("lagan");

    // Set short description, version, and date.
    setShortDescription(parser, "Glocal alignment computation.");
    setVersion(parser, "0.1");
    setDate(parser, "September 2017");

    // Define usage line and long description.
    addUsageLine(parser,
                 "[\\fIOPTIONS\\fP] \"\\fISEQUENCE_1\\fP\" \"\\fISEQUENCE_2\\fP\"");
    addDescription(parser,
                  "lagan is a large-scale sequence alignment tool "
                  "using a glocal alignment approach. It first filters for good seeding matches which are "
                  "chained together using CHAOS chaining and finally a global chain is computed and "
                  "the alignment around this chain.");

    // We require two arguments.
    addArgument(parser, seqan2::ArgParseArgument(seqan2::ArgParseArgument::INPUT_FILE, "SEQUENCE_1"));
    addArgument(parser, seqan2::ArgParseArgument(seqan2::ArgParseArgument::INPUT_FILE, "SEQUENCE_2"));

    addOption(parser, seqan2::ArgParseOption("o", "output", "Output file to write the alignment to.",
                                            ArgParseArgument::OUTPUT_FILE, "FILE"));
    setValidValues(parser, "output", ".align");
    setDefaultValue(parser, "output", "out.align");

    addOption(parser, seqan2::ArgParseOption("q", "qgram", "Size of the qGram.", ArgParseArgument::INTEGER, "INT"));
    addDefaultValue(parser, "qgram", "12");

    addOption(parser, seqan2::ArgParseOption("d", "distance", "Distance criteria for CHAOS chaining.",
                                            ArgParseArgument::INTEGER, "INT"));
    addDefaultValue(parser, "distance", "10");

    addOption(parser, seqan2::ArgParseOption("g", "gap", "Gap criteria for CHAOS chaining.",
                                            ArgParseArgument::INTEGER, "INT"));
    addDefaultValue(parser, "gap", "15");

    auto parseResult = parse(parser, argc, argv);

    if (parseResult == ArgumentParser::PARSE_OK)
    {
        getArgumentValue(options.seq1Filename, parser, 0);
        getArgumentValue(options.seq2Filename, parser, 1);
        getOptionValue(options.outFilename, parser, "output");
        getOptionValue(options.qGramSize, parser, "qgram");
        getOptionValue(options.distanceCriteria, parser, "distance");
        getOptionValue(options.gapCriteria, parser, "gap");
    }
    return parseResult;
}

Hint

If you want to learn more about parsing arguments with SeqAn read the Parsing Command Line Arguments tutorial.

With this, we have set up our initial tool. Let’s start implementing the algorithm. To do so, we need to first load the sequences using the class SeqFileIn. We can get access to the data structures and methods by including the following modules:

#include <seqan/seq_io.h>
#include <seqan/stream.h>

Assignment 1

Type

Application

Objective

Use the code template below (click more…) and implement the function loadSequence to load a single sequence file from the specified path. Use the file paths given in the options object and report an error if the files could not be opened.

inline std::pair<Dna5String, CharString>
loadSequence(std::string const & fileName)
{
    // Write your code here.
}
Hint
Solution
inline std::pair<CharString, Dna5String>
loadSequence(std::string const & fileName)
{
    if (empty(fileName))
    {
        std::cerr << "Invalid file name!\n";
        std::exit(EXIT_FAILURE);
    }
    try {
        SeqFileIn seqFile(fileName.c_str());

        Dna5String seq;
        CharString seqId;
        readRecord(seqId, seq, seqFile);
        return {std::move(seqId), std::move(seq)};
    }
    catch (ParseError & error)
    {
        std::cerr << "Error while parsing " << fileName << "!\n";
        std::cerr << error.what() << '\n';
        std::exit(EXIT_FAILURE);
    }
    catch (IOError & error)
    {
        std::cerr << "Could not read " << fileName << "!\n";
        std::cerr << error.what() << '\n';
        std::exit(EXIT_FAILURE);
    }
}

Finally, we can update our main method and use our loadSequence function to load sequence 1 …

    /* load the sequences */
    CharString seq1Id;
    Dna5String seq1;

    std::tie(seq1Id, seq1) = loadSequence(options.seq1Filename);
    std::cout << "Loaded " << seq1Id << " (" << length(seq1) << "bp)\n";

and sequence 2 …

    CharString seq2Id;
    Dna5String seq2;

    std::tie(seq2Id, seq2) = loadSequence(options.seq2Filename);
    std::cout << "Loaded " << seq2Id << " (" << length(seq2) << "bp)\n";

Filtering seeds

After we read the sequences from the command line it is time to write our actual algorithm. A naive algorithm would scan for every q-gram of sequence 2 the complete sequence 1 to find all possible positions. But this approach of course is to slow for large-scale sequences and we need a better strategy. SeqAn provides for this kind of task indexes which can be queried efficiently to find all occurrences of a pattern in an indexed text.

Hint

There are several index implementations and we recommend to read Indices to learn more about the available index data structures.

In this tutorial we are going to use a IndexQGram (see Q-gram Index for more information), which can be included with the following module:

#include <seqan/index.h>

This index type will create a directory with all distinct q-grams and stores the positions of the indexed sequence, where a specific q-gram occurs. It therefor will generate a suffix array, which is sorted by the first `q symbols of every suffix.

The following line declares our q-gram index type:

    using TIndex    = Index<Dna5String, IndexQGram<Shape<Dna5, SimpleShape>, OpenAddressing>>;

The Index type is a template class which requires two type template parameters. The first type template parameter names the sequence type that this index is constructed for. In our case this will be a Dna5String. The second type template parameter is a tag or also known a policy, that defines the type of index to use. In our case we use the IndexQGram policy, which itself can be further specialized through two type template parameters. We need to select the policy used for the q-gram shape and the policy for managing the q-grams. In our example we will need a SimpleShape, which is a variable length ungapped shape. Thus, we are able to change the size of the q-gram at runtime. Note, there also constant length shapes, whose sizes are fixed at compile time. And as a storage policy we use OpenAdressing, which allows us to use longer values for our q-gram. Alternatively, we could leave the parameter unspecified and would therefor enable the default behavior which is direct addressing. Direct addressing, however, will create a lookup table for every possible q-mer (\(\Sigma^{q}\)), which can become quite large for small q already.

In the next step we are going to initialize the index.

    TIndex qGramIndex{seq1};

    // Initialize the shape.
    resize(indexShape(qGramIndex), options.qGramSize);
    hashInit(indexShape(qGramIndex), begin(seq2, Standard()));

First the index is constructed with the sequence it should be created for. Note, that this will not yet create the index. The creation will be triggered in a lazy manner, which means it will be first created when an access to the index is requested. Before the index can be created we need to give the index the size of the q-gram shape. This is done in the second line of the above snippet. The method indexShape returns a reference to the shape stored within the index. We resize this shape to the requested length. The last line initializes the index shape with the first q-gram of the second sequence.

Assignment 2

Type

Application

Objective

Write a loop over the second sequence and write the number of occurrences per q-gram to the console.

Hint
  • Use the function hashNext to update the shape for the current q-gram.

  • Use the function getOccurrences to get a list of hits.

Solution
for (auto seqIter = begin(seq2, Standard());
     seqIter != end(seq2, Standard()) - length(indexShape(qGramIndex)) + 1;
     ++seqIter)
{
    hashNext(indexShape(qGramIndex), seqIter);
    std::cout << getOccurrences(qGramIndex, indexShape(qGramIndex));
}

Local chaining

Now we can stream over the second sequence and can extract all locations of a given q-gram in the indexed sequence. To implement the second step of the LAGAN algorithm, we need to apply local chaining to the filtered q-grams and extend them to longer anchors. SeqAn offers a data structure called SeedSet for this. The following snippet shows the declaration of a SeedSet:

#include <seqan/seeds.h>
    using TSeed     = Seed<ChainedSeed>;
    using TSeedSize = typename Size<TSeed>::Type;
    using TSeedSet  = SeedSet<TSeed>;

The first line declares the type of the seed we want to use. We can define the chaining policy as type template parameter. In our case we use the ChainedSeed policy, which enables us to locally chain the seeds. In addition we define a SeedSet with our ChainedSeed as type template parameter.

Now we create an instance of the seed set and of a scoring scheme, which we will need to score the local chain.

    TSeedSet seedSet;
    Score<int, Simple> scoreScheme{2, -1, -2};

Assignment 3

Type

Application

Objective

Update the loop from assignment 2 and fill the previously created seed set. Use the CHAOS chaining method to chain seeds locally, using the scoring scheme, the gap and distance criteria and the current position of the matching q-grams.

Hint
  • The method addSeed has different overloads for various chaining policies.

  • If the seed could not be combined to any other in the set it must be added as a single seed to set.

Solution
    for (auto seqIter = begin(seq2, Standard());
         seqIter != end(seq2, Standard()) - length(indexShape(qGramIndex)) + 1;
         ++seqIter)
    {
        hashNext(indexShape(qGramIndex), seqIter);
        for (TSeedSize seq1Pos : getOccurrences(qGramIndex, indexShape(qGramIndex)))
        {
            /* filter the seeds */
            if (!addSeed(seedSet,
                         TSeed{seq1Pos, static_cast<TSeedSize>(seqIter - begin(seq2, Standard())), options.qGramSize},
                         options.distanceCriteria, options.gapCriteria, scoreScheme, seq1, seq2, Chaos{}))
            {
                addSeed(seedSet,
                        TSeed{seq1Pos, static_cast<TSeedSize>(seqIter - begin(seq2, Standard())), options.qGramSize},
                        Single{});
            }
        }
    }
    std::cout << "Finished seeding: " << length(seedSet) << " seeds!\n";

Global chaining

After scanning the second sequence and filling the seed set the highest scoring global chain must be computed, which gives a map of good matching anchors.

Assignment 4

Type

Application

Objective

Read the documentation to chainSeedsGlobally and build a global chain over the local anchors stored in the current seed set.

Solution
    /* chain seeds */
    String<TSeed> seedChain;
    chainSeedsGlobally(seedChain, seedSet, SparseChaining());
    std::cout << "Finished chaining: " << length(seedChain) << " seeds!\n";

Final alignment

In the original algorithm, the steps from above would be repeated for the gaps between the anchors selected by the global chaining algorithm. In this tutorial we skip the iterative step and directly compute the final alignment along the global map produced by the chaining algorithm. SeqAn already offers an alignment function for filling the gaps and connecting them with the anchors, which is available in the seeds module.

Assignment 5

Type

Application

Objective

Compute an alignment around the global anchors identified by the chaining algorithm.

Hint
Solution
    /* build alignment */
    Align<Dna5String, ArrayGaps> alignment;
    resize(rows(alignment), 2);
    assignSource(row(alignment, 0), seq1);
    assignSource(row(alignment, 1), seq2);

    int score = bandedChainAlignment(alignment, seedChain, Score<int, Simple>{4, -5, -1, -11}, 15);
    std::cout << "Finished alignment: Score = " << score << "\n";

Finally we can output the alignment in the specified output file.

    /* output alignment */
    std::ofstream outStream;
    outStream.open(options.outFilename.c_str());
    if (!outStream.good())
    {
        std::cerr << "Could not open " << options.outFilename << "!\n";
        std::exit(EXIT_FAILURE);
    }

    outStream << "#seq1: " << seq1Id << "\n";
    outStream << "#seq2: " << seq2Id << "\n";
    outStream << "#score: " << score << "\n";
    outStream << "#alignment\n";
    outStream << alignment;

    std::exit(EXIT_SUCCESS);

Congratulation, you wrote a simple genome aligner!!! You can use the code to add iterative steps to make it more sensitive for large indels.

References

  1. Brudno M. and Morgenstern, B., 2002. Fast and sensitive alignment of large genomic sequences. In Proceeding of the IEEE Computer Society Bioinformatics Conference (CSB).

  2. Brudno, M., Do, C. B., Cooper, G. M., Kim, M. F., Davydov, E., Green, E. D., Sidow, A., Batzoglou, S., and and NISC Comparative Sequencing Program. (2003). LAGAN and Multi-LAGAN: efficient tools for large-scale multiple alignment of genomic DNA. Genome research, 13(4), 721-731.