Seed Extension

Learning Objective
You will learn how to do seed-and-extend with SeqAn, how to do local and global chaining of seeds. Finally, you will learn how to create a banded alignment around a seed chain.
Difficulty
Average
Duration
2 h
Prerequisites
Sequences, Seeds

Overview

Many efficient heuristics to find high scoring, but inexact, local alignments between two sequences start with small exact (or at least highly similar) segments, so called seeds, and extend or combine them to get larger highly similar regions. Probably the most prominent tool of this kind is BLAST [AGM+90], but there are many other examples like FASTA [Pea90] or LAGAN [BDC+03].

SeqAn’s header file for all data structures and functions related to two-dimensional seeds is <seqan/seeds.h>.

Seed Extension

Seeds are often created quickly using a k-mer index: When a k-mer of a given length is found in both sequences, we can use it as a seed. However, the match can be longer than just k characters. To get longer matches, we use seed extension.

In the most simple case we simply look for matching characters in both sequences to the left and right end of the seed. This is called match extension and available through the extendSeed function using the MatchExtend tag. Below example shows how to extend seeds to the right end.

    // The horizontal and vertical sequence (subject and query sequences).
    CharString seqH = "The quick BROWN fox jumped again!";
    CharString seqV =     "thick BROWNIES for me!";
    // Create and print the seed sequence.
    Seed<Simple> seed(11, 7, 14, 10);
    std::cout << "original\n"
              << "seedH: " << infix(seqH, beginPositionH(seed),
                          endPositionH(seed)) << "\n"
              << "seedV: " << infix(seqV, beginPositionV(seed),
                          endPositionV(seed)) << "\n";

    // Perform match extension.
    extendSeed(seed, seqH, seqV, EXTEND_RIGHT, MatchExtend());
    // Print the resulting seed.
    std::cout << "result\n"
              << "seedH: " << infix(seqH, beginPositionH(seed),
                          endPositionH(seed)) << "\n"
              << "seedV: " << infix(seqV, beginPositionV(seed),
                          endPositionV(seed)) << "\n";
original
seedH: ROW
seedV: ROW
result
seedH: ROWN
seedV: ROWN

Assignment 1

Type
Review
Objective
Change the example from above to extend the seed to both sides.
Solution
#include <seqan/sequence.h>
#include <seqan/stream.h>
#include <seqan/seeds.h>

using namespace seqan;

int main()
{
    // The horizontal and vertical sequence (subject and query sequences).
    CharString seqH = "The quick BROWN fox jumped again!";
    CharString seqV =     "thick BROWNIES for me!";
    // Create and print the seed sequence.
    Seed<Simple> seed(11, 7, 14, 10);
    std::cout << "original\n"
              << "seedH: " << infix(seqH, beginPositionH(seed),
                          endPositionH(seed)) << "\n"
              << "seedV: " << infix(seqV, beginPositionV(seed),
                          endPositionV(seed)) << "\n";

    // Perform match extension.
    extendSeed(seed, seqH, seqV, EXTEND_BOTH, MatchExtend());
    // Print the resulting seed.
    std::cout << "result\n"
              << "seedH: " << infix(seqH, beginPositionH(seed),
                          endPositionH(seed)) << "\n"
              << "seedV: " << infix(seqV, beginPositionV(seed),
                          endPositionV(seed)) << "\n";

    return 0;
}

A more complex case is the standard bioinformatics approach of x-drop extension.

In the ungapped case, we extend the seed by comparing the i-th character to the left/right of the seed of the horizontal sequence (subject sequence) with the j-th character to the left/right of the seed in the vertical sequence (query sequence). Matches and mismatches are assigned with scores (usually matches are assigned with positive scores and mismatches are assigned with negative scores). The scores are summed up. When one or more mismatches occur, the running total will drop. When the sum drops more than a value x, the extension is stopped.

This approach is also available in the gapped case in the SeqAn library. Here, creating gaps is also possible but also assigned negative scores.

    // The horizontal and vertical sequence (subject and query sequences).
    CharString seqH = "The quick BROWN fox jumped again!";
    CharString seqV =     "thick BROWN boxes of brownies!";
    // Create and print the seed sequence.
    Seed<Simple> seed(11, 7, 14, 10);
    std::cout << "original\n"
              << "seedH: " << infix(seqH, beginPositionH(seed),
                          endPositionH(seed)) << "\n"
              << "seedV: " << infix(seqV, beginPositionV(seed),
                          endPositionV(seed)) << "\n";

    // Perform match extension.
    Score<int, Simple> scoringScheme(1, -1, -1);
    extendSeed(seed, seqH, seqV, EXTEND_BOTH, scoringScheme, 3,
               UnGappedXDrop());
    // Print the resulting seed.
    std::cout << "result\n"
              << "seedH: " << infix(seqH, beginPositionH(seed),
                          endPositionH(seed)) << "\n"
              << "seedV: " << infix(seqV, beginPositionV(seed),
                          endPositionV(seed)) << "\n";
original
seedH: ROW
seedV: ROW
result
seedH: ick BROWN fox
seedV: ick BROWN box

Assignment 2

Type
Review
Objective
Change the example from above to use gapped instead of ungapped x-drop extension.
Solution
#include <seqan/sequence.h>
#include <seqan/stream.h>
#include <seqan/score.h>
#include <seqan/seeds.h>

using namespace seqan;

int main()
{
    // The horizontal and vertical sequence (subject and query sequences).
    CharString seqH = "The quick BROWN fox jumped again!";
    CharString seqV =     "thick BROWN boxes of brownies!";
    // Create and print the seed sequence.
    Seed<Simple> seed(11, 7, 14, 10);
    std::cout << "original\n"
              << "seedH: " << infix(seqH, beginPositionH(seed),
                          endPositionH(seed)) << "\n"
              << "seedV: " << infix(seqV, beginPositionV(seed),
                          endPositionV(seed)) << "\n";

    // Perform match extension.
    Score<int, Simple> scoringScheme(1, -1, -1);
    extendSeed(seed, seqH, seqV, EXTEND_BOTH, scoringScheme, 3,
               GappedXDrop());
    // Print the resulting seed.
    std::cout << "result\n"
              << "seedH: " << infix(seqH, beginPositionH(seed),
                          endPositionH(seed)) << "\n"
              << "seedV: " << infix(seqV, beginPositionV(seed),
                          endPositionV(seed)) << "\n";

    return 0;
}

After extending a seed, we might wish to actually get the resulting alignment. When using gapped x-drop extension, we need to perform a banded global alignment of the two sequence infixes that correspond to the seed. This is shown in the following example:

    // The horizontal and vertical sequence (subject and query sequences).
    CharString seqH = "The quick BROWN fox jumped again!";
    CharString seqV =     "thick BROWN boxes of brownies!";
    // Create the seed sequence.
    Seed<Simple> seed(11, 7, 14, 10);

    // Perform match extension.
    Score<int, Simple> scoringScheme(1, -1, -1);
    extendSeed(seed, seqH, seqV, EXTEND_BOTH, scoringScheme, 3,
               GappedXDrop());

    // Perform a banded alignment.
    Align<CharString> align;
    resize(rows(align), 2);
    assignSource(row(align, 0), infix(seqH, beginPositionH(seed),
                                      endPositionH(seed)));
    assignSource(row(align, 1), infix(seqV, beginPositionV(seed),
                                      endPositionV(seed)));

    globalAlignment(align, scoringScheme);
    std::cout << "Resulting alignment\n" << align << "\n";
Resulting alignment
      0     .    :    .    :  
         quick BROWN fox-- ju
           |||||||||| ||  |  
        -thick BROWN boxes of

Assignment 3

Type
Review
Objective
Change the example from above to a gap open score of -2 and a gap extension score of -2.
Solution

Note that we do not have to explicitely call Gotoh’s algorithm in globalAlignment(). The fact that the gap extension score is different from the gap opening score is enough.

#include <seqan/align.h>
#include <seqan/stream.h>
#include <seqan/score.h>
#include <seqan/seeds.h>
#include <seqan/sequence.h>

using namespace seqan;

int main()
{
    // The horizontal and vertical sequence (subject and query sequences).
    CharString seqH = "The quick BROWN fox jumped again!";
    CharString seqV =     "thick BROWN boxes of brownies!";
    // Create the seed sequence.
    Seed<Simple> seed(11, 7, 14, 10);

    // Perform match extension.
    Score<int, Simple> scoringScheme(1, -1, -2, -2);
    extendSeed(seed, seqH, seqV, EXTEND_BOTH, scoringScheme, 3,
               GappedXDrop());

    // Perform a banded alignment.
    Align<CharString> align;
    resize(rows(align), 2);
    assignSource(row(align, 0), infix(seqH, beginPositionH(seed),
                                      endPositionH(seed)));
    assignSource(row(align, 1), infix(seqV, beginPositionV(seed),
                                      endPositionV(seed)));

    globalAlignment(align, scoringScheme);
    std::cout << "Resulting alignment\n" << align << "\n";

    return 0;
}

Local Chaining using Seed Sets

Usually, we quickly determine a large number of seeds. When a seed is found, we want to find a “close” seed that we found previously and combine it to form a longer seed. This combination is called local chaining. This approach has been pioneered in the CHAOS and BLAT programs.

SeqAn provides the SeedSet class as a data structure to efficiently store seeds and combine new seeds with existing ones. The following example creates a SeedSet object seeds, adds four seeds to it and then prints its contents.

    typedef Seed<Simple>    TSeed;
    typedef SeedSet<TSeed> TSeedSet;

    TSeedSet seedSet;
    addSeed(seedSet, TSeed(0, 0, 2), Single());
    addSeed(seedSet, TSeed(3, 5, 2), Single());
    addSeed(seedSet, TSeed(4, 2, 3), Single());
    addSeed(seedSet, TSeed(9, 9, 2), Single());

    std::cout << "Resulting seeds.\n";
    typedef Iterator<TSeedSet>::Type TIter;
    for (TIter it = begin(seedSet, Standard()); it != end(seedSet, Standard()); ++it)
        std::cout << "(" << beginPositionH(*it) << ", " << endPositionH(*it)
                  << ", " << beginPositionV(*it) << ", " << endPositionV(*it)
                  << ", " << lowerDiagonal(*it) << ", " << upperDiagonal(*it)
                  << ")\n";

The output of the program above can be seen below.

Resulting seeds.
(3, 5, 5, 7, -2, -2)
(0, 2, 0, 2, 0, 0)
(9, 11, 9, 11, 0, 0)
(4, 7, 2, 5, 2, 2)

Note that we have used the Single() tag for adding the seeds. This forces the seeds to be added independent of the current seed set contents.

By using different overloads of the addSeed, we can use various local chaining strategies when adding seed A.

Merge
If there is a seed B that overlaps with A and the difference in diagonals is smaller than a given threshold then A can be merged with B.
SimpleChain
If there is a seed B whose distance in both sequences is smaller than a given threshold then A can be chained to B.
Chaos
Following the strategy of CHAOS [BCGottgens+03], if A is within a certain distance to B in both sequences and the distance in diagonals is smaller than a given threshold then A can be chained to B.

The addSeed function returns a boolean value indicating success in finding a suitable partner for chaining. A call using the Single strategy always yields true.

The following example shows how to use the SimpleChain strategy.

    typedef Seed<Simple>   TSeed;
    typedef SeedSet<TSeed> TSeedSet;

    Dna5String seqH;
    Dna5String seqV;
    Score<int, Simple> scoringScheme(1, -1, -1);

    String<TSeed> seeds;
    appendValue(seeds, TSeed(0, 0, 2));
    appendValue(seeds, TSeed(3, 5, 2));
    appendValue(seeds, TSeed(4, 2, 3));
    appendValue(seeds, TSeed(9, 9, 2));

    TSeedSet seedSet;
    for (unsigned i = 0; i < length(seeds); ++i)
    {
        if (!addSeed(seedSet, seeds[i], 2, 2, scoringScheme,
                     seqH, seqV, SimpleChain()))
            addSeed(seedSet, seeds[i], Single());
    }

    std::cout << "Resulting seeds.\n";
    typedef Iterator<TSeedSet>::Type TIter;
    for (TIter it = begin(seedSet, Standard());
         it != end(seedSet, Standard()); ++it)
        std::cout << "(" << beginPositionH(*it) << ", " << endPositionH(*it)
                  << ", " << beginPositionV(*it) << ", " << endPositionV(*it)
                  << ", " << lowerDiagonal(*it) << ", " << upperDiagonal(*it)
                  << ")\n";

As we can see, the seed TSeed(4, 2, 3) has been chained to TSeed(0, 0, 2).

Resulting seeds.
(3, 5, 5, 7, -2, -2)
(0, 7, 0, 5, 0, 2)
(9, 11, 9, 11, 0, 0)

Assignment 4

Type
Review
Objective
Change the example above to use the Chaos strategy instead of the SimpleChain.
Solution
#include <seqan/stream.h>
#include <seqan/score.h>
#include <seqan/seeds.h>
#include <seqan/sequence.h>

using namespace seqan;

int main()
{
    typedef Seed<Simple>    TSeed;
    typedef SeedSet<TSeed> TSeedSet;

    Dna5String seqH;
    Dna5String seqV;
    Score<int, Simple> scoringScheme(1, -1, -1);

    String<TSeed> seeds;
    appendValue(seeds, TSeed(0, 0, 2));
    appendValue(seeds, TSeed(3, 5, 2));
    appendValue(seeds, TSeed(4, 2, 3));
    appendValue(seeds, TSeed(9, 9, 2));

    TSeedSet seedSet;
    for (unsigned i = 0; i < length(seeds); ++i)
    {
        if (!addSeed(seedSet, seeds[i], 2, 2, scoringScheme,
                     seqH, seqV, Chaos()))
            addSeed(seedSet, seeds[i], Single());
    }

    std::cout << "Resulting seeds.\n";
    typedef Iterator<TSeedSet>::Type TIter;
    for (TIter it = begin(seedSet, Standard());
         it != end(seedSet, Standard()); ++it)
        std::cout << "(" << beginPositionH(*it) << ", " << endPositionH(*it)
                  << ", " << beginPositionV(*it) << ", " << endPositionV(*it)
                  << ", " << lowerDiagonal(*it) << ", " << upperDiagonal(*it)
                  << ")\n";

    return 0;
}

Global Chaining

../../_images/GlobalChaining.png

After one has determined a set of candidate seeds, a lot of these seeds will conflict. The image to the right shows an example. Some conflicting seeds might be spurious matches or come from duplication events.

Often, we need to find a linear ordering of the seeds such that each seed starts after all of its predecessor end in both sequences. This can be done efficiently using dynamic sparse programming (in time \(\mathcal{O}(n log n)\) where \(n\) is the number of seeds) as described in [Gus97]. The red seeds in the image to the right show such a valid chain.

This functionality is available in SeqAn using the chainSeedsGlobally function. The function gets a sequence container of Seed objects for the result as its first parameter and a SeedSet as its second parameter. A subset of the seeds from the SeedSet are then selected and stored in the result sequence.

The following shows a simple example.

    typedef Seed<Simple>    TSeed;
    typedef SeedSet<TSeed> TSeedSet;

    TSeedSet seedSet;
    addSeed(seedSet, TSeed(0, 0, 2), Single());
    addSeed(seedSet, TSeed(3, 5, 2), Single());
    addSeed(seedSet, TSeed(4, 2, 3), Single());
    addSeed(seedSet, TSeed(9, 9, 2), Single());

    String<TSeed> result;
    chainSeedsGlobally(result, seedSet, SparseChaining());

Assignment 5

Type
Review
Objective
Change the example from above to use a different chain of seeds. The seeds should be TSeed(1, 1, 3), TSeed(6, 9, 2), TSeed(10, 13, 3), and TSeed(20, 22, 5).
Solution
#include <seqan/sequence.h>
#include <seqan/stream.h>
#include <seqan/seeds.h>

using namespace seqan;

int main()
{
    typedef Seed<Simple>    TSeed;
    typedef SeedSet<TSeed> TSeedSet;

    TSeedSet seedSet;
    addSeed(seedSet, TSeed(1, 1, 3), Single());
    addSeed(seedSet, TSeed(6, 9, 2), Single());
    addSeed(seedSet, TSeed(10, 13, 3), Single());
    addSeed(seedSet, TSeed(20, 22, 5), Single());

    String<TSeed> result;
    chainSeedsGlobally(result, seedSet, SparseChaining());

    return 0;
}

Banded Chain Alignment

After obtaining such a valid seed chain, we would like to obtain an alignment along the chain. For this, we can use the so-called banded chain alignment algorithm [BDC+03]. Around seeds, we can use banded DP alignment and the spaces between seeds can be aligned using standard DP programming alignment.

In SeqAn you can compute the banded chain alignment by calling the function bandedChainAlignment. This function gets the structure in which the alignment should be stored as the first parameter. This corresponds to the interface of the globalAlignment and allows the same input types. Additionally, this function requires a non-empty, non-decreasing monotonic chain of seeds which is used as the rough global map for computing the global alignment. The third required parameter is the Score.

Note, that there are a number of optional parameters that can be specified. These include a second Score which, if specified, is used to evaluate the regions between two consecutive seeds differently than the regions around the seeds itself (for which then the first specified score is taken.). As for the global alignment you can use the AlignConfig to specify the behavior for initial and end gaps. The last optional parameter is the band extension. This parameter specifies to which size the bands around the anchors should be extended. The default value is 15 and conforms the default value in the LAGAN-algorithm [BDC+03].

Important

At the moment the specified value for the band extension must be at least one.

    typedef Seed<Simple> TSeed;

    Dna5String sequenceH = "CGAATCCATCCCACACA";
    Dna5String sequenceV = "GGCGATNNNCATGGCACA";

    String<TSeed> seedChain;
    appendValue(seedChain, TSeed(0, 2, 5, 6));
    appendValue(seedChain, TSeed(6, 9, 9, 12));
    appendValue(seedChain, TSeed(11, 14, 17, 16));

    Align<Dna5String, ArrayGaps> alignment;
    resize(rows(alignment), 2);
    assignSource(row(alignment, 0), sequenceH);
    assignSource(row(alignment, 1), sequenceV);

    Score<int, Simple> scoringScheme(2, -1, -2);

    int result = bandedChainAlignment(alignment, seedChain, scoringScheme, 2);

    std::cout << "Score: " << result << std::endl;
    std::cout << alignment << std::endl;

The output of the example above.

Score: 5
      0     .    :    .    :  
        --CGAAT--CCATCCCACACA
          || ||   |||    ||||
        GGCG-ATNNNCATGG--CACA

Assignment 6

Type
Review
Objective

Change the example form above to use two different scoring schemes. The scoring scheme for the seeds should use the Levenshtein distance and the score for the gap regions should be an affine score with the following values: match = 2, mismatch = -1, gap open = -2, gap extend = -1.

Furthermore, we are looking for a semi-global alignment here the initial and end gaps in the query sequence are free.

Solution
#include <seqan/sequence.h>
#include <seqan/align.h>
#include <seqan/score.h>
#include <seqan/seeds.h>

using namespace seqan;

int main()
{
    typedef Seed<Simple>    TSeed;

    Dna5String sequenceH = "CGAATCCATCCCACACA";
    Dna5String sequenceV = "GGCGATNNNCATGGCACA";
    Score<int, Simple> scoringSchemeAnchor(0, -1, -1);
    Score<int, Simple> scoringSchemeGap(2, -1, -1, -2);

    String<TSeed> seedChain;
    appendValue(seedChain, TSeed(0, 2, 5, 6));
    appendValue(seedChain, TSeed(6, 9, 9, 12));
    appendValue(seedChain, TSeed(11, 14, 17, 16));

    Align<Dna5String, ArrayGaps> alignment;
    resize(rows(alignment), 2);
    assignSource(row(alignment, 0), sequenceH);
    assignSource(row(alignment, 1), sequenceV);
    AlignConfig<true, false, false, true> alignConfig;

    int result = bandedChainAlignment(alignment, seedChain, scoringSchemeAnchor, scoringSchemeGap, alignConfig, 2);

    std::cout << "Score: " << result << std::endl;
    std::cout << alignment << std::endl;

    return 0;
}