Alignment Representation (Gaps)

Learning Objective

This tutorial introduces you to the gaps data structures that can be used to represent an alignment in SeqAn. You will learn basic techniques to create and modify such data structures and how to access certain information from these data structures.

Difficulty

Basic

Duration

15 min

Prerequisites

A First Example, Sequences


The Align data structure is simply a set of multiple Gaps data structures. A Gaps data structure is a container storing gap information for a given source sequence. The gap information is put on top of the source sequence (coordinates of the gapped sequence refer to the gap space) without directly applying them to the source (coordinates of the ungapped sequence refer to the source space). This way operating with gaps sustains very flexible.

Gaps data structures

There are two specializations available for the Gaps data structures: Array Gaps and Anchor Gaps. They differ in the way they implement the gap space.

Note

In general, using Array Gaps is sufficient for most applications. This specialization is also the default one if nothing else is specified. It simply uses an array which stores the counts of gaps and characters in an alternating order. Thus, it is quite efficient to extend existing gaps while it is more expensive to search within the gapped sequence or insert new gaps. Alternatively, one should prefer Anchor Gaps if many conversions between coordinates of the gap and the source space are needed as binary search can be conducted to search for specific positions.

Constructing an alignment

Now, let’s start by constructing our first alignment. Before we can make use of any of the mentioned data structures, we need to tell the program where to find the definitions. This can be achieved by including the header file <seqan/align.h> which contains the necessary data structures and functions associated with the alignments. The next steps would be to implement the main function of our program and to define the types that we want to use.

#include <iostream>
#include <seqan/align.h>

using namespace seqan2;

int main()
{

We first define the type of the input sequences (TSequence). Then we can define the type of our actual Align object we want to use. In an Align object, the gapped sequences are arranged in rows. You can use the Metafunction Row to get the correct type of the used Gaps objects. In the following we use the term row to explicitly refer to a single gapped sequence in the Align object. We will use the term gapped sequence to describe functionalities that are related to the Gaps data structure independent of the Align object.

    typedef char TChar;                             // character type
    typedef String<TChar> TSequence;                // sequence type
    typedef Align<TSequence, ArrayGaps> TAlign;     // align type
    typedef Row<TAlign>::Type TRow;                 // gapped sequence type

After defining the types, we can continue to actually construct our own Align object. Therefore, we need to resize the alignment object in order to reserve space for the sequences we want to add. In our case, we assume a pairwise alignment, hence we reserve space for 2 sequences. With the function row, we get access to the gapped sequence at a specific row in the alignment object. This is similar to the value function used in String Sets. Now, we can assign the source to the corresponding gapped sequence.

    TSequence seq1 = "CDFGDC";
    TSequence seq2 = "CDEFGAHGC";

    TAlign align;
    resize(rows(align), 2);
    assignSource(row(align, 0), seq1);
    assignSource(row(align, 1), seq2);
    std::cout << align;
      0     .
        CDFGDC
        ||
        CDEFGA

Note

The second string CDEFGAHGC of the alignment is cropped in the output to CDEFGA, such that they are of equal length. Note that the string itself is not modified, i.e. not shortened.

After assigning the sources to the gapped sequences, we need to add some gaps to make it look like a real alignment. You can use the functions insertGap() and removeGap() to insert and delete one gap or insertGaps() and removeGaps() to insert and delete multiple gaps in a gapped sequence.

    TRow & row1 = row(align, 0);
    TRow & row2 = row(align, 1);
    insertGap(row1, 2);
    std::cout << align;
    insertGaps(row1, 5, 2);
    std::cout << align;
      0     .
        CD-FGDC
        || ||
        CDEFGAH


      0     .
        CD-FG--DC
        || ||   |
        CDEFGAHGC

Congratulations! You have created your first alignment. Note that we used a reference declaration TRow & for the variables row1 and row2. Without the reference, we would only modify copies of rows and the changes would not effect our align object.

Gap Space vs. Source Space

../../../_images/gaps_illustration.png

In the next steps, we want to dig a little deeper to get a feeling for the gap space and the source space. As mentioned above, the gaps are not inserted into the source but put on top of it in a separate space, the gap space. When inserting gaps, the gap space is modified and all coordinates right of the inserted gap are shifted to the right by the size of the gap. At the same time, the coordinates of the source remain unchanged. Using the function toSourcePosition(), we can determine which position of the current gapped sequence (gap space) corresponds to the position in the source space.

    std::cout << std::endl << "ViewToSource1: " << std::endl;
    for (auto c: row1)
        std::cout << c << " ";
    std::cout << std::endl;

    for (unsigned i = 0; i < length(row1); ++i)
        std::cout << toSourcePosition(row1, i) << " ";
    std::cout << std::endl;


    std::cout << std::endl << "ViewToSource2: " << std::endl;
    for (auto c: row2)
        std::cout << c << " ";
    std::cout << std::endl;

    for (unsigned i = 0; i < length(row2); ++i)
        std::cout << toSourcePosition(row2, i) << " ";
    std::cout << std::endl;
ViewToSource1:
C D - F G - - D C
0 1 2 2 3 4 4 4 5

ViewToSource2:
C D E F G A H G C
0 1 2 3 4 5 6 7 8

If the position in the gap space is actually a gap, then toSourcePosition() returns the source position of the next character to the right that is not a gap. Vice versa, we can determine where our current source position maps into the gap space using the function toViewPosition().

    std::cout << std::endl << "SourceToView1: " << std::endl;
    for (auto c: source(row1))
        std::cout << c << " ";
    std::cout << std::endl;

    for (unsigned i = 0; i < length(source(row1)); ++i)
        std::cout << toViewPosition(row1, i) << " ";
    std::cout << std::endl;


    std::cout << std::endl << "SourceToView2: " << std::endl;
    for (auto c: source(row2))
        std::cout << c << " ";
    std::cout << std::endl;

    for (unsigned i = 0; i < length(source(row2)); ++i)
        std::cout << toViewPosition(row2, i) << " ";
    std::cout << std::endl;
SourceToView1:
C D F G D C
0 1 3 4 7 8

SourceToView2:
C D E F G A H G C
0 1 2 3 4 5 6 7 8

In the first alignment, it seems that the end of the second row is cropped off to match the size of the first one. This effect takes place only in the visualization but is not explicitly applied to the gapped sequence. The second alignment is the one we manually constructed. Here, you can see that the second row is expanded to its full size while it matches the size of the first row. However, it is possible to explicitly crop off the ends of a gapped sequence by using the functions setClippedBeginPosition() and setClippedEndPosition(). These functions shrink the gap space and can be understood as defining an infix of the gapped sequence. After the clipping, the relative view position changes according to the clipping and so does the mapping of the source positions to the gap space. The mapping of the view positions to the source space does not change.

    std::cout << std::endl << "Before clipping:\n" << align;
    setClippedBeginPosition(row1, 1);
    setClippedEndPosition(row1, 7);
    setClippedBeginPosition(row2, 1);
    setClippedEndPosition(row2, 7);
    std::cout << std::endl << "After clipping:\n" << align;
Before clipping:
      0     .
        CD-FG--DC
        || ||   |
        CDEFGAHGC



After clipping:
      0     .
        D-FG--
        | ||
        DEFGAH

Here the output of the clipping procedure.

    std::cout << std::endl << "ViewToSource1: ";
    for (unsigned i = 0; i < length(row1); ++i)
        std::cout << toSourcePosition(row1, i) << " ";

    std::cout << std::endl << "ViewToSource2: ";
    for (unsigned i = 0; i < length(row2); ++i)
        std::cout << toSourcePosition(row2, i) << " ";
    std::cout << std::endl;

    std::cout << std::endl << "SourceToView1: ";
    for (unsigned i = 0; i < length(source(row1)); ++i)
        std::cout << toViewPosition(row1, i) << " ";

    std::cout << std::endl << "SourceToView2: ";
    for (unsigned i = 0; i < length(source(row2)); ++i)
        std::cout << toViewPosition(row2, i) << " ";
    std::cout << std::endl;
ViewToSource1: 1 2 2 3 4 4
ViewToSource2: 1 2 3 4 5 6

SourceToView1: -1 0 2 3 6 7
SourceToView2: -1 0 1 2 3 4 5 6 7

Note

It is important to understand the nature of the clipping information. It virtually shrinks the gap space not physically. That means the information before/after the begin/end of the clipping still exists and the physical gap space remains unchanged. To the outer world it seems the alignment is cropped off irreparably. But you can expand the alignment again by resetting the clipping information.

Iterating over Gapped Sequences

In the last part of this section, we are going to iterate over a Gaps object. This can be quite useful if one needs to parse the alignment rows to access position specific information. First, we have to define the type of the Iterator, which can be easily done by using the metafunction Iterator. Remember that we iterate over an TRow object. Then we have to construct the iterators it which points to the begin of row1 using the begin() function and itEnd which points behind the last value of row1 using the end() function. If you need to refresh the Iterator Concept you can read the iterator section Iteration. While we iterate over the gapped sequence, we can ask if the current value, at which the iterator it points to, is a gap or not by using the function isGap(). Use gapValue to print the correct gap symbol.

    typedef Iterator<TRow>::Type TRowIterator;
    TRowIterator it = begin(row1),
                 itEnd = end(row1);
    for (; it != itEnd; ++it)
    {
        TChar c = isGap(it) ? gapValue<TChar>() : *it;
        std::cout << c << " ";
    }
    std::cout << std::endl;
D - F G - -

We will now reset the clipping of row1 using clearClipping and iterate again over it to see its effect.

    clearClipping(row1);

    it = begin(row1);
    itEnd = end(row1);
    for (; it != itEnd; ++it)
    {
        TChar c = isGap(it) ? gapValue<TChar>() : *it;
        std::cout << c << " ";
    }
    std::cout << std::endl;
    return 0;
}
C D - F G - - D C

Here you can see how resetting the clipping positions brings back our complete row.

Assignment 1

Type

Review

Objective

Construct an alignment using the Align data structure for the sequences "ACGTCACCTC" and "ACGGGCCTATC". Insert two gaps at the second position and insert one gap at the fifth position of the first sequence. Insert one gap at the ninth position of the second sequence. Iterate over the rows of your Align object and print the total count of gaps that exist within the alignment.

Hints

You can use the function countGaps to count the number of consecutive gaps starting from the current position of the iterator.

The resulting alignment should look like:
AC--GTC-ACCTC

ACGGGCCTA--TC
Solution
#include <iostream>
#include <seqan/align.h>

using namespace seqan2;

int main()
{
    // Defining all types that are needed.
    typedef String<char> TSequence;
    typedef Align<TSequence, ArrayGaps> TAlign;
    typedef Row<TAlign>::Type TRow;
    typedef Iterator<TRow>::Type TRowIterator;

    TSequence seq1 = "ACGTCACCTC";
    TSequence seq2 = "ACGGGCCTATC";

    // Initializing the align object.
    TAlign align;
    resize(rows(align), 2);
    assignSource(row(align, 0), seq1);
    assignSource(row(align, 1), seq2);

    // Use references to the rows of align.
    TRow & row1 = row(align, 0);
    TRow & row2 = row(align, 1);

    // Insert gaps.
    insertGaps(row1, 2, 2);
    insertGap(row1, 7);  // We need to pass the view position which is changed due to the previous insertion.
    insertGaps(row2, 9, 2);

    // Initialize the row iterators.
    TRowIterator itRow1 = begin(row1);
    TRowIterator itEndRow1 = end(row1);
    TRowIterator itRow2 = begin(row2);

    // Iterate over both rows simultaneously.
    int gapCount = 0;
    for (; itRow1 != itEndRow1; ++itRow1, ++itRow2)
    {
        if (isGap(itRow1))
        {
            gapCount += countGaps(itRow1);  // Count the number of consecutive gaps from the current position in row1.
            itRow1 += countGaps(itRow1);    // Jump to next position to check for gaps.
            itRow2 += countGaps(itRow1);    // Jump to next position to check for gaps.
        }
        if (isGap(itRow2))
        {
            gapCount += countGaps(itRow2);  // Count the number of consecutive gaps from the current position in row2.
            itRow1 += countGaps(itRow2);    // Jump to next position to check for gaps.
            itRow2 += countGaps(itRow2);    // Jump to next position to check for gaps.
        }
    }
    // Print the result.
    std::cout << "Number of gaps: " << gapCount << std::endl;
}
Number of gaps: 5