Realignment

Learning Objective
In this tutorial, you will learn how to refine multi-sequence alignments in a fragment store. This can be useful for refining multi-read alignments around indels prior to small variant calling. After completing the tutorial, you will be able to load reads into a fragment store and compute a realignment.
Difficulty
Advanced
Duration
30 min
Prerequisites
Fragment Store

A common task in NGS data analysis is small variant calling (SNVs or indels with a length of up to 10 bp) after the read mapping step. Usually, one considers the “pileup” of the reads and looks for variant signatures (e.g. a certain number of non-reference characters in the aligned reads). Usually, read mappers compute pairwise alignments of each read and the reference and store them in a SAM or BAM file. In the absence of indels, such pairwise alignments can be converted to a multi-read alignment without problems. However, there can be an undesired multi-read alignment around indels (Figure 1).

The task of improving such an alignment is called realignment and there is a small number of algorithms and tools available for realignment. This tutorial describes the <seqan/realign.h> module which implements a variant of the ReAligner algorithm by Anson and Myers [AM97].

Getting Started

Consider the following program. It creates a fragment store and then reads a small reference (with a length of 2kb) from a FASTA file and also a SAM file with reads spanning a complex indel region at 1060 ~ 1140. Finally, it prints the multi-read alignment around this position using AlignedReadLayout.

#include <seqan/store.h>
#include <seqan/realign.h>

using namespace seqan;

int main()
{
    // Build paths to reference and to alignment.
    std::string refPath = getAbsolutePath("demos/tutorial/realignment/ref.fa");
    std::string samPath = getAbsolutePath("demos/tutorial/realignment/reads.sam");

    // Declare fragment store.
    FragmentStore<> store;

    // Load contigs and read alignment.
    loadContigs(store, refPath.c_str());
    BamFileIn bamFileIn(samPath.c_str());
    readRecords(store, bamFileIn);

    // Layout alignment and print.
    AlignedReadLayout layout;
    layoutAlignment(layout, store);
    printAlignment(std::cout, layout, store, /*contigID=*/ 0, /*posBegin=*/ 1060,
                   /*posEnd=*/ 1140, /*lineBegin=*/ 0, /*lineEnd=*/ 100);

    return 0;
}

The output of the program is as follows:

TTGACTGTGGGAGGATACATCTCTCCATCAATTATCTAAAAAAATAAATAAATAAATAAACATCAGTTAAAAAGTTAAGG
.........................................  C,,,,A,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.........................................TT   ..A...............................
.................................C............   .TC....A...T.AATAAAC.TC......AA
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,T,ACT,A  .............................
..................................................       .......................
.....................................**.G..............    ,,,,,,,,,,,,,,,,,,,,,
...................................G...******......G..........   ...............
..............................G........******..................      ,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,******,,,,,,,,,,,,,,,,,,,          ......
.......................................******........................         ,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,********,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.......G.................................********..........C..................T.
.......................................******..................................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,******,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.........................................********.....................C.........
.......................................******................................C..
.........................................********...............................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,******,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.......................................******...................................
.........................................********...............................
.......................................******...................................
.........................................********...............................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,C,,******,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
  .......................................********...............................
    ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,******,N,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
       ..................................********...............................
          ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,********,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
            ...........................******...................................
            .............................********...............................
                 ,,,,,,,,,,,,,,,,,,,,,,,,********,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
                  ,,,,,,,,,,,,,,,,,,,,,******,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
                   ......................********...............................
                               ,,,,,,,,,,********,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,

Figure 1: An example of a multi-read alignment from pairwise alignments

Performing the Realignment

We can now use the function reAlignment for performing a realignment of the reads in the fragment store.

contigID
The numeric ID of the contig to realign.
realignmentMethod
Whether to use linear (0) or affine gap costs (1). It is recommended to use affine gap costs.
bandwidth
The bandwidth to use in the realignment step.
includeReference
Whether or not to include the reference as a pseudo read.

The algorithm works as follows: A profile is computed, with a count of each base and the gap character at each position in the multi-read alignment. Each read is taken and aligned against this profile. This is repeated until convergence. Finally, the consensus of the multi-read alignment is written into store.contigStore[contigID].seq.

The parameter bandwidth controls the bandwidth of the banded alignment used in the alignment of reads against the profile. If includeReference is true then the reference is added as a pseudo-read (a new read at the end of the read store). This can be used for computing alignments of the reads agains the original reference.

#include <seqan/store.h>
#include <seqan/realign.h>

using namespace seqan;

int main()
{
    // Build paths to reference and to alignment.
    std::string refPath = getAbsolutePath("demos/tutorial/realignment/ref.fa");
    std::string samPath = getAbsolutePath("demos/tutorial/realignment/reads.sam");

    // Declare fragment store.
    FragmentStore<> store;

    // Load contigs and read alignment.
    loadContigs(store, refPath.c_str());
    BamFileIn bamFileIn(samPath.c_str());
    readRecords(store, bamFileIn);

    // Perform the realignment.
    reAlignment(store, /*contigID=*/ 0, /*method=*/ 1, /*bandwidth=*/ 20,
                /*includeReference=*/ true);

    // Layout alignment and print.
    AlignedReadLayout layout;
    layoutAlignment(layout, store);
    printAlignment(std::cout, layout, store, /*contigID=*/ 0, /*posBegin=*/ 1060,
                   /*posEnd=*/ 1140, /*lineBegin=*/ 0, /*lineEnd=*/ 100);

    return 0;
}

Here is the program’s output. The reference pseudo-read is here shown as the first read (second row) below the reference (first row).

TTGACTGTGGGAGGATACATCTCTCCATCAATTATCTAAAA-------TAAATAAATAAACATCAGTTAAAAAGTTAAGG
.........................................AAATAAA................................
.........................................          .............................
.........................................******T.        .......................
.................................C.......********.....     .....................
.........................................******A...C...          ...............
.........................................********.........           ...........
......................................G..******A...........               ......
...................................G.....******A...G..........                ..
..............................G..........******A...............
.........................................******A................
.........................................******A.....................
.........................................********.............................
.......G.................................********..........C..................T.
.........................................******A...............................
.........................................******A................................
.........................................********.....................C.........
.........................................******A.............................C..
.........................................********...............................
.........................................******A................................
.........................................******A................................
.........................................********...............................
.........................................******A................................
.........................................********...............................
....................................C....******A................................
  .......................................********...............................
    ..................................N..******A................................
       ..................................********...............................
          ...............................********...............................
            .............................******A................................
            .............................********...............................
                 ........................********...............................
                  .......................******A................................
                   ......................********...............................
                               ..........********...............................
                                 ........********...............................
                                   ......********...............................
                                      ...********...............................