Realignment¶
- Learning Objective
In this tutorial, you will learn how to refine multi-sequence alignments in a fragment store. This can be useful for refining multi-read alignments around indels prior to small variant calling. After completing the tutorial, you will be able to load reads into a fragment store and compute a realignment.
- Difficulty
Advanced
- Duration
30 min
- Prerequisites
A common task in NGS data analysis is small variant calling (SNVs or indels with a length of up to 10 bp) after the read mapping step. Usually, one considers the “pileup” of the reads and looks for variant signatures (e.g. a certain number of non-reference characters in the aligned reads). Usually, read mappers compute pairwise alignments of each read and the reference and store them in a SAM or BAM file. In the absence of indels, such pairwise alignments can be converted to a multi-read alignment without problems. However, there can be an undesired multi-read alignment around indels (Figure 1).
The task of improving such an alignment is called realignment and there is a small number of algorithms and tools available for realignment.
This tutorial describes the <seqan/realign.h>
module which implements a variant of the ReAligner algorithm by Anson and Myers [AM97].
Getting Started¶
Consider the following program. It creates a fragment store and then reads a small reference (with a length of 2kb) from a FASTA file and also a SAM file with reads spanning a complex indel region at 1060 ~ 1140. Finally, it prints the multi-read alignment around this position using AlignedReadLayout.
#include <seqan/store.h>
#include <seqan/realign.h>
using namespace seqan2;
int main()
{
// Build paths to reference and to alignment.
std::string refPath = getAbsolutePath("demos/tutorial/realignment/ref.fa");
std::string samPath = getAbsolutePath("demos/tutorial/realignment/reads.sam");
// Declare fragment store.
FragmentStore<> store;
// Load contigs and read alignment.
loadContigs(store, refPath.c_str());
BamFileIn bamFileIn(samPath.c_str());
readRecords(store, bamFileIn);
// Layout alignment and print.
AlignedReadLayout layout;
layoutAlignment(layout, store);
printAlignment(std::cout, layout, store, /*contigID=*/ 0, /*posBegin=*/ 1060,
/*posEnd=*/ 1140, /*lineBegin=*/ 0, /*lineEnd=*/ 100);
return 0;
}
The output of the program is as follows:
TTGACTGTGGGAGGATACATCTCTCCATCAATTATCTAAAAAAATAAATAAATAAATAAACATCAGTTAAAAAGTTAAGG
......................................... C,,,,A,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.........................................TT ..A...............................
.................................C............ .TC....A...T.AATAAAC.TC......AA
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,T,ACT,A .............................
.................................................. .......................
.....................................**.G.............. ,,,,,,,,,,,,,,,,,,,,,
...................................G...******......G.......... ...............
..............................G........******.................. ,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,******,,,,,,,,,,,,,,,,,,, ......
.......................................******........................ ,,
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,********,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.......G.................................********..........C..................T.
.......................................******..................................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,******,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.........................................********.....................C.........
.......................................******................................C..
.........................................********...............................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,******,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.......................................******...................................
.........................................********...............................
.......................................******...................................
.........................................********...............................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,C,,******,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
.......................................********...............................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,******,N,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
..................................********...............................
,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,********,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
...........................******...................................
.............................********...............................
,,,,,,,,,,,,,,,,,,,,,,,,********,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
,,,,,,,,,,,,,,,,,,,,,******,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
......................********...............................
,,,,,,,,,,********,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Figure 1: An example of a multi-read alignment from pairwise alignments
Performing the Realignment¶
We can now use the function reAlignment for performing a realignment of the reads in the fragment store.
- contigID
The numeric ID of the contig to realign.
- realignmentMethod
Whether to use linear (0) or affine gap costs (1). It is recommended to use affine gap costs.
- bandwidth
The bandwidth to use in the realignment step.
- includeReference
Whether or not to include the reference as a pseudo read.
The algorithm works as follows:
A profile is computed, with a count of each base and the gap character at each position in the multi-read alignment.
Each read is taken and aligned against this profile.
This is repeated until convergence.
Finally, the consensus of the multi-read alignment is written into store.contigStore[contigID].seq
.
The parameter bandwidth
controls the bandwidth of the banded alignment used in the alignment of reads against the profile.
If includeReference
is true
then the reference is added as a pseudo-read (a new read at the end of the read store).
This can be used for computing alignments of the reads agains the original reference.
#include <seqan/store.h>
#include <seqan/realign.h>
using namespace seqan2;
int main()
{
// Build paths to reference and to alignment.
std::string refPath = getAbsolutePath("demos/tutorial/realignment/ref.fa");
std::string samPath = getAbsolutePath("demos/tutorial/realignment/reads.sam");
// Declare fragment store.
FragmentStore<> store;
// Load contigs and read alignment.
loadContigs(store, refPath.c_str());
BamFileIn bamFileIn(samPath.c_str());
readRecords(store, bamFileIn);
// Perform the realignment.
reAlignment(store, /*contigID=*/ 0, /*method=*/ 1, /*bandwidth=*/ 20,
/*includeReference=*/ true);
// Layout alignment and print.
AlignedReadLayout layout;
layoutAlignment(layout, store);
printAlignment(std::cout, layout, store, /*contigID=*/ 0, /*posBegin=*/ 1060,
/*posEnd=*/ 1140, /*lineBegin=*/ 0, /*lineEnd=*/ 100);
return 0;
}
Here is the program’s output. The reference pseudo-read is here shown as the first read (second row) below the reference (first row).
TTGACTGTGGGAGGATACATCTCTCCATCAATTATCTAAAA-------TAAATAAATAAACATCAGTTAAAAAGTTAAGG
.........................................AAATAAA................................
......................................... .............................
.........................................******T. .......................
.................................C.......********..... .....................
.........................................******A...C... ...............
.........................................********......... ...........
......................................G..******A........... ......
...................................G.....******A...G.......... ..
..............................G..........******A...............
.........................................******A................
.........................................******A.....................
.........................................********.............................
.......G.................................********..........C..................T.
.........................................******A...............................
.........................................******A................................
.........................................********.....................C.........
.........................................******A.............................C..
.........................................********...............................
.........................................******A................................
.........................................******A................................
.........................................********...............................
.........................................******A................................
.........................................********...............................
....................................C....******A................................
.......................................********...............................
..................................N..******A................................
..................................********...............................
...............................********...............................
.............................******A................................
.............................********...............................
........................********...............................
.......................******A................................
......................********...............................
..........********...............................
........********...............................
......********...............................
...********...............................