Multiple Sequence Alignment

Learning Objective
You will learn how to compute a multiple sequence alignment using SeqAn’s alignment data structures and algorithms.
Difficulty
Basic
Duration
15 min
Prerequisites
A First Example, Sequences, Alphabets, Alignment Representation

Apart from pairwise alignments, also multiple sequence alignments can be computed in SeqAn. The easiest way to do this is by using the function globalMsaAlignment. This function computes a heuristic alignment based on a consistency-based progressive alignment strategy as described in SeqAn::TCoffee paper.

In the following example, we align four amino acid sequences using the AlignmentGraph data structure and the Blosum62 scoring matrix with gap extension penalty -11 and gap open penalty -1. The required header for multiple sequence alignments is <seqan/graph_msa.h>.

#include <iostream>
#include <seqan/align.h>
#include <seqan/graph_msa.h>

using namespace seqan;

int main()
{

First, the sequence type TSequence is defined and a StringSet is declared. The four sequences to be aligned are appended to the StringSet seq.

	typedef String<AminoAcid> TSequence;
	StringSet<TSequence> seq;
	appendValue(seq,"DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE");
	appendValue(seq,"RVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK");
	appendValue(seq,"FPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYKELPEKKKMKYIQDFQREKQEFERNLARFREDHPDLIQNAKK");
	appendValue(seq,"HIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQAKYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK");

Now we initialize our AlignmentGraph with the sequences. The graph and the Blosum62 scoring matrix are handed to the function globalMsaAlignment which computes the desired alignment.

	Graph<Alignment<StringSet<TSequence, Dependent<> > > > aliG(seq);
	globalMsaAlignment(aliG, Blosum62(-1, -11));
	std::cout << aliG << std::endl;
	
	return 0;
}

And here is the output of this example program:

Alignment matrix:
      0     .    :    .    :    .    :    .    :    .    :
        DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSA
          | |   |          |       |      || ||     ||
        RVKRP---MNAFIVWSRDQRRKMALENP--RMRNSEISKQLGYQWKMLTE
          | |              | | |   |   | |    | |    | | |
        FPKKP---LTPYFRFFMEKRAKYAKLHP--EMSNLDLTKILSKKYKELPE
          |||   |        | ||                  ||      |
        HIKKP---LNAFMLYMKEMRANVVAEST--LKESAAINQILGRRWHALSR

     50     .    :    .    :    .    :    .    :
        KEKGKFEDMAKADKARYEREMKTY----------IPPKGE
         ||  |   |    |        |
        AEKWPFFQEAQKLQAMHREKYPNYKYRP---RRKAKMLPK
          |    |  |                            |
        KKKMKYIQDFQREKQEFERNLARFREDH---PDLIQNAKK
            ||      | |                        |
        EEQAKYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK

Assignment 1

Type
Review
Objective

Compute a multiple sequence alignments between the four protein sequences

  • DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE
  • RVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK
  • FPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYKELPEKKKMKYIQDFQREKQEFERNLARFREDHPDLIQNAKK
  • HIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQAKYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK

using a Align object and the Blosum80 score matrix.

Repeat the above example using the Align data structure and the Blosum80 scoring matrix.

Solution

After the usual includes, the Align object align is initialized and the four sequences are appended as rows.

#include <iostream>
#include <seqan/align.h>
#include <seqan/graph_msa.h>

using namespace seqan;

int main()
{
	typedef String<AminoAcid> TSequence;
	Align<TSequence> align;
	resize(rows(align), 4);
	assignSource(row(align, 0),"DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMAKADKARYEREMKTYIPPKGE");
	assignSource(row(align, 1),"RVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQAMHREKYPNYKYRPRRKAKMLPK");
	assignSource(row(align, 2),"FPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYKELPEKKKMKYIQDFQREKQEFERNLARFREDHPDLIQNAKK");
	assignSource(row(align, 3),"HIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQAKYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK");

Now the MSA is computed, using the Blosum80 matrix for scoring.

	globalMsaAlignment(align, Blosum80(-1, -11));
	std::cout << align << std::endl;
	
	return 0;
}

And here is the output:

 0     .    :    .    :    .    :    .    :    .    :
   DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSA
     | |   |          |       |    |  | ||     ||
   RVKRP---MNAFIVWSRDQRRKMALENPRMR-NS-EISKQLGYQWKMLTE
     | |              | | |   | |  |     | |    | | |
   FPKKP---LTPYFRFFMEKRAKYAKLHPEMS-NL-DLTKILSKKYKELPE
     |||   |        | ||                  ||      |
   HIKKP---LNAFMLYMKEMRANVVAESTLKE-SA-AINQILGRRWHALSR

50     .    :    .    :    .    :    .    :    .    :
   KEKGKFEDMAKADKARYEREMKTY---------------IP--PKG---E
    ||  |   |    |   || |                  |
   AEKWPFFQEAQKLQAMH-RE-K-----YP------NYKYRPRRKAKMLPK
     |    |  |       |         |               ||   |
   KKKMKYIQDFQREKQEFERNLARFREDHP------DL--IQ--NAK---K
       ||      | |             |                    |
   EEQAKYYELARKERQLH-MQ-L-----YPGWSARDNYGKKKKRKRE---K
comments powered by Disqus