Multiple Sequence Alignment

Learning Objective

You will learn how to compute a multiple sequence alignment (MSA) using SeqAn’s alignment data structures and algorithms.

Difficulty

Basic

Duration

30 min

Prerequisites

Sequences, Alignment

Alignments are at the core of biological sequence analysis and part of the “bread and butter” tasks in this area. As you have learned in the pairwise alignment tutorial, SeqAn offers powerful and flexible functionality for computing such pairwise alignments. This tutorial shows how to compute multiple sequence alignments (MSAs) using SeqAn. First, some background on MSA will be given and the tutorial will then explain how to create multiple sequence alignments.

Note that this tutorial focuses on the <seqan/graph_msa.h> module whose purpose is the computation of global MSAs, i.e. similar to seqan2::T-Coffee [REW+08] or ClustalW [THG94]. If you are interested in computing consensus sequences of multiple overlapping sequences (e.g. NGS reads), similar to assembly after the layouting step, then have a look at the Consensus Alignment tutorial.

While the pairwise alignment of sequences can be computed exactly in quadratic time using dynamic programming, the computation of exact MSAs is harder. Given \(n\) sequences of length \(\ell\), the exact computation of an MSA is only feasible in time \(\mathcal{O}(\ell^n)\). Thus, global MSAs are usually computed using a heuristic called progressive alignment. For an introduction to MSAs, see the Wikipedia Article on Multiple Sequence Aligment.

Computing MSAs with SeqAn

The SeqAn library gives you access to the engine of seqan2::T-Coffee [REW+08], a powerful and efficient MSA algorithm based on the progressive alignment strategy. The easiest way to compute multiple sequence alignments is using the function globalMsaAlignment. The following example shows how to compute a global multiple sequence alignment of proteins using the Blosum62 scoring matrix with gap open penalty -11 and gap extension penalty -1.

First, we include the necessary headers and begin the main function by declaring our strings as a char array.

#include <iostream>
#include <seqan/align.h>
#include <seqan/graph_msa.h>

using namespace seqan2;

int main()
{
    char const * strings[4] =
    {
        "DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMA"
        "KADKARYEREMKTYIPPKGE",
        "RVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQA"
        "MHREKYPNYKYRPRRKAKMLPK",
        "FPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYKELPEKKKMKYIQDFQREKQ"
        "EFERNLARFREDHPDLIQNAKK",
        "HIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQAKYYELARKERQ"
        "LHMQLYPGWSARDNYGKKKKRKREK"
    };

Next, we build a Align object with underlying SeqAn Strings over the AminoAcid alphabet. We create four rows and assign the previously defined amino acid strings into the rows.

    Align<String<AminoAcid> > align;
    resize(rows(align), 4);
    for (int i = 0; i < 4; ++i)
        assignSource(row(align, i), strings[i]);

Finally, we call globalMsaAlignment and print align to the standard output. We use the Blosum62 score matrix with the penalties from above.

    globalMsaAlignment(align, Blosum62(-1, -11));
    std::cout << align << "\n";

    return 0;
}

The output of the program looks as follows.

      0     .    :    .    :    .    :    .    :    .    : 
        DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSA
          | |   |          |       |    |  | ||     ||    
        RVKRP---MNAFIVWSRDQRRKMALENPRMR-NS-EISKQLGYQWKMLTE
          | |              | | |   | |  |     | |    | | |
        FPKKP---LTPYFRFFMEKRAKYAKLHPEMS-NL-DLTKILSKKYKELPE
          |||   |        | ||                  ||      |  
        HIKKP---LNAFMLYMKEMRANVVAESTLKE-SA-AINQILGRRWHALSR

     50     .    :    .    :    .    :    .    : 
        KEKGKFEDMAKADKARYEREMKTY----------IPPKGE
         ||  |   |    |        |                
        AEKWPFFQEAQKLQAMHREKYPNY--KYRP-RRKAKMLPK
          |    |  |                            |
        KKKMKYIQDFQREKQEFERNLARF--REDH-PDLIQNAKK
            ||      | |           |            |
        EEQAKYYELARKERQLHMQLYPGWSARDNYGKKKKRKREK

Note that we stored the MSA in an Align object which allows easy access to the individual rows of the MSA as Gaps objects. globalMsaAlignment also allows storing the alignment as an AlignmentGraph. While this data structure makes other operations easier, it is less intuitive than the tabular represention of the Align class.

Assignment 1

Type

Review

Objective

Compute a multiple sequence alignment between the four protein sequences from above using a Align object and the Blosum80 score matrix.

Solution

The solution looks as follows.

//![main]
#include <iostream>
#include <seqan/align.h>
#include <seqan/graph_msa.h>

using namespace seqan2;

int main()
{
    char const * strings[4] =
    {
        "DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSAKEKGKFEDMA"
        "KADKARYEREMKTYIPPKGE",
        "RVKRPMNAFIVWSRDQRRKMALENPRMRNSEISKQLGYQWKMLTEAEKWPFFQEAQKLQA"
        "MHREKYPNYKYRPRRKAKMLPK",
        "FPKKPLTPYFRFFMEKRAKYAKLHPEMSNLDLTKILSKKYKELPEKKKMKYIQDFQREKQ"
        "EFERNLARFREDHPDLIQNAKK",
        "HIKKPLNAFMLYMKEMRANVVAESTLKESAAINQILGRRWHALSREEQAKYYELARKERQ"
        "LHMQLYPGWSARDNYGKKKKRKREK"
    };

    Align<String<AminoAcid> > align;
    resize(rows(align), 4);
    for (int i = 0; i < 4; ++i)
        assignSource(row(align, i), strings[i]);

    globalMsaAlignment(align, Blosum80(-1, -11));
    std::cout << align << "\n";

    return 0;
}
//![main]

And here is the program’s output.

      0     .    :    .    :    .    :    .    :    .    : 
        DPKKPRGKMSSYAFFVQTSREEHKKKHPDASVNFSEFSKKCSERWKTMSA
          | |   |          |       |    |  | ||     ||    
        RVKRP---MNAFIVWSRDQRRKMALENPRMR-NS-EISKQLGYQWKMLTE
          | |              | | |   | |  |     | |    | | |
        FPKKP---LTPYFRFFMEKRAKYAKLHPEMS-NL-DLTKILSKKYKELPE
          |||   |        | ||                  ||      |  
        HIKKP---LNAFMLYMKEMRANVVAESTLKE-SA-AINQILGRRWHALSR

     50     .    :    .    :    .    :    .    :    .    : 
        KEKGKFEDMAKADKARYEREMKTY---------------IP--PKG---E
         ||  |   |    |   || |                  |         
        AEKWPFFQEAQKLQAMH-RE-K-----YP------NYKYRPRRKAKMLPK
          |    |  |       |         |               ||   |
        KKKMKYIQDFQREKQEFERNLARFREDHP------DL--IQ--NAK---K
            ||      | |             |                    |
        EEQAKYYELARKERQLH-MQ-L-----YPGWSARDNYGKKKKRKRE---K

Computing Consensus Sequences

One common task following the computation of a global MSA for DNA sequences is the computation of a consensus sequence. The type ProfileChar can be used for storing counts for a profile’s individual characters. It is used by creating a String over ProfileChar as the alphabet.

The following program first computes a global MSA of four variants of exon1 of the gene SHH. First, we compute the alignment as in the example above.

#include <iostream>
#include <seqan/align.h>
#include <seqan/graph_msa.h>

using namespace seqan2;

int main()
{
    // some variangs of sonic hedgehog exon 1
    char const * strings[4] =
    {
        // gi|2440284|dbj|AB007129.1| Oryzias latipes
        "GCGGGTCACTGAGGGCTGGGATGAGGACGGCCACCACTTCGAGGAGTCCCTTCACTACGAGGGCAGGGCC"
        "GTGGACATCACCACGTCAGACAGGGACAAGAGCAAGTACGGCACCCTGTCCAGACTGGCGGTGGAAGCTG"
        "GGTTCGACTGGGTCTACTATGAGTCCAAAGCGCACATCCACTGCTCTGTGAAAGCAGAAAGCTCAGTCGC"
        "TGCAAAGTCGGGCGGTTGCTTCCCAGGATCCTCCACGGTCACCCTGGAAAATGGCACCCAGAGGCCCGTC"
        "AAAGATCTCCAACCCGGGGACAGAGTACTGGCCGCGGATTACGACGGAAACCCGGTTTATACCGACTTCA"
        "TCATGTTCAA",
        // gi|1731488|gb|U51350.1|DDU51350 Devario devario
        "CTACGGCAGAAGAAGACATCCGAAAAAGCTGACACCTCTCGCCTACAAGCAGTTCATACCTAATGTCGCG"
        "GAGAAGACCTTAGGGGCCAGCGGCAGATACGAGGGCAAGATAACGCGCAATTCGGAGAGATTTAAAGAAC"
        "TTACTCCAAATTACAATCCCGACATTATCTTTAAGGATGAGGAGAACACG",
        // gi|1731504|gb|U51352.1|PTU51352 Puntius tetrazona
        "CTACGGCAGAAGAAGACATCCCAAGAAGCTGACACCTCTCGCCTACAAGCAGTTTATACCTAATGTCGCG"
        "GAGAAGACCTTAGGGGCCAGCGGCAGATACGAGGGCAAGATCACGCGCAATTCGGAGAGATTTAAAGAAC"
        "TTACTCCAAATTACAATCCCGACATTATCTTTAAGGATGAGGAGAACACT",
        // gi|54399708|gb|AY642858.1| Bos taurus
        "TGCTGCTGCTGGCGAGATGTCTGCTGGTGCTGCTTGTCTCCTCGCTGTTGATGTGCTCGGGGCTGGCGTG"
        "CGGACCCGGCAGGGGATTTGGCAAGAGGCGGAACCCCAAAAAGCTGACCCCTTTAGCCTACAAGCAGTTT"
        "ATCCCCAACGTGGCGGAGAAGACCCTAGGGGCCAGTGGAAGATATGAGGGGAAGATCACCAGAAACTCAG"
        "AGCGATTTAAGGAACTCACCCCCAATTACAACCC"
    };

    Align<DnaString> align;
    resize(rows(align), 4);
    for (int i = 0; i < 4; ++i)
        assignSource(row(align, i), strings[i]);

    globalMsaAlignment(align, SimpleScore(5, -3, -1, -3));
    std::cout << align << "\n";

Then, we create the profile string with the length of the MSA. We then count the number of characters (and gap pseudo-characters which have an ordValue of 4 for Gaps over Dna) at each position.

    // create the profile string
    String<ProfileChar<Dna> > profile;
    resize(profile, length(row(align, 0)));
    for (unsigned rowNo = 0; rowNo < 4u; ++rowNo)
        for (unsigned i = 0; i < length(row(align, rowNo)); ++i)
            profile[i].count[ordValue(getValue(row(align, rowNo), i))] += 1;

Finally, we compute the consensus and print it to the standard output. At each position, the consensus is called as the character with the highest count. Note that getMaxIndex breaks ties by the ordinal value of the characters, i.e. A would be preferred over C, C over G and so on.

    // call consensus from this string
    DnaString consensus;
    for (unsigned i = 0; i < length(profile); ++i)
    {
        int idx = getMaxIndex(profile[i]);
        if (idx < 4)  // is not gap
            appendValue(consensus, Dna(getMaxIndex(profile[i])));
    }

    std::cout << "consensus sequence is\n"
              << consensus << "\n";

    return 0;
}

The output of the program is as follows.

      0     .    :    .    :    .    :    .    :    .    : 
        -GCGGGTCACTGAG-GGCTGGGATGA------------------------
          |   | ||     |||    | ||                        
        --C---T-AC-----GGC----A-GA------------------------
          |   | ||     |||    | ||                        
        --C---T-AC-----GGC----A-GA------------------------
          |   |  |     |||    | ||                        
        TGC---T-GCT-GCTGGC---GA-GATGTCTGCTGGTGCTGCTTGTCTCC

     50     .    :    .    :    .    :    .    :    .    : 
        ---------------------------------------GG---------
                                                |         
        ---------------------------------------AG---------
                                               ||         
        ---------------------------------------AG---------
                                               ||         
        TCGCTGTTGATGTGCTCGGGGCTGGCGTGCGGACCCGGCAGGGGATTTGG

    100     .    :    .    :    .    :    .    :    .    : 
        -ACGGCCAC--C---ACTTCGAGGAGTCCCTTCACTACGAGGGCAGGGCC
         | |   |   |   |          |||    |  |  |    |  || 
        -AAG---A---C---A----------TCC---GA--A-AA----A--GC-
         |||   |   |   |          |||    |  |  |    |  || 
        -AAG---A---C---A----------TCC---CA--A-GA----A--GC-
         |||   |   |   |           ||   ||  |  |    |  || 
        CAAG---A-GGCGGAA----------CCC---CA--A-AA----A--GC-

    150     .    :    .    :    .    :    .    :    .    : 
        GTGGACATCACCACGTCAGACAGGGACAAGAGCAAG--TACGGCACCCTG
         | ||   |||| | || | |    |||  ||| ||   |    | ||| 
        -T-GA---CACCTC-TC-GCC---TACA--AGC-AGTTCA---TA-CCT-
         | ||   |||||| || |||   ||||  ||| |||| |   || ||| 
        -T-GA---CACCTC-TC-GCC---TACA--AGC-AGTTTA---TA-CCT-
         | ||   | |||  |  |||   ||||  ||| ||||||   |  ||  
        -T-GA---CCCCTT-TA-GCC---TACA--AGC-AGTTTA---TC-CCC-

    200     .    :    .    :    .    :    .    :    .    : 
        TCCAGACTG--GCGGTGGAAGCTGGGTTCGACTGGGTCTACTATGAGTCC
           | | ||  ||||  |||         |||     ||  || | | ||
        ---A-A-TGTCGCGG-AGAA---------GAC-----CT--TA-GGGGCC
           | | |||||||| ||||         |||     ||  || ||||||
        ---A-A-TGTCGCGG-AGAA---------GAC-----CT--TA-GGGGCC
           | |  || |||| ||||         |||     |   || ||||||
        ---A-A-CGTGGCGG-AGAA---------GAC-----CC--TA-GGGGCC

    250     .    :    .    :    .    :    .    :    .    : 
        AAAGC-GCACATCCACTGCTCTGTGAAAGCAGAAAGCTCAGTCGCT-GCA
          ||| ||| ||  || |    | |   ||   ||| | |  |||  || 
        --AGCGGCAGAT--AC-G---AG-G---GC---AAGATAA--CGC--GC-
          ||||||||||  || |   || |   ||   ||||| |  |||  || 
        --AGCGGCAGAT--AC-G---AG-G---GC---AAGATCA--CGC--GC-
          || || ||||  |  |   || |   |    |||||||  | |  |  
        --AGTGGAAGAT--AT-G---AG-G---GG---AAGATCA--C-C-AGA-

    300     .    :    .    :    .    :    .    :    .    : 
        AAGTCGGGCGGTTGCTTCCCAG-GATCCTCCACGGTCACCCTGGAAAATG
        || ||||             || |||  |      |        |||   
        AATTCGG-------------AGAGAT--T------T--------AAA---
        |||||||             ||||||  |      |        |||   
        AATTCGG-------------AGAGAT--T------T--------AAA---
        || || |             || |||  |      |        ||    
        AACTCAG-------------AGCGAT--T------T--------AAG---

    350     .    :    .    :    .    :    .    :    .    : 
        GCACCCAGAGGCCCGTCAAAGATCTCCAA-CCCGGGGACAGAGTACTGGC
        | ||  |    | |  |||  || | ||| |||   |||  | || |   
        GAACTTA----CTC--CAA--AT-TACAATCCC---GAC--ATTA-T---
        |||||||    |||  |||  || |||||||||   |||  |||| |   
        GAACTTA----CTC--CAA--AT-TACAATCCC---GAC--ATTA-T---
        ||||| |    | |  | |  || ||||| |||                 
        GAACTCA----CCC--CCA--AT-TACAA-CCC-----------------

    400     .    :    .    :    .    :    .    :    .    : 
        CGCGGA-TTACGACGGA--AACCCGGTTTATACCGACTTCATCATGTTCA
          |    |||  | |||  |    ||   |    ||    | || |    
        --C---TTTA--A-GGATGA----GG---A----GA----A-CACG----
          |   ||||  | ||||||    ||   |    ||    | |||     
        --C---TTTA--A-GGATGA----GG---A----GA----A-CACT----
                                                          
        --------------------------------------------------

    450   
        A
         
        -
         
        -
         
        -



consensus sequence is
GCTACTGGCGAGAAGAAGACATCCCAAAAAGCTGACACCTCTCGCCTACAAGCAGTTTATACCTAATGTCGCGGAGAAGACCTTAGGGGCCAGCGGCAGATACGAGGGCAAGATCACGCGCAATTCGGAGAGATTTAAAGAACTCACCCCAAATTACAATCCCGACATTATCTTTAAGGATGAGGAGAACACG