Alphabets

Learning Objective
You will learn the details about the alphabets in SeqAn.
Difficulty
Basic
Duration
15 min
Prerequisites
A First Example

This tutorial will describe the different alphabets used in SeqAn, or in other words, you will learn about the contained types of a SeqAn String. To continue with the other tutorials, it would be enough to know, that in SeqAn several standard alphabets are already predefined, e.g. Dna, Dna5, Rna, Rna5, Iupac, AminoAcid.

Types

Any type that provides a default constructor, a copy constructor and an assignment operator can be used as the alphabet / contained type of a String (see also the tutorial Sequences). This includes the C++ POD types, e.g. char, int, double etc. In addition you can use more complex types like String as the contained type of strings, e.g. String<String<char> >.

SeqAn also provides the following types that are useful in bioinformatics. Each of them is a specialization of the class SimpleType.

Specialization Description
AminoAcid Amino Acid Alphabet
Dna DNA alphabet
Dna5 N alphabet including N character
DnaQ N alphabet plus phred quality
Dna5Q N alphabet plus phred quality including N character
Finite Finite alphabet of fixed size.
Iupac N Iupac code.
Rna N alphabet
Rna5 N alphabet including N character

Functionality

In SeqAn, alphabets are value types that can take a limited number of values and which hence can be mapped to a range of natural numbers. We can retrieve the number of different values of an alphabet, the alphabet size, by the metafunction ValueSize.

    typedef Dna TAlphabet;

    unsigned alphSize = ValueSize<TAlphabet>::VALUE;
    std::cout << "Alphabet size of Dna: " << alphSize << '\n';
Alphabet size of Dna: 4

Another useful metafunction called BitsPerValue can be used to determine the number of bits needed to store a value of a given alphabet.

    unsigned bits = BitsPerValue<TAlphabet>::VALUE;
    std::cout << "Number of bits needed to store a value of type Dna: " << bits << '\n';
Number of bits needed to store a value of type Dna: 2

The order of a character in the alphabet (i.e. its corresponding natural number) can be retrieved by calling the function ordValue. See each specialization’s documentation for the ordering of the alphabet’s values.

    Dna a = 'A';
    Dna c = 'C';
    Dna g = 'G';
    Dna t = 'T';

    std::cout <<"A: " << (unsigned)ordValue(a) << '\n';
    std::cout <<"C: " << (unsigned)ordValue(c) << '\n';
    std::cout <<"G: " << (unsigned)ordValue(g) << '\n';
    std::cout <<"T: " << (unsigned)ordValue(t) << '\n';
A: 0
C: 1
G: 2
T: 3

Tip

The return value of the ordValue function is determined by the metafunction ValueSize. ValueSize returns the type which uses the least amount of memory while being able to represent all possible values. E.g. ValueSize of Dna returns an _uint8 which is able to represent 256 different characters. However, note that std::cout has no visible symbol for printing all values on the screen, hence a cast to unsigned might be necessary.

Assignment 1

Type
Application
Objective

In this task you will learn how to access all the letters of an alphabet. Use the piece of code from below and adjust the function showAllLettersOfMyAlphabet() to go through all the characters of the current alphabet and print them.

#include <seqan/sequence.h>
#include <seqan/basic.h>
#include <iostream>

using namespace seqan;

// We want to define a function, which takes
// the alphabet type as an argument
template <typename TAlphabet>
void showAllLettersOfMyAlphabet(TAlphabet const &)
{
    // ...
}

int main()
{
    showAllLettersOfMyAlphabet(AminoAcid());
    showAllLettersOfMyAlphabet(Dna());
    showAllLettersOfMyAlphabet(Dna5());
    return 0;
}
Hints
You will need the Metafunction ValueSize.
Solution

Click more... to see the solution.

#include <seqan/sequence.h>
#include <seqan/basic.h>
#include <iostream>

using namespace seqan;

// We define a function which takes
// the alphabet type as an argument
template <typename TAlphabet>
void showAllLettersOfMyAlphabet(TAlphabet const &)
{
    typedef typename ValueSize<TAlphabet>::Type TSize;
    // We need to determine the alphabet size
    // using the metafunction ValueSize
    TSize alphSize = ValueSize<TAlphabet>::VALUE;
    // We iterate over all characters of the alphabet
    // and output them
    for (TSize i = 0; i < alphSize; ++i)
        std::cout << i << ',' << TAlphabet(i) << "  ";
    std::cout << std::endl;

}

int main()
{
    showAllLettersOfMyAlphabet(AminoAcid());
    showAllLettersOfMyAlphabet(Dna());
    showAllLettersOfMyAlphabet(Dna5());
    return 0;
}