Strings and Segments

Learning Objective

You will learn about the SeqAn sequence concept and its main class String as well as the class Segment. After completing this tutorial, you will be able to use important functionalities of sequences in SeqAn and you will be ready to continue with the more specific tutorials, e.g. Alignment, or Pairwise Sequence Alignment.

Difficulty

Very basic

Duration

45 min

Prerequisites

Basic C or C++ knowledge, the A First Example tutorial helps.

Sequences are the core concept of SeqAn. A sequence is a container that stores an ordered list of values. In SeqAn, there are three kinds of sequences: Strings, Sequence Adaptions and Segments.

The String class is one of the most fundamental classes in SeqAn. It is designed as a generic data structure that can be instantiated for all kinds of values, both simple (e.g. char, Dna, AminoAcid) and non-simple value types (e.g. Tuple, String). With sequence adaptions, SeqAn offers an interface for accessing data types that are not part of SeqAn, namely standard library strings and c-style char arrays. Thus those built-in types can be handled in a similar way as SeqAn strings, for example with the length function. Segments are contiguous subsequences that represent parts of other sequences.

This tutorial will deal with the SeqAn sequence classes String and Segment.

Strings

In this section, we will have a detailed look at the SeqAn class String. You will learn how to build and expand strings as well as how to compare and convert them.

Defining Strings

Let’s first have a look at a simple example on how to define a String. The type of the contained value is specified by the first template argument, e.g. char or int.

    String<char>  myText;     // A string of characters.
    String<int>   myNumbers;  // A string of integers.

To fill the string with contents, we can simply assign a string literal to the created variable:

    myText = "Hello SeqAn!";

Any type that provides a default constructor, a copy constructor and an assignment operator can be used as the alphabet / contained type of a String. This includes the C++ POD types, e.g. char, int, double etc., or even more complex types complex types, such as Strings.

    String<String<char> >   myStringList;   // A string of character strings.

Hint

Nested Sequences (aka “Strings of Strings”)

A collection of sequences can either be stored in a sequence of sequences, for example in a String< String<char> >, or in a StringSet. The latter one allows for more auxiliary functionalities to improve the efficiency of working with large sequence collections. You can learn more about it in the tutorial String Sets.

SeqAn also provides the following types that are useful in bioinformatics: AminoAcid, Dna, Dna5, DnaQ, Dna5Q, Finite, Iupac, Rna, Rna5. You can find detailed information in the tutorial Alphabets.

    String<Dna>         myGenome;   // A string of nucleotides.
    String<AminoAcid>   myProtein;  // A string of amino acids.

For commonly used string parameterizations, SeqAn has a range of shortcuts implemented, e.g. DnaString, RnaString and Peptide.

    // Instead of String<Dna> dnaSeq we can also write:
    DnaString dnaSeq = "TATA";

Working with Strings

The SeqAn String implementation provides the common C++ operators that you know already from the vector class of the STL. For example:

    String<Dna> dnaSeq = "TATA";
    dnaSeq += "CGCG";
    std::cout << dnaSeq << std::endl;
TATACGCG

Each sequence object has a capacity, i.e. the maximum length of a sequence that can be stored in this object. While some sequence types have a fixed capacity, the capacity of other sequence classes like Alloc String or std::basic_string can be changed at runtime. The capacity can be set explicitly by functions such as reserve or resize. It can also be set implicitly by functions like append or replace, if the operation’s result exceeds the length of the target string.

In the following example we create a String of Dna5String. We first set the new length of the container with resize to two elements. After assigning two elements we append one more element with appendValue. In the last step the capacity is implicitly changed.

    String<Dna5String> readList;
    resize(readList, 2);
    readList[0] = "GGTTTCGACG";
    readList[1] = "AAGATGTCGC";
    appendValue(readList, "TATGCATGAT");

Using the function length, we can now get the length of our strings, e.g.:

    std::cout << length(readList) << std::endl;
    std::cout << length(readList[0]) << std::endl;
3
10

To empty a String, the function clear resets the object.

    clear(readList);

SeqAn offers a range of other functions for the work with the String class, e.g. assign, assignValue, empty, etc. The full list of functions you can find in the documentation String.

Assignment 1

Type

Review

Objective

In the following assignment, you will write a small function that builds the reverse complement of a given string. Copy the code below and add the following functionalities:

  1. Use the resize function to resize the revComplGenome variable.

  2. Using the getRevCompl function, get the reverse complement for every nucleotide genome and store it in reverse order revComplGenome.

  3. Print out the original genome and the reverse complement.

    #include <seqan/sequence.h>
    #include <seqan/basic.h>
    #include <seqan/stream.h>
    #include <seqan/file.h>
    #include <seqan/modifier.h>
    
    using namespace seqan2;
    
    Dna getRevCompl(Dna const & nucleotide)
    {
        if (nucleotide == 'A')
            return 'T';
        if (nucleotide == 'T')
            return 'A';
        if (nucleotide == 'C')
            return 'G';
        return 'C';
    }
    
    int main()
    {
        DnaString genome = "TATATACGCGCGAGTCGT";
        DnaString revComplGenome;
    
    // Your code snippet here
    
        // And to check if your output is correct,
        // use the given SeqAn function reverseComplement(),
        // which modifies the sequence in-place:
        reverseComplement(genome);
        std::cout << genome << std::endl;
        return 0;
    }
    
Hints

Remember that the last element in genome is stored at position length(genome) - 1.

Solution

Click more… to see the solution.

//![top]
#include <seqan/sequence.h>
#include <seqan/basic.h>
#include <seqan/stream.h>
#include <seqan/file.h>
#include <seqan/modifier.h>

using namespace seqan2;

Dna getRevCompl(Dna const & nucleotide)
{
    if (nucleotide == 'A')
        return 'T';
    if (nucleotide == 'T')
        return 'A';
    if (nucleotide == 'C')
        return 'G';
    return 'C';
}

int main()
{
    DnaString genome = "TATATACGCGCGAGTCGT";
    DnaString revComplGenome;
//![top]
    //1.
    resize(revComplGenome, length(genome));
    //2.
    for (unsigned i = 0; i < length(genome); ++i)
        revComplGenome[length(genome) - 1 - i] = getRevCompl(genome[i]);
    //3.
    std::cout << genome << std::endl;
    std::cout << revComplGenome << std::endl;
//![bottom]

    // And to check if your output is correct,
    // use the given SeqAn function reverseComplement(),
    // which modifies the sequence in-place:
    reverseComplement(genome);
    std::cout << genome << std::endl;
    return 0;
}
//![bottom]

Your output should look like this:

TATATACGCGCGAGTCGT
ACGACTCGCGCGTATATA
ACGACTCGCGCGTATATA

Assignment 2

Type

Review

Objective

In this assignment, you will do some simple string building tasks, and write a simple alignment of the given reads and chromosomes. Use the given code template to solve these subtasks:

  1. Assume we have mapped the reads to the positions 7, 100, 172, and 272 in ‘chr1’. Store these positions in another string ‘alignPosList’.

  2. Build another String bsChr1 as a copy of chr1, and exchange every ‘C’ with a ‘T’, as in a bisulfite treated genome.

  3. Print alignments of the reads and chr1 (or bschr1) using the function printAlign and the string alignPosList.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;
// Function to print simple alignment between two sequences with the same length
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
    std::cout <<  "Alignment " << std::endl;
    std::cout << "  genome : " << genomeFragment << std::endl;
    std::cout << "  read   : " << read << std::endl;
}

int main()
{
    // Build reads and genomes
    DnaString chr1 = "TATAATATTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCTAGATGTGCAGCTCGATCGAATGCACGTGTGTGCGATCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATTTAG";

    // Build List containing all reads
    typedef String<DnaString> TDnaList;
    TDnaList readList;
    resize(readList, 4);
    readList[0] = "TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT";
    readList[1] = "TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT";
    readList[2] = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA";
    readList[3] = "CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA";

    // Append a second chromosome sequence fragment to chr1
    DnaString chr2 = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACACGTCTCTGTGTTCCGACGTGTGTCACGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACACATGCTGCTG";
    append(chr1, chr2);

    // Print readlist
    std::cout << " \n Read list: " << std::endl;
    for (unsigned i = 0; i < length(readList); ++i)
        std::cout << readList[i] << std::endl;

    // 1. Assume we have mapped the 4 reads to chr1 (and chr2) and now have the mapping start positions (no gaps).
    // Store the start position in a String alignPosList: 7, 100, 172, 272
// Your code snippet here for 1.+2.
    // 3. Print alignments of the reads with chr1 (or bsChr1) sequence using the function printAlign
    // and the positions in alignPosList.
    // To do that, you have to create a copy of the fragment in chr1 (bsChr1) that is aligned to the read.
    std::cout << " \n Print alignment: " << std::endl;
    for (unsigned i = 0; i < length(readList); ++i)
    {
        // Begin position beginPosition of a given alignment between the read and the genome
// Your code snippet here for 3.
        // Genome fragment
        DnaString genomeFragment;
// Your code snippet here for 3.
        // Call of our function to print the simple alignment
        printAlign(genomeFragment, readList[i]);
    }
    return 0;
}
Hints

You have to create a copy of the fragment in chr1 (bsChr1) that is aligned to the read.

Solution

Click more… to see the solution.

//![one]
#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;
// Function to print simple alignment between two sequences with the same length
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
    std::cout <<  "Alignment " << std::endl;
    std::cout << "  genome : " << genomeFragment << std::endl;
    std::cout << "  read   : " << read << std::endl;
}

int main()
{
    // Build reads and genomes
    DnaString chr1 = "TATAATATTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCTAGATGTGCAGCTCGATCGAATGCACGTGTGTGCGATCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATTTAG";

    // Build List containing all reads
    typedef String<DnaString> TDnaList;
    TDnaList readList;
    resize(readList, 4);
    readList[0] = "TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT";
    readList[1] = "TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT";
    readList[2] = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA";
    readList[3] = "CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA";

    // Append a second chromosome sequence fragment to chr1
    DnaString chr2 = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACACGTCTCTGTGTTCCGACGTGTGTCACGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACACATGCTGCTG";
    append(chr1, chr2);

    // Print readlist
    std::cout << " \n Read list: " << std::endl;
    for (unsigned i = 0; i < length(readList); ++i)
        std::cout << readList[i] << std::endl;

    // 1. Assume we have mapped the 4 reads to chr1 (and chr2) and now have the mapping start positions (no gaps).
    // Store the start position in a String alignPosList: 7, 100, 172, 272
//![one]
    String<unsigned> alignPosList;
    resize(alignPosList, 4);
    alignPosList[0] = 7;
    alignPosList[1] = 100;
    alignPosList[2] = 172;
    alignPosList[3] = 272;

    // 2. Bisulfite conversion
    // Assume chr1 is beeing bisulfate treated: Copy chr1 to a new genome bsChr1 and exchange every 'C' with a 'T'
    DnaString bsChr1;
    assign(bsChr1, chr1);
    for (unsigned i = 0; i < length(bsChr1); ++i)
        if (bsChr1[i] == 'C')
            bsChr1[i] = 'T';
//![two]
    // 3. Print alignments of the reads with chr1 (or bsChr1) sequence using the function printAlign
    // and the positions in alignPosList.
    // To do that, you have to create a copy of the fragment in chr1 (bsChr1) that is aligned to the read.
    std::cout << " \n Print alignment: " << std::endl;
    for (unsigned i = 0; i < length(readList); ++i)
    {
        // Begin position beginPosition of a given alignment between the read and the genome
//![two]
        unsigned beginPosition = alignPosList[i];
//![three]

        // Genome fragment
        DnaString genomeFragment;
//![three]

        // We have to create a copy of the corresponding fragment of the genome, where the read aligns to
        for (unsigned j = 0; j < length(readList[i]); ++j)
            appendValue(genomeFragment, chr1[beginPosition + j]);
//![four]

        // Call of our function to print the simple alignment
        printAlign(genomeFragment, readList[i]);
    }
    return 0;
}
//![four]
 
 Read list: 
TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT
TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT
AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA
CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA
 
 Print alignment: 
Alignment 
  genome : TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT
  read   : TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT
Alignment 
  genome : TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT
  read   : TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT
Alignment 
  genome : AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA
  read   : AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA
Alignment 
  genome : CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA
  read   : CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA

Comparisons

Two sequences can be lexicographically compared using standard operators such as < or >=.

    String<char> a = "beta";
    String<char> b = "alpha";

    std::cout << (a != b) << std::endl;
    std::cout << (a < b) << std::endl;
    std::cout << (a > b) << std::endl;
1
0
1

Each comparison involves a scan of the two sequences for searching the first mismatch between the strings. This could be costly if the two sequences share a long common prefix. Suppose we want to branch in a program depending on whether a < b, a == b, or a > b.

    if (a < b)      { /* code for case "a < b"  */ }
    else if (a > b) { /* code for case "a > b"  */ }
    else            { /* code for case "a == b" */ }

In this case, although only one scan would be enough to decide what case is to be applied, each operator > and < performs a new comparison. SeqAn offers the class Lexical to avoid unnecessary sequence scans. Lexicals can store the result of a comparison, for example:

    // Compare a and b and store the result in comp
    Lexical<> comp(a, b);

    if (isLess(comp))         { /* code for case "a < b"  */ }
    else if (isGreater(comp)) { /* code for case "a > b"  */ }
    else                      { /* code for case "a == b" */ }

Conversions

A sequence of type A values can be converted into a sequence of type B values, if A can be converted into B. SeqAn offers different conversion alternatives.

Copy conversion. The source sequence is copied into the target sequence. This can be done by assignment (operator=) or using the function assign.

    String<Dna> dna_source = "acgtgcat";
    String<char> char_target;
    assign(char_target, dna_source);
    std::cout << char_target << std::endl;
ACGTGCAT

Move conversion. If the source sequence is not needed any more after the conversion, it is always advisable to use move instead of assign. The function move does not make a copy but can reuse the source sequence storage. In some cases, move can also perform an in-place conversion.

    String<char> char_source = "acgtgcat";
    String<Dna> dna_target;

    // The in-place move conversion.
    move(dna_target, char_source);
    std::cout << dna_target << std::endl;
ACGTGCAT

Assignment 3

Type

Review

Objective

In this assignment you will sort nucleotides. Copy the code below. Adjust the code such that all nucleotides, which are lexicographically smaller than a Dna5 'G' are stored in a list lesser, while all nucleotides which are greater, should be stored in a list greater. Print out the final lists.

#include <seqan/stream.h>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan2;

int main()
{
    String<Dna5> nucleotides = "AGTCGTGNNANCT";
    String<Dna5> selected;
    // Append all elements of nucleotides, apart of Gs,
    // to the list selected.
    for (unsigned i = 0; i < length(nucleotides); ++i)
    {
        appendValue(selected, nucleotides[i]);
    }
    std::cout << "Selected nucleotides: " << selected << std::endl;
    return 0;
}
Solution

Click more… to see the solution.

   #include <seqan/stream.h>
   #include <seqan/sequence.h>
   #include <seqan/file.h>

   using namespace seqan2;

   int main()
   {
       String<Dna5> nucleotides = "AGTCGTGNNANCT";
       String<Dna5> lesser;
       String<Dna5> greater;

       for (unsigned i = 0; i < length(nucleotides); ++i){
	   if (nucleotides[i] < 'G')
	       appendValue(lesser, nucleotides[i]);
	   else if (nucleotides[i] > 'G')
	       appendValue(greater, nucleotides[i]);
       }
       std::cout << "Lesser nucleotides: " << lesser << std::endl;
       std::cout << "Greater nucleotides: " << greater << std::endl;
   }
Lesser nucleotides: ACAC
Greater nucleotides: TTNNNT

Assignment 4

Type

Transfer

Objective

In this task you will compare whole sequences. Reuse the code from above. Instead of a String<Dna5> we will now deal with a String<Dna5String>. Build a string which contains the Dna5Strings “ATATANGCGT”, “AAGCATGANT” and “TGAAANTGAC”. Now check for all elements of the container, if they are lexicographically smaller or bigger than the given subject sequence “GATGCATGAT” and append them to a appropriate list. Print out the final lists.

Hints

Try to avoid unnecessary sequence scans.

Solution

Click more… to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;

int main()
{
    String<Dna5String> nucleotidesList;
    Dna5String str1 = "ATATANGCGT";
    Dna5String str2 = "AAGCATGANT";
    Dna5String str3 = "TGAAANTGAC";
    resize(nucleotidesList, 3);
    nucleotidesList[0] = str1;
    nucleotidesList[1] = str2;
    nucleotidesList[2] = str3;

    String<Dna5String> lesser;
    String<Dna5String> greater;
    Dna5String ref = "GATGCATGAT";

    // For each Dna5String of the String:
    for (unsigned i = 0; i < length(nucleotidesList); ++i)
    {
        // Compare the Dna5String with the given reference string
        // The result of the comparison is stored in comp
        Lexical<> comp(nucleotidesList[i], ref);
        // The function isLess checks only the stored result
        // without comparing the sequences again
        if (isLess(comp))
            appendValue(lesser, nucleotidesList[i]);
        else if (isGreater(comp))
            appendValue(greater, nucleotidesList[i]);
    }
    // Print the results
    std::cout << "Lesser sequences: " << std::endl;
    for (unsigned i = 0; i < length(lesser); ++i)
    {
        std::cout << lesser[i] << ", ";
    }
    std::cout << std::endl;
    std::cout << "Greater sequences: " << std::endl;
    for (unsigned i = 0; i < length(greater); ++i)
    {
        std::cout << greater[i] << ", ";
    }
}
Lesser sequences: 
ATATANGCGT, AAGCATGANT, 
Greater sequences: 
TGAAANTGAC, 

Iteration

Very often you will be required to iterate over your string to either retrieve what’s stored in the string or to write something at a specific position. For this purpose SeqAn provides Iterators for all container types. The metafunction Iterator can be used to determine the appropriate iterator type for a given a container.

An iterator always points to one value of the container. The operator operator* can be used to access this value by reference. Functions like operator++(prefix) or operator–(prefix) can be used to move the iterator to other values within the container.

The functions begin and end, applied to a container, return iterators to the begin and to the end of the container. Note that similar to C++ standard library iterators, the iterator returned by end does not point to the last value of the container but to the position behind the last one. If the container is empty then end() == begin().

The following code prints out a sequence and demonstrates how to iterate over a string.

    DnaString genome = "ACGTACGTACGT";
    typedef Iterator<DnaString>::Type TIterator;
    for (TIterator it = begin(genome); it != end(genome); ++it)
    {
        std::cout << *it;
    }
ACGTACGTACGT

Different Iterator Types

Some containers offer several kinds of iterators, which can be selected by an optional template parameter of the Iterator class. For example, the tag Standard can be used to get an iterator type that resembles the C++ standard random access iterator. For containers there is also a second variant available, the so called Rooted iterator. The rooted iterator knows its container by pointing back to it. This gives us a nice interface to access member functions of the underlying container while operating on a rooted iterator. The construction of an iterator in SeqAn, e.g. for a Dna String, could look like the following:

    Iterator<DnaString>::Type           it1;  // A standard iterator
    Iterator<DnaString, Standard>::Type it2;  // Same as above
    Iterator<DnaString, Rooted>::Type   it3;  // A rooted iterator

Tip

The default iterator implementation is Standard. Rooted iterators offer some convenience interfaces for the user. They offer additional functions like container for determining the container on which the iterator works, and they simplify the interface for other functions like atEnd. Moreover, rooted iterators may change the container’s length or capacity, which makes it possible to implement a more intuitive variant of a remove algorithm.

While rooted iterators can usually be converted into standard iterators, it is not always possible to convert standard iterators back into rooted iterators, since standard iterators may lack the information about the container they work on. Therefore, many functions that return iterators like begin or end return rooted iterators instead of standard iterators; this way, they can be used to set both rooted and standard iterator variables. Alternatively it is possible to specify the returned iterator type explicitly by passing the iterator kind as a tag argument, e.g. begin(str, Standard()).

Assignment 5

Type

Review

Objective

Copy the code below, which replaces all N’s of a given String with A’s. Adjust the code to use iterators to traverse the container. Use the Standard iterator.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;

int main()
{
    Dna5String genome = "ANTGGTTNCAACNGTAANTGCTGANNNACATGTNCGCGTGTA";
    for (unsigned i = 0; i < length(genome); ++i)
    {
        if (genome[i] == 'N')
            genome[i] = 'A';
    }
    std::cout << "Modified genome: " << genome << std::endl;
    return 0;
}

Solution

Click more… to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;

int main()
{
    Dna5String genome = "ANTGGTTNCAACNGTAANTGCTGANNNACATGTNCGCGTGTA";

    Iterator<Dna5String>::Type it = begin(genome);
    Iterator<Dna5String>::Type itEnd = end(genome);

    for (; it != itEnd; goNext(it))
    {
        if (*it == 'N')
            *it = 'A';
    }
    std::cout << "Modified genome: " << genome << std::endl;
    return 0;
}

Assignment 6

Type

Application

Objective

Use the code from above and change the Standard to a Rooted iterator. Try to shorten the code wherever possible.

Solution

Click more… to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;

int main()
{
    Dna5String genome = "ANTGGTTNCAACNGTAANTGCTGANNNACATGTNCGCGTGTA";

    Iterator<Dna5String, Rooted>::Type it = begin(genome);

    for (; !atEnd(it); goNext(it))
    {
        if (getValue(it) == 'N')
            value(it) = 'A';
    }
    std::cout << "Modified genome: " << genome << std::endl;
    return 0;
}

String Allocation Strategies

Each sequence object has a capacity, i.e. the reserved space for this object. The capacity can be set explicitly by functions such as reserve or resize. It can also bet set implicitly by functions like append, assign, insert or replace, if the operation’s result exceeds the length of the target sequence.

If the current capacity of a sequence is exceeded by chaining the length, we say that the sequence overflows. There are several overflow strategies that determine what actually happens when a string should be expanded beyond its capacity. The user can specify this for a function call by additionally handing over a tag. If no overflow strategy is specified, a default overflow strategy is selected depending on the type of the sequence.

The following overflow strategies exist:

Exact

Expand the sequence exactly as far as needed. The capacity is only changed if the current capacity is not large enough.

Generous

Whenever the capacity is exceeded, the new capacity is chosen somewhat larger than currently needed. This way, the number of capacity changes is limited in a way that resizing the sequence only takes amortized constant time.

Limit

Instead of changing the capacity, the contents are limited to current capacity. All values that exceed the capacity are lost.

Insist

No capacity check is performed, so the user has to ensure that the container’s capacity is large enough.

The next example illustrates how the different strategies could be used:

    String<Dna> dnaSeq;
    // Sets the capacity of dnaSeq to 5.
    resize(dnaSeq, 4, Exact());
    // Only "TATA" is assigned to dnaSeq, since dnaSeq is limited to 4.
    assign(dnaSeq, "TATAGGGG", Limit());
    std::cout << dnaSeq << std::endl;
    // Use the default expansion strategy.
    append(dnaSeq, "GCGCGC");
    std::cout << dnaSeq << std::endl;
TATA
TATAGCGCGC

Assignment 7

Type

Review

Objective

Build a string of Dna (default specialization) and use the function appendValue to append a million times the nucleotide ‘A’. Do it both using the overflow strategy Exact and Generous. Measure the time for the two different strategies.

Solution

Click more… to see the solution.

#include <iostream>
#include <seqan/sequence.h>

using namespace seqan2;

int main()
{
    unsigned num = 100000;
    double start;

    String<Dna> str;
    clear(str);
    start = sysTime();
    for (unsigned i = 0; i < num; ++i)
        appendValue(str, 'A', Exact());
    std::cout << "Strategy Exact() took: " << sysTime() - start << " s\n\n";

    clear(str);
    shrinkToFit(str);
    start = sysTime();
    for (unsigned i = 0; i < num; ++i)
        appendValue(str, 'A', Generous());
    std::cout << "Strategy Generous() took: " << sysTime() - start << " s\n\n";

    return 0;
}

String Specializations

The user can specify the kind of string that should be used in an optional second template argument of String. The default string implementation is Alloc String.


In most cases, the implementation Alloc String (the default when using a String<T>) is the best choice. Exceptions are when you want to process extremely large strings that are a bit larger than the available memory (consider Alloc String) or much larger so most of them are stored on the hard disk and only parts of them are loaded in main memory (consider External String). The following list describes in detail the different specializations:

Specialization Alloc String
  • Description Expandable string that is stored on the heap.

  • Applications The default string implementation that can be used for general purposes.

  • Limitations Changing the capacity can be very costly since all values must be copied.

Specialization Array String
  • Description Fast but non-expandable string. Fast storing of fixed-size sequences.

  • Limitations Capacity must already be known at compile time. Not suitable for storing large sequences.

Specialization Block String
  • Description String that stores its sequence characters in blocks.

  • Applications The capacity of the string can quickly be increased. Good choice for growing strings or stacks.

  • Limitations Iteration and random access to values is slightly slower than for Alloc String.

Specialization Packed String
  • Description A string that stores as many values in one machine word as possible.

  • Applications Suitable for storing large strings in memory.

  • Limitations Slower than other in-memory strings.

Specialization External String
  • Description String that is stored in secondary memory.

  • Applications Suitable for storing very large strings (>2GB). Parts of the string are automatically loaded from secondary memory on demand.

  • LimitationsApplications Slower than other string classes.

Specialization Journaled String
  • Description String that stores differences to an underlying text rather than applying them directly.

  • Applications Suitable for efficiently storing similar strings, if their differences to an underlying reference sequence are known.

  • LimitationsApplications Slower than other string classes, due to logarithmic penalty for random accesses.

Specialization CStyle String
  • Description Allows adaption of strings to C-style strings.

  • Applications Used for transforming other String classes into C-style strings (i.e. null terminated char arrays). Useful for calling functions of C-libraries.

  • Limitations Only sensible if value type is char or wchar_t.

    // String with maximum length 100.
    String<char, Array<100> > myArrayString;
    // String that takes only 2 bits per nucleotide.
    String<Dna, Packed<> > myPackedString;
    // Most of the string is stored on the disk.
    String<Dna, External<> > myLargeGenome;

Tip

String Simplify Memory Management

One advantage of using Strings is that the user does not need to reserve memory manually with new and does not need delete to free memory. Instead, those operations are automatically handled by the String class.

    String<Dna> myDnaGenome = "TATACGCG";

Segments

The following section will introduce you into the Segment class of SeqAn.

Segments are contiguous subsequences that represent parts of other sequences. Therefore, their functionality is similar to the String functionality. In SeqAn, there are three kinds of segments: InfixSegment, PrefixSegment, and SuffixSegment. The metafunctions Infix, Prefix, and Suffix, respectively, return the appropriate segment data type for a given sequence type.

For prefixes, we use the function prefix to build the prefix. The first parameter is the sequence we build the prefix from, the second the excluding end position. For infixes, we have to provide both the including start and the excluding end position. For suffixes, the second parameter of the function denotes the including starting position of the suffix:

    String<Dna> dnaSeq = "AGTTGGCATG";
    Prefix<String<Dna> >::Type pre = prefix(dnaSeq, 4);
    std::cout << "Prefix: " << pre << std::endl;

    Infix<String<Dna> >::Type inf = infix(dnaSeq, 4, 7);
    std::cout << "Infix: " << inf << std::endl;

    Suffix<String<Dna> >::Type suf = suffix(dnaSeq, 4);
    std::cout << "Suffix: " << suf << std::endl;
Prefix: AGTT
Infix: GGC
Suffix: GGCATG

Segments store a pointer on the underlying sequence object, the host, and an start and/or end position, depending on the type of segment. The segment is not a copy of the sequence segment.

Warning

Please note that it is not possible anymore to change the underlying sequence by changing the segment. If you want to change the host sequence, you have to explicitly modify this. If you want to modify only the segment, you have to explicitly make a copy of the string.

Assignment 8

Type

Application

Objective

In this task you will use a segment to pass over an infix of a given sequence to a function without copying the corresponding fragment. Use the code given below. Lets assume that we have given a genome and a read sequence as well as the begin position of a given alignment. In the main function a fragment of the Dna5String genome is copied and passed together with the Dna5String read to a print function. Adjust the code to use an infix of the genome, instead of copying the corresponding fragment.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;

// Function to print simple alignment between two sequences with the same length
// .. for two sequences of different types
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
    std::cout <<  "Alignment " << std::endl;
    std::cout << "  genome : ";
    std::cout << genomeFragment << std::endl;
    std::cout << "  read   : ";
    std::cout << read << std::endl;
}

int main()
{
    // We have given a genome sequence
    Dna5String genome = "ATGGTTTCAACGTAATGCTGAACATGTCGCGT";
    // A read sequence
    Dna5String read = "TGGTNTCA";
    // And the begin position of a given alignment between the read and the genome
    unsigned beginPosition = 1;
    // We have to create a copy of the corresponding fragment of the genome, where the read aligns to
    // Change this piece of code using an infix of the genome
    Dna5String genomeFragment;
    for (unsigned i = 0; i < length(read); ++i)
    {
        appendValue(genomeFragment, genome[beginPosition + i]);
    }
    // Call of our function to print the simple alignment
    printAlign(genomeFragment, read);
    return 0;
}
Solution

Click more… to see the solution.

//![top]
#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;

// Function to print simple alignment between two sequences with the same length
// .. for two sequences of different types
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
    std::cout <<  "Alignment " << std::endl;
    std::cout << "  genome : ";
    std::cout << genomeFragment << std::endl;
    std::cout << "  read   : ";
    std::cout << read << std::endl;
}

int main()
{
    // We have given a genome sequence
    Dna5String genome = "ATGGTTTCAACGTAATGCTGAACATGTCGCGT";
    // A read sequence
    Dna5String read = "TGGTNTCA";
    // And the begin position of a given alignment between the read and the genome
    unsigned beginPosition = 1;
//![top]

    // Create Infix of type Dna5String and get the corresponding infix sequence of genome
    Infix<Dna5String>::Type genomeFragment = infix(genome, beginPosition, beginPosition + length(read));

//![bottom]
    // Call of our function to print the simple alignment
    printAlign(genomeFragment, read);
    return 0;
}
//![bottom]
Alignment 
  genome : TGGTTTCA
  read   : TGGTNTCA

Assignment 9

Type

Review

Objective

Take the solution from the workshop assignment above and change it to use Segments for building the genome fragment.

Hints

Note that because printAlign uses templates, you don’t have to change the function even though the type of genomeFragment is different.

Solution

Click more… to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/stream.h>

using namespace seqan2;
// Function to print simple alignment between two sequences with the same length
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
    std::cout <<  "Alignment " << std::endl;
    std::cout << "  genome : " << genomeFragment << std::endl;
    std::cout << "  read   : " << read << std::endl;
}

int main(int, char const **)
{
    // Build reads and genomes
    DnaString chr1 = "TATAATATTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCTAGATGTGCAGCTCGATCGAATGCACGTGTGTGCGATCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATTTAG";
    // Build List containing all reads
    typedef String<DnaString> TDnaList;
    TDnaList readList;
    resize(readList, 4);
    readList[0] = "TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT";
    readList[1] = "TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT";
    readList[2] = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA";
    readList[3] = "CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA";
    // Append a second chromosome sequence fragment to chr1
    DnaString chr2 = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACACGTCTCTGTGTTCCGACGTGTGTCACGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACACATGCTGCTG";
    append(chr1, chr2);
    // Print readlist
    std::cout << " \n Read list: " << std::endl;
    for (unsigned i = 0; i < length(readList); ++i)
        std::cout << readList[i] << std::endl;
    // Assume we have mapped the 4 reads to chr1 (and chr2) and now have the mapping start positions (no gaps).
    // Store the start position in a String alignPosList: 7, 100, 172, 272
    String<unsigned> alignPosList;
    resize(alignPosList, 4);
    alignPosList[0] = 7;
    alignPosList[1] = 100;
    alignPosList[2] = 172;
    alignPosList[3] = 272;
    // Optional
    // Bisulfite conversion
    // Assume chr1 is beeing bisulfate treated: Copy chr1 to a new genome bsChr1 and exchange every 'C' with a 'T'
    DnaString bsChr1;
    assign(bsChr1, chr1);
    for (unsigned i = 0; i < length(bsChr1); ++i)
        if (bsChr1[i] == 'C')
            bsChr1[i] = 'T';
    // Print alignments using Segment: Do the same as above, but instead of using a for loop to build the fragment,
    // use the Segment class to build an infix of bsChr1.
    // Note: Because printAlign uses templates, we don't have to change the function even though the type of
    // genomeFragment is different.
    std::cout << " \n Print alignment using Segment: " << std::endl;
    for (unsigned i = 0; i < length(readList); ++i)
    {
        // Begin and end position of a given alignment between the read and the genome
        unsigned beginPosition = alignPosList[i];
        unsigned endPosition = beginPosition + length(readList[i]);
        // Build infix
        Infix<DnaString>::Type genomeFragment = infix(chr1, beginPosition, endPosition);
        // Call of our function to print the simple alignment
        printAlign(genomeFragment, readList[i]);
    }
    return 0;
}
 
 Read list: 
TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT
TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT
AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA
CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA
 
 Print alignment using Segment: 
Alignment 
  genome : TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT
  read   : TTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCT
Alignment 
  genome : TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT
  read   : TCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATT
Alignment 
  genome : AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA
  read   : AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACA
Alignment 
  genome : CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA
  read   : CGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACA