# Sequences¶

Learning Objective
You will learn about the SeqAn sequence concept and its main class String as well as the class Segment. After completing this tutorial, you will be able to use important functionalities of sequences in SeqAn and you will be ready to continue with the more specific tutorials, e.g. Iterators, Alignment Representation, or Pairwise Sequence Alignment.
Difficulty
Very basic
Duration
45 min
Prerequisites
Basic C or C++ knowledge, the A First Example tutorial helps.

Sequences are the core concept of SeqAn. A sequence is a container that stores an ordered list of values. In SeqAn, there are three kinds of sequences: Strings, Sequence Adaptions and Segments.

The String class is one of the most fundamental classes in SeqAn. It is designed as a generic data structure that can be instantiated for all kinds of values, both simple (e.g. char, Dna, AminoAcid) and non-simple value types (e.g. Tuple, String). With sequence adaptions, SeqAn offers an interface for accessing data types that are not part of SeqAn, namely standard library strings and c-style char arrays. Thus those built-in types can be handled in a similar way as SeqAn strings, for example with the length function. Segments are contiguous subsequences that represent parts of other sequences.

This tutorial will deal with the SeqAn sequence classes String and Segment.

## Strings¶

In this section, we will have a detailed look at the SeqAn class String. You will learn how to build and expand strings as well as how to compare and convert them.

### Building Strings¶

Let’s first have a look at an example on how to define a String. The type of the contained value is specified by the first template argument, e.g. char or int.

String<char>  myText;     // A string of characters.
String<int>   myNumbers;  // A string of integers.


Any type that provides a default constructor, a copy constructor and an assignment operator can be used as the alphabet / contained type of a String. This includes the C++ POD types, e.g. char, int, double etc., but you can use more complex types, e.g. Strings, too.

String<String<char> >   myStringList;   // A string of character strings.


Hint

Nested Sequences (aka “Strings of Strings”)

A set of sequences can either be stored in a sequence of sequences, for example in a String< String<char> >, or in StringSet. See the tutorial String Sets for more information about the class StringSet.

SeqAn also provides the following types that are useful in bioinformatics: AminoAcid, Dna, Dna5, DnaQ, Dna5Q, Finite, Iupac, Rna, Rna5. You can find detailed information in the tutorial Alphabets.

String<Dna>        myGenome;   // A string of nucleotides.
String<AminAcid>   myProtein;  // A string of amino acids.


For commonly used string parameterizations, SeqAn has a range of shortcuts implemented, e.g. DnaString, RnaString and Peptide.

// Instead of String<Dna> dnaSeq we can also write:
DnaString dnaSeq = "TATA";


The user can specify the kind of string that should be used in an optional second template argument of String. This is also known as selecting the specialization of a class in SeqAn. The default string implementation is Alloc String, which the best choice for most cases.

String<Dna>              myGenome;   // A default string of nucleotides.
String<Dna, Alloc<> >    myGenome;   // The same as above.


For some scenarios though, there might be other types more suitable. One such example is when processing extremely large strings that are much larger than the available main memory. In this case, using External Strings is a good choice.

// Most of the string is stored on the disk.
String<Dna, External<> > myLargeGenome;


More details about the different specializations you can find in the tutorial Sequences In-Depth.

Tip

String Simplify Memory Management

One advantage of using Strings is that the user does not need to reserve memory manually with new and does not need delete to free memory. Instead, those operations are automatically handeld by the String class.

String<Dna> myGenome = "TATACGCG";


### Functionality¶

SeqAn also provides the common C++ operators for strings. You can use them like STL strings, for example:

String<Dna> dnaSeq = "TATA";
dnaSeq += "CGCG";
std::cout << dnaSeq << std::endl;

TATACGCG


Each sequence object has a capacity, i.e. the maximum length of a sequence that can be stored in this object. While some sequence types have a fixed capacity, the capacity of other sequence classes like Alloc String or std::basic_string can be changed at runtime. The capacity can be set explicitly by functions such as reserve or resize. It can also be set implicitly by functions like append or replace, if the operation’s result exceeds the length of the target string.

In the following example, a String of Dna5String, we first set the new length of the container with resize to two elements. After assigning two elements we append one more element with appendValue. In the last step the capacity is implicitly changed.

String<Dna5String> readList;


Using the function length, we can now get the length of our strings, e.g.:

std::cout << length(readList) << std::endl;

3
10


To empty a String, the function clear resets the object.

clear(readList);


SeqAn offers a range of other functions for the work with the String class, e.g. assign, assignValue, value, getValue, empty, etc. The full list of functions you can find in the documentation String.

### Assignment 1¶

Type
Review
Objective

In the following assignment, you will write a small function that builds the reverse complement of a given string. Copy the code below and add the following functionalities:

1. Use the resize function to resize the revComplGenome variable.
2. Using the getRevCompl function, get the reverse complement for every nucleotide genome and store it in reverse order revComplGenome.
3. Print out the original genome and the reverse complement.
#include <seqan/sequence.h>
#include <seqan/basic.h>
#include <iostream>
#include <seqan/file.h>
#include <seqan/modifier.h>

using namespace seqan;

Dna getRevCompl(Dna const & nucleotide)
{
if (nucleotide == (Dna)'A')
return (Dna)'T';
if (nucleotide == (Dna)'T')
return (Dna)'A';
if (nucleotide == (Dna)'C')
return (Dna)'G';
return (Dna)'C';
}

int main()
{
DnaString genome = "TATATACGCGCGAGTCGT";
DnaString revComplGenome;

// And to check if your output is correct,
// use the given SeqAn function reverseComplement(),
// which modifies the sequence in-place
reverseComplement(genome);
std::cout << genome << std::endl;
return 0;
}

Hints
Remember that the last element in genome is stored at position length(genome) - 1.
Solution

Click more... to see the solution.

#include <seqan/sequence.h>
#include <seqan/basic.h>
#include <iostream>
#include <seqan/file.h>
#include <seqan/modifier.h>

using namespace seqan;

Dna getRevCompl(Dna const & nucleotide)
{
if (nucleotide == (Dna)'A')
return (Dna)'T';
if (nucleotide == (Dna)'T')
return (Dna)'A';
if (nucleotide == (Dna)'C')
return (Dna)'G';
return (Dna)'C';
}

int main()
{
DnaString genome = "TATATACGCGCGAGTCGT";
DnaString revComplGenome;
resize(revComplGenome, length(genome));

for (unsigned i = 0; i < length(genome); ++i)
{
revComplGenome[length(genome) - 1 - i] = getRevCompl(genome[i]);
}

std::cout << genome << std::endl;
std::cout << revComplGenome << std::endl;
reverseComplement(genome);
std::cout << genome << std::endl;
return 0;
}


### Workshop Assignment 1¶

Type
Review
Objective
In this assignment, you will do some simple string building tasks, and write a simple alignment of the given reads and chromosomes. Use the given code template to solve these subtasks:
1. Assume we have mapped the reads to the positions 7, 100, 172, and 272 in ‘chr1’. Store these positions in another string ‘alignPosList’.
2. Build another String bsChr1 as a copy of chr1, and exchange every ‘C’ with a ‘T’, as in a bisulfite treated genome.
3. Print alignments of the reads and chr1 (or bschr1) using the function printAlign and the string alignPosList.
#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;
// Function to print simple alignment between two sequences with the same length
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
std::cout <<  "Alignment " << std::endl;
std::cout << "  genome : " << genomeFragment << std::endl;
}

int main(int, char const **)
{
DnaString chr1 = "TATAATATTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTG"
"CGATATATCGCGCTAGATGTGCAGCTCGATCGAATGCACGTGTGTGCGATC"
"GATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATA"
"TTAGCGGTCTAGCATTTAG";

// Build List containing all reads
typedef String<DnaString> TDnaList;

// Append a second chromosome sequence fragment to chr1
DnaString chr2 = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCAT"
"GTGCGCTGAAGCACACATGCACACGTCTCTGTGTTCCGACGTGTGTCACGTG"
"CACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGAC"
"ACATGCTGCTG";
append(chr1, chr2);

std::cout << " \n Read list: " << std::endl;
for(unsigned i = 0; i < length(readList); ++i)

// Assume we have mapped the 4 reads to chr1 (and chr2) and now have the mapping start positions (no gaps).
// Store the start position in a String alignPosList: 7, 100, 172, 272

// Optional
// Bisulfite conversion
// Assume chr1 is beeing bisulfate treated: Copy chr1 to a new genome bsChr1 and exchange every 'C' with a 'T'
DnaString bsChr1;

// Print alignments of the reads with chr1 (or bsChr1) sequence using the function printAlign
// and the positions in alignPosList.
// To do that, you have to create a copy of the fragment in chr1 (bsChr1) that is aligned to the read.
std::cout << " \n Print alignment: " << std::endl;
for(unsigned i = 0; i < length(readList); ++i)
{
// Temporary copy of begin position (beginPosition) from alignPosList
// of a given alignment between the read and the genome

// Genome fragment
DnaString genomeFragment;
// We have to create a copy of the corresponding fragment of the genome, where the read aligns to

// Call of our function to print the simple alignment
}
return 1;
}

Hints
You have to create a copy of the fragment in chr1 (bsChr1) that is aligned to the read.
Solution

Click more... to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;
// Function to print simple alignment between two sequences with the same length
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
std::cout <<  "Alignment " << std::endl;
std::cout << "  genome : " << genomeFragment << std::endl;
}
int main(int, char const **)
{
DnaString chr1 = "TATAATATTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCTAGATGTGCAGCTCGATCGAATGCACGTGTGTGCGATCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATTTAG";
// Build List containing all reads
typedef String<DnaString> TDnaList;
// Append a second chromosome sequence fragment to chr1
DnaString chr2 = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACACGTCTCTGTGTTCCGACGTGTGTCACGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACACATGCTGCTG";
append(chr1, chr2);
std::cout << " \n Read list: " << std::endl;
for(unsigned i = 0; i < length(readList); ++i)
// Assume we have mapped the 4 reads to chr1 (and chr2) and now have the mapping start positions (no gaps).
// Store the start position in a String alignPosList: 7, 100, 172, 272
String<unsigned> alignPosList;
resize(alignPosList, 4);
alignPosList[0] = 7;
alignPosList[1] = 100;
alignPosList[2] = 172;
alignPosList[3] = 272;
// Optional
// Bisulfite conversion
// Assume chr1 is beeing bisulfate treated: Copy chr1 to a new genome bsChr1 and exchange every 'C' with a 'T'
DnaString bsChr1;;
assign(bsChr1, chr1);
for(unsigned i = 0; i < length(bsChr1); ++i)
if(bsChr1[i] == 'C')
bsChr1[i] = 'T';
// Print alignments of the reads with chr1 (or bsChr1) sequence using the function printAlign
// and the positions in alignPosList.
// To do that, you have to create a copy of the fragment in chr1 (bsChr1) that is aligned to the read.
std::cout << " \n Print alignment: " << std::endl;
for(unsigned i = 0; i < length(readList); ++i)
{
// Begin position beginPosition of a given alignment between the read and the genome
unsigned beginPosition = alignPosList[i];
// Genome fragment
DnaString genomeFragment;
// We have to create a copy of the corresponding fragment of the genome, where the read aligns to
for (unsigned j = 0; j < length(readList[i]); ++j)
appendValue(genomeFragment, chr1[beginPosition+j]);
// Call of our function to print the simple alignment
}
return 1;
}


### Comparisons¶

Two sequences can be lexicographically compared using standard operators such as < or >=.

String<char> a = "beta";
String<char> b = "alpha";

std::cout << (a != b) << std::endl;
std::cout << (a < b) << std::endl;
std::cout << (a > b) << std::endl;

1
0
1


Each comparison involves a scan of the two sequences for searching the first mismatch between the strings. This could be costly if the two sequences share a long common prefix. Suppose we want to branch in a program depending on whether a < b, a == b, or a > b.

if (a < b)      { /* code for case "a < b"  */ }
else if (a > b) { /* code for case "a > b"  */ }
else            { /* code for case "a == b" */ }


In this case, although only one scan would be enough to decide what case is to be applied, each operator > and < performs a new comparison. SeqAn offers the class Lexical to avoid unnecessary sequence scans. Lexicals can store the result of a comparison, for example:

// Compare a and b and store the result in comp
Lexical<> comp(a, b);

if (isLess(comp))         { /* code for case "a < b"  */ }
else if (isGreater(comp)) { /* code for case "a > b"  */ }
else                      { /* code for case "a == b" */ }


### Conversions¶

A sequence of type A values can be converted into a sequence of type B values, if A can be converted into B. SeqAn offers different conversion alternatives.

Copy conversion. The source sequence is copied into the target sequence. This can be done by assignment (operator=) or using the function assign.

String<Dna> source = "acgtgcat";
String<char> target;
assign(target, source);
std::cout << target;

acgtgcat


Move conversion. If the source sequence is not needed any more after the conversion, it is always advisable to use move instead of assign. The function move does not make a copy but can reuse the source sequence storage. In some cases, move can also perform an in-place conversion.

String<char> source = "acgtgcat";
String<Dna> target;

// The in-place move conversion.
move(target, source);
std::cout << target;

acgtgcat


### Assignment 2¶

Type
Review
Objective

In this assignment you will sort nucleotides. Copy the code below. Adjust the code such that all nucleotides, which are lexicographically smaller than a Dna5 'G' are stored in a list lesser, while all nucleotides which are greater, should be stored in a list greater. Print out the final lists.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;

int main()
{
String<Dna5> nucleotides = "AGTCGTGNNANCT";
String<Dna5> selected;
// Append all elements of nucleotides, apart of Gs,
// to the list selected.
for (unsigned i = 0; i < length(nucleotides); ++i){
appendValue(selected, nucleotides[i]);
}
std::cout << "Selected nucleotides: " << selected << std::endl;
return 0;
}

Solution

Click more... to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;

int main()
{
String<Dna5> nucleotides = "AGTCGTGNNANCT";
String<Dna5> lesser;
String<Dna5> greater;

for (unsigned i = 0; i < length(nucleotides); ++i){
if (nucleotides[i] < 'G')
appendValue(lesser, nucleotides[i]);
else if (nucleotides[i] > 'G')
appendValue(greater, nucleotides[i]);
}
std::cout << "Lesser nucleotides: " << lesser << std::endl;
std::cout << "Greater nucleotides: " << greater << std::endl;
}


### Assignment 3¶

Type
Transfer
Objective
In this task you will compare whole sequences. Reuse the code from above. Instead of a String<Dna5> we will now deal with a String<Dna5String>. Build a string which contains the Dna5Strings “ATATANGCGT”, “AAGCATGANT” and “TGAAANTGAC”. Now check for all elements of the container, if they are lexicographically smaller or bigger than the given reference sequence “GATGCATGAT” and append them to a appropriate list. Print out the final lists.
Hints
Try to avoid unnecessary sequence scans.
Solution

Click more... to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;

int main()
{
String<Dna5String> nucleotidesList;
Dna5String str1 = "ATATANGCGT";
Dna5String str2 = "AAGCATGANT";
Dna5String str3 = "TGAAANTGAC";
resize(nucleotidesList, 3);
nucleotidesList[0] = str1;
nucleotidesList[1] = str2;
nucleotidesList[2] = str3;

String<Dna5String> lesser;
String<Dna5String> greater;
Dna5String ref = "GATGCATGAT";

// For each Dna5String of the String:
for (unsigned i = 0; i < length(nucleotidesList); ++i){
// Compare the Dna5String with the given reference string
// The result of the comparison is stored in comp
Lexical<> comp(nucleotidesList[i], ref);
// The function isLess checks only the stored result
// without comparing the sequences again
if (isLess(comp))
appendValue(lesser, nucleotidesList[i]);
else if (isGreater(comp))
appendValue(greater, nucleotidesList[i]);
}
// Print the results
std::cout << "Lesser sequences: " << std::endl;
for (unsigned i = 0; i < length(lesser); ++i){
std::cout << lesser[i] << ", ";
}
std::cout << std::endl;
std::cout << "Greater sequences: " << std::endl;
for (unsigned i = 0; i < length(greater); ++i){
std::cout << greater[i] << ", ";
}
}


## Segments¶

The following section will introduce you into the Segment class of SeqAn.

Segments are contiguous subsequences that represent parts of other sequences. Therefore, their functionality is similar to the String functionality. In SeqAn, there are three kinds of segments: InfixSegment, PrefixSegment, and SuffixSegment. The metafunctions Infix, Prefix, and Suffix, respectively, return the appropriate segment data type for a given sequence type.

For prefixes, we use the function prefix to build the prefix. The first parameter is the sequence we build the prefix from, the second the excluding end position. For infixes, we have to provide both the including start and the excluding end position. For suffixes, the second parameter of the function denotes the including starting position of the suffix:

String<Dna> dnaSeq = "AGTTGGCATG";
Prefix<String<Dna> >::Type pre = prefix(dnaSeq, 4);
std::cout << "Prefix: " << pre << std::endl;

Infix<String<Dna> >::Type inf = infix(dnaSeq, 4, 7);
std::cout << "Infix: " << inf << std::endl;

Suffix<String<Dna> >::Type suf = suffix(dnaSeq, 4);
std::cout << "Suffix: " << suf << std::endl;

Prefix: AGTT
Infix: GGC
Suffix: GGCATG


Segments store a pointer on the underlying sequence object, the host, and an start and/or end position, depending on the type of segment. The segment is not a copy of the sequence segment.

Warning

Please note that it is not possible anymore to change the underlying sequence by changing the segment. If you want to change the host sequence, you have to explicilty modify this. If you want to modify only the segment, you have to explicitly make a copy of the string.

### Assignment 4¶

Type
Application

Objective

In this task you will use a segment to pass over an infix of a given sequence to a function without copying the corresponding fragment. Use the code given below. Lets assume that we have given a genome and a read sequence as well as the begin position of a given alignment. In the main function a fragment of the Dna5String genome is copied and passed together with the Dna5String read to a print function. Adjust the code to use an infix of the genome, instead of copying the corresponding fragment.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;

// Function to print simple alignment between two sequences with the same length
// .. for two sequences of the same type
template <typename TText>
void printAlign(TText const & genomeFragment, TText const & read)
{
std::cout <<  "Alignment " << std::endl;
std::cout << "  genome : ";
std::cout << genomeFragment << std::endl;
std::cout << "  read   : ";
}

int main()
{
// We have given a genome sequence
Dna5String genome = "ATGGTTTCAACGTAATGCTGAACATGTCGCGT";
// And the begin position of a given alignment between the read and the genome
unsigned beginPosition = 1;

Dna5String genomeFragment;
// We have to create a copy of the corresponding fragment of the genome, where the read aligns to
for (unsigned i = 0; i < length(read); ++i){
appendValue(genomeFragment, genome[beginPosition+i]);
}
// Call of our function to print the simple alignment

return 0;
}

Solution

Click more... to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;

// Function to print simple alignment between two sequences with the same length
// .. for two sequences of different types
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
std::cout <<  "Alignment " << std::endl;
std::cout << "  genome : ";
std::cout << genomeFragment << std::endl;
std::cout << "  read   : ";
}

int main()
{
// We have given a genome sequence
Dna5String genome = "ATGGTTTCAACGTAATGCTGAACATGTCGCGT";
// And the begin position of a given alignment between the read and the genome
unsigned beginPosition = 1;

// Create Infix of type Dna5String and get the corresponding infix sequence of genome
Infix<Dna5String>::Type inf = infix(genome, beginPosition, beginPosition + length(read));
// Call of our function to print the simple alignment
return 0;
}


### Workshop Assignment 2¶

Type
Review
Objective
Take the solution from the workshop assignment above and change it to use Segments for building the genome fragment.
Hints
Note that because printAlign uses templates, you don’t have to change the function even though the type of genomeFragment is different.
Solution

Click more... to see the solution.

#include <iostream>
#include <seqan/sequence.h>
#include <seqan/file.h>

using namespace seqan;
// Function to print simple alignment between two sequences with the same length
template <typename TText1, typename TText2>
void printAlign(TText1 const & genomeFragment, TText2 const & read)
{
std::cout <<  "Alignment " << std::endl;
std::cout << "  genome : " << genomeFragment << std::endl;
}
int main(int, char const **)
{
DnaString chr1 = "TATAATATTGCTATCGCGATATCGCTAGCTAGCTACGGATTATGCGCTCTGCGATATATCGCGCTAGATGTGCAGCTCGATCGAATGCACGTGTGTGCGATCGATTAGCGTCGATCATCGATCTATATTAGCGCGCGGTATCGGACGATCATATTAGCGGTCTAGCATTTAG";
// Build List containing all reads
typedef String<DnaString> TDnaList;
// Append a second chromosome sequence fragment to chr1
DnaString chr2 = "AGCCTGCGTACGTTGCAGTGCGTGCGTAGACTGTTGCAAGCCGGGGGTTCATGTGCGCTGAAGCACACATGCACACGTCTCTGTGTTCCGACGTGTGTCACGTGCACTGCTGACGTCGTGGTTGTCACATCGTCGTGCGTGCGTACTGCTGCTGACACATGCTGCTG";
append(chr1, chr2);
std::cout << " \n Read list: " << std::endl;
for(unsigned i = 0; i < length(readList); ++i)
// Assume we have mapped the 4 reads to chr1 (and chr2) and now have the mapping start positions (no gaps).
// Store the start position in a String alignPosList: 7, 100, 172, 272
String<unsigned> alignPosList;
resize(alignPosList, 4);
alignPosList[0] = 7;
alignPosList[1] = 100;
alignPosList[2] = 172;
alignPosList[3] = 272;
// Optional
// Bisulfite conversion
// Assume chr1 is beeing bisulfate treated: Copy chr1 to a new genome bsChr1 and exchange every 'C' with a 'T'
DnaString bsChr1;
assign(bsChr1, chr1);
for(unsigned i = 0; i < length(bsChr1); ++i)
if(bsChr1[i] == 'C')
bsChr1[i] = 'T';
// Print alignments using Segment: Do the same as above, but instead of using a for loop to build the fragment,
// use the Segment class to build an infix of bsChr1.
// Note: Because printAlign uses templates, we don't have to change the function even though the type of
// genomeFragment is different.
std::cout << " \n Print alignment using Segment: " << std::endl;
for(unsigned i = 0; i < length(readList); ++i)
{
// Begin and end position of a given alignment between the read and the genome
unsigned beginPosition = alignPosList[i];
unsigned endPosition = beginPosition + length(readList[i]);
// Build infix
Infix<DnaString>::Type genomeFragment = infix(chr1, beginPosition, endPosition);
// Call of our function to print the simple alignment