A First Example¶
- Learning Objective
- You will learn the most basic concepts of SeqAn. After this tutorial you will be ready to deal with the more specific tutorials, e.g. Sequences.
- Difficulty
- Very basic
- Duration
- 1.5h
- Prerequisites
- Basic C or C++ knowledge
Welcome to the SeqAn “Hello World”. This is the first practical tutorial you should look at when starting to use our software library.
We assume that you have some programming experience (preferably in C++ or C) and concentrate on SeqAn specific aspects. We will start out pretty slowly and hopefully the tutorial will make sense to you even if you are new to C++. However, to really leverage the power of SeqAn you will have to learn C++. There are many tutorials on C++, for example the tutorial at cplusplus.com.
This tutorial will walk you through a simple example program that highlights the things that are most prominently different from the libraries that many SeqAn newcomers are used to:
- extensive usage of C++ templates,
- generic programming using templates,
- using references instead of pointers in most places,
- and more.
Running Example¶
Let’s start with a simple example programm. The program will do a pattern search of a short query sequence (pattern) in a long subject sequence (text). We define the score for each position of the database sequence as the sum of matching characters between the pattern and the text.
The following figure shows an expected result:
score: 101 ... ... 801 ...
text: This is an awesome tutorial to get to know SeqAn!
pattern: tutorial tutorial
tutorial tutorial
... ...
The first position has a score of 1, because the i
in the pattern matches the i
in is
.
This is only a toy example for explanatory reasons and we ignore any more advanced implementations.
In SeqAn the program could look like this (we will explain every line of code shortly):
#include <iostream>
#include <seqan/file.h>
#include <seqan/sequence.h>
using namespace seqan;
int main()
{
// Initialization
String<char> text = "This is an awesome tutorial to get to know SeqAn!";
String<char> pattern = "tutorial";
String<int> score;
resize(score, length(text) - length(pattern) + 1);
// Computation of the similarities
// Iteration over the text (outer loop)
for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i)
{
int localScore = 0;
// Iteration over the pattern for character comparison
for (unsigned j = 0; j < length(pattern); ++j)
{
if (text[i + j] == pattern[j])
++localScore;
}
score[i] = localScore;
}
// Printing the result
for (unsigned i = 0; i < length(score); ++i)
std::cout << score[i] << " ";
std::cout << std::endl;
return 0;
}
Whenever we use SeqAn classes or functions we have to explicitly write the namespace qualifier seqan::
in front of the class name or function.
This can be circumvented if we include the line using namespace seqan;
at the top of the working example.
However, during this tutorial we will not do this, such that SeqAn classes and functions can be recognized more easily.
Attention
Argument-Dependent Name Lookup (Koenig Lookup)
Using the namespace prefix seqan::
is not really necessary in all places.
In many cases, the Koenig lookup rule in C++ for functions makes this unnecessary.
Consider the following, compiling, example.
seqan::String<char> s = "example";
unsigned i = length(s);
Here, the function length
does not have a namespace prefix.
The code compiles nevertheless.
The compiler automatically looks for a function length
in the namespace of its arguments.
Note that we follow the rules for variable, function, and class names as outlined in the SeqAn style guide. For example: 1. variables and functions use lower case, 2. struct, enum and classes use CamelCase, 3. metafunctions start with a capital letter, and 4. metafunction values are UPPERCASE.
Assignment 1¶
- Type
- Review
- Objective
- Create a demo program and replace its content with the code above.
- Hint
- Depending on your operating system you have different alternatives to create a demo application. An in depth description can be found in GettingStarted.
- Solution
Click ‘’more...’‘
// Copy the code into a demo program and have a look at the result. #include <iostream> #include <seqan/file.h> #include <seqan/sequence.h> using namespace seqan; int main() { // Initialization String<char> text = "This is an awesome tutorial to get to know SeqAn!"; String<char> pattern = "tutorial"; String<int> score; resize(score, length(text) - length(pattern) + 1); // Computation of the similarities // Iteration over the text (outer loop) for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i) { int localScore = 0; // Iteration over the pattern for character comparison for (unsigned j = 0; j < length(pattern); ++j) { if (text[i + j] == pattern[j]) ++localScore; } score[i] = localScore; } // Printing the result for (unsigned i = 0; i < length(score); ++i) std::cout << score[i] << " "; std::cout << std::endl; // > 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 8 0 1 0 0 0 0 2 0 1 0 0 1 0 3 0 1 0 1 0 0 0 0 return 0; }
SeqAn and Templates¶
Let us now have a detailed look at the program.
We first include the IOStreams library that we need to print to the screen and the SeqAn’s <seqan/file.h>
as well as <seqan/sequence.h>
module from the SeqAn library that provides SeqAn String.
#include <iostream>
#include <seqan/file.h>
#include <seqan/sequence.h>
using namespace seqan;
The String class is one of the most fundamental classes in SeqAn, which comes as no surprise since SeqAn is used to analyse sequences (there is an extra tutorial for SeqAn sequences and alphabets).
In contrast to the popular string classes of Java or C++, SeqAn provides different string implementations and different alphabets for its strings. There is one string implementation that stores characters in memory, just like normal C++ strings. Another string implementation stores the characters on disk and only keeps a part of the sequence in memory. For alphabets, you can use strings of nucleotides, such as genomes, or you can use strings of amino acids, for example.
SeqAn uses template functions and template classes to implement the different types of strings using the generic programming paradigm. Template functions/classes are normal functions/classes with the additional feature that one passes the type of a variable as well as its value (see also: templates in cpp). This means that SeqAn algorithms and data structures are implemented in such a way that they work on all types implementing an informal interface (see information box below for more details). This is similar to the philosophy employed in the C++ STL (Standard Template Library).
The following two lines make use of template programming to define two strings of type char, a text and a pattern.
// Initialization
String<char> text = "This is an awesome tutorial to get to know SeqAn!";
String<char> pattern = "tutorial";
In order to store the similarities between the pattern and different text positions we additionally create a string storing integer values.
String<int> score;
Note that in contrast to the first two string definitions we do not know the values of the different positions in the string in advance.
In order to dynamically adjust the length of the new string to the text we can use the function resize.
The resize function is not a member function of the string class because SeqAn is not object oriented in the typical sence (we will see later how we adapt SeqAn to object oriented programming).
Therefore, instead of writing string.resize(newLength)
we use resize(string, newLength)
.
resize(score, length(text) - length(pattern) + 1);
Note
Global function interfaces.
SeqAn uses global interfaces for its data types/classes.
Generally, you have to use function(variable)
instead of variable.function()
.
This has the advantage that we can extend the interface of a type outside of its definition.
For example, we can provide a length()
function for STL containers std::string<T>
and std::vector<T>
outside their class files.
We can use such global functions to make one data type have the same interface as a second.
This is called adaption.
Additionally, we can use one function definition for several data types. For example, the alignment algorithms in SeqAn are written such that we can compute alignments using any String with any alphabet: There are more than 5 String variants in SeqAn and more than 8 built-in alphabets. Thus, one implementation can be used for more than 40 different data types!
After the string initializations it is now time for the similarity computation. In this toy example we simply take the pattern and shift it over the text from left to right. After each step, we check how many characters are equal between the corresponding substring of the text and the pattern. We implement this using two loops; the outer one iterates over the given text and the inner loop over the given pattern:
// Computation of the similarities
// Iteration over the text (outer loop)
for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i)
{
int localScore = 0;
// Iteration over the pattern for character comparison
for (unsigned j = 0; j < length(pattern); ++j)
{
if (text[i + j] == pattern[j])
++localScore;
}
score[i] = localScore;
}
There are two things worth mentioning here: (1) SeqAn containers or strings start at position 0 and (2) you will notice that we use ++variable
instead of variable++
wherever possible.
The reason is that ++variable
is slightly faster than its alternative, since the alternative needs to make a copy of itself before returning the result.
In the last step we simply print the result that we stored in the variable ````score
on screen.
This gives the similarity of the pattern to the string at each position.
// Printing the result
for (unsigned i = 0; i < length(score); ++i)
std::cout << score[i] << " ";
std::cout << std::endl;
Refactoring¶
At this point, we have already created a working solution!
However, in order to make it easier to maintain and reuse parts of the code we need to export them into functions.
In this example the interesting piece of code is the similarity computation, which consists of an outer and inner loop.
We encapsulate the outer loop in function computeScore
and the inner loop in function computeLocalScore
as can be seen in the following code.
#include <iostream>
#include <seqan/file.h>
#include <seqan/sequence.h>
using namespace seqan;
int computeLocalScore(String<char> subText, String<char> pattern)
{
int localScore = 0;
for (unsigned i = 0; i < length(pattern); ++i)
if (subText[i] == pattern[i])
++localScore;
return localScore;
}
String<int> computeScore(String<char> text, String<char> pattern)
{
String<int> score;
resize(score, length(text) - length(pattern) + 1, 0);
for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i)
score[i] = computeLocalScore(infix(text, i, i + length(pattern)), pattern);
return score;
}
int main()
{
String<char> text = "This is an awesome tutorial to get to know SeqAn!";
String<char> pattern = "tutorial";
String<int> score = computeScore(text, pattern);
for (unsigned i = 0; i < length(score); ++i)
std::cout << score[i] << " ";
std::cout << std::endl;
return 0;
}
The function computeScore() now contains the fundamental part of the code and can be reused by other functions.
The input arguments are two strings.
One is the pattern itself and one is a substring of the text.
In order to obtain the substring we can use the function infix implemented in SeqAn.
The function call infix(text, i, j)
generates a substring equal to text[i ... j - 1]
, e.g. infix(text, 1, 5)
equals “ello”, where text
is “Hello World”.
To be more precise, infix() generates a Infix which can be used as a string, but is implemented using pointers such that no copying is necessary and running time and memory is saved.
Assignment 2¶
- Type
- Review
- Objective
- Replace the code in your current file by the code above and encapsulate the print instructions.
- Hint
The function head should look like this:
void print(String<int> text)
- Solution
// Copy the code into your current file and encapsulate the print instructions. #include <iostream> #include <seqan/file.h> #include <seqan/sequence.h> using namespace seqan; int computeLocalScore(String<char> subText, String<char> pattern) { int localScore = 0; for (unsigned i = 0; i < length(pattern); ++i) if (subText[i] == pattern[i]) ++localScore; return localScore; } String<int> computeScore(String<char> text, String<char> pattern) { String<int> score; resize(score, length(text) - length(pattern) + 1, 0); for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i) score[i] = computeLocalScore(infix(text, i, i + length(pattern)), pattern); return score; } //![head] void print(String<int> text) //![head] { for (unsigned i = 0; i < length(text); ++i) std::cout << text[i] << " "; std::cout << std::endl; } int main() { String<char> text = "This is an awesome tutorial to get to now SeqAn!"; String<char> pattern = "tutorial"; String<int> score = computeScore(text, pattern); print(score); return 0; }
The Role of References in SeqAn¶
Let us now have a closer look at the signature of computeScore()
.
Both the text and the pattern are passed by value. This means that both the text and the pattern are copied when the function is called, which consumes twice the memory. This can become a real bottleneck since copying longer sequences is very memory and time consuming, think of the human genome, for example.
Instead of copying we could use references.
A reference in C++ is created using an ampersand sign (&
) and creates an alias to the referenced value.
Basically, a reference is a pointer to an object which can be used just like the referenced object itself.
This means that when you change something in the reference you also change the original object it came from.
But there is a solution to circumvent this modification problem as well, namely the word const.
A const
object cannot be modified.
Important
If an object does not need to be modified make it an nonmodifiably object using the keyword const
.
This makes it impossible to unwillingly change objects, which can be really hard to debug.
Therefore it is recommended to use it as often as possible.
Therefore we change the signature of computeScore to:
String<int> computeScore(String<char> const & text, String<char> const & pattern)
Reading from right to left the function expects two references
to
const objects
of type String
of char
.
Assignment 3¶
- Type
- Review
- Objective
- Adjust your current code to be more memory and time efficient by using references in the function header.
- Hint
The function head for
computeLocalScore
should look like this:int computeLocalScore(String<char> const & subText, String<char> const & pattern)
- Solution
// Adjust your current code to be more memory and time efficient by using references in the function header. #include <iostream> #include <seqan/file.h> #include <seqan/sequence.h> using namespace seqan; //![head_local] int computeLocalScore(String<char> const & subText, String<char> const & pattern) //![head_local] { int localScore = 0; for (unsigned i = 0; i < length(pattern); ++i) if (subText[i] == pattern[i]) ++localScore; return localScore; } //![head] String<int> computeScore(String<char> const & text, String<char> const & pattern) //![head] { String<int> score; resize(score, length(text) - length(pattern) + 1, 0); for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i) score[i] = computeLocalScore(infix(text, i, i + length(pattern)), pattern); return score; } void print(String<int> const & text) { for (unsigned i = 0; i < length(text); ++i) std::cout << text[i] << " "; std::cout << std::endl; } int main() { String<char> text = "This is an awesome tutorial to get to now SeqAn!"; String<char> pattern = "tutorial"; String<int> score = computeScore(text, pattern); print(score); return 0; }
Generic and Reusable Code¶
As mentioned earlier, there is another issue: the function computeScore only works for Strings having the alphabet char
.
If we wanted to use it for Dna
or AminoAcid
strings then we would have to reimplement it even though the only difference is the signature of the function.
All used functions inside computeScore
can already handle the other datatypes.
The more appropriate solution is a generic design using templates, as often used in the SeqAn library.
Instead of specifying the input arguments to be references of strings of char
s we could use references of template arguments as shown in the following lines:
template <typename TText, typename TPattern>
String<int> computeScore(TText const & text, TPattern const & pattern)
The first line above specifies that we create a template function with two template arguments TText
and TPattern
.
At compile time the template arguments are then replace with the correct types.
If this line was missing the compiler would expect that there are types TText
and TPattern
with definitions.
Now the function signature is better in terms of memory consumption, time efficiency, and generality.
Important
The SeqAn Style Guide
The SeqAn style guide gives rules for formatting and structuring C++ code as well as naming conventions. Such rules make the code more consistent, easier to read, and also easier to use.
- Naming Scheme.
Variable and function names are written in
lowerCamelCase
, type names are written inUpperCamelCase
. Constants and enum values are written inUPPER_CASE
. Template variable names always start with ‘T’. - Function Parameter Order.
The order is (1) output, (2) non-const input (e.g. file handles), (3) input, (4) tags.
Output and non-const input can be modified, the rest is left untouched and either passed by copy or by const-reference (
const &
). - Global Functions. With the exception of constructors and a few operators that have to be defined in-class, the interfaces in SeqAn use global functions.
- No Exceptions. The SeqAn interfaces do not throw any exceptions.
While we are trying to make the interfaces consistent with our style guide, some functions have incorrect parameter order. This will change in the near future to be more in line with the style guide.
Assignment 4¶
- Type
- Review
- Objective
- Generalize the
computeLocalScore
function in your file. - Solution
// Generalize the computeLocalScore function in you file. #include <iostream> #include <seqan/file.h> #include <seqan/sequence.h> #include <seqan/score.h> using namespace seqan; template <typename TText, typename TPattern> int computeLocalScore(TText const & subText, TPattern const & pattern) { int localScore = 0; for (unsigned i = 0; i < length(pattern); ++i) if (subText[i] == pattern[i]) ++localScore; return localScore; } template <typename TText, typename TPattern> String<int> computeScore(TText const & text, TPattern const & pattern) { String<int> score; resize(score, length(text) - length(pattern) + 1, 0); for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i) score[i] = computeLocalScore(infix(text, i, i + length(pattern)), pattern); return score; } void print(String<int> const & text) { for (unsigned i = 0; i < length(text); ++i) std::cout << text[i] << " "; std::cout << std::endl; } int main() { String<char> text = "This is an awesome tutorial to get to now SeqAn!"; String<char> pattern = "tutorial"; String<int> score = computeScore(text, pattern); print(score); return 0; }
From Object-Oriented Programming to SeqAn¶
There is another huge advantage of using templates: we can specialize a function without touching the existing function.
In our working example it might be more appropriate to treat AminoAcid
sequences differently.
As you probably know, there is a similarity relation on amino acids: Certain amino acids are more similar to each other, than others.
Therefore we want to score different kinds of mismatches differently.
In order to take this into consideration we simple write a computeLocalScore()
function for AminoAcid
strings.
In the future whenever ‘computerScore’ is called always the version above is used unless the second argument is of type String-AminoAcid.
Note that the second template argument was removed since we are using the specific type String-AminoAcid.
template <typename TText>
int computeLocalScore(TText const & subText, seqan::String<seqan::AminoAcid> const & pattern)
{
int localScore = 0;
for (unsigned i = 0; i < seqan::length(pattern); ++i)
localScore += seqan::score(seqan::Blosum62(), subText[i], pattern[i]);
return localScore;
}
In order to score a mismatch we use the function score()
from the SeqAn library.
Note that we use the Blosum62 matrix as a similarity measure.
When looking into the documentation of score you will notice that the score function requires a argument of type Score.
This object tells the function how to compare two letters and there are several types of scoring schemes available in SeqAn (of course, you can extend this with your own).
In addition, because they are so frequently used there are shortcuts as well.
For example Blosum62 is really a shortcut for Score<int, ScoreMatrix<AminoAcid, Blosum62_> >
, which is obviously very helpful.
Other shortcuts are DnaString
for String<Dna>
(sequence tutorial), CharString
for String<char>
, ...
Tip
Template Subclassing
The main idea of template subclassing is to exploit the C++ template matching mechanism.
For example, in the following code, the function calls (1) and (3) will call the function myFunction()
in variant (A) while the function call (2) will call variant (B).
struct SpecA;
struct SpecB;
struct SpecC;
template <typename TAlphabet, typename TSpec>
class String{};
template <typename TAlphabet, typename TSpec>
void myFunction(String<TAlphabet, TSpec> const &){} // Variant (A)
template <typename TAlphabet>
void myFunction(String<TAlphabet, SpecB> const &){} // Variant (B)
// ...
int main()
{
String<char, SpecA> a;
String<char, SpecB> b;
String<char, SpecC> c;
myFunction(a); // calls (A)
myFunction(b); // calls (B)
myFunction(c); // calls (A)
}
Assignment 5¶
- Type
- Application
- Objective
- Provide a generic print function which is used when the input type is not
String<int>
. - Hint
- Keep your current implementation and add a second function.
Don’t forget to make both template functions.
Include
<seqan/score.h>
as well. - Solution
// Provide a generic print function which is used when the input type is not String<int>. #include <iostream> #include <seqan/file.h> #include <seqan/sequence.h> #include <seqan/score.h> using namespace seqan; template <typename TText> int computeLocalScore(TText const & subText, String<AminoAcid> const & pattern) { int localScore = 0; for (unsigned i = 0; i < length(pattern); ++i) localScore += score(Blosum62(), subText[i], pattern[i]); return localScore; } template <typename TText, typename TPattern> int computeLocalScore(TText const & subText, TPattern const & pattern) { int localScore = 0; for (unsigned i = 0; i < length(pattern); ++i) if (subText[i] == pattern[i]) ++localScore; return localScore; } template <typename TText, typename TPattern> String<int> computeScore(TText const & text, TPattern const & pattern) { String<int> score; resize(score, length(text) - length(pattern) + 1, 0); for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i) score[i] = computeLocalScore(infix(text, i, i + length(pattern)), pattern); return score; } template <typename TText> void print(TText const & text) { std::cout << text << std::endl; } void print(String<int> const & text) { for (unsigned i = 0; i < length(text); ++i) std::cout << text[i] << " "; std::cout << std::endl; } int main() { String<char> text = "This is an awesome tutorial to get to now SeqAn!"; String<char> pattern = "tutorial"; String<int> score = computeScore(text, pattern); print(text); // > This is an awesome tutorial to get to now SeqAn! print(pattern); // > tutorial print(score); // > 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 8 0 1 0 0 0 0 2 0 1 0 0 1 0 3 0 1 1 0 0 0 0 return 0; }
Tags in SeqAn¶
Sometimes you will see something like this:
globalAlignment(align, seqan::MyersHirschberg());
Having a closer look you will notice that there is a default constructor call (MyersHirschberg()
) within a function call.
Using this mechanism one can specify which function to call at compile time.
The MyersHirschberg()
`` is only a tag to determine which specialisation of the globalAligment
function to call.
If you want more information on tags then read on otherwise you are now ready to explore SeqAn in more detail and continue with one of the other tutorials.
There is another use case of templates and function specialization.
This might be useful in a print()
function, for example.
In some scenarios, we only want to print the position where the maximal similarity between pattern and text is found.
In other cases, we might want to print the similarities of all positions.
In SeqAn, we use tag-based dispatching to realize this.
Here, the type of the tag holds the specialization information.
Tip
Tag-Based Dispatching
You will often see tags in SeqAn code, e.g. Standard()
.
These are parameters to functions that are passed as const-references.
They are not passed for their values but for their type only.
This way, we can select different specializations at compile time in a way that plays nicely together with metafunctions, template specializations, and an advanced technique called [[Tutorial/BasicTechniques| metaprogramming]].
Consider the following example:
template <typename T>
struct Tag{};
struct TagA_;
typedef Tag<TagA_> TagA;
struct TagB_;
typedef Tag<TagB_> TagB;
void myFunction(TagA const &){} // (1)
void myFunction(TagB const &){} // (2)
int main()
{
myFunction(TagA()); // (3)
myFunction(TagB()); // (4)
return 0;
}
The function call in line (3) will call myFunction()
in the variant in line (1).
The function call in line (4) will call myFunction()
in the variant in line (2).
The code for the two different print()
functions mentioned above could look like this:
#include <iostream>
#include <seqan/sequence.h>
#include <seqan/score.h>
template <typename TText, typename TSpec>
void print(TText const & text, TSpec const & /*tag*/)
{
for (unsigned i = 0; i < seqan::length(text); ++i)
std::cout << text[i] << " ";
std::cout << std::endl;
}
struct MaxOnly {};
template <typename TText>
void print(TText const & score, MaxOnly const & /*tag*/)
{
int maxScore = score[0];
seqan::String<int> output;
appendValue(output, 0);
for (unsigned i = 1; i < seqan::length(score); ++i)
{
if (score[i] > maxScore)
{
maxScore = score[i];
clear(output);
resize(output, 1, i);
}
else if (score[i] == maxScore)
{
appendValue(output, i);
}
}
for (unsigned i = 0; i < seqan::length(output); ++i)
std::cout << output[i] << " ";
std::cout << std::endl;
}
int main()
{
return 0;
}
If we call print()
with something different than MaxOnly
then we print all the positions with their similarity, because the generic template function accepts anything as the template argument.
On the other hand, if we call print with MaxOnly
only the positions with the maximum similarity as well as the maximal similarity will be shown.
Assignment 6¶
- Type
- Review
- Objective
- Provide a print function that prints pairs of positions and their score if the score is greater than 0.
- Hints
- SeqAn provides a data type Pair.
- Solution
// Provide a print function that prints pairs of positions and their score if the score is greater than 0. #include <iostream> #include <seqan/file.h> #include <seqan/sequence.h> #include <seqan/score.h> using namespace seqan; template <typename TText> int computeLocalScore(TText const & subText, String<AminoAcid> const & pattern) { int localScore = 0; for (unsigned i = 0; i < length(pattern); ++i) localScore += score(Blosum62(), subText[i], pattern[i]); return localScore; } template <typename TText, typename TPattern> int computeLocalScore(TText const & subText, TPattern const & pattern) { int localScore = 0; for (unsigned i = 0; i < length(pattern); ++i) if (subText[i] == pattern[i]) ++localScore; return localScore; } template <typename TText, typename TPattern> String<int> computeScore(TText const & text, TPattern const & pattern) { String<int> score; resize(score, length(text) - length(pattern) + 1, 0); for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i) score[i] = computeLocalScore(infix(text, i, i + length(pattern)), pattern); return score; } template <typename TText> void print(TText const & text) { std::cout << text << std::endl; } void print(String<int> const & text) { for (unsigned i = 0; i < length(text); ++i) std::cout << text[i] << " "; std::cout << std::endl; } template <typename TText, typename TSpec> void print(TText const & text, TSpec const & /*tag*/) { print(text); } struct MaxOnly {}; template <typename TText> void print(TText const & score, MaxOnly const & /*tag*/) { int maxScore = score[0]; String<int> output; appendValue(output, 0); for (unsigned i = 1; i < length(score); ++i) { if (score[i] > maxScore) { maxScore = score[i]; clear(output); resize(output, 1, i); } else if (score[i] == maxScore) appendValue(output, i); } print(output); } struct GreaterZero {}; template <typename TText> void print(TText const & score, GreaterZero const & /*tag*/) { String<Pair<int> > output; for (unsigned i = 1; i < length(score); ++i) if (score[i] > 0) appendValue(output, Pair<int>(i, score[i])); for (unsigned i = 0; i < length(output); ++i) std::cout << "(" << output[i].i1 << "; " << output[i].i2 << ") "; std::cout << std::endl; } int main() { String<char> text = "This is an awesome tutorial to get to now SeqAn!"; String<char> pattern = "tutorial"; String<int> score = computeScore(text, pattern); print(text); // > This is an awesome tutorial to get to now SeqAn! print(pattern); // > tutorial print(score); // > 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 8 0 1 0 0 0 0 2 0 1 0 0 1 0 3 0 1 1 0 0 0 0 print(score, MaxOnly()); // > 19 print(score, GreaterZero()); // > (2; 1) (5; 1) (12; 1) (17; 1) (19; 8) (21; 1) (26; 2) (28; 1) (31; 1) (33; 3) (35; 1) (36; 1) // And now for a protein pattern String<AminoAcid> protein = "tutorial"; String<int> proteinScore = computeScore(text, protein); print(text); // > This is an awesome tutorial to get to now SeqAn! print(protein); // > TXTXRIAL print(proteinScore); // > 6 -9 -3 -6 -6 0 -9 -8 -7 -3 -9 -5 -8 -4 -5 -6 -6 1 -6 25 -7 2 -6 -6 -9 -6 -5 -7 1 -7 -5 -4 -6 2 -6 -3 -8 -9 -10 -4 -6 0 0 0 0 0 0 0 print(proteinScore, MaxOnly()); // > 19 print(proteinScore, GreaterZero()); // > (17; 1) (19; 25) (21; 2) (28; 1) (33; 2) return 0; }
Obviously this is only a toy example in which we could have named the two print()
functions differently.
However, often this is not the case when the programs become more complex.
Because SeqAn is very generic we do not know the datatypes of template functions in advance.
This would pose a problem because the function call of function b()
in function a()
may depend on the data types of the template arguments of function a()
.
The Final Result¶
Don’t worry if you have not fully understood the last section. If you have – perfect. In any case the take home message is that you use data types for class specializations and if you see a line of code in which the default constructor is written in a function call this typical means that the data type is important to distinct between different function implementations.
Now you are ready to explore more of the SeqAn library. There are several tutorials which will teach you how to use the different SeqAn data structures and algorithms. Below you find the complete code for our example with the corresponding output.
#include <iostream>
#include <seqan/file.h>
#include <seqan/sequence.h>
#include <seqan/score.h>
using namespace seqan;
template <typename TText>
int computeLocalScore(TText const & subText, String<AminoAcid> const & pattern)
{
int localScore = 0;
for (unsigned i = 0; i < length(pattern); ++i)
localScore += score(Blosum62(), subText[i], pattern[i]);
return localScore;
}
template <typename TText, typename TPattern>
int computeLocalScore(TText const & subText, TPattern const & pattern)
{
int localScore = 0;
for (unsigned i = 0; i < length(pattern); ++i)
if (subText[i] == pattern[i])
++localScore;
return localScore;
}
template <typename TText, typename TPattern>
String<int> computeScore(TText const & text, TPattern const & pattern)
{
String<int> score;
resize(score, length(text) - length(pattern) + 1, 0);
for (unsigned i = 0; i < length(text) - length(pattern) + 1; ++i)
score[i] = computeLocalScore(infix(text, i, i + length(pattern)), pattern);
return score;
}
template <typename TText>
void print(TText const & text)
{
std::cout << text << std::endl;
}
void print(String<int> const & text)
{
for (unsigned i = 0; i < length(text); ++i)
std::cout << text[i] << " ";
std::cout << std::endl;
}
template <typename TText, typename TSpec>
void print(TText const & text, TSpec const & /*tag*/)
{
print(text);
}
struct MaxOnly {};
template <typename TText>
void print(TText const & score, MaxOnly const & /*tag*/)
{
int maxScore = score[0];
String<int> output;
appendValue(output, 0);
for (unsigned i = 1; i < length(score); ++i)
{
if (score[i] > maxScore)
{
maxScore = score[i];
clear(output);
resize(output, 1, i);
}
else if (score[i] == maxScore)
appendValue(output, i);
}
print(output);
}
struct GreaterZero {};
template <typename TText>
void print(TText const & score, GreaterZero const & /*tag*/)
{
String<Pair<int> > output;
for (unsigned i = 1; i < length(score); ++i)
if (score[i] > 0)
appendValue(output, Pair<int>(i, score[i]));
for (unsigned i = 0; i < length(output); ++i)
std::cout << "(" << output[i].i1 << "; " << output[i].i2 << ") ";
std::cout << std::endl;
}
int main()
{
String<char> text = "This is an awesome tutorial to get to now SeqAn!";
String<char> pattern = "tutorial";
String<int> score = computeScore(text, pattern);
print(text);
// > This is an awesome tutorial to get to now SeqAn!
print(pattern);
// > tutorial
print(score);
// > 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 8 0 1 0 0 0 0 2 0 1 0 0 1 0 3 0 1 1 0 0 0 0
print(score, MaxOnly());
// > 19
print(score, GreaterZero());
// > (2; 1) (5; 1) (12; 1) (17; 1) (19; 8) (21; 1) (26; 2) (28; 1) (31; 1) (33; 3) (35; 1) (36; 1)
// And now for a protein pattern
String<AminoAcid> protein = "tutorial";
String<int> proteinScore = computeScore(text, protein);
print(text);
// > This is an awesome tutorial to get to now SeqAn!
print(protein);
// > TXTXRIAL
print(proteinScore);
// > 6 -9 -3 -6 -6 0 -9 -8 -7 -3 -9 -5 -8 -4 -5 -6 -6 1 -6 25 -7 2 -6 -6 -9 -6 -5 -7 1 -7 -5 -4 -6 2 -6 -3 -8 -9 -10 -4 -6 0 0 0 0 0 0 0
print(proteinScore, MaxOnly());
// > 19
print(proteinScore, GreaterZero());
// > (17; 1) (19; 25) (21; 2) (28; 1) (33; 2)
return 0;
}